### **TUTORIAL 1: ANALYTICS FOR PERCEPTION**



# Step 0: Does this work?

In [1]:
print("Hello People!")

Hello People!


# Step 1: Import some modules

Never reinvent the wheel! There are tons of programs out there that we can simply import into our own code and use as needed. If you want to know more, or find useful modules, check out [PyPI](https://pypi.org/) (the Python Package Index).

In [2]:
import pandas as pd # for data structure and data analysis
import statsmodels.api as sm # for statistical analysis
from scipy import stats #for statistical analysis & machine learning
from scipy.stats.mstats import zscore #to standardize regression coefficients
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df) #to standardize regression coefficients
import matplotlib.pyplot as plt #to plot
import numpy as np #to handle arrays and matrixes
import seaborn as sns #to plot
import matplotlib.pyplot as plt #to plot
import numpy as np #to handle arrays and matrixes
print("Module import completed!")

Module import completed!


So what, exactly, are we doing here?


*   "import X as Y": We import the whole module X, but we use Y to refer to it (because we really don't want to write matplotlib.pyplot every time we use it)
*   "from X import Y": We don't want all of X, but we care specially about one function, Y. Also, this way we can refer to Y directly




# Step 2: Define functions for plotting correlation matrix and regression coefficients

A function is like a recipe for your computer: it bundles a set of instructions, so that the computer can work its way from a well-defined input (flour, baking powder, milk, eggs), to a well-defined output ([pancakes](https://www.kingarthurflour.com/recipes/simply-perfect-pancakes-recipe)).

Functions are useful, because we would rather just tell our computer to make pancakes than to explain all the instructions every time we are hungry.

Let's teach our computer something more simple than making pancakes first:

In [3]:
def sum_numbers(x, y):
    result = x + y
    return result

Our function lets us sum any two numbers by calling it:

In [4]:
sum_numbers(3,6)

9

We are now ready to define the functions we will use to plot our graphs. Don't worry too much about the details - we are just teaching our computer a specific recipe.

In [5]:
#function to plot correlation matrix
def plot_corr(corr):
    sns.set(style="white")
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))
    
    # Generate a custom diverging colormap
    # sns.palplot(sns.diverging_palette(240, 0)) Uncomment to try out different color palettes to use below
    cmap = sns.diverging_palette(240, 0, as_cmap=True)
    
    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.15,vmin=-0.15,  # vmax and vmin define the upper and lower boundaries of the colormap respectively
                square=True, linewidths=.5, cbar_kws={"shrink": .5})


#define function to plot coefficients and confidence intervals
def plot_coef(model):
    err_series = model.params - model.conf_int(alpha=0.05)[0]
    coef_df = pd.DataFrame({'coef': model.params.values[1:],
                            'err': err_series.values[1:],
                            'varname': err_series.index.values[1:]
                           })
    fig, ax = plt.subplots(figsize=(8, 5))
    coef_df.plot(x='varname', y='coef', kind='bar', 
                 ax=ax, color='none', 
                 yerr='err', legend=False)
    ax.set_ylabel('')
    ax.set_xlabel('')
    ax.scatter(x=pd.np.arange(coef_df.shape[0]), 
               marker='s', s=120, 
               y=coef_df['coef'], color='black')
    ax.axhline(y=0, linestyle='--', color='black', linewidth=4)
    ax.xaxis.set_ticks_position('none')
    
    
print ("Plotting function definition completed!")

Plotting function definition completed!


# Step 3: Make data available within the program

## Step 3.1: Upload the data file from your local computer to Google Drive

Luckily, Google makes our life very easy - we just need to run the code below and then we can upload our file to Google drive.

In [None]:
import os.path
from google.colab import files
if not os.path.exists("T1.1_ORG2.0_chimeraCSV.csv"):
    uploaded = files.upload()

## Step 3.2: set a path to your data CSV file (Database.csv) and acces your data

We use [Pandas](https://www.learnpython.org/en/Pandas_Basics), a popular module 
in Python (not to be confused with this [cute guy](https://www.worldwildlife.org/species/giant-panda)), to store the data from the CSV in a way that Python can work with it.

In [None]:
path = "T1.1_ORG2.0_chimeraCSV.csv"
df = pd.read_csv(path, sep = ",") # create a pandas data frame to store data, set the type of separator used in your CSV file

Let's take a sneak peak at what we imported:

In [None]:
df.head()

We can also get a summary of each column:

In [None]:
summary_stats = df.describe()
print(summary_stats)

# Step 4: Bivariate tests (t-tests)

Let's see whether we can find any initial trends in the data. For example, are women more likely to exit the firm? We will use [t-tests](https://towardsdatascience.com/the-statistical-analysis-t-test-explained-for-beginners-and-experts-fd0e358bbb62) to analyze the difference in means:

In [None]:
tstat = sm.stats.ttest_ind(df[df.gender==0].exit,df[df.gender==1].exit)[0] 
pvalue = sm.stats.ttest_ind(df[df.gender==0].exit,df[df.gender==1].exit)[1] 

print('the tstat for gender differences in exit is =', tstat)
print('the pvalue is =', pvalue)

The data indicates that women are, on average, more likely to quit. However, the finding is not very [significant from a statistical perspective](https://hbr.org/2016/02/a-refresher-on-statistical-significance). Also, don't forget that we are just taking a look at two variables, ignoring a lot of other things information!

Maybe exiters are paid more?


In [None]:
tstat = sm.stats.ttest_ind(df[df.exit==0].salary,df[df.exit==1].salary)[0]
pvalue =  sm.stats.ttest_ind(df[df.exit==0].salary,df[df.exit==1].salary)[1]

print('the tstat for salary differences in exit is =', tstat)
print('the pvalue is =', pvalue)

There indeed is a difference in average salary between those that exit and those that don't. This time, statistically speaking, we can be more sure of this difference. But again, don't forget that we are looking at only two variables right now, i.e. we are assuming all else is equal.

**Exercise:** can you write a test to check if exiters (1) have higher or lower job satisfaction than stayers?





If you need help, you can find the answer below:

In [None]:
#@title
tstat = sm.stats.ttest_ind(df[df.exit==1].job_satisfaction,df[df.exit==0].job_satisfaction)[0]
pvalue =  sm.stats.ttest_ind(df[df.exit==1].job_satisfaction,df[df.exit==0].job_satisfaction)[1]

print('the tstat for job satisfaction of exiters is =', tstat)
print('the pvalue is =', pvalue)

# Step 5: Multiple bivariate associations

We create a new pandas data frame to store the [correlation](https://https://en.wikipedia.org/wiki/Correlation_and_dependence) information and use our custom-made plot_corr function to visualize it.

In [None]:
correlation_matrix = df.corr(method='pearson', min_periods=1)
plot_corr(correlation_matrix)

# Step 6: Multi-variate analysis with OLS

In an [ordinarly least squares, or OLS regression](https://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/ordinary-least-squares-regression), we try to understand the *simultaenous* effect of multiple independent variables (X), on our dependet variable (y). As an example, we want to see how education and age, together, influence an employee's salary.

First, let's set our independent and dependent variables

In [None]:
X = df[["age","education"]]
y = df["salary"]

Standardizing our data sometimes makes it easier to interpret. For this, we simply compute the [z-score](https://www.statisticshowto.com/probability-and-statistics/z-score/) of each column. We also want a constant in our model (which forms the intercept of the regression model).

In [None]:
X_std = pd.DataFrame(data=zscore(X),
                     columns=list(X.columns))
X_std = sm.add_constant(X_std)

We are now ready to fit our OLS model:

In [None]:
model = sm.OLS(y,X_std).fit()
model_summary = model.summary()
print (model_summary) 

Let's get a better look at the data, by using our custom-built visualization function:

In [None]:
plot_coef(model)

What we actually care about is, whether employees exit the company. Hence, y should indicate whether an employee is an exiteer (not to be confused with a [Brexiteer](https://en.wikipedia.org/wiki/Glossary_of_Brexit_terms)). We combine all available information, in order to try to explain why someone left:

In [None]:
X = df.loc[:, df.columns != "exit"]
y = df["exit"]

Don't forget to standardize and add a constant!

In [None]:
X_std = pd.DataFrame(data=zscore(X),
                     columns=list(X.columns))
X_std = sm.add_constant(X_std)

We are now ready to fit our OLS model:

In [None]:
model = sm.OLS(y,X_std).fit()
model_summary = model.summary()
print (model_summary) 

Let's get a better look at the data, by using our custom-built visualization function:

In [None]:
plot_coef(model)

Before we move onto the logit regression, we first save our model as a CSV. You may need to allow Google to download the file to your computer.

In [None]:
# create instance which open a CSV file in "writing" ('w') mode
CSV_writer = open("ExiteerOLSModel.csv","w")
# write results of the OLS in CSV
CSV_writer.write(model_summary.as_csv())
# close CSV-writer instance
CSV_writer.close()
from google.colab import files
files.download("ExiteerOLSModel.csv")

# Step 7: Multivariate analysis with Logit

In a [Logistic (or Logit) regression](https://en.wikipedia.org/wiki/Logistic_regression), we try to understand the *simultaenous* effect of multiple independent variables (X), on our dependet variable (y), when y is either 0 or 1, exactly as in our problem!

In [None]:
model_logit = sm.Logit(endog=y, exog=X_std).fit()
model_logit_summary= model_logit.summary()
print (model_logit_summary)

We can, again, use our custom-made visualization function:

In [None]:
plot_coef(model_logit)

For further reading on Logit, see, for example, [here](https://statisticalhorizons.com/whats-so-special-about-logit). For more on Logit in Python, see [here](https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python).

We again save the model as a CSV export for future reference:

In [None]:
# create instance which open a CSV file in "writing" ('w') mode
CSV_writer = open("ExiteerLogitModel.csv","w")
# write results of Logit in CSV
CSV_writer.write(model_logit_summary.as_csv())
# close CSV-writer instance
CSV_writer.close()
from google.colab import files
files.download("ExiteerLogitModel.csv")