# Lesson 2 - Hiring Analysis

![analytics](images/analytics.jpg)

A Jupyter notebook is an [Integrated Development Environment](https://en.wikipedia.org/wiki/Integrated_development_environment) (IDE) created by the [Jupyter Project](https://jupyter.org/). It allows you to combine different tools that are paramount for a good coding workflow. For example, you can have the terminal, a Jupyter notebook, and a markdown file for note-taking/documenting your work, as well as other files, opened at the same time to improve your workflow as you write code (see image below).

A silly metaphor to think about IDEs is that, IDEs are to programmers, data analysts, scientists, researcher, etc..., what a kitchen is to a chef, an indispensable piece to get things done.

Jupyter notebooks are composed of cells and each cell has 3 states with the default state beign "code" and the other two being "markdown" and "raw text".

To run code you will use the following two commands:

The first option will run the cell where you have your cursor at and take you to the next one. If there is no cell underneath the one you just ran, it will insert a new one for you.

> # Shift + Enter

This second option will run the cell and insert a new one below automatically. Alternatively, you can also run the cells using the play (▶︎) button at the top or with the _Run menu_ on the top left-hand corner.

> # Alt + Enter  


Anything that follows a hash `#` sign is a comment and will not be evaluated by Python. They are useful for documenting your code and letting others know what is happening with every line of code or with every cell.

To check the information of a package, function, method, etc., use `?` or `??` at the begining or end of such element, and it will provide you with a lot of information about it.

# 1. Import Packages

We will start by importing the python packages we will be using during this session.

- [pandas](https://pandas.pydata.org/) -> "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."

- [numpy](https://numpy.org/) -> "is the fundamental package for scientific computing with Python. It contains among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities."

- [statsmodels]() -> "statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration."

In [49]:
import numpy as np
import pandas as pd

import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

We will use a function from pandas (imported as `pd`) called `read_csv()` to read in the data into our session as a dataframe and we will assign it to a variable we will call `hiringData`.

In [50]:
hiringData = pd.read_csv("Hyp_employees.csv")
hiringData.head()

Unnamed: 0,new_id,age,gender,undergradranking,gpa,didmba,extracurriculars,maturityassesment,programmingskill,disciplinetest,ambitiousnesstest,creativitytest,neuroticismtest,extraversiontest,teamquality,tenure,currentseniority,mainperformancemetric
0,1,48,0,3,4.0,0,3,2,40,9,7,4,5,5,2,19,9,5
1,2,46,0,3,4.0,0,7,4,37,6,6,5,5,4,2,18,9,7
2,3,48,1,3,4.0,0,3,2,32,7,7,5,6,5,1,18,9,5
3,4,44,0,3,3.91,0,3,3,55,3,6,6,7,5,1,17,10,5
4,5,45,0,3,4.0,0,3,2,44,5,4,4,8,4,1,17,9,5


A dataset contains data or information in a rectangular shape in the same way in which you encounter information in a spreadsheet. You can look at the shape of this rectangle (i.e. its rows and columns, in that order) by using the attribute `.shape` on your dataset.

In [51]:
hiringData.shape

(267, 18)

You can look at the most important descriptive statistics using the method `.describe()` on your dataframe.

In [52]:
hiringData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
new_id,267.0,134.0,77.220464,1.0,67.5,134.0,200.5,267.0
age,267.0,34.430712,4.916394,25.0,31.0,33.0,37.0,48.0
gender,267.0,0.456929,0.499077,0.0,0.0,0.0,1.0,1.0
undergradranking,267.0,3.220974,0.996163,1.0,3.0,3.0,4.0,5.0
gpa,267.0,3.890412,0.198686,3.15,3.87,4.0,4.0,4.0
didmba,267.0,0.382022,0.486794,0.0,0.0,0.0,1.0,1.0
extracurriculars,267.0,7.756554,2.813218,0.0,7.0,10.0,10.0,10.0
maturityassesment,267.0,1.0,0.822613,0.0,0.0,1.0,1.0,4.0
programmingskill,267.0,38.067416,11.682965,8.0,30.0,38.0,45.0,73.0
disciplinetest,267.0,4.928839,1.603155,1.0,4.0,5.0,6.0,9.0


You can examine the correlation between all of your numerical variables using the `.corr()` method on your dataframe. In this instance, we removed the `new_id` variable as it doesn't have any meaning in this particular use case. We want to check whether all variables, except `neuroticismtest`, are positively correlated with performance.

In [56]:
hiringData.drop('new_id', axis=1).corr()

Unnamed: 0,age,gender,undergradranking,gpa,didmba,extracurriculars,maturityassesment,programmingskill,disciplinetest,ambitiousnesstest,creativitytest,neuroticismtest,extraversiontest,teamquality,tenure,currentseniority,mainperformancemetric
age,1.0,0.040531,0.001219,-0.06226,0.166613,-0.057081,0.845897,0.018212,0.01869,0.329151,0.544936,-0.397345,0.440685,0.151099,0.980432,0.870832,0.700571
gender,0.040531,1.0,0.03812,0.026529,-0.009389,-0.032934,-0.027471,0.010171,0.021998,0.063578,0.048278,-0.083482,0.074075,0.086787,0.038801,0.064858,0.010886
undergradranking,0.001219,0.03812,1.0,0.096219,0.112107,-0.137685,-0.022938,0.213849,0.0193,-0.11346,-0.171053,0.14548,-0.13282,0.060737,-0.017599,0.016126,-0.269461
gpa,-0.06226,0.026529,0.096219,1.0,-0.05605,-0.004662,-0.086946,0.019002,0.614061,0.304021,0.01523,0.068875,-0.049993,-0.019868,-0.06354,-0.062372,-0.061309
didmba,0.166613,-0.009389,0.112107,-0.05605,1.0,-0.016934,0.065717,0.052964,0.001245,-0.052913,0.269598,-0.280938,0.119911,0.811775,0.056745,0.401769,0.338086
extracurriculars,-0.057081,-0.032934,-0.137685,-0.004662,-0.016934,1.0,-0.01462,-0.053488,-0.07971,0.034732,0.114275,-0.127076,0.042882,-0.013349,-0.059354,-0.045733,0.206327
maturityassesment,0.845897,-0.027471,-0.022938,-0.086946,0.065717,-0.01462,1.0,-0.003129,-0.022805,0.293137,0.492473,-0.342216,0.397373,0.084093,0.866164,0.768192,0.611439
programmingskill,0.018212,0.010171,0.213849,0.019002,0.052964,-0.053488,-0.003129,1.0,-0.096088,-0.03487,-0.062098,-0.014853,-0.050463,0.03834,0.014743,-0.017346,-0.057286
disciplinetest,0.01869,0.021998,0.0193,0.614061,0.001245,-0.07971,-0.022805,-0.096088,1.0,0.132192,0.114133,-0.043512,-0.004248,0.053797,0.00757,0.009319,0.076366
ambitiousnesstest,0.329151,0.063578,-0.11346,0.304021,-0.052913,0.034732,0.293137,-0.03487,0.132192,1.0,0.492915,-0.400247,0.390224,0.01727,0.348781,0.283755,0.383247


# Example 1

In [59]:
# This is our main variable of interest, we can select one particular variable with brackets, the name, and quotation marks
y = hiringData['mainperformancemetric']

# the rest of the variables except the ones below
X = hiringData.drop(["mainperformancemetric", "new_id", "age"], axis=1).copy()

# run a regression on our main metric using all variables in our dataset as the independent variables
model1 = sm.OLS(y, X).fit()

# print the summary
print(model1.summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.980
Model:                               OLS   Adj. R-squared (uncentered):              0.979
Method:                    Least Squares   F-statistic:                              825.5
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                   1.53e-204
Time:                           16:35:28   Log-Likelihood:                         -253.31
No. Observations:                    267   AIC:                                      536.6
Df Residuals:                        252   BIC:                                      590.4
Df Model:                             15                                                  
Covariance Type:               nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----

In [58]:
print("Table with Significance Stars")
print(summary_col(model1, stars=True))

Table with Significance Stars

                  mainperformancemetric
---------------------------------------
gender            -0.0982              
                  (0.0816)             
undergradranking  -0.2567***           
                  (0.0425)             
gpa               0.1441               
                  (0.1499)             
didmba            0.1985               
                  (0.1578)             
extracurriculars  0.0907***            
                  (0.0146)             
maturityassesment -0.0385              
                  (0.0988)             
programmingskill  0.0007               
                  (0.0035)             
disciplinetest    0.0364               
                  (0.0293)             
ambitiousnesstest 0.0159               
                  (0.0442)             
creativitytest    0.3025***            
                  (0.0534)             
neuroticismtest   -0.0108              
                  (0.0414)             
extravers

# Example 2

In [60]:
# new set of independent variables
# we now exclude tenure and age
X_2 = hiringData.drop(["mainperformancemetric", "new_id", "tenure", "age"],axis=1)

In [61]:
model2 = sm.OLS(y, X_2).fit()
print(model2.summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.977
Model:                               OLS   Adj. R-squared (uncentered):              0.975
Method:                    Least Squares   F-statistic:                              759.9
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                   1.16e-197
Time:                           16:36:02   Log-Likelihood:                         -273.65
No. Observations:                    267   AIC:                                      575.3
Df Residuals:                        253   BIC:                                      625.5
Df Model:                             14                                                  
Covariance Type:               nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----

In [62]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model2, stars=True))

Table with Significance Stars

                  mainperformancemetric
---------------------------------------
gender            -0.0881              
                  (0.0879)             
undergradranking  -0.2523***           
                  (0.0458)             
gpa               0.0015               
                  (0.1596)             
didmba            -0.0218              
                  (0.1659)             
extracurriculars  0.0835***            
                  (0.0156)             
maturityassesment 0.2880***            
                  (0.0913)             
programmingskill  0.0035               
                  (0.0038)             
disciplinetest    0.0525*              
                  (0.0314)             
ambitiousnesstest 0.0433               
                  (0.0474)             
creativitytest    0.3012***            
                  (0.0575)             
neuroticismtest   -0.0109              
                  (0.0446)             
extravers

## Example 2.1 - Tenure Positively Affects both Maturity and Performance

In [63]:
print(sm.OLS(y, hiringData['tenure']).fit().summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.776
Model:                               OLS   Adj. R-squared (uncentered):              0.775
Method:                    Least Squares   F-statistic:                              921.1
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                    2.22e-88
Time:                           16:37:17   Log-Likelihood:                         -576.24
No. Observations:                    267   AIC:                                      1154.
Df Residuals:                        266   BIC:                                      1158.
Df Model:                              1                                                  
Covariance Type:               nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------

In [64]:
print(sm.OLS(hiringData['maturityassesment'], hiringData['tenure']).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:      maturityassesment   R-squared (uncentered):                   0.898
Model:                            OLS   Adj. R-squared (uncentered):              0.898
Method:                 Least Squares   F-statistic:                              2343.
Date:                Thu, 14 Jan 2021   Prob (F-statistic):                   6.77e-134
Time:                        16:37:21   Log-Likelihood:                         -142.84
No. Observations:                 267   AIC:                                      287.7
Df Residuals:                     266   BIC:                                      291.3
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

# Example 3 - No Team Quality

In [65]:
X_3 = hiringData.drop(["mainperformancemetric", "new_id", "teamquality"], axis=1)

In [66]:
model3 = sm.OLS(y, X_3).fit()
print(model3.summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.979
Model:                               OLS   Adj. R-squared (uncentered):              0.977
Method:                    Least Squares   F-statistic:                              774.1
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                   4.25e-201
Time:                           16:38:01   Log-Likelihood:                         -261.72
No. Observations:                    267   AIC:                                      553.4
Df Residuals:                        252   BIC:                                      607.2
Df Model:                             15                                                  
Covariance Type:               nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----

In [67]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model3, stars=True))

Table with Significance Stars

                  mainperformancemetric
---------------------------------------
age               0.0616**             
                  (0.0299)             
gender            -0.0479              
                  (0.0833)             
undergradranking  -0.2642***           
                  (0.0439)             
gpa               -0.1738              
                  (0.2551)             
didmba            0.6261***            
                  (0.1230)             
extracurriculars  0.0873***            
                  (0.0151)             
maturityassesment -0.0021              
                  (0.1017)             
programmingskill  0.0010               
                  (0.0036)             
disciplinetest    0.0644**             
                  (0.0323)             
ambitiousnesstest 0.0302               
                  (0.0465)             
creativitytest    0.2897***            
                  (0.0559)             
neurotici

# Example 3.1 - Teams are Back

In [68]:
# Same if we include age instead of tenure.
X_4 = hiringData.drop(["mainperformancemetric", "new_id"], axis=1)

In [69]:
model4 = sm.OLS(y, X_4).fit()
print(model4.summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.980
Model:                               OLS   Adj. R-squared (uncentered):              0.979
Method:                    Least Squares   F-statistic:                              780.0
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                   1.05e-203
Time:                           16:38:55   Log-Likelihood:                         -251.76
No. Observations:                    267   AIC:                                      535.5
Df Residuals:                        251   BIC:                                      592.9
Df Model:                             16                                                  
Covariance Type:               nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----

In [70]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model4, stars=True))

Table with Significance Stars

                  mainperformancemetric
---------------------------------------
age               0.0496*              
                  (0.0290)             
gender            -0.0997              
                  (0.0813)             
undergradranking  -0.2601***           
                  (0.0424)             
gpa               -0.1908              
                  (0.2463)             
didmba            0.1624               
                  (0.1586)             
extracurriculars  0.0888***            
                  (0.0146)             
maturityassesment -0.0338              
                  (0.0985)             
programmingskill  0.0005               
                  (0.0035)             
disciplinetest    0.0556*              
                  (0.0312)             
ambitiousnesstest 0.0306               
                  (0.0448)             
creativitytest    0.2866***            
                  (0.0540)             
neurotici

# Example 4 - Teams with MBAs

In [71]:
print(sm.OLS(hiringData['didmba'], hiringData['teamquality']).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                 didmba   R-squared (uncentered):                   0.660
Model:                            OLS   Adj. R-squared (uncentered):              0.659
Method:                 Least Squares   F-statistic:                              517.2
Date:                Thu, 14 Jan 2021   Prob (F-statistic):                    2.54e-64
Time:                        16:40:20   Log-Likelihood:                         -106.23
No. Observations:                 267   AIC:                                      214.5
Df Residuals:                     266   BIC:                                      218.1
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

# Example 5 - Teams Quality on Performance

In [72]:
print(sm.OLS(hiringData['mainperformancemetric'], hiringData['teamquality']).fit().summary())

                                  OLS Regression Results                                  
Dep. Variable:     mainperformancemetric   R-squared (uncentered):                   0.861
Model:                               OLS   Adj. R-squared (uncentered):              0.861
Method:                    Least Squares   F-statistic:                              1654.
Date:                   Thu, 14 Jan 2021   Prob (F-statistic):                   3.55e-116
Time:                           16:41:20   Log-Likelihood:                         -512.05
No. Observations:                    267   AIC:                                      1026.
Df Residuals:                        266   BIC:                                      1030.
Df Model:                              1                                                  
Covariance Type:               nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------