# Lesson 2 - Hiring Analysis

![analytics](images/analytics.jpg)

# Table of Contents

0. Important Info
1. Import Packages
2. Read, Evaluate and Describe
3. Regression Overview
4. Examples
5. Python Breakdown

# 0. Important Info

A Jupyter notebook (what you are using right now) is an [Integrated Development Environment](https://en.wikipedia.org/wiki/Integrated_development_environment) (IDE) created by the [Jupyter Project](https://jupyter.org/). It allows you to combine different tools that are paramount for a good coding workflow. For example, you can have the terminal, a notebook, and a markdown file for note-taking/documenting your work, as well as other files, opened at the same time to improve your workflow as you write code.

A silly metaphor to think about IDEs is that, IDEs are to programmers, data analysts, scientists, researcher, etc..., what a kitchen is to a chef, an indispensable piece of equipment to get things done.

Jupyter notebooks are composed of cells and each cell has 3 states with the default one beign "code", and the other two being "markdown" and "raw text". The two latter ones can be used for note-taking purposes (e.g. this is a markdown cell).

To run code you will use the following two commands:

The first option will run the cell where you have your cursor at and take you to the next one. If there is no cell underneath the one you just ran, it will insert a new one for you.

> # Shift + Enter

The second option will run the cell and insert a new one below automatically. Alternatively, you can also run the cells using the play (▶︎) button at the top or with the _Run menu_ on the top left-hand corner.

> # Alt + Enter  

Anything that follows a hash `#` sign is a comment and will not be evaluated by Python. Comments are useful for documenting your code and letting others know what is happening with every line of code you write and/or with every cell.

To check the information of a package, function, method, etc., use `?` or `??` at the begining or end of such element, these will provide you with a lot of information about the object you are interested in.

# 1. Import Packages

We will start by importing the python packages we will be using during this session.

- [pandas](https://pandas.pydata.org/) -> "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."

- [statsmodels]() -> "statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration."

In [None]:
import pandas as pd

import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

We will use a function from pandas (imported as `pd`) called `read_csv()` to read in the data into our session as a dataframe and we will assign it to a variable we will call `hiringData`.

# 2. Read, Evaluate and Describe

In [None]:
hiringData = pd.read_csv("Hyp_employees.csv")
hiringData.head()

A dataset contains data or information in a rectangular shape in the same way in which you encounter information in a spreadsheet. You can look at the shape of this rectangle (i.e. its rows and columns, in that order) by using the attribute `.shape` on your dataset.

In [None]:
hiringData.shape

You can look at the most important descriptive statistics using the method `.describe()` on your dataframe.

In [None]:
hiringData.describe().T

You can examine the correlation between all of your numerical variables using the `.corr()` method on your dataframe. In this instance, we removed the `new_id` variable as it doesn't have any meaning in this particular use case. We want to check whether all variables, except `neuroticismtest`, are positively correlated with performance.

In [None]:
hiringData.drop('new_id', axis=1).corr()

# 3. Ordinary Least Squares Intro

The Goal of a Regression is to search for associations between one or many variables and another. For example, determining the price of a house (what we want to predict) might only be possible with additional information (what will help us make a prediction) such as # of bathrooms, # of bedrooms, # garage, etc...

A regression is a type of linear model with which we can quantify the relationships in our data and, at the same time, determine how reliable such relationship is. A linear model usually looks as follows,

$y = a*x + b$

- $a$ - is the slope of the line
- $b$ - y intercept is where the line crosses the y-axis
- $y$ - is what we are trying to predict
- $x$ - is what we are using to predict

We are interested in finding the optimal values or $a$ and $b$ which are also called parameters. 

You might also be wondering, which line are we talking about? The line of best fit is a line with predicted values that run through our data points as close as possible. This means that the values in such a predictive line are not necessarily perfect predictors but rather the best predictors given the data, which in turn means that there will be a difference between the actual data and the predictions. These errors, the differences between a predicted value and a real one, are called residuals. Our goal is often to minimize the square distances between the observed values and the line of best fit. See the image below for an example of a line of best fit.

![line](https://images.saymedia-content.com/.image/t_share/MTc0MjM1NjgwNzExNzgwMjIw/how-to-create-a-simple-linear-regression-equation.png)

Nomeclature and definitions

- $Y$ - The vector, array, characteristic or value that we are trying to predict. This is often called,
    - Dependent Variable
    - Target Variable
    - Outcome Variable
    - Response Variable
- $X$ - Can be a single array or a matrix representing multiple variables. These measures we use to make predictions are often called,
    - Independent Variable(s)
    - Features
    - Predictor Variable(s)
- Fitted values - the estimates obtained from the regression, aka predicted values
- Coefficients - measures the strength of the relationshit between the independent variable(s) and the dependent variable as well as the sign of such relationship (i.e. positive or negative). These are also the slopes
- Residuals - difference between predicted value(s) (the line fitted) and the actual data
- $R^2$ - How much of the variation in our dependent variable is explained by the variation in the independent variable(s). This number goes from 0 to 1 and the way to interpret it is, "x% of the variation in our dependent variable is explained by the variation in our independent variable(s)"
- Adjusted $R^2$ - scaled version of $R^2$ by the parameters
- Sum of Square Residuals - The residuals are the differences between the real data and the predicted line that best fits the data. We want this to be as close to 0 as possible
- Mean Square Error - is the average squared residuals, in other words, the average of the squared differences between the predicted values and the actual values. We want this number to be as small as possible.


Assumptions:  
- The regression model is linear in the coefficients and the error term
- The error term has a population mean of zero
- All independent variables are uncorrelated with the error term
- Observations of the error term are uncorrelated with each other
- The error term has a constant variance
- No independent variable is a perfect linear function of other explanatory variables
- The error term is normally distributed (optional)

# 4. Examples

The following examples were the ones used in lesson 2 of the course. Let's go over them one by one.

## First Example

Our main variable for this exercise, i.e. the one that we are interested in validating given other conditions, is a one-dimensional array or vector of numbers representing a performance metric for employees.

In [None]:
# This is our main variable of interest, we can select one particular variable with brackets, the name, and quotation marks
# this is also called a vector or array, think of it as a column in Excel
y = hiringData['mainperformancemetric']

# the rest of the variables except the ones below
X = sm.add_constant(hiringData.drop(["mainperformancemetric", "new_id", "age"], axis=1).copy())

# run a regression on our main metric using all variables in our dataset as the independent variables
model1 = sm.OLS(y, X).fit()

# print the summary
print(model1.summary())

In [None]:
print("Table with Significance Stars")
print(summary_col(model1, stars=True))

## Second Example

In [None]:
# new set of independent variables
# we now exclude tenure and age
X_2 = sm.add_constant(hiringData.drop(["mainperformancemetric", "new_id", "tenure", "age"],axis=1))

In [None]:
model2 = sm.OLS(y, X_2).fit()
print(model2.summary())

In [None]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model2, stars=True))

## Third Example - Tenure Positively Affects both Maturity and Performance

In [None]:
print(sm.OLS(y, sm.add_constant(hiringData['tenure'])).fit().summary())

## Fourth Example

In [None]:
print(sm.OLS(hiringData['maturityassesment'], sm.add_constant(hiringData['tenure'])).fit().summary())

## Fifth Example - No Team Quality

In [None]:
X_3 = sm.add_constant(hiringData.drop(["mainperformancemetric", "new_id", "teamquality"], axis=1))

In [None]:
model3 = sm.OLS(y, X_3).fit()
print(model3.summary())

In [None]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model3, stars=True))

## Sixth Example - Teams are Back

In [None]:
# Same if we include age instead of tenure.
X_4 = sm.add_constant(hiringData.drop(["mainperformancemetric", "new_id"], axis=1))

In [None]:
model4 = sm.OLS(y, X_4).fit()
print(model4.summary())

In [None]:
# You can also produce a table with just the coefficients with significance 
# stars using the code below 
print("Table with Significance Stars")
print(summary_col(model4, stars=True))

## Seventh Example - Teams with MBAs

In [None]:
X_5 = sm.add_constant(hiringData['teamquality'])

print(sm.OLS(hiringData['didmba'], X_5).fit().summary())

## Eigth Example - Teams Quality on Performance

In [None]:
X_6 = sm.add_constant(hiringData['teamquality'])

print(sm.OLS(hiringData['mainperformancemetric'], X_6).fit().summary())

# 5. Python Breakdown

## 5.1 Packages

Programming languages are very diverse creatures composed of built-in functionalities and add-ons. These functionalities and add-ons are pieces of code grouped that are useful for a particular problem or task. This means that when we initiate a Python session either through the command line or in a Jupyter Notebook, we don't immediately have all of its most useful tools available in the session, rather, it lets us pick and choose whatever we need, when we need it. The ones we used serve two well-defined purposes, statistical modeling (statsmodel), and data manipulation and analysis (pandas).

For example, to use a built-in mathematical function that gets us the square root of a number, we would have to `import` the library `math` first and then call the method `.sqrt(49)` on math to get the result we want. We could create our own function to do this, but that would imply we would have to do this everytime we wanted to use that function for a task ("not a very productive thing to do").

To import these additional libraries of code we need to use the `import` expression, or a variation/combination of it. Let's go over a few examples.

In [None]:
import math # first import the library you need

In [None]:
# You can than call the method you need by typing the library name, followed by a dot, and then the method
math.sqrt(49)

Another way to import libraries is by using an alias. This is particularly useful with libraries with very long names and typing them every time would decrease our productivity.

In [None]:
# Here we will import math with the alias ma
import math as ma

In [None]:
ma.sqrt(49)

Sometimes we only want to use a single function from a library, thus, we don't want to import the whole library if we won't be using it. In the following example, we take `sqrt` out of the math package.

In [None]:
# Here is how we can import a standalone function from a library
from math import sqrt

In [None]:
sqrt(49)

We can also add aliases to functions of a library, we would first have to import the function we want and then at the same time and rename it using the convention `as`.

In [None]:
from math import sqrt as sq

sq(36)

## 5.2 Reading Data

![all the data](http://blogs.ubc.ca/coetoolbox/files/2014/03/meme-data-data-everywhere.png)

When working with data in Python you will encounter datasets coming in all shapes and formats, so it is crucial to understand how to deal with different formats them in order to work with data. Here are some of the most common ones and the most useful functions for them in python.

- CSV --> Comma Separated Values --> `pd.read_csv(file, sep=',')`
- TSV --> Tab Separated Values --> `pd.read_csv(file, sep=' ')`
- Excel --> Microsoft Excel format (.xlsx) --> `pd.read_excel()`
- JSON --> JavaScript Object Notation --> `pd.read_json()`
- HTML --> Hypertext Markup Language --> `pd.read_html()`

The one we read is a comma separate values file, which is equivalent to data you see and find in spreadsheets, except that every one of the columns and elements in the rows are separated by commas `,`.

In [28]:
hiringData = pd.read_csv("Hyp_employees.csv")
hiringData.head()

Unnamed: 0,new_id,age,gender,undergradranking,gpa,didmba,extracurriculars,maturityassesment,programmingskill,disciplinetest,ambitiousnesstest,creativitytest,neuroticismtest,extraversiontest,teamquality,tenure,currentseniority,mainperformancemetric
0,1,48,0,3,4.0,0,3,2,40,9,7,4,5,5,2,19,9,5
1,2,46,0,3,4.0,0,7,4,37,6,6,5,5,4,2,18,9,7
2,3,48,1,3,4.0,0,3,2,32,7,7,5,6,5,1,18,9,5
3,4,44,0,3,3.91,0,3,3,55,3,6,6,7,5,1,17,10,5
4,5,45,0,3,4.0,0,3,2,44,5,4,4,8,4,1,17,9,5


## 5.3 Variables, Printing, and data types and structures

When you read in data you just don't leave it flying around either, we assign it to a variable. You can think of variables as containers that can hold any piece of information in our session. In addition, variables usually hold for us two kinds of objects, either single data types, which can be integers, strings, floats, or booleans, or data structures which are sub-containers for data types. Variables can contain both, and, as a matter of fact, that is what we did earlier with our data and models.

Here are some rules you should remember about variables.

__Do's & Dont's__

- A variable can only contain letters (uppercase or lowercase), underscores, and numbers  
good --> `variable_1`  ✅  
bad --> `vari--!#able_@`  ❌
- Variables cannot start with numbers  
good --> `variable_1`  ✅  
bad --> `123_variables`  ❌
- Variables are case sensitive  
This variable --> `variable_1` <-- is not the same as --> `VARiable_1`
- Variables cannot be only numbers  
good --> `something45678`  ✅  
bad --> `45678` ❌

In [22]:
'this is a string'

'this is a string'

In [23]:
x = 'this is a string inside a variable called x'

In [24]:
print(x)

this is a string inside a variable called x


In [25]:
an_integer = 32
print(a_number)

32


In [26]:
a_float = 2.1
print(a_float)

2.1


In [27]:
a_boolean = True
print(a_boolean)

True


As mentioned previously, data structures help us hold data, and the one we have been using is called a dataframe. When we read data with pandas, we are reading in the equivalent of a spreadsheet but with more functionalities. Here are some of the other data structures you should be aware of. We will see more of these later, but here are some explanations.

- lists - Lists in Python are some of the most versatile data structures available. They can hold multiple data types at the same time, and their elements can all be accessed using the same conventions as with strings plus more. To create a list you have to use square brackets `[ ]` and separate the values in them with commas `,`.
- dictionaries - Dictionaries are analogous data structures to what is called a hash table. These dictionaries are key-value pairs of data where the key is the name or variable of the values (it can be a string or a number), and the value is any kind of data type (e.g. a list, another dictionary, numbers, strings, booleans, etc.) or data structure (e.g. another dictionary, a list, a set, a tuple, etc.). You can create a dictionary using brackets `var = {"key": value(s)}` and by separating the key and value with a column. Additional key-value pairs can be separated by a comma `,`.
- tuples - Tuple are lists' cousins except that they are immutable. This means that the content of a tuple cannot be altered. What you can do instead is to take the elements inside a tupe out by what is called unpacking. Unpacking means taking them out of the tuple and putting them into another container (e.g. a variable) or another data structure, e.g. a list. Tuples are usually denoted with parentheses `()` or with commas `,` separating their elements.
- sets - Sets are the more strict cousins of lists. While lists allow for multiple data types and structures in them, sets don't like to have the same data twice in them and also can't stand its cousins, the lists and the dictionaries. They do get along with their first cousins the tuples though. To get a set started, all you need is to create a data structure with a set of brackets `{}` around, in the same fashion you would create a dictionary but without the construct of `key : value`. You can also call the function `set()` on a list or tuple and this will return a set.

## 5.4 Examining data

The first thing we want to do as soon as we get the data is to examine its content not only to see the kind of data we have but also to see if we can spot any inconsistencies that need to be dealt with from the start. Here are a few very useful methods available in pandas.

- `df.head()` --> shows the first 5 rows of a DataFrame or Series
- `df.tail()` --> shows the last 5 rows of a DataFrame or Series
- `df.info()` --> provides information about the DataFrame or Series
- `df.describe()` --> provides descriptive statistics of the numerical variables in a DataFrame
- `df.isna()` --> returns True for every element that is NaN and False for every element that isn't
- `df.notna()` --> does the opposite of `.isna()`

In [None]:
hiringData.head()

In [None]:
hiringData.tail()

In [None]:
hiringData.info()

In [None]:
hiringData.describe()

In [None]:
hiringData.isna()

In [None]:
hiringData.isna().sum()

In [None]:
hiringData.notna().sum()