# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:

* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

In [2]:
df = pd.read_csv('multipleChoiceResponses_cleaned.csv', low_memory=False, lineterminator='\n')


## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [3]:
ed_inc = df[['FormalEducation','AdjustedCompensation']].dropna()

In [4]:
ed_inc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4372 entries, 3 to 16700
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   FormalEducation       4372 non-null   object 
 1   AdjustedCompensation  4372 non-null   float64
dtypes: float64(1), object(1)
memory usage: 102.5+ KB


In [5]:
ed_inc.head()

Unnamed: 0,FormalEducation,AdjustedCompensation
3,Master's degree,250000.0
8,Bachelor's degree,64184.8
9,Bachelor's degree,20882.4
11,Bachelor's degree,1483.9
14,Master's degree,36634.4


In [6]:
ed_inc['FormalEducation'].value_counts()

Master's degree                                                      2006
Bachelor's degree                                                    1110
Doctoral degree                                                       976
Professional degree                                                   131
Some college/university study without earning a bachelor's degree     112
I did not complete any formal education past high school               30
I prefer not to answer                                                  7
Name: FormalEducation, dtype: int64

In [7]:
ed_inc['AdjustedCompensation'].describe()

count    4.372000e+03
mean     6.592544e+06
std      4.279731e+08
min     -7.351631e+01
25%      2.049480e+04
50%      5.381217e+04
75%      9.566608e+04
max      2.829740e+10
Name: AdjustedCompensation, dtype: float64

In [8]:
ed_inc['AdjustedCompensation'].median()

53812.17000000001

In [9]:
df_clean = ed_inc[(ed_inc['AdjustedCompensation'] < 1000000)
                  & (ed_inc['AdjustedCompensation'] >= 0)
                  & ((ed_inc['FormalEducation'] == "Master's degree")
                     | (ed_inc['FormalEducation'] == "Bachelor's degree")
                     | (ed_inc['FormalEducation']
                        == "Doctoral degree"))].reset_index(drop=True)

In [10]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4081 entries, 0 to 4080
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   FormalEducation       4081 non-null   object 
 1   AdjustedCompensation  4081 non-null   float64
dtypes: float64(1), object(1)
memory usage: 63.9+ KB


In [11]:
"""
Metrics: FormalEducation (Categorical) and AdjustedCompensation 
(Numerical)

H_o = There is no significant difference in income based on education 
level.
H_a = There is a significant difference in income based on education 
level.

Test: Because we are comparing between between three categories
I will conduct an ANOVA test.

alpha = 0.05
"""

'\nMetrics: FormalEducation (Categorical) and AdjustedCompensation \n(Numerical)\n\nH_o = There is no significant difference in income based on education \nlevel.\nH_a = There is a significant difference in income based on education \nlevel.\n\nTest: Because we are comparing between between three categories\nI will conduct an ANOVA test.\n\nalpha = 0.05\n'

In [12]:
df_clean.groupby('FormalEducation').mean()

Unnamed: 0_level_0,AdjustedCompensation
FormalEducation,Unnamed: 1_level_1
Bachelor's degree,54073.441445
Doctoral degree,86761.997815
Master's degree,64153.257678


In [13]:
bach = df_clean[df_clean['FormalEducation']== "Bachelor's degree"]['AdjustedCompensation']
mast = df_clean[df_clean['FormalEducation']== "Master's degree"]['AdjustedCompensation']
doc = df_clean[df_clean['FormalEducation']== "Doctoral degree"]['AdjustedCompensation']

stats.f_oneway(bach, mast, doc)

F_onewayResult(statistic=83.35995316548635, pvalue=3.2937908764507496e-36)

In [14]:
# Conducting a t-test becasue the mean of the income of doctoral degrees 
# is so clearly different
stats.ttest_ind(bach,mast,equal_var=False)

Ttest_indResult(statistic=-4.725515975586772, pvalue=2.4365850139260144e-06)

In [15]:
"""
We reject the initial null hypothesis as our pvalue is less 
than 0.05. Meaning there is a significant difference. 

In my next step I conducted a t-test comparing masters to bachelors 
incomes and also found a signficant difference with a p-value of about 
0.000002 

"""

'\nWe reject the initial null hypothesis as our pvalue of 0.0033 is less \nthan 0.05. Meaning there is a significant difference. \n\nIn my next step I conducted a t-test comparing masters to bachelors \nincomes and also found a signficant difference with a p-value of about \n0.000002 \n\n'

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [16]:
# H_o = There is no significant difference in between bachelors 
# and doctoral degrees.

# H_o = There is a significant difference in between bachelors 
# and doctoral degrees.

# alpha = 0.05

stats.ttest_ind(bach,doc,equal_var=False)



Ttest_indResult(statistic=-12.009757583777525, pvalue=4.2199881493416054e-32)

In [17]:
"""
We can reject the null hypothesis since our pvalue is less than 0.05. 
This means that there is a significant difference between incomes for 
bachelors degree holder and doctoral degree holders.
"""

'\nWe can reject the null hypothesis since our pvalue is less than 0.05. \nThis means that there is a significant difference between incomes for \nbachelors degree holder and doctoral degree holders.\n'

## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [18]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4081 entries, 0 to 4080
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   FormalEducation       4081 non-null   object 
 1   AdjustedCompensation  4081 non-null   float64
dtypes: float64(1), object(1)
memory usage: 63.9+ KB


In [19]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = "AdjustedCompensation ~ C(FormalEducation)"
lm = ols(formula, df_clean).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)




                          sum_sq      df          F        PR(>F)
C(FormalEducation)  5.806439e+11     2.0  83.359953  3.293791e-36
Residual            1.420266e+13  4078.0        NaN           NaN


## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!