# In Depth A/B Testing - Lab

## Introduction

In this lab, you'll explore a survey from Kaggle regarding budding data scientists. With this, you'll form some initial hypotheses, and test them using the tools you've acquired to date. 

## Objectives

You will be able to:
* Conduct t-tests and an ANOVA on a real-world dataset and interpret the results

## Load the Dataset and Perform a Brief Exploration

The data is stored in a file called **multipleChoiceResponses_cleaned.csv**. Feel free to check out the original dataset referenced at the bottom of this lab, although this cleaned version will undoubtedly be easier to work with. Additionally, meta-data regarding the questions is stored in a file name **schema.csv**. Load in the data itself as a Pandas DataFrame, and take a moment to briefly get acquainted with it.

> Note: If you can't get the file to load properly, try changing the encoding format as in `encoding='latin1'`

In [1]:
import pandas as pd
import scipy as sp
import matplotlib as plot
import seaborn as sns
# import statsmodels.api as sn
from statsmodels.formula.api import ols

In [4]:
#Your code here
tists = pd.read_csv('multipleChoiceResponses_cleaned.csv', encoding='latin1')

In [5]:
tists.head(7)

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,
5,Male,Brazil,46.0,Employed full-time,,,Yes,,Data Scientist,Fine,...,,,,,,,,,,
6,Male,United States,35.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


In [6]:
tists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26394 entries, 0 to 26393
Columns: 230 entries, GenderSelect to AdjustedCompensation
dtypes: float64(15), object(215)
memory usage: 46.3+ MB


In [7]:
tists['GenderSelect'].unique()

array(['Non-binary, genderqueer, or gender non-conforming', 'Female',
       'Male', 'A different identity', nan, 'Health Data', 'Image data',
       'not in standard formats ',
       'Find the proper number of nodes for reconstruction',
       'come up good labelling strategy',
       'No clear understanding of expections.', 'Data quality',
       'Need to create data lakes',
       'Most data is in cubes or data warehouses and requires data pulls to bring into toolsets',
       'Kaggle Dataset;', 'TensorFlow public models;',
       'facebook fastText;', 'Microsoft 1 billion images',
       'feret database image facial expressions', 'vggface dataset',
       'NPD',
       'Neuroscience; Allen Brain Atlas; BIRN fMRI and MRI data; Database for Reaching Experiments And Models (DREAM);The fMRI Data Center;nternational Neuroimaging Data-sharing Initiative (INDI)',
       'UCI ML repository', 'EMRs', 'Messy data', 'Text mining',
       'When Unstructured data and Text data processing and c

## Wages and Education

You've been asked to determine whether education is impactful to salary. Develop a hypothesis test to compare the salaries of those with Master's degrees to those with Bachelor's degrees. Are the two statistically different according to your results?

> Note: The relevant features are stored in the 'FormalEducation' and 'AdjustedCompensation' features.

You may import the functions stored in the `flatiron_stats.py` file to help perform your hypothesis tests. It contains the stats functions that you previously coded: `welch_t(a,b)`, `welch_df(a, b)`, and `p_value(a, b, two_sided=False)`. 

Note that `scipy.stats.ttest_ind(a, b, equal_var=False)` performs a two-sided Welch's t-test and that p-values derived from two-sided tests are two times the p-values derived from one-sided tests. See the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.    

In [14]:
pd.set_option("display.max_columns", None)
tists.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,CurrentEmployerType,MLToolNextYearSelect,MLMethodNextYearSelect,LanguageRecommendationSelect,PublicDatasetsSelect,LearningPlatformSelect,LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,LearningPlatformUsefulnessCourses,LearningPlatformUsefulnessProjects,LearningPlatformUsefulnessPodcasts,LearningPlatformUsefulnessSO,LearningPlatformUsefulnessTextbook,LearningPlatformUsefulnessTradeBook,LearningPlatformUsefulnessTutoring,LearningPlatformUsefulnessYouTube,BlogsPodcastsNewslettersSelect,LearningDataScienceTime,JobSkillImportanceBigData,JobSkillImportanceDegree,JobSkillImportanceStats,JobSkillImportanceEnterpriseTools,JobSkillImportancePython,JobSkillImportanceR,JobSkillImportanceSQL,JobSkillImportanceKaggleRanking,JobSkillImportanceMOOC,JobSkillImportanceVisualizations,JobSkillImportanceOtherSelect1,JobSkillImportanceOtherSelect2,JobSkillImportanceOtherSelect3,CoursePlatformSelect,HardwarePersonalProjectsSelect,TimeSpentStudying,ProveKnowledgeSelect,DataScienceIdentitySelect,FormalEducation,MajorSelect,Tenure,PastJobTitlesSelect,FirstTrainingSelect,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,MLSkillsSelect,MLTechniquesSelect,ParentsEducation,EmployerIndustry,EmployerSize,EmployerSizeChange,EmployerMLTime,EmployerSearchMethod,UniversityImportance,JobFunctionSelect,WorkHardwareSelect,WorkDataTypeSelect,WorkProductionFrequency,WorkDatasetSize,WorkAlgorithmsSelect,WorkToolsSelect,WorkToolsFrequencyAmazonML,WorkToolsFrequencyAWS,WorkToolsFrequencyAngoss,WorkToolsFrequencyC,WorkToolsFrequencyCloudera,WorkToolsFrequencyDataRobot,WorkToolsFrequencyFlume,WorkToolsFrequencyGCP,WorkToolsFrequencyHadoop,WorkToolsFrequencyIBMCognos,WorkToolsFrequencyIBMSPSSModeler,WorkToolsFrequencyIBMSPSSStatistics,WorkToolsFrequencyIBMWatson,WorkToolsFrequencyImpala,WorkToolsFrequencyJava,WorkToolsFrequencyJulia,WorkToolsFrequencyJupyter,WorkToolsFrequencyKNIMECommercial,WorkToolsFrequencyKNIMEFree,WorkToolsFrequencyMathematica,WorkToolsFrequencyMATLAB,WorkToolsFrequencyAzure,WorkToolsFrequencyExcel,WorkToolsFrequencyMicrosoftRServer,WorkToolsFrequencyMicrosoftSQL,WorkToolsFrequencyMinitab,WorkToolsFrequencyNoSQL,WorkToolsFrequencyOracle,WorkToolsFrequencyOrange,WorkToolsFrequencyPerl,WorkToolsFrequencyPython,WorkToolsFrequencyQlik,WorkToolsFrequencyR,WorkToolsFrequencyRapidMinerCommercial,WorkToolsFrequencyRapidMinerFree,WorkToolsFrequencySalfrod,WorkToolsFrequencySAPBusinessObjects,WorkToolsFrequencySASBase,WorkToolsFrequencySASEnterprise,WorkToolsFrequencySASJMP,WorkToolsFrequencySpark,WorkToolsFrequencySQL,WorkToolsFrequencyStan,WorkToolsFrequencyStatistica,WorkToolsFrequencyTableau,WorkToolsFrequencyTensorFlow,WorkToolsFrequencyTIBCO,WorkToolsFrequencyUnix,WorkToolsFrequencySelect1,WorkToolsFrequencySelect2,WorkFrequencySelect3,WorkMethodsSelect,WorkMethodsFrequencyA/B,WorkMethodsFrequencyAssociationRules,WorkMethodsFrequencyBayesian,WorkMethodsFrequencyCNNs,WorkMethodsFrequencyCollaborativeFiltering,WorkMethodsFrequencyCross-Validation,WorkMethodsFrequencyDataVisualization,WorkMethodsFrequencyDecisionTrees,WorkMethodsFrequencyEnsembleMethods,WorkMethodsFrequencyEvolutionaryApproaches,WorkMethodsFrequencyGANs,WorkMethodsFrequencyGBM,WorkMethodsFrequencyHMMs,WorkMethodsFrequencyKNN,WorkMethodsFrequencyLiftAnalysis,WorkMethodsFrequencyLogisticRegression,WorkMethodsFrequencyMLN,WorkMethodsFrequencyNaiveBayes,WorkMethodsFrequencyNLP,WorkMethodsFrequencyNeuralNetworks,WorkMethodsFrequencyPCA,WorkMethodsFrequencyPrescriptiveModeling,WorkMethodsFrequencyRandomForests,WorkMethodsFrequencyRecommenderSystems,WorkMethodsFrequencyRNNs,WorkMethodsFrequencySegmentation,WorkMethodsFrequencySimulation,WorkMethodsFrequencySVMs,WorkMethodsFrequencyTextAnalysis,WorkMethodsFrequencyTimeSeriesAnalysis,WorkMethodsFrequencySelect1,WorkMethodsFrequencySelect2,WorkMethodsFrequencySelect3,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect,AlgorithmUnderstandingLevel,WorkChallengesSelect,WorkChallengeFrequencyPolitics,WorkChallengeFrequencyUnusedResults,WorkChallengeFrequencyUnusefulInstrumenting,WorkChallengeFrequencyDeployment,WorkChallengeFrequencyDirtyData,WorkChallengeFrequencyExplaining,WorkChallengeFrequencyPass,WorkChallengeFrequencyIntegration,WorkChallengeFrequencyTalent,WorkChallengeFrequencyDataFunds,WorkChallengeFrequencyDomainExpertise,WorkChallengeFrequencyML,WorkChallengeFrequencyTools,WorkChallengeFrequencyExpectations,WorkChallengeFrequencyITCoordination,WorkChallengeFrequencyHiringFunds,WorkChallengeFrequencyPrivacy,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkDataVisualizations,WorkInternalVsExternalTools,WorkMLTeamSeatSelect,WorkDatasets,WorkDatasetsChallenge,WorkDataStorage,WorkDataSharing,WorkDataSourcing,WorkCodeSharing,RemoteWork,CompensationAmount,CompensationCurrency,SalaryChange,JobSatisfaction,JobSearchResource,JobHuntTime,JobFactorLearning,JobFactorSalary,JobFactorOffice,JobFactorLanguages,JobFactorCommute,JobFactorManagement,JobFactorExperienceLevel,JobFactorDepartment,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity,exchangeRate,AdjustedCompensation
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,Employed by a company that doesn't perform adv...,SAS Base,Random Forests,F#,Dataset aggregator/platform (i.e. Socrata/Kagg...,"College/University,Conferences,Podcasts,Trade ...",,,,,Very useful,,,,,,,,Very useful,,,Somewhat useful,,,"Becoming a Data Scientist Podcast,Data Machina...",,,,,,,,,,,,,,,,,,,Yes,Bachelor's degree,Management information systems,More than 10 years,"Predictive Modeler,Programmer,Researcher",University courses,0.0,0.0,100.0,0.0,0.0,0.0,"Computer Vision,Natural Language Processing,Su...","Evolutionary Approaches,Neural Networks - GANs...",A doctoral degree,Internet-based,100 to 499 employees,Increased slightly,3-5 years,I visited the company's Web site and found a j...,Not very important,Build prototypes to explore applying machine l...,"Gaming Laptop (Laptop + CUDA capable GPU),Work...","Text data,Relational data",Rarely,10GB,"Neural Networks,Random Forests,RNNs","Amazon Web services,Oracle Data Mining/ Oracle...",,Rarely,,,,,,,,,,,,,,,,,,,,,,,,,,Sometimes,,Most of the time,,,,,,,,,,,,,,,,,,,,,,"Association Rules,Collaborative Filtering,Neur...",,Rarely,,,Often,,,,,,,,,,,,,,,Sometimes,Often,,Most of the time,,,,,,,,,,,0.0,100.0,0.0,0.0,0.0,0.0,Enough to explain the algorithm to someone non...,Company politics / Lack of management/financia...,Rarely,,,,,,,,,,,,,,,,Often,Most of the time,,,,,26-50% of projects,Do not know,Standalone Team,,,Document-oriented (e.g. MongoDB/Elasticsearch)...,"Company Developed Platform,I don't typically s...",,"Mercurial,Subversion,Other",Always,,,I am not currently employed,5,,,,,,,,,,,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,,Python,Random Forests,Python,Dataset aggregator/platform (i.e. Socrata/Kagg...,Kaggle,,,,,,,Somewhat useful,,,,,,,,,,,,"Becoming a Data Scientist Podcast,Siraj Raval ...",1-2 years,,Nice to have,Unnecessary,,Unnecessary,,Necessary,,,,,,,,,2 - 10 hours,Master's degree,Yes,Master's degree,Computer Science,Less than a year,Software Developer/Software Engineer,University courses,10.0,30.0,0.0,30.0,30.0,0.0,"Computer Vision,Supervised Machine Learning (T...","Bayesian Techniques,Decision Trees - Gradient ...",A bachelor's degree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Somewhat important,,,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,,Amazon Web services,Deep learning,R,Dataset aggregator/platform (i.e. Socrata/Kagg...,"Arxiv,College/University,Kaggle,Online courses...",Very useful,,Somewhat useful,,,,Somewhat useful,,,,Very useful,,,,,,,Very useful,"FastML Blog,No Free Hunch Blog,Talking Machine...",1-2 years,Necessary,,,,,Necessary,,,,,,,,"Coursera,edX",Basic laptop (Macbook),2 - 10 hours,Github Portfolio,Yes,Master's degree,Engineering (non-computer focused),3 to 5 years,"Data Scientist,Machine Learning Engineer",University courses,20.0,50.0,0.0,30.0,0.0,0.0,"Adversarial Learning,Computer Vision,Natural L...","Decision Trees - Random Forests,Ensemble Metho...",A bachelor's degree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"Asking friends, family members, or former coll...",1-2,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,,
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,Self-employed,TensorFlow,Neural Nets,Python,I collect my own data (e.g. web-scraping),"Blogs,College/University,Conferences,Friends n...",,Very useful,Very useful,,Very useful,Very useful,,,,Very useful,Very useful,Very useful,,,,,,,KDnuggets Blog,,,,,,,,,,,,,,,,,,,Yes,Master's degree,Mathematics or statistics,More than 10 years,"Business Analyst,Operations Research Practitio...",University courses,30.0,0.0,40.0,30.0,0.0,0.0,"Recommendation Engines,Reinforcement learning,...","Bayesian Techniques,Decision Trees - Gradient ...",High school,Mix of fields,,,,,Very important,Analyze and understand data to influence produ...,"Laptop + Cloud service (AWS, Azure, GCE ...)",Relational data,Always,1GB,"Bayesian Techniques,Decision Trees,Random Fore...","Amazon Machine Learning,Amazon Web services,Cl...",Rarely,Often,,,Rarely,,,,Rarely,,,,,Rarely,Rarely,,,,,Rarely,Rarely,,Sometimes,,Rarely,,Rarely,,,,Rarely,,Rarely,,,,,Sometimes,,Rarely,,Often,,,Rarely,,,,,,,"A/B Testing,Bayesian Techniques,Data Visualiza...",Sometimes,,Sometimes,,,,Sometimes,Often,Sometimes,,,,,,,Sometimes,Often,Sometimes,,Sometimes,,,Sometimes,,,,Often,,,Often,,,,50.0,20.0,0.0,10.0,20.0,0.0,Enough to refine and innovate on the algorithm,Company politics / Lack of management/financia...,Often,Often,Often,Often,Often,Often,,Often,Often,Often,Most of the time,Often,Often,Often,,Often,Often,Often,Often,Often,Often,,100% of projects,Entirely internal,Standalone Team,Electricity data sets from government and states,"Everything is custom, there is never a tool th...","Column-oriented relational (e.g. KDB/MariaDB),...","Company Developed Platform,Email",,Generic cloud file sharing software (Dropbox/B...,,250000.0,USD,Has increased 20% or more,10 - Highly Satisfied,,,,,,,,,,,,,,,,,,,1.0,250000.0
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,Employed by a company that doesn't perform adv...,TensorFlow,Text Mining,Python,GitHub,"Arxiv,Conferences,Kaggle,Textbook",Very useful,,,,Somewhat useful,,Somewhat useful,,,,,,,,Somewhat useful,,,,"Data Machina Newsletter,Jack's Import AI Newsl...",,,,,,,,,,,,,,,,,,,No,Doctoral degree,Engineering (non-computer focused),More than 10 years,"Computer Scientist,Data Analyst,Data Miner,Dat...",University courses,60.0,5.0,5.0,30.0,0.0,0.0,"Computer Vision,Outlier detection (e.g. Fraud ...","Bayesian Techniques,Decision Trees - Gradient ...",Primary/elementary school,Technology,"5,000 to 9,999 employees",Stayed the same,Don't know,A tech-specific job board,Somewhat important,Build prototypes to explore applying machine l...,"Gaming Laptop (Laptop + CUDA capable GPU),GPU ...","Image data,Relational data",Most of the time,100GB,"Bayesian Techniques,CNNs,Ensemble Methods,Neur...","C/C++,Jupyter notebooks,MATLAB/Octave,Python,R...",,,,Most of the time,,,,,,,,,,,,,Sometimes,,,,Often,,,,,,,,,,Sometimes,,Sometimes,,,,,,,,,,,,,Sometimes,,,,,,"Association Rules,Bayesian Techniques,CNNs,Col...",,Sometimes,Often,Most of the time,Sometimes,,Most of the time,Sometimes,Often,Sometimes,,,,Most of the time,,Sometimes,,Sometimes,,Most of the time,Sometimes,,,,Sometimes,Often,,Most of the time,,Sometimes,,,,30.0,20.0,15.0,15.0,20.0,0.0,Enough to refine and innovate on the algorithm,Company politics / Lack of management/financia...,Often,Sometimes,,,,,,,Sometimes,Sometimes,Sometimes,,,,Sometimes,,Most of the time,,Sometimes,,,,10-25% of projects,Approximately half internal and half external,Business Department,,,Flat files not in a database or cache (e.g. CS...,Company Developed Platform,,Git,Rarely,,,I do not want to share information about my sa...,2,,,,,,,,,,,,,,,,,,,,


In [15]:
#Your code here
tists['FormalEducation'].unique()
# and 'AdjustedCompensation' 

array(["Bachelor's degree", "Master's degree", 'Doctoral degree', nan,
       "Some college/university study without earning a bachelor's degree",
       'I did not complete any formal education past high school',
       'Professional degree', 'I prefer not to answer'], dtype=object)

In [18]:
bach = tists.loc[tists['FormalEducation'] == "Bachelor's degree"]
mast = tists.loc[tists['FormalEducation'] == "Master's degree"]

In [19]:
import flatiron_stats as fls

fls.welch_t(bach['AdjustedCompensation'], mast['AdjustedCompensation'])

0.9104118471854615

## Wages and Education II

Now perform a similar statistical test comparing the AdjustedCompensation of those with Bachelor's degrees and those with Doctorates. If you haven't already, be sure to explore the distribution of the AdjustedCompensation feature for any anomalies. 

In [11]:
#Your code here

Median Values: 
s1:74131.92 
s2:38399.4
Sample sizes: 
s1: 967 
s2: 1107
Welch's t-test p-value: 0.1568238199472023


Repeated Test with Ouliers Removed:
Sample sizes: 
s1: 964 
s2: 1103
Welch's t-test p-value with outliers removed: 0.0


## Wages and Education III

Remember the multiple comparisons problem; rather than continuing on like this, perform an ANOVA test between the various 'FormalEducation' categories and their relation to 'AdjustedCompensation'.

In [None]:
#Your code here

## Additional Resources

Here's the original source where the data was taken from:  
    [Kaggle Machine Learning & Data Science Survey 2017](https://www.kaggle.com/kaggle/kaggle-survey-2017)

## Summary

In this lab, you practiced conducting actual hypothesis tests on actual data. From this, you saw how dependent results can be on the initial problem formulation, including preprocessing!