<a href="https://colab.research.google.com/github/subho99/Computational-Data-Science/blob/main/Subhajit_Basistha_M1_AST_06_Statistical_Testing_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Assignment 6: Statistical Testing

## Learning Objectives

At the end of the experiment, you will be able to

* understand the different viewpoints of the frequentist and Bayesian approaches
* understand the basic idea behind statistical hypothesis testing
* perform one-sample z-test for mean
* perform two-sample z-test for mean
* perform paired z-test 
* perform two-sample t-test for mean 

## Information

**Frequentist inference** is a collection of error probabilistic methods which allows us to learn from data about the true state of nature in the presence of uncertainty by using model-based inference. Its core goal involves providing error control in the face of uncertainty. It was developed in the early 20-th century by Fisher, Neyman & Pearson, and others, largely replacing the present approaches to statistical inference, among them Bayesian inference.

**Confidence Interval** estimate a parameter by specifying a range of possible values. Such an interval is associated with a confidence level, which is the probability that the procedure used to generate the interval will produce an interval containing the true parameter.

To know more about confidence interval, click [here](https://www.statology.org/confidence-intervals-python/).

### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2236624" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "8240187807" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()
  
notebook= "M1_AST_06_Statistical_Testing_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")  
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/CDS/Datasets/2019_Data_Professional_Salary_Survey_Responses.xlsx")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None
    
    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:        
        print(r["err"])
        return None   
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://cds.iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if not Additional: 
      raise NameError
    else:
      return Additional  
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None
  
  
# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None
  
def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None
  

def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError 
    else: 
      return Answer
  except NameError:
    print ("Please answer Question")
    return None
  

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup() 
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


#### Importing required packages

In [4]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm                             # A normal continuous random variable
from numpy.random import seed                            # set the randomness
from scipy import stats                                  # statistical operations
import warnings
warnings.simplefilter("ignore")

## Hypothesis Testing

**Hypothesis testing** is a statistical method for testing a claim or hypothesis  about a parameter in a population, using data measured in a sample. It is basically an assumption that we make about the population parameter.
Based on this, we define the 

* null hypothesis: $H_0$
  
  - the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, i.e. no association among groups
* alternative hypothesis: $H_A$ or $H_1$

  - The alternative hypothesis is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed)

**Significance level** (alpha ($\alpha$)) is the probability that you will make the mistake of rejecting the null hypothesis when in fact it is true. 

$$\alpha = P(Rejecting\ a\ null\ hypothesis\ |\ null\ hypothesis\ is\ true)$$
<br>

The **p-value** measures the probability of getting a more extreme value than the one you got from the experiment. 

$$p - value = P(Observing\ test\ statistics\ value\ |\ null\ hypothesis\ is\ true)$$
<br>

|Criteria| | Decision|
|--------|----|---------|
|p-value $\le$ α | | Reject the null hypothesis|
|p-value $\gt$ α| | Retain (or fail to reject) the null hypothesis|  

<br>

**Z-test and t-test** are two widely known hypothesis testing types.

<br>

|Basis for Comparison |  |	Z-Test |  |	t-Test |
|--------|----|----|---|---------|
|Meaning |  |	Z-test implies a hypothesis test which  |  | T-test refers to a type of parametric test that is|
|       |  |ascertains if the means of two datasets are different  |  | applied to identify, how the means of two sets of data |
|       |  | from each other when variance is given. |  |	  differ from one another when variance is not given.|
| Based on |  |	Normal distribution |  |	Student-t distribution |
| Population variance |  |	Known |  |	Unknown |
| Sample Size |  |	Large |  |	Small |
  .

### One-Sample Z-Test for Mean

The One-Sample Z-test is used to know if the difference between the sample mean and the mean of a population is large enough to be statistically significant (i.e. unlikely to have occurred by mere chance). 

Assumptions
1.       Mean and variance of the population are known.

2.       The test statistic follows normal distribution. 

The test statistics for Z-test is given by

$$Z- statistic = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$$

where
- $\bar{X}$: sample mean
- $\mu$: population mean
- $\sigma$: population standard deviation
- $n$: sample size



**Example 1:** Let's create some dummy age data for the population of voters in the entire country and a sample of voters in `North_Carolina` and test whether the average age of voters in `North_Carolina` differs from the entire country population.

In [5]:
np.random.seed(6)

# Generate Population ages data, see this https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
# .rvs provides random samples; scale = std. deviation, loc = mean, size = no. of samples

population_ages1 = stats.norm.rvs(scale=18, loc=45, size=150000)                  
population_ages2 = stats.norm.rvs(scale=18, loc=20, size=100000) 
population_ages = np.concatenate((population_ages1, population_ages2))

# Generate North Carolina sample ages data

North_Carolina_ages1 = stats.norm.rvs(scale=18, loc=40, size=30)
North_Carolina_ages2 = stats.norm.rvs(scale=18, loc=20, size=25)
North_Carolina_ages = np.concatenate((North_Carolina_ages1, North_Carolina_ages2))

print("Mean age of population of voters in the entire country: ", population_ages.mean())
print("Mean age of North Carolina population: ", North_Carolina_ages.mean())

Mean age of population of voters in the entire country:  34.97043368335889
Mean age of North Carolina population:  28.52661116077034


Notice that we used a slightly different combination of distributions to generate the sample data for `North_Carolina`, so we know that the two means are different. Let's conduct a z-test at a 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the population. To conduct a one-sample z-test, we can use the `stats.ztest()` function by passing argument `alternative = 'smaller'`.

Some of the parameters of `stats.ztest()` function are:
- **x1**: first of the two independent samples
- **x2**: second of the two independent samples
- **value**: 
    - In the one-sample case, value is the mean of x1 under the Null hypothesis. 
    - In the two-sample case, value is the difference between mean of x1 and mean of x2 under the Null hypothesis.
- **alternative**: The alternative hypothesis, H1, has to be one of the following
    - 'two-sided': H1: difference in means not equal to value (default)
    - 'larger': H1: difference in means larger than value 
    - 'smaller': H1: difference in means smaller than value

In [6]:
# Perform Z-test for mean
one_sample_ztest = sm.stats.ztest(x1= North_Carolina_ages, x2=None, value= population_ages.mean(), alternative='smaller') 

print("test-statistic (z-score): ", one_sample_ztest[0])
print("p-value: ", one_sample_ztest[1])

test-statistic (z-score):  -2.2814904209459996
p-value:  0.011259721451099942


The z-score tells us how many standard deviations from the mean our score is. In this example, our score is -2.28 [standard deviations below the mean](https://cdn.iisc.talentsprint.com/CDS/Images/ZScores.jpg). Also,  we can reject the null hypothesis that the mean of samples is equal to the population mean (i.e. 28.52) as we can see that the p-value is less than 0.05 (i.e. 0.011). This indicates that the difference between the sample mean and population mean is statistically significant.

### Two-Sample Z-Test for Mean

The Two-Sample Z-test is used to compare the means of two samples to see if it is feasible that they come from the same population. The null hypothesis is: the population means are equal. The **Z-test is preferred to the t-test for large samples (N > 30)  or when the variance is known**, otherwise, the sample standard deviation is a more biased estimate of a population standard deviance than is allowable, and using a two-sample t-test should be considered.

Assume that $\mu_1$ and $\mu_2$ are the population means. Our interest is to check a hypothesis on difference between $\mu_1$ and $\mu_2$, that is ($\mu_1$ - $\mu_2$). If $\bar{X_1}$ and $\bar{X_2}$ are the estimated mean values from two samples, the statistic ($\bar{X_1} - \bar{X_2}$) follows a standard normal distribution with mean ($\mu_1$ - $\mu_2$) and standard deviation
$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$, where $n_1$
 and $n_2$ are the sample sizes of two samples. 
 
The corresponding Z-statistic is given by

$$Z = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

**Example 2:** Let's create a sample of voters in `South_Carolina` and test whether the average age of voters in `North_Carolina` differs from the average age of voters in `South_Carolina`.

In [7]:
np.random.seed(12)

# Generate South Carolina sample ages data
South_Carolina_ages1 = stats.norm.rvs(scale=15, loc=33, size=30)
South_Carolina_ages2 = stats.norm.rvs(scale=15, loc=20, size=25)
South_Carolina_ages = np.concatenate((South_Carolina_ages1, South_Carolina_ages2))

print("Mean age of South Carolina population: ", South_Carolina_ages.mean())

Mean age of South Carolina population:  24.970648445336874


In [8]:
# Perform Two sample Z-test for Mean
two_sample_ztest = sm.stats.ztest(x1 = North_Carolina_ages,
                                  x2 = South_Carolina_ages,
                                  value = 0,
                                  alternative = 'two-sided')    # Assume samples have equal variance
print("test-statistic (z-score): ", two_sample_ztest[0])
print("p-value: ", two_sample_ztest[1])

test-statistic (z-score):  0.9604812179337472
p-value:  0.3368130793371993


The test yields a p-value of 0.336, which means there is a 33.6% chance we'd see the difference in the means of the two sample data are not equal to value (here, it is 0 as here we calculate the difference of the two sample means), if these two groups tested are actually identical. If we were using a 95% confidence level we would fail to reject the null hypothesis, since the p-value is greater than the corresponding significance level of 5%.

### Paired Z-Test for Mean

The paired z-test is used to test whether the mean difference of two populations is greater than, less than, or not equal to 0. 

The objective is to check whether the difference in the parameter values is statistically significant before and after the intervention or between two different types of interventions.

**Example 3:** Consider the weights of a group of patients before and after an exercise program.

  - Observation 1: The weight of a group of patients was evaluated at baseline.
  - Observation 2: This same group of patients was evaluated after an 8-week exercise program.
  - Variable of interest: Body weight.

In this example, we have one group with two observations, meaning that the data are paired. In this example, we know the population variance of body weights of each patient from previous studies.

The null hypothesis is that there will be no difference in weights measured before and after the exercise program. After observing that our data meet the assumptions of a paired samples z-test (the variable of interest is continuous, it has only two groups i.e. two measurements from a single group, it has  paired samples, it has a normal variable of interest and the population variance is known), we proceed with the analysis.

When we run the analysis, we get a test statistic (in this case a Z-statistic) and a p-value.

  - The test statistic is a measure of how different the group is on our body weight variable of interest across the two observations. 
  - The p-value is the chance of seeing our results assuming the exercise program actually doesn’t do anything. A p-value less than or equal to 0.05 means that our result is statistically significant and we can trust that the difference is not due to chance alone.



In [9]:
np.random.seed(11)

# Generate Weights of population before exercise program
before= stats.norm.rvs(scale=10, loc=50, size=100)

# Generate Weights of population after exercise program
after = before + stats.norm.rvs(scale=5, loc=-3.5, size=100)

# Create dataframe
weight_df = pd.DataFrame({"weight_before":before,
                          "weight_after":after,
                          "weight_change":after-before})

weight_df.head()

Unnamed: 0,weight_before,weight_after,weight_change
0,67.494547,68.365911,0.871364
1,47.13927,43.997531,-3.141739
2,45.154349,33.459091,-11.695258
3,23.466814,16.730301,-6.736513
4,49.917154,50.506002,0.588848


In [10]:
# Check summary of the data
weight_df.describe()             

Unnamed: 0,weight_before,weight_after,weight_change
count,100.0,100.0,100.0
mean,50.115182,46.634807,-3.480375
std,9.377513,10.423714,4.783696
min,23.466814,16.730301,-13.745286
25%,43.473681,39.362388,-6.296211
50%,50.276935,47.179677,-3.663463
75%,56.879048,53.765628,-0.511327
max,71.566744,71.336869,7.509282


The summary shows that patients lost about 3.48 pounds on average after the exercise program. Let's conduct a paired z-test to see whether this difference is significant at a 95% confidence level

In [11]:
# Perform paired Z-test
pair_ztest = sm.stats.ztest(x1 = before, x2 = after, value = 0, alternative='two-sided')
print("test-statistic: ", pair_ztest[0])
print("p-value: ", pair_ztest[1])

test-statistic:  2.48223901119329
p-value:  0.013055966969698763


Here the null-hypothesis was that in the population (from which the samples are drawn) the difference between similarly paired observations is 0 (as indicated by `value = 0`). The p-value in the test output shows that the probability that the difference in means is not equal to the value (0) is 1.3%, i.e. < 5%, so the null hypothesis is rejected. The differences in means of the weight measurements before and after the exercise program is statistically significant.

### Two Sample T-test for Mean

Let's see the hypothesis test for difference in two population means when the **standard deviations of the populations are unknown**. Hence we need to estimate them from the samples drawn from these two populations. An additional assumption we make here is that the standard deviation of two populations are equal (however, unknown). Then the sampling distribution of the difference in estimated means $(\bar{X_1} - \bar{X_2})$ follows a t-distribution with ($n_1
 + n_2  – 2$) degrees of freedom with mean ($\mu_1 – \mu_2$) and standard deviation 
$\sqrt{S_P^2 \big( \frac{1}{n_1} + \frac{1}{n_2} \big)}$

where $S_P^2$ is the pooled variance of two samples and is given by
$S_P^2 = \frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{(n_1 + n_2 - 2)}$

The corresponding *t-statistic* is
$$t = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{S_P^2 \big( \frac{1}{n_1} + \frac{1}{n_2} \big)}}$$

where 
- $\bar{X_1}$: sample mean for first population
- $\bar{X_2}$: sample mean for second population
- $n_1$: sample size for first population
- $n_2$: sample size for second population

**Example 4:** Consider the Professional Salary Survey Results dataset. 
At $\alpha$ = 0.05, test whether the salary means for both male and female employees for the year 2019 who belong to the United States, are equal.

The corresponding null hypothesis and alternate hypothesis are as follows:

- $H_0$: both salary means( male and female) are equal
- $H_A$: both means are not equal

Now, we will carry out the two-sample t-test with the `stats.ttest_ind()` function.

#### Dataset Description

Dataset chosen here is [Professional Salary Survey Results dataset](https://data.world/finance/data-professional-salary-survey). The dataset is made up of 6893 records and 29 columns. It includes important attributes of the employments. Some of the features are

- Survey Year
- SalaryUSD: salary in US dollars
- Country
- Education
- EmploymentSector
- LookingForAnotherJob
- Gender

Here we will only be considering the United States employees' Salary for the year 2019.

In [12]:
# Read data
raw_data = pd.read_excel('2019_Data_Professional_Salary_Survey_Responses.xlsx', header = 3)
raw_data.head()

Unnamed: 0,Survey Year,Timestamp,SalaryUSD,Country,PostalCode,PrimaryDatabase,YearsWithThisDatabase,OtherDatabases,EmploymentStatus,JobTitle,...,HoursWorkedPerWeek,TelecommuteDaysPerWeek,PopulationOfLargestCityWithin20Miles,EmploymentSector,LookingForAnotherJob,CareerPlansThisYear,Gender,OtherJobDuties,KindsOfTasksPerformed,Counter
0,2017,2017-01-05 05:10:20.451,200000,United States,Not Asked,Microsoft SQL Server,10,MySQL/MariaDB,Full time employee,DBA,...,45,1,Not Asked,Private business,"Yes, but only passively (just curious)",Not Asked,Not Asked,Not Asked,Not Asked,1
1,2017,2017-01-05 05:26:23.388,61515,United Kingdom,Not Asked,Microsoft SQL Server,15,"Oracle, PostgreSQL",Full time employee,DBA,...,35,2,Not Asked,Private business,No,Not Asked,Not Asked,Not Asked,Not Asked,1
2,2017,2017-01-05 05:32:57.367,95000,Germany,Not Asked,Microsoft SQL Server,5,"Oracle, MySQL/MariaDB, Informix",Full time employee,Other,...,45,"None, or less than 1 day per week",Not Asked,Private business,"Yes, but only passively (just curious)",Not Asked,Not Asked,Not Asked,Not Asked,1
3,2017,2017-01-05 05:33:03.316,56000,United Kingdom,Not Asked,Microsoft SQL Server,6,,Full time employee,DBA,...,40,1,Not Asked,Private business,"Yes, but only passively (just curious)",Not Asked,Not Asked,Not Asked,Not Asked,1
4,2017,2017-01-05 05:34:33.866,35000,France,Not Asked,Microsoft SQL Server,10,Oracle,Full time employee of a consulting/contracting...,DBA,...,40,"None, or less than 1 day per week",Not Asked,Private business,"Yes, but only passively (just curious)",Not Asked,Not Asked,Not Asked,Not Asked,1


In [13]:
# Shape of data
raw_data.shape

(6893, 29)

In [14]:
len(raw_data)

6893

In [15]:
# Check for missing values
raw_data.isna().sum()

Survey Year                                0
Timestamp                                  0
SalaryUSD                                  0
Country                                    0
PostalCode                               959
PrimaryDatabase                            0
YearsWithThisDatabase                      0
OtherDatabases                          1373
EmploymentStatus                           0
JobTitle                                   0
ManageStaff                                0
YearsWithThisTypeOfJob                     0
HowManyCompanies                           0
OtherPeopleOnYourTeam                      0
CompanyEmployeesOverall                    0
DatabaseServers                            0
Education                                  0
EducationIsComputerRelated              1216
Certifications                             0
HoursWorkedPerWeek                         0
TelecommuteDaysPerWeek                     0
PopulationOfLargestCityWithin20Miles       0
Employment

In [16]:
# Check for missing values
raw_data.isna().sum() / len(raw_data)

Survey Year                             0.000000
Timestamp                               0.000000
SalaryUSD                               0.000000
Country                                 0.000000
PostalCode                              0.139127
PrimaryDatabase                         0.000000
YearsWithThisDatabase                   0.000000
OtherDatabases                          0.199188
EmploymentStatus                        0.000000
JobTitle                                0.000000
ManageStaff                             0.000000
YearsWithThisTypeOfJob                  0.000000
HowManyCompanies                        0.000000
OtherPeopleOnYourTeam                   0.000000
CompanyEmployeesOverall                 0.000000
DatabaseServers                         0.000000
Education                               0.000000
EducationIsComputerRelated              0.176411
Certifications                          0.000000
HoursWorkedPerWeek                      0.000000
TelecommuteDaysPerWe

In [17]:
# Remove missing values
raw_data.dropna(inplace= True)
raw_data.reset_index(drop= True, inplace= True)
raw_data.isna().sum()

Survey Year                             0
Timestamp                               0
SalaryUSD                               0
Country                                 0
PostalCode                              0
PrimaryDatabase                         0
YearsWithThisDatabase                   0
OtherDatabases                          0
EmploymentStatus                        0
JobTitle                                0
ManageStaff                             0
YearsWithThisTypeOfJob                  0
HowManyCompanies                        0
OtherPeopleOnYourTeam                   0
CompanyEmployeesOverall                 0
DatabaseServers                         0
Education                               0
EducationIsComputerRelated              0
Certifications                          0
HoursWorkedPerWeek                      0
TelecommuteDaysPerWeek                  0
PopulationOfLargestCityWithin20Miles    0
EmploymentSector                        0
LookingForAnotherJob              

In [18]:
# Data type of SalaryUSD column
raw_data['SalaryUSD'].dtype

dtype('O')

In [19]:
# Unique values of SalaryUSD column
raw_data['SalaryUSD'].unique()

array([200000, 95000, 35000, 47000, 41000, 51652, 60000, 63000, 85000,
       96000, 45000, 78000, 123000, 100800, 113000, 80000, 133500, 102000,
       107690, 84000, 133000, 67000, 120000, 145000, 77600, 46000, 140000,
       23000, 138000, 40000, 72800, 99000, 50000, 55000, 90000, 175000,
       113400, 97500, 17621, 105000, 104000, 98000, 70760, 67770, 82000,
       98517, 107000, 72000, 125000, 129000, 66000, 141000, 58000, 121000,
       10000, 15000, 130000, 300000, 64000, 93000, 87000, 88000, 70000,
       35900, 42000, 61606, 91000, 27000, 116000, 50400, 270000, 48000,
       117832, 160000, 115000, 7968, 100000, 65000, 138500, 63500, 62000,
       111000, 134000, 95500, 122000, 26000, 103110, 92000, 14000, 112500,
       12000, 12300, 12900, 118000, 110000, 53000, 8300, 91800, 225000,
       94000, 120792, 112000, 133900, 93800, 89035, 158000, 93500, 52970,
       108000, 135000, 117000, 118500, 113300, 165000, 81000, 73000,
       142000, 147000, 109000, 124800, 71594, 70400

In [20]:
# Processing Salary column

def process_salary(salary):
    sep = '.'
    salary = str(salary)
    # Replace characters and take the cents out of our data
    salary = salary.replace(" ","").replace("$","").replace(",","").split(sep)[0]
    
    return float(salary)

# Replace spaces(“ “) in columns name with underscore (“_”)
raw_data.columns = raw_data.columns.str.replace(" ","_")

# Apply process_salary function
raw_data['SalaryUSD'] = raw_data['SalaryUSD'].apply(process_salary)
raw_data['SalaryUSD'].head()

0    200000.0
1     95000.0
2     35000.0
3     47000.0
4     41000.0
Name: SalaryUSD, dtype: float64

In order to analyze only 2019 United States data, we will filter the DataFrame by year and country.

In [21]:
# Filter dataframe by year
df = raw_data[raw_data.Survey_Year == 2019]
df.head()

Unnamed: 0,Survey_Year,Timestamp,SalaryUSD,Country,PostalCode,PrimaryDatabase,YearsWithThisDatabase,OtherDatabases,EmploymentStatus,JobTitle,...,HoursWorkedPerWeek,TelecommuteDaysPerWeek,PopulationOfLargestCityWithin20Miles,EmploymentSector,LookingForAnotherJob,CareerPlansThisYear,Gender,OtherJobDuties,KindsOfTasksPerformed,Counter
3200,2019,2018-12-06 13:58:01.557,128500.0,United States,442,Microsoft SQL Server,15,"Microsoft SQL Server, Oracle",Full time employee,Architect,...,40,"None, or less than 1 day per week",300K-1M (large city),Private business,No,"Stay with the same employer, same role",Male,"DBA (Development Focus - tunes queries, indexe...","Manual tasks, Meetings & management, Projects",1
3201,2019,2018-12-11 06:24:30.227,110000.0,United States,43016,Microsoft SQL Server,18,Azure SQL DB,Full time employee,"DBA (Development Focus - tunes queries, indexe...",...,44,"None, or less than 1 day per week",1M+ (metropolis),Private business,No,"Stay with the same employer, same role",Male,DBA (Production Focus - build & troubleshoot s...,"Build scripts & automation tools, Manual tasks...",1
3202,2019,2018-12-11 06:30:47.023,116500.0,United States,605,Microsoft SQL Server,12,"PostgreSQL, MongoDB, Azure SQL DB",Full time employee of a consulting/contracting...,Architect,...,30,5 or more,1M+ (metropolis),Private business,"Yes, but only passively (just curious)",Prefer not to say,Male,DBA (General - splits time evenly between writ...,"Build scripts & automation tools, Manual tasks...",1
3203,2019,2018-12-11 06:31:42.083,67000.0,United Kingdom,LE2,Microsoft SQL Server,3,"Microsoft SQL Server, Microsoft Access, Azure ...","Independent consultant, contractor, freelancer...","Developer: App code (C#, JS, etc)",...,37,"None, or less than 1 day per week",300K-1M (large city),Private business,No,"Stay with the same employer, same role",Male,"DBA (Development Focus - tunes queries, indexe...","Build scripts & automation tools, Manual tasks...",1
3204,2019,2018-12-11 06:33:56.358,124000.0,United States,92105,Microsoft SQL Server,18,Oracle,Full time employee,DBA (General - splits time evenly between writ...,...,40,4,1M+ (metropolis),Private business,"Yes, actively looking for something else","Stay with the same role, but change employers",Male,"DBA (Development Focus - tunes queries, indexe...","Manual tasks, Projects, R&D",1


In [22]:
# Filter dataframe by country
US_2019 = df.loc[df.Country == 'United States', :]
US_2019.head(3)

Unnamed: 0,Survey_Year,Timestamp,SalaryUSD,Country,PostalCode,PrimaryDatabase,YearsWithThisDatabase,OtherDatabases,EmploymentStatus,JobTitle,...,HoursWorkedPerWeek,TelecommuteDaysPerWeek,PopulationOfLargestCityWithin20Miles,EmploymentSector,LookingForAnotherJob,CareerPlansThisYear,Gender,OtherJobDuties,KindsOfTasksPerformed,Counter
3200,2019,2018-12-06 13:58:01.557,128500.0,United States,442,Microsoft SQL Server,15,"Microsoft SQL Server, Oracle",Full time employee,Architect,...,40,"None, or less than 1 day per week",300K-1M (large city),Private business,No,"Stay with the same employer, same role",Male,"DBA (Development Focus - tunes queries, indexe...","Manual tasks, Meetings & management, Projects",1
3201,2019,2018-12-11 06:24:30.227,110000.0,United States,43016,Microsoft SQL Server,18,Azure SQL DB,Full time employee,"DBA (Development Focus - tunes queries, indexe...",...,44,"None, or less than 1 day per week",1M+ (metropolis),Private business,No,"Stay with the same employer, same role",Male,DBA (Production Focus - build & troubleshoot s...,"Build scripts & automation tools, Manual tasks...",1
3202,2019,2018-12-11 06:30:47.023,116500.0,United States,605,Microsoft SQL Server,12,"PostgreSQL, MongoDB, Azure SQL DB",Full time employee of a consulting/contracting...,Architect,...,30,5 or more,1M+ (metropolis),Private business,"Yes, but only passively (just curious)",Prefer not to say,Male,DBA (General - splits time evenly between writ...,"Build scripts & automation tools, Manual tasks...",1


In [23]:
# Number of employees by their genders 
US_2019.Gender.value_counts()

Male      274
Female     26
Name: Gender, dtype: int64

In [24]:
# Filter salary by gender
Male_salary = US_2019[US_2019.Gender == 'Male']['SalaryUSD']
Female_salary = US_2019[US_2019.Gender == 'Female']['SalaryUSD']

print("Mean salary for male employees: ", Male_salary.mean())
print("Mean salary for female employees: ", Female_salary.mean())

Mean salary for male employees:  110565.50729927007
Mean salary for female employees:  95569.38461538461


In [25]:
# Perform t-test on data
ttest = stats.ttest_ind(a = Female_salary, b = Male_salary, equal_var= False)
ttest

Ttest_indResult(statistic=-2.3294080652649156, pvalue=0.025597245279444746)

From the above results, we can see that the p-value is 0.02 which is lower than the level of significance (0.05). So, this suggests that there is indeed a difference between male and female salaries in the United States for the year 2019.

### References for further reading:

[Hypothesis testing in machine learning](https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce)

[How to use z-table](https://www.statology.org/how-to-use-z-table/)


### Please answer the questions below to complete the experiment:




In [None]:
# @title Select the correct option below, based on the following statement: The p-value is defined as the smallest value of alpha for which the null hypothesis can be rejected. { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["","If the p-value is less than or equal to alpha, we reject the null hypothesis","If the p-value is greater than alpha, we do not reject the null hypothesis","Both of the above", "None of the above"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Please answer Question
