# Hypothesis Testing: One Sample Significance Tests

The purpose of One Sample Significance Tests is to check if a sample of observations could have been generated by a process with a specific mean.

Some questions that can be answered by one sample significance tests are:
* Is there equal representation of men and women in a particular industry?
* Is the normal human body temperature 98.6 F?

We will try and apply this test to a few real world problems in this notebook.

The Suicide dataset was obtained from Kaggle courtesy Rajanand Illangovan. You can download it here: https://www.kaggle.com/rajanand/suicides-in-india

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats

## Analyzing Suicides in India by Gender

Are men as likely to commit suicide as women?

This is the question we will attempt at answering in this section. To answer this question, we will use suicide statistics shared by the National Crime Records Bureau (NCRB), Govt of India. To perform this analysis, we need to know the sex ratio in India. The Census 2011 report states that there are 940 females for every 1000 males in India.

Let p denote the fraction of women in India.

In [2]:
p = 940/(940+1000)
p

0.4845360824742268

If there is no correlation between gender and suicide, then the sex ratio of people committing suicides should closely reflect that of the general population. 

Let us now get our data into a Pandas dataframe for analysis.

In [3]:
df = pd.read_csv('data/suicides.csv')
df.head()

Unnamed: 0,State,Year,Type_code,Type,Gender,Age_group,Total
0,A & N Islands,2001,Causes,Illness (Aids/STD),Female,0-14,0
1,A & N Islands,2001,Causes,Bankruptcy or Sudden change in Economic,Female,0-14,0
2,A & N Islands,2001,Causes,Cancellation/Non-Settlement of Marriage,Female,0-14,0
3,A & N Islands,2001,Causes,Physical Abuse (Rape/Incest Etc.),Female,0-14,0
4,A & N Islands,2001,Causes,Dowry Dispute,Female,0-14,0


In [4]:
df.shape

(237519, 7)

In [5]:
df['Gender'].value_counts()

Male      118879
Female    118640
Name: Gender, dtype: int64

We can see that the number of female suicides is slightly lesser than the number of male suicides. There are also fewer females than males. How do we prove that females are as likely to commit suicide as males? This can be answered through hypothesis testing.|

### Step 1: Formulate the hypothesis and decide on confidence level

The null hypothesis, as stated in the slides, is the default state. Therefore, I will state my null and alternate hypothesis as follows.

* **Null Hypothesis (H0)**: Men and women are equally likely to commit suicide.
* **Alternate Hypothesis (H1)**: Men and women are not equally likely to commit suicide.

If the null hypothesis is true, it would mean that the fraction of women committing suicide would be the same as the fraction of women in the general population. We now need to use a suitable statistica test to find out if this is indeed is the case.

Our statistical test will generate a p-value which has to be compared to a significance level ($\alpha$). If p is less than alpha, then it is extremely unlikely that the event must have occurred by chance and we would be reasonable in rejecting the null hypothesis. On the contrary, if the p-value is higher than $\alpha$, we will not be in a position to reject the null hypothesis.

Let us assume, $\alpha$ = 0.05

### Step 2: Decide on the Statsitical Test

We will be using the One Sample Z-Test here. How to decide upon a test will be discussed in another notebook.

### Step 3: Compute the p-value

In [6]:
h0_prop = p
h0_prop

0.4845360824742268

In [9]:
h1_prop = df['Gender'].value_counts()['Female']/len(df)
h1_prop

0.49949688235467476

In [10]:
sigma_prop = np.sqrt((h0_prop * (1 - h0_prop))/len(df))
sigma_prop

0.0010254465276083747

In [11]:
z = (h1_prop - h0_prop)/sigma_prop
z

14.589546580591277

In [12]:
def pvalue(z):
    return 1 - 2 * (1 - stats.norm.cdf(z))

In [14]:
p_val = (1-stats.norm.cdf(z))*2
p_val

0.0

The p value is so small that Python has effectively rounded it to zero.

### Step 4: Comparison and Decision

The p value obtained is extremely strong evidence to suggest that it is much lower than our significance level $\alpha$. We can thus safely disregard the null hypothesis and accept the alternate hypothesis (since it is the negation of the null hypothesis).

**Men and women are not equally likely to commit suicide.**

Note that this test says nothing about if men are more likely than women to commit suicide or vice versa. It just states that they are not equally likely. The reader is encouraged to form their own hypothesis tests to check these results.

## Analyzing the average heights of NBA Players

I was interested in knowing the average height of NBA playes. A quick Google search tells me that the average height of players between 1985-2006 was **6'7"** or 200.66 cm. Is this still the case?

To answer this question, we will be using the NBA Players Stats - 2014-2015 dataset on Kaggle courtesy DrGuillermo. The dataset can be downloaded here: https://www.kaggle.com/drgilermo/nba-players-stats-20142015

In [16]:
df2 = pd.read_csv('data/players_stats.csv')
df2.head()

Unnamed: 0,Name,Games Played,MIN,PTS,FGM,FGA,FG%,3PM,3PA,3P%,...,Age,Birth_Place,Birthdate,Collage,Experience,Height,Pos,Team,Weight,BMI
0,AJ Price,26,324,133,51,137,37.2,15,57,26.3,...,29.0,us,"October 7, 1986",University of Connecticut,5,185.0,PG,PHO,81.45,23.798393
1,Aaron Brooks,82,1885,954,344,817,42.1,121,313,38.7,...,30.0,us,"January 14, 1985",University of Oregon,6,180.0,PG,CHI,72.45,22.361111
2,Aaron Gordon,47,797,243,93,208,44.7,13,48,27.1,...,20.0,us,"September 16, 1995",University of Arizona,R,202.5,PF,ORL,99.0,24.142661
3,Adreian Payne,32,740,213,91,220,41.4,1,9,11.1,...,24.0,us,"February 19, 1991",Michigan State University,R,205.0,PF,ATL,106.65,25.377751
4,Al Horford,76,2318,1156,519,965,53.8,11,36,30.6,...,29.0,do,"June 3, 1986",University of Florida,7,205.0,C,ATL,110.25,26.234384


In [17]:
df2.shape

(490, 34)

### Hypothesis Testing

One Sample Significance Test for Mean is extremely similar to that for Proportion. We will go through almost an identical process.

The hypotheses are defined as follows:
* **Null Hypothesis**: The average height of an NBA player is 200.66 cm.
* **Alternate Hypothesis**: The average height of an NBA player is not 200.66 cm.

Significance Level, $\alpha$ is at 0.05. Assuming Null Hypothesis to be true.

In [18]:
h0_mean = 200.66

In [19]:
h1_mean = df2['Height'].mean()
h1_mean

197.44075829383885

In [23]:
sigma = df2['Height'].std()/np.sqrt(len(df2))
sigma

0.39484424472376178

In [28]:
z = (h1_mean - h0_mean)/sigma
z

-8.1531939471812898

In [34]:
p_val = (1 - stats.norm.cdf(abs(z))) * 2
p_val

4.4408920985006262e-16

The p value obtained is much lesser than the significance level $\alpha$. We therefore reject the null hypothesis and accept the alternate hypothesis (the negation). We can therefore arrive at the following conclusion from this analysis:

**The average height of NBA Players is NOT 6'7"**.