# Hypothesis Testing 

Exploring the Titanic survival dataset to demonstrate **Exploratory Data Analysis (EDA)** and **Hypothesis Testing**. 

In [2]:
import numpy as np
import pandas as pd

In [3]:
data_raw = pd.read_csv("data/titanic.csv")
data_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Breakdown of Titanic Data 

#### Predicting Survivorship

We are going to predict the target **Survived**, *i.e. we are using Survived as our dependent variable.* Our target variable is dependent on the **features** of the system we are modeling. In other words, our system is the passenger population of the Titanic. There were over 2,000 people aboard the Titanic when it sank and this data set is **891 people**. Thus, just as all data sets are, this data is a sample of a population. When we perform statistical analysis of any kind, we are modeling patterns that may exist in the actual population. 

We are going to assume that the features of the data are each independent dimensions. We are assuming the boolean variable Survived is dependent on all of those independent dimensions. There are a few concepts we are going to tackle with this notebook. 

#### Concepts in this Notebook

1. Encoding non-numeric variables 
2. Parametric vs Non-parametric 
3. Type I vs II errors
4. Feature engineer: filling in missing data 
5. Feature engineer: scale and normalize data
6. Checking assumptions of parametric models


In [21]:
# Encode categorical variables into numeric 

def encode_cat_var(col):
    categories = unique(col)
    feats = {}
    for cat in categories:
        binn = (data == cat)
        feats["%s_%s" % (col.name, cat)] = binn.astype("int")
    return pd.DataFrame(feats)

def get_encoded_features(dat):
    """
        Encode categorcial features into numeric
    """
    Y = dat["Survived"]
    X = dat.drop(["Survived", "PassengerId", "Cabin", "Ticket", "Name"], axis = 1)
    X["EnSex"] = X["Sex"].apply(lambda x: 1 if x == "male" else 0)
    X["En_Q"] = X["Embarked"].apply(lambda x: 1 if x == "Q" else 0)
    X["En_C"] = X["Embarked"].apply(lambda x: 1 if x == "C" else 0)
    X["En_S"] = X["Embarked"].apply(lambda x: 1 if x == "S" else 0)
    X = X.drop(["Embarked", "Sex"], axis = 1)
    return X, Y



In [35]:
X, Y = get_encoded_features(data_raw)

In [36]:
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,EnSex,En_Q,En_C,En_S
0,3,22.0,1,0,7.25,1,0,0,1
1,1,38.0,1,0,71.2833,0,0,1,0
2,3,26.0,0,0,7.925,0,0,0,1
3,1,35.0,1,0,53.1,0,0,0,1
4,3,35.0,0,0,8.05,1,0,0,1


### Parametric vs Nonparametric models 

In general, we make assumptions about the structure of the population where we are pulling data from (**sampling from**). Another way to say this is that we make assumptions about the distribution of the variables from the population we are making generalizations of. 

These assumptions usually come from our models, and those breakdown into two main categories: **parametric vs nonparametric models**. Parametric models are ones that assume a normal distribution in the population where data is sampled from among other assumptions. **Linear Regression** amd a lot of standard or classical statistical models are parametric. 

Non-parametric models do not assume as many things about the structure of the population where data is pulled from. Many more **complex models and analysis are nonparametric**. In a way, assuming more strict criteria about the nature of the data you are modeling allows for higher interpretability. Parametric models and analysis tend to have easier and more direct ways to interpret results. For example, many **tests that generate p-values are parametric**. These values state quite simply, whether or not there is a statistical significance in this variable for finding a pattern in the data / population. 

If you think about this, this is quite a leap from assumptions about structure of data to assumptions about how a complex system works. Nonparametric models tend to have less explanatory power in a direct way. In other words, **nonparametric models and analysis tend to be complicated or more difficult to explain**. 



### Error Types 

There are two major types of error that one can make in analysis. 

#### Type I 

These are **false-positives**. You incorrectly find significance when there is none. You believe there is a pattern when there is none. 

In terms of hypothesis testing: 

this is when you reject a null hypothesis that is actually true in a population. The null hypothesis is that there is no difference in the population between whatever you are looking at. So, in other words, you reject that there is no difference, thus, accepting that there IS a difference, when there is none. 

#### Type II 

These are **false-negatives**. You incorrectly find no significance when there is some there. You believe there is no pattern when there is one. 

In terms of hypothesis testing: 

this is when you fail to reject a null hypothesis that is actually untrue in a population. The null hypothesis is that there is no difference in the population between whatever you are looking at. So, in other words, you fail to reject that there is no difference, thus, accepting that there is no difference that is actually there. 

In [37]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  891 non-null    int64  
 1   Age     714 non-null    float64
 2   SibSp   891 non-null    int64  
 3   Parch   891 non-null    int64  
 4   Fare    891 non-null    float64
 5   EnSex   891 non-null    int64  
 6   En_Q    891 non-null    int64  
 7   En_C    891 non-null    int64  
 8   En_S    891 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.8 KB


In [38]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292
EnSex,891.0,0.647587,0.47799,0.0,0.0,1.0,1.0,1.0
En_Q,891.0,0.08642,0.281141,0.0,0.0,0.0,0.0,1.0
En_C,891.0,0.188552,0.391372,0.0,0.0,0.0,0.0,1.0
En_S,891.0,0.722783,0.447876,0.0,0.0,1.0,1.0,1.0


In [39]:
X["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

### Filling in missing values 

You need to be careful with this, but a method of filling in missing values is replacing all of the missing values of a numeric column with the median value of that column. You can also alternatively use the mean (average). This is a way to not disrupt the distribution of the data or bias the explanatory power of that or other variables. 



In [41]:
X["Age"] = X["Age"].apply(lambda x: 28.0 if np.isnan(x) else x)

In [42]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  891 non-null    int64  
 1   Age     891 non-null    float64
 2   SibSp   891 non-null    int64  
 3   Parch   891 non-null    int64  
 4   Fare    891 non-null    float64
 5   EnSex   891 non-null    int64  
 6   En_Q    891 non-null    int64  
 7   En_C    891 non-null    int64  
 8   En_S    891 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.8 KB


In [43]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,891.0,29.361582,13.019697,0.42,22.0,28.0,35.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292
EnSex,891.0,0.647587,0.47799,0.0,0.0,1.0,1.0,1.0
En_Q,891.0,0.08642,0.281141,0.0,0.0,0.0,0.0,1.0
En_C,891.0,0.188552,0.391372,0.0,0.0,0.0,0.0,1.0
En_S,891.0,0.722783,0.447876,0.0,0.0,1.0,1.0,1.0
