# Hypothesis Test - Clear Explained

![](https://editor.analyticsvidhya.com/uploads/52940cover.jpg)
source: https://editor.analyticsvidhya.com/uploads/52940cover.jpg

## Need statement

Consider the following scenario: You've trained two (or three, or ten, or thousands...) **super cool predictive models** for a specific problem (like this famous titanic problem, for example). How do you actually know which model has the best of the best of the best performance? Random Forest model is always better than Decision tree? XGBoost is the best algorithm of the world? What is your opinion about KNN? and... what about the bubble sorting method? ok ok... I'm just kidding...

I'll try to help you with a cool technique that **statistically** demonstrates whether a model performs better, worse or similar to another.


Part of the work presented here is based on the following video #bam!:

[![Live 2020-06-01!!! Hypothesis Testing](http://img.youtube.com/vi/hGoTUyBnbxg/0.jpg)](https://www.youtube.com/watch?v=hGoTUyBnbxg)



Of course, you will find much more technical information about design of experiments (DoE) in the book below:

![](https://images-na.ssl-images-amazon.com/images/I/51XN4Kgi0JL._SX396_BO1,204,203,200_.jpg)
[Link](https://www.amazon.com.br/Design-Analysis-Experiments-Douglas-Montgomery/dp/1119722101/)


![](https://thumbs.dreamstime.com/b/lets-go-handwritten-white-background-169989567.jpg)

source: https://thumbs.dreamstime.com/b/lets-go-handwritten-white-background-169989567.jpg


## Please upvote me if you like, ok? (this is really  really important to me)

# 1. (Without) exploratory data analysis / just data engineering


The objective here is not to compete but to present the most basic concepts of hypothesis testing, so accuracy will not be our main concern, ok?


We will only use the following database features:
* Pclass (categorical)
* Sex (categorical, with just 2 unique values...)
* Age (numerical)
* Fare (numerical)
* Embarked (categorical)
* Survived (categorical ... of course!!!)

In [None]:
# general imports...
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as st
from statsmodels.stats.weightstats import ztest

# sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [None]:
# SOME CONSTANTS

SEED = 123

In [None]:
# loading the database...
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')

# I will use this list further up
cols = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']

# selecting the features
df_train = df_train[cols + ['Survived']]
df_train

In [None]:
# base info
df_train.info()

In [None]:
for c in df_train.columns:
    print(f'Column: {c}')
    print(f'# of unique values: {len(df_train[c].unique())}')

Now let's separate the predictor variables ($X$) and the outcome ($y$)...

In [None]:
X = df_train[cols]
y = df_train['Survived']

Let's create a (very very very simple) pipeline for the predictor variables, which performs the following tasks:

* Numeric features:
    * Imputer: KNNImputer
    * Scaler: StandardScaler
* Categorial features:
    * Imputer: Most frequent
    * Encoder: One Hot Encoder
* Binary features:
    * Imputer: Most frequent
    * Encoder: Ordinal Encoder


This pipeline will be used later, at the time of the experiments, ok?

In [None]:
# numerical features
numeric_features = ["Age", "Fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())])

# categorial features
categorical_features = ["Pclass", "Embarked"]
categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OneHotEncoder(handle_unknown="ignore"))])
    
# binary features
binary_features = ["Sex"]
binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OrdinalEncoder())])

    

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("bin", binary_transformer, binary_features),
    ]
)

# 2.1 The very first experiment

In this very first experiment we are going to compare a **logistic regression** with a **random forest** (both models with their default settings)


Let's run each of the methods just once and check which one has the best accuracy, ok?


In [None]:
# Separating the sample into 70% for training and the remaining test, stratified by outcome.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED, stratify=y)

print(f'X_train shape {X_train.shape}')
print(f'y_train shape {y_train.shape}')
print('-'*10)
print(f'X_test shape {X_test.shape}')
print(f'y_test shape {y_test.shape}')

First, logistic regression model....

In [None]:
pipe_dt = Pipeline(
    [('preprocessor', preprocessor), 
     ('estimator', LogisticRegression(random_state=SEED))])
pipe_dt.fit(X_train, y_train)
y_pred = pipe_dt.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred)
print(f'Acc (Logistic Regression): {acc_lr}')

Now, random forest...

In [None]:
pipe_rn = Pipeline(
    [('preprocessor', preprocessor), 
     ('estimator', RandomForestClassifier(random_state=SEED))])
pipe_rn.fit(X_train, y_train)
y_pred = pipe_rn.predict(X_test)
acc_rn = accuracy_score(y_test, y_pred)
print(f'Acc (Random Forest): {acc_rn}')

Let's check which one is the best model...

In [None]:
# Here we have an exceptionally complex quantum calculus...
if acc_lr > acc_rn:
    print('Logistic Regression is the best model!!!')
elif acc_rn > acc_lr:
    print('Random forest is the best model!!!')
else:
    print("Something went wrong... (and it obviously isn't right...)")


Now you can tell the whole world that the **Random Forest model is the best**, right? We can stop the discussion here....


Hmm...not yet...

Let's try some more...

# 2.2 Second experiment (much better than the first)

In this second experiment, instead of running each model just once, we are going to run them 25 times (with different training and tests datasets). After that, we'll compare the average accuracies of each model, ok?

In [None]:
# I will store the results of the executions here
results = {
    'lr':[],
    'rn':[]
}

local_seed = 10
for i in tqdm(range(25)):
    # Separating the sample into 70% for training and the remaining test, stratified by outcome.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=local_seed)

    pipe_dt = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', LogisticRegression(random_state=SEED))])
    pipe_dt.fit(X_train, y_train)
    y_pred = pipe_dt.predict(X_test)
    acc_lr = accuracy_score(y_test, y_pred)
    results['lr'].append(acc_lr)
    
    pipe_rn = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', RandomForestClassifier(random_state=SEED))])
    pipe_rn.fit(X_train, y_train)
    y_pred = pipe_rn.predict(X_test)
    acc_rn = accuracy_score(y_test, y_pred)    
    results['rn'].append(acc_rn)
    
    local_seed += 10
    

# Converting result to dataframe
df_results = pd.DataFrame(results)
df_results

In [None]:
# Summarizing the result...
df_results.describe()

And now, who did better?

In [None]:
# Here we have an exceptionally complex quantum calculus...
if df_results['lr'].mean() > df_results['rn'].mean():
    print('Logistic Regression is the best model!!!')
elif df_results['rn'].mean() > df_results['lr'].mean():
    print('Random forest is the best model!!!')
else:
    print("Something went wrong... (and it obviously isn't right...)")


By now you must be really convinced that Random Forest is the **best model ever**, right?


Hmm... I don't think so...


In our results, was the logistic better in any round?



In [None]:
df_results['lr win'] = df_results['lr'] > df_results['rn']
df_results

In [None]:
lr_win = df_results[df_results['lr win'] == True].shape[0]
rn_win = 25 - lr_win

print(f'Logistic Regression was the best for {lr_win} times')
print(f'Random Forest was the best for {rn_win} times')

Now, there is no doubt that Random Forest is **the best model ever created**, right?

Let's try it one more time (*I promise it will be the last experiment on this notebook*)...

# 2.3 Third experiment (and the most robust of all)

According to wikipedia: "*A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.*"

And what does that mean in practice? It means that we will use a **robust** method to support some claim (in our case hypotheses), based on **experiments** and their **results**.

Let's build our hypotheses. I have an initial thought that the random forest model will be better than the logistic regression, but I want to statistically test if that thought is correct. To test statistically, we have to write our hypotheses. 

The first hypothesis is called the **null hypothesis** (or null effect hypothesis or H0). Here we place our first bet, that is, that there is no statistically significant difference between the models (hence the name null hypothesis).

The other hypothesis is called the **alternative hypothesis** (H1), and it is used when we reject the null hypothesis. But pay close attention now: We never accept the null hypothesis... or reject H0 or fail to reject H0, okay? (save this information)

Note that earlier I said that the method is robust, but this does not mean that the method is infallible (there is a probability that the result of the experiment is not exactly adhering to reality). In this sense, there are 4 possibilities:

* H0 is true AND we do not reject H0: this is a Right decision
* H0 is true AND we reject H0: this is a Wrong decision (Type I Error)
* H0 is false AND we do not reject H0: this is a Wrong decision (Type II Error)
* H0 is false AND we reject H0: this is a Right decision


According to wikipedia: "*In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the failure to reject a null hypothesis that is actually false (also known as a "false negative" finding or conclusion; example: "a guilty person is not convicted")*"


The probability of a type I error is called the **significance level** of the test (we use the Greek letter alpha $\alpha$ to express this value). The complement of the level of statistical significance (1-$\alpha$) is called the **confidence level**. We choose the $\alpha$ value before the experiment starts, and we use the experiment result (p_value) to reach a conclusion. For more information see Montgomery's book ok?


I think we already have enough to start our statistical test and actually compare the 2 models and see which one is better. Let's go!!!


So let's write our hypotheses and our parameters for the statistical test:

**hypotheses:**
* H0: There is no statistical difference between the LR and RN methods in terms of mean accuracies
* H1: There is statistical difference between the LR and RN methods in terms of mean accuracies

**design of experiment:**
* $\alpha = 0.05$
* Each test will be run 25 times, at the end it will be verified (using the z-test) whether or not there is a statistical difference between the methods

**premises**
* I'm using the z-test because it's easier to understand, even though here some assumptions are not verified. The purpose here is to be didactic. Certainly other statistical tests (such as the T test, for example or some other non-parametric method) are more interesting, but they require a greater depth in the theory of experiment design.


Let's implement the test then...

In [None]:
# I will store the results of the executions here
results = {
    'lr':[],
    'rn':[]
}

local_seed = 10
for i in tqdm(range(25)):
    # Separating the sample into 70% for training and the remaining test, stratified by outcome.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=local_seed)

    pipe_dt = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', LogisticRegression(random_state=SEED))])
    pipe_dt.fit(X_train, y_train)
    y_pred = pipe_dt.predict(X_test)
    acc_lr = accuracy_score(y_test, y_pred)
    results['lr'].append(acc_lr)
    
    pipe_rn = Pipeline(
        [('preprocessor', preprocessor), 
         ('estimator', RandomForestClassifier(random_state=SEED))])
    pipe_rn.fit(X_train, y_train)
    y_pred = pipe_rn.predict(X_test)
    acc_rn = accuracy_score(y_test, y_pred)    
    results['rn'].append(acc_rn)
    
    local_seed += 10
    

# Converting result to dataframe
df_results = pd.DataFrame(results)
df_results

In [None]:
df_results.describe()

Let's see graphically the result of the

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = sns.distplot(df_results['lr'])
ax = sns.distplot(df_results['rn'])
fig.legend(labels=['Logistic Regression','Random Forest'])
plt.show()

Well well well, you were absolutely sure that Random Forest was better than Logistic Regression... now what? What do you see in the picture above?

It looks like the histograms have a very large overlapping area, right? Is it possible to keep the same statement as before? or now it looks like the game has changed?

And now... **the statistical test**...

In [None]:
alpha = 0.05
z_calc, p_valor = ztest(x1=df_results['lr'], x2=df_results['rn'], alternative='two-sided')
z_calc, p_valor

It's time to check...

In [None]:
if p_valor < alpha:
    print("Reject H0! - There is statistical difference between the LR and RN methods in terms of mean accuracies")
else:
    print("Do not reject H0! - Maybe there is no statistical difference between the LR and RN methods in terms of mean accuracies")

## #Bam!!!!

Now I convinced you that there is no difference between the methods right??? Read the conclusions with me...

# The Storytelling

Okay okay... and now? How to report this result?

You won't tell everything you've done, because it will be very boring! You can simply say the following:

"*According to the experiments carried out, it is not possible to state that there are statistically significant differences between the random forest and logistic regression methods.*"

And what could change this result? Many things!!! I will list some of them:

* sample size
* LR and RN hyperparameters
* number of runs
* $\alpha$ value
* the statistical test used
* verification of assumptions according to the statistical test used
* the seed value of the random number (see constant SEED)
* if it's raining... (I'm just kidding...)


The message here is that the conclusions depend on several factors involved in the various stages of the experiment. So, here's some friendly advice:

* Always check that the methodology applied is adequate and correct
* Check that the parameters used for the experiment are correct
* If a conclusion does not seem to be very reasonable, try running an independent experiment (following and verifying the methodology) and compare the results


Thank you very much for your attention, and feel free to make suggestions!!! Bye!!!

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQeTlBWPzFucVr0vMMgbWtuF1iX_Ja16zMiuUzpx41OHktuj_PeeGQht8qiof2LWZZfv4g&usqp=CAU)