<a href="https://colab.research.google.com/github/sungkim11/compare-datasets/blob/main/compare_two_datasets_with_uneq_obs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Comparing Datasets: Comparing two datasets with unequal number of observations (Welch's T-Test for the means of two independent samples)

## 1. Prerequisities

Following are prerequisite for this tutorial:

- Data
- Python Packages: researchpy and scipy.stats

### 1.1. Data

The dataset used in this notebook was retrieved from kaggle (https://www.kaggle.com/datasets/wordsforthewise/lending-club) and it contains the full Lending Club data available from their site. There are two separate files for accepted and rejected loans. 

For the purpose of this exercise I have used 'fico_range_high' variable from the accepted dataset and 'Risk_Score' variable from the rejected dataset, which I assumed both to be derived credit score of somekind.

For people who would like to understand the data in depth, here is a link of Exploratory Data Analysis (EDA) notebook of dataset here => https://www.kaggle.com/code/wordsforthewise/eda-with-python/notebook.



### 1.2. Python Packages

#### 1.2.1. Install researchpy

Per researchpy's documentation, which is located here => https://researchpy.readthedocs.io/en/latest/index.html:

*Researchpy produces Pandas DataFrames that contains relevant statistical testing information that is commonly required for academic research. The information is returned as Pandas DataFrames to make for quick and easy exporting of results to any format/method that works with the traditional Pandas DataFrame. Researchpy is essentially a wrapper that combines various established packages such as pandas, scipy.stats, numpy, and statsmodels to get all the standard required information in one method. If analyses were not available in these packages, code was developed to fill the gap*. 

In [1]:
%%writefile requirements.txt

researchpy==0.3.5

Writing requirements.txt


In [2]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting researchpy==0.3.5
  Downloading researchpy-0.3.5-py3-none-any.whl (33 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.3.5


## 2. Code (Two samples from same population)

### 2.1. Import Python Packages

Import python packages and show its version. Showing version is important since it will enable other users to replicate your work using same python version and python packages version.

In [3]:
import pandas as pd
import scipy
from scipy import stats as stats
import researchpy as rp
import sklearn
from sklearn.model_selection import train_test_split

import platform

In [4]:
print('Python: ', platform.python_version())
print('pandas: ', pd.__version__)
print('scipy: ', scipy.__version__)
print('sklearn: ', sklearn.__version__)

Python:  3.7.14
pandas:  1.3.5
scipy:  1.7.3
sklearn:  1.0.2


### 2.2. Mount Storage

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 2.3. Exploratory Data Analysis

#### 2.3.1. Import and validate dataset

In [6]:
accepted_loans = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/stats/data/accepted_2007_to_2018Q4.csv', low_memory=False)

In [7]:
accepted_loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


List of variables in accepted_loans dataset.

In [8]:
for col_name in accepted_loans.columns: 
    print(col_name)

id
member_id
loan_amnt
funded_amnt
funded_amnt_inv
term
int_rate
installment
grade
sub_grade
emp_title
emp_length
home_ownership
annual_inc
verification_status
issue_d
loan_status
pymnt_plan
url
desc
purpose
title
zip_code
addr_state
dti
delinq_2yrs
earliest_cr_line
fico_range_low
fico_range_high
inq_last_6mths
mths_since_last_delinq
mths_since_last_record
open_acc
pub_rec
revol_bal
revol_util
total_acc
initial_list_status
out_prncp
out_prncp_inv
total_pymnt
total_pymnt_inv
total_rec_prncp
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt
next_pymnt_d
last_credit_pull_d
last_fico_range_high
last_fico_range_low
collections_12_mths_ex_med
mths_since_last_major_derog
policy_code
application_type
annual_inc_joint
dti_joint
verification_status_joint
acc_now_delinq
tot_coll_amt
tot_cur_bal
open_acc_6m
open_act_il
open_il_12m
open_il_24m
mths_since_rcnt_il
total_bal_il
il_util
open_rv_12m
open_rv_24m
max_bal_bc
all_util
total_rev_hi_lim
inq_fi
to

#### 2.3.2. Cleanse Dataset

Drop all obs where 'fico_range_high' variable is null.

In [9]:
accepted_loans = accepted_loans.dropna(subset=['fico_range_high'])

In [10]:
accepted_loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2260668 entries, 0 to 2260698
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.6+ GB


#### 2.3.3. Downsample Dataset

Since I do not need a such big dataset and Google colab complains about running out of memory, I have downsampled the dataset size to 50,000 obs.

In [11]:
accepted_loans = accepted_loans.sample(n=50000)

In [12]:
accepted_loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 1707701 to 1641780
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 58.0+ MB


#### 2.3.4. Split Dataset

I have split the dataset into two datasets to illustrate comparing two datasets with similiar values.

In [13]:
accepted_loans_1, accepted_loans_2 = train_test_split(accepted_loans, test_size=0.4)

In [14]:
accepted_loans_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1892604 to 320080
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 34.8+ MB


In [15]:
accepted_loans_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 615383 to 713763
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 23.2+ MB


### 2.4. Perform t-Test

Perform t-Test to determinewhether two datasets differ from each other (i.e.,  independent two-sample t-test)

In [16]:
summary, results = rp.ttest(group1 = accepted_loans_1['fico_range_high'], group1_name = "1",
                            group2 = accepted_loans_2['fico_range_high'], group2_name = "2",
                            paired=False,
                            equal_variances=False
                            )

In [17]:
summary

Unnamed: 0,Variable,N,Mean,SD,SE,95% Conf.,Interval
0,1,30000.0,702.0404,33.120944,0.191224,701.665593,702.415207
1,2,20000.0,702.11395,32.873109,0.232448,701.658333,702.569567
2,combined,50000.0,702.06982,33.021724,0.147678,701.78037,702.35927


In [18]:
results

Unnamed: 0,Satterthwaite t-test,results
0,Difference (1 - 2) =,-0.0736
1,Degrees of freedom =,43075.3424
2,t =,-0.2444
3,Two side test p value =,0.807
4,Difference < 0 p value =,0.4035
5,Difference > 0 p value =,0.5965
6,Cohen's d =,-0.0022
7,Hedge's g =,-0.0022
8,Glass's delta1 =,-0.0022
9,Point-Biserial r =,-0.0012


We are considering whether the two samples were drawn from the same population or two different populations.

The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true.

Since the 'Two side test p value' is greater than 0.05 indicates that the differences are insignificantour observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. This makes sense since the two samples are split  from one dataset.

In [19]:
stats.ttest_ind(accepted_loans_1['fico_range_high'],
                accepted_loans_2['fico_range_high'],
                equal_var=False)

Ttest_indResult(statistic=-0.24435535636503777, pvalue=0.8069567781712216)

Same conclusion using stats.ttest_ind based on pvalue greater than 0.05.

## 3. Code (Two samples from different population)

### 3.1. Exploratory Data Analysis

#### 3.1.1. Import and validate dataset

In [20]:
rejected_loans = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/stats/data/rejected_2007_to_2018Q4.csv', low_memory=False)

In [21]:
rejected_loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Application Date      object 
 2   Loan Title            object 
 3   Risk_Score            float64
 4   Debt-To-Income Ratio  object 
 5   Zip Code              object 
 6   State                 object 
 7   Employment Length     object 
 8   Policy Code           float64
dtypes: float64(3), object(6)
memory usage: 1.9+ GB


#### 3.1.2. Cleanse Dataset

Drop all obs where 'Risk_Score' variable is null.

In [22]:
rejected_loans = rejected_loans.dropna(subset=['Risk_Score'])

In [23]:
rejected_loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9151111 entries, 0 to 27648740
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Application Date      object 
 2   Loan Title            object 
 3   Risk_Score            float64
 4   Debt-To-Income Ratio  object 
 5   Zip Code              object 
 6   State                 object 
 7   Employment Length     object 
 8   Policy Code           float64
dtypes: float64(3), object(6)
memory usage: 698.2+ MB


#### 3.1.3. Downsample Dataset

Since I do not need a such big dataset and Google colab complains about running out of memory, I have downsampled the dataset size to 50,000 obs.

In [24]:
rejected_loans = rejected_loans.sample(n=50000)

In [25]:
rejected_loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 27643233 to 20691279
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Amount Requested      50000 non-null  float64
 1   Application Date      50000 non-null  object 
 2   Loan Title            50000 non-null  object 
 3   Risk_Score            50000 non-null  float64
 4   Debt-To-Income Ratio  50000 non-null  object 
 5   Zip Code              50000 non-null  object 
 6   State                 50000 non-null  object 
 7   Employment Length     49143 non-null  object 
 8   Policy Code           49995 non-null  float64
dtypes: float64(3), object(6)
memory usage: 3.8+ MB


### 3.4. Perform t-Test

In [26]:
summary, results = rp.ttest(group1 = accepted_loans_1['fico_range_high'], group1_name = "Approved",
                            group2 = rejected_loans['Risk_Score'], group2_name = "Rejected",
                            equal_variances=False
                            )

In [27]:
summary

Unnamed: 0,Variable,N,Mean,SD,SE,95% Conf.,Interval
0,Approved,30000.0,702.0404,33.120944,0.191224,701.665593,702.415207
1,Rejected,50000.0,628.31182,89.139896,0.398646,627.53047,629.09317
2,combined,80000.0,655.960037,81.557247,0.288348,655.394876,656.525199


In [28]:
results

Unnamed: 0,Satterthwaite t-test,results
0,Difference (Approved - Rejected) =,73.7286
1,Degrees of freedom =,69520.7172
2,t =,166.7551
3,Two side test p value =,0.0
4,Difference < 0 p value =,1.0
5,Difference > 0 p value =,0.0
6,Cohen's d =,1.0054
7,Hedge's g =,1.0054
8,Glass's delta1 =,2.226
9,Point-Biserial r =,0.5345


We are considering whether the two samples were drawn from the same population or two different populations.

The p-value quantifies the probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from populations with the same population means, is true.

Since the 'Two side test p value' is less than 0.05 indicates that the differences are significant that our observation is likely to have occurred by chance. Therefore, we reject the null hypothesis of equal population means. This makes sense since these two samples are from two diffferent datasets.

In [29]:
stats.ttest_ind(accepted_loans['fico_range_high'],
                rejected_loans['Risk_Score'],
                equal_var=False)

Ttest_indResult(statistic=173.49920329765496, pvalue=0.0)

Same conclusion using stats.ttest_ind based on pvalue less than 0.05.