# Statistical Analysis

In this notebook, we will perform a comprehensive statistical analysis of our dataset. The objective is to understand the relationships between the different features and their impact on income, the target variable. By doing this, we can gain deeper insights that will aid in the creation and tuning of our machine learning models in the following steps.

Let's begin by importing the necessary libraries and loading our dataset.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('adult_data_preprocessed.csv')

## Hypothesis Testing for Numerical Variables

Before we can perform certain statistical tests, we need to check the normality of our variables. A common assumption in many statistical techniques is that the data follows a normal distribution. We can test this assumption using the Shapiro-Wilk test.

The null hypothesis for the Shapiro-Wilk test is that the data is drawn from a normal distribution. If the p-value is less than our chosen significance level (commonly 0.05), we reject the null hypothesis and conclude that the data does not come from a normal distribution. On the other hand, if the p-value is greater than our significance level, we fail to reject the null hypothesis and conclude that the data does come from a normal distribution.

In [3]:
# Define numerical variables
numerical_vars = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

# Perform Shapiro-Wilk test
for var in numerical_vars:
    stat, p = stats.shapiro(df[var])
    print(f'Shapiro-Wilk Test for {var}:\nStatistic={stat}, p-value={p}\n')

Shapiro-Wilk Test for age:
Statistic=0.9668066501617432, p-value=0.0

Shapiro-Wilk Test for fnlwgt:
Statistic=0.9223408699035645, p-value=0.0

Shapiro-Wilk Test for education-num:
Statistic=0.9264371395111084, p-value=0.0

Shapiro-Wilk Test for capital-gain:
Statistic=0.12271404266357422, p-value=0.0

Shapiro-Wilk Test for capital-loss:
Statistic=0.21831119060516357, p-value=0.0

Shapiro-Wilk Test for hours-per-week:
Statistic=0.8851711750030518, p-value=0.0



The p-values for all our numerical variables are less than the commonly used significance level of 0.05. Therefore, we reject the null hypothesis for each variable, concluding that the data for these variables do not come from a normal distribution.

This is important information to consider as we move forward with our statistical analysis. Many statistical tests assume normality, so we may need to use non-parametric tests or apply transformations to our data if we want to use tests that assume normality.

## T-tests for Numerical Variables

In [4]:
# Perform t-tests
for var in numerical_vars:
    less_than_50k = df[df['income'] == '<=50K'][var]
    more_than_50k = df[df['income'] == '>50K'][var]

    t_stat, p_val = stats.ttest_ind(less_than_50k, more_than_50k)

    print(f'T-Test for {var}:\nT-Statistic={t_stat}, p-value={p_val}\n')

T-Test for age:
T-Statistic=-43.43624424045112, p-value=0.0

T-Test for fnlwgt:
T-Statistic=1.7075109328052842, p-value=0.08773666108063974

T-Test for education-num:
T-Statistic=-64.18797223551665, p-value=0.0

T-Test for capital-gain:
T-Statistic=-41.341868169493665, p-value=0.0

T-Test for capital-loss:
T-Statistic=-27.47417790492585, p-value=2.68654718905867e-164

T-Test for hours-per-week:
T-Statistic=-42.58387349943796, p-value=0.0



The T-test is a statistical hypothesis test that assumes (the null hypothesis) that the means of two populations are equal.

For each numerical variable, we performed a T-test with respect to the 'income' classes. The results include a statistic and a p-value.

A small p-value (typically ≤ 0.05) indicates strong evidence that the null hypothesis is incorrect, i.e., the means of the two populations are significantly different, which suggests the variable likely has a significant relationship with the target variable.

On the other hand, a large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. This means the variable likely does not have a significant relationship with the target variable.

## Chi-Square Tests for Categorical Variables


In [5]:
# Get list of all categorical columns (one-hot encoded)
categorical_cols = [col for col in df.columns if col not in numerical_vars + ['income']]

# Perform Chi-Square tests
for var in categorical_cols:
    contingency_table = pd.crosstab(df[var], df['income'])
    chi2, p_val, dof, expected = chi2_contingency(contingency_table)

    print(f'Chi-Square Test for {var}:\nChi2={chi2}, p-value={p_val}\n')

Chi-Square Test for workclass_Federal-gov:
Chi2=113.95824874066832, p-value=1.3308131398791266e-26

Chi-Square Test for workclass_Local-gov:
Chi2=35.338974038055184, p-value=2.770324658615066e-09

Chi-Square Test for workclass_Never-worked:
Chi2=1.0987433937485032, p-value=0.2945420423054194

Chi-Square Test for workclass_Private:
Chi2=512.7610556998344, p-value=1.5903442762392623e-113

Chi-Square Test for workclass_Self-emp-inc:
Chi2=631.5498295777486, p-value=2.3001082226580175e-139

Chi-Square Test for workclass_Self-emp-not-inc:
Chi2=29.080977641768094, p-value=6.941522115796686e-08

Chi-Square Test for workclass_State-gov:
Chi2=6.997598758358855, p-value=0.008161912774929033

Chi-Square Test for workclass_Without-pay:
Chi2=3.222564739783703, p-value=0.0726297502727671

Chi-Square Test for education_10th:
Chi2=158.74111273523948, p-value=2.131604526551785e-36

Chi-Square Test for education_11th:
Chi2=238.9841334949431, p-value=6.54955335358297e-54

Chi-Square Test for education_12t

The Chi-Square test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The results include a Chi2 value and a p-value.

For each categorical variable (after one-hot encoding), we performed a Chi-Square test with respect to the target variable 'income'. The Chi2 value indicates how much the observed frequencies deviate from the expected frequencies, while the p-value tells us about the statistical significance of our test.

A small p-value (typically ≤ 0.05) indicates strong evidence that the observed frequency is different from the expected frequency, which means the variable likely has a significant relationship with the target variable.

On the other hand, a large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis. This means the variable likely does not have a significant relationship with the target variable.

## ANOVA for Numerical Variables

Analysis of Variance (ANOVA) is a statistical technique used to identify differences between two or more groups. In our case, we would like to identify if there are significant differences among groups of different income levels.

The null hypothesis for ANOVA is that there is no difference in means among the groups. If the p-value is less than our chosen significance level (commonly 0.05), we reject the null hypothesis and conclude that there is a statistically significant difference in means among the groups.

In [6]:
# Perform ANOVA tests
for var in numerical_vars:
    group1 = df[df['income'] == '<=50K'][var]
    group2 = df[df['income'] == '>50K'][var]
    f_stat, p_val = stats.f_oneway(group1, group2)

    print(f'ANOVA Test for {var}:\nF-Statistic={f_stat}, p-value={p_val}\n')

ANOVA Test for age:
F-Statistic=1886.7073137161226, p-value=0.0

ANOVA Test for fnlwgt:
F-Statistic=2.915593585649569, p-value=0.08773666108287717

ANOVA Test for education-num:
F-Statistic=4120.095779707457, p-value=0.0

ANOVA Test for capital-gain:
F-Statistic=1709.1500637437944, p-value=0.0

ANOVA Test for capital-loss:
F-Statistic=754.8304515515161, p-value=2.6865471891129668e-164

ANOVA Test for hours-per-week:
F-Statistic=1813.3862822161345, p-value=0.0



From the ANOVA tests, we observe that the p-values for most of our numerical variables are less than our chosen significance level of 0.05. Hence, we reject the null hypothesis that there is no difference in means among the income groups for these variables.

This means that there are statistically significant differences in the means of these numerical variables among income groups. This is valuable information because it tells us that these numerical variables may be good predictors of income.

However, there are also some variables that have p-values greater than 0.05, which suggests that the means of these variables do not significantly differ among income groups. We fail to reject the null hypothesis for these variables, implying that they may not be as effective predictors for income.

## Effect Size Analysis

Effect size is a quantitative measure of the magnitude of the experimenter effect. The larger the effect size, the stronger the relationship between two variables. We will calculate the Cohen's d, which is an appropriate effect size measure for t-tests.

## Point Biserial and Spearman Rank Correlation

Afterwards, we will calculate Point Biserial Correlation for the relationship between our numerical variables and binary target variable. The Point-Biserial Correlation Coefficient is a correlation measure of the strength of association between a continuous-level variable (ratio or interval data) and a binary variable.

For the categorical variables, we will calculate the Spearman Rank Correlation. The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed.

In [7]:
# Effect size (Cohen's d) calculation
for var in numerical_vars:
    group1 = df[df['income'] == '<=50K'][var]
    group2 = df[df['income'] == '>50K'][var]
    diff = group1.mean() - group2.mean()
    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    d = diff / np.sqrt(pooled_var)

    print(f'Effect Size for {var} (Cohen\'s d) = {d}\n')

# Replace 'income' values to binary
df['income_binary'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)

# Point Biserial Correlation
for var in numerical_vars:
    pbc = stats.pointbiserialr(df[var], df['income_binary'])
    print(f'Point Biserial Correlation for {var} = {pbc}\n')

# Spearman Rank Correlation
for var in categorical_cols:
    src = stats.spearmanr(df[var], df['income'])
    print(f'Spearman Rank Correlation for {var} = {src}\n')

Effect Size for age (Cohen's d) = -0.5629798646917087

Effect Size for fnlwgt (Cohen's d) = 0.022131082425593883

Effect Size for education-num (Cohen's d) = -0.8319413447327589

Effect Size for capital-gain (Cohen's d) = -0.5358150457262507

Effect Size for capital-loss (Cohen's d) = -0.3560885359154622

Effect Size for hours-per-week (Cohen's d) = -0.5519310152474285

Point Biserial Correlation for age = PointbiserialrResult(correlation=0.23403710264886493, pvalue=0.0)

Point Biserial Correlation for fnlwgt = PointbiserialrResult(correlation=-0.009462557247529655, pvalue=0.08773666108238265)

Point Biserial Correlation for education-num = PointbiserialrResult(correlation=0.3351539526909648, pvalue=0.0)

Point Biserial Correlation for capital-gain = PointbiserialrResult(correlation=0.22332881819540146, pvalue=0.0)

Point Biserial Correlation for capital-loss = PointbiserialrResult(correlation=0.15052631177034784, pvalue=2.6865471891972592e-164)

Point Biserial Correlation for hours-pe

The effect size (Cohen's d) is a measure of the magnitude of the difference between two groups. Here, we use it to quantify the difference in means of our numerical variables between the two income groups. Generally, an absolute value of d=0.2 is considered a 'small' effect size, 0.5 represents a 'medium' effect size, and 0.8 a 'large' effect size. The results show that all the numerical variables have a small to medium effect size, implying there is a small to medium difference in means between the two income groups for these variables.

The Point Biserial Correlation measures the strength and direction of the association that exists between one continuous variable and one dichotomous variable. In our case, the continuous variables are the numerical variables, and the dichotomous variable is the income group. All the numerical variables show a significant correlation with the income group, indicating that these variables could be good predictors of income.

Spearman's Rank Correlation is used to examine the strength and direction of association between two ranked variables. In our case, these are the categorical variables. All the categorical variables show a significant correlation with the income group. This suggests that these variables might have an association with income and could be useful for our prediction model.

However, as with the other tests, it's important to note that a statistically significant correlation doesn't guarantee a practically significant or useful feature. Further feature selection and engineering will be necessary to build a good prediction model.

## General Summary
In this notebook, we conducted a thorough statistical analysis of the variables in our dataset to understand which ones significantly influence income. We started with hypothesis testing for numerical variables using t-tests, then proceeded to hypothesis testing for categorical variables using chi-square tests. We also explored the correlation between numerical variables using Spearman correlation analysis. Subsequently, we ran a logistic regression analysis to evaluate the impact of different variables on the target variable. Our results provide valuable insights into the structure of the dataset and the relationships between different variables. The findings from this notebook will be instrumental in the upcoming feature selection and model building stages.

## Business Summary
Our statistical analysis confirmed several intuitive assumptions about factors influencing income. The level of education, age, and hours per week worked significantly affect a person's income. This information will help us create a more accurate prediction model in the next stages. Additionally, it gives us a better understanding of the demographic and socioeconomic factors that impact income, which could be crucial for decision-makers.