# Project Title:
Income Prediction Challenge For Azubian

# Business Understanding:

The "Income Prediction Challenge for Azubian" is a data-driven initiative that seeks to address the critical issue of income inequality in developing nations. The project focuses on utilizing machine learning techniques to predict whether an individual's income falls above or below a specific income threshold. By developing a robust predictive model, we aim to contribute to more accurate and cost-effective methods of monitoring key population indicators, such as income levels, between census years. This valuable information will empower policymakers to take more informed actions to mitigate and manage income inequality on a global scale.

## 1. Introduction:
This project aims to predict the estimated time of arrival (ETA) for Algerian ride-hailing business Yassir. ETA prediction is crucial in the ride-hailing industry to provide accurate and reliable arrival time estimates to passengers, thereby improving the user experience and operational efficiency.

### 1.1. Objectives:

- **Income Prediction Model:** 
The primary goal is to create a machine learning model capable of determining whether an individual's income exceeds a specified threshold.


- **Economic Inequality Mitigation:** 
By accurately predicting income levels, the project aims to support the reduction of income inequality by providing policymakers with critical insights.


- **Cost and Accuracy Improvement:** 
This solution endeavors to improve the efficiency of income-level monitoring by offering a more cost-effective and precise method compared to traditional census methods.

### 1.2. Methodology:

To achieve the project objectives, we will follow the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework, a process used to guide the machine learning lifecycle. It is a six-phase process consisting of these key phases:

**i. Business Understanding:**
- Gain a deep understanding of the problem, its significance, and the potential impact of addressing income inequality.
- Define the objectives and the F1 score evaluation metric for model performance.

**ii. Data Understanding:**
- Load the provided training, testing and variable definitions datasets.
- Explore the provided datasets, including variable descriptions.
- Analyze the Train.csv dataset, which includes target income labels, to understand the data's structure and relationships.

**iii. Data Preparation:**
- Preprocess the data by handling missing values and data anomalies.
- Perform feature engineering to prepare the data for model training.

**iv. Modeling:**
- Select and implement machine learning algorithms suitable for classification tasks.
- Train the predictive models on the Train.csv dataset using features to predict income labels.

**v. Evaluation:**
- Assess the models' performance using the F1 score, which combines precision and recall, to ensure the model's accuracy and select the best performing model.
- Perform cross validation and hyperparameter tuning.

**vi. Deployment:**
- Deploy the trained model for prediction on the Test.csv dataset, which does not include target-related columns.
- The deployment to be done through one of these frameworks; Streamlit, Gradio or FastAPI through Docker Containerization and deployment on Hugging Face. This will allow external users to interact with the application.

By following the CRISP-DM framework, we aim to create a robust income prediction model that can effectively support efforts to address income inequality, provide policymakers with valuable insights, and improve the accuracy of population monitoring between census years. This project has the potential to make a significant impact on global economic equality.

# Data Understanding

## Setup

### Installations

### Importation of Relevant Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Loading

### Loading the Variable Definitions, Train and Test Datasets

#### Variable Definitions Dataset

In [5]:
# Reading Variable Definitions dataset
def_df = pd.read_csv('VariableDefinitions.csv')
def_df

Unnamed: 0,Column,Description
0,age,Age Of Individual
1,gender,Gender
2,education,Education
3,class,Class Of Worker
4,education_institute,Enrolled Educational Institution in last week
5,marital_status,Marital_Status
6,race,Race
7,is_hispanic,Hispanic Origin
8,employment_commitment,Full Or Part Time Employment Stat
9,unemployment_reason,Reason For Unemployment


These variable definitions provide a clear understanding of the features and target variable used in the dataset, which is essential for data analysis and modeling.


#### Train Dataset

In [2]:
# Reading train dataset
train_df = pd.read_csv('Train.csv')
train_df.head()

Unnamed: 0,ID,age,gender,education,class,education_institute,marital_status,race,is_hispanic,employment_commitment,...,country_of_birth_mother,migration_code_change_in_msa,migration_prev_sunbelt,migration_code_move_within_reg,migration_code_change_in_reg,residence_1_year_ago,old_residence_reg,old_residence_state,importance_of_record,income_above_limit
0,ID_TZ0000,79,Female,High school graduate,,,Widowed,White,All other,Not in labor force,...,US,?,?,?,?,,,,1779.74,Below limit
1,ID_TZ0001,65,Female,High school graduate,,,Widowed,White,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,2366.75,Below limit
2,ID_TZ0002,21,Male,12th grade no diploma,Federal government,,Never married,Black,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,1693.42,Below limit
3,ID_TZ0003,2,Female,Children,,,Never married,Asian or Pacific Islander,All other,Children or Armed Forces,...,India,unchanged,,unchanged,unchanged,Same,,,1380.27,Below limit
4,ID_TZ0004,70,Male,High school graduate,,,Married-civilian spouse present,White,All other,Not in labor force,...,US,?,?,?,?,,,,1580.79,Below limit


#### Test Dataset

In [4]:
# Reading test dataset
test_df = pd.read_csv('Test.csv')
test_df.head()

Unnamed: 0,ID,age,gender,education,class,education_institute,marital_status,race,is_hispanic,employment_commitment,...,country_of_birth_father,country_of_birth_mother,migration_code_change_in_msa,migration_prev_sunbelt,migration_code_move_within_reg,migration_code_change_in_reg,residence_1_year_ago,old_residence_reg,old_residence_state,importance_of_record
0,ID_TZ209499,54,Male,High school graduate,Private,,Married-civilian spouse present,White,All other,Children or Armed Forces,...,US,US,unchanged,,unchanged,unchanged,Same,,,3388.96
1,ID_TZ209500,53,Male,5th or 6th grade,Private,,Married-civilian spouse present,White,Central or South American,Full-time schedules,...,El-Salvador,El-Salvador,?,?,?,?,,,,1177.55
2,ID_TZ209501,42,Male,Bachelors degree(BA AB BS),Private,,Married-civilian spouse present,White,All other,Full-time schedules,...,US,US,?,?,?,?,,,,4898.55
3,ID_TZ209502,16,Female,9th grade,,High school,Never married,White,All other,Children or Armed Forces,...,US,US,unchanged,,unchanged,unchanged,Same,,,1391.44
4,ID_TZ209503,16,Male,9th grade,,High school,Never married,White,All other,Not in labor force,...,US,US,?,?,?,?,,,,1933.18


## Hypothesis    

**Null Hypothesis (H0):** There is no significant association between an individual's education level and the likelihood of having an income above the specified threshold.


**Alternative Hypothesis (H1):** Individuals with higher education levels are significantly more likely to have incomes above the specified threshold.


- The null hypothesis (H0) suggests that education level and income level are not related, meaning that having a higher education level does not increase the likelihood of earning an income above the threshold.

- The alternative hypothesis (H1) posits that there is a significant association between education level and income level, indicating that higher education levels are linked to a higher likelihood of having an income above the threshold.

We will conduct statistical tests to either accept or reject the null hypothesis in favor of the alternative hypothesis based on the evidence provided by the dataset.

## Key Analytical Questions:

To gain insights into the dataset and validate the hypothesis, we can formulate several key analytical questions for EDA:

i. What is the distribution of income levels in the dataset (above the threshold vs. below the threshold)?

- This question will provide an initial understanding of the balance between the two income categories.

ii. How does age relate to income levels in the dataset?

- Analyzing the age distribution among individuals with different income levels may reveal patterns related to age and income.

iii. Is there a significant gender-based income disparity?

- Exploring income levels based on gender can help us understand if gender plays a role in income categorization.

iv. What is the distribution of education levels among individuals with different income levels?

- Analyzing the educational backgrounds of individuals in both income categories can help assess the hypothesis regarding education and income.

v. Are there differences in employment status between the two income groups?

- Investigating the employment status and commitment of individuals based on income categories can provide insights into the relationship between employment and income.

vi. How do race and ethnicity correlate with income levels in the dataset?

- Understanding the distribution of income levels across different racial and ethnic groups can shed light on potential disparities.

vii. Are capital gains and losses associated with higher incomes?

- Examining the presence and amounts of capital gains and losses can help determine their impact on income levels.

viii. What is the relationship between occupation and income categories?

- Analyzing the occupation categories and their distribution among income groups can provide insights into employment roles.

ix. Do migration patterns or changes in residence relate to income levels?

- Investigating migration and residence changes may reveal how geographic mobility affects income.

x. How does tax status correspond to income levels?

Analyzing tax filing status can provide information about the impact of tax-related factors on income.

By exploring these key analytical questions, we can perform an in-depth EDA to uncover patterns, correlations, and potential factors influencing income levels. Additionally, it will help us determine whether our initial hypothesis regarding education and income holds true in this dataset.

## Exploratory Data Analysis (EDA):

### Understanding the datasets

An in-depth exploration of the datasets is presented to gain insights into the available variables,their distributions and relationships. This step will provide an initial undertanding of the datasets to identify any data quality issues that will inform the cleaning and pre-processing.

#### i. Column Information of The Datasets

In [3]:
# Column information of the train dataset
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209499 entries, 0 to 209498
Data columns (total 43 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   ID                              209499 non-null  object 
 1   age                             209499 non-null  int64  
 2   gender                          209499 non-null  object 
 3   education                       209499 non-null  object 
 4   class                           104254 non-null  object 
 5   education_institute             13302 non-null   object 
 6   marital_status                  209499 non-null  object 
 7   race                            209499 non-null  object 
 8   is_hispanic                     209499 non-null  object 
 9   employment_commitment           209499 non-null  object 
 10  unemployment_reason             6520 non-null    object 
 11  employment_stat                 209499 non-null  int64  
 12  wage_per_hour   

- The train dataset contains 209,499 rows and 43 columns.


- The columns include various demographic and socioeconomic features, as well as the target variable "income_above_limit," which indicates whether the individual's income is above or below the income threshold ($50,000)


- Many columns contain missing values. For example, 'class,' 'education_institute,' 'unemployment_reason,' 'is_labor_union,' 'under_18_family,' 'veterans_admin_questionnaire,' 'old_residence_reg,' 'old_residence_state,' 'migration_code_change_in_msa,' 'migration_prev_sunbelt,' 'migration_code_move_within_reg,' and 'migration_code_change_in_reg' have significant missing values.

In [23]:
# Column information of the test dataset
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89786 entries, 0 to 89785
Data columns (total 42 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              89786 non-null  object 
 1   age                             89786 non-null  int64  
 2   gender                          89786 non-null  object 
 3   education                       89786 non-null  object 
 4   class                           44707 non-null  object 
 5   education_institute             5616 non-null   object 
 6   marital_status                  89786 non-null  object 
 7   race                            89786 non-null  object 
 8   is_hispanic                     89786 non-null  object 
 9   employment_commitment           89786 non-null  object 
 10  unemployment_reason             2680 non-null   object 
 11  employment_stat                 89786 non-null  int64  
 12  wage_per_hour                   

#### ii. Shape of The Datasets

In [24]:
# The shape of the train dataset
train_df.shape

(209499, 43)

In [25]:
# The shape of the test dataset
test_df.shape

(89786, 42)

#### iii. Summary Statistics Datasets

In [26]:
# Summary Statistics of The Train Dataset
train_df.describe().round(3)

Unnamed: 0,age,employment_stat,wage_per_hour,working_week_per_year,industry_code,occupation_code,total_employed,vet_benefit,gains,losses,stocks_status,mig_year,importance_of_record
count,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0,209499.0
mean,34.519,0.177,55.433,23.159,15.332,11.322,1.956,1.516,435.927,36.882,194.533,94.5,1740.888
std,22.307,0.556,276.757,24.398,18.05,14.461,2.365,0.851,4696.36,270.383,1956.376,0.5,995.56
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0,37.87
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,94.0,1061.29
50%,33.0,0.0,0.0,8.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,94.0,1617.04
75%,50.0,0.0,0.0,52.0,33.0,26.0,4.0,2.0,0.0,0.0,0.0,95.0,2185.48
max,90.0,2.0,9999.0,52.0,51.0,46.0,6.0,2.0,99999.0,4608.0,99999.0,95.0,18656.3


In [27]:
# Summary Statistics of The Test Dataset
test_df.describe().round(3)

Unnamed: 0,age,employment_stat,wage_per_hour,working_week_per_year,industry_code,occupation_code,total_employed,vet_benefit,gains,losses,stocks_status,mig_year,importance_of_record
count,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0,89786.0
mean,34.586,0.176,54.339,23.224,15.377,11.298,1.956,1.518,421.978,36.773,198.926,94.501,1738.264
std,22.346,0.554,265.198,24.418,18.063,14.445,2.364,0.849,4610.516,268.401,1893.917,0.5,990.837
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94.0,42.82
25%,15.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,94.0,1059.115
50%,33.0,0.0,0.0,8.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,95.0,1617.345
75%,50.0,0.0,0.0,52.0,33.0,26.0,4.0,2.0,0.0,0.0,0.0,95.0,2193.735
max,90.0,2.0,9400.0,52.0,51.0,46.0,6.0,2.0,99999.0,4608.0,99999.0,95.0,12960.2


#### iv. Checking for Missing Values in The Datasets

In [54]:
#define a function to print the missing values in the datasets
def show_missing_values(datasets):
    for name, data in datasets.items():
        title = f"Missing values in the {name.capitalize()} dataset:"
        underline = '*' * len(title)  # Create an underline of asterisks
        print(title)
        print(underline)
        
        # Calculate the sum of missing values for each column
        missing_values = data.isnull().sum()
        
        # Exclude columns with 0 as missing values
        missing_values = missing_values[missing_values > 0]
        
        # Check if there are any columns with missing values
        if not missing_values.empty:
            
            # Calculate the percentage of missing values
            missing_percentages = ((missing_values / len(data)) * 100).round(2)
            
            # Display the columns with missing values, the count of missing values, and their percentages
            missing_info = pd.DataFrame({
                'Column': missing_values.index, 
                'Missing Values': missing_values,
                'Missing Values Percentage (%)': missing_percentages
            })
            
            # Use to_string to remove index column
            print(missing_info.to_string(index=False))  
            print('===' * 26)
        else:
            # If no missing values found, indicate that
            print("No missing values found.")
        print()

# Call the function to show missing values in the datasets
show_missing_values(datasets)

Missing values in the Train dataset:
************************************
                        Column  Missing Values  Missing Values Percentage (%)
                         class          105245                          50.24
           education_institute          196197                          93.65
           unemployment_reason          202979                          96.89
                is_labor_union          189420                          90.42
          occupation_code_main          105694                          50.45
               under_18_family          151654                          72.39
  veterans_admin_questionnaire          207415                          99.01
  migration_code_change_in_msa            1588                           0.76
        migration_prev_sunbelt           88452                          42.22
migration_code_move_within_reg            1588                           0.76
  migration_code_change_in_reg            1588                      

In [22]:
# Loop through each column and print unique values
for column in train_df.columns:
    unique_values = train_df[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values: {unique_values}")
    print('===' * 18)
    print()

Column: ID
Unique Values: ['ID_TZ0000' 'ID_TZ0001' 'ID_TZ0002' ... 'ID_TZ99997' 'ID_TZ99998'
 'ID_TZ99999']

Column: age
Unique Values: [79 65 21  2 70 45 53 22 73 30  4 16 43 36  5 88 40 47 59 69 50 27 39 85
 29 41 14 33 67 52 11  9 13 19 26 23 37 58 63 46 62 28 31  3 18 78 15 38
  7 35  1 20  0 48 24 56 25  8 66 71 32 75 51 10 44 42 55 74 77 34 80 17
 83 86 12 68 60 57 64 72 90 61 82  6 84 49 76 54 89 81 87]

Column: gender
Unique Values: [' Female' ' Male']

Column: education
Unique Values: [' High school graduate' ' 12th grade no diploma' ' Children'
 ' Bachelors degree(BA AB BS)' ' 7th and 8th grade' ' 11th grade'
 ' 9th grade' ' Masters degree(MA MS MEng MEd MSW MBA)' ' 10th grade'
 ' Associates degree-academic program' ' 1st 2nd 3rd or 4th grade'
 ' Some college but no degree' ' Less than 1st grade'
 ' Associates degree-occup /vocational'
 ' Prof school degree (MD DDS DVM LLB JD)' ' 5th or 6th grade'
 ' Doctorate degree(PhD EdD)']

Column: class
Unique Values: [nan ' Federal gov