<a href="https://colab.research.google.com/github/techllen/AI_ML_projects/blob/main/decision_trees/Adult_income_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='56c3ea'><h1><b>Adult Income Analysis<b></h1></font>

###Bachgound and Problem statement
- An individual’s annual income results from various factors. Some know factors are education level, age, gender, occupation.

- In this work we want to explore what are the main factors based on the individual’s personal information can contribute to their income(weather its Low or high).

- The income is divided into two groups >50K is a high income and <=50K is a low income

- Dataset :  https://www.kaggle.com/datasets/wenruliu/adult-income-dataset

###Data understanding
- Columns descriptions

| Number | Column Name      | Description                                                                                           |
|--------|------------------|-------------------------------------------------------------------------------------------------------|
| 1      | age              | The age of the individual.                                                                            |
| 2      | workclass        | Type of employment (e.g., Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, etc.).      |
| 3      | fnlwgt           | Final weight; represents the number of people the census believes the entry represents.               |
| 4      | education        | Highest level of education achieved (e.g., Bachelors, HS-grad, Masters, etc.).                        |
| 5      | education-num    | Number of years of education completed.                                                               |
| 6      | marital-status   | Marital status of the individual (e.g., Married-civ-spouse, Never-married, Divorced, etc.).           |
| 7      | occupation       | Occupation of the individual (e.g., Tech-support, Craft-repair, Sales, Exec-managerial, etc.).        |
| 8      | relationship     | Relationship status within a family (e.g., Wife, Own-child, Husband, Not-in-family, etc.).            |
| 9      | race             | Race of the individual (e.g., White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other).           |
| 10     | sex              | Gender of the individual (Male or Female).                                                            |
| 11     | capital-gain     | Capital gains; income from investment sources apart from wages/salary.                                |
| 12     | capital-loss     | Capital losses; losses from investment sources apart from wages/salary.                               |
| 13     | hours-per-week   | Number of hours worked per week.                                                                      |
| 14     | native-country   | Country of origin (e.g., United-States, Canada, England, Germany, India, etc.).                       |
| 15     | income           | Income bracket of the individual ('<=50K' or '>50K').                                                 |


<font color='56c3ea'><h1><b>1 Data Loading<b></h1></font>

<font color='56c3ea'><h1><b>1.1 Necessary Imports and dataset observation<b></h1></font>

In [1]:
# Install dependencies as needed:
!pip install kagglehub[pandas-datasets] --quiet
# !pip install kagglehub==0.3.10 --upgrade --quiet
!pip install pycaret > /dev/null 2>&1

In [2]:
# Importing required libraries
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from pycaret.classification import setup, compare_models
import warnings

In [3]:
# Adjusting max columns to be displayed and suppressing warnings
pd.set_option('display.max_columns' , None)
warnings.filterwarnings('ignore')

In [4]:
# Loading dataset from kaggle
file_path = "adult.csv"

# Load the latest version
adult_income_df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "wenruliu/adult-income-dataset",
  file_path,
)

Downloading from https://www.kaggle.com/api/v1/datasets/download/wenruliu/adult-income-dataset?dataset_version_number=2&file_name=adult.csv...


100%|██████████| 652k/652k [00:00<00:00, 933kB/s]

Extracting zip of adult.csv...





In [5]:
adult_income_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K


<font color='Green'><h1><b>Observations<b></h1></font>
<ul>
<li>The dataset contains several features(columns) such as 'age', 'workclass', 'fnlwgt', 'education', 'educational-num','marital-status', 'occupation', 'relationship', 'race', 'gender','capital-gain', 'capital-loss', 'hours-per-week', 'native-country','income'</li>
<li>Some feature contains a mixture of numbers and string / special characters (question marks)for proper analysis these features might need cleaning .The features are education , workclass , occupation</li>
<li>Some of the features in the dataset shows order of categories (educational-num , education)
<li>Some of the features in the dataset shows categories which are not ordered (workclass, marital-status ,occupation , relationship , race , gender , native-country)
<li>Some of the features in the dataset are numerical(continuous or descrete) which are ( age , capital-loss , capital-gain , hours-per-week , fnlwgt)
</ul>
</p>

In [6]:
# Checking the structure of the data - datatypes and number of missing rows
adult_income_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [7]:
# Checking shape
adult_income_df.shape

(48842, 15)

In [8]:
# Selecting dtyppes
adult_income_df.select_dtypes(include = ['int64']).columns

Index(['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')

In [9]:
# Selecting dtyppes
adult_income_df.select_dtypes(include = ['object']).columns

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'gender', 'native-country', 'income'],
      dtype='object')

<font color='Green'><h1><b>Observations<b></h1></font>
<ul>
<li>The dataset consist of numerical int and object data types</li>
<li>From the shape of the data it shows that the dataset has 15 features(columns) and the total number of entries(rows) is 48842</li>
<li>Out of 15 features , no null values have been found how ever some columns have special characters such as ? which may need cleaning down the road</li>
</ul>
</p>

<font color='56c3ea'><h1><b>2 Summary statistics<b></h1></font>






<font color='56c3ea'><h1><b>2.1 Descriptive statistics for numerical features<b></h1></font>

In [10]:
# Summary statistics for numerical values
adult_income_df.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


<font color='Green'><h1><b>Observations<b></h1></font>
###Interpretation of Descriptive Statistics for Numerical Features

age
- Count: 48842 , no missing entry
- Mean: average age of individuals participated in this census is 38.64
- Standard Deviation: 13.7 showing wide range of age among participants
- Min and Max: youngest individual is 17 y.o and oldest individual is 90 y.

fnlwgt
- Count: 48842 , no missing entry
- Mean: average fnlwgt of individuals participated in this census is 189664.1
- Standard Deviation: 105604 showing wide range of fnlwgt among participants

educational-num
- Count: 48842 , no missing entry
- Mean: working adult spends an average of 10.07 years of education
- Standard Deviation: 2.57 showing wide range of years in education among participants
- Min and Max: 1 yr is the least amount of years a working adult has spent in school while 16 is the maximum,  this aligns with the amout of years individual spends until college

capital-gain
- Count: 48842 , no missing entry
- Mean: average capital gain of partcipants is 1079.06
- Standard Deviation: 7452.01 showing very wide range of capital gains among participants , some individual gains are really high
- Min: 0 indicating some indivisual do not have any capital investments at all outside their income
- Max: 99999 show some individial have heavily invested outside their income

capital-loss
- Count: 48842 , no missing entry
- Mean: average capital loss of partcipants is 87.50
- Standard Deviation: 403.00 showing very wide range of capital loss among participants
- Min: 0 indicating some individual do not have any losses( weather with capital investment or not)
- Max: 4356.00 show some individial have taken some losses

hours-per-week
- Count: 48842 , no missing entry
- Mean: 40 , meaning most of participants works 5 days , 8 hrs each day shift
- Standard Deviation: 12.39 showing very wide range of hours per week worked  among participants
- Min: 1 some individual spends 1 hr to work in a week
- Max: 99 show some individial works 2 times(80 hours) an average individual works


<font color='56c3ea'><h1><b>2.2 Descriptive statistics for categorical features<b></h1></font>

In [11]:
# Summary statistics for categorical values
adult_income_df.describe(include=['object'])

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country,income
count,48842,48842,48842,48842,48842,48842,48842,48842,48842
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43832,37155


<font color='Green'><h1><b>Observations<b></h1></font>
###Interpretation of Descriptive Statistics for Categorical Features
workclass
- Count: 48842 , no missing entry
- Unique: 9 classes
- Most frequent: Majority of partcipants work in private sector

education
- Count: 48842 , no missing entry
- Unique: 16 classes
- Most frequent: Majority of partcipants are High school graduants

marital-status
- Count: 48842 , no missing entry
- Unique: 7 classes
- Most frequent: Majority of partcipants are Married

occupation
- Count: 48842 , no missing entry
- Unique: 15 classes
- Most frequent: Majority of partcipants are  prof-specialty , professions that require specialized knowledge and often advanced education( doctors , engineers , teachers , scientints , layers)

relationship
- Count: 48842 , no missing entry
- Unique: 6 classes
- Most frequent: Majority of partcipants have rerlationships to their husbands

race
- Count: 48842 , no missing entry
- Unique: 5 classes
- Most frequent: Majority of partcipants are white

gender
- Count: 48842 , no missing entry
- Unique: 2 classes
- Most frequent: Majority of partcipants are male

native-country
- Count: 48842 , no missing entry
- Unique: 42 classes , unique countries are represented in a dataset
- Most frequent: Majority of partcipants are from US

income
- Count: 48842 , no missing entry
- Unique: 2 classes low and high income
- Most frequent: Majority of partcipants have low income