# Data Understanding

 Dataset Feature Descriptions

age

Represents the age of the individual in years.
(Type: Numerical)

workclass

Describes the employment status or type of employer, such as Private, Self-employed, Government, or Without-pay.
(Type: Categorical)

fnlwgt

Stands for "final weight" — a statistical weight that indicates how many people the individual represents in the U.S. Census survey.
(Type: Numerical)

education

The highest level of education attained by the individual, such as Bachelors, Masters, or HS-grad.
(Type: Categorical)

education.num

A numeric encoding of the education level, corresponding to education. For example, Bachelors = 13, Masters = 14, etc.
(Type: Numerical)

marital.status

Indicates the marital status of the individual, such as Married-civ-spouse, Divorced, Never-married.
(Type: Categorical)

occupation

Describes the individual’s job title or area of work, such as Tech-support, Craft-repair, Sales, or Exec-managerial.
(Type: Categorical)

relationship

The person's relationship within their household or family structure, like Husband, Wife, Own-child, Not-in-family.
(Type: Categorical)

race

The race category of the individual, such as White, Black, Asian-Pac-Islander, or Amer-Indian-Eskimo.
(Type: Categorical)

sex

The gender of the individual: Male or Female.
(Type: Categorical)

capital.gain

Capital gain received from investments such as stocks or property sales.
(Type: Numerical)

capital.loss

Capital loss incurred from investments or asset sales.
(Type: Numerical)

hours.per.week

The number of hours the person works per week.
(Type: Numerical)

native.country

The country of origin of the individual, such as United-States, Mexico, or India.
(Type: Categorical)

income

The target variable. Indicates whether the individual earns more than $50,000 annually or not:

<=50K: Less than or equal to $50K

>50K: More than $50K
(Type: Binary Categorical — Target)



## import necessary libraries

In [117]:
import numpy as np
import pandas as pd
import plotly.express as px

Data Loading

In [118]:
df = pd.read_csv('adult.csv',na_values='?')

df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


Data Exploration

In [119]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [120]:
df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [121]:
df.describe().round(2)

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.58,189778.37,10.08,1077.65,87.3,40.44
std,13.64,105549.98,2.57,7385.29,402.96,12.35
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [122]:
df.describe(include='object')

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country,income
count,30725,32561,32561,30718,32561,32561,32561,31978,32561
unique,8,16,7,14,6,5,2,41,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


In [123]:
df.duplicated().sum()

24

In [124]:
df.drop_duplicates(inplace=True)

In [125]:
df.duplicated().sum()

0

Data Cleaning

check categorical cols

In [126]:
cat_cols = df.select_dtypes(include= 'object').columns
cat_cols

Index(['workclass', 'education', 'marital.status', 'occupation',
       'relationship', 'race', 'sex', 'native.country', 'income'],
      dtype='object')

In [127]:
for col in cat_cols:

    print(col)
    print(df[col].nunique())
    print(df[col].unique())
    print('-' * 100)

workclass
8
[nan 'Private' 'State-gov' 'Federal-gov' 'Self-emp-not-inc' 'Self-emp-inc'
 'Local-gov' 'Without-pay' 'Never-worked']
----------------------------------------------------------------------------------------------------
education
16
['HS-grad' 'Some-college' '7th-8th' '10th' 'Doctorate' 'Prof-school'
 'Bachelors' 'Masters' '11th' 'Assoc-acdm' 'Assoc-voc' '1st-4th' '5th-6th'
 '12th' '9th' 'Preschool']
----------------------------------------------------------------------------------------------------
marital.status
7
['Widowed' 'Divorced' 'Separated' 'Never-married' 'Married-civ-spouse'
 'Married-spouse-absent' 'Married-AF-spouse']
----------------------------------------------------------------------------------------------------
occupation
14
[nan 'Exec-managerial' 'Machine-op-inspct' 'Prof-specialty'
 'Other-service' 'Adm-clerical' 'Craft-repair' 'Transport-moving'
 'Handlers-cleaners' 'Sales' 'Farming-fishing' 'Tech-support'
 'Protective-serv' 'Armed-Forces' 'Priv-house-s

check numerical Cols

In [128]:
num_cols = df.select_dtypes(include= 'number').columns
num_cols

Index(['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss',
       'hours.per.week'],
      dtype='object')

In [129]:
for col in num_cols:

    px.histogram(data_frame= df, x= col, title= col).show()

Feature Engineering

In [130]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


# Data Analysis

## Univariate

What is the distribution of age in the dataset?

In [131]:
px.histogram(data_frame= df, x= 'age', color= 'income', title= 'Age Distribution by Income').show()

What percentage of individuals earn more than $50K?

In [132]:
px.pie(data_frame= df, names= 'income', title= 'Income Distribution').show()

What is the distribution of hours worked per week?

In [133]:
px.box(df,x = 'hours.per.week',title= ' Hours Worked Per Week Distribution')

## Bivariate

Is there a relationship between education level and income?

In [134]:
px.histogram(data_frame= df, x= 'education', color= 'income',barmode='group', title= 'Education Number Distribution by Income').show()

Does income distribution vary by gender?

In [135]:
px.histogram(data_frame= df, x= 'sex', color= 'income',barmode='group', title= 'Income by Gender')

Do individuals who work more hours tend to earn more?

In [136]:
px.box(df, x='income', y='hours.per.week', color='income', title='Hours per Week by Income Level')

## Multivariate

How do education level and gender together affect income?

In [137]:
px.histogram(data_frame= df, x= 'education', color= 'income',barmode='group',facet_col='sex', title= 'Income by Education and Gender')

Does the relationship between age and income differ by gender?

In [138]:
px.histogram(df, x='age', color='income',
                   facet_col='sex',
                   title='Age Distribution by Income and Gender')


How do education level and working hours together influence income?

In [139]:
px.scatter(df, x='hours.per.week', y='education.num', color='income',
                 title='Income by Hours Worked and Education Level',)

# Deployment

In [140]:
%%writefile adults.py

import pandas as pd
import plotly.express as px
import streamlit as st

st.set_page_config(layout= 'wide', page_title= 'adults EDA', page_icon= '✨')

page = st.sidebar.radio('Pages', ['Dataset','Univariate Analysis', 'Bivariate Analysis', 'Multivariate Analysis'])
adults = pd.read_csv('adult.csv')
if page == 'Dataset':
    st.dataframe(adults)
    st.title("Adult Income Dataset - Project Overview")

    st.markdown("""
    ### 📄 Dataset Description

    The **Adult Income Dataset** is extracted from the 1994 U.S. Census database.  
    It is commonly used to predict whether an individual's annual income exceeds \$50K, based on demographic and employment-related attributes.

---

### 🔍 Features Included:

- **age**: Age of the individual (in years)
- **workclass**: Type of employment (e.g., Private, Self-emp, Government)
- **fnlwgt**: Final weight – represents how many people the record represents
- **education**: Highest level of education completed
- **education.num**: Numeric representation of education
- **marital.status**: Marital status (e.g., Married, Divorced, Never-married)
- **occupation**: Job type (e.g., Tech-support, Sales)
- **relationship**: Relationship status within household
- **race**: Race category (e.g., White, Black, Asian)
- **sex**: Gender (Male/Female)
- **capital.gain**: Income from capital gains
- **capital.loss**: Loss from capital investments
- **hours.per.week**: Total working hours per week
- **native.country**: Country of origin
- **income**: Target variable – whether income is >50K or <=50K

---

### 🎯 Objective

The goal of this project is to analyze and visualize demographic and employment patterns to predict income level, and understand key features that influence high earnings.

  """)
elif page == 'Univariate Analysis':
    question = st.selectbox('Select Question', ['What is the distribution of age in the dataset?','What percentage of individuals earn more than $50K?','What is the distribution of hours worked per week?'])
    if question == 'What is the distribution of age in the dataset?':
        st.plotly_chart(px.histogram(data_frame= adults, x= 'age', color= 'income', title= 'Age Distribution by Income'))
    elif question == 'What percentage of individuals earn more than $50K?':
        st.plotly_chart(px.pie(data_frame= adults, names= 'income', title= 'Income Distribution'))
    elif question == 'What is the distribution of hours worked per week?':
        st.plotly_chart(px.box(adults,x = 'hours.per.week',title= ' Hours Worked Per Week Distribution'))

elif page == 'Bivariate Analysis':
    question = st.selectbox('Select Question', ['Is there a relationship between education level and income?','Does income distribution vary by gender?','Do individuals who work more hours tend to earn more?'])
    if question == 'Is there a relationship between education level and income?':
        st.plotly_chart(px.histogram(data_frame= adults, x= 'education', color= 'income',barmode='group', title= 'Education Number Distribution by Income'))
    elif question == 'Does income distribution vary by gender?':
        st.plotly_chart(px.histogram(data_frame= adults, x= 'sex', color= 'income',barmode='group', title= 'Income by Gender'))
    elif question == 'Do individuals who work more hours tend to earn more?':
        st.plotly_chart(px.box(adults, x='income', y='hours.per.week', color='income', title='Hours per Week by Income Level'))
    
elif page == 'Multivariate Analysis':
    question = st.selectbox('Select Question',['How do education level and gender together affect income?','Does the relationship between age and income differ by gender?','How do education level and working hours together influence income?'])
    if question == 'How do education level and gender together affect income?':
        st.plotly_chart(px.histogram(data_frame= adults, x= 'education', color= 'income',barmode='group',facet_col='sex', title= 'Income by Education and Gender'))

    elif question == 'Does the relationship between age and income differ by gender?':
        st.plotly_chart(px.histogram(adults, x='age', color='income',facet_col='sex', title='Age Distribution by Income and Gender'))

    elif question == 'How do education level and working hours together influence income?':
        st.plotly_chart(px.scatter(adults, x='hours.per.week', y='education.num', color='income',title='Income by Hours Worked and Education Level',))



Overwriting adults.py


In [141]:
#! streamlit run adults.py

# Data Preprocessing

split data into features and target variable

In [142]:
x = df.drop('income',axis=  1)
y = df['income']

split data into train and test

In [143]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## Numerical

Impute Missing

In [144]:
num_cols = x_train.select_dtypes(include= 'number').columns
num_cols

Index(['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss',
       'hours.per.week'],
      dtype='object')

In [145]:
from sklearn.impute import KNNImputer

knn = KNNImputer()

x_train[num_cols] = knn.fit_transform(x_train[num_cols])

x_test[num_cols] = knn.transform(x_test[num_cols])

In [146]:
scaling_cols = ['fnlwgt']

In [147]:
from sklearn.preprocessing import RobustScaler

rc = RobustScaler()

x_train[scaling_cols] = rc.fit_transform(x_train[scaling_cols])
x_test[scaling_cols] = rc.transform(x_test[scaling_cols])

## Categorical

impute missing

In [148]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy= 'constant', fill_value= 'Unknown')

x_train[['occupation']] = imputer.fit_transform(x_train[['occupation']])

x_test[['occupation']] = imputer.transform(x_test[['occupation']])

Nominal

In [149]:
x_train.select_dtypes(include= 'object')

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country
32239,Self-emp-inc,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
30377,Private,Bachelors,Married-civ-spouse,Sales,Husband,White,Male,United-States
5455,Self-emp-inc,HS-grad,Married-civ-spouse,Other-service,Husband,Asian-Pac-Islander,Male,Thailand
19698,Private,Assoc-voc,Separated,Machine-op-inspct,Not-in-family,Black,Female,United-States
23193,Private,Some-college,Married-civ-spouse,Adm-clerical,Wife,White,Female,United-States
...,...,...,...,...,...,...,...,...
29823,Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,United-States
5390,Private,Assoc-voc,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
860,Federal-gov,HS-grad,Married-civ-spouse,Tech-support,Husband,White,Male,United-States
15800,Private,Bachelors,Never-married,Sales,Not-in-family,White,Male,United-States


In [150]:
for col in x_train.select_dtypes(include= 'object').columns:
    
    print(col)
    print(x_train[col].nunique())

workclass
8
education
16
marital.status
7
occupation
15
relationship
6
race
5
sex
2
native.country
41


OneHotEncoder

In [151]:
ohe_cols = x_train.select_dtypes(include= 'object').drop(['occupation','education','native.country','workclass'], axis = 1).columns
ohe_cols

Index(['marital.status', 'relationship', 'race', 'sex'], dtype='object')

In [152]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output= False, drop= 'first')

ohe_train = ohe.fit_transform(x_train[ohe_cols])

ohe_test = ohe.transform(x_test[ohe_cols])

In [153]:
ohe.get_feature_names_out()

array(['marital.status_Married-AF-spouse',
       'marital.status_Married-civ-spouse',
       'marital.status_Married-spouse-absent',
       'marital.status_Never-married', 'marital.status_Separated',
       'marital.status_Widowed', 'relationship_Not-in-family',
       'relationship_Other-relative', 'relationship_Own-child',
       'relationship_Unmarried', 'relationship_Wife',
       'race_Asian-Pac-Islander', 'race_Black', 'race_Other',
       'race_White', 'sex_Male'], dtype=object)

In [154]:
ohe_train_df = pd.DataFrame(ohe_train, columns= ohe.get_feature_names_out())

ohe_test_df = pd.DataFrame(ohe_test, columns= ohe.get_feature_names_out())

In [155]:
ohe_train_df

Unnamed: 0,marital.status_Married-AF-spouse,marital.status_Married-civ-spouse,marital.status_Married-spouse-absent,marital.status_Never-married,marital.status_Separated,marital.status_Widowed,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26024,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
26025,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
26026,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
26027,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [156]:
x_train

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
32239,41.0,Self-emp-inc,-0.948018,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States
30377,29.0,Private,0.282875,Bachelors,13.0,Married-civ-spouse,Sales,Husband,White,Male,0.0,0.0,40.0,United-States
5455,43.0,Self-emp-inc,0.161131,HS-grad,9.0,Married-civ-spouse,Other-service,Husband,Asian-Pac-Islander,Male,0.0,0.0,78.0,Thailand
19698,46.0,Private,0.877954,Assoc-voc,11.0,Separated,Machine-op-inspct,Not-in-family,Black,Female,0.0,0.0,40.0,United-States
23193,30.0,Private,-0.644616,Some-college,10.0,Married-civ-spouse,Adm-clerical,Wife,White,Female,0.0,0.0,40.0,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29823,31.0,Private,-0.545014,HS-grad,9.0,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0.0,0.0,40.0,United-States
5390,51.0,Private,-0.259099,Assoc-voc,11.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,40.0,United-States
860,55.0,Federal-gov,0.506566,HS-grad,9.0,Married-civ-spouse,Tech-support,Husband,White,Male,0.0,1887.0,40.0,United-States
15800,23.0,Private,0.468327,Bachelors,13.0,Never-married,Sales,Not-in-family,White,Male,0.0,0.0,25.0,United-States


In [157]:
x_train.reset_index(inplace= True, drop= True)

x_test.reset_index(inplace= True, drop= True)

y_train.reset_index(inplace= True, drop= True)

y_test.reset_index(inplace= True, drop= True)

In [158]:
x_train = pd.concat([x_train, ohe_train_df], axis= 1).drop(ohe_cols, axis= 1)

x_test = pd.concat([x_test, ohe_test_df], axis= 1).drop(ohe_cols, axis= 1)

nominal

In [159]:
from category_encoders import BinaryEncoder

be = BinaryEncoder()

be_train_df = be.fit_transform(x_train[['occupation','native.country','workclass']])

be_test_df = be.transform(x_test[['occupation','native.country','workclass']])

In [160]:
x_train = pd.concat([x_train, be_train_df], axis= 1).drop(['occupation','native.country','workclass'], axis = 1)

x_test = pd.concat([x_test, be_test_df], axis= 1).drop(['occupation','native.country','workclass'], axis = 1)

In [161]:
df.education.unique()

array(['HS-grad', 'Some-college', '7th-8th', '10th', 'Doctorate',
       'Prof-school', 'Bachelors', 'Masters', '11th', 'Assoc-acdm',
       'Assoc-voc', '1st-4th', '5th-6th', '12th', '9th', 'Preschool'],
      dtype=object)

In [162]:
from sklearn.preprocessing import OrdinalEncoder

ord = OrdinalEncoder(categories= [[
    'Preschool',
    '1st-4th',
    '5th-6th',
    '7th-8th',
    '9th',
    '10th',
    '11th',
    '12th',
    'HS-grad',
    'Some-college',
    'Assoc-voc',
    'Assoc-acdm',
    'Bachelors',
    'Masters',
    'Prof-school',
    'Doctorate'
]
])
x_train[['education']] = ord.fit_transform(x_train[['education']])

x_test[['education']] = ord.transform(x_test[['education']])


In [163]:
x_train.head()

Unnamed: 0,age,fnlwgt,education,education.num,capital.gain,capital.loss,hours.per.week,marital.status_Married-AF-spouse,marital.status_Married-civ-spouse,marital.status_Married-spouse-absent,...,native.country_0,native.country_1,native.country_2,native.country_3,native.country_4,native.country_5,workclass_0,workclass_1,workclass_2,workclass_3
0,41.0,-0.948018,8.0,9.0,0.0,0.0,50.0,0.0,1.0,0.0,...,0,0,0,0,0,1,0,0,0,1
1,29.0,0.282875,12.0,13.0,0.0,0.0,40.0,0.0,1.0,0.0,...,0,0,0,0,0,1,0,0,1,0
2,43.0,0.161131,8.0,9.0,0.0,0.0,78.0,0.0,1.0,0.0,...,0,0,0,0,1,0,0,0,0,1
3,46.0,0.877954,10.0,11.0,0.0,0.0,40.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,1,0
4,30.0,-0.644616,9.0,10.0,0.0,0.0,40.0,0.0,1.0,0.0,...,0,0,0,0,0,1,0,0,1,0


In [164]:
y_train

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
26024    <=50K
26025     >50K
26026     >50K
26027    <=50K
26028    <=50K
Name: income, Length: 26029, dtype: object

# StreamLit link

https://adult-income-analysis-imwvlxqnobjorseupd6kta.streamlit.app/

In [165]:
y_train.value_counts()

income
<=50K    19710
>50K      6319
Name: count, dtype: int64

In [168]:


from imblearn.over_sampling import SMOTE

smote = SMOTE()

x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)

x_test_smote, y_test_smote = smote.fit_resample(x_test, y_test)


`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.


`BaseEstimator._validate_data` is deprecated in 1.6 and will be removed in 1.7. Use `sklearn.utils.validation.validate_data` instead. This function becomes public and is part of the scikit-learn developer API.



In [169]:
y_train_smote.value_counts()

income
<=50K    19710
>50K     19710
Name: count, dtype: int64