# Predicting High‑Income (> $50K) in the Adult Dataset

## Introduction to Problem & Data

#### **Problem Statement**:

Accurately classifying individuals as high‑income earners (annual income > $50,000) is pivotal not only for socioeconomic research and policy design, but also for shaping the career trajectories of students like me. As I navigate college life—juggling rigorous coursework, part‑time work, and internship applications—knowing which factors truly influence salary outcomes brings clarity to my own decision‑making. Should I invest extra time in a demanding major? Would pursuing a graduate degree or specialized certifications yield a meaningful return? This project offers data‑driven insights to answer these questions. Beyond academic fulfillment, these findings help me and my peers make data‑driven decisions about future pathways.

In this analysis, I leverage the UCI Adult dataset to build and compare classification models that predict whether an individual earns more than $50K per year based on a suite of demographic, educational, and occupational attributes. By systematically exploring and modeling these factors, I gain hands‑on experience with end‑to‑end data science workflows—skills that will bolster my resume and portfolio for future employers. More importantly, this work equips me and my peers with evidence on how investments in education, work habits, and financial decisions can translate into tangible income differences.

Concretely, my goals include:

**Identifying Key Drivers**: Uncover which attributes—such as education level, hours worked per week, or capital gains—most strongly predict high earnings. These insights can inform my decisions about double‑majoring, seeking part‑time roles, or even experimenting with small‑scale investments.

**Model Comparison**: Evaluate a variety of algorithms (Logistic Regression, Decision Trees, Random Forests, K‑Nearest Neighbors) to determine not just which predictor variables matter, but how different modeling approaches perform on realistic socio‑economic data. This deepens my understanding of algorithm strengths and limitations in a context that mirrors real‑world business analytics.

**Fairness and Equity Considerations**: Assess model performance across subgroups (gender, race, etc.) to raise awareness of potential biases—an essential perspective for any data scientist committed to ethical practice.


Through this project, I not only refine my technical skills in Python, scikit‑learn, and data visualization, model training, but also develop a richer understanding of how personal and socioeconomic choices intersect. The rest of this report details the data, analytic methods, results, and actionable next steps for translating these findings into career and financial planning strategies.

# 2. Data Description

**2.1 Source and Collection**

My analysis is grounded in the Adult dataset, originally derived from the 1994 5% Public Use Microdata Sample of the U.S. Census Bureau and made accessible by the UCI Machine Learning Repository. This rich resource comprises 48,842 anonymized records spanning 15 socioeconomic attributes—including demographics, education, work habits, and financial indicators—capturing a nuanced snapshot of the mid‑1990s American labor force.

While the dataset’s breadth enables robust modeling of income determinants, it also embodies historical biases in race, gender, and regional representation. A small fraction of entries (<1.5%) contain missing values denoted by “?”, primarily in workclass, occupation, and native‑country fields. We chose to remove these records to preserve the fidelity of categorical encodings, with the understanding that this step may underrepresent certain minority subgroups.

Data ingestion proceeded via direct CSV download from the UCI repository (https://archive.ics.uci.edu/ml/datasets/Adult). After initial validation and deduplication, I performed an 80/20 stratified split—ensuring proportional representation of the income labels—to create training and hold‑out sets. This careful curation lays the foundation for reproducible modeling and fair evaluation of classification performance.

**2.2 Attributes and Types**

The dataset includes both numerical and categorical features:

#### 2.3 Data Pre-Processing & Preliminary Examination:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import os

In [2]:
pip install ucimlrepo #install the ucimlrepo package

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [5]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
census_income = fetch_ucirepo(id=20)

# data (as pandas dataframes)
X = census_income.data.features
y = census_income.data.targets

In [6]:
data = pd.concat([X, y], axis=1)

In [7]:
# metadata
print(census_income.metadata)

# variable information
print(census_income.variables) #code extracted from the oroginal dataset to import the dataset into my code

{'uci_id': 20, 'name': 'Census Income', 'repository_url': 'https://archive.ics.uci.edu/dataset/20/census+income', 'data_url': 'https://archive.ics.uci.edu/static/public/20/data.csv', 'abstract': 'Predict whether income exceeds $50K/yr based on census data.  Also known as Adult dataset.', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Mon Sep 09 2024', 'dataset_doi': '10.24432/C5GP7S', 'creators': ['Ron Kohavi'], 'intro_paper': None, 'additional_info': {'summary': 'Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && 

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [9]:
#since there are some missing values in the dataset, I'll replace ' ?' with NaN
data.replace(' ?', pd.NA, inplace=True)

#drop rows with missing values
data.dropna(inplace=True)

#verify no missing values remain
print(data.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [10]:
#drop less relevant columns (e.g., 'fnlwgt' is a sample weight)
data.drop('fnlwgt', axis=1, inplace=True)

In [11]:
#remove trailing periods and strip whitespace from all string columns in the DataFrame
for col in data.columns:
    if data[col].dtype == 'object':
        data[col] = data[col].str.strip().str.rstrip('.')

#verify cleaning
print(data.head())
print(data['income'].unique())  #should output ['<=50K', '>50K']

   age         workclass  education  education-num      marital-status  \
0   39         State-gov  Bachelors             13       Never-married   
1   50  Self-emp-not-inc  Bachelors             13  Married-civ-spouse   
2   38           Private    HS-grad              9            Divorced   
3   53           Private       11th              7  Married-civ-spouse   
4   28           Private  Bachelors             13  Married-civ-spouse   

          occupation   relationship   race     sex  capital-gain  \
0       Adm-clerical  Not-in-family  White    Male          2174   
1    Exec-managerial        Husband  White    Male             0   
2  Handlers-cleaners  Not-in-family  White    Male             0   
3  Handlers-cleaners        Husband  Black    Male             0   
4     Prof-specialty           Wife  Black  Female             0   

   capital-loss  hours-per-week native-country income  
0             0              40  United-States  <=50K  
1             0              13  U

In [12]:
#convert income categories to numerical values (0/1)
data['income'] = data['income'].map({'<=50K': 0, '>50K': 1}).astype(int)

In [16]:
print(data['income'].value_counts(), "\n")

income
0    36080
1    11541
Name: count, dtype: int64 



# **Dataset Description:**

The revised dataset comprises **48,421** adult records drawn from the 1994 U.S. Census 5% Public Use Microdata Sample. Each entry is labeled according to annual income:

**Class 0 (≤ $50 K):** 36,080 individuals

**Class 1 (> $50 K):** 11,541 individuals

In addition to the target label, there are 14 predictor variables capturing a comprehensive socioeconomic profile:

**Demographics:** age (17–90), sex, race, native country[link text](https://)

**Education:** highest level (e.g. HS‑grad, Bachelors, Doctorate)

**Employment:** workclass (Private, Government, Self‑employed, etc.), occupation, hours worked per week (1–99)

**Financial indicators:** capital‑gain (0–99,999 USD), capital‑loss (0–4,356 USD)

**Family & social:** marital status, relationship status

Missing entries (< 1.5% of rows) were removed, and the fnlwgt sample‑weight column was dropped as it does not represent an individual attribute. String values have been cleaned of trailing punctuation and whitespace, and the binary income label is encoded as 0/1 for seamless model integration. This rich mixture of demographic, educational, occupational, and financial features will serve as the basis for our classification models.