# UCI Adult Income Dataset - Exploratory Data Analysis

This notebook contains a professional exploratory data analysis (EDA) of the UCI Adult Income (Census) dataset. The goal is to investigate the structure and relationships in the data to inform potential predictive modeling tasks.

## Introduction
The UCI Adult dataset contains demographic information and income labels indicating whether a person earns over $50K/year. This EDA explores the data's distributions, relationships, and potential data quality issues to prepare it for machine learning modeling.

In [None]:
# Import essential libraries for data analysis and visualization
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the Adult Income dataset
data_path = '../data/adult.data.csv'
df = pd.read_csv(data_path, header=None)
df.head()

## Data Preparation and Feature Naming

In this section, we assign meaningful column names to the dataset for easier analysis and inspect the basic structure of the data.

In [None]:
# Assign descriptive column names to the dataset
df.columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]

# Display the first few rows with new column names
df.head()

In [None]:
# Inspect the dataset structure and data types
df.info()

In [None]:
# View summary statistics for numerical features
df.describe()

In [None]:
# Check for missing values in each column
df.isnull().sum()

In [None]:
# Examine the distribution of the target variable
df['income'].value_counts()

In [None]:
# Visualize the distribution of income classes
sns.countplot(x='income', data=df)
plt.title('Income Distribution')
plt.xlabel('Income')
plt.ylabel('Count')
plt.show()

In [None]:
# Explore the relationship between age and income classes
sns.boxplot(x='income', y='age', data=df)
plt.title('Boxplot of Age by Income Class')
plt.xlabel('Income')
plt.ylabel('Age')
plt.show()

In [None]:
# Visualize the distribution of occupations by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='occupation', hue='income', data=df)
plt.title('Occupation Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Occupation')
plt.show()

In [None]:
# Visualize the distribution of education levels by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='education', hue='income', data=df)
plt.title('Education Level Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Education Level')
plt.show()

In [None]:
# Visualize the distribution of marital status by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='marital-status', hue='income', data=df)
plt.title('Marital Status Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Marital Status')
plt.show()

In [None]:
# Visualize the distribution of relationship types by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='relationship', hue='income', data=df)
plt.title('Relationship Type Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Relationship Type')
plt.show()

In [None]:
# Visualize the distribution of race by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='race', hue='income', data=df)
plt.title('Race Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Race')
plt.show()

In [None]:
# Visualize the distribution of sex by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='sex', hue='income', data=df)
plt.title('Sex Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Sex')
plt.show()

In [None]:
# Visualize the distribution of native country by income class
plt.figure(figsize=(12, 8))
sns.countplot(y='native-country', hue='income', data=df)
plt.title('Native Country Distribution by Income Class')
plt.xlabel('Count')
plt.ylabel('Native Country')
plt.show()