# Income Prediction Analysis

Analyze the UCI Adult Income dataset to build a predictive model for income classification, and uncover insights into the factors that influence earning more or less than $50,000.

## Project setup

Installing packages:

In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn jupyter faker

Import necessary modules

In [None]:
import pandas as pd
import numpy as np
from faker import Faker
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Data Exploration

Create dataframe

In [None]:

# Initialize Faker
fake = Faker()

# Define the number of rows
num_rows = 4000

# Generate the dummy data
data = {
    'Age': np.random.randint(18, 70, size=num_rows),
    'Workclass': np.random.choice(['Private', 'Self-emp', 'Government', 'Unemployed', 'Unknown'], size=num_rows),
    'fnlwgt': np.random.randint(10000, 1000000, size=num_rows),
    'Education': np.random.choice(['Bachelors', 'Masters', 'PhD', 'HS-grad', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Doctorate', 'Prof-school', 'Some-college'], size=num_rows),
    'EducationNum': np.random.randint(1, 16, size=num_rows),
    'MaritalStatus': np.random.choice(['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse'], size=num_rows),
    'Occupation': np.random.choice(['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces'], size=num_rows),
    'Relationship': np.random.choice(['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried'], size=num_rows),
    'Race': np.random.choice(['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black'], size=num_rows),
    'Sex': np.random.choice(['Male', 'Female'], size=num_rows),
    'HoursPerWeek': np.random.randint(1, 99, size=num_rows),
    'NativeCountry': np.random.choice(['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Brazil', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands'], size=num_rows),
    'Income': np.random.choice(['<=50K', '>50K'], size=num_rows)
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('dummy_income_data.csv', index=False)

print("Dummy data created and saved to 'dummy_income_data.csv'")


Explore the data

In [None]:
# Display basic information
print(df.info())
print(df.describe())


In [None]:
df.head(10)

Check for missing values:

In [None]:
print(df.isnull().sum())

Visualize the data:

In [None]:

# Plot income distribution by sex
sns.countplot(x='Sex', hue='Income', data=df)
plt.title('Income Distribution by Sex')
plt.show()

# Plot income by education level
plt.figure(figsize=(12, 6))
sns.countplot(y='Education', hue='Income', data=df)
plt.title('Income by Education Level')
plt.show()


## Data Preprocessing

Handle missing values:

In [None]:
# Drop rows with missing values
df.dropna(inplace=True)


Before building a predictive model, we need to preprocess the data. This involves:

1. Encoding categorical variables (since most machine learning algorithms require numerical inputs)

2. Splitting the data into training and testing sets

In [None]:
# Encode categorical variables

encoder = LabelEncoder()
df['Workclass'] = encoder.fit_transform(df['Workclass'])
df['Education'] = encoder.fit_transform(df['Education'])
df['MaritalStatus'] = encoder.fit_transform(df['MaritalStatus'])
df['Occupation'] = encoder.fit_transform(df['Occupation'])
df['Relationship'] = encoder.fit_transform(df['Relationship'])
df['Race'] = encoder.fit_transform(df['Race'])
df['Sex'] = encoder.fit_transform(df['Sex'])
df['NativeCountry'] = encoder.fit_transform(df['NativeCountry'])
df['Income'] = df['Income'].map({'<=50K': 0, '>50K': 1})

# Split data into features and target
X = df.drop('Income', axis=1)
y = df['Income']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Build and Evaluate a Predictive Model
Now, let's build a predictive model. For simplicity, we'll use a RandomForestClassifier. You can choose other classifiers or even try different models to see which performs best.

In [None]:
# Initialize the model
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Classification report
print(classification_report(y_test, y_pred))

# Confusion matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


## Interpret the Results


### Accuracy:

The overall accuracy of the model is 0.48, which means the model correctly predicts the income level (<=50K or >50K) about 48% of the time. While accuracy gives us a general idea of model performance, it's essential to delve deeper into precision, recall, and F1-score to understand performance across different income categories.

### Precision, Recall, and F1-score:

Class 0 (<=50K):

Precision (0.48):

Out of all instances predicted as <=50K, 48% are actually <=50K.
Recall (0.56): Out of all instances that are actually <=50K, the model correctly identifies 56%.
F1-score (0.52): The harmonic mean of precision and recall for <=50K.

Class 1 (>50K):

Precision (0.47):

Out of all instances predicted as >50K, 47% are actually >50K.

Recall (0.39):

Out of all instances that are actually >50K, the model correctly identifies 39%.

F1-score (0.43):

The harmonic mean of precision and recall for >50K.

### Interpretation of Precision, Recall, and F1-score:

The precision for both income classes (<=50K and >50K) indicates that the model correctly identifies true positives (correct predictions) to some extent but not very strongly (both around 0.48-0.47).

The recall values show that the model is somewhat better at identifying <=50K instances compared to >50K instances (56% vs. 39%).

F1-scores, which balance precision and recall, are moderate but indicate room for improvement, especially for predicting >50K income.

## Feature Importance
Let's analyze the feature importance to understand which factors are influencing the model's predictions:

In [None]:
# Feature importance
feature_importance = model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})

importance_df


## Interpretation of Feature Importance
fnlwgt: This feature has the highest importance (0.140). However, fnlwgt (final weight) is typically a variable used in UCI Adult dataset as a weighting factor in sampling, rather than as a predictor of income directly. Its high importance here might be misleading due to the generated data or model specifics.

HoursPerWeek: The number of hours worked per week (0.125) is crucial. Generally, individuals who work longer hours tend to have higher incomes, assuming hourly pay rates are stable.

Age: Age (0.118) is also significant. Older individuals often earn more due to accumulated experience and career progression.

NativeCountry: Native country (0.114) suggests that geographical location can influence income levels, likely due to varying economic conditions and opportunities across countries.

EducationNum: Education level represented numerically (0.086) is critical. Higher education typically correlates with higher income levels.

Occupation: Occupation (0.084) plays a substantial role. Certain professions are associated with higher incomes (e.g., managerial roles versus service jobs).

Education: The categorical representation of education (0.083) also shows importance. Specific degrees or qualifications (e.g., PhD versus high school diploma) impact income potential.

MaritalStatus: Marital status (0.065) may influence income, potentially due to joint incomes in married couples or stability in financial decisions.

Relationship: Relationship status (0.059) indicates social and economic support networks, which can impact income stability.

Race and Workclass: Race (0.054) and work class (0.051) also contribute, reflecting societal disparities and economic opportunities based on demographic and employment sectors.

Sex: Finally, sex (0.021) shows the smallest impact among these features, but it's important to note that this could vary significantly in real-world datasets where gender disparities in income are often more pronounced.

In [None]:
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plotting feature importance
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


## Conclusion
Understanding feature importance provides valuable insights into which factors most strongly influence income classification. These insights can guide further analysis, model refinement, and policy decisions aimed at addressing income disparities and enhancing economic opportunities based on demographic and socio-economic factors.

In practical applications, considering these features carefully in predictive models can improve accuracy and fairness, ensuring that decisions based on income prediction models are robust and equitable.