# Census Income Prediction

## Introduction About the Data

Prediction task is to determine whether a person makes over 50K a year. (Classification Analysis)

There are 14 Independent Variables.

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Target Varibale:
- income: >50K, <=50K.

Dataset Source Link : https://archive.ics.uci.edu/ml/datasets/census+income

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler


%matplotlib inline

In [2]:
df = pd.read_csv("data/raw/adult.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'artifacts/data/raw/adult.csv'

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.nunique()

In [None]:
df.duplicated().sum()

#### Obeservation: There are 24 duplicate values in the dataset.

In [None]:
#droping duplicates
df.drop_duplicates(keep='first',inplace=True)

In [None]:
df.shape

In [None]:
df.duplicated().sum()

#### Obeservation: There no duplicate values left in the dataset.

In [None]:
# Separating numerical and categorical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
categorical_features = df.select_dtypes(include=['object']).columns

In [None]:
num_col = df.select_dtypes(include=['int64', 'float64'])
cat_col = df.select_dtypes(include=['object'])

In [None]:
num_col

In [None]:
cat_col

In [None]:
numerical_features.head()

In [None]:
categorical_features.head()

In [None]:
numerical_features.shape

In [None]:
categorical_features.shape

#### Obeservation:
* There are total 32537 rows and 15 columns in the dataset
* Categorical features = 9 and Numerical features = 6

In [None]:
# finding the unique values in numerical feature
for feature in numerical_features.columns:
    print(feature)
    print(numerical_features[feature].unique())
    print("")

In [None]:
# finding the unique values in categorical feature
for feature in categorical_features.columns:
    print(feature)
    print(categorical_features[feature].unique())
    print("")

#### Observation:
* It seems to be NaN values in categorical feature in the form of  '?'

In [None]:
#replacing '?' with NaN
categorical_features.replace(' ?',np.nan,inplace=True) 

In [None]:
categorical_features.isnull().sum()

#### Observation:
There are null values in the categorical features. 
* Workclass - 1836
* occupation - 1843
* native_country - 582

In [None]:
# Checking relative null values for categorical feature
sns.heatmap(categorical_features.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
## Checking distribution of our target variable -> income
categories = categorical_features['income'].unique()
values = categorical_features['income'].value_counts()

# Create a figure and axes
fig, ax = plt.subplots()

# Plot the vertical bar chart
ax.bar(categories, values, align='center')

# Set labels for x-axis and y-axis
ax.set_xlabel('Categories')
ax.set_ylabel('Values')

# Set title
ax.set_title('Vertical Bar Chart')

# Show the plot
plt.show()


In [None]:
values

#### Observation:
* Dataset is slidely imbalanced.
* Total records - 32537
* income <=50K - 24698 records 76% approx
* income >50K  - 7839  records 24% approx

In [None]:
## Checking Distribution of workclass feature

plt.figure(figsize=(10, 6)) 
categorical_features['workclass'].value_counts().plot(kind='bar', stacked=True)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Categories vs. Categories Count')

plt.show()

#### Observation:
* Most frequent category in workclass is "Private" 
* Every category contains both income classes.
* Except self-emp-inc every category major income class is "<=50k"

In [None]:
## Checking Distribution of education feature

plt.figure(figsize=(10, 6)) 
categorical_features['education'].value_counts().plot(kind='bar', stacked=True)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Categories vs. Categories Count')

plt.show()

In [None]:
## Checking Distribution of occupation feature

plt.figure(figsize=(10, 6)) 
categorical_features['occupation'].value_counts().plot(kind='bar', stacked=True)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Categories vs. Categories Count')

plt.show()

In [None]:
# Visualizing the relationship between occupation,occupation category count and income
# Create a count plot
plt.figure(figsize=(10, 6)) 
sns.countplot(categorical_features,x='occupation',hue='income')
plt.xticks(rotation=90)
plt.show()

In [None]:
# Visualizing the relationship between workclass,workclass category count and income 
# Create a count plot
plt.figure(figsize=(10, 6)) 
sns.countplot(categorical_features,x='workclass',hue='income')
plt.xticks(rotation=90)
plt.show()

In [None]:
categorical_features['workclass'].fillna(categorical_features['workclass'].mode(),inplace=True)
categorical_features['occupation'].fillna(categorical_features['occupation'].mode(),inplace=True)
categorical_features['country'].fillna(categorical_features['country'].mode(),inplace=True)

In [None]:
for feature in categorical_features.columns:
    plt.figure(figsize=(8, 3)) 
    categorical_features[feature].value_counts().plot(kind='bar', stacked=True)
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.title('Categories vs. Categories Count')

    plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew,kurtosis
# Calculate skewness
def skewkness(data):
    data_skewness = skew(data)
    print("Skewness:", data_skewness)

    if data_skewness< 0:
        print("Left skew distribution")
    elif data_skewness > 0 :
        print("Right skew distribution")
    else:
        print("Symmetrical distribution")
        
def calculate_kurtosis(data):
    data_kurtosis = kurtosis(data)
    print("Kurtosis:", data_kurtosis)
    
    if data_kurtosis < 0:
        print("platykurtic (lighter tail) ")
        
    elif data_kurtosis > 0 :
        print("leptokurtic (heavier tail)")
    else:
        print("Mesokurtic distribution")

In [None]:
for feature in numerical_features.columns:
    # Create a figure and axes
    fig, ax = plt.subplots()

    # Plot the histogram
    ax.hist(numerical_features[feature], bins=30, density=True)

    # Set labels and title
    ax.set_xlabel(feature)
    ax.set_ylabel('Frequency')
    ax.set_title('Histogram')
    
    min_value = min(numerical_features[feature])
    max_value = max(numerical_features[feature])
    
    # Set axis limits
    ax.set_xlim([0, 10])
    ax.set_ylim([-1.2, 1.2])

    # Show the plot
    plt.show()
    skewkness(numerical_features[feature])
    calculate_kurtosis(numerical_features[feature])


In [None]:
# Visualizing the relationship between education,education category count and income
# Create a count plot
plt.figure(figsize=(10, 6)) 
sns.countplot(categorical_features,x='education',hue='income')
plt.xticks(rotation=90)
plt.show()

In [None]:
# Create a figure and axes

fig, ax = plt.subplots(figsize=( 12, 6))

# Plot violin plots
ax.violinplot(numerical_features, showmedians=True)

# Set labels and title
ax.set_xticks(range(len(numerical_features.columns)))
ax.set_xticklabels(numerical_features.columns)
ax.set_ylabel('Values')
ax.set_title('Violin Plots')

# Show the plot
plt.show()

Clean the Data:

Handle missing data: Identify missing values and decide how to handle them (e.g., imputation or removal).
Remove duplicates: Identify and remove any duplicated observations.
Handle outliers: Detect outliers and decide whether to keep, remove, or transform them.

In [None]:
categorical_features.head()

In [None]:
categorical_features.shape

In [None]:
categorical_features.columns

In [None]:
numerical_features.head()

In [None]:
numerical_features.shape

In [None]:
numerical_features.columns

In [None]:
numerical_features

In [None]:
new_df = pd.concat([numerical_features,categorical_features],axis=1)

In [None]:
new_df.head()

In [None]:
new_df.shape

In [None]:
new_df.columns

In [None]:
categorical_features

In [None]:
categorical_features.columns

In [None]:
target = categorical_features["income"]
categorical_features.drop(labels=["income",'education'], axis=1, inplace=True)

In [None]:
target.unique()

In [None]:
mapping = {' <=50K': 0, ' >50K': 1}

# Use the map() function to apply the mapping
y = target.map(mapping)

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
scaler = StandardScaler().fit(numerical_features)
preprocessed_numerical_feature = scaler.transform(numerical_features)

encoder = OneHotEncoder().fit(categorical_features)
preprocessed_categorical_feature = encoder.transform(categorical_features).toarray()

In [None]:
preprocessed_numerical_feature.shape

In [None]:
preprocessed_categorical_feature.shape

In [None]:
numeric_df = pd.DataFrame(preprocessed_numerical_feature)
categoric_df = pd.DataFrame(preprocessed_categorical_feature)

In [None]:
preprocessing_df = pd.concat([numeric_df,categoric_df],axis=1) 

In [None]:
preprocessing_df.shape

In [None]:
preprocessing_df.head()

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(preprocessing_df, y, test_size=0.3, random_state=42)

In [None]:
X_train.to_csv("data/X_train.csv")
y_train.to_csv("data/y_train.csv")
X_test.to_csv("data/X_test.csv")
y_test.to_csv("data/y_test.csv")

In [None]:
pca = PCA(n_components=55)

In [None]:
pca.fit(preprocessing_df)

In [None]:
df_pca = pca.transform(preprocessing_df)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_pca, y, test_size=0.33, random_state=42)

In [None]:
# Calculate cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot the cumulative explained variance ratio
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio vs. Number of Components')
plt.grid(True)
plt.show()


In [None]:
pd.DataFrame(X_train).to_csv("data/X_pca_train.csv")
pd.DataFrame(y_train).to_csv("data/y_pca_train.csv")
pd.DataFrame(X_test).to_csv("data/X_pca_test.csv")
pd.DataFrame(y_test).to_csv("data/y_pca_test.csv")

In [None]:
# Apply SMOTE (Synthetic Minority Over-sampling Technique) for oversampling
smote = SMOTE(random_state=42)
preprocessing_df_resampled, y_resampled = smote.fit_resample(preprocessing_df, y)
X_train, X_test, y_train, y_test = train_test_split(preprocessing_df_resampled, y_resampled, test_size=0.33, random_state=42)

pd.DataFrame(X_train).to_csv("data/X_SMOTE_train.csv")
pd.DataFrame(y_train).to_csv("data/y_SMOTE_train.csv")
pd.DataFrame(X_test).to_csv("data/X_SMOTE_test.csv")
pd.DataFrame(y_test).to_csv("data/y_SMOTE_test.csv")

In [None]:
# Apply Random Under-sampling for undersampling
rus = RandomUnderSampler(random_state=42)
preprocessing_undersampled, y_undersampled = rus.fit_resample(preprocessing_df, y)

X_train, X_test, y_train, y_test = train_test_split(preprocessing_undersampled, y_undersampled, test_size=0.33, random_state=42)

pd.DataFrame(X_train).to_csv("data/X_SMOTE_train.csv")
pd.DataFrame(y_train).to_csv("data/y_SMOTE_train.csv")
pd.DataFrame(X_test).to_csv("data/X_SMOTE_test.csv")
pd.DataFrame(y_test).to_csv("data/y_SMOTE_test.csv")