# DTSA 5509 Final Project Summary

This project focuses on applying supervised machine learning techniques—specifically binary classification—to predict whether an individual’s annual income exceeds $50,000 or falls below that threshold. Using demographic and employment-related data, the project explores how different machine learning models can learn from patterns in features such as education, occupation, age, and work hours to make accurate income predictions. This type of classification problem is widely used in socioeconomic modeling and workforce analytics, as it helps identify the key factors that influence income levels and supports data-driven decision-making in areas like policy planning, recruitment, and economic forecasting.

The primary goal of this project is to develop and evaluate models capable of accurately predicting a person’s income category—greater than $50K or less than/equal to $50K—based on their personal and employment characteristics. Beyond achieving high model accuracy, the project aims to understand which variables most strongly influence income and how various machine learning techniques compare in performance. This goal is important because it demonstrates how data science can be used to extract meaningful insights from real-world data, improve fairness and transparency in predictive modeling, and refine data-driven policy or business strategies that depend on income-related predictions.

To achieve this, the project follows a structured workflow. It begins with a Data Summary to understand the dataset’s structure and features, followed by Exploratory Data Analysis (EDA) to identify patterns, correlations, and potential outliers. The Data Cleaning phase ensures data quality by addressing missing values, duplicates, and inconsistencies. In the Preprocessing stage, features are transformed and encoded to prepare them for modeling. The Modeling section introduces baseline classifiers—Logistic Regression, Support Vector Classifier, and Random Forest—to establish benchmark performance. Feature Selection is then applied to reduce dimensionality and improve model efficiency, followed by Hyperparameter Tuning to optimize each model’s parameters using cross-validation. Next, Ensemble Methods such as Voting and Stacking are implemented to combine the strengths of multiple models for improved prediction accuracy. Finally, the Results and Analysis stage evaluates all models using metrics like accuracy and ROC AUC, compares their performance, and highlights the most effective approaches and insights gained throughout the process.


# Data Summary

### Import Python Packages

In [None]:
# import pandas as pd
import numpy as np
from IPython.display import display, Markdown
from ucimlrepo import fetch_ucirepo

import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt
alt.data_transformers.disable_max_rows()

### Data Source
The dataset used in this project is the Adult Income Dataset, commonly known as the “Census Income” dataset, which originates from the U.S. Census Bureau’s 1994 Census database. It contains demographic and employment-related information for over 48,000 individuals, with attributes such as age, education, occupation, work hours, marital status, and native country. The dataset’s primary purpose is to predict whether a person earns more than $50,000 per year based on these features, making it a well-known benchmark for supervised binary classification tasks in machine learning.  

The data was obtained in CSV format from the UCI Machine Learning Repository, where it is publicly available for academic and research use (Dua & Graff, 2019).

Reference:  
Dua, D., & Graff, C. (2019). UCI Machine Learning Repository: Adult Data Set. University of California, Irvine, School of Information and Computer Sciences. https://archive.ics.uci.edu/dataset/2/adult

In [None]:
# Import dataset 
adult_dataset_dict = fetch_ucirepo(id=2)

# View raw dataset
df = adult_dataset_dict.data.original
df

### Dataset Info
- `df.info()` provides a concise summary of a DataFrame, including the:
  - Number of rows
  - Column names
  - Data types
- It also shows the count of non-null entries for each column, which makes it easy to identify missing values.
- In addition, it displays the memory usage of the DataFrame, helping to assess the size and efficiency of the dataset in memory.

In [None]:
# Inspect column data types and size of the dataframe
df.info()

> The dataset contains 15 columns and 48,842 rows of data.  

> Six columns are integer datatypes, and the other 9 columns are categorical datatypes (shown as 'object' data-type above).

### Dataset Metadata Summary

The table below provides the **metadata of each column** in our dataset:
- `name` are the column names of the features and target in our dataset
- `role` identifies if the column is a feature or target variable
- `type` is the column data type
- `demographic` describes the demographic type
- `description` details the various categories for the categorical features
- `units` are empty (No further information has been provided by the data source)
- `missing_values` is a flag to indicate if there are values missing for that feature (missing values will be confirmed and imputed during the EDA and Pre-Processing stages)

In [None]:
# Create a dataframe of the dataset metadata (as provided by the data source)
adult_dataset_metadata_df = adult_dataset_dict.variables

# Display a markdown table of the metadata for each column
display(Markdown(adult_dataset_metadata_df.to_markdown(index=False)))

### Feature Descriptions
- The UCI Adult dataset is a derived subset by Becker & Kohavi of the U.S. Census 1994 microdata, so I couldn’t reliably find a primary source document that provides descriptions of the parameters as provided by Becker & Kohavi.
- Therefore, I am using the feature descriptions as found in this Kaggle notebook, which are generally the same descriptions I've found elsewhere:
  - Vyass, Y. H. (2022). Adult census income logistic regression explained (86.2%) [Computer software]. Kaggle. https://www.kaggle.com/code/yashhvyass/adult-census-income-logistic-reg-explained-86-2

Feature Name | Description
---|---
age | The age of an individual
workclass | Employment status of an individual
fnlwgt | The number of people the census believes the entry represents
education | The highest level of education achieved by an individual
education-­num | The highest level of education achieved in numerical form
marital­-status | Marital status of an individual
occupation | The general type of occupation of an individual
relationship | The relationship status of an individual
race | The race of an individual
sex | The sex of an individual
capital-­gain | The capital gains for an individual
capital-­loss | The capital loss for an individual
hours-­per-­week | The hours an individual has reported to work per week
native-­country | The country of origin of an individual

> - It can be noted that the `education` and `education-num` columns essentially describe the same thing, where `education` are the education labels and `education-num` are the ordinal encoding for those labels.
> - During the EDA stage these columns should be cross referenced to ensure that each category in `education` directly corresponds to the same ordinal value in `education-num` (i.e. 'HS-grad' in `education` always corresponds to the value 9 in `education`).
>   - If they do not match then efforts should be made to replace the ordinal values in `education-num` column with the most frequent values for each category in `education`.
>   - Once the `education` and `education-num` have a 1:1 match then the `education` column can be dropped during Data Cleaning stage, leaving only the ordinal encodings in `education-num` for modeling purposes.

# EDA
In the EDA section, I start by defining which columns in the dataset are numeric and which are categorical, since this separation helps guide how each will be analyzed later in preprocessing. I then calculate descriptive statistics for the numeric features to understand their central tendencies, variability, and overall distributions, which also helps identify potential outliers. Next, I create histograms for each numeric column to visually assess their distributions and confirm where any extreme values, like those in `capital-gain` and `capital-loss`, might exist. After that, I check for correlations between numeric features using both a heatmap and a scatter plot matrix to see whether any strong relationships exist that could cause redundancy. I also examine the categorical columns by listing their unique categories and plotting their frequency distributions to get a sense of balance among categories and identify any `"?"` or missing values that need to be handled. Lastly, I compare the education and education-num columns using a heatmap to determine if they carry the same information, confirming that education-num is redundant and should be dropped later during data cleaning.

### Define Numeric and Categorical Feature Columns Groups
- These groups will be used for EDA purposes, and during the Pre-Processing stage.

In [None]:
# Define lists of the numeric and categorical column names
numeric_columns = df.select_dtypes(include=['int64']).columns.to_list()
categorical_columns = [col for col in df.columns if col not in numeric_columns]

# Display numeric and categorical columns as lists
print('Numeric Columns :', numeric_columns)
print('Categorical Columns :', categorical_columns)

### Numeric Columns Descriptive Statistics
- The descriptive statistics summary gives us high level insights for each feature, including the:
  - Values counts
  - Mean
  - Standard Deviation
  - Minimum values
  - 25th, 50th, and 75th Quartiles
  - Maximum values
- These statistics can be used to get a general sense of the distribution of each numerical feature, and to possibly detect any numerical outliers.

In [None]:
# Descriptive statistics for the dataset's numeric columns 
df[numeric_columns].describe()

### Numeric Columns Histograms

- Histograms of numeric columns allow us to visually interpret the distributions of the data. It will also allow us to identify where numeric outliers may exist.

In [None]:
# Melt the numeric columns into one column
df_melt_numeric_columns = df[numeric_columns].melt(var_name='feature', value_name='value')

# Create a base Altair histogram chart
chart = alt.Chart(df_melt_numeric_columns).mark_bar().encode(
    x = alt.X('value:Q', 
              axis=alt.Axis(title=''), 
              scale=alt.Scale(zero=False),
              bin=alt.Bin(maxbins=50)),
    y = alt.Y('count():Q', 
              axis=alt.Axis(title='')),
    color = alt.Color('feature:N', legend=None)
).properties(
    width=300,
    height=200
)

# Display a histogram for each numeric_columns
alt.ConcatChart(
    concat=[
      chart.transform_filter(alt.datum.feature == value).properties(title=value)
      for value in numeric_columns
    ],
    columns=3
).configure_title(
    fontSize=10
).resolve_axis(
    x='independent',
    y='independent'
).resolve_scale(
    x='independent', 
    y='independent'
)

> The `capital-gain` and `capital-loss` columns seem to have a lot of values close to zero, as well as some extreme outliers.  
> - In particular the maximum value for `capital-gain` is equal to 99,999 which seems odd given that most of the values are ~16,000 or less. This outlier needs to be investigated further to decide if the value makes sense or it was just a entry error. If so, we should remove the outlier and replace it with the median of `capital-gain` (median after the outlier has been removed) during the Data Cleaning stage.
> - The zero values in the `capital-gain` and `capital-loss` columns may just be an indicator that no capital gains or losses values were recorded, rather than the gains or losses equaling to exactly `$0`. It will be decided during Data Cleaning if it makes sense to replace the zero values and impute new values.

### Numeric Column Correlations
- We need to check if there exists strong correlations between the numeric features. If there are then we should consider dropping one or more of the correlated features from the dataset.
- To aid in this evaluation we will visualize the data using:
    - A heatmap, where values of `[-1,1]` indicate as strong correlation, and values close to `0` indicate a weak correlation.
    - A matrix of scatter plots to help observe if there exists any trends between numeric features. 

In [None]:
# Create heatmap for numeric columns
sns.heatmap(
    df[numeric_columns].corr().replace(1,np.nan), # .replace() removes all of the 1's along the diagonal
    cmap='vlag_r',
    annot=True,
    fmt='.3f',
    linewidths=1,
    vmin=-1,vmax=1
)
plt.title('Numeric Features Correlation Heatmap')
plt.show()

> All of the correlation values are very close to zero so we can assume that **no strong correlations exist between the numeric features**.

In [None]:
# Scatter plot matrix
alt.Chart(df[numeric_columns]).mark_circle(opacity=0.5).encode(
    x=alt.X(alt.repeat('column'), type='quantitative'),
    y=alt.Y(alt.repeat('row'), type='quantitative')
).properties(
    width=125,
    height=125,
).repeat(
    row=numeric_columns,
    column=numeric_columns,
)

> Once again the outliers are noticeable in the `capital-gain` and `capital-loss` columns.

### Frequency of Categorical Columns
- The categorical columns need to be investigated to understand what unique categories exist in each column, as well as the count of occurrence for each category.

In [None]:
# Investigate the unique values in each of the categorical columns
for col in categorical_columns:
    print(f'Column name: {col}')
    print(f'Categories: {df[col].unique()}\n')

In [None]:
# Visualize the counts of each category

# Melt the categorical_columns
demographics_df = df[categorical_columns].melt(
    var_name='feature', value_name='category'
).dropna()

# Create a base Altair chart
base = (
    alt.Chart(demographics_df)
    .transform_aggregate(count='count()', groupby=['feature','category'])
    .transform_window(
        rank='rank()',
        sort=[alt.SortField('count', order='descending')],
        groupby=['feature']
    )
)

# Category counts facet chart
(base.mark_bar()
    .encode(
        y=alt.Y('count:Q', title=None),
        x=alt.X('category:N', sort='-y', title=None)
    )
    .properties(width=350, height=150)
    .facet(facet=alt.Facet('feature:N', title=None), align='all', columns=3)
    .resolve_scale(x ='independent')
)

> The `native-country`, `occupation`, and `workclass` features contain some rows with `?` values. 
> - These values will need to be replaced with `np.nan`'s during the Data Cleaning stage.  
> - Later during Pre-Processing those `np.nan`'s will be imputed with the most frequent categories for each feature.  

> Otherwise, the distributions of the categorical columns are sensible and don't appear to contain any outliers.

### Cross-Reference Education Columns
- It was previously identified that the `education` and `education-num` columns need to be checked to see if they essentially contain the same information.
- A plot will be created to investigate if each individual `education` category corresponds to exactly one `education-num` value.

In [None]:
# Create a subset of the dataframe for the education columns
education_df = df[['education','education-num']]

# Plot the chart 
alt.Chart(education_df).mark_rect().encode(
    y=alt.Y('education:O', sort='x'),
    x='education-num:O',
    color='count()'
).properties(title='Education Cross Reference Plot')

> The chart above indicates that indeed there is only one occurrence of `education` for each occurrence of `education-num`.  
> - Note: The color just indicates the count of occurrence in the dataset. What were most concerned about here is that **only one box appears for each row/column set** in the chart. 

>This proves that the `education-num` column should be dropped during Data Cleaning since it contains redundant information.

### EDA - Conclusions/Discussions/Next Steps:
In summary, the EDA process involved defining numeric and categorical feature groups, exploring descriptive statistics, visualizing distributions through histograms, examining correlations among numeric variables, assessing category frequencies, and cross-referencing the `education` and `education-num` columns for redundancy.  

From this analysis, I found that most numeric variables have reasonable distributions with no strong correlations, although `capital-gain` and `capital-loss` contain extreme outliers and many zeros that will require further investigation. In the categorical features, several columns such as `workclass`, `occupation`, and `native-country` include `"?"` values that will need to be treated as missing data. Additionally, the `education-num` column was found to duplicate information from `education` and should be removed. These findings suggest that the primary challenges in the next stage will involve handling outliers, missing data, and redundant features.  

The next step, Data Cleaning, will focus on addressing these issues to prepare a consistent and reliable dataset for model training.

# Data Cleaning
In the Data Cleaning section, I begin by removing the `education` column since it was found to be redundant with `education-num` during the EDA stage. Next, I identify and drop duplicate rows to prevent data leakage and overfitting during model training. I then address numeric outliers in the `capital-gain` and `capital-loss` columns by flagging and replacing implausible values such as zeros and 99,999 with NaN, ensuring that these will later be imputed appropriately. For the categorical data, I replace `"?"` entries in `workclass`, `occupation`, and `native-country` with NaN to standardize missing values, and I clean the target variable `income` by removing trailing periods to maintain only two valid categories (<=50K and >50K). I then identify which numeric and categorical columns contain missing values and visualize where these occur, noting that `capital-gain`, `capital-loss`, `workclass`, `occupation`, and `native-country` have missing data that will need imputation. Finally, I handle a special case where `workclass` equals “Never-worked” by creating a new occupation category labeled “None.” These steps ensure the dataset is consistent, free of duplicates, and properly formatted for the next phase of preprocessing and imputation.

### Drop Columns
- The `education` column can be dropped because we have proven it to be redundant in the EDA stage. We will instead use the ordinal encoding of `education-num` for modeling.

In [None]:
# Drop the 'education' column
df = df.drop(['education'], axis=1)

# Remove the 'education' column from the categorical_columns list
categorical_columns = [col for col in categorical_columns if col != 'education']

# Ensure the 'education' column has been dropped
df[categorical_columns].columns.to_list()

### Drop Duplicate Values
- Duplicate values need to be removed because there could be a chance that one copy lands in the training set and its twin lands in the test set when we split the dataset for supervised modeling. This would lead to over fitting because the model has been trained on an instance of the duplicate.

In [None]:
# Display first 10 rows where duplicated values occur
df_duplicated = df[df.duplicated(keep=False)].sort_values(by=df.columns.to_list())
df_duplicated

> The dataset contains 57 rows that have duplicated values. Only the first occurrence will be kept and the other duplicates will be dropped to ensure the model is not over fit.

In [None]:
# Check the shape of the dataset before dropping duplicates
print(f'Dataframe shape before dropping duplicates: {df.shape}')
rows_before = df.shape[0]

# Drop duplicate values
df = df.drop_duplicates()

# Check the shape of the dataset after dropping duplicates
print(f'Dataframe shape before after duplicates: {df.shape}')
rows_after = df.shape[0]

print(f'Number of duplicate rows removed = {rows_before-rows_after}')

> 29 duplicate rows were removed from the dataset after dropping duplicates.

### Identify Numeric Column Outliers
- The numeric column histograms indicated that the `capital-gain` and `capital-loss` columns have numeric outliers. In particular, both columns have a large amount of zero values.
  - The zero values may mean that the value of the capital gain or loss is truly equal to $0, but there is a chance that it also means that no gains or losses actually exist. This needs to be investigate further. If the zero values don't make sense then they will be removed from the dataset and new values imputed in their place.
- The `capital-gain` has a maximum value of `99,999` which appears to be an outlier in the dataset. This may have been accidentally entered as a placeholder and never removed from the original dataset. If it is determined that this is truly an outlier, then it will be removed from the dataset and new values imputed in it's place.

In [None]:
# Create a subset of the dataframe for 'capital-gain' and 'capital-loss', removing outliers
capital_df = (df.copy()
    .loc[:,['capital-gain','capital-loss']]
    .melt(value_vars=['capital-gain','capital-loss'])
    .replace([0,99999], np.nan) # replace the zero and 99999 values
)

# Create a base histogram instance
capital_base_histogram = alt.Chart(capital_df).mark_bar().encode(
    x = alt.X('value:Q', title=None, bin=alt.Bin(maxbins=50)),
    y = 'count():Q'
).properties(title='capital-gain')

# Create histograms of 'capital-gain' and 'capital-loss' (post outliers removal)
capital_gain_histogram = capital_base_histogram.transform_filter(alt.datum.variable=='capital-gain').properties(title='capital-gain (post outliers removal)')
capital_loss_histogram = capital_base_histogram.transform_filter(alt.datum.variable=='capital-loss').properties(title='capital-loss (post outliers removal)')

# Concatenate and display post outliers removal histograms
capital_gain_histogram | capital_loss_histogram

> Removing the numeric outliers makes sense because they don't appear to be part of the main distributions of the `capital-gain` and `capital-loss` columns. These will be removed in the next step.

### Replace Values
- The categorical columns frequency charts indicate that some rows in the `workclass`, `occupation`, and `native-country` features have been filled in with a `?`. It will be assumed that these are unknown values and as such should be replaced with `np.nan` values instead.
- Also, the target variable `income` appears to have four unique categories: `<=50K`, `>50K`, `<=50K.`, `>50K.`. Because we are modeling a binary classification problem we need to ensure there are only two unique categories. Therefore, the periods should be removed where they exist so the target only contains the two categories of `<=50K` and `>50K`.
- Finally, the outliers in the numeric columns need to be removed to better represent the distributions of their data.

In [None]:
# Replace ? with np.nan in the `workclass`, `occupation`, and `native-country` columns
df = df.replace({'?':np.nan})

# Ensure the '?' have been removed from the `workclass`, `occupation`, and `native-country` columns
for col in ['workclass', 'occupation', 'native-country']:
    print(f'Column name: {col}')
    print(f'Categories: {df[col].unique()}\n')

In [None]:
# Replace the '.' in the 'income' column
df.loc[:,'income'] = df.loc[:,'income'].str.replace('.','')

# Ensure the '.' have been removed from the 'income' column
for col in ['income']:
    print(f'Column name: {col}')
    print(f'Categories: {df[col].unique()}\n')

In [None]:
# Replace the numeric outliers in 'capital-gain' and 'capital-loss' columns
df[['capital-gain','capital-loss']] = df[['capital-gain','capital-loss']].replace([0,99999], np.nan)

### Identify Missing Numeric Values
- We need to investigate where the numeric outliers values of `0` and `99999` have been replaced with NaNs so we will be able to correctly impute the missing values during pre-processing

In [None]:
# Find which numeric columns contain missing values in the dataset
df[numeric_columns].isna().any(axis=0)

In [None]:
# Filter the dataset to only the rows of the numeric columns that have missing values
numeric_missing_df = df[numeric_columns][
    (df['capital-gain'].isna()) | 
    (df['capital-loss'].isna())
]

# Visualize the rows where the numeric columns have missing values
fig, ax = plt.subplots(figsize=(20,5))  
sns.heatmap(numeric_missing_df.T.isna(), cmap='Blues', cbar=False)
plt.title('Rows of the Numeric Columns with Missing Values')
plt.show()

> These missing numeric values will need to be imputed during Pre-Processing

### Identify Missing Categorical Values
- We need to investigate which categorical columns contain missing values so we will be able to correctly impute the missing values during pre-processing

In [None]:
# Find which categorical columns contain missing values in the dataset
df[categorical_columns].isna().any(axis=0)

> It appears that the `workclass`, `occupation`, and `native-country` categorical columns all contain missing values.

In [None]:
# Calculate the count of missing values for the 'workclass', 'occupation', and 'native-country' columns
df[categorical_columns].isna().sum()[df[categorical_columns].isna().sum() > 0]

In [None]:
# Filter the dataset to only the rows where the categorical columns have missing values
categorical_missing_df = df[categorical_columns][
    (df['workclass'].isna()) | 
    (df['occupation'].isna()) | 
    (df['native-country'].isna())
]

# Visualize the rows of the categorical columns that have missing values
fig, ax = plt.subplots(figsize=(20,5))  
sns.heatmap(categorical_missing_df.T.isna(), cmap='Blues', cbar=False)
plt.title('Rows of Categorical Columns with Missing Values')
plt.show()

> The `workclass` and `occupation` columns appear to have similar patterns of where the data is missing. The missing value count table indicates that there are 10 more missing values in `occupation` compared to `workclass`. It would be interesting to find out where `workclass` is not NaN and where `occupation` is.

In [None]:
# Filter to rows where workclass is NaN and occupation is not NaN
categorical_missing_subset_df = categorical_missing_df[(categorical_missing_df['workclass'].notna()) & (categorical_missing_df['occupation'].isna())]
categorical_missing_subset_df

> It's interesting that `workclass` is only equal to 'Never-worked' where the `occupation` is missing. Let's check where else 'Never-worked' occurs in `workclass`.

In [None]:
# Filter the original dataset to where 'workclass' is 'Never-worked'
workclass_never_worked_df = df[(df['workclass']=='Never-worked')]
workclass_never_worked_df

> It seems like these are the only rows where 'Never-worked' occurs, which corresponds to the same rows we found in categorical_missing_subset_df.  
> 
> I'm going to make the assumption that in this case where `workclass` is 'Never-worked' then `occupation` must equal to None. Therefore I'm going to replace these missing values with a new `occupation` category of 'None'.

In [None]:
# Replace occupation with 'None' where workclass=='Never-worked'
df['occupation'] = np.where(
    (df['workclass']=='Never-worked'),
    'None',
    df['occupation']
)

# Check that 'None' has been added to the unique categories of the 'occupation' column
df['occupation'].unique()

### Identify Class Imbalance
- Since our task is a binary classification of income with categories [<=50K, >50K], we will need to verify whether the target variable is balanced—that is, whether both classes occur in roughly equal proportions. Working with an imbalanced dataset can lead to biased models that perform well on the majority class but fail to correctly identify or predict the minority class, resulting in misleading accuracy scores and poor generalization. If we find that the distribution is uneven then I will experiment with the SMOTE oversampling technique to attempt to get more accurate models results during the Modeling phase of the project.

In [None]:
# Calculate the percentage of each target class for the "income" target variable
lt50_income_count = df[df['income']=='<=50K']['income'].count()
gt50_income_count = df[df['income']=='>50K']['income'].count()
total_income_count = len(df)

print(f'The count of the income category ">=50K" is {lt50_income_count:,d} out of the total count of {total_income_count:,d} representing {lt50_income_count/total_income_count*100:.1f}% of the income values.')
print(f'The count of the income category  "<50K" is {gt50_income_count:,d} out of the total count of {total_income_count:,d} representing {gt50_income_count/total_income_count*100:.1f}% of the income values.')

In [None]:
# Create the left bar chart for the count of income <=50
lt50_chart = alt.Chart(df, title='Income <= 50K', height=300, width=800
    ).mark_bar(
        color=alt.Gradient(
            gradient='linear',
            stops=[
                alt.GradientStop(color='#b8e3be', offset=1.00),
                alt.GradientStop(color='#93d5bd', offset=0.75),
                alt.GradientStop(color='#69c2ca', offset=0.5),
                alt.GradientStop(color='#43a5c9', offset=0.25),
                alt.GradientStop(color='#2283b9', offset=0.00),
                ],
                x1=0, x2=1, y1=1, y2=1
            )
    ).transform_filter(
        alt.datum.income == '<=50K'
    ).encode(
        x=alt.X('count():Q', sort='descending', scale=alt.Scale(domain=(0,40000)))
    )
lt50_text = lt50_chart.mark_text(align='left', dy=-10, dx=-30, angle=270, fontWeight='bold', fontSize=20).encode(
        text=alt.Text('count():Q', format=',d')
    )

# Create the right bar chart for the count of income >50
gt50_chart = alt.Chart(df, title='Income > 50K', height=300, width=800
    ).mark_bar(
        color=alt.Gradient(
            gradient='linear',
            stops=[
                alt.GradientStop(color='#b8e3be', offset=0),
                alt.GradientStop(color='#93d5bd', offset=1),
                ],
                x1=0, x2=1, y1=1, y2=1
            )
    ).transform_filter(
        alt.datum.income == '>50K'
    ).encode(
        x=alt.X('count():Q', scale=alt.Scale(domain=(0,40000)))
    )
gt50_text = gt50_chart.mark_text(align='right', dy=-10, dx=30, angle=90, fontWeight='bold', fontSize=20).encode(
    text=alt.Text('count():Q', format=',d')
)

# Combine charts
alt.concat(lt50_chart+lt50_text, gt50_chart+gt50_text, title=alt.Title('Imbalance of the Income Target', fontWeight='bold', fontSize=20, anchor='middle'))

### Data Cleaning - Conclusions/Discussions/Next Steps:
In summary, the Data Cleaning process focused on removing redundant and duplicate records, addressing outliers, and standardizing missing and inconsistent values across the dataset to ensure data integrity. During this process, we also discovered that the target variable was imbalanced, with 37,128 rows labeled “≤50K” and 11,685 rows labeled “>50K.” This imbalance will be addressed during preprocessing to help improve model fairness and predictive performance.

One key insight was that rows with workclass equal to “Never-worked” consistently had missing occupation values, prompting the creation of a new “None” category. These findings show that although the dataset is now well-prepared, careful imputation will be necessary to prevent bias and preserve accuracy.

The next step, Preprocessing, will focus on handling the class imbalance, imputing missing values, encoding categorical variables, and scaling numeric features for modeling.

In [None]:
# Export the dataset so it can be used in the "5509_income_modeling.ipynb" workbook
df.to_csv('./df.csv')