# **Importing Required Libraries**

In [29]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer

# **Step 1 : Data Loading**

In [30]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url, header=None, names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital-gain',
           'capital-loss', 'hours-per-week', 'native-country', 'income']
, na_values=' ?', skipinitialspace=True, delimiter=',')

In [31]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# **Step 2 : Initial Data Inspection and Cleaning**

### Step 2.1 : Display dataset information and preview data

**Preview Data:** Using head() lets students see the actual records, which is critical for understanding the context of the data.

**Dataset Info:** The info() function shows non-null counts and datatypes, which helps in quickly spotting missing values or incorrect data formats.



In [32]:
# Display dataset information and preview data
# -> Display the first 5 rows and print the basic states for data
# -> Hint: head and info

print("Dataset Information:")
print(df.info())

print("\nFirst 5 rows of the dataset:")
print(df.head())


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None

First 5 rows of the dataset:
   age         workclass  fnlwgt  edu

**Dataset Dimensions:** Knowing the shape of the data informs students about its scale, which can affect computation time and choice of algorithms.

In [33]:
# Display the dimension of the data
# -> Hint : shape
print("Dimension of the dataset:")
print(df.shape)

Dimension of the dataset:
(32561, 15)


### **Step 2.2 : Basic Statistical Summary:**

The describe() function provides vital statistics (min, max, mean, standard deviation) that help identify any outliers or anomalies in numerical columns.

In [34]:
# Check the min , max , count , means , standard diviation etc
# Hint -> describe
print("Basic Statistical Summary:")
print(df.describe())

Basic Statistical Summary:
                age        fnlwgt  education-num  capital-gain  capital-loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours-per-week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


# **Step 2.3 : Counting Missing Values:**

This step is essential for diagnosing data quality. Missing values can lead to biased or inaccurate models if not handled properly.

In [35]:
# Count missing values in each column
print("Missing Values Count:")
print(df.isnull().sum())

Missing Values Count:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


### **Step 2.4 : Checking for Duplicate Records:**

Duplicates can skew the analysis by over-representing some data, so it's important to remove them.

In [36]:
# Check for duplicate records
print("Duplicate Records Count:")
print(df.duplicated().sum())

Duplicate Records Count:
24


### **Step 2.5 : Inspecting Unique Values in Categorical Columns:**

Unique value inspection reveals if there are any unexpected values (e.g., a '?' or extra spaces) that need cleaning. This is important for ensuring reliable encoding later.

In [37]:
# Check the unique data of each Categorical column to find if there is any irrelevant record or data e.g one record contains ? mark

categorical_columns = df.select_dtypes(include=['object', 'category']).columns

for col in categorical_columns:
    print(f"\nColumn: {col}")
    print(df[col].unique())


Column: workclass
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']

Column: education
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']

Column: marital-status
['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']

Column: occupation
['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']

Column: relationship
['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']

Column: race
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']

Column: sex
['Male' 'Female']

Column: native-country
['United-States' 'Cuba' 'Jamaica' 'I

### **Step 2.6 : Validating and Converting Data Types:**

Converting columns to the correct datatype (like converting an ID column to a string) prevents errors in operations such as merging, filtering, or encoding.

In [38]:
# Check the data type of each column and if wrong datatype convert it to the suitable datatype

# Show current data types
print("Original Data Types:\n")
print(df.dtypes)

# Try to convert columns to suitable datatypes where needed
for col in df.columns:
    # Skip object columns for now (categorical/text)
    if df[col].dtype == 'object':
        try:
            # Try converting to numeric (if applicable)
            df[col] = pd.to_numeric(df[col])
            print(f"Converted '{col}' to numeric.")
        except ValueError:
            try:
                # Try converting to datetime (if applicable)
                df[col] = pd.to_datetime(df[col])
                print(f"Converted '{col}' to datetime.")
            except ValueError:
                # If neither numeric nor datetime, assume it's categorical
                df[col] = df[col].astype('category')
                print(f"Converted '{col}' to category.")
    elif df[col].dtype == 'int64' or df[col].dtype == 'float64':
        continue  # Numeric types are fine
    elif 'date' in col.lower():
        try:
            df[col] = pd.to_datetime(df[col])
            print(f"Converted '{col}' to datetime (by name match).")
        except:
            pass  # If can't convert, keep it as is

# Show updated data types
print("\nUpdated Data Types:\n")
print(df.dtypes)

Original Data Types:

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object
Converted 'workclass' to category.
Converted 'education' to category.
Converted 'marital-status' to category.
Converted 'occupation' to category.
Converted 'relationship' to category.
Converted 'race' to category.
Converted 'sex' to category.
Converted 'native-country' to category.
Converted 'income' to category.

Updated Data Types:

age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race              category
sex      

  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])


### **Step 2.7 : Checking Value Counts for Categorical Columns:**

Value counts help in understanding the distribution within each category. They are useful for detecting class imbalances and anomalies.

In [39]:
# Check value count for each Cateorical column
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

for col in categorical_columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts())


Column: workclass
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64

Column: education
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64

Column: marital-status
marital-status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: count, dtype: int64

Column: occupation
oc

### **Step 2.8 : Handling Missing Values using SimpleImputer:**

Imputation preserves the dataset size while ensuring that no null values interfere with analysis. Different strategies are used for numerical (mean) and categorical (mode) columns based on their characteristics.

In [40]:
# TODO: fill null values either by mean , median  or mode based on type of data

# Separate numeric and categorical columns

# Numeric: fill with mean

# Categorical: fill with most frequent (mode)



# Separate columns by data type
numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Fill missing values in numeric columns with mean
for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)
        print(f"Filled NaN in numeric column '{col}' with mean: {mean_value}")

# Fill missing values in categorical columns with mode
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)
        print(f"Filled NaN in categorical column '{col}' with mode: {mode_value}")


# **Step 3 : Converting Data Types and Cleaning Categorical Data**

### **Step 3.1 :Removing Leading/Trailing Spaces:**

Standardize entries in categorical columns so that no extra spaces lead to misclassification of similar values.



In [41]:
# Check for leading/trailing spaces in categorical data and remove them

# Select categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Strip leading/trailing spaces
for col in categorical_cols:
    original = df[col].copy()
    df[col] = df[col].astype(str).str.strip()

    # Check if anything changed
    if not original.equals(df[col]):
        print(f"Stripped spaces from column: {col}")

Stripped spaces from column: workclass
Stripped spaces from column: education
Stripped spaces from column: marital-status
Stripped spaces from column: occupation
Stripped spaces from column: relationship
Stripped spaces from column: race
Stripped spaces from column: sex
Stripped spaces from column: native-country
Stripped spaces from column: income


### **Step 3.2 : Checking and Converting Data Types:**

Ensure that every column is of the correct data type (e.g., IDs as strings, dates as datetime).

In [42]:
# Check the data type of each column and if wrong datatype convert it to the suitable datatype

# Show original data types
print("Original Data Types:\n", df.dtypes)

# Try to fix data types
for col in df.columns:
    # Skip columns that are already numeric or datetime
    if pd.api.types.is_numeric_dtype(df[col]) or pd.api.types.is_datetime64_any_dtype(df[col]):
        continue

    # Try to convert to numeric
    try:
        df[col] = pd.to_numeric(df[col])
        print(f"✅ Converted '{col}' to numeric.")
        continue
    except:
        pass

    # Try to convert to datetime
    try:
        df[col] = pd.to_datetime(df[col])
        print(f"✅ Converted '{col}' to datetime.")
        continue
    except:
        pass

    # If nothing worked, convert to category (for text-like data)
    df[col] = df[col].astype('category')
    print(f"ℹ️ Converted '{col}' to category.")

# Show updated data types
print("\nUpdated Data Types:\n", df.dtypes)

Original Data Types:
 age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object
ℹ️ Converted 'workclass' to category.
ℹ️ Converted 'education' to category.
ℹ️ Converted 'marital-status' to category.
ℹ️ Converted 'occupation' to category.
ℹ️ Converted 'relationship' to category.
ℹ️ Converted 'race' to category.
ℹ️ Converted 'sex' to category.
ℹ️ Converted 'native-country' to category.
ℹ️ Converted 'income' to category.

Updated Data Types:
 age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race     

  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])
  df[col] = pd.to_datetime(df[col])


### **3.3 : Converting to 'category' Datatype:**

Transform columns that represent categorical data (like Gender or Embarked port) into the 'category' type to optimize memory and computational performance.

In [43]:
# Convert suitable columns to 'category' datatype

# Or auto-detect object/string columns with low unique value counts
categorical_cols = [
    col for col in df.columns
    if df[col].dtype == 'object' and df[col].nunique() < 0.1 * len(df)
]

# Convert to 'category' dtype
for col in categorical_cols:
    df[col] = df[col].astype('category')
    print(f"✅ Converted '{col}' to 'category' type.")

# Show updated data types
print("\nUpdated Data Types:\n", df.dtypes)


Updated Data Types:
 age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status    category
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
income            category
dtype: object


# **Step 4 : Feature Engineering**

### **Step 4.1 : Creating "age_group" Feature:**

Binning converts continuous age values into meaningful categories (e.g., 'Young', 'Adult') that are easier to analyze and interpret.

In [44]:
# create feature "age_group" age categories using binning
bins = [0, 25, 45, 65, np.inf]
labels = ['Young', 'Adult', 'Middle-Aged', 'Senior']

df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)


### **Step 4.2 : Creating "education_hours_interaction" Feature:**

Interaction features help capture complex relationships between variables. In this case, the interaction between education (via 'education-num') and work intensity ('hours-per-week') may reveal underlying patterns related to social or economic outcomes.

In [45]:
# Create an interaction feature "education_hours_interaction": education-num multiplied by hours-per-week (as a proxy for workload vs. education level)
df['education_hours_interaction'] = df['education-num'] * df['hours-per-week']

# **Step 5 : Encoding Categorical Data**

**One-Hot Encoding:** Converts multiple categorical values into binary columns to prevent ordinality.


In [46]:
# One-hot encode the categorical columns (sex, workclass, education, etc.).
ct = OneHotEncoder(sparse_output=False)
encoded_df = df.copy()

for col in df.select_dtypes(include=['object', 'category']).columns:
    x = pd.DataFrame(ct.fit_transform(encoded_df[[col]]))
    x.columns = ct.get_feature_names_out([col])
    encoded_df = encoded_df.join(x)

print(encoded_df)

       age         workclass  fnlwgt   education  education-num  \
0       39         State-gov   77516   Bachelors             13   
1       50  Self-emp-not-inc   83311   Bachelors             13   
2       38           Private  215646     HS-grad              9   
3       53           Private  234721        11th              7   
4       28           Private  338409   Bachelors             13   
...    ...               ...     ...         ...            ...   
32556   27           Private  257302  Assoc-acdm             12   
32557   40           Private  154374     HS-grad              9   
32558   58           Private  151910     HS-grad              9   
32559   22           Private  201490     HS-grad              9   
32560   52      Self-emp-inc  287927     HS-grad              9   

           marital-status         occupation   relationship   race     sex  \
0           Never-married       Adm-clerical  Not-in-family  White    Male   
1      Married-civ-spouse    Exec-manag


**Label Encoding for Income:** Maps income to binary labels for binary classification tasks.

In [48]:
# Use label encoding for the income column, converting ≤50K to 0 and 50K to 1.
ordinal_encoder = LabelEncoder()
df['income_encoded'] = ordinal_encoder.fit_transform(df[['income']])
print(df)

       age         workclass  fnlwgt   education  education-num  \
0       39         State-gov   77516   Bachelors             13   
1       50  Self-emp-not-inc   83311   Bachelors             13   
2       38           Private  215646     HS-grad              9   
3       53           Private  234721        11th              7   
4       28           Private  338409   Bachelors             13   
...    ...               ...     ...         ...            ...   
32556   27           Private  257302  Assoc-acdm             12   
32557   40           Private  154374     HS-grad              9   
32558   58           Private  151910     HS-grad              9   
32559   22           Private  201490     HS-grad              9   
32560   52      Self-emp-inc  287927     HS-grad              9   

           marital-status         occupation   relationship   race     sex  \
0           Never-married       Adm-clerical  Not-in-family  White    Male   
1      Married-civ-spouse    Exec-manag

  y = column_or_1d(y, warn=True)


# **Step 6 : Normalization and Standardization**

Standardization transforms the specified columns to a mean of 0 and a standard deviation of 1, which is important to ensure comparability among numerical features during model training.

In [49]:
# Standardize the "age", "hours-per-week", "capital-gain" and "capital-loss" column to have a mean of 0 and a standard deviation of 1.

scaler = StandardScaler()
df[['age', 'hours-per-week', 'capital-gain', 'capital-loss']] = scaler.fit_transform(df[['age', 'hours-per-week', 'capital-gain', 'capital-loss']])

print(df)

NameError: name 'StandardScaler' is not defined