# Adult Income Dataset â€“ Feature Encoding & Scaling

This notebook focuses on preprocessing the Adult Income dataset by applying feature encoding and scaling techniques to make the data suitable for machine learning models.

## Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

## Loading the Dataset

In [2]:
df = pd.read_csv("adult_income.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K


## Dataset Structure and Data Types

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48077 entries, 0 to 48076
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              48077 non-null  int64  
 1   workclass        48077 non-null  object 
 2   fnlwgt           48077 non-null  int64  
 3   education        48077 non-null  object 
 4   educational-num  48077 non-null  int64  
 5   marital-status   48077 non-null  object 
 6   occupation       48077 non-null  object 
 7   relationship     48076 non-null  object 
 8   race             48076 non-null  object 
 9   gender           48076 non-null  object 
 10  capital-gain     48076 non-null  float64
 11  capital-loss     48076 non-null  float64
 12  hours-per-week   48076 non-null  float64
 13  native-country   48076 non-null  object 
 14  income           48076 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 5.5+ MB


## Feature Type Identification

### Numerical Features
- age
- fnlwgt
- capital-gain
- capital-loss
- hours-per-week

### Categorical Features (Nominal)
- workclass
- marital-status
- occupation
- relationship
- race
- sex
- native-country

### Categorical Feature (Ordinal)
- education

### Target Variable
- income

## Handling Missing Values

In [4]:
df.replace(' ?', np.nan, inplace=True)
df.dropna(inplace=True)

## Label Encoding for Ordinal Feature

In [5]:
le = LabelEncoder()
df['education'] = le.fit_transform(df['education'])

## One-Hot Encoding for Nominal Features

In [6]:
df = pd.get_dummies(df, drop_first=True)

## Separating Features and Target Variable

In [7]:
X = df.drop('income_>50K', axis=1)
y = df['income_>50K']

## Feature Scaling Using StandardScaler

In [8]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Comparison Before and After Scaling

In [9]:
print("Before Scaling:")
print(X.describe())

print("\nAfter Scaling:")
print(pd.DataFrame(X_scaled).describe())

Before Scaling:
                age        fnlwgt  ...  capital-loss  hours-per-week
count  48076.000000  4.807600e+04  ...  48076.000000    48076.000000
mean      38.628214  1.897571e+05  ...     87.338610       40.417942
std       13.710565  1.056980e+05  ...    402.707997       12.379868
min       17.000000  1.228500e+04  ...      0.000000        1.000000
25%       28.000000  1.175670e+05  ...      0.000000       40.000000
50%       37.000000  1.782530e+05  ...      0.000000       40.000000
75%       48.000000  2.377200e+05  ...      0.000000       45.000000
max       90.000000  1.490400e+06  ...   4356.000000       99.000000

[8 rows x 7 columns]

After Scaling:
                 0             1   ...            84            85
count  4.807600e+04  4.807600e+04  ...  4.807600e+04  4.807600e+04
mean   3.133269e-17 -3.458420e-17  ... -5.172850e-18 -7.685378e-18
std    1.000010e+00  1.000010e+00  ...  1.000010e+00  1.000010e+00
min   -1.577502e+00 -1.679066e+00  ... -4.183649e-02 -2.1

## Saving Preprocessed Dataset

In [10]:
processed_df = pd.DataFrame(X_scaled, columns=X.columns)
processed_df['income'] = y.values

processed_df.to_csv("adult_income_preprocessed.csv", index=False)

In [12]:
processed_df.head(500).to_csv("adult_income_preprocessed_sample.csv",index=False)

## Impact of Feature Scaling

Feature scaling ensures fair contribution of all numerical features by bringing them to a similar range. It is crucial for distance-based and gradient-based algorithms, as unscaled data can lead to biased learning and slow convergence. Tree-based models are generally unaffected by scaling.