# DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

Dataset:

Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.

Tasks:

1.	Handle missing values as per the best practices (imputation, removal, etc.).
●	Apply scaling techniques to numerical features:

a.	Standard Scaling   b. Min-Max Scaling

●	Discuss the scenarios where each scaling technique is preferred and why.

2. Encoding Techniques:

●	Apply One-Hot Encoding to categorical variables with less than 5 categories.

●	Use Label Encoding for categorical variables. Data Exploration and
Preprocessing:

●	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).

●	les with more than 5 categories.

●	Discuss the pros and cons of One-Hot Encoding and Label Encoding.

3. Feature Engineering:

●	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

●	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.



In [131]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('/content/adult_with_headers (1).csv')

In [132]:
print(df.head())

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50

In [133]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In [134]:
print(df.describe())

                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


In [135]:
print(df.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [136]:
# Remove rows with missing values
df = df.dropna()




In [137]:
print(df.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [138]:
num_cols = ['age', 'hours_per_week', 'education_num']

# Standard Scaling
# We use standard scaling when we have a normal distribution of data
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.030671,State-gov,77516,Bachelors,1.134739,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,-0.035429,United-States,<=50K
1,0.837109,Self-emp-not-inc,83311,Bachelors,1.134739,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,-2.222153,United-States,<=50K
2,-0.042642,Private,215646,HS-grad,-0.42006,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,-0.035429,United-States,<=50K
3,1.057047,Private,234721,11th,-1.197459,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,-0.035429,United-States,<=50K
4,-0.775768,Private,338409,Bachelors,1.134739,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,-0.035429,Cuba,<=50K


In [139]:
# Min-Max Scaling
# we use Min-Max Scaling when we have varied distributions of data
minmax_scaler = MinMaxScaler()
df[num_cols] = minmax_scaler.fit_transform(df[num_cols])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.30137,State-gov,77516,Bachelors,0.8,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,0.397959,United-States,<=50K
1,0.452055,Self-emp-not-inc,83311,Bachelors,0.8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0.122449,United-States,<=50K
2,0.287671,Private,215646,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,0.397959,United-States,<=50K
3,0.493151,Private,234721,11th,0.4,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,0.397959,United-States,<=50K
4,0.150685,Private,338409,Bachelors,0.8,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,0.397959,Cuba,<=50K


In [140]:
categorical_cols = df.select_dtypes(include='object').columns
print(categorical_cols)

Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')


In [141]:
# Find columns with less than 5 unique categories
few_categories = [col for col in categorical_cols if df[col].nunique() < 5]

print(few_categories)

['sex', 'income']


In [142]:
# One-Hot Encoding for categorical variables with less than 5 categories
categorical_cols = ['sex', 'income']
df = pd.get_dummies(df, columns=categorical_cols)
df.head()
# Pros: Transforms each category into a separate binary column, making encoded features easy to interpret.
# Cons: Increases the dataset dimensionality significantly, which may lead to sparsity and higher computational cost.

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K
0,0.30137,State-gov,77516,Bachelors,0.8,Never-married,Adm-clerical,Not-in-family,White,2174,0,0.397959,United-States,False,True,True,False
1,0.452055,Self-emp-not-inc,83311,Bachelors,0.8,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0.122449,United-States,False,True,True,False
2,0.287671,Private,215646,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0.397959,United-States,False,True,True,False
3,0.493151,Private,234721,11th,0.4,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0.397959,United-States,False,True,True,False
4,0.150685,Private,338409,Bachelors,0.8,Married-civ-spouse,Prof-specialty,Wife,Black,0,0,0.397959,Cuba,True,False,True,False


In [143]:
# Label Encoding for others
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
columns_to_encode = ['workclass', 'education', 'marital_status', 'occupation',
                     'relationship', 'race', 'native_country']

for col in columns_to_encode:
    df[col] = le.fit_transform(df[col])
# Pros: Converts each category into a unique integer in a single feature column, so it keeps the dimensionality low.
# Cons: May mistakenly imply an ordinal relationship between categories where none exists, which can mislead some models (e.g., linear models, logistic regression).

In [144]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K
0,0.30137,7,77516,9,0.8,4,1,1,4,2174,0,0.397959,39,False,True,True,False
1,0.452055,6,83311,9,0.8,2,4,0,4,0,0,0.122449,39,False,True,True,False
2,0.287671,4,215646,11,0.533333,0,6,1,4,0,0,0.397959,39,False,True,True,False
3,0.493151,4,234721,1,0.4,2,6,0,2,0,0,0.397959,39,False,True,True,False
4,0.150685,4,338409,9,0.8,2,10,5,2,0,0,0.397959,5,True,False,True,False


In [145]:
# Feature 1: Total hours worked yearly
df['hours_worked_yearly'] = df['hours_per_week'] * 52
df.head()
# Allows comparisons across employees with different weekly hours, showing actual yearly contributions or effort.

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K,hours_worked_yearly
0,0.30137,7,77516,9,0.8,4,1,1,4,2174,0,0.397959,39,False,True,True,False,20.693878
1,0.452055,6,83311,9,0.8,2,4,0,4,0,0,0.122449,39,False,True,True,False,6.367347
2,0.287671,4,215646,11,0.533333,0,6,1,4,0,0,0.397959,39,False,True,True,False,20.693878
3,0.493151,4,234721,1,0.4,2,6,0,2,0,0,0.397959,39,False,True,True,False,20.693878
4,0.150685,4,338409,9,0.8,2,10,5,2,0,0,0.397959,5,True,False,True,False,20.693878


In [146]:
# Feature 2: Age group
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 45, 65, 100], labels=['Young', 'Mid-age', 'Senior', 'Elder'])
df.head()
# Grouping ages into categories reduces complexity, making trends easier to
#spot across meaningful life stages rather than dealing with noisy continuous age values.

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K,hours_worked_yearly,age_group
0,0.30137,7,77516,9,0.8,4,1,1,4,2174,0,0.397959,39,False,True,True,False,20.693878,Young
1,0.452055,6,83311,9,0.8,2,4,0,4,0,0,0.122449,39,False,True,True,False,6.367347,Young
2,0.287671,4,215646,11,0.533333,0,6,1,4,0,0,0.397959,39,False,True,True,False,20.693878,Young
3,0.493151,4,234721,1,0.4,2,6,0,2,0,0,0.397959,39,False,True,True,False,20.693878,Young
4,0.150685,4,338409,9,0.8,2,10,5,2,0,0,0.397959,5,True,False,True,False,20.693878,Young


In [147]:
# Log transform skewed numeric feature e.g. fnlwgt
import numpy as np
df['log_fnlwgt'] = np.log1p(df['fnlwgt'])
#Skewed data have long tails that can distort statistical measures and modeling.
#Log transform compresses large values, reducing right skewness and making the data distribution more balanced and symmetric.
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K,hours_worked_yearly,age_group,log_fnlwgt
0,0.30137,7,77516,9,0.8,4,1,1,4,2174,0,0.397959,39,False,True,True,False,20.693878,Young,11.258253
1,0.452055,6,83311,9,0.8,2,4,0,4,0,0,0.122449,39,False,True,True,False,6.367347,Young,11.330348
2,0.287671,4,215646,11,0.533333,0,6,1,4,0,0,0.397959,39,False,True,True,False,20.693878,Young,12.281398
3,0.493151,4,234721,1,0.4,2,6,0,2,0,0,0.397959,39,False,True,True,False,20.693878,Young,12.366157
4,0.150685,4,338409,9,0.8,2,10,5,2,0,0,0.397959,5,True,False,True,False,20.693878,Young,12.732013
