# **Data Preprocessing and Feature Engineering in Machine Learning**

Data preprocessing and feature engineering are foundational, iterative steps in the machine learning workflow, focusing on preparing and enhancing raw data to maximize model performance and accuracy.

In short:

Data Preprocessing is primarily about cleaning, structuring, and transforming raw data into a machine-readable format.

Feature Engineering is about creating, selecting, and modifying input variables (features) to capture the underlying problem better and boost the model's predictive power.

This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

# 1.Handle missing values as per the best practices (imputation, removal, etc.).

In [11]:
#importing all the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
#loading the dataset
df = pd.read_csv('/content/drive/MyDrive/Python excelr/adult_with_headers (1).csv')
df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [13]:
df.head() #first five rows

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [15]:
df.describe() #provides statistical overview

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [16]:
df.dtypes #data type of each column

Unnamed: 0,0
age,int64
workclass,object
fnlwgt,int64
education,object
education_num,int64
marital_status,object
occupation,object
relationship,object
race,object
sex,object


In [17]:
df.isnull().sum() #identifies missing values for handling.

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


In [18]:
# Separate numerical and categorical features
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(include='object').columns.tolist()

# Impute numerical features with median
num_imputer = SimpleImputer(strategy='median')
df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])

# Impute categorical features with mode
cat_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])


In [20]:
#Min Max Scaling:it is a way to change the range of your data so that every value is between 0 and 1.
scaler_mm = MinMaxScaler()
df[numerical_cols] = scaler_mm.fit_transform(df[numerical_cols])

MinMax Scaling was chosen above,it is to explicitly bound the numerical features between 0 and 1, which is often beneficial for gradient-based optimization and algorithms that rely on distance metrics.

# 2. Encoding Techniques:

●Apply One-Hot Encoding to categorical variables with less than 5 categories.

One-hot encoding is a data preprocessing technique used to convert categorical variables into a numerical format that machine learning algorithms can easily interpret.

In [21]:
onehot_cols = [col for col in categorical_cols if df[col].nunique() < 5]
df = pd.get_dummies(df, columns=onehot_cols, drop_first=True)

the above code selects low-cardinality categorical columns (fewer than 5 unique values) to prevent generating excessive features, thus mitigating the curse of dimensionality. It then performs one-hot encoding using pd.get_dummies(), with drop_first=True to eliminate one dummy variable per feature, which is a standard technique to avoid perfect multicollinearity in statistical models.

●Use Label Encoding for categorical variables. Data Exploration and Preprocessing:

Label Encoding is a data preprocessing technique used to convert categorical labels (text) into a numerical format.

In [22]:
label_cols = [col for col in categorical_cols if col not in onehot_cols]
le = LabelEncoder()
for col in label_cols:
    df[col] = le.fit_transform(df[col])

the above code first selects categorical columns not chosen for one-hot encoding (those with ≥5 unique values) and assigns them to label_cols. It then iterates through these columns, using the LabelEncoder to replace each text category with a unique integer (e.g., 'A' becomes 0, 'B' becomes 1, 'C' becomes 2). This is often preferred for high-cardinality features or when the categories have a natural ordinal rank.

# 3. Feature Engineering:

Feature Engineering is the critical process of transforming raw data into features that better represent the underlying problem to predictive models, thereby improving model performance and accuracy. Essentially, you're using domain knowledge and mathematical creativity to make the data more informative for the algorithm.

●Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

In [23]:
df['AgeGroup'] = pd.cut(df['age'], bins=[0, 25, 45, 65, 100], labels=[1,2,3,4]) #ex1

df['CapGainLossRatio'] = df['capital_gain'] / (df['capital_loss'] + 1)  # ex2 and +1 to avoid division by zero

Apply a transformation (e.g., log transformation) to at least one skewed numerical feature

In [24]:
# Check skewness
skewed_cols = df[numerical_cols].skew().sort_values(ascending=False)
print("\nSkewed columns:\n", skewed_cols)
df['capital_gain_log'] = np.log1p(df['capital_gain']) # Apply log transformation to 'capital_gain' (if skewed)


Skewed columns:
 capital_gain      11.953848
capital_loss       4.594629
fnlwgt             1.446980
age                0.558743
hours_per_week     0.227643
education_num     -0.311676
dtype: float64


The capital_gain feature in the Adult dataset is highly right-skewed, meaning most values are low while a few are extremely high. Applying a log transformation reduces the effect of these extreme values, making the distribution more symmetric and easier for machine learning models to learn from. This improves model performance by preventing outliers from dominating the learning process.

 **Revised Insight :**\
 Scaling was addressed by applying MinMax Scaling exclusively. The redundant step of applying Standard Scaling immediately before was removed, aligning with best practices to avoid sequential, overwriting transformations that could lead to non-optimal feature distributions.\
 The MinMax Scaler was chosen to bound all numerical features between 0 and 1.\ The remaining preprocessing steps, including handling missing values, applying One Hot and Label encoding, and the justified log transformation on skewed features, all contribute to a highly prepared dataset suitable for predictive modeling.