## *DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING*

### *__Objective__*

This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

### *__1. Data Exploration and Preprocessing__*

In [1]:
import pandas as pd
import numpy as np

In [2]:
df =pd.read_csv(r"/Users/rahulpoojith/Documents/Excelr Datasets/Machine Learning Datasets/adult_with_headers.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.shape

(32561, 15)

In [4]:
# #Display basic information
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In [5]:
# Display summary statistics
df.describe(include ="all")

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,


In [6]:
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [7]:
# Check for missing values
print(df.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


### Handling Missing Values

In [8]:
# Removing rows with missing values
df_cleaned = df.dropna()

# second method to handle missing value.

# Imputing missing values (if any)
#df_imputed = df.fillna(df.median(numeric_only=True))

In [9]:
# Check again for missing values after handling
print(df_cleaned.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


### Scaling Techniques

In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [11]:
# Selecting numerical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns

# Standard Scaling
scaler_standard = StandardScaler()
df_standard_scaled = df.copy()
df_standard_scaled[numerical_features] = scaler_standard.fit_transform(df[numerical_features])


# Min-Max Scaling
# scaler_minmax = MinMaxScaler()
# df_minmax_scaled = df.copy()
# df_minmax_scaled[numerical_features] = scaler_minmax.fit_transform(df[numerical_features])



Scenario discussions
Standard Scaling: Preferred when data has a Gaussian (normal) distribution..

Min-Max Scaling: Preferred when data does not assume any distribution and when the features have different units or scales.s.

##  *2. Encoding Techniques*

#### •	Apply One-Hot Encoding to categorical variables with less than 5 categories.

In [12]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [13]:
# Identifying categorical features
categorical_features = df.select_dtypes(include=['object']).columns
categorical_features

Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')

In [14]:
# Applying One-Hot Encoding
onehot_features = [col for col in categorical_features if df[col].nunique() < 5]
df_onehot_encoded = pd.get_dummies(df, columns=onehot_features)
df_onehot_encoded.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,sex_ Female,sex_ Male,income_ <=50K,income_ >50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,2174,0,40,United-States,0,1,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,13,United-States,0,1,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,40,United-States,0,1,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,40,United-States,0,1,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,0,0,40,Cuba,1,0,1,0


#### •	Use Label Encoding for categorical variables with more than 5 categories.

In [15]:
# Applying Label Encoding
label_features = [col for col in categorical_features if df[col].nunique() >= 5]
df_label_encoded = df.copy()
label_encoder = LabelEncoder()

for col in label_features:
    df_label_encoded[col] = label_encoder.fit_transform(df[col])



#### •	Discuss the pros and cons of One-Hot Encoding and Label Encoding.

 One-Hot Encoding: Pros - No ordinal relationship assumed. Cons - High dimensionality with many categories.

Label Encoding: Pros - Simple and efficient for tree-based models. Cons - Implicit ordinal relationship which might not exist.

## *3. Feature Engineering:*

### Creating New Features

Let's create two new features: 'age_group' and 'hours_per_week_group'. The rationale is to capture age and working hours patterns which might be indicative of income levels.

In [16]:
# Creating 'age_group' feature
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 45, 65, np.inf], labels=['young', 'middle-aged', 'senior', 'elder'])


In [17]:
# Creating 'hours_per_week_group' feature
df['hours_per_week_group'] = pd.cut(df['hours_per_week'], bins=[0, 25, 40, 60, np.inf], labels=['part-time', 'full-time', 'over-time', 'extreme'])

In [18]:
print(df[['age', 'age_group', 'hours_per_week', 'hours_per_week_group']].head())

   age    age_group  hours_per_week hours_per_week_group
0   39  middle-aged              40            full-time
1   50       senior              13            part-time
2   38  middle-aged              40            full-time
3   53       senior              40            full-time
4   28  middle-aged              40            full-time


### • Applying log transformation to 'capital-gain' due to its skewness

In [19]:
df['capital_gain'] = np.log1p(df['capital_gain'])

#Log transformation helps in reducing the skewness and brings outlier values closer to the rest of the data.

### *4. Feature Selection*

#### Using Isolation Forest to Remove Outliers

In [20]:
from sklearn.ensemble import IsolationForest

In [21]:
# Using Isolation Forest to detect outliers
isolation_forest = IsolationForest(contamination=0.05)
outliers = isolation_forest.fit_predict(df[numerical_features])

# Removing outliers
df_no_outliers = df[outliers == 1]

print(df_no_outliers.shape)

(30933, 17)


#### Using PPS (Predictive Power Score) for Feature Relationships

In [22]:
pip install ppscore

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [23]:
import ppscore as pps

# Computing PPS matrix
pps_matrix = pps.matrix(df)

# Displaying PPS matrix
print(pps_matrix)

# Comparing PPS with correlation matrix
correlation_matrix = df.corr()

print(correlation_matrix.shape)



                        x                     y   ppscore            case  \
0                     age                   age  1.000000  predict_itself   
1                     age             workclass  0.011232  classification   
2                     age                fnlwgt  0.000000      regression   
3                     age             education  0.052315  classification   
4                     age         education_num  0.000000      regression   
..                    ...                   ...       ...             ...   
284  hours_per_week_group        hours_per_week  0.563282      regression   
285  hours_per_week_group        native_country  0.000000  classification   
286  hours_per_week_group                income  0.000000  classification   
287  hours_per_week_group             age_group  0.088831  classification   
288  hours_per_week_group  hours_per_week_group  1.000000  predict_itself   

     is_valid_score               metric  baseline_score   model_score  \
0

  correlation_matrix = df.corr()
