## DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

### Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.


### Data Exploration and Preprocessing

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest



# Load dataset
data = pd.read_csv(r"https://raw.githubusercontent.com/rohitmaind/ExcelR_Assignments/main/Datasets/adult_with_headers.csv")

# Basic data exploration
print("Summary Statistics:\n", data.describe())
print("\nMissing Values:\n", data.isnull().sum())
print("\nData Types:\n", data.dtypes)

# Handle missing values
imputer = SimpleImputer(strategy='mean')
data['age'] = imputer.fit_transform(data[['age']])

# Apply scaling techniques
scaler_standard = StandardScaler()
data_standard_scaled = pd.DataFrame(scaler_standard.fit_transform(data.select_dtypes(include=[np.number])), columns=data.select_dtypes(include=[np.number]).columns)

scaler_minmax = MinMaxScaler()
data_minmax_scaled = pd.DataFrame(scaler_minmax.fit_transform(data.select_dtypes(include=[np.number])), columns=data.select_dtypes(include=[np.number]).columns)

# Display scaled data
print("\nStandard Scaled Data:\n", data_standard_scaled.head())
print("\nMin-Max Scaled Data:\n", data_minmax_scaled.head())


Summary Statistics:
                 age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  

Missing Values:
 age               0
work

### Encoding Techniques

In [11]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Select categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns

# One-Hot Encoding for categorical variables with less than 5 categories
one_hot_encoder = OneHotEncoder(drop='first', sparse_output=False)
categorical_small = [col for col in categorical_columns if data[col].nunique() < 5]
data_one_hot_encoded = pd.DataFrame(one_hot_encoder.fit_transform(data[categorical_small]), columns=one_hot_encoder.get_feature_names_out(categorical_small))

# Label Encoding for categorical variables with more than 5 categories
label_encoder = LabelEncoder()
categorical_large = [col for col in categorical_columns if data[col].nunique() >= 5]

for col in categorical_large:
    data[col] = label_encoder.fit_transform(data[col])

# Merge the one-hot encoded columns back to the original dataframe
data = data.drop(columns=categorical_small)
data = pd.concat([data, data_one_hot_encoded], axis=1)

# Display the first few rows of the transformed dataframe
print(data.head())


    age  workclass  fnlwgt  education  education_num  marital_status  \
0  39.0          7   77516          9             13               4   
1  50.0          6   83311          9             13               2   
2  38.0          4  215646         11              9               0   
3  53.0          4  234721          1              7               2   
4  28.0          4  338409          9             13               2   

   occupation  relationship  race  capital_gain  capital_loss  hours_per_week  \
0           1             1     4          2174             0              40   
1           4             0     4             0             0              13   
2           6             1     4             0             0              40   
3           6             0     2             0             0              40   
4          10             5     2             0             0              40   

   native_country  sex_ Male  income_ >50K  
0              39        1.0       

### Feature Engineering

In [12]:
# Creating new features
data['age_group'] = pd.cut(data['age'], bins=[0, 25, 45, 65, np.inf], labels=['Young', 'Adult', 'Middle-Aged', 'Senior'])
data['capital_gain_loss'] = data['capital_gain'] - data['capital_loss']

# Log transformation for skewed numerical feature
data['capital_gain_log'] = np.log1p(data['capital_gain'])

# Display new features
print("\nNew Features:\n", data[['age_group', 'capital_gain_loss', 'capital_gain_log']].head())



New Features:
      age_group  capital_gain_loss  capital_gain_log
0        Adult               2174          7.684784
1  Middle-Aged                  0          0.000000
2        Adult                  0          0.000000
3  Middle-Aged                  0          0.000000
4        Adult                  0          0.000000


### Feature Selection

In [16]:
#!pip install --upgrade pip

#!pip install --upgrade pandas
#!pip install ppscore
#import ppscore as pps

In [17]:
# Using Isolation Forest to identify outliers
isolation_forest = IsolationForest(contamination=0.1)
outliers = isolation_forest.fit_predict(data.select_dtypes(include=[np.number]))
data['outliers'] = outliers
data_no_outliers = data[data['outliers'] == 1]

# Display data without outliers
print("\nData without Outliers:\n", data_no_outliers.head())

# PPS score analysis
pps_matrix = pps.matrix(data_no_outliers)

# Display PPS matrix
print("\nPPS Matrix:\n", pps_matrix)



Data without Outliers:
     age  workclass  fnlwgt  education  education_num  marital_status  \
0  39.0          7   77516          9             13               4   
1  50.0          6   83311          9             13               2   
2  38.0          4  215646         11              9               0   
3  53.0          4  234721          1              7               2   
4  28.0          4  338409          9             13               2   

   occupation  relationship  race  capital_gain  capital_loss  hours_per_week  \
0           1             1     4          2174             0              40   
1           4             0     4             0             0              13   
2           6             1     4             0             0              40   
3           6             0     2             0             0              40   
4          10             5     2             0             0              40   

   native_country  sex_ Male  income_ >50K    age_group

###  Conclusion

In this assignment, we focused on essential steps for preparing the "Adult" dataset for machine learning:

1. **Data Exploration and Preprocessing**:
   - **Summary Statistics & Missing Values**: Conducted basic exploration and handled missing values using mean imputation.
   - **Scaling**:
     - **Standard Scaling**: Preferred for normally distributed data.
     - **Min-Max Scaling**: Useful for preserving original data distribution.

2. **Encoding Techniques**:
   - **One-Hot Encoding**: Used for categorical variables with fewer than 5 categories to avoid ordinal relationships.
   - **Label Encoding**: Applied to categorical variables with more than 5 categories for simplicity.

3. **Feature Engineering**:
   - Created two new features and applied log transformation to a skewed numerical feature to normalize its distribution.

4. **Feature Selection**:
   - **Isolation Forest**: Identified and removed outliers to improve model performance.
   - **PPS Score**: Analyzed relationships between features, providing a more nuanced view than the correlation matrix.

These preprocessing steps are crucial for building effective and efficient machine learning models, ensuring the data is clean, well-scaled, and features are appropriately engineered and selected.