## FEATURE ENGINEERING using Python

Feature Engineering is a critical step in solving a Data Science problem that involves creating or selecting the most relevant and informative features (input variables) from raw data. These features are then used to train machine learning models, so the quality of the features directly impacts a model’s performance. So, if you want to learn how to perform feature engineering, this notebook is for you. In this notebook, I’ll take you through a practical guide to Feature Engineering using Python.

#### Importing Necessary Python Libraries: 

In [13]:
import pandas as pd

#### Loading a dataset. Here, we used the dataset for "Dynamic Pricing":

In [14]:
df = pd.read_csv('dynamic_pricing.csv')
df.head()

Unnamed: 0,Number_of_Riders,Number_of_Drivers,Location_Category,Customer_Loyalty_Status,Number_of_Past_Rides,Average_Ratings,Time_of_Booking,Vehicle_Type,Expected_Ride_Duration,Historical_Cost_of_Ride
0,90,45,Urban,Silver,13,4.47,Night,Premium,90,284.257273
1,58,39,Suburban,Silver,72,4.06,Evening,Economy,43,173.874753
2,42,31,Rural,Silver,0,3.99,Afternoon,Premium,76,329.795469
3,89,28,Rural,Regular,67,4.31,Afternoon,Premium,134,470.201232
4,78,22,Rural,Regular,74,3.77,Afternoon,Economy,149,579.681422


In a typical process of solving a data science problem, **feature engineering** is performed after **data exploration** and **preprocessing**. Because to create or select the most important features, you must first know your features.

#### Feature Selection Process:

Once you have explored your data, identify the most important features to solve your problem. Use techniques like **correlation analysis**, **feature importance from tree-based models**, or **domain knowledge** to select a subset of features.

In [15]:
import warnings
warnings.filterwarnings('ignore')
correlation_matrix = df.corr()

threshold = 0.7

correlated_features = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i,j]) > threshold:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

correlated_df = df.drop(correlated_features, axis = 1)
correlated_df.head()

Unnamed: 0,Number_of_Riders,Number_of_Drivers,Location_Category,Customer_Loyalty_Status,Number_of_Past_Rides,Average_Ratings,Time_of_Booking,Vehicle_Type,Expected_Ride_Duration
0,90,45,Urban,Silver,13,4.47,Night,Premium,90
1,58,39,Suburban,Silver,72,4.06,Evening,Economy,43
2,42,31,Rural,Silver,0,3.99,Afternoon,Premium,76
3,89,28,Rural,Regular,67,4.31,Afternoon,Premium,134
4,78,22,Rural,Regular,74,3.77,Afternoon,Economy,149


Here, we are performing feature selection to identify and remove highly correlated features from the dataset. We start by calculating the correlation matrix, which measures the linear relationships between pairs of numerical features. Then, we set a correlation threshold of 0.7, indicating the maximum allowed correlation coefficient. We iterate through the upper triangular part of the correlation matrix to find feature pairs where the absolute correlation coefficient exceeds the threshold. These pairs of highly correlated features are stored in the correlated_features set. Finally, we remove these highly correlated features from the dataset to reduce multicollinearity, which can enhance the performance and interpretability of machine learning models.

In [16]:
selected_features = ['Number_of_Riders', 'Number_of_Drivers',
                     'Location_Category', 'Number_of_Past_Rides',
                     'Average_Ratings', 'Vehicle_Type',
                     'Expected_Ride_Duration', 'Historical_Cost_of_Ride']

domain_based_features = df[selected_features]
domain_based_features.head()

Unnamed: 0,Number_of_Riders,Number_of_Drivers,Location_Category,Number_of_Past_Rides,Average_Ratings,Vehicle_Type,Expected_Ride_Duration,Historical_Cost_of_Ride
0,90,45,Urban,13,4.47,Premium,90,284.257273
1,58,39,Suburban,72,4.06,Economy,43,173.874753
2,42,31,Rural,0,3.99,Premium,76,329.795469
3,89,28,Rural,67,4.31,Premium,134,470.201232
4,78,22,Rural,74,3.77,Economy,149,579.681422


#### Feature Creation:

In [17]:
data = domain_based_features.copy()

In [18]:
data['Riders_to_Drivers_Ratio'] = data['Number_of_Riders']/ data['Number_of_Drivers']
data['Cost_Per_Past_Ride'] = data['Historical_Cost_of_Ride']/data['Number_of_Past_Rides']

In [19]:
data.head()

Unnamed: 0,Number_of_Riders,Number_of_Drivers,Location_Category,Number_of_Past_Rides,Average_Ratings,Vehicle_Type,Expected_Ride_Duration,Historical_Cost_of_Ride,Riders_to_Drivers_Ratio,Cost_Per_Past_Ride
0,90,45,Urban,13,4.47,Premium,90,284.257273,2.0,21.865944
1,58,39,Suburban,72,4.06,Economy,43,173.874753,1.487179,2.414927
2,42,31,Rural,0,3.99,Premium,76,329.795469,1.354839,inf
3,89,28,Rural,67,4.31,Premium,134,470.201232,3.178571,7.017929
4,78,22,Rural,74,3.77,Economy,149,579.681422,3.545455,7.833533


#### Handling text and Categorical Data:

Method: One hot encoding ->

In [20]:
data = pd.get_dummies(data, columns=['Location_Category','Vehicle_Type'], prefix= 'is_')

In [21]:
data.head()

Unnamed: 0,Number_of_Riders,Number_of_Drivers,Number_of_Past_Rides,Average_Ratings,Expected_Ride_Duration,Historical_Cost_of_Ride,Riders_to_Drivers_Ratio,Cost_Per_Past_Ride,is__Rural,is__Suburban,is__Urban,is__Economy,is__Premium
0,90,45,13,4.47,90,284.257273,2.0,21.865944,0,0,1,0,1
1,58,39,72,4.06,43,173.874753,1.487179,2.414927,0,1,0,1,0
2,42,31,0,3.99,76,329.795469,1.354839,inf,1,0,0,0,1
3,89,28,67,4.31,134,470.201232,3.178571,7.017929,1,0,0,0,1
4,78,22,74,3.77,149,579.681422,3.545455,7.833533,1,0,0,1,0


## THANK YOU