<a href="https://colab.research.google.com/github/k21academyuk/Python/blob/main/7_2_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

#### What is F-test?
The F-test is a statistical method used in machine learning to figure out if a feature (or group of features) is useful for predicting the target variable. It helps you decide if a feature adds value or can be ignored.

**For example:** If you’re predicting house prices, the F-test can tell you whether "house size" or "distance from the city" is important for the prediction or not.

#### What does the F-test do?

It compares the variation:

Between groups/features (explained by the model).

Within groups (random noise or error not explained by the model).

It checks if the feature significantly helps predict the target variable

#### Why is it useful in machine learning?

To identify which features are important for the model.

To remove features that don’t add value (feature selection).

To compare models with different sets of features and decide which one is better.

#### Simple Example
Let’s say you’re predicting house prices using:

Number of bedrooms

Size of the house

Distance from the city center

The F-test can tell you which of these features (or a combination) really impacts the price. If "number of bedrooms" has a low F-statistic, it means it doesn’t help much and can be ignored.

## 1. Feature Selection with F-test

### Apply Feature Selection with F-Test on Linear Regression
Feature selection helps in identifying the most important features in a dataset that contribute to the target variable. The **F-Test** is a statistical test used to evaluate whether the means of two or more groups are significantly different.

In this code block, we apply feature selection using the F-Test in the context of Linear Regression. This helps in narrowing down the features that are most predictive of the target variable.

### Comparison of Results
After selecting features, we compare the model's performance with and without feature selection. This step helps in understanding whether the selected features improve the model’s accuracy or reduce computational complexity without compromising performance.


#### Importing pandas
We import the `pandas` library for data manipulation and analysis. It provides powerful tools to work with structured data like tables and CSV files.


In [None]:
# Import libraries
import pandas as pd

#### Reading the Dataset
We use `pandas` to load the `Students2.csv` file into a DataFrame. This allows us to work with the data in a structured tabular format.


In [None]:
# Read the file
f = pd.read_csv('Students2.csv')

#### Displaying the First Few Rows
The `head()` function is used to view the first 5 rows of the dataset. It helps in quickly inspecting the structure and content of the DataFrame.


In [None]:
f.head()

Unnamed: 0,Hours,sHours,hoursplayed,income,distance,calories,Marks
0,0,6,6,146,9,2491,34
1,1,7,2,112,5,2303,36
2,1,6,1,84,7,2475,33
3,1,8,5,134,0,2282,39
4,1,8,5,104,8,2359,42


#### Splitting the Data into Features (X) and Target (Y)
- `X`: Contains all the independent features (all columns except the last one).
- `Y`: Contains the dependent feature (the last column), which is the target variable to be predicted.
We use `iloc` to perform slicing based on column positions.


In [None]:
# Split the columns into Dependent (Y) and independent (X) features
x = f.iloc[:,:-1]
y = f.iloc[:, -1]

#### Displaying the First Few Rows of Features (X)
The `head()` function shows the first 5 rows of the independent features (`X`). This helps in understanding the structure and values of the input data.


In [None]:
x.head()

Unnamed: 0,Hours,sHours,hoursplayed,income,distance,calories
0,0,6,6,146,9,2491
1,1,7,2,112,5,2303
2,1,6,1,84,7,2475
3,1,8,5,134,0,2282
4,1,8,5,104,8,2359


In [None]:
y.head().to_frame()

Unnamed: 0,Marks
0,34
1,36
2,33
3,39
4,42


In [None]:
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = \
train_test_split(x, y, test_size = 0.4, random_state = 1234)

In [None]:
# Perform Linear Regression using original dataset
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train, Y_train)

In [None]:
y_predict = lr.predict(X_test)

In [None]:
Y_test

7     45
10    56
4     42
1     36
28    82
8     53
3     39
23    89
14    72
13    56
22    71
24    82
Name: Marks, dtype: int64

In [None]:
y_predict.tolist()

[37.43043619598965,
 56.84799579604985,
 38.85467353701599,
 38.975672999863235,
 82.86947671956898,
 59.67634561811308,
 52.29968092971381,
 81.1729075926475,
 67.775629380509,
 63.75778539649318,
 82.18209853962895,
 86.65908659236905]

In [None]:
# Calculate the RMSE error for the regression
from sklearn.metrics import mean_squared_error
import math
rmse = math.sqrt(mean_squared_error(Y_test, y_predict))

In [None]:
rmse

6.982206715357434

In [None]:
# import and perform the f_regression to get the F-Score and P-Values
from sklearn.feature_selection import f_regression as fr
result = fr(x,y)

In [None]:
result

(array([1.41905913e+02, 4.57019756e+00, 1.44882087e-03, 1.59990513e-01,
        3.16606568e-03, 4.04208927e-01]),
 array([1.77038466e-12, 4.14028344e-02, 9.69907241e-01, 6.92200477e-01,
        9.55528076e-01, 5.30086171e-01]))

In [None]:
# Split the result tuple into F_Score and P_Values
f_score = result[0]
p_values = result[1]

In [None]:
f_score

array([1.41905913e+02, 4.57019756e+00, 1.44882087e-03, 1.59990513e-01,
       3.16606568e-03, 4.04208927e-01])

In [None]:
p_values

array([1.77038466e-12, 4.14028344e-02, 9.69907241e-01, 6.92200477e-01,
       9.55528076e-01, 5.30086171e-01])

In [None]:
# Print the table of Features, F-Score and P-values
columns = list(x.columns)

print (" ")
print (" ")
print (" ")

print ("    Features     ", "F-Score    ", "P-Values")
print ("    -----------  ---------    ---------")

for i in range(0, len(columns)):
    f1 = "%4.2f" % f_score[i]
    p1 = "%2.6f" % p_values[i]
    print("    ", columns[i].ljust(12), f1.rjust(8),"    ", p1.rjust(8))

 
 
 
    Features      F-Score     P-Values
    -----------  ---------    ---------
     Hours          141.91      0.000000
     sHours           4.57      0.041403
     hoursplayed      0.00      0.969907
     income           0.16      0.692200
     distance         0.00      0.955528
     calories         0.40      0.530086


In [None]:
# Perform the Linear Regression with reduced features
X_train_n = X_train[['Hours', 'sHours']]
X_test_n = X_test[['Hours', 'sHours']]

In [None]:
lr1 = LinearRegression()
lr1.fit(X_train_n, Y_train)
y_predict_n = lr1.predict(X_test_n)

In [None]:
Y_test

7     45
10    56
4     42
1     36
28    82
8     53
3     39
23    89
14    72
13    56
22    71
24    82
Name: Marks, dtype: int64

In [None]:
y_predict_n.tolist()

[45.710625255414804,
 56.08733142623621,
 46.592398855741735,
 41.40404577033103,
 89.31959542296687,
 50.89897834082551,
 46.592398855741735,
 79.82466285247241,
 65.58226399673069,
 55.64644462607274,
 74.6363097670617,
 84.57212913771966]

In [None]:
# Calculate the RMSE with reduced features
rmse_n = math.sqrt(mean_squared_error(Y_test, y_predict_n))
rmse_n

5.09721728108113

#### Note: the RMSE (root mean square error) value has reduced from rmse = 6.98 to rmse_n = 5.09

## 2. Different Feature Selection Methods

In [None]:
# ----------------------------------------------------------------
# Implement various feature selection, Select Transforms
# ----------------------------------------------------------------

In [None]:
# Import pandas, read the file and split into X and Y
import pandas as pd
f = pd.read_csv('Students2.csv')
X = f.iloc[:, :-1]
Y = f.iloc[:,  -1]

In [None]:
# Import various select transforms along with the f_regression mode
from sklearn.feature_selection import SelectKBest,             \
                                      SelectPercentile,        \
                                      GenericUnivariateSelect, \
                                      f_regression

In [None]:
from sklearn.feature_selection import SelectKBest, SelectPercentile, GenericUnivariateSelect, f_regression
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np

# Example dataset creation (replace with your data)
X, Y = make_regression(n_samples=100, n_features=5, noise=0.1)

# Convert to DataFrame for better visualization
X = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(X.shape[1])])

# ---------------------- SelectKBest ----------------------
selectorK = SelectKBest(score_func=f_regression, k=3)
x_k = selectorK.fit_transform(X, Y)

# Get f_score and p_values for the selected features
f_score_k = selectorK.scores_
p_values_k = selectorK.pvalues_

print("SelectKBest - F-Score and P-Values:")
print(" ")
print("Features     F-Score    P-Value")
for i in range(len(X.columns)):
    print(f"{X.columns[i]:<12} {f_score_k[i]:>8.2f}    {p_values_k[i]:>8.6f}")

# ---------------------- SelectPercentile ----------------------
selectorP = SelectPercentile(score_func=f_regression, percentile=50)
x_p = selectorP.fit_transform(X, Y)

# Get f_score and p_values for the selected features
f_score_p = selectorP.scores_
p_values_p = selectorP.pvalues_

print("\nSelectPercentile - F-Score and P-Values:")
print(" ")
print("Features     F-Score    P-Value")
for i in range(len(X.columns)):
    print(f"{X.columns[i]:<12} {f_score_p[i]:>8.2f}    {p_values_p[i]:>8.6f}")

# ---------------------- GenericUnivariateSelect (k_best) ----------------------
selectorG1 = GenericUnivariateSelect(score_func=f_regression, mode='k_best', param=3)
x_g1 = selectorG1.fit_transform(X, Y)

# Get scores and calculate p-values using f_regression
f_score_g1 = selectorG1.scores_
_, p_values_g1 = f_regression(X, Y)  # Calculate p-values using f_regression

print("\nGenericUnivariateSelect (k_best) - F-Score and P-Values:")
print(" ")
print("Features     F-Score    P-Value")
for i in range(len(X.columns)):
    print(f"{X.columns[i]:<12} {f_score_g1[i]:>8.2f}    {p_values_g1[i]:>8.6f}")

# ---------------------- GenericUnivariateSelect (percentile) ----------------------
selectorG2 = GenericUnivariateSelect(score_func=f_regression, mode='percentile', param=50)
x_g2 = selectorG2.fit_transform(X, Y)

# Get scores and calculate p-values using f_regression
f_score_g2 = selectorG2.scores_
_, p_values_g2 = f_regression(X, Y)  # Calculate p-values using f_regression

print("\nGenericUnivariateSelect (percentile) - F-Score and P-Values:")
print(" ")
print("Features     F-Score    P-Value")
for i in range(len(X.columns)):
    print(f"{X.columns[i]:<12} {f_score_g2[i]:>8.2f}    {p_values_g2[i]:>8.6f}")


SelectKBest - F-Score and P-Values:
 
Features     F-Score    P-Value
Feature_0        0.04    0.839093
Feature_1       27.73    0.000001
Feature_2       22.88    0.000006
Feature_3       98.73    0.000000
Feature_4       15.22    0.000175

SelectPercentile - F-Score and P-Values:
 
Features     F-Score    P-Value
Feature_0        0.04    0.839093
Feature_1       27.73    0.000001
Feature_2       22.88    0.000006
Feature_3       98.73    0.000000
Feature_4       15.22    0.000175

GenericUnivariateSelect (k_best) - F-Score and P-Values:
 
Features     F-Score    P-Value
Feature_0        0.04    0.839093
Feature_1       27.73    0.000001
Feature_2       22.88    0.000006
Feature_3       98.73    0.000000
Feature_4       15.22    0.000175

GenericUnivariateSelect (percentile) - F-Score and P-Values:
 
Features     F-Score    P-Value
Feature_0        0.04    0.839093
Feature_1       27.73    0.000001
Feature_2       22.88    0.000006
Feature_3       98.73    0.000000
Feature_4       15.2