# Regression Model Selection
- We have 5 different regression model, but which is best and more effective model/algorithm for the dataset (Combined_Cycle_Power_Plant.csv) 😕?
- No worries, we answer that question in this **Regression Model Selection** notebook.

## R square (R^2)
- R square is the factor which evaluate the result of the model and give the score out of 1.

<img src="../images/r_squared_eqn.png" alt="r_squared_eqn.png">

## Data preprocessing

✔️ Import the necessary libraries.

✔️ Load dataset (Combined_Cycle_Power_Plant.csv).

❌ Our dataset doesn't have any missing data.

❌ We have categorical string data.

✔️ We have 9569 data. So, we can split this dataset into testing and training datasets to evaluate the result.

⚠️ Please apply feature scaling only if required by the regression model.

In [1]:
# Import libraries....
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Load dataset....
dataset = pd.read_csv(r"../dataset/Combined_Cycle_Power_Plant.csv")
X = dataset.iloc[:, :-1].values # [row, column]
Y = dataset.iloc[:, -1].values

In [3]:
# Split training and testing dataset....
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

## Train and evaluate the performance of multiple linear regression

In [4]:
# Train multiple linear regression model...
from sklearn.linear_model import LinearRegression
multiple_linear_regressor = LinearRegression()
multiple_linear_regressor.fit(x_train, y_train)

# Test multiple linear regression model...
y_pred = multiple_linear_regressor.predict(x_test)

# Check the result...
np.set_printoptions(precision=2)
print("Comparison of y_test & y_pred", np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1), sep='\n')

# Find preformance by R^2 score...
from sklearn.metrics import r2_score
multiple_linear_regression_r2_score = r2_score(y_test, y_pred)
print("R^2 score for Multiple linear regression : ", multiple_linear_regression_r2_score)

Comparison of y_test & y_pred
[[445.27 442.32]
 [440.64 445.64]
 [432.32 429.59]
 ...
 [481.14 482.21]
 [443.1  440.68]
 [463.99 468.4 ]]
R^2 score for Multiple linear regression :  0.9301213220397643


## Train and evaluate the performance of polynomial linear regression

In [5]:
# Converting normal feature x to x^n ...
from sklearn.preprocessing import PolynomialFeatures
x_ploy_convertor = PolynomialFeatures(degree=4)

# Train polynomial linear regression model...
from sklearn.linear_model import LinearRegression
polynomial_linear_regressor = LinearRegression()
polynomial_linear_regressor.fit(x_ploy_convertor.fit_transform(x_train), y_train)

# Test polynomial linear regression model...
y_pred = polynomial_linear_regressor.predict(x_ploy_convertor.fit_transform(x_test))

# Check the result...
np.set_printoptions(precision=2)
print("Comparison of y_test & y_pred", np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1), sep='\n')

# Find preformance by R^2 score...
from sklearn.metrics import r2_score
polynomial_linear_regression_r2_score = r2_score(y_test, y_pred)
print("R^2 score for Ploynomial linear regression : ", polynomial_linear_regression_r2_score)

Comparison of y_test & y_pred
[[443.77 442.32]
 [440.99 445.64]
 [433.2  429.59]
 ...
 [484.41 482.21]
 [443.49 440.68]
 [462.3  468.4 ]]
R^2 score for Ploynomial linear regression :  0.9407971284242803


## Train and evaluate the performance of (SVR) Support Vector regression

In [6]:
# Feature scaling independent variables
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
scaled_X = x_sc.fit_transform(x_train)

# Feature scaling dependent variables
y_sc = StandardScaler()
scaled_Y = y_sc.fit_transform(y_train.reshape(len(y_train), 1)).ravel()

# Train Linear SVR Regression Model
from sklearn.svm import SVR
svr_regressor = SVR(kernel='rbf')
svr_regressor.fit(scaled_X, scaled_Y)

# Test Linear SVR Regression Model
y_pred = svr_regressor.predict(x_test)
y_pred = y_sc.inverse_transform(y_pred)

# Check the result...
np.set_printoptions(precision=2)
print("Comparison of y_test & y_pred", np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1), sep='\n')

# Find preformance by R^2 score...
from sklearn.metrics import r2_score
svr_linear_regression_r2_score = r2_score(y_test, y_pred)
print("R^2 score for SVR linear regression : ", svr_linear_regression_r2_score)


Comparison of y_test & y_pred
[[456.57 442.32]
 [456.57 445.64]
 [456.57 429.59]
 ...
 [456.57 482.21]
 [456.57 440.68]
 [456.57 468.4 ]]
R^2 score for SVR linear regression :  -0.01091284839544615


## Train and evaluate the performance of Decision Tree Regression

In [7]:
# Train Decision Tree Regression Model
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(x_train, y_train)

# Test Decision Tree Regression Model
y_pred = regressor.predict(x_test)

# Check the result...
np.set_printoptions(precision=2)
print("Comparison of y_test & y_pred", np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1), sep='\n')

# Find preformance by R^2 score...
from sklearn.metrics import r2_score
decision_tree_regression_r2_score = r2_score(y_test, y_pred)
print("R^2 score for Decision tree regression : ", decision_tree_regression_r2_score)

Comparison of y_test & y_pred
[[444.42 442.32]
 [445.75 445.64]
 [429.18 429.59]
 ...
 [483.12 482.21]
 [441.21 440.68]
 [471.24 468.4 ]]
R^2 score for Decision tree regression :  0.9289645039577621


## Train and evaluate the performance of Random Forest Regression

In [8]:
# Train Random Forest Regression Model
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10) # n_estimators -> number of trees to train the forest
regressor.fit(x_train, y_train)

# Test Random Forest Regression Model
y_pred = regressor.predict(x_test)

# Check the result...
np.set_printoptions(precision=2)
print("Comparison of y_test & y_pred", np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1), sep='\n')

# Find preformance by R^2 score...
from sklearn.metrics import r2_score
random_forest_regression_r2_score = r2_score(y_test, y_pred)
print("R^2 score for Random forest regression : ", random_forest_regression_r2_score)

Comparison of y_test & y_pred
[[445.2  442.32]
 [442.96 445.64]
 [429.61 429.59]
 ...
 [483.82 482.21]
 [443.19 440.68]
 [467.78 468.4 ]]
R^2 score for Random forest regression :  0.9590065187034049


## Which is best for given dataset ?

In [9]:
r2_scores = {
    "Multiple Linear Regression" : multiple_linear_regression_r2_score, 
    "Ploynomial Linear Regression" : polynomial_linear_regression_r2_score, 
    "SVR" : svr_linear_regression_r2_score, 
    "Decision Tree Regression" : decision_tree_regression_r2_score, 
    "Random Forest Regression" : random_forest_regression_r2_score
}
# Print final result of all model....
for model, r2 in r2_scores.items():
    print(f"{model} with r^2 score {r2}")

# find best of them....
best_of_them = max(r2_scores.values())

# Print best of them....
for model, r2 in r2_scores.items():
    if r2 == best_of_them:
        print_me=f"{model} is the best model for given dataset 🥳 with R^2 score {r2}"
        print("🎉" * (len(print_me) // 2))
        print(print_me)
        print("🎉" * (len(print_me) // 2))
        break

Multiple Linear Regression with r^2 score 0.9301213220397643
Ploynomial Linear Regression with r^2 score 0.9407971284242803
SVR with r^2 score -0.01091284839544615
Decision Tree Regression with r^2 score 0.9289645039577621
Random Forest Regression with r^2 score 0.9590065187034049
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
Random Forest Regression is the best model for given dataset 🥳 with R^2 score 0.9590065187034049
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉


**Note:** Above result is only for the dataset (Combined_Cycle_Power_Plant.csv) which we were given as the input. If you change the dataset, the result also changes certainly.