# **Introduction**

**How is this project different?**  
The word is flexibility, we have designed this pipeline so it can handle any regression dataset with just a few simple tweaks for data preprocessing.

**Which model have we used?**  
All regression models we have learned throughout this training, the pipeline automatically finds the highest accuracy model and exports it as a .plk file.

**Whats with flags?**  
We have introduced flags to our pipeline so you won't have to worry about changing the code just to drop some features, all you have to do is add those features to the drop_features flag and tada! problem solved!

Some examples for our flags:  
- debug_mode -> Shows analytical information about the data, for example features dtypes and missing values
- skew_power -> Handles data skewness using the power method
- check_outliers -> Visualizes outliers in the data

**What about the constants and models block?**  
It covers:
- Essential information like dataset path and target column name
- Some flags for automatic numerical/categorical feature selection
- Model selection

# **Import libraries**

In [23]:
"""
These are the needed libraries to run the project.
1- Core modules: are for reading data files, mathematical computations and data structuring.
2- Visualization modules: are needed to display plots, and heatmaps that are both needed for
   understanding the nature of the data and model evaluation.
3- Model splitting module: is needed to split data into training and testing.
4- Preprocessing modules: are needed to manipulate data, like encoding and scaling.
5- Regression modules: are needed for different regression models.
6- Evaulation modules: are needed for evaluation metrics like r2_score and root_mean_squared_error.
7- Pipeline & utility modules: are for file system handling and saving files.
"""

# Core
import numpy as np
import pandas as pd
from scipy import stats

# Visualization
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Model selection / splitting
from sklearn.model_selection import train_test_split

# Preprocessing
# Encoding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

#Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer

# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from xgboost import XGBRegressor

#Evaluation
from sklearn.metrics import r2_score

# Pipeline & utilities
import joblib
import os

# **Flags**

In [24]:
"""Flags"""
# Analysis
debug_mode = False # sample / info / Missing / Duplicates / Describe
visualize_mode = False # Features distribution / Outliers / Skewness

# Incorrect dtypes
check_dtypes = False # Checker
enforce_numerical = [] # Setter

# Missing Values
check_missing = False # Checker
remove_missing = ["Year"] # Remover
fill_mean = [] # Imputer
fill_median = [] # Imputer
fill_mode = [] # Imputer
fill_ffill = [] # Imputer
fill_bfill = [] # Imputer
fill_constant = [] # Imputer
fill_constant_value = "" # Imputer, Give it a value if you have features in fill_constant

# Outliers
check_outliers = False # Checker
outliers_zscore = [] # Remover
outliers_zscore_threshold = 2 # Remover, Give it a value if you have features in fill_constant

# Scaling
check_scaling = False # Checker
standard_scale = [] # Scaler
min_max_scale = [] # Scaler
robust_scale = [] # Scaler

# Skewness
check_skewness = False # Checker
skew_power = [] # Transformer, General, don't use with outliers
skew_log1p = [] # Transformer, Don't use with negative values
skew_cbrt = [] # Transformer, Don't use with high skew

# Encoding
check_dtypes = False # Checker
label_encode = ["Name", "Platform", "Genre", "Publisher"] # Encoder
one_hot_encode = [] # Encoder
ordinal_encode = [] # Encoder

# Feature engineering
check_correlation = False # Checker
corr_threshold = 0 # Remover, All results below this value will be dropped
drop_features = [] # Remover

# Model deployment
do_save_model = True

# **Constants & Models**

In [25]:
"""Constants"""
MODEL_PATH = "test.plk"
DATA_PATH = "/content/video_games_sales.csv"
TARGET = "Global_Sales"
TEST_SIZE = 0.2
RANDOM_STATE = 42

auto_select_numerical = True
auto_select_categorical = True
NUMERICAL_FEATURES = []
CATEGORICAL_FEATURES = []

decision_tree_estimator = DecisionTreeRegressor(random_state=RANDOM_STATE)

REGRESSION_MODELS = {
    "LinearRegression": LinearRegression(),
    "KNN": KNeighborsRegressor(n_neighbors=5),
    "SVR_rbf": SVR(kernel="rbf", C=1.0, epsilon=0.1),
    # "SVR_linear": SVR(kernel="linear", C=1.0, epsilon=0.1),
    "SVR_poly": SVR(kernel="poly", C=1.0, epsilon=0.1),
    "SVR_sigmoid": SVR(kernel="sigmoid", C=1.0, epsilon=0.1),
    "DecisionTree": DecisionTreeRegressor(random_state=RANDOM_STATE , max_depth=5),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE),
    "Bagging": BaggingRegressor(estimator=decision_tree_estimator, n_estimators=50, random_state=RANDOM_STATE),
    "AdaBoost": AdaBoostRegressor(estimator=decision_tree_estimator, n_estimators=50, learning_rate=0.1, random_state=RANDOM_STATE),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=RANDOM_STATE),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=RANDOM_STATE),
}

# **Linear Regression Pipeline**

## **Data collection**

In [26]:
def load_data():
    global NUMERICAL_FEATURES
    global CATEGORICAL_FEATURES

    df = pd.read_csv(DATA_PATH)

    if auto_select_numerical:
        NUMERICAL_FEATURES = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
        print("Numerical features have been selected automatically")
    if auto_select_categorical:
        CATEGORICAL_FEATURES = df.select_dtypes(include=["object", "category"]).columns.tolist()
        print("Categorical features have been selected automatically")

    print("Data has been loaded")
    return df

## **Data analysis & preprocessing**

In [27]:
def preprocessing(df):
    if enforce_numerical:
        df = enforce(df, enforce_numerical)

    if label_encode or one_hot_encode or ordinal_encode:
        df = encode(df)

    if skew_power or skew_log1p or skew_cbrt:
        df = skew(df)

    if remove_missing or fill_mean or fill_median or fill_mode or fill_ffill or fill_bfill or fill_constant:
        df = handle_missing(df)

    if outliers_zscore:
        df = handle_outliers(df)

    if debug_mode:
        debug(df)

    if visualize_mode:
        visualize(df)

    return df

### **Debug mode**

In [28]:
def debug(df, debug_mode=False):
    if debug_mode:
        display(df.sample(10))

    if debug_mode or check_dtypes:
        print("******")
        print("Info:")
        print("******")
        print(df.info())
        print()

    if debug_mode or check_missing:
        print("*********")
        print("Missing:")
        print("*********")
        display(df.isnull().sum() / df.shape[0] * 100)
        print()

    if debug_mode:
        print("************")
        print("Duplicates:")
        print("************")
        print(df.shape[0] - df.nunique())
        print()

    if debug_mode or check_scaling:
        print("**********")
        print("Describe:")
        print("**********")
        display(df.describe())
        print()

### **Visualize mode**

In [29]:
def visualize(df):
    global NUMERICAL_FEATURES

    if visualize_mode:
        columns = df.columns
        n_cols = 4 # number of plots per row
        n_rows = -(-len(columns) // n_cols)

        plt.figure(figsize=(4 * n_cols, 3 * n_rows))

        for i, column in enumerate(columns, 1):
            plt.subplot(n_rows, n_cols, i)

            if pd.api.types.is_numeric_dtype(df[column]):
                # Histogram for numeric data
                plt.hist(df[column].dropna(), bins=40, color="deeppink", edgecolor="black")
                plt.xlabel(column)
                plt.ylabel("Count")
                plt.title(f"Histogram of {column}")

            else:
                # Bar plot for categorical data
                df[column].value_counts().plot(kind="bar", color="deeppink", edgecolor="black")
                plt.xlabel(column)
                plt.ylabel("Count")
                plt.title(f"Bar Plot of {column}")

        plt.tight_layout()
        plt.show()

    if visualize_mode or check_outliers:
        if not NUMERICAL_FEATURES:
            NUMERICAL_FEATURES = df.select_dtypes(include="number").columns

        plt.figure(figsize=(15, 8))
        df[NUMERICAL_FEATURES].boxplot()
        plt.xticks(rotation=45)
        plt.title("Boxplots of All Numerical Columns")
        plt.show()

    if visualize_mode or check_skewness:
        sns.set_style("darkgrid")

        NUMERICAL_FEATURES = df.select_dtypes(include="number").columns

        plt.figure(figsize=(15, 12))
        for idx, feature in enumerate(NUMERICAL_FEATURES, 1):
            plt.subplot(5, 4, idx)
            sns.histplot(df[feature], kde=True)
            plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

        plt.tight_layout()
        plt.show()
        print()

### **Enforce numeric data type**

In [30]:
def enforce(df, columns):
    df[columns] = df[columns].astype(int)
    return df

### **Handle missing values**

In [31]:
def handle_missing(df):

    for col in remove_missing:
        df = df.dropna(subset=[col])
    for col in fill_mean:
        df[col] = df[col].fillna(df[col].mean())
    for col in fill_median:
        df[col] = df[col].fillna(df[col].median())
    for col in fill_mode:
        df[col] = df[col].fillna(df[col].mode()[0])
    for col in fill_ffill:
        df[col] = df[col].fillna(method='ffill')
    for col in fill_bfill:
        df[col] = df[col].fillna(method='bfill')
    for col in fill_constant:
        df[col] = df[col].fillna(fill_constant_value)

    return df

### **Handle outliers**

In [32]:
def handle_outliers(df):
    z = np.abs(stats.zscore(df[outliers_zscore], nan_policy='omit'))
    df = df[(z < outliers_zscore_threshold).all(axis=1)]
    return df

### **Handle data scaling**

In [33]:
def scale(command, x_train, x_test):
    # All these flags are lists that include features to apply the process on

    if standard_scale:
        sc = StandardScaler()
        x_train = sc.fit_transform(x_train)
        x_test = sc.transform(x_test)

    if min_max_scale:
        MinMax = MinMaxScaler()
        x_train = MinMax.fit_transform(x_train)
        x_test = MinMax.transform(x_test)

    if robust_scale:
        rb = RobustScaler()
        x_train = rb.fit_transform(x_train)
        x_test = rb.transform(x_test)

### **Handle skewness**

In [34]:
def skew(df):
    if skew_power:
        pt = PowerTransformer(method='yeo-johnson')
        df[skew_power] = pt.fit_transform(df[skew_power])
        print(f"Applied PowerTransformer on {skew_power}")

    if skew_log1p:
        for col in skew_log1p:
            if (df[col] <= 0).any():
                print(f"Skipping log1p for {col} because it contains negative values.")
                continue
            df[col] = np.log1p(df[col])
        print(f"Applied log1p transformation on {skew_log1p}")

    if skew_cbrt:
        for col in skew_cbrt:
            df[col] = np.cbrt(df[col])
        print(f"Applied Cube Root transformation on {skew_cbrt}")

### **Encode data**

In [35]:
def encode(df):
    if label_encode:
        encoder = LabelEncoder()
        for feature in label_encode:
            df[feature] = encoder.fit_transform(df[feature])

    if one_hot_encode:
        encoder = OneHotEncoder()
        df[one_hot_encode] = encoder.fit_transform(df[one_hot_encode])

    if ordinal_encode:
        encoder = OrdinalEncoder()
        df[ordinal_encode] = encoder.fit_transform(df[ordinal_encode])

    return df

## **Feature engineering**

### **Feature selection**

In [36]:
def feature_selection(df):
    print("Initializing the feature selection process...")

    if check_correlation or corr_threshold or drop_features:
        df = corr(df)

    x = df.drop(TARGET, axis=1).values
    y = df[TARGET].values

    if debug_mode:
        display(x)
        display(y)

    print("Features and target have been determined!")
    return x, y

### **Correlation**

In [37]:
def corr(df):
    if corr_threshold:
        # Drop all features with correlation price less than the value of corr_threshold
        corr_matrix = df.corr()
        price_corr = corr_matrix[TARGET].abs()
        features_to_drop = price_corr[price_corr < corr_threshold].index.tolist()
        df = df.drop(features_to_drop, axis=1)
        print(f"Dropped features with correlation less than {corr_threshold}: {features_to_drop}")

    if drop_features: # Drop features in the drop_features list
        df = df.drop(drop_features, axis=1)

    if check_correlation: # Print the correlation to the target feature
        corr_matrix = df.corr()
        price_corr = (corr_matrix[TARGET].abs() * 100).sort_values(ascending=False)
        print("Pearson Correlation Percentage:\n", price_corr)

    return df

## **Models training**

In [38]:
def split_train_test(x, y):
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size = TEST_SIZE, random_state = RANDOM_STATE)

    print("Data has been splitted to training and testing")
    return x_train, x_test, y_train, y_test

In [39]:
def train_model(x_train, y_train):
    trained_models = {}
    for name, model in REGRESSION_MODELS.items():
        print(f"Training {name}")
        trained_models[name] = model.fit(x_train, y_train)

    print("Model has been trained")
    return trained_models

## **Models evaluation**

In [40]:

def evaluate_all_models(trained_models, X, y, dataset_name):

    scores = {}
    print(f"\n{'='*60}")
    print(f"Evaluating models on {dataset_name} dataset:")
    print(f"{'='*60}")

    for name, model in trained_models.items():
        y_pred = model.predict(X)
        score = r2_score(y, y_pred)
        scores[name] = score
        print(f"R² score for {name}: {(score * 100):.4f}%")

    print(f"{'='*60}")
    return scores

## **Model deployment**

In [41]:
def load_model(path):
    if os.path.exists(path):
        print("Model has been loaded")
        return joblib.load(path)
    else:
        raise FileNotFoundError(f"No model found at {path}")

In [42]:
def save_model(model, path):
    joblib.dump(model, path)
    print(f"Model has been saved")

## **Main**

In [43]:
def run_pipeline():
    df = load_data()  # Load data
    df = preprocessing(df)  # Apply all preprocessing
    x, y = feature_selection(df)

    x_train, x_test, y_train, y_test = split_train_test(x, y)

    # Train all models
    trained_models = train_model(x_train, y_train)

    # Evaluate on Training set (R²)
    print("\n--- Training Set Performance ---")
    train_scores = {}
    for name, model in trained_models.items():
        y_pred_train = model.predict(x_train)
        score_train = r2_score(y_train, y_pred_train)
        train_scores[name] = score_train
        print(f"{name} R² on Training set = {score_train*100:.2f}%")

    # Evaluate on Test set (R²)
    print("\n--- Test Set Performance ---")
    test_scores = {}
    for name, model in trained_models.items():
        y_pred_test = model.predict(x_test)
        score_test = r2_score(y_test, y_pred_test)
        test_scores[name] = score_test
        print(f"{name} R² on Test set = {score_test*100:.2f}%")

    # Select best model based on Test set
    best_model_name = max(test_scores, key=test_scores.get)
    best_model = trained_models[best_model_name]
    print(f"\nBest model based on Test set: {best_model_name} with R² = {test_scores[best_model_name]*100:.2f}%")

    # Save best model
    if do_save_model:
        save_model(best_model, MODEL_PATH)

## **Pipeline runner**

In [22]:
run_pipeline()

Numerical features have been selected automatically
Categorical features have been selected automatically
Data has been loaded
Initializing the feature selection process...
Features and target have been determined!
Data has been splitted to training and testing
Training LinearRegression
Training KNN
Training SVR_rbf
Training SVR_poly
Training SVR_sigmoid
Training DecisionTree
Training RandomForest
Training Bagging
Training AdaBoost
Training GradientBoosting
Training XGBoost
Model has been trained

--- Training Set Performance ---
LinearRegression R² on Training set = 100.00%
KNN R² on Training set = 89.64%
SVR_rbf R² on Training set = 42.37%
SVR_poly R² on Training set = 22.79%
SVR_sigmoid R² on Training set = -2567982.64%
DecisionTree R² on Training set = 99.80%
RandomForest R² on Training set = 99.93%
Bagging R² on Training set = 99.90%
AdaBoost R² on Training set = 100.00%
GradientBoosting R² on Training set = 99.99%
XGBoost R² on Training set = 99.97%

--- Test Set Performance ---
