Title: **Assignment 2 - COSC3013 Computational Machine Learning - End-to-end Machine Learning Project**

Student ID: **S3979613**

Student Name and email (contact info): **Dao Sy Trung Kien - S3979613@rmit.edu.vn**

Affiliations: **RMIT University Vietnam.**

Date of Report: 03/08/2023

I certify that this is all my own original work. If I took any parts from elsewhere, then they were non-essential parts of the assignment, and they are clearly attributed in my submission.  I will show I agree to this honor code by typing "Yes": Yes.

Please start your report here. 

### Required Libraries and Utilities

In [None]:
# Importing packages - Pandas, Numpy, Seaborn, Scipy, Impute
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
import matplotlib.style as style; style.use('fivethirtyeight')
np.random.seed(0)
from sklearn.impute import SimpleImputer

# Modelling
import sklearn.metrics as metrics
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Oversampling
from imblearn.over_sampling import SMOTE

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
pd.options.display.max_rows = 4000

### Import data and check for null values.

In [None]:
# Code for import data from Paitients_Files_Train csv file
# df_train = pd.read_csv(r'C:\Users\Kien\Downloads\Computational ML\UCI-electricity\UCI-electricity\UCI_data.csv')
df_train = pd.read_csv(r'UCI_data.csv')
# Print out data.
df_train.head()

In [None]:
# Calculating the Missing Values % contribution in Train Data
df_train_null = round(100*(df_train.isnull().sum())/len(df_train), 2)
df_train_null

### Basic Data Exploration.

In [None]:
# Check the dimensions of the Training dataset
print(df_train.shape)

In [None]:
# Get info of the dataframe columns
df_train.info()

### Data Pre-processing.

In [None]:
from scipy import stats

# Calculate Z-scores for each column
z_scores = stats.zscore(df_train.select_dtypes(include=[float, int]))

# Create a boolean mask for rows where all Z-scores are less than 3
# mask = (abs(z_scores) < 3).all(axis=1)

# Filter the DataFrame using the mask
# df_train_filtered = df_train[mask]

In [None]:
df_train_filtered = df_train
df_train_filtered

In [None]:
# Calculate the median of the Windspeed column
median_windspeed = df_train_filtered['Windspeed'].median()
# Replace missing values with median
df_train_filtered['Windspeed'].replace(0, median_windspeed, inplace=True)

### Feature Engineering.

In [None]:
# Convert the datetime column to a pandas datetime object
df_train_filtered['datetime'] = pd.to_datetime(df_train_filtered['date'])

# Create new columns for date and time
df_train_filtered['date'] = df_train_filtered['datetime'].dt.date
df_train_filtered['time'] = df_train_filtered['datetime'].dt.time

#Hour of the Day
df_train_filtered['hour'] = df_train_filtered['datetime'].dt.hour

# Day Part
df_train_filtered['day_part'] = pd.cut(df_train_filtered['hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'], right=False)

# Season
df_train_filtered['month'] = df_train_filtered['datetime'].dt.month
df_train_filtered['season'] = df_train_filtered['month'].apply(lambda x: 'Winter' if x in [12, 1, 2] else 'Spring' if x in [3, 4, 5] else 'Summer' if x in [6, 7, 8] else 'Fall')

# Drop the original datetime column if no longer needed
df_train_filtered.drop('datetime', axis=1, inplace=True)


In [None]:
df_train_filtered

In [None]:
# Encode the 'day_part' categorical feature into numerical values
df_train_filtered['day_part_encoded'] = df_train_filtered['day_part'].cat.codes
df_train_filtered['day_part_encoded']

In [None]:
temperature_columns = ['T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9']
df_train_filtered['temperature_mean'] = df_train_filtered[temperature_columns].mean(axis=1)
df_train_filtered

In [None]:
humidity_columns = ['RH_1', 'RH_2', 'RH_3', 'RH_4', 'RH_5', 'RH_6', 'RH_7', 'RH_8', 'RH_9']
df_train_filtered['humidity_mean'] = df_train_filtered[humidity_columns].mean(axis=1)
df_train_filtered

In [None]:
df_train_filtered['daylight'] = df_train_filtered['hour'].between(6,18)
df_train_filtered['daylight'] = np.where(df_train_filtered['daylight'], 1, 0)

In [None]:
# need create hot day/ cold day 

### Correlation Matrix

In [None]:
# Correlation Matrix
numeric_df_train = df_train_filtered.select_dtypes(include=[np.number])
plt.figure(figsize=(30, 25))
sns.heatmap(numeric_df_train.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


In [None]:
df_train_filtered.info()

In [None]:
numerical_df_train = df_train_filtered.drop(columns=['day_part', 'time', 'season','date'])

In [None]:
plt.figure(figsize=(15, 10))
numeric_df_train.boxplot()
plt.title('Box plot of numerical features after handling outliers')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Calculate Z-scores for each numerical feature
z_scores = np.abs(stats.zscore(numerical_df_train))
outliers = np.where(z_scores > 3)  # Identify points more than 3 standard deviations away
# Calculate IQR(measure of data statistical dispersion) for each numerical feature
Q1 = numerical_df_train.quantile(0.25)
Q3 = numerical_df_train.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = ((numerical_df_train < (lower_bound)) | (numerical_df_train > (upper_bound))).sum()
print(outliers)

In [None]:
# Cap and floor the outliers
for feature in numerical_df_train:
    lower_cap = Q1[feature] - 1.5 * IQR[feature]
    upper_cap = Q3[feature] + 1.5 * IQR[feature]
    numerical_df_train[feature] = np.where(numerical_df_train[feature] < lower_cap, lower_cap, numerical_df_train[feature])
    numerical_df_train[feature] = np.where(numerical_df_train[feature] > upper_cap, upper_cap, numerical_df_train[feature])

In [None]:
plt.figure(figsize=(15,10))
numerical_df_train.boxplot()
plt.title('Box plot of numerical features')
plt.xticks(rotation=45)
plt.show()

### Train and Fitting models

In [None]:
# Correlation Matrix
numeric_df_train = df_train_filtered.select_dtypes(include=[np.number])
plt.figure(figsize=(30, 25))
sns.heatmap(numeric_df_train.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Split dataset into training, and test sets (70% train, 30% test)
target_feature = numerical_df_train['TARGET_energy']
features = numerical_df_train.drop(columns=['TARGET_energy'],axis=1)
selected_features= ['hour','day_part_encoded','daylight']
X_train, X_test, y_train, y_test = train_test_split(features[selected_features], target_feature, test_size=0.3, random_state=42)

In [None]:
target_feature.describe()

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Scale numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import lightgbm as lgb

# # Define the models and their hyperparameter grids
models = {
    'LinearRegression': {
        'model': LinearRegression(),
        'params': {  
            'fit_intercept':[True , False], 
            'n_jobs':[None, -1, 1],            
            'positive':[False, True]    
        }       
    },
    'Ridge': {
        'model': Ridge(random_state=42),
        'params': {
            'alpha': [0.01, 0.1, 1, 10, 100]
        }
    },
    'Lasso': {
        'model': Lasso(random_state=42),
        'params': {
            'alpha': [0.01, 0.1, 1, 10, 100]
        }
    },
    'ElasticNet': {
        'model': ElasticNet(),
        'params': {
            'alpha': [0.01, 0.1, 1, 10, 100],
            'l1_ratio': [0.1, 0.5, 0.7, 1.0]
        }
    },
    'DecisionTreeRegressor': {
        'model': DecisionTreeRegressor(random_state=42),
        'params': {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    'RandomForestRegressor': {
        'model': RandomForestRegressor(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    'SVR': {
        'model': SVR(),
        'params': {
            'C': [0.1, 1, 10],
            'epsilon': [0.01, 0.1, 0.2],
            'kernel': ['linear', 'rbf']
        }
    },
    'GradientBoostingRegressor': {
        'model': GradientBoostingRegressor(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 10]
        }
    },
    'KNeighborsRegressor': {
        'model': KNeighborsRegressor(),
        'params': {
            'n_neighbors': [3, 5, 7, 9],
            'weights': ['uniform', 'distance']
        }
    },
    'AdaBoostRegressor': {
        'model': AdaBoostRegressor(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 0.2]
        }
    },
     'XGBoost': {
        'model': xgb.XGBRegressor(random_state=42),
        'params': {
            'n_estimators': [100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7]
        }
    },
    'LightGBM': {
        'model': lgb.LGBMRegressor(random_state=42),
        'params': {
            'n_estimators': [100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7]
        }
    },
    'MLPRegressor': {
        'model': MLPRegressor(random_state=42),
        'params': {
            'hidden_layer_sizes': [(50,), (100,), (50, 50)],
            'activation': ['relu', 'tanh'],
            'solver': ['adam', 'lbfgs'],
            'learning_rate_init': [0.001, 0.01]
        }
    }
}

In [None]:
# # Evaluate the model
# for model_name, model in models.items():
#     model.fit(X_train, y_train)
#     # Make predictions
#     y_pred = model.predict(X_test)
#     # Evaluate the model
#     mse = mean_squared_error(y_test, y_pred)
#     r2 = r2_score(y_test, y_pred)
#     print(f"{model} - MSE: {mse}, R²: {r2}")


In [31]:
# Perform grid search for each model fitting on X_train, y_train
best_params = {}
best_scores = {}
best_models = {}
skf = StratifiedKFold(n_splits = 10)
for model_detail in models:
    model = models[model_detail]['model']
    param_grid = models[model_detail]['params']
    grid_search = GridSearchCV(model, param_grid, cv=skf, scoring='neg_mean_absolute_error', n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)
    best_params[model_detail] = grid_search.best_params_
    best_scores[model_detail] = grid_search.best_score_
    print(f'{model_detail} best params: {grid_search.best_params_}')
    print(f'{model_detail} best score: {grid_search.best_score_:.4f}')


Fitting 10 folds for each of 12 candidates, totalling 120 fits
LinearRegression best estimator: LinearRegression(positive=True)
LinearRegression best params: {'fit_intercept': True, 'n_jobs': None, 'positive': True}
LinearRegression best score: -29.6687
Fitting 10 folds for each of 5 candidates, totalling 50 fits
Ridge best estimator: Ridge(alpha=100, random_state=42)
Ridge best params: {'alpha': 100}
Ridge best score: -29.6719
Fitting 10 folds for each of 5 candidates, totalling 50 fits
Lasso best estimator: Lasso(alpha=0.1, random_state=42)
Lasso best params: {'alpha': 0.1}
Lasso best score: -29.6650
Fitting 10 folds for each of 20 candidates, totalling 200 fits
ElasticNet best estimator: ElasticNet(alpha=0.1, l1_ratio=1.0)
ElasticNet best params: {'alpha': 0.1, 'l1_ratio': 1.0}
ElasticNet best score: -29.6650
Fitting 10 folds for each of 12 candidates, totalling 120 fits
DecisionTreeRegressor best estimator: DecisionTreeRegressor(random_state=42)
DecisionTreeRegressor best params: {

In [38]:
from sklearn.metrics import mean_squared_error,mean_absolute_percentage_error,mean_absolute_error
# Ppredict with best params on X_test, y_test
for model_detail in models:
    model = models[model_detail]['model']
    model.set_params(**best_params[model_detail])
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mpe = mean_absolute_percentage_error(y_test, y_pred)
    print(f"Test MSE for {models[model_detail]['model']}: {mse}")
    print(f"Test MAE for {models[model_detail]['model']}: {mae}")
    print(f"Test MPE for {models[model_detail]['model']}: {mpe}")
    print('')

Test MSE for LinearRegression(positive=True): 1540.6557935151873
Test MAE for LinearRegression(positive=True): 29.81502703435183
Test MPE for LinearRegression(positive=True): 0.4380858874907701

Test MSE for Ridge(alpha=100, random_state=42): 1539.1668992186985
Test MAE for Ridge(alpha=100, random_state=42): 29.814473660429147
Test MPE for Ridge(alpha=100, random_state=42): 0.4382421890023142

Test MSE for Lasso(alpha=0.1, random_state=42): 1539.831647666405
Test MAE for Lasso(alpha=0.1, random_state=42): 29.80939874830973
Test MPE for Lasso(alpha=0.1, random_state=42): 0.438066450571489

Test MSE for ElasticNet(alpha=0.1, l1_ratio=1.0): 1539.831647666405
Test MAE for ElasticNet(alpha=0.1, l1_ratio=1.0): 29.80939874830973
Test MPE for ElasticNet(alpha=0.1, l1_ratio=1.0): 0.438066450571489

Test MSE for DecisionTreeRegressor(random_state=42): 1333.3483007341024
Test MAE for DecisionTreeRegressor(random_state=42): 26.932463525894462
Test MPE for DecisionTreeRegressor(random_state=42): 0.