# Linear Regression Model Training 

## Overview

This script performs training of a linear regression model to predict energy consumption based on extracted features from a dataset containing timestamps and energy consumption records.

- The script demonstrates the entire process of training a linear regression model for energy consumption prediction, from data loading and preprocessing to model evaluation and saving.
- It utilizes popular libraries such as pandas, scikit-learn, and joblib to streamline various tasks involved in machine learning model training.
- The use of train-test splitting helps to assess the model's generalization performance and detect overfitting.
- Evaluation metrics such as MSE and MAE provide insights into the model's accuracy and are essential for model selection and tuning.
- Saving the trained model allows for easy reuse and deployment in production environments.

## Dependencies

- **pandas**: A powerful data manipulation library used to load and preprocess the dataset.
- **sklearn**: A machine learning library providing tools for model selection, training, and evaluation.
- **matplotlib**: A plotting library used to visualize the evaluation metrics.

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

1. **Load Dataset**: 
- The script loads the dataset containing energy consumption records from the Excel file `new_data.xlsx` using the `pd.read_excel()` function. This dataset serves as the basis for training the linear regression model.



In [15]:
# Load the dataset
df = pd.read_excel("new_data.xlsx")

In [16]:
df.head()

Unnamed: 0,Timestamp,Answer Value first,Answer Value last,Equipment SNO first,Asset Number first,Reading Name first,Is Error Set? first,Is Error Code? first,Asset first,Company first,Energy Consumption (kWh)
0,2024-01-01 00:00:00,54199.73,54199.79,68B6B34180C8-3,FSCHN-E-00001,activeenergydla,0.0,,AHU DB,Chennai – Bayline,0.06
1,2024-01-01 01:00:00,54199.8,54199.91,68B6B34180C8-3,FSCHN-E-00001,activeenergydla,0.0,,AHU DB,Chennai – Bayline,0.11
2,2024-01-01 02:00:00,54199.92,54200.03,68B6B34180C8-3,FSCHN-E-00001,activeenergydla,0.0,,AHU DB,Chennai – Bayline,0.11
3,2024-01-01 03:00:00,54200.04,54200.16,68B6B34180C8-3,FSCHN-E-00001,activeenergydla,0.0,,AHU DB,Chennai – Bayline,0.12
4,2024-01-01 04:00:00,54200.17,54200.29,68B6B34180C8-3,FSCHN-E-00001,activeenergydla,0.0,,AHU DB,Chennai – Bayline,0.12


2. **Data Preprocessing**: 
- The script converts the 'Timestamp' column to datetime format using the `pd.to_datetime()` function. 

- It then extracts relevant features such as year, month, day, hour, minute, and second from the timestamp using pandas' datetime properties (`dt.year`, `dt.month`, `dt.day`, `dt.hour`, `dt.minute`, `dt.second`).

In [17]:

# Convert the 'Timestamp' column to datetime format
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [18]:

# Extract relevant features from the timestamp
df['Year'] = df['Timestamp'].dt.year
df['Month'] = df['Timestamp'].dt.month
df['Day'] = df['Timestamp'].dt.day
df['Hour'] = df['Timestamp'].dt.hour
df['Minute'] = df['Timestamp'].dt.minute
df['Second'] = df['Timestamp'].dt.second

3. **Feature Selection**: 
 - Extracted features are used as input features (X) for training the linear regression model, while the 'Energy Consumption (kWh)' column serves as the target variable (y).

In [19]:
# Feature Selection

# Use the extracted features as input features
X = df[['Year', 'Month', 'Day', 'Hour', 'Minute', 'Second']]
y = df['Energy Consumption (kWh)']  # Target variable


2. **Calculate Mean**: Calculate the mean value of the target variable ('Energy Consumption (kWh)') using the `mean()` function from the pandas library.
3. **Impute NaN Values**: Replace NaN values in the target variable with the calculated mean using the `fillna()` function from pandas. Set the `inplace` parameter to `True` to modify the DataFrame in place.
4. **Display Updated Dataset**: Optionally, display the updated dataset to verify that NaN values have been successfully imputed with the mean.


In [22]:
# Impute NaN values in the target variable with the mean
mean_energy_consumption = df['Energy Consumption (kWh)'].mean()
df['Energy Consumption (kWh)'].fillna(mean_energy_consumption, inplace=True)


4. **Data Splitting**:
-  The dataset is split into training and testing sets using the `train_test_split()` function from `sklearn.model_selection`.

 -  The testing set size is set to 20% of the total dataset, and a random seed of 42 is used for reproducibility.

In [23]:

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


5. **Model Initialization and Training**:
-  A linear regression model is initialized using the `LinearRegression()` class from `sklearn.linear_model`, and then trained on the training data using the `fit()` method.


In [24]:
# Initializing and training the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

6. **Making Predictions**:
-  Predictions are made on both the training and testing sets using the trained model's `predict()` method.


In [26]:

# Making predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)


7. **Model Evaluation**:
-  The mean squared error (MSE) and mean absolute error (MAE) are calculated to evaluate the performance of the model on both the training and testing sets using the `mean_squared_error()` and `mean_absolute_error()` functions from `sklearn.metrics`.


In [27]:

# Evaluating the model
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)

8. **Printing Evaluation Metrics**: 
- The calculated MSE and MAE for both training and testing sets are printed to the console for analysis.


In [28]:
# Print the evaluation metrics
print("Training MSE:", mse_train)
print("Testing MSE:", mse_test)
print("Training MAE:", mae_train)
print("Testing MAE:", mae_test)


Training MSE: 97.07064487207533
Testing MSE: 95.01444622559254
Training MAE: 8.195468089792882
Testing MAE: 7.876718217135026


9. **Save Model**: 
- Finally, the trained model is saved as a .h5 file using the `joblib.dump()` function from the `joblib` library.


In [47]:
import joblib
# Save the model as a .h5 file
joblib.dump(model, 'model.h5')

['model.h5']