# One_month_prediction_Alkalis

## TO-DOs
```
[v] Import monthly electrcity data
[v] Import monthly TTF_GAS data
[v] Import price evaluatioin data
[v] Create rows and encoding Alkalis_RM02_0001, Alkalis_RM02_0002
[v] To calculate the monthly average prices of Alkalis
[v] Create 12*N features, external factor prices from one-month before to 12-month before
[v] Combine features with target variables
[v] train_test_split() - do calculation and scaling only based on train data set to prevent data leakage
[x] Detect outliers - skip
[v] Check data distribution
[v] Data scaling - log transformation
[x] check multicollinearity(to run one regression using each features, and find corr of all feature, filtering those with higher performance and least corr for our last model) - skip
[v] Lasso regression - fit and transform train data set
[v] Cross validation and Hyperparameter tuning using RandomizedSearchCV
[v] Lasso regression - transform test data set
[] Compare Lasso with Simple linear model
[] Visualisation
```

In [1]:
!pip install fredapi

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import preprocessor as pre
import pandas as pd
from fredapi import Fred
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import numpy as np
import matplotlib.pyplot as plt


In [3]:
def monthly_mean_to_daily(df_monthly: pd.core.frame.DataFrame ) -> pd.core.frame.DataFrame:
    """
    Convert Monthly data into Daily data and impute with monthly mean prices
    """
    df_monthly['Date'] = pd.to_datetime(df_monthly[['Year', 'Month']].assign(DAY=1))
    df = df_monthly.explode('Date') # The explode() method converts each element of the specified column(s) into a row.

    # Generate a complete range of daily dates for the year for imputation
    start_date = df['Date'].min() # represents the starting point of your data
    end_date = df['Date'].max() + pd.offsets.MonthEnd(1)  # finds the maximum (or latest) date and include the last month fully
    full_date_range = pd.date_range(start=start_date, end=end_date, freq='D') # generates a fixed-frequency DatetimeIndex

    # Merge the full date range with the monthly averages to fill in all days
    df_full_date_range = pd.DataFrame(full_date_range, columns=['Date'])
    df = pd.merge(df_full_date_range, df_monthly, on='Date', how='left')
    df_daily = df.ffill(axis=0) # to fill the missing value based on last valid observation following index sequence
    return df_daily

In [4]:
print(pre.get_Fred_data.__doc__)

To extract data from Fred database: https://fred.stlouisfed.org/ 
    apiKey = '29219060bc68b2802af8584e0f328b52'
    PWHEAMTUSDM - wheat https://fred.stlouisfed.org/series/PWHEAMTUSDM
    WPU0652013A - Ammonia https://fred.stlouisfed.org/series/WPU0652013A
    PNGASEUUSDM - TTG_Gas https://fred.stlouisfed.org/series/PNGASEUUSDM
    


In [5]:
gas_df = pre.get_Fred_data('PNGASEUUSDM',2011,2023)
ammonia_df = pre.get_Fred_data('WPU0652013A',2011,2023)
wheat_df = pre.get_Fred_data('PWHEAMTUSDM',2011,2023)

elec_df = pre.clean_elec_csv('ELECTRICITY.csv',2011,2023)

price_evo_df = pre.clean_pred_price_evo_csv('Dataset_Predicting_Price_Evolutions.csv',2012,2023)

dummy_df = pre.get_dummies_and_average_price(price_evo_df,'Alkalis',\
                                         'RM02/0001','RM02/0002')

Alkalis_df = pre.generate_features(1,12,dummy_df,\
                                       Electricity=elec_df,\
                                       PNGASEUUSDM=gas_df)

print(Alkalis_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6007 entries, 0 to 6006
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Time               6007 non-null   datetime64[ns]
 1   Group Description  6007 non-null   object        
 2   Year               6007 non-null   int64         
 3   Month              6007 non-null   int64         
 4   RM02/0002          6007 non-null   uint8         
 5   Average_price      6007 non-null   float64       
 6   Electricity_1      6007 non-null   float64       
 7   PNGASEUUSDM_1      6007 non-null   float64       
 8   Electricity_2      6007 non-null   float64       
 9   PNGASEUUSDM_2      6007 non-null   float64       
 10  Electricity_3      6007 non-null   float64       
 11  PNGASEUUSDM_3      6007 non-null   float64       
 12  Electricity_4      6007 non-null   float64       
 13  PNGASEUUSDM_4      6007 non-null   float64       
 14  Electric

In [6]:
## train_test_split()
## Log transformation

# # Observe data distribution
# Alkalis_df_dummies.drop(['RM02/0002','Time', 'Group Description', 'Year','Month'],axis=1).hist()
# Alkalis_df_dummies['Average_price'].hist()

# Create X, y
feature_list = Alkalis_df.drop(['Time', 'Group Description', 'Year','Month','Average_price'],axis=1)
X = feature_list.values
y = Alkalis_df['Average_price'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 30% of our data as the test set

# Log transformation and standardlisation
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

scaler_x = StandardScaler()
X_train_scaled = scaler_x.fit_transform(X_train)
X_test_scaled = scaler_x.transform(X_test)

scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train_log.reshape(-1,1))
y_test_scaled = scaler_y.transform(y_test_log.reshape(-1,1))

In [7]:
## Lasso regression - fit and transform train data set
## Cross validation and Hyperparameter tuning using RandomizedSearchCV

# Define the parameter grid
param_grid = {'alpha': np.linspace(0.0000001, 1, 3000)}

# Create a Lasso regression model
lasso = Lasso()

# Create RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=lasso, 
                                   param_distributions=param_grid, 
                                   n_iter=300, 
                                   cv=5, 
                                   random_state=42)

# Fit the data to perform a grid search
random_search.fit(X_train_scaled, y_train_scaled)

# Best alpha parameter
print("Best alpha parameter:", random_search.best_params_['alpha'])

# Best R-squared score
print("Best R-squared score:", round(random_search.best_score_, 3))

# Coefficients of the best Lasso model
assert random_search.n_features_in_ == len(feature_list.columns)

print("Coefficients of the selected features in the best Lasso model:")
for feature, coefficient in zip(feature_list.columns, random_search.best_estimator_.coef_):
    print(f"{feature}: {round(coefficient,3)}")

Best alpha parameter: 1e-07
Best R-squared score: 0.924
Coefficients of the selected features in the best Lasso model:
RM02/0002: 0.467
Electricity_1: 0.386
PNGASEUUSDM_1: 0.011
Electricity_2: 0.278
PNGASEUUSDM_2: 0.256
Electricity_3: -0.031
PNGASEUUSDM_3: -0.301
Electricity_4: 0.102
PNGASEUUSDM_4: 0.151
Electricity_5: 0.045
PNGASEUUSDM_5: -0.171
Electricity_6: -0.022
PNGASEUUSDM_6: 0.031
Electricity_7: -0.082
PNGASEUUSDM_7: -0.041
Electricity_8: 0.132
PNGASEUUSDM_8: 0.313
Electricity_9: 0.221
PNGASEUUSDM_9: -0.098
Electricity_10: 0.265
PNGASEUUSDM_10: 0.048
Electricity_11: -0.024
PNGASEUUSDM_11: -0.218
Electricity_12: -0.138
PNGASEUUSDM_12: -0.131


In [8]:
## Lasso regression - transform test data set
# Get the best Lasso model from RandomizedSearchCV
best_lasso_model = random_search.best_estimator_

# Predict on the test data
y_pred_test = best_lasso_model.predict(X_test_scaled)

# Evaluate the model performance on the test data
test_score = best_lasso_model.score(X_test_scaled, y_test_scaled)
print("Best Model:", best_lasso_model)
print("Test Set R-squared score:", round(test_score, 3))



Best Model: Lasso(alpha=1e-07)
Test Set R-squared score: 0.923


In [9]:
## Simple Linear regression - fit and transform train data set
## Cross validation and Hyperparameter tuning using RandomizedSearchCV

# Define the parameter grid
param_grid = {'fit_intercept': [True, False]}

# Create a Lasso regression model
linear = LinearRegression()

# Create RandomizedSearchCV object
random_search_compare = RandomizedSearchCV(estimator=linear, 
                                   param_distributions=param_grid, 
                                   n_iter=300, 
                                   cv=5, 
                                   random_state=42)

# Fit the data to perform a grid search
random_search_compare.fit(X_train_scaled, y_train_scaled)

# Best alpha parameter
print("Best parameter:", random_search_compare.best_params_['fit_intercept'])

# Best R-squared score
print("Best R-squared score:", round(random_search_compare.best_score_, 3))

# Coefficients of the best Lasso model
assert random_search_compare.n_features_in_ == len(feature_list.columns)

print("Coefficients of the selected features in the best Linear model:")
for feature, coefficient in zip(feature_list.columns, random_search_compare.best_estimator_.coef_):
    print(f"{feature}: {np.round(coefficient,3)}")

Best parameter: False
Best R-squared score: 0.924
Coefficients of the selected features in the best Linear model:
RM02/0002: [ 0.467  0.393  0.01   0.299  0.251 -0.03  -0.308  0.055  0.141  0.024
 -0.166  0.01   0.066 -0.053 -0.042  0.128  0.306  0.2   -0.111  0.259
  0.054 -0.02  -0.225 -0.131 -0.13 ]


In [10]:
## Simple Linear regression - transform test data set
# Get the best Lasso model from RandomizedSearchCV
best_linear_model = random_search_compare.best_estimator_

# Predict on the test data
y_pred_test = best_linear_model.predict(X_test_scaled)

# Evaluate the model performance on the test data
test_score = best_lasso_model.score(X_test_scaled, y_test_scaled)
print("Best Model:", best_linear_model)
print("Test Set R-squared score:", round(test_score, 3))



Best Model: LinearRegression(fit_intercept=False)
Test Set R-squared score: 0.923
