# Requirements

In [35]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_squared_error
from xgboost import XGBRegressor
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Laboratory Exercise - Run Mode (8 points)

## Introduction
In this laboratory assignment, the focus is on time series forecasting, specifically targeting the prediction of the current **mean temperature** in the city of Delhi. Your task involves employing bagging and boosting methods to forecast the **mean temperature**. To accomplish this use data from the preceding three days, consisting of **mean temperature**, **humidity**, **wind speed**, and **mean pressure**.

**Note: You are required to perform this laboratory assignment on your local machine.**

## The Climate Dataset

## Downloading the Climate Dataset

## Exploring the Climate Dataset
This dataset consists of daily weather records for the city of Delhi spanning a period of 4 years (from 2013 to 2017). The dataset includes the following attributes:

- date - date in the format YYYY-MM-DD,
- meantemp - mean temperature averaged from multiple 3-hour intervals in a day,
- humidity - humidity value for the day (measured in grams of water vapor per cubic meter volume of air),
- wind_speed - wind speed measured in kilometers per hour, and
- meanpressure - pressure reading of the weather (measured in atm).

*Note: The dataset is complete, with no missing values in any of its entries.*

Load the dataset into a `pandas` data frame.

In [121]:
climate_data = pd.read_csv('climate-data.csv')
data.head()

Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2013-01-01,10.0,84.5,0.0,1015.666667
1,2013-01-02,7.4,92.0,2.98,1017.8
2,2013-01-03,7.166667,87.0,4.633333,1018.666667
3,2013-01-04,8.666667,71.333333,1.233333,1017.166667
4,2013-01-05,6.0,86.833333,3.7,1016.5


Explore the dataset using visualizations of your choice.

In [None]:
# Write your code here. Add as many boxes as you need.

# Feauture Extraction
Apply a lag of one, two, and three days to each feature, creating a set of features representing the meteorological conditions from the previous three days. To maintain dataset integrity, eliminate any resulting missing values at the beginning of the dataset.

Hint: Use `df['column_name'].shift(period)`. Check the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html.

In [123]:
for lag in range(1, 4):
    climate_data[f"meantemp_lag{lag}"] = climate_data["meantemp"].shift(lag)
    climate_data[f"humidity_lag{lag}"] = climate_data["humidity"].shift(lag)
    climate_data[f"wind_speed_lag{lag}"] = climate_data["wind_speed"].shift(lag)
    climate_data[f"meanpressure_lag{lag}"] = climate_data["meanpressure"].shift(lag)

In [87]:
climate_data = climate_data.dropna().reset_index(drop=True)

In [125]:
X = climate_data.drop(columns "meantemp"])
Y = climate_data["meantemp"]

In [127]:
X

Unnamed: 0,humidity,wind_speed,meanpressure,meantemp_lag1,humidity_lag1,wind_speed_lag1,meanpressure_lag1,meantemp_lag2,humidity_lag2,wind_speed_lag2,meanpressure_lag2,meantemp_lag3,humidity_lag3,wind_speed_lag3,meanpressure_lag3
0,84.500000,0.000000,1015.666667,,,,,,,,,,,,
1,92.000000,2.980000,1017.800000,10.000000,84.500000,0.000000,1015.666667,,,,,,,,
2,87.000000,4.633333,1018.666667,7.400000,92.000000,2.980000,1017.800000,10.000000,84.500000,0.000000,1015.666667,,,,
3,71.333333,1.233333,1017.166667,7.166667,87.000000,4.633333,1018.666667,7.400000,92.000000,2.980000,1017.800000,10.000000,84.500000,0.000000,1015.666667
4,86.833333,3.700000,1016.500000,8.666667,71.333333,1.233333,1017.166667,7.166667,87.000000,4.633333,1018.666667,7.400000,92.000000,2.980000,1017.800000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1457,68.043478,3.547826,1015.565217,16.850000,67.550000,8.335000,1017.200000,17.142857,74.857143,8.784211,1016.952381,14.000000,94.300000,9.085000,1014.350000
1458,87.857143,6.000000,1016.904762,17.217391,68.043478,3.547826,1015.565217,16.850000,67.550000,8.335000,1017.200000,17.142857,74.857143,8.784211,1016.952381
1459,89.666667,6.266667,1017.904762,15.238095,87.857143,6.000000,1016.904762,17.217391,68.043478,3.547826,1015.565217,16.850000,67.550000,8.335000,1017.200000
1460,87.000000,7.325000,1016.100000,14.095238,89.666667,6.266667,1017.904762,15.238095,87.857143,6.000000,1016.904762,17.217391,68.043478,3.547826,1015.565217


## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**



In [93]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, shuffle=False)

## Ensemble Learning Methods

### Bagging

Create an instance of a Random Forest model and train it using the `fit` function.

In [95]:
rf_model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(x_train, y_train)

Use the trained model to make predictions for the test set.

In [97]:
y_pred = rf_model.predict(x_test)

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [99]:
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

Mean Absolute Error: 1.1419082226518071
Mean Squared Error: 2.146107518422741
R2 Score: 0.9330001471544205


### Boosting

Create an instance of an XGBoost model and train it using the `fit` function.

In [101]:
xgb_model = XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42, objective="reg:squarederror")
xgb_model.fit(x_train, y_train)

Use the trained model to make predictions for the test set.

In [103]:
y_pred2 = xgb_model.predict(x_test)

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [105]:
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred2))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred2))
print("R2 Score:", r2_score(y_test, y_pred2))

Mean Absolute Error: 1.1450612598227679
Mean Squared Error: 2.102660654654704
R2 Score: 0.9343565253666359


# Laboratory Exercise - Bonus Task (+ 2 points)

As part of the bonus task in this laboratory assignment, your objective is to fine-tune the number of estimators (`n_estimators`) for the XGBoost model using a cross-validation with grid search and time series split. This involves systematically experimenting with various values for `n_estimators` and evaluating the model's performance using cross-validation. Upon determining the most suitable `n_estimators` value, evaluate the model's performance on a test set for final assessment.

Hints:
- For grid search use the `GridCVSearch` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.
- For cross-validation use the `TimeSeriesSplit` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html.

## Dataset Splitting
Partition the dataset into training and testing sets with an 90:10 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**

In [107]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(X, Y, test_size=0.1, shuffle=False)

## Fine-tuning the XGBoost Hyperparameter
Experiment with various values for `n_estimators` and evaluate the model's performance using cross-validation.

In [109]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold

In [111]:
param_grid = {
    'max_depth': [3, 5, 10, None],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [50, 100, 200],
    'min_child_weight': [1, 3, 5],
}

cv = RepeatedKFold(n_splits=10, n_repeats=1, random_state=42)

grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',  
    cv=cv,
    verbose=1,
    n_jobs=-1
)

grid_search.fit(x_train2, y_train2)

# Print the best hyperparameters and cross-validation score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", -grid_search.best_score_)


Fitting 10 folds for each of 72 candidates, totalling 720 fits
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200}
Best Cross-Validation Score: 0.9529234095984149


## Final Assessment of the Model Performance
Upon determining the most suitable `n_estimators` value, evaluate the model's performance on a test set for final assessment.

In [117]:
best_model = grid_search.best_estimator_
y_pred3 = best_model.predict(x_test2)

In [119]:
print("Mean Absolute Error:", mean_absolute_error(y_test2, y_pred3))
print("Mean Squared Error:", mean_squared_error(y_test2, y_pred3))
print("R2 Score:", r2_score(y_test2, y_pred3))

Mean Absolute Error: 0.8977508971711522
Mean Squared Error: 1.391355688300696
R2 Score: 0.9572718133689079
