# Task 2
This serves as a template which will guide you through the implementation of this task. It is advised to first read the whole template and get a sense of the overall structure of the code before trying to fill in any of the TODO gaps.
This is the jupyter notebook version of the template. For the python file version, please refer to the file `template_solution.py`.

First, we import necessary libraries:

In [8]:
import numpy as np
import pandas as pd
# Add any other imports you need here
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.metrics import r2_score
from sklearn.gaussian_process.kernels import RBF
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Loading
TODO: Perform data preprocessing, imputation and extract X_train, y_train and X_test
(and potentially change initialization of variables to accomodate how you deal with non-numeric data)

In [9]:
"""
This loads the training and test data, preprocesses it, removes the NaN
values and interpolates the missing data using imputation

Parameters
----------
Compute
----------
X_train: matrix of floats, training input with features
y_train: array of floats, training output with labels
X_test: matrix of floats: dim = (100, ?), test input with features
"""
# Load training data
train_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/IML FS24/task2/train.csv")

print("Training data:")
print("Shape:", train_df.shape)
print(train_df.head(2))
print('\n')

# Load test data
test_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/IML FS24/task2/test.csv")

print("Test data:")
print(test_df.shape)
print(test_df.head(2))

# Dummy initialization of the X_train, X_test and y_train
# TODO: Depending on how you deal with the non-numeric data, you may want to
# modify/ignore the initialization of these variables
#X_train = np.zeros_like(train_df.drop(['price_CHF'],axis=1))
#y_train = np.zeros_like(train_df['price_CHF'])
#X_test = np.zeros_like(test_df)
X_train = train_df.drop(['price_CHF'],axis=1)
y_train = train_df['price_CHF']
X_test = test_df
# TODO: Perform data preprocessing, imputation and extract X_train, y_train and X_test
num_features = X_train.select_dtypes(include=['int64', 'float64']).columns
cat_features = X_train.select_dtypes(include=['object']).columns

num_trans = SimpleImputer(strategy='mean')
cat_trans = OneHotEncoder(handle_unknown='ignore')

preproc = ColumnTransformer(transformers=[('num', num_trans, num_features), ('cat', cat_trans, cat_features)])

X_train_preproc = preproc.fit_transform(X_train)
X_test_preproc = preproc.transform(X_test)

assert (X_train.shape[1] == X_test.shape[1]) and (X_train.shape[0] == y_train.shape[0]) and (X_test.shape[0] == 100), "Invalid data shape"

Training data:
Shape: (900, 11)
   season  price_AUS  price_CHF  price_CZE  price_GER  price_ESP  price_FRA  \
0  spring        NaN   9.644028  -1.686248  -1.748076  -3.666005        NaN   
1  summer        NaN   7.246061  -2.132377  -2.054363  -3.295697  -4.104759   

   price_UK  price_ITA  price_POL  price_SVK  
0 -1.822720  -3.931031        NaN  -3.238197  
1 -1.826021        NaN        NaN  -3.212894  


Test data:
(100, 10)
   season  price_AUS  price_CZE  price_GER  price_ESP  price_FRA  price_UK  \
0  spring        NaN   0.472985   0.707957        NaN  -1.136441 -0.596703   
1  summer  -1.184837   0.358019        NaN  -3.199028  -1.069695       NaN   

   price_ITA  price_POL  price_SVK  
0        NaN   3.298693   1.921886  
1  -1.420091   3.238307        NaN  


# Modeling and Prediction
TODO: Define the model and fit it using training data. Then, use test data to make predictions

In [17]:
"""
This defines the model, fits training data and then does the prediction
with the test data

Parameters
----------
X_train: matrix of floats, training input with 10 features
y_train: array of floats, training output
X_test: matrix of floats: dim = (100, ?), test input with 10 features

Compute
----------
y_test: array of floats: dim = (100,), predictions on test set
"""

y_pred=np.zeros(X_test.shape[0])
#TODO: Define the model and fit it using training data. Then, use test data to make predictions
y_imp = SimpleImputer(strategy='mean')
y_train_imp = y_imp.fit_transform(y_train.values.reshape(-1,1)).ravel()
models = {'RandomForestRegressor': RandomForestRegressor()}
param_grids = {'RandomForestRegressor': {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}}

best_models = {}
for name, model in models.items():
  grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='r2')
  grid_search.fit(X_train_preproc, y_train_imp)
  best_models[name] = grid_search.best_estimator_

best_model_name = max(best_models, key=best_models.get)
best_model = best_models[best_model_name]

y_pred = best_model.predict(X_test_preproc)

assert y_pred.shape == (100,), "Invalid data shape"

# Saving Results
You don't have to change this

In [18]:
dt = pd.DataFrame(y_pred)
dt.columns = ['price_CHF']
dt.to_csv('/content/drive/MyDrive/Colab Notebooks/IML FS24/task2/results_task2_2.csv', index=False)
print("\nResults file successfully generated!")


Results file successfully generated!
