# Introduction and Prerequisites
This notebook serves as a tutorial on employing the ElasticNet model from QSAR package to a specific dataset. Before you start:
- Basic familiarity with Python, Pandas, and Scikit-learn will be beneficial.
- The main aim is to understand the flow and not just achieve a high score.

In [None]:
# todo add installation instructions
# option1: !python setup.py install
# option2: !pip install -r requirements.txt

# Table of Contents
1. Data Import
2. Preprocessing
3. Cross-Validation
4. Model Without Optimization
5. Model With Hyperparameter Optimization
6. Model Evaluation
7. Prediction and Results

In [None]:
import warnings

warnings.filterwarnings('ignore')

import pandas as pd
from optuna.visualization import plot_optimization_history

from qsar.utils.visualizer import Visualizer
from qsar.utils.cross_validator import CrossValidator
from qsar.utils.extractor import Extractor
from qsar.utils.hyperparameter_optimizer import HyperParameterOptimizer

from qsar.models.elasticnet import ElasticnetModel
from sklearn.preprocessing import StandardScaler

## 1. Data Import
Here, we'll load our dataset, explore it briefly, and prepare it for modeling.
Define paths for various datasets: full, neutral, and ionizable for both training and testing.
Initialize the extractor and split datasets into features (X) and target variable (y) based on "Log_MP_RATIO".
Also, retrieve the full training dataset.

In [None]:
data_paths = {
    "full_train": "../data/full/train/full_train_unfiltered.csv",
    "full_test": "../data/full/test/full_test_unfiltered.csv",
    "neutral_train": "../data/neutral/train/neutral_train_unfiltered.csv",
    "neutral_test": "../data/neutral/test/neutral_test_unfiltered.csv",
    "ionizable_train": "../data/ionizable/train/ionizable_train_unfiltered.csv",
    "ionizable_test": "../data/ionizable/test/ionizable_test_unfiltered.csv",
}

extractor = Extractor(data_paths)

# todo move the split after the preprocessing
x_dfs, y_dfs = extractor.split_x_y("Log_MP_RATIO")
df_full_tain = extractor.get_df("full_train")

## 2. Preprocessing
Scaling the features to have a mean=0 and variance=1 for better convergence during model training.

In [None]:
# Todo: include and review all the preprocessing steps
# Todo: check if we need scaling to make the model converge
# scaler = StandardScaler()
# x_full_train_scaled = scaler.fit_transform(x_dfs["full_train"])
# x_dfs["full_train"] = pd.DataFrame(x_full_train_scaled, index=x_dfs["full_train"].index, columns=x_dfs["full_train"].columns)
# x_full_test_scaled = scaler.transform(x_dfs["full_test"])
# x_dfs["full_test"] = pd.DataFrame(x_full_test_scaled, index=x_dfs["full_test"].index, columns=x_dfs["full_test"].columns)

## 3. Cross-validation
Initialize cross-validation and visualization tools.
Create cross-validation folds from the full training dataset and visualize the created folds.

In [None]:
cross_validator = CrossValidator(df_full_tain)
visualizer = Visualizer()
X_list, y_list, df, y, n_folds = cross_validator.create_cv_folds()
visualizer.display_cv_folds(df, y, n_folds)

## 4. Model evaluation before hyperparameter optimization
Evaluate the performance of the ElasticNet model on the full training dataset and visualize its performance metrics.

In [None]:
elasticnet_model = ElasticnetModel()
metrics = cross_validator.evaluate_model_performance(elasticnet_model.model,
                                                                   x_dfs["full_train"], y_dfs["full_train"],
                                                                   x_dfs["full_test"], y_dfs["full_test"])
visualizer.display_model_performance("ElasticNet", metrics)

## 5. Hyperparameter optimization
Initialize the ElasticnetModel, then use Optuna for hyperparameter optimization.
Print the best hyperparameters after optimization.

In [None]:
elasticnet_model = ElasticnetModel()

optimizer = HyperParameterOptimizer(model=elasticnet_model, data=df_full_tain, direction='maximize', trials=100)

study = optimizer.optimize()
trial = study.best_trial
print(trial.value, trial.params)

## 6. Model evaluation after hyperparameter optimization
Evaluate the performance of the ElasticNet model with the best hyperparameters on the full training dataset.
Then visualize its performance metrics.

In [None]:
elasticnet_model.set_hyperparameters(**study.best_params)
metrics = cross_validator.evaluate_model_performance(
    elasticnet_model.model, x_dfs["full_train"], y_dfs["full_train"], x_dfs["full_test"], y_dfs["full_test"])
visualizer.display_model_performance("ElasticNet", metrics)

Display the optimization history of the study (hyperparameter optimization).

In [None]:
display(plot_optimization_history(study))

## 7. Prediction and results
Predict with the optimized ElasticNet model and visualize the comparison between predicted and actual values.

In [None]:
y_full_train_pred, y_full_test_pred = cross_validator.get_predictions(elasticnet_model.model, x_dfs["full_train"],
                                                                      y_dfs["full_train"], x_dfs["full_test"])
visualizer.display_true_vs_predicted("ElasticNet", y_dfs["full_train"], y_dfs["full_test"], y_full_train_pred,
                                     y_full_test_pred)