# Regression Project with Various Models

This notebook presents a regression approach on a dataset using several models (SVR, linear regression, random forests, gradient boosting, etc.).  
We will:  
- Install and import the necessary libraries  
- Load and preprocess the data (encoding, standardization, etc.)  
- Separate features and the target variable (with logarithmic transformation)  
- Train and evaluate different models  
- Visualize the results (histograms, comparison tables)  
- Experiment with a subset of features  
- Implement an ensemble model for


In [None]:
!pip install pandas numpy scikit-learn matplotlib xgboost


## Importing Libraries

Here we import all the necessary libraries for data manipulation, modeling, and visualization.


In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

# Models and validation tools
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error
import xgboost as xgb


## 1. Loading and Preprocessing Data

We start by loading the training and test datasets.  
- The `Id` column is removed from the training dataset (and from the test set when present).  
- The categorical variable `parentspecies` is encoded into dummy variables (one-hot encoding).  
- The columns are reindexed in alphabetical order to ensure consistency.


In [None]:
# Load datasets
training_set = pd.read_csv("train.csv")
test_set = pd.read_csv("test.csv")

# Remove 'Id' column from the training dataset
train = training_set.drop("Id", axis=1)

# Encode the categorical variable 'parentspecies'
train = pd.get_dummies(train, columns=["parentspecies"])
train = train.reindex(sorted(train.columns), axis=1)

# Prepare the test dataset: remove 'Id' and encode 'parentspecies'
testing = test_set.drop("Id", axis=1)
testing = pd.get_dummies(testing, columns=["parentspecies"])

# Standardize the test dataset (we will standardize the training set later)
sc_te = StandardScaler()
testing = pd.DataFrame(sc_te.fit_transform(testing), columns=testing.columns)


## 2. Feature Separation and Standardization

We separate the features (`X`) and the target variable (`pSat_Pa`).  
The target is transformed using log10 to normalize its distribution.  
Then, we standardize the features and ensure that the test set has the same columns as the training set.


In [None]:
# Separate features and target variable
x_train = train.drop("pSat_Pa", axis=1)
y_train = np.log10(train["pSat_Pa"])

# Standardize the training features
sc_trx = StandardScaler()
x_train = pd.DataFrame(sc_trx.fit_transform(x_train), columns=x_train.columns)

# Add missing columns to the test set (if necessary)
missing_columns = set(x_train.columns) - set(testing.columns)
for col in missing_columns:
    testing[col] = 0

# Reindex test columns in alphabetical order
testing = testing.reindex(sorted(testing.columns), axis=1)

# Display the shapes of the datasets
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("testing shape:", testing.shape)


## 3. Preview of Preprocessed Data

We display the first few rows of the processed datasets to verify preprocessing.


In [None]:
display(x_train.head())
display(y_train.head())
display(testing.head())


## 4. Training the SVR Model

We train an SVR model (with an RBF kernel) on the standardized data.  
Next, we make predictions on the test set and save the results to a CSV file.


In [None]:
# Train the SVR model
svr = SVR(kernel="rbf")
svr.fit(x_train, y_train)

# Make predictions on the test set
preds = svr.predict(testing)

# Prepare and save the results
results = pd.DataFrame({
    'Id': test_set['Id'],
    'target': np.log10(np.abs(preds))  # log10 transformation of absolute prediction values
})
results.to_csv("results.csv", index=False)
print(results.head())


## 5. Model Evaluation with Cross-Validation

We evaluate the SVR model using 2-fold cross-validation with the R² scoring metric.


In [None]:
scores = cross_val_score(svr, x_train, y_train, cv=2, scoring="r2")
print("Mean R^2 score from cross-validation:", scores.mean())


## 6. Visualization of Predictions

We create a histogram of the distribution of predictions (after log10 transformation) to observe their spread.


In [None]:
plt.figure(figsize=(8, 5))
plt.hist(np.log10(np.abs(preds)), bins=30, edgecolor='black')
plt.title('Histogram of Log10 Absolute Predictions')
plt.xlabel('Log10 Absolute Predictions')
plt.ylabel('Frequency')
plt.show()


## 7. Comparing Different Models

We compare the performance (R² score in cross-validation) of different models:  
- Linear Regression  
- Polynomial Regression (degree 3)  
- Random Forest  
- SVR  
- Gradient Boosting Regressor


In [None]:
# Dictionary to store scores
model_comparison = {
    "model": ["Linear", "Polynomial", "RandomForest", "SVR", "GradientBoostRegressor"],
    "CV_R2": [0] * 5
}

# Linear Regression
lin_reg = LinearRegression()
model_comparison["CV_R2"][0] = cross_val_score(lin_reg, x_train, y_train, cv=2, scoring="r2").mean()

# Polynomial Regression (degree 3)
poly_features = PolynomialFeatures(degree=3)
x_poly = poly_features.fit_transform(x_train)
lin_reg_poly = LinearRegression()
model_comparison["CV_R2"][1] = cross_val_score(lin_reg_poly, x_poly, y_train, cv=2, scoring="r2").mean()

# Random Forest
rf_reg = RandomForestRegressor(n_estimators=100, random_state=0)
model_comparison["CV_R2"][2] = cross_val_score(rf_reg, x_train, y_train, cv=2, scoring="r2").mean()

# SVR
svr_reg = SVR(kernel="rbf")
model_comparison["CV_R2"][3] = cross_val_score(svr_reg, x_train, y_train, cv=2, scoring="r2").mean()

# Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(random_state=0)
model_comparison["CV_R2"][4] = cross_val_score(gb_reg, x_train, y_train, cv=2, scoring="r2").mean()

# Convert to DataFrame and display
losses = pd.DataFrame(model_comparison)
print(losses)


## 8. Visualization of the Comparison Table

We display the comparison table of scores using a graphical representation.


In [None]:
from pandas.plotting import table

fig, ax = plt.subplots(figsize=(8, 2))
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.set_frame_on(False)
table(ax, losses, loc='center', colWidths=[0.2]*len(losses.columns))
plt.show()


## 9. SVR Model with a Subset of Features

Here we test the impact of excluding certain features (`nitroester`, `MW`, `carbonylperoxynitrate`, `C.C.C.O.in.non.aromatic.ring`, `aromatic.hydroxyl`) on the SVR model.


In [None]:
# List of columns to drop
drop_columns = ['nitroester', 'MW', 'carbonylperoxynitrate', 
                'C.C.C.O.in.non.aromatic.ring', 'aromatic.hydroxyl']

# Create reduced datasets for training and testing
train_x_reduced = x_train.drop(drop_columns, axis=1)
testing_reduced = testing.drop(drop_columns, axis=1)

# Train the SVR model on the reduced dataset
svr_reduced = SVR()
svr_reduced.fit(train_x_reduced, y_train)
pred_reduced = svr_reduced.predict(testing_reduced)

# Save the reduced model results
results_reduced = pd.DataFrame({
    "Id": test_set["Id"],
    "target": pred_reduced
})
results_reduced.to_csv("results_reduced.csv", index=False)

# Histogram of reduced model predictions
plt.figure(figsize=(8, 5))
plt.hist(results_reduced['target'], bins=50, edgecolor='black')
plt.title("Histogram of Predictions (Reduced Feature Set)")
plt.show()


## 10. Ensemble Modeling

In this section, we use an ensemble of models (linear regression, SVR, XGBoost, random forest, gradient boosting) to make final predictions.  
We will load the data (located in the `data` folder), perform encoding, standardization, and split into training and validation sets, then combine the predictions from different models.


In [None]:
# Load datasets for ensemble modeling (path different, located in "data" folder)
train_ensemble = pd.read_csv('data/train.csv')
test_ensemble = pd.read_csv('data/test.csv')

# Encode 'parentspecies'
train_encoded = pd.get_dummies(train_ensemble, columns=['parentspecies'], drop_first=True)
test_encoded = pd.get_dummies(test_ensemble, columns=['parentspecies'], drop_first=True)

# Align the test dataset with the training dataset columns
for column in train_encoded.columns:
    if column not in test_encoded.columns and column not in ['pSat_Pa']:
        test_encoded[column] = 0

# Transform the target variable using log10
train_encoded['pSat_Pa_log'] = np.log10(train_ensemble['pSat_Pa'])

# Select features, excluding specific columns
drop_features = ['Id', 'pSat_Pa', 'pSat_Pa_log', 'nitroester', 'MW', 
                 'carbonylperoxynitrate', 'C.C.C.O.in.non.aromatic.ring', 'aromatic.hydroxyl']
X = train_encoded.drop(drop_features, axis=1)
y = train_encoded['pSat_Pa_log']

# Prepare the test dataset with the same features
X_test = test_encoded.drop('Id', axis=1)
X_test = X_test[X.columns]

# Split the training dataset into training and validation sets
X_train, X_val, y_train_val, y_val = train_test_split(X, y, test_size=0.1, random_state=1)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)


## 11. Training Models for the Ensemble

We train several models on the reduced training set (after standardization):  
- Linear Regression  
- SVR  
- XGBoost  
- Random Forest  
- Gradient Boosting  

Their predictions will then be combined to obtain the ensemble prediction.


In [None]:
# Initialize models
lin_model = LinearRegression()
svr_model = SVR(kernel='rbf')
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=1)
rf_model = RandomForestRegressor(n_estimators=100, random_state=1)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=1)

# Train models on the training set
lin_model.fit(X_train_scaled, y_train_val)
svr_model.fit(X_train_scaled, y_train_val)
xgb_model.fit(X_train_scaled, y_train_val)
rf_model.fit(X_train_scaled, y_train_val)
gb_model.fit(X_train_scaled, y_train_val)


## 12. Evaluating the Ensemble

We compute the ensemble prediction (average of each model’s predictions) on the validation set and evaluate its performance (R² and MSE).


In [None]:
# Predictions on the validation set by each model
ensemble_val_preds = (lin_model.predict(X_val_scaled) + 
                      svr_model.predict(X_val_scaled) +
                      xgb_model.predict(X_val_scaled) +
                      rf_model.predict(X_val_scaled) +
                      gb_model.predict(X_val_scaled)) / 5

# Compute metrics
r2_ensemble = r2_score(y_val, ensemble_val_preds)
mse_ensemble = mean_squared_error(y_val, ensemble_val_preds)

print(f"Ensemble R2 Score on the validation set: {r2_ensemble}")
print(f"Ensemble MSE on the validation set: {mse_ensemble}")


## 13. Final Predictions with the Ensemble

We generate the final predictions on the test set by combining the predictions from each model and save the results to a CSV file.


In [None]:
# Predictions on the test set by each model
test_pred_lin = lin_model.predict(X_test_scaled)
test_pred_svr = svr_model.predict(X_test_scaled)
test_pred_xgb = xgb_model.predict(X_test_scaled)
test_pred_rf = rf_model.predict(X_test_scaled)
test_pred_gb = gb_model.predict(X_test_scaled)

# Compute ensemble prediction (average)
ensemble_test_preds = (test_pred_lin + test_pred_svr + test_pred_xgb + test_pred_rf + test_pred_gb) / 5

# Prepare and save final results
ensemble_results = pd.DataFrame({
    'Id': test_ensemble['Id'],
    'target': ensemble_test_preds
})
ensemble_results.to_csv("test_results_ensemble.csv", index=False)
