## Model Training
#### Model training using the TPOT exported pipeline and, then, model evaluation.

In [29]:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from tpot.builtins import StackingEstimator
from imblearn.over_sampling import RandomOverSampler

#### Take the engineered data and perform split between train and test set. After some tentative we found that 20% for the test set and 80% for the training set was the best solution.

In [32]:
tpot_data = pd.read_json("../data/engineered/presences.json")
features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], test_size=0.20 ,random_state=None)

#### After the splitting, we save the training and test features in order to perform some prediction evaluations if needed.

In [33]:
NEW_DATA_DIR = "../data/segregation/"

try:
    os.mkdir(NEW_DATA_DIR)
except:
    pass

training_features.to_json("../data/segregation/presencesTrain.json")
testing_features.to_json("../data/segregation/presencesTest.json")

#### Execution of the TPOT exported pipeline with subsequent predictions saving.

In [34]:
exported_pipeline = make_pipeline(
    MinMaxScaler(),
    RobustScaler(),
    StackingEstimator(estimator=SGDRegressor(alpha=0.01, eta0=0.1, fit_intercept=False, l1_ratio=0.5, learning_rate="invscaling", loss="squared_loss", penalty="elasticnet", power_t=0.1)),
    SelectFromModel(estimator=ExtraTreesRegressor(max_features=0.9500000000000001, n_estimators=100), threshold=0.05),
    RandomForestRegressor(bootstrap=True, max_features=0.45, min_samples_leaf=1, min_samples_split=5, n_estimators=100)
)

exported_pipeline.fit(training_features, training_target)
preds = exported_pipeline.predict(testing_features)

np.savetxt("../predictions/preds.csv", preds, delimiter=",")

## Model Evaluation

#### In Regression, unlike Classification, accuracy is slightly harder to illustrate. It is impossible to predict the exact value but rather how close the prediction is against the real value.
#### There are three main metrics for model evaluation in Regression:
 - R Square: measures how much of variability in dependent variable can be explained by the model. Determine how well the model fits the dependent variables.
 - Mean Square Error(MSE) and Root Mean Square Error (RMSE): MSE is an absolute measure of the goodness for the fit. Gives an absolute number on how much the predicted results deviate from the actual number. RMSE is the square root of MSE and is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily.
 - Mean Absolute Error(MAE): is similar to MSE but, instead of the sum of square of error, MAE is taking the sum of absolute value of error.

In [35]:
r2 = r2_score(testing_target, preds)

rmse = mean_squared_error(testing_target, preds, squared=False)

mae = mean_absolute_error(testing_target, preds)

print("R Square (r2):", r2)
print("Root Mean Square Error (RMSE)", rmse)
print("Mean Absolute Error (MAE):", mae)

R Square (r2): 0.7881535934679571
Root Mean Square Error (RMSE) 4.10359376773461
Mean Absolute Error (MAE): 2.43473827818464


#### As you can see, the results were not really good. They show that there are too much difference between the real values and the predicted ones probably because of the weak quantity of data.
#### For this reason, we decided to oversample the data in order to see if the model performs better with a larger amount of data.  We used the Imbalanced-Learn library to perform a random oversample.

In [36]:
ros = RandomOverSampler(random_state=0)
features_oversampled, target_oversampled = ros.fit_resample(features, tpot_data['target'])

#### As always, we store the intermediate data.

In [37]:
NEW_DATA_DIR = "../data/oversampled/"

try:
    os.mkdir(NEW_DATA_DIR)
except:
    pass

features_oversampled.to_json("../data/oversampled/features.json")
target_oversampled.to_json("../data/oversampled/target.json")

#### Perform again the splitting of data and store the prediction.

In [40]:
training_features, testing_features, training_target, testing_target = \
            train_test_split(features_oversampled, target_oversampled, test_size=0.20 ,random_state=None)

exported_pipeline.fit(training_features, training_target)
preds = exported_pipeline.predict(testing_features)

np.savetxt("../predictions/preds_oversampled.csv", preds, delimiter=",")

#### From the evaluation of model we can clearly see how much the model perfomance have increased.

In [41]:
r2 = r2_score(testing_target, preds)

rmse = mean_squared_error(testing_target, preds, squared=False)

mae = mean_absolute_error(testing_target, preds)

print("R Square (r2):", r2)
print("Root Mean Square Error (RMSE)", rmse)
print("Mean Absolute Error (MAE):", mae)

R Square (r2): 0.9989233839667933
Root Mean Square Error (RMSE) 0.596933035434028
Mean Absolute Error (MAE): 0.051922954284027135
