# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling

---

## 1. Import packages

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

---
## 2. Load data

In [3]:
df = pd.read_csv('./data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model. Remember, we only need to focus on training a `Random Forest` classifier.

In [4]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [5]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)

(14606, 61)
(14606,)


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


### Model training

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm. 

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging. 

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [17]:
# Add model training in here!
model = RandomForestClassifier(n_estimators=500, random_state=42) # Add parameters to the model!
model.fit(X_train, y_train) # Complete this method call!

## Parameters
- n_estimators:
 - This parameter defines the number of trees in the forest. Specifically, n_estimators=10000 means that the Random Forest model will create 500 decision trees.
 - More trees generally help improve the model's performance by reducing variance, which makes the predictions more stable and less sensitive to noise in the data.
- random_state:
 - Random_state ensures reproducibility of the model. It sets the seed for the random number generator used by the Random Forest algorithm.
 - By setting random_state=42, I ensure that the same random subsets of the data and features are chosen each time you run the model. This means that you will get the same results every time you run the code, provided everything else (like the data and hyperparameters) remains the same.

### Evaluation

Now let's evaluate how well this trained model is able to predict the values of the test dataset.

In [18]:
# Generate predictions here!
y_pred = model.predict(X_test)

In [19]:
# Calculate performance metrics here!
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.9036144578313253
Confusion Matrix:
[[3283    3]
 [ 349   17]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      3286
           1       0.85      0.05      0.09       366

    accuracy                           0.90      3652
   macro avg       0.88      0.52      0.52      3652
weighted avg       0.90      0.90      0.86      3652



## Why did you choose the evaluation metrics that you used? Please elaborate on your choices.
- Accuracy is the ratio of correctly predicted instances (true positives + true negatives) to the total number of predictions. Accuracy is a simple and intuitive measure that provides a quick snapshot of the model's performance. It is easy to understand and interpret, especially for balanced datasets where each class has a similar number of samples. We recieved a 90% accuracy.
- A confusion matrix is a table that summarizes the performance of a classification algorithm by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). 
 - The model predicted 3283 observations of positive class correcly (TP) which means customers who actually churned, and the model correctly predicted them as churned
 - 17 observations of negative class correctly (TN) which means customers who stayed, and the model correctly predicted them as not churned
 - 3 observations of negative class but being classified wrongly as belonging to the positive class (FP) which means customers who stayed, but the model incorrectly predicted them as churned which is also a Type I error
 - 349 observations of positive class but being classified wrongly as belonging to the negative class (FN) which means customers who churned, but the model incorrectly predicted them as not churned which is a Type II error.
- Precision is the ratio of true positives to all predicted positives. Of all instances that the model predicted as positive, how many were actually positive, in our model for class 0, the precision is 0.90, meaning that 90% of the instances predicted as class 0 are actually class 0, for class 1, the precision is 0.85, meaning 85% of the instances predicted as class 1 are correct. 
- Recall is the ratio of true positives to all actual positives. Of all the actual positive instances, how many did the model correctly identify, in our model for class 0, the recall is 1.00, indicating that the model correctly identified all instances of class 0, for class 1, the recall is very low at 0.05, meaning the model only identified 5% of the actual class 1 instances.
- The F1-score is the harmonic mean of precision and recall, giving a balanced measure when both values are important. In our model, for class 0, the F1-score is 0.95, indicating good performance in both precision and recall for this class, for class 1, the F1-score is very low at 0.09, suggesting poor performance overall in identifying this class.

## Do you think that the model performance is satisfactory? Give justification for your answer.
The model performs very well on class 0 (high precision, recall, and F1-score), but very poorly on class 1 (low recall and F1-score). This could indicate a significant imbalance in the data, with far fewer instances of class 1 compared to class 0, or that the model struggles to correctly identify class 1 instances.