# Big Data Processes Exercises - Week 06
# <font color= MediumSpringGreen>CodeCarbon</font>

#### What we will cover today

<ol>
    <li>Importing packages and libraries</li>
    <li>Loading the dataset</li>
    <li>CodeCarbon</li>
    <ol>
        <li>Decision Tree from week 3</li>
        <li>Testing with CodeCarbon</li>
        <li>Evaluating the model</li>
        <li>Evaluating emissions</li>
    </ol>
</ol>

Info about CodeCarbon: https://mlco2.github.io/codecarbon/

***
***
***

## 1. Importing various libraries

In [None]:
#%pip install seaborn
#%pip install sklearn
#pip install scikit-learn



import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

### 1.1 Installing and importing <font color=MediumSpringGreen>CodeCarbon </font>

In [None]:
#%pip install codecarbon

In [None]:
from codecarbon import EmissionsTracker

***
***
## 2. Load and examine the data

Yet again, we will again be using the **IBM-employee-attrition dataset** where we will try and predict if an employee has attrition or not, aka. whether they have left the company or not. 

As we have explained before, our target variable, attrition, is either be 0 or 1:
- 0 = No attrition, the employee did not leave the company. The negative class
- 1 = Attrition, the employee left the company. The positive class <-- our focus

In [None]:
df = pd.read_csv("IBM-Employee-Attrition.csv", delimiter=',')

Examine the notebook if necessary

In [None]:
df.head()
df.info()

***
***

## 3. Classification Decision Trees with an EmissionsTracker

We will try CodeCarbon on the Classification Decision Tree model week 3 (Classification) in order to test how many emissions does our model release 

### 3.1 Selecting target features

We will select the same features as we've used before for our models (go back to the notebook for week 3, if you need a refresher as to how we got these features):

In [None]:
#Create the feature and target variables
#From list of feature(s) 'X', the model will guess/predict the 'y' feature (our target)
X = df[['EnvironmentSatisfaction', 'JobSatisfaction', 'JobInvolvement', 'YearsAtCompany', 'StockOptionLevel', 'YearsWithCurrManager', 'Age', 'MonthlyIncome', 'YearsInCurrentRole', 'JobLevel', 'TotalWorkingYears']].values

y = df['Attrition'].values

### 3.2 Split the data in training and test sample

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

No need to standardise this time since we are working with a Decision Tree.

### 3.3 Create the model and initialise the EmissionsTracker

In [None]:
# Decision Tree Classifier with some hyperparameter tuning from week 4
model_DTC = DecisionTreeClassifier(max_depth=5, min_samples_leaf=2, min_samples_split=5, random_state=42)

tracker = EmissionsTracker()

### 3.4 Start the tracker and then fit/train the model on training data

In [None]:
# Start tracking carbon emissions
tracker.start()

# fit the classifier to the standardized training data
model_DTC.fit(X_train, y_train)

###  3.5 Make prediction and stop the tracker

In [None]:
# Make predictions on the test set
y_pred = model_DTC.predict(X_test)

# Stop tracking carbon emissions
tracker.stop()

### 3.6 Evaluate the decision tree

In [None]:
# We evaluate the performance of the classifier using the accuracy score
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

~0.82 = 82% Accuracy - that can be pretty good for a model. But as we learned last week, evaluating on 'Accuracy' does not work on our imbalanced dataset. We need to use other evaluation metrics.

#### 3.6.1 Other evaluation metrics

There are many ways to evaluate machine learning models - what is important to note, is to figure out which evaluation metric or score is best for your <font color=red> model and data </font>. If you need a recap of the different evaluation metrics, we have written a wonderful look-up section in the last week's Notebook 'BDP_Evaluation.ipynb'.

Last week, we found out that our IBM-Employee-Attrition dataset is imbalanced. Meaning, when we try and predict the 'Attrition' class, the class is very small compared to the 'No Attrition' majority class. Therefore we cannot use the accuracy metric anymore, since we know that our model will seldom predict an instance of 'Attrition' since it hasn't learned enough about the 'Attrition' class in order for it to predict it. 

Instead, we will use the *precision*, *recall* and *f1* scores. 
- Precision summarizes the fraction of examples assigned the positive class that belong to the positive class. (aka. who had attrition that was correctly predicted for attrition)
- Recall summarizes how well the positive class was predicted. (aka. how well the 'Attrition' instances were predicted)
- F1 score combines both precision and recall into a single score, that balances both scores.

For our case, we are equally instereted in False Negatives and False Positives, aka. in the wrongs of our model, so we will focus on the F1-score.

In [None]:
precision = precision_score(y_test, y_pred) 
# Recall 
recall = recall_score(y_test, y_pred) 
# F1-Score 
f1 = f1_score(y_test, y_pred) 

print("Precision:", precision) 
print("Recall:", recall) 
print("F1-Score:", f1) 

Okay so this is not the greatest model - but what is most important, how much CO2 did our model emit? Hint: look at the output printed when the tracker was stopped...

***
***

## 4. Emissions

After ending the EmissionsTracker, it will save a dataframe as a .csv-file in your directory

In [None]:
emissions_df = pd.read_csv("emissions.csv")
emissions_df.info()

You can read more about the columns, what they represent and their format, at: https://mlco2.github.io/codecarbon/output.html#csv 

In [None]:
#This code is a way to display all the column in one dataframe
pd.set_option('display.max_columns', None) 
emissions_df

N.B.: the data from your code is saved in the last row of the dataframe. 
The other rows represent previous measurements.

***
***
***

## Your turn 🚀

#### We encourage you to try the EmissionsTracker on models in your own project or on models from previous exercise sessions. Try out different classification models, with/without hyperparameter tuning, run your model on different computers, etc.

Link to the Quickstart guide for CodeCarbon.io https://mlco2.github.io/codecarbon/usage.html