# Big Data Processes Exercises - Week 06
# <font color= MediumSpringGreen>CodeCarbon</font>

#### What we will cover today

<ol>
    <li>Importing packages and libraries</li>
    <li>Loading the dataset</li>
    <li>CodeCarbon</li>
    <ol>
        <li>Decision Tree from week 3</li>
        <li>Testing with CodeCarbon</li>
        <li>Evaluating the model</li>
        <li>Evaluating emissions</li>
    </ol>
</ol>

Info about CodeCarbon: https://mlco2.github.io/codecarbon/

***
***
***

## 1. Importing various libraries

In [1]:
#%pip install seaborn
#%pip install sklearn
#pip install scikit-learn

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

### 1.1 Installing and importing <font color=MediumSpringGreen>CodeCarbon </font>

In [2]:
%pip install codecarbon

Collecting codecarbon
  Obtaining dependency information for codecarbon from https://files.pythonhosted.org/packages/8a/c5/4b02e1eaa6277f0d0a0e354f3c49842fc485aedb146d4ad2c2fb3112ac65/codecarbon-2.3.4-py3-none-any.whl.metadata
  Downloading codecarbon-2.3.4-py3-none-any.whl.metadata (6.6 kB)
Collecting pynvml (from codecarbon)
  Obtaining dependency information for pynvml from https://files.pythonhosted.org/packages/5b/9c/adb8070059caaa15d5a572b66bccd95900d8c1b9fa54d6ecea6ae97448d1/pynvml-11.5.0-py3-none-any.whl.metadata
  Downloading pynvml-11.5.0-py3-none-any.whl.metadata (7.8 kB)
Collecting py-cpuinfo (from codecarbon)
  Obtaining dependency information for py-cpuinfo from https://files.pythonhosted.org/packages/e0/a9/023730ba63db1e494a271cb018dcd361bd2c917ba7004c3e49d5daf795a2/py_cpuinfo-9.0.0-py3-none-any.whl.metadata
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting rapidfuzz (from codecarbon)
  Obtaining dependency information for rapidfuzz from htt

In [3]:
from codecarbon import EmissionsTracker

***
***
## 2. Load and examine the data

Yet again, we will again be using the **IBM-employee-attrition dataset** where we will try and predict if an employee has attrition or not, aka. whether they have left the company or not. 

As we have explained before, our target variable, attrition, is either be 0 or 1:
- 0 = No attrition, the employee did not leave the company. The negative class
- 1 = Attrition, the employee left the company. The positive class <-- our focus

In [4]:
df = pd.read_csv("IBM-Employee-Attrition.csv", delimiter=',')

Examine the notebook if necessary

In [5]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   int64 
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

***
***

## 3. Classification Decision Trees with an EmissionsTracker

We will try CodeCarbon on the Classification Decision Tree model week 3 (Classification) in order to test how many emissions does our model release 

### 3.1 Selecting target features

We will select the same features as we've used before for our models (go back to the notebook for week 3, if you need a refresher as to how we got these features):

In [6]:
#Create the feature and target variables
#From list of feature(s) 'X', the model will guess/predict the 'y' feature (our target)
X = df[['EnvironmentSatisfaction', 'JobSatisfaction', 'JobInvolvement', 'YearsAtCompany', 'StockOptionLevel', 'YearsWithCurrManager', 'Age', 'MonthlyIncome', 'YearsInCurrentRole', 'JobLevel', 'TotalWorkingYears']].values

y = df['Attrition'].values

### 3.2 Split the data in training and test sample

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

No need to standardise this time since we are working with a Decision Tree.

### 3.3 Create the model and initialise the EmissionsTracker

In [8]:
# Decision Tree Classifier with some hyperparameter tuning from week 4
model_DTC = DecisionTreeClassifier(max_depth=5, min_samples_leaf=2, min_samples_split=5, random_state=42)

tracker = EmissionsTracker()

[codecarbon INFO @ 14:24:38] [setup] RAM Tracking...
[codecarbon INFO @ 14:24:38] [setup] GPU Tracking...
[codecarbon INFO @ 14:24:38] No GPU found.
[codecarbon INFO @ 14:24:38] [setup] CPU Tracking...
[codecarbon INFO @ 14:24:38] CPU Model on constant consumption mode: Apple M1 Pro
[codecarbon INFO @ 14:24:38] >>> Tracker's metadata:
[codecarbon INFO @ 14:24:38]   Platform system: macOS-14.4-arm64-arm-64bit
[codecarbon INFO @ 14:24:38]   Python version: 3.10.12
[codecarbon INFO @ 14:24:38]   CodeCarbon version: 2.3.4
[codecarbon INFO @ 14:24:38]   Available RAM : 16.000 GB
[codecarbon INFO @ 14:24:38]   CPU count: 10
[codecarbon INFO @ 14:24:38]   CPU model: Apple M1 Pro
[codecarbon INFO @ 14:24:38]   GPU count: None
[codecarbon INFO @ 14:24:38]   GPU model: None


### 3.4 Start the tracker and then fit/train the model on training data

In [9]:
# Start tracking carbon emissions
tracker.start()

# fit the classifier to the standardized training data
model_DTC.fit(X_train, y_train)

###  3.5 Make prediction and stop the tracker

In [10]:
# Make predictions on the test set
y_pred = model_DTC.predict(X_test)

# Stop tracking carbon emissions
tracker.stop()

[codecarbon INFO @ 14:24:45] Energy consumed for RAM : 0.000004 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 14:24:45] Energy consumed for all CPUs : 0.000003 kWh. Total CPU Power : 5.0 W
[codecarbon INFO @ 14:24:45] 0.000007 kWh of electricity used since the beginning.
  df = pd.concat([df, pd.DataFrame.from_records([dict(data.values)])])


1.6262311401552623e-06

### 3.6 Evaluate the decision tree

In [11]:
# We evaluate the performance of the classifier using the accuracy score
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.8299319727891157


~0.82 = 82% Accuracy - that can be pretty good for a model. But as we learned last week, evaluating on 'Accuracy' does not work on our imbalanced dataset. We need to use other evaluation metrics.

#### 3.6.1 Other evaluation metrics

There are many ways to evaluate machine learning models - what is important to note, is to figure out which evaluation metric or score is best for your <font color=red> model and data </font>. If you need a recap of the different evaluation metrics, we have written a wonderful look-up section in the last week's Notebook 'BDP_Evaluation.ipynb'.

Last week, we found out that our IBM-Employee-Attrition dataset is imbalanced. Meaning, when we try and predict the 'Attrition' class, the class is very small compared to the 'No Attrition' majority class. Therefore we cannot use the accuracy metric anymore, since we know that our model will seldom predict an instance of 'Attrition' since it hasn't learned enough about the 'Attrition' class in order for it to predict it. 

Instead, we will use the *precision*, *recall* and *f1* scores. 
- Precision summarizes the fraction of examples assigned the positive class that belong to the positive class. (aka. who had attrition that was correctly predicted for attrition)
- Recall summarizes how well the positive class was predicted. (aka. how well the 'Attrition' instances were predicted)
- F1 score combines both precision and recall into a single score, that balances both scores.

For our case, we are equally instereted in False Negatives and False Positives, aka. in the wrongs of our model, so we will focus on the F1-score.

In [12]:
precision = precision_score(y_test, y_pred) 
# Recall 
recall = recall_score(y_test, y_pred) 
# F1-Score 
f1 = f1_score(y_test, y_pred) 

print("Precision:", precision) 
print("Recall:", recall) 
print("F1-Score:", f1) 

Precision: 0.4444444444444444
Recall: 0.08163265306122448
F1-Score: 0.13793103448275862


Okay so this is not the greatest model - but what is most important, how much CO2 did our model emit? Hint: look at the output printed when the tracker was stopped...

***
***

## 4. Emissions

After ending the EmissionsTracker, it will save a dataframe as a .csv-file in your directory

In [13]:
emissions_df = pd.read_csv("emissions.csv")
emissions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 31 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   timestamp           5 non-null      object 
 1   project_name        5 non-null      object 
 2   run_id              5 non-null      object 
 3   duration            5 non-null      float64
 4   emissions           5 non-null      float64
 5   emissions_rate      5 non-null      float64
 6   cpu_power           5 non-null      float64
 7   gpu_power           5 non-null      float64
 8   ram_power           5 non-null      float64
 9   cpu_energy          5 non-null      float64
 10  gpu_energy          5 non-null      float64
 11  ram_energy          5 non-null      float64
 12  energy_consumed     5 non-null      float64
 13  country_name        5 non-null      object 
 14  country_iso_code    5 non-null      object 
 15  region              1 non-null      object 
 16  cloud_provid

You can read more about the columns, what they represent and their format, at: https://mlco2.github.io/codecarbon/output.html#csv 

In [14]:
#This code is a way to display all the column in one dataframe
pd.set_option('display.max_columns', None) 
emissions_df

Unnamed: 0,timestamp,project_name,run_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,on_cloud,pue
0,2024-02-05T21:41:36,codecarbon,37ef3e50-1e90-4bae-a6cd-25a6dfe35872,2.570896,1.3e-05,5.236615e-06,42.5,36.764062,5.977035,3e-05,2.6e-05,4e-06,6.1e-05,Denmark,DNK,capital region,,,Windows-10-10.0.19045-SP0,3.11.7,2.3.4,8,Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz,1.0,1 x NVIDIA GeForce GTX 1080,12.4809,55.674,15.938759,machine,N,1.0
1,2024-03-01T11:23:43,codecarbon,f45d1a95-3541-4200-8e8e-63c17cad077d,4.338811,1.2e-05,2.877786e-06,42.5,0.0,4.33449,5.1e-05,0.0,5e-06,5.6e-05,Denmark,DNK,,,,Linux-6.5.0-21-generic-x86_64-with-glibc2.35,3.10.12,2.3.4,4,Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz,,,12.0564,55.7123,11.55864,machine,N,1.0
2,2024-03-01T11:31:21,codecarbon,a797f0cb-dc2b-41c0-82f6-522ae48cd31a,329.024675,0.000947,2.879003e-06,42.5,0.0,4.33449,0.003882,0.0,0.000396,0.004278,Denmark,DNK,,,,Linux-6.5.0-21-generic-x86_64-with-glibc2.35,3.10.12,2.3.4,4,Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz,,,12.0564,55.7123,11.55864,machine,N,1.0
3,2024-03-04T10:12:43,codecarbon,afcb5316-9e17-4e2d-8a4b-713663795217,3.755695,1.2e-05,3.318624e-06,42.5,0.0,11.538693,4.4e-05,0.0,1.2e-05,5.6e-05,Denmark,DNK,,,,Windows-10-10.0.19045-SP0,3.12.1,2.3.4,16,AMD Ryzen 7 PRO 6850U with Radeon Graphics,,,12.0564,55.7123,30.769848,machine,N,1.0
4,2024-03-08T14:24:45,codecarbon,2e8d0e65-c7c8-4764-90e6-7a0315ab4215,2.404702,2e-06,6.762714e-07,5.0,0.0,6.0,3e-06,0.0,4e-06,7e-06,Denmark,DNK,,,,macOS-14.4-arm64-arm-64bit,3.10.12,2.3.4,10,Apple M1 Pro,,,12.0564,55.7123,16.0,machine,N,1.0


N.B.: the data from your code is saved in the last row of the dataframe. 
The other rows represent previous measurements.

***
***
***

## Your turn 🚀

#### We encourage you to try the EmissionsTracker on models in your own project or on models from previous exercise sessions. Try out different classification models, with/without hyperparameter tuning, run your model on different computers, etc.

Link to the Quickstart guide for CodeCarbon.io https://mlco2.github.io/codecarbon/usage.html