## Data transformation

This is the second automatically graded exercise for JODA. The objective here is to get our hands dirty with data. 

The context of this particular analysis is a fictional company that routinely runs different machine learning operations. 

We have generated a dataset that has the following columns or properties (to be engineered into features):

* Date
* Department
* ML Task ID
* ML Method
* Task Category
* Model Complexity (Parameters)
* Training Data Size (GB)
* Training Duration (Hours)
* Hardware Used
* Energy Consumption (kWh)
* CO2 Emissions (Kg)
* Cloud Provider

Moreover, there is a secondary dataset that includes information about the energy sources for the different cloud providers:

* Cloud Provider    
* Green Energy


#### Install the required packages using requirements.txt 

pip install -r requirements.txt 

#### Import the needed packages  

In [563]:
import pandas as pd

#### Read the two data files

In [564]:
emissions = pd.read_excel("./data/co2-emissions.xlsx")
cloud_providers = pd.read_excel("./data/cloud-providers.xlsx")

#### Joining the two data frames to add information about the energy sources that the cloud providers use.

In [565]:
emissions_df = pd.DataFrame(emissions)
cloud_providers_df = pd.DataFrame(cloud_providers)

merged_df = pd.merge(emissions_df, cloud_providers_df, on="Cloud Provider", how="left")
# print(merged_df)

#### Aggregating the data to department level. Each row represents the aggregated values for each department.

In [566]:
# Group by department, ML method, and green energy category, then sum CO2 emissions
co2_distribution = merged_df.groupby(['Department', 'ML Method', 'Green Energy'])['CO2 Emissions (Kg)'].sum().reset_index()

# Pivot the data to have departments and ML methods as rows and green energy categories as columns
co2_distribution_pivot = co2_distribution.pivot_table(index=['Department', 'ML Method'], columns='Green Energy', values='CO2 Emissions (Kg)', aggfunc='sum').fillna(0)

# Reset index to make Department and ML Method regular columns
co2_distribution_pivot.reset_index(inplace=True)

# Rename columns for better clarity
co2_distribution_pivot.columns = ['Department', 'ML Method', 'Green', 'Hybrid', 'Unknown']

# Calculate the total CO2 emissions for each department and ML method
co2_distribution_pivot['CO2 Emissions (Kg)'] = co2_distribution_pivot['Green'] + co2_distribution_pivot['Hybrid'] + co2_distribution_pivot['Unknown']

# Reorder the columns
co2_distribution_pivot = co2_distribution_pivot[['Department', 'CO2 Emissions (Kg)', 'ML Method', 'Green', 'Hybrid', 'Unknown']]

# print(co2_distribution_pivot)

#### Calculate the total of CO2 emissions for each department

In [567]:
total_co2_emissions = co2_distribution_pivot.groupby('Department')['CO2 Emissions (Kg)'].sum().reset_index()

# print(total_co2_emissions)

#### Renaming the column CO2 Emissions (Kg)

In [568]:
co2_distribution_pivot.rename(columns={"CO2 Emissions (Kg)": "co2_emissions_kg"}, inplace=True)
# print(co2_distribution_pivot)

#### Creating a function that picks the most common value among in a Pandas Series object

In [569]:
def pick_most_frequent(values):
	# Convert the input to a pandas Series if it's not already one
	if not isinstance(values, pd.Series):
		values = pd.Series(values)
	
	# Use mode() function to find the most frequent value
	mode_values = values.mode()
	
	# If there are multiple modes, return the first one
	return mode_values[0]


#### Picking the most frequent ML method for each department.

In [570]:
# Group by department and apply pick_most_frequent to ML Method
most_frequent_ml_method = merged_df.groupby('Department')['ML Method'].agg(pick_most_frequent).reset_index()

# print(most_frequent_ml_method)

#### Sorting the DataFrame to set the department with the largest emissions first

In [571]:
# Calculate total CO2 emissions for each department
total_co2_emissions = co2_distribution_pivot.groupby('Department')['co2_emissions_kg'].sum().reset_index()

# Merge the most frequent ML method DataFrame with total CO2 emissions DataFrame
result = pd.merge(most_frequent_ml_method, total_co2_emissions, on='Department')

# Sort the DataFrame by CO2 emissions in descending order
result = result.sort_values(by='co2_emissions_kg', ascending=False)

# print(result)

#### Calculating the CO2 emissions for each department in different Green Energy categories

In [572]:
# Group by department and sum CO2 emissions
co2_emissions_by_department = co2_distribution_pivot.groupby('Department')['Green'].sum().reset_index()

# print(co2_emissions_by_department)

#### Calculating the sum of emissions per energy types for every department

In [573]:
# Calculate sum of Hybrid and Unknown emissions by Department
co2_emissions_hybrid = co2_distribution_pivot.groupby('Department')['Hybrid'].sum().reset_index()
co2_emissions_unknown = co2_distribution_pivot.groupby('Department')['Unknown'].sum().reset_index()

# Merge with co2_emissions_by_department DataFrame
co2_emissions_by_department = pd.merge(co2_emissions_by_department, co2_emissions_hybrid, on='Department')
co2_emissions_by_department = pd.merge(co2_emissions_by_department, co2_emissions_unknown, on='Department')

# Merge the result to the final DataFrame
final_df = pd.merge(result, co2_emissions_by_department)

# print(final_df)


In [574]:
import os

def ensure_folder_exists(folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f"Folder '{folder_path}' created.")
    else:
        print(f"Folder '{folder_path}' already exists.")

ensure_folder_exists('results')

Folder 'results' already exists.


#### Save the results to Excel and Pickle

In [575]:
final_df.to_excel('results/department_co2.xlsx', index=False)
final_df.to_pickle('results/department_co2.pkl')