![](https://imgur.com/AcU0UOv.jpg)

# CDP: Unlocking Climate Solutions - KPIs Proposal

This notebook will propose several Key Performance Indicators (KPIs) derived from the analysis of the three questionnaires made available by the CDP: the cities, corporation and water questionnaires.

The KPIs we propose can be grouped as follows:
* [Objective tracking KPIs](#section-one): these KPIs are designed to assess whether the objective outlined by the city/company is reasonable and achievable based on historical performance; or if the objective been set is too  easy given the indicated time frame. Equally,through these KPIs is possible to see if a company/city is on track on achieving its goals
* [Risk/Opportunities assessment KPIs](#section-two): these KPIs are aimed at assessing exposure to a risk/opportunity given the probability of its manifestation, the severity and a time horizon. KPIs in this section can be easily enhanced/cross-referenced with external data to add further dimentions, for example we use the Social Vulnerability Index and the ND-GAIN datasets to add further insights on what could be the impact of a climate related event on the City.
* [Return on Investment KPIs](#section-three): KPIs aimed at assessing the "value for money" of a given target by a company/city. These KPIs could help identify what are the most efficient measures that can be implemented by a company/city to cut for example CO2 emissions or reduce MWh of electricy consumption.
* [Trends KPIs](#section-four): these KPIs look at trend information e.g. CO2 emissions/ water consumption, etc. and monitor how well/badly a company/city is doing in controlling polluting behaviours
* [City assessment KPIs](#section-five): KPIs monitoring some of the city main enviromental themes, including - green energy usage, transportation preferences, waste produced, eating habits and CO2 emissions (e.g. consumption of meat & dairy, number of private cars per citizen, waste production per citizen or square km etc.)
* [Corporate assessment KPIs](#section-six): KPIs monitoring efficiency and consumption in terms of energy, CO2 emissions efficiency and consumption/production of green energy. This is also produced with a country level view. 
* [Water Management assessment KPIs](#section-seven): several KPIs built on water consumption and water management, also with a country/facility level view. 

**--> All the KPIs we produce can be used both as a raw indicator or as a "ranking factor" to be able to order the cities/companies and identify best/worst in class.** 

<a id="section-one"></a>
# Objective tracking KPIs#

In this section the objective tracking KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ffcccc><b>Objective strategy</b></td>
    <td style="text-align:right" BGcolor=#ffcccc>Objective strategy = Years set to achieve target - Actual years it should take to achieve it
<br>Where:Years set to achieve the target: are the total years a company/city has given itself to achieve a target
Actual years: years it should take to achieve it based on performance so far (e.g. emissions cut since the objective was set per year)</td>
    <td style="text-align:right" BGcolor=#ffcccc>This KPI assesses how reasonable the objective been set is, based on current performance.
A large negative value on this KPI means that the objective will be hard to achieve based on the historical performance and what has been done so far (e.g. a company wants to cut emissions by 50% in 5 years and after 4 years has only cut 10%)
Positive values of the raw KPI mean that actually the objective been set might be too easy (e.g. a company gives itself 10 years to cut emissions by 50% but after 2 years it has already cut 45%). Balanced objectives will range around the Zero value for this metric. </td>
   <td style="text-align:right" BGcolor=#ffcccc>5.0a</td>
   <td style="text-align:right" BGcolor=#ffcccc>4.1a</td>
   <td style="text-align:right" BGcolor=#ffcccc>8.1a</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#ffcccc><b>Objective ambition</b></td>
   <td style="text-align:right" BGcolor=#ffcccc>Objective ambition = Total emissions in base year/ Emissions cuts that are target of the objective</td>
   <td style="text-align:right" BGcolor=#ffcccc>A measure of how ambitious is a company in setting the objective (e.g. a company wanting to cut 100% of a target can be considered more ambitions than a company cutting only 10%)</td>
   <td style="text-align:right" BGcolor=#ffcccc>5.0a</td>
   <td style="text-align:right" BGcolor=#ffcccc>4.1a</td>
   <td style="text-align:right" BGcolor=#ffcccc>8.1a</td>
  </tr>
   <tr>
    <td style="text-align:right" BGcolor=#ffcccc><b>Objective status</b></td>
    <td style="text-align:right" BGcolor=#ffcccc>Objective status = If Objective Strategy<0 objective is "On target", if Objective strategy>0 objective is "Behind Target" </td>
    <td style="text-align:right" BGcolor=#ffcccc>A binary outcome to check if objectives are on track or not, based on the "Objective Strategy" KPI above. </td>
    <td style="text-align:right" BGcolor=#ffcccc>5.0a/ 8.0a/ 8.5a</td>
    <td style="text-align:right" BGcolor=#ffcccc>4.1a</td>
    <td style="text-align:right" BGcolor=#ffcccc>8.1a</td>   
  </tr>
    
   <tr>
    <td style="text-align:right" BGcolor=#ffcccc><b>Objective progress</b></td>
    <td style="text-align:right" BGcolor=#ffcccc>Objective progress = % achieved against an objective</td>
    <td style="text-align:right" BGcolor=#ffcccc>A rank order measure is provided by company/city to see which ones are the the closest to achieve their objectives. For certain Tables this is weighted by objective size when grouping by company (e.g. Emissions that are being cut under that specific objective)</td>
    <td style="text-align:right" BGcolor=#ffcccc>5.0a/ 8.0a/ 8.5a</td>
    <td style="text-align:right" BGcolor=#ffcccc>4.1a</td>
    <td style="text-align:right" BGcolor=#ffcccc>8.1a</td>
  </tr>
    
</table>
</body>
</html>


Before we start we will show how we have re-organised the data to build the KPIs. This approach is aimed at re-constructing the data in the same way the corporation/city has visualised and entered the information. After trying several approaches we found this to be the easiest way to understand the contents of the questionnaires. 

An example below for table 4.1a of the corporate questionnaire for "Celestica Inc.". This how we re-organize the data to build KPIs:

In [None]:
### For simplicity we present below the mian project code, which will create all the KPIs discussed throughout the notebook.
### THe code starts with the tables of the corporate questionnaire and then moves to water and finally the cities. Putting all the code here makes it simple
### for us to call different KPIs throughout the notebook. The underlying process which we use to analyse the tables and create the KPIs is always the same:
### Import the table from the main dataset, clean the table from uneccessary columns and prepare for pivot, pivot the table, transform strings into floats where 
### necessary and create the KPI.


### IMPORTS
# Python packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas_profiling import ProfileReport # quick EDA
import matplotlib.pyplot as plt # plotting
from math import pi
import string
import seaborn as sns # plotting
from shapely.geometry import Point # To represent and analyze geo coordinates
import functools 
import operator
import warnings
warnings.filterwarnings('ignore')
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# Import files

q20 = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/2020_Full_Climate_Change_Dataset.csv', low_memory=False)
w20 = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/2020_Full_Water_Security_Dataset.csv', low_memory=False)
c20 = pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv', low_memory=False)
SVI_C = pd.read_csv('../input/cdp-unlocking-climate-solutions/Supplementary Data/CDC Social Vulnerability Index 2018/SVI2018_US_COUNTY.csv', low_memory=False)
SVI = pd.read_csv('../input/cdp-unlocking-climate-solutions/Supplementary Data/CDC Social Vulnerability Index 2018/SVI2018_US.csv', low_memory=False)
disc = pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv', low_memory=False)
ND_DROUGHT = pd.read_csv('../input/ndgain-uaa-dataset/UAA Data/Drought Data.csv', low_memory=False)

## Create folder to store KPI df:

try:
    os.mkdir('/kaggle/working/df_KPIs/')
    os.mkdir('/kaggle/working/df_KPIs/Climate/')
    os.mkdir('/kaggle/working/df_KPIs/Water/')
    os.mkdir('/kaggle/working/df_KPIs/Cities/')
    os.mkdir('/kaggle/working/df_KPIs/Final/')
except:
    pass


################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################
###################################################                                                      #######################################################
###################################################            C O R P O R A T I O N S - 2 0 2 0         #######################################################
###################################################                                                      #######################################################
################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################

############################################################### TABLE - 4.1a Emission reduciton targets ########################################################

# Select questions
C41a = q20[q20['question_number']=='C4.1a']

# Remove unnecessary columns
C41a =C41a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)

## Pivot
C41a=C41a.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C41a=C41a.sort_values(by=['account_number','organization','row_number'])

## Select columns
C41a=C41a[['account_number','organization','row_number','C4.1a_C1Target reference number', 'C4.1a_C2Year target was set','C4.1a_C4Scope(s) (or Scope 3 category)',
     'C4.1a_C5Base year','C4.1a_C6Covered emissions in base year (metric tons CO2e)','C4.1a_C7Covered emissions in base year as % of total base year emissions in selected Scope(s) (or Scope 3 category)',
     'C4.1a_C8Target year','C4.1a_C9Targeted reduction from base year (%)','C4.1a_C12% of target achieved [auto-calculated]','C4.1a_C13Target status in reporting year',
     'C4.1a_C14Is this a science-based target?','C4.1a_C15Please explain (including target coverage)']]

# Re set col type
C41a['C4.1a_C8Target year'] = C41a['C4.1a_C8Target year'].astype(float)
C41a['C4.1a_C2Year target was set'] = C41a['C4.1a_C2Year target was set'].astype(float)
C41a['C4.1a_C12% of target achieved [auto-calculated]'] = C41a['C4.1a_C12% of target achieved [auto-calculated]'].astype(float)
C41a['C4.1a_C9Targeted reduction from base year (%)'] = C41a['C4.1a_C9Targeted reduction from base year (%)'].astype(float)
C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)'] = C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)'].astype(float)
C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)']=C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)'].astype(float)

# Calculate new cols
C41a['target_years']=C41a['C4.1a_C8Target year'] - C41a['C4.1a_C2Year target was set']+1
C41a['Emissions_reduction_obj']= (C41a['C4.1a_C9Targeted reduction from base year (%)']/100)*C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)']
C41a['Emissions_reduction_achieved']= (C41a['C4.1a_C12% of target achieved [auto-calculated]']/100)*C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)']
C41a['Percentage_reduction_per_year']= (C41a['C4.1a_C9Targeted reduction from base year (%)']/100)/C41a['target_years']

C41a['years_left']= C41a['C4.1a_C8Target year']-2020

C41a['achiev_per_year'] = (C41a['C4.1a_C9Targeted reduction from base year (%)'])*(C41a['C4.1a_C12% of target achieved [auto-calculated]']/100)/(2020- C41a['C4.1a_C2Year target was set']+1)/100
C41a['effective_years_to_achiev'] = C41a['Emissions_reduction_obj']/ (C41a['C4.1a_C6Covered emissions in base year (metric tons CO2e)']*(C41a['C4.1a_C9Targeted reduction from base year (%)']/100))
C41a['Emissions_reduction per year'] = C41a['Emissions_reduction_obj']/C41a['target_years']
C41a['Actual_emissions_cut_per_year'] = (C41a['Emissions_reduction_obj']*(C41a['C4.1a_C12% of target achieved [auto-calculated]']/100))/(2020- C41a['C4.1a_C2Year target was set']+1)
C41a['Actual_years_to_achiev'] = C41a['Emissions_reduction_obj']/C41a['Actual_emissions_cut_per_year']
C41a['Years_diff'] = C41a['target_years']-C41a['Actual_years_to_achiev']
C41a['Years_diff'].replace(np.inf, np.nan, inplace=True)
C41a['Years_diff'].replace(np.NINF, np.nan, inplace=True)
C41a.loc[(C41a['C4.1a_C12% of target achieved [auto-calculated]']<0),'Years_diff'] = np.nan 
C41a['KPI_Strategy']=np.nan 

C41a.loc[(C41a['Years_diff']>C41a['Years_diff'].std()) | (C41a['Years_diff']<-C41a['Years_diff'].std()), 'Years_diff'] = np.nan
C41a['Years_diff'].replace(np.inf, 0, inplace=True)
C41a['Years_diff'].replace(np.NINF, 0, inplace=True)


## Filter down Q're for KPI creation  
C41a_KPI=C41a[['account_number','organization','row_number', 'C4.1a_C6Covered emissions in base year (metric tons CO2e)', 'Emissions_reduction_achieved','Emissions_reduction_obj','Years_diff']]

# Do a groupby and weight by emission reduction objective
wm = lambda x: np.average(x, weights=C41a_KPI.loc[x.index, 'Emissions_reduction_obj'])

# C41a_KPI.loc[C41a_KPI['Years_diff'].isna(), 'Years_diff'] = 0.00000001
# C41a_KPI.loc[C41a_KPI['Emissions_reduction_obj'].isna(), 'Emissions_reduction_obj'] = 0.00000001
C41a_KPI.loc[C41a_KPI['Years_diff']==0, 'Years_diff'] = 0.00000001
C41a_KPI.loc[C41a_KPI['Emissions_reduction_obj']==0, 'Emissions_reduction_obj'] = 0.00000001

C41a_KPI=C41a_KPI.groupby(['account_number','organization'], as_index=False).agg(
    Emissions_reduction_achieved=('Emissions_reduction_achieved', 'sum'),
    Covered_emissions= ('C4.1a_C6Covered emissions in base year (metric tons CO2e)', 'sum'),
    Emissions_reduction_obj=('Emissions_reduction_obj', 'sum'),
    Years_diff=('Years_diff', wm)).dropna().reset_index()

# Create final features
C41a_KPI['Percentage_obj_total'] = C41a_KPI.Emissions_reduction_obj / C41a_KPI.Covered_emissions
C41a_KPI['Percentage_obj_achieved'] = C41a_KPI.Emissions_reduction_achieved / C41a_KPI.Covered_emissions

# Create ranking KPIs
C41a_KPI['KPI_rank_objective_strategy'] = C41a_KPI['Years_diff'].rank(pct=True)
C41a_KPI['KPI_rank_objective_ambition'] = C41a_KPI['Percentage_obj_total'].rank(pct=True)
C41a_KPI['KPI_rank_objective_progress'] = C41a_KPI['Percentage_obj_achieved'].rank(pct=True)

C41a_KPI = C41a_KPI.sort_values(by=['KPI_rank_objective_strategy']).reset_index()


## Save KPI df to folder
C41a_KPI.to_csv('df_KPIs/Climate/C41a_KPI.csv')

################################################### TABLE 7.5C1 global Scope 2 emissions by country/region #####################################################

# Location vs market based: http://www.trackmyelectricity.com/learn-more/scope-2-guidance-on-energy-emissions-reporting

## Prepare the df
C75 = q20[q20['question_number']=='C7.5']
C75 =C75.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C75=C75.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C75=C75.sort_values(by=['account_number','organization','row_number'])
C75=C75.dropna()

# Clean cols
C75['C7.5_C2Scope 2, location-based (metric tons CO2e)'] = C75['C7.5_C2Scope 2, location-based (metric tons CO2e)'].astype(float)
C75['C7.5_C3Scope 2, market-based (metric tons CO2e)'] = C75['C7.5_C3Scope 2, market-based (metric tons CO2e)'].astype(float)
C75['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)'] = C75['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)'].astype(float)
C75['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'] = C75['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'].astype(float)

### Create KPI for country
C75_country=C75.copy()
C75_country=C75_country.drop(columns=['row_number'])

C75_country=C75_country.groupby(['C7.5_C1Country/Region']).agg({
    'account_number': 'count', 
    'C7.5_C2Scope 2, location-based (metric tons CO2e)': 'sum',
    'C7.5_C3Scope 2, market-based (metric tons CO2e)': 'sum',
    'C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)': 'sum',
    'C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)': 'sum'
    })

C75_country=C75_country.rename(columns={'account_number': 'count_companies'})

# C75_country=C75_country.groupby(['C7.5_C1Country/Region']).sum()
C75_country['mkt_vs_loc']=C75_country['C7.5_C2Scope 2, location-based (metric tons CO2e)']-C75_country['C7.5_C3Scope 2, market-based (metric tons CO2e)']

C75_country['t_co2_per_megawatt_loc']=C75_country['C7.5_C2Scope 2, location-based (metric tons CO2e)']/(C75_country['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)']+C75_country['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'])
C75_country['t_co2_per_megawatt_loc'].replace(np.inf, np.nan, inplace=True)

C75_country['t_co2_per_megawatt_mkt']=C75_country['C7.5_C3Scope 2, market-based (metric tons CO2e)']/(C75_country['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)']+C75_country['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'])
C75_country['t_co2_per_megawatt_mkt'].replace(np.inf, np.nan, inplace=True)


## Create KPI (0: least efficient; 1: most efficient)
C75_country['KPI_rank_energy_efficiency_mkt'] = C75_country['t_co2_per_megawatt_mkt'].rank(pct=True)
C75_country['KPI_rank_energy_efficiency_loc'] = C75_country['t_co2_per_megawatt_loc'].rank(pct=True)

C75_country['KPI_rank_energy_efficiency_mkt'] = 1-C75_country['KPI_rank_energy_efficiency_mkt']
C75_country['KPI_rank_energy_efficiency_loc'] = 1-C75_country['KPI_rank_energy_efficiency_loc']

# Sort df by KPI values
C75_country = C75_country.sort_values(by=['KPI_rank_energy_efficiency_mkt']).reset_index()


### Create KPI for company
C75_KPI=C75.copy()
C75_KPI=C75_KPI.drop(columns=['C7.5_C1Country/Region'])

# Do groupby
C75_KPI=C75_KPI.groupby(['account_number','organization'], as_index=False).agg({
    'row_number': 'max', 
    'C7.5_C2Scope 2, location-based (metric tons CO2e)': 'sum',
    'C7.5_C3Scope 2, market-based (metric tons CO2e)': 'sum',
    'C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)': 'sum',
    'C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)': 'sum'
    })

C75_KPI=C75_KPI.rename(columns={'row_number': 'count_countries'})

# Create extra features
C75_KPI['mkt_vs_loc']=C75_KPI['C7.5_C2Scope 2, location-based (metric tons CO2e)']-C75_KPI['C7.5_C3Scope 2, market-based (metric tons CO2e)']
C75_KPI['t_co2_per_megawatt_loc']=C75_KPI['C7.5_C2Scope 2, location-based (metric tons CO2e)']/(C75_KPI['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)']+C75_KPI['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'])
C75_KPI['t_co2_per_megawatt_loc'].replace(np.inf, np.nan, inplace=True)
C75_KPI['t_co2_per_megawatt_mkt']=C75_KPI['C7.5_C3Scope 2, market-based (metric tons CO2e)']/(C75_KPI['C7.5_C4Purchased and consumed electricity, heat, steam or cooling (MWh)']+C75_KPI['C7.5_C5Purchased and consumed low-carbon electricity, heat, steam or cooling accounted for in Scope 2 market-based approach (MWh)'])
C75_KPI['t_co2_per_megawatt_mkt'].replace(np.inf, np.nan, inplace=True)


## Create KPIs (0: least efficient; 1: most efficient)
C75_KPI['KPI_rank_energy_efficiency_mkt'] = C75_KPI['t_co2_per_megawatt_mkt'].rank(pct=True)
C75_KPI['KPI_rank_energy_efficiency_loc'] = C75_KPI['t_co2_per_megawatt_loc'].rank(pct=True)

C75_KPI['KPI_rank_energy_efficiency_mkt'] = 1-C75_KPI['KPI_rank_energy_efficiency_mkt']
C75_KPI['KPI_rank_energy_efficiency_loc'] = 1-C75_KPI['KPI_rank_energy_efficiency_loc']

# Sort df by KPI values
C75_KPI = C75_KPI.sort_values(by=['KPI_rank_energy_efficiency_mkt']).reset_index()

pd.set_option('display.max_columns', None)
C75_KPI

## Save KPI df to folder
C75_KPI.to_csv('df_KPIs/Climate/C75_company.csv')
C75_country.to_csv('df_KPIs/Climate/C75_country.csv')

# # Do some checks
# C75_KPI['t_co2_per_megawatt_mkt'].sort_values(ascending=False)
# C75_KPI['t_co2_per_megawatt_loc'].sort_values(ascending=False)

##################################################### Q8.2a energy consumption totals ######################################################################

C82a = q20[q20['question_number']=='C8.2a']
C82a =C82a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C82a=C82a.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C82a=C82a.sort_values(by=['account_number','organization','row_number'])
C82a=C82a.dropna()

# Drop row_number, as there is only 1 value
C82a=C82a.drop(columns=['row_number'])


## Clean cols
C82a['C8.2a_C2MWh from renewable sources'] = C82a['C8.2a_C2MWh from renewable sources'].astype(float)
C82a['C8.2a_C3MWh from non-renewable sources'] = C82a['C8.2a_C3MWh from non-renewable sources'].astype(float)
C82a['C8.2a_C4Total (renewable and non-renewable) MWh'] = C82a['C8.2a_C4Total (renewable and non-renewable) MWh'].astype(float)

C82a['C8.2a_C1Heating value'].value_counts(dropna=True)


## Binarize 'C8.2a_C1Heating value'
one_hot = pd.get_dummies(C82a['C8.2a_C1Heating value']).add_prefix('C8.2a_C1Heating value_')
# df = df.drop('C8.2a_C1Heating value', axis = 1)
C82a = C82a.join(one_hot)


## Create % of green energy consumption 
C82a['energy_consumption_mwh_pctg_renewables']=C82a['C8.2a_C2MWh from renewable sources']/C82a['C8.2a_C4Total (renewable and non-renewable) MWh']
C82a['energy_consumption_mwh_pctg_renewables'].replace(np.inf, np.nan, inplace=True)

# C82a[C82a['energy_consumption_mwh_pctg_renewables']>0]


## Create KPI (0: least green; 1: most green)
C82a['KPI_rank_energy_renewables'] = C82a['energy_consumption_mwh_pctg_renewables'].rank(pct=True)
# C82a['KPI_rank_energy_efficiency_mkt'] = 1-C82a['KPI_rank_energy_efficiency_mkt']

# Sort df by KPI values
C82a = C82a.sort_values(by=['KPI_rank_energy_renewables']).reset_index()


## Save KPI df to folder
C82a.to_csv('df_KPIs/Climate/C82a.csv')


## Visualize
#pd.set_option('display.max_columns', None)
#C82a.head(5)
# C82a['energy_consumption_mwh_pctg_renewables'].describe()
# sns.distplot(C82a['energy_consumption_mwh_pctg_renewables'])


########################### Q8.2d details on the electricity, heat, steam, and cooling your organization has generated and consumed ############################

C82d = q20[q20['question_number']=='C8.2d']
C82d =C82d.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C82d=C82d.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C82d=C82d.sort_values(by=['account_number','organization','row_number'])
C82d=C82d.dropna()

## Clean cols
C82d['C8.2d_C1Total Gross generation (MWh)'] = C82d['C8.2d_C1Total Gross generation (MWh)'].astype(float)
C82d['C8.2d_C2Generation that is consumed by the organization (MWh)'] = C82d['C8.2d_C2Generation that is consumed by the organization (MWh)'].astype(float)
C82d['C8.2d_C3Gross generation from renewable sources (MWh)'] = C82d['C8.2d_C3Gross generation from renewable sources (MWh)'].astype(float)
C82d['C8.2d_C4Generation from renewable sources that is consumed by the organization (MWh)'] = C82d['C8.2d_C4Generation from renewable sources that is consumed by the organization (MWh)'].astype(float)
# C82d['row_number'].value_counts(dropna=True)


### Create KPI for company
C82d_KPI=C82d.copy()
# C82d_KPI=C82d_KPI.drop(columns=['C7.5_C1Country/Region'])

C82d_KPI=C82d_KPI.groupby(['account_number','organization'], as_index=False).agg({
    'row_number': 'max', 
    'C8.2d_C1Total Gross generation (MWh)': 'sum',
    'C8.2d_C2Generation that is consumed by the organization (MWh)': 'sum',
    'C8.2d_C3Gross generation from renewable sources (MWh)': 'sum',
    'C8.2d_C4Generation from renewable sources that is consumed by the organization (MWh)': 'sum'
    })

C82d_KPI=C82d_KPI.rename(columns={'row_number': 'count_rows'})


C82d_KPI['electricity_generation_mwh_renewables_pctg']=C82d_KPI['C8.2d_C3Gross generation from renewable sources (MWh)']/C82d_KPI['C8.2d_C1Total Gross generation (MWh)']
C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg']=C82d_KPI['C8.2d_C4Generation from renewable sources that is consumed by the organization (MWh)']/C82d_KPI['C8.2d_C2Generation that is consumed by the organization (MWh)']

C82d_KPI['renewables_sold']=C82d_KPI['C8.2d_C3Gross generation from renewable sources (MWh)']-C82d_KPI['C8.2d_C4Generation from renewable sources that is consumed by the organization (MWh)']
C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg']=C82d_KPI['renewables_sold']/C82d_KPI['C8.2d_C1Total Gross generation (MWh)']

C82d_KPI['electricity_generation_mwh_renewables_pctg'].replace(np.inf, np.nan, inplace=True)
C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg'].replace(np.inf, np.nan, inplace=True)
C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg'].replace(np.inf, np.nan, inplace=True)

# Check out values:
# C82d_KPI[(C82d_KPI['electricity_generation_mwh_renewables_pctg']<0) | (C82d_KPI['electricity_generation_mwh_renewables_pctg']>1)]
# C82d_KPI[(C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg']<0) | (C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg']>1)]
C82d_KPI[(C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg']<0) | (C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg']>1)]


# Correct wrongly assigned values:
C82d_KPI.loc[C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg']>1, 'electricity_consumption_generated_mwh_renewables_pctg'] = 1
C82d_KPI.loc[C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg']<0, 'electricity_generation_contribution_mwh_renewables_pctg'] = 0


## Create KPI (0: least green; 1: most green)
C82d_KPI['KPI_rank_electricty_generated_from_renewables'] = C82d_KPI['electricity_generation_mwh_renewables_pctg'].rank(pct=True)
C82d_KPI['KPI_rank_electricty_consumed_generated_mwh_renewables'] = C82d_KPI['electricity_consumption_generated_mwh_renewables_pctg'].rank(pct=True)
C82d_KPI['KPI_rank_electricty_generated_from_renewables_sold'] = C82d_KPI['electricity_generation_contribution_mwh_renewables_pctg'].rank(pct=True)
# C82a['KPI_rank_energy_efficiency_mkt'] = 1-C82a['KPI_rank_energy_efficiency_mkt']

# Sort df by KPI values
C82d_KPI = C82d_KPI.sort_values(by=['KPI_rank_electricty_generated_from_renewables']).reset_index()


## Save KPI df to folder
C82d_KPI.to_csv('df_KPIs/Climate/C82d_KPI.csv')

## Visualize
# pd.set_option('display.max_columns', None)
C82d_KPI.head(5)


# ### Q6.1
# C61 = q20[q20['question_number']=='C6.1']
# C61 =C61.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
#           'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
# C61=C61.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
# C61=C61.sort_values(by=['account_number','organization','row_number'])

# C61=C61.dropna()
# C61['row_number'].value_counts(dropna=True)



################################################################ TABLE 2.3a details of risks identified #####################################################

C23a = q20[q20['question_number']=='C2.3a']
C23a =C23a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C23a=C23a.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C23a=C23a.sort_values(by=['account_number','organization','row_number'])
# C23a=C23a.dropna()

# Drop row_number, as there is only 1 value
# C23a=C82a.drop(columns=['row_number'])


## Clean cols
C23a[C23a['row_number']==0]
C23a = C23a[C23a['row_number'] != 0] # All row number = 0 are NaN

C23a['C2.3a_C11Potential financial impact figure (currency)'] = C23a['C2.3a_C11Potential financial impact figure (currency)'].astype(float)
C23a['C2.3a_C12Potential financial impact figure â€“ minimum (currency)'] = C23a['C2.3a_C12Potential financial impact figure â€“ minimum (currency)'].astype(float)
C23a['C2.3a_C13Potential financial impact figure â€“ maximum (currency)'] = C23a['C2.3a_C13Potential financial impact figure â€“ maximum (currency)'].astype(float)
C23a['C2.3a_C15Cost of response to risk'] = C23a['C2.3a_C15Cost of response to risk'].astype(float)

C23a =C23a.drop([
    'C2.3a_C14Explanation of financial impact figure',
    'C2.3a_C17Comment',
    'C2.3a_C1Identifier',
    'C2.3a_C5Climate risk type mapped to traditional financial services industry risk classification'
    ], axis=1)


## Calculate financial impact into a single col
C23a['risk_financial_impact'] = C23a['C2.3a_C11Potential financial impact figure (currency)']
C23a.loc[C23a['risk_financial_impact'].isnull(), 'risk_financial_impact'] = (C23a['C2.3a_C12Potential financial impact figure â€“ minimum (currency)'] + C23a['C2.3a_C13Potential financial impact figure â€“ maximum (currency)'])/2

# C23a =C23a.drop([
#     'C2.3a_C11Potential financial impact figure (currency)', 
#     'C2.3a_C12Potential financial impact figure â€“ minimum (currency)', 
#     'C2.3a_C13Potential financial impact figure â€“ maximum (currency)'
#     ], axis=1)


### Add time horizon in years
# Q2.1a  Extract and assign short-, medium- and long-term time horizons definitions
C21a = q20[q20['question_number']=='C2.1a']
C21a =C21a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C21a=C21a.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C21a =C21a.drop(['C2.1a_C3Comment', 'C2.1a_C1From (years)'], axis=1)
C21a=C21a.pivot(index=['account_number','organization'], columns='row_number', values='C2.1a_C2To (years)').reset_index()

C23a = C23a.merge(C21a, how='left', left_on='account_number', right_on='account_number')

# Assign horizon years values
"""
TIME HORIZON:
Short-term: Use company definition from 2.1
Medium-term: Use company definition from 2.1
Long-term: Use company definition from 2.1
Unknown: np.nan
"""
C23a['time_horizon_years'] = np.nan
C23a.loc[C23a['C2.3a_C7Time horizon']=='Short-term', 'time_horizon_years'] = C23a[1]
C23a.loc[C23a['C2.3a_C7Time horizon']=='Medium-term', 'time_horizon_years'] = C23a[2]
C23a.loc[C23a['C2.3a_C7Time horizon']=='Long-term', 'time_horizon_years'] = C23a[3]

C23a['time_horizon_years'] = C23a['time_horizon_years'].astype(float)

C23a =C23a.drop(['organization_y', 1, 2, 3], axis=1)

### Categorize numerically categoricals, between 0 and 1
## Likelihood
"""
LIKELIHOOD:
    Virtually certain = 8/8
    Very likely = 7/8
    Likely = 6/8
    Likely More likely than not = 5/7
    About as likely as not =  = 4/7
    Unlikely = 3/7
    Very unlikely = 2/7
    Exceptionally unlikely = 1/7
    Unknown = np.nan
"""
C23a['risk_likelihood'] = np.nan
C23a.loc[C23a['C2.3a_C8Likelihood']=='Virtually certain', 'risk_likelihood'] = 8/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='Very likely', 'risk_likelihood'] = 7/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='Likely', 'risk_likelihood'] = 6/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='More likely than not', 'risk_likelihood'] = 5/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='About as likely as not', 'risk_likelihood'] = 4/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='Unlikely', 'risk_likelihood'] = 3/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='Very unlikely', 'risk_likelihood'] = 2/8
C23a.loc[C23a['C2.3a_C8Likelihood']=='Exceptionally unlikely', 'risk_likelihood'] = 1/8
# C23a =C23a.drop(['C2.3a_C8Likelihood'], axis=1)


## Magnitude
"""
MAGNITUDE
    High = 5/5
    Medium-high = 4/5
    Medium = 3/5
    Medium-low = 2/5
    Low = 1/5
    Unknown = np.nan
"""
C23a['risk_impact_magnitude'] = np.nan
C23a.loc[C23a['C2.3a_C9Magnitude of impact']=='High', 'risk_impact_magnitude'] = 5/5
C23a.loc[C23a['C2.3a_C9Magnitude of impact']=='Medium-high', 'risk_impact_magnitude'] = 4/5
C23a.loc[C23a['C2.3a_C9Magnitude of impact']=='Medium', 'risk_impact_magnitude'] = 3/5
C23a.loc[C23a['C2.3a_C9Magnitude of impact']=='Medium-low', 'risk_impact_magnitude'] = 2/5
C23a.loc[C23a['C2.3a_C9Magnitude of impact']=='Low', 'risk_impact_magnitude'] = 1/5
# C23a =C23a.drop(['C2.3a_C8Likelihood'], axis=1)

C23a.head(5)



## Do a groupby company
C23a_KPI=C23a.copy()
C23a_KPI=C23a_KPI.drop(columns=[
    'C2.3a_C10Are you able to provide a potential financial impact figure?',
    'C2.3a_C16Description of response and explanation of cost calculation',
    'C2.3a_C2Where in the value chain does the risk driver occur?',
    'C2.3a_C3Risk type & Primary climate-related risk driver',
    'C2.3a_C3Risk type & Primary climate-related risk driver_G',
    'C2.3a_C4Primary potential financial impact',
    'C2.3a_C7Time horizon',
    'C2.3a_C9Magnitude of impact',
    'C2.3a_C8Likelihood'
    ])


## Groupby company
# wm_risk_financial_impact = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_time_horizon_years = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_likelihood = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_impact_magnitude = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_index = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])

C23a_KPI=C23a_KPI.groupby(['account_number','organization_x'], as_index=False).agg(
    row_number = ('row_number', 'max'),
    C11_Potential_financial_impact = ('C2.3a_C11Potential financial impact figure (currency)', 'sum'),
    C12_Potential_financial_impact_range_min = ('C2.3a_C12Potential financial impact figure â€“ minimum (currency)', 'sum'),
    C12_Potential_financial_impact_range_max = ('C2.3a_C13Potential financial impact figure â€“ maximum (currency)', 'sum'),
    C15_Cost_of_response_to_risk = ('C2.3a_C15Cost of response to risk', 'sum'),
    risk_financial_impact = ('risk_financial_impact', 'sum'),
    time_horizon_years = ('time_horizon_years', 'mean'),
    risk_likelihood=('risk_likelihood', 'mean'),
    risk_impact_magnitude=('risk_impact_magnitude', 'mean')
    )

C23a_KPI=C23a_KPI.rename(columns={'row_number': 'count_risks'})


### Calculate final features and KPIs
## Likelihood * Magnitude / Years
C23a_KPI['risk_index'] = (C23a_KPI['risk_likelihood'] * C23a_KPI['risk_impact_magnitude']) / C23a_KPI['time_horizon_years'] 
C23a_KPI['risk_index'].replace(np.inf, np.nan, inplace=True)

## ROI_risk_mitigation
C23a_KPI['ROI_risk_mitigation'] = C23a_KPI['risk_financial_impact']/C23a_KPI['C15_Cost_of_response_to_risk']
C23a_KPI['ROI_risk_mitigation'].replace(np.inf, np.nan, inplace=True)

## Create KPI (0: least at risk; 1: most at risk)
C23a_KPI['KPI_rank_risk_response_cost'] = C23a_KPI['C15_Cost_of_response_to_risk'].rank(pct=True)
C23a_KPI['KPI_rank_risk_financial_impact'] = C23a_KPI['risk_financial_impact'].rank(pct=True)
C23a_KPI['KPI_rank_risk_time_horizon'] = C23a_KPI['time_horizon_years'].rank(pct=True)
C23a_KPI['KPI_rank_risk_likelihood'] = C23a_KPI['risk_likelihood'].rank(pct=True)
C23a_KPI['KPI_rank_risk_impact_magnitude'] = C23a_KPI['risk_impact_magnitude'].rank(pct=True)
C23a_KPI['KPI_rank_risk_index'] = C23a_KPI['risk_index'].rank(pct=True)
C23a_KPI['KPI_rank_risk_mitigation'] = C23a_KPI['ROI_risk_mitigation'].rank(pct=True)

# Revert time horizon to align 0 (good) to 1 (bad) as the other KPIs
C23a_KPI['KPI_rank_risk_time_horizon'] = 1-C23a_KPI['KPI_rank_risk_time_horizon']
C23a_KPI['KPI_rank_risk_mitigation'] = 1-C23a_KPI['KPI_rank_risk_mitigation']
# C23a_KPI[''] = 1-C23a_KPI['']

# Sort df by KPI values
C23a_KPI = C23a_KPI.sort_values(by=['KPI_rank_risk_index']).reset_index()



## Save KPI df to folder
C23a_KPI.to_csv('df_KPIs/Climate/C23a_KPI.csv')


## Visualize results
# pd.set_option('display.max_columns', None)
# C23a.head(5)
#C23a_KPI.head(5)
# C23a[C23a['C2.3a_C7Time horizon']=='Unknown']
# C23a[C23a['C2.3a_C10Are you able to provide a potential financial impact figure?']=='No, we do not have this figure']
# C23a[C23a['C2.3a_C8Likelihood']=='Unknown']


############################################################ TABLE 2.4a climate-related opportunities ##########################################################

C24a = q20[q20['question_number']=='C2.4a']
C24a =C24a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','row_name','table_columns_unique_reference','column_number'], axis=1)
C24a=C24a.pivot(index=['account_number','organization','row_number'], columns='column_name', values='response_value').reset_index()
C24a=C24a.sort_values(by=['account_number','organization','row_number'])
# C23a=C23a.dropna()

## Clean up df and cols
C24a[C24a['row_number']==0]
C24a = C24a[C24a['row_number'] != 0] # All row number = 0 are NaN

C24a['C2.4a_C11Potential financial impact figure (currency)'] = C24a['C2.4a_C11Potential financial impact figure (currency)'].astype(float)
C24a['C2.4a_C12Potential financial impact figure â€“ minimum (currency)'] = C24a['C2.4a_C12Potential financial impact figure â€“ minimum (currency)'].astype(float)
C24a['C2.4a_C13Potential financial impact figure â€“ maximum (currency)'] = C24a['C2.4a_C13Potential financial impact figure â€“ maximum (currency)'].astype(float)
C24a['C2.4a_C15Cost to realize opportunity'] = C24a['C2.4a_C15Cost to realize opportunity'].astype(float)

C24a =C24a.drop([
    'C2.4a_C14Explanation of financial impact figure',
    'C2.4a_C17Comment',
    'C2.4a_C1Identifier',
    'C2.4a_C16Strategy to realize opportunity and explanation of cost calculation'
    ], axis=1)


## Calculate financial impact into a single col
C24a['opportunity_financial_impact'] = C24a['C2.4a_C11Potential financial impact figure (currency)']
C24a.loc[C24a['opportunity_financial_impact'].isnull(), 'opportunity_financial_impact'] = (C24a['C2.4a_C12Potential financial impact figure â€“ minimum (currency)'] + C24a['C2.4a_C13Potential financial impact figure â€“ maximum (currency)'])/2


## Add time horizon in years (Merge with horizon definition in 2.1)
"""
TIME HORIZON:
Short-term: Use company definition from 2.1
Medium-term: Use company definition from 2.1
Long-term: Use company definition from 2.1
Unknown: np.nan
"""
C24a = C24a.merge(C21a, how='left', left_on='account_number', right_on='account_number')

## Add values expressed in years
C24a['time_horizon_years'] = np.nan
C24a.loc[C24a['C2.4a_C7Time horizon']=='Short-term', 'time_horizon_years'] = C24a[1]
C24a.loc[C24a['C2.4a_C7Time horizon']=='Medium-term', 'time_horizon_years'] = C24a[2]
C24a.loc[C24a['C2.4a_C7Time horizon']=='Long-term', 'time_horizon_years'] = C24a[3]

C24a['time_horizon_years'] = C24a['time_horizon_years'].astype(float)

C24a =C24a.drop(['organization_y', 1, 2, 3], axis=1)


### Categorize categorical into o to 1 values
## Likelihood
"""
LIKELIHOOD:
    Virtually certain = 8/8
    Very likely = 7/8
    Likely = 6/8
    More likely than not = 5/8
    About as likely as not =  = 4/8
    Unlikely = 3/8
    Very unlikely = 2/8
    Exceptionally unlikely = 1/8
    Unknown = np.nan
"""
C24a['opportunity_likelihood'] = np.nan
C24a.loc[C24a['C2.4a_C8Likelihood']=='Virtually certain', 'opportunity_likelihood'] = 8/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='Very likely', 'opportunity_likelihood'] = 7/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='Likely', 'opportunity_likelihood'] = 6/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='More likely than not', 'opportunity_likelihood'] = 5/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='About as likely as not', 'opportunity_likelihood'] = 4/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='Unlikely', 'opportunity_likelihood'] = 3/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='Very unlikely', 'opportunity_likelihood'] = 2/8
C24a.loc[C24a['C2.4a_C8Likelihood']=='Exceptionally unlikely', 'opportunity_likelihood'] = 1/8
# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Magnitude
"""
MAGNITUDE
    High = 5/5
    Medium-high = 4/5
    Medium = 3/5
    Medium-low = 2/5
    Low = 1/5
    Unknown = np.nan
"""
C24a['opportunity_impact_magnitude'] = np.nan
C24a.loc[C24a['C2.4a_C9Magnitude of impact']=='High', 'opportunity_impact_magnitude'] = 5/5
C24a.loc[C24a['C2.4a_C9Magnitude of impact']=='Medium-high', 'opportunity_impact_magnitude'] = 4/5
C24a.loc[C24a['C2.4a_C9Magnitude of impact']=='Medium', 'opportunity_impact_magnitude'] = 3/5
C24a.loc[C24a['C2.4a_C9Magnitude of impact']=='Medium-low', 'opportunity_impact_magnitude'] = 2/5
C24a.loc[C24a['C2.4a_C9Magnitude of impact']=='Low', 'opportunity_impact_magnitude'] = 1/5
# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Do a groupby 
C24a_KPI=C24a.copy()
C24a_KPI=C24a_KPI.drop(columns=[
    'C2.4a_C10Are you able to provide a potential financial impact figure?',
    'C2.4a_C2Where in the value chain does the opportunity occur?',
    'C2.4a_C3Opportunity type',
    'C2.4a_C4Primary climate-related opportunity driver',
    'C2.4a_C5Primary potential financial impact',
    'C2.4a_C6Company-specific description',
    'C2.4a_C7Time horizon',
    'C2.4a_C9Magnitude of impact',
    'C2.4a_C8Likelihood'
    ])

# wm_risk_financial_impact = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_time_horizon_years = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_likelihood = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_impact_magnitude = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])
# wm_risk_index = lambda x: np.average(x, weights=C23a_KPI.loc[x.index, 'risk_financial_impact'])

C24a_KPI=C24a_KPI.groupby(['account_number','organization_x'], as_index=False).agg(
    row_number = ('row_number', 'max'),
    C11_Potential_financial_impact = ('C2.4a_C11Potential financial impact figure (currency)', 'sum'),
    C12_Potential_financial_impact_range_min = ('C2.4a_C12Potential financial impact figure â€“ minimum (currency)', 'sum'),
    C12_Potential_financial_impact_range_max = ('C2.4a_C13Potential financial impact figure â€“ maximum (currency)', 'sum'),
    C15_Cost_to_realize_opportunity = ('C2.4a_C15Cost to realize opportunity', 'sum'),
    opportunity_financial_impact = ('opportunity_financial_impact', 'sum'),
    time_horizon_years = ('time_horizon_years', 'mean'),
    opportunity_likelihood=('opportunity_likelihood', 'mean'),
    opportunity_impact_magnitude=('opportunity_impact_magnitude', 'mean')
    )

C24a_KPI=C24a_KPI.rename(columns={'row_number': 'count_opportunities'})


## create new features:
## Likelihood * Magnitude / Years
C24a_KPI['opportunity_index'] = (C24a_KPI['opportunity_likelihood'] * C24a_KPI['opportunity_impact_magnitude']) / C24a_KPI['time_horizon_years'] 
C24a_KPI['opportunity_index'].replace(np.inf, np.nan, inplace=True)

## ROI_risk_mitigation
C24a_KPI['ROI_opportunity'] = C24a_KPI['opportunity_financial_impact']/C24a_KPI['C15_Cost_to_realize_opportunity']
C24a_KPI['ROI_opportunity'].replace(np.inf, np.nan, inplace=True)

## Create KPI (0: least at risk; 1: most at risk)
C24a_KPI['KPI_rank_opportunity_realization_cost'] = C24a_KPI['C15_Cost_to_realize_opportunity'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_financial_impact'] = C24a_KPI['opportunity_financial_impact'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_time_horizon'] = C24a_KPI['time_horizon_years'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_likelihood'] = C24a_KPI['opportunity_likelihood'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_impact_magnitude'] = C24a_KPI['opportunity_impact_magnitude'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_index'] = C24a_KPI['opportunity_index'].rank(pct=True)
C24a_KPI['KPI_rank_opportunity_creation'] = C24a_KPI['ROI_opportunity'].rank(pct=True)

# Revert time horizon to align 0 (good) to 1 (bad) as the other KPIs
# C24a_KPI['KPI_rank_opportunity_time_horizon'] = 1-C24a_KPI['KPI_rank_opportunity_time_horizon']
# C24a_KPI['KPI_rank_opportunity_creation'] = 1-C24a_KPI['KPI_rank_risk_mitigation']

# Sort df by KPI values
C24a_KPI = C24a_KPI.sort_values(by=['KPI_rank_opportunity_index']).reset_index()

pd.set_option('display.max_columns', None)
C24a_KPI.head(5)

## Save KPI df to folder
C24a_KPI.to_csv('df_KPIs/Climate/C24a_KPI.csv')

## Visualize results:

#C24a.head(5)
#C24a_KPI.head(5)

################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################
###################################################                                                      #######################################################
###################################################                 W A T E R - 2 0 2 0                  #######################################################
###################################################                                                      #######################################################
################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################


####################################################### TABLE 1.1 Water importance ############################################################################

W11 = w20[w20['question_number']=='W1.1']
W11 =W11.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W11=W11.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W11=W11.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

# Pivot dataset to manage data and calculations 
W11_p1 = W11.drop(['W1.1_C3Please explain', 'row_number', 'W1.1_C2Indirect use importance rating'], axis=1)
W11_p2 = W11.drop(['W1.1_C3Please explain', 'row_number', 'W1.1_C1Direct use importance rating'], axis=1)

W11_p1 = W11_p1.pivot(index=['account_number','organization'], columns='row_name', values='W1.1_C1Direct use importance rating').reset_index()
W11_p1.columns = ['account_number', 'organization', 'direct_use_importance_sufficient_freshwater', 'direct_use_importance_sufficient_recycled']

W11_p2 = W11_p2.pivot(index=['account_number','organization'], columns='row_name', values='W1.1_C2Indirect use importance rating').reset_index()
W11_p2.columns = ['account_number', 'organization', 'indirect_use_importance_sufficient_freshwater', 'indirect_use_importance_sufficient_recycled']

W11_c = W11_p1.merge(W11_p2, how='left', on='account_number')

W11_c.drop(['organization_y'], axis=1, inplace=True)
W11_c.rename(columns={'organization_x': 'organization'}, inplace=True)


### Categorize columns into numericals
# W11_c['direct_use_importance_sufficient_recycled'].value_counts(dropna=True)
## Water importance
"""
IMPORTANCE:
    'Vital' = 5/5
    'Important' = 4/5
    'Neutral' = 3/5
    'Not very important' = 2/5
    'Not important at all' = 1/5
    'Have not evaluated' = np.nan
"""
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Vital', 'direct_use_importance_sufficient_freshwater'] = 5/5
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Important', 'direct_use_importance_sufficient_freshwater'] = 4/5
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Neutral', 'direct_use_importance_sufficient_freshwater'] = 3/5
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Not very important', 'direct_use_importance_sufficient_freshwater'] = 2/5
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Not important at all', 'direct_use_importance_sufficient_freshwater'] = 1/5
W11_c.loc[W11_c['direct_use_importance_sufficient_freshwater']=='Have not evaluated', 'direct_use_importance_sufficient_freshwater'] = np.nan

W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Vital', 'direct_use_importance_sufficient_recycled'] = 5/5
W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Important', 'direct_use_importance_sufficient_recycled'] = 4/5
W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Neutral', 'direct_use_importance_sufficient_recycled'] = 3/5
W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Not very important', 'direct_use_importance_sufficient_recycled'] = 2/5
W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Not important at all', 'direct_use_importance_sufficient_recycled'] = 1/5
W11_c.loc[W11_c['direct_use_importance_sufficient_recycled']=='Have not evaluated', 'direct_use_importance_sufficient_recycled'] = np.nan

W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Vital', 'indirect_use_importance_sufficient_freshwater'] = 5/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Important', 'indirect_use_importance_sufficient_freshwater'] = 4/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Neutral', 'indirect_use_importance_sufficient_freshwater'] = 3/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Not very important', 'indirect_use_importance_sufficient_freshwater'] = 2/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Not important at all', 'indirect_use_importance_sufficient_freshwater'] = 1/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_freshwater']=='Have not evaluated', 'indirect_use_importance_sufficient_freshwater'] = np.nan

W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Vital', 'indirect_use_importance_sufficient_recycled'] = 5/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Important', 'indirect_use_importance_sufficient_recycled'] = 4/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Neutral', 'indirect_use_importance_sufficient_recycled'] = 3/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Not very important', 'indirect_use_importance_sufficient_recycled'] = 2/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Not important at all', 'indirect_use_importance_sufficient_recycled'] = 1/5
W11_c.loc[W11_c['indirect_use_importance_sufficient_recycled']=='Have not evaluated', 'indirect_use_importance_sufficient_recycled'] = np.nan
# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Calcolate new features
# Composite metrics
W11_c['direct_use_importance'] = W11_c['direct_use_importance_sufficient_freshwater'] + W11_c['direct_use_importance_sufficient_freshwater']
W11_c['indirect_use_importance'] = W11_c['indirect_use_importance_sufficient_freshwater'] + W11_c['indirect_use_importance_sufficient_freshwater']

W11_c['use_importance_freshwater'] = W11_c['direct_use_importance_sufficient_freshwater'] + W11_c['indirect_use_importance_sufficient_freshwater']
W11_c['use_importance_recycled'] = W11_c['direct_use_importance_sufficient_freshwater'] + W11_c['indirect_use_importance_sufficient_freshwater']

W11_c['use_importance'] = W11_c['direct_use_importance'] + W11_c['indirect_use_importance']

# Create KPIs
W11_c['KPI_rank_use_importance_freshwater'] = W11_c['use_importance_freshwater'].rank(pct=True)
W11_c['KPI_rank_use_importance_recycled'] = W11_c['use_importance_recycled'].rank(pct=True)
W11_c['KPI_rank_direct_use_importance'] = W11_c['direct_use_importance'].rank(pct=True)
W11_c['KPI_rank_indirect_use_importance'] = W11_c['indirect_use_importance'].rank(pct=True)
W11_c['KPI_rank_use_importance'] = W11_c['use_importance'].rank(pct=True)


## Visualize results
#W11_c.head(5)

############################################################# TABLE 1.2b Water use ###############################################################################

W12b = w20[w20['question_number']=='W1.2b']
W12b =W12b.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W12b=W12b.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W12b=W12b.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

# Pivot dataset to manage data and calculations 
W12b_p1 = W12b.drop(['W1.2b_C3Please explain', 'row_number', 'W1.2b_C2Comparison with previous reporting year'], axis=1)
W12b_p2 = W12b.drop(['W1.2b_C3Please explain', 'row_number', 'W1.2b_C1Volume (megaliters/year)'], axis=1)

W12b_p1 = W12b_p1.pivot(index=['account_number','organization'], columns='row_name', values='W1.2b_C1Volume (megaliters/year)').reset_index()
W12b_p1.columns = ['account_number', 'organization', 'water_use_consumption_quantity_MLpa', 'water_use_discharges_quantity_MLpa', 'water_use_withdrawals_quantity_MLpa']

W12b_p2 = W12b_p2.pivot(index=['account_number','organization'], columns='row_name', values='W1.2b_C2Comparison with previous reporting year').reset_index()
W12b_p2.columns = ['account_number', 'organization', 'water_use_consumption_vs_last_year', 'water_use_discharges_vs_last_year', 'water_use_withdrawals_vs_last_year']

W12b_c = W12b_p1.merge(W12b_p2, how='left', on='account_number')

W12b_c.drop(['organization_y'], axis=1, inplace=True)
W12b_c.rename(columns={'organization_x': 'organization'}, inplace=True)

### Clean cols
W12b_c['water_use_consumption_quantity_MLpa'] = W12b_c['water_use_consumption_quantity_MLpa'].astype(float)
W12b_c['water_use_discharges_quantity_MLpa'] = W12b_c['water_use_discharges_quantity_MLpa'].astype(float)
W12b_c['water_use_withdrawals_quantity_MLpa'] = W12b_c['water_use_withdrawals_quantity_MLpa'].astype(float)

W12b_c['water_use_consumption_vs_last_year'].value_counts(dropna=True)

## Water consumption
"""
CONSUMPTION:
Much lower = 1/5
Lower = 2/5
About the same = 3/5
Higher = 4/5
Much higher = 5/5
This is our first year of measurement = np.nan
"""
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='Much lower', 'water_use_consumption_vs_last_year'] = 1/5
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='Lower', 'water_use_consumption_vs_last_year'] = 2/5
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='About the same', 'water_use_consumption_vs_last_year'] = 3/5
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='Higher', 'water_use_consumption_vs_last_year'] = 4/5
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='Much higher', 'water_use_consumption_vs_last_year'] = 5/5
W12b_c.loc[W12b_c['water_use_consumption_vs_last_year']=='This is our first year of measurement', 'water_use_consumption_vs_last_year'] = np.nan

W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='Much lower', 'water_use_discharges_vs_last_year'] = 1/5
W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='Lower', 'water_use_discharges_vs_last_year'] = 2/5
W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='About the same', 'water_use_discharges_vs_last_year'] = 3/5
W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='Higher', 'water_use_discharges_vs_last_year'] = 4/5
W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='Much higher', 'water_use_discharges_vs_last_year'] = 5/5
W12b_c.loc[W12b_c['water_use_discharges_vs_last_year']=='This is our first year of measurement', 'water_use_discharges_vs_last_year'] = np.nan

W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='Much lower', 'water_use_withdrawals_vs_last_year'] = 1/5
W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='Lower', 'water_use_withdrawals_vs_last_year'] = 2/5
W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='About the same', 'water_use_withdrawals_vs_last_year'] = 3/5
W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='Higher', 'water_use_withdrawals_vs_last_year'] = 4/5
W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='Much higher', 'water_use_withdrawals_vs_last_year'] = 5/5
W12b_c.loc[W12b_c['water_use_withdrawals_vs_last_year']=='This is our first year of measurement', 'water_use_withdrawals_vs_last_year'] = np.nan

# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Create KPIs
# Composite metrics
W12b_c['water_use_total_movement_quantity_MLpa'] = W12b_c['water_use_consumption_quantity_MLpa'] + W12b_c['water_use_discharges_quantity_MLpa'] + W12b_c['water_use_withdrawals_quantity_MLpa']

# KPIs
W12b_c['KPI_rank_water_use_consumption_quantity'] = W12b_c['water_use_consumption_quantity_MLpa'].rank(pct=True)
W12b_c['KPI_rank_water_use_discharges_quantity'] = W12b_c['water_use_discharges_quantity_MLpa'].rank(pct=True)
W12b_c['KPI_rank_water_use_withdrawals_quantity'] = W12b_c['water_use_withdrawals_quantity_MLpa'].rank(pct=True)
W12b_c['KPI_rank_water_use_total_movement_quantity'] = W12b_c['water_use_total_movement_quantity_MLpa'].rank(pct=True)
W12b_c['KPI_rank_water_use_consumption_vs_last_year'] = W12b_c['water_use_consumption_vs_last_year'].rank(pct=True)
W12b_c['KPI_rank_water_use_discharges_vs_last_year'] = W12b_c['water_use_discharges_vs_last_year'].rank(pct=True)
W12b_c['KPI_rank_water_use_withdrawals_vs_last_year'] = W12b_c['water_use_withdrawals_vs_last_year'].rank(pct=True)


## Visualzie results
#W12b_c.head(5)

####################################################### TABLE 4.1b Water-related risks #######################################################################

W41b = w20[w20['question_number']=='W4.1b']
W41b =W41b.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W41b=W41b.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W41b=W41b.sort_values(by=['account_number','organization','row_number'])

W41b = W41b.drop(['W4.1b_C3Comment', 'row_number', 'row_name'], axis=1)


### Clean cols
W41b['W4.1b_C1Total number of facilities exposed to water risk'] = W41b['W4.1b_C1Total number of facilities exposed to water risk'].astype(float)


### Categorize columns into numericals
## Water risk
"""
RISK:
    100 = 1.0
    76-99 = (99 + 76)/2
    51-75 = (75 + 51)/2
    26-50 = (50 + 26)/2
    1-25 = (25 + 1)/2
    1 = (1 + 0)/2
    Unknown = np.nan
"""
W41b['water_risk_business_exposure_pct'] = np.nan
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='100', 'water_risk_business_exposure_pct'] = 1.0
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='76-99', 'water_risk_business_exposure_pct'] = (99 + 76)/2/100
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='51-75', 'water_risk_business_exposure_pct'] = (75 + 51)/2/100
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='26-50', 'water_risk_business_exposure_pct'] = (50 + 26)/2/100
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='1-25', 'water_risk_business_exposure_pct'] = (25 + 1)/2/100
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='Less than 1%', 'water_risk_business_exposure_pct'] = (1 + 0)/2/100
W41b.loc[W41b['W4.1b_C2% company-wide facilities this represents']=='Unknown', 'water_risk_business_exposure_pct'] = np.nan
# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Create KPIs
# KPIs
W41b['KPI_rank_business_risk_facilities_exposed'] = W41b['W4.1b_C1Total number of facilities exposed to water risk'].rank(pct=True)
W41b['KPI_rank_business_risk_business_exposure'] = W41b['water_risk_business_exposure_pct'].rank(pct=True)



## Visualize results
#W41b.head(5)
# W41b['row_number'].value_counts(dropna=True)
# W41b['water_risk_business_exposure_pct'].value_counts(dropna=True)

#####################################################  TABLE 4.1c Water-related risk by geography #################################################################

W41c = w20[w20['question_number']=='W4.1c']
W41c =W41c.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W41c=W41c.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W41c=W41c.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()


### Clean up dataframe
## Remove questons
"""
These questions will be removed: 
'W4.1c_C4Production value for the metals & mining activities associated with these facilities': too few values/answers (23)
'W4.1c_C1Country/Area & River basin': too atomic/text sparse
'W4.1c_C5% companyâ€™s annual electricity generation that could be affected by these facilities': too few values/answers (63)
'W4.1c_C6% companyâ€™s global oil & gas production volume that could be affected by these facilities': too few values/answers (27)

"""
W41c = W41c.drop([
    'row_number',
    'row_name',
    'W4.1c_C4Production value for the metals & mining activities associated with these facilities', 
    'W4.1c_C1Country/Area & River basin', 
    'W4.1c_C5% companyâ€™s annual electricity generation that could be affected by these facilities',
    'W4.1c_C6% companyâ€™s global oil & gas production volume that could be affected by these facilities',                  
    'W4.1c_C8Comment'], axis=1)


## Clean cols
W41c['facilities_exposed_to_water_risk'] = W41c['W4.1c_C2Number of facilities exposed to water risk'].astype(float)

### Categorize columns into numericals
## Water risk impact
"""
RISK:
    # 100% = 1.0
    # 81-90 = (90 + 81)/2
    # 71-80 = (80 + 71)/2
    # 61-70 = (70 + 61)/2
    # 51-60 = (60 + 51)/2
    # 41-50 = (50 + 41)/2
    # 31-40 = (40 + 31)/2
    # 21-30 = (30 + 21)/2
    # 11-20 = (20 + 11)/2
    # 1-10 = (10 + 1)/2
    # Less than 1% = (1 + 0)/2
    # Unknown = np.nan
"""
W41c['water_risk_business_exposure_by_country_pct'] = np.nan
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='100%', 'water_risk_business_exposure_by_country_pct'] = 1.0
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='81-90', 'water_risk_business_exposure_by_country_pct'] = (90 + 81)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='71-80', 'water_risk_business_exposure_by_country_pct'] = (80 + 71)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='61-70', 'water_risk_business_exposure_by_country_pct'] = (70 + 61)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='51-60', 'water_risk_business_exposure_by_country_pct'] = (60 + 51)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='41-50', 'water_risk_business_exposure_by_country_pct'] = (50 + 41)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='31-40', 'water_risk_business_exposure_by_country_pct'] = (40 + 31)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='21-30', 'water_risk_business_exposure_by_country_pct'] = (30 + 21)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='11-20', 'water_risk_business_exposure_by_country_pct'] = (20 + 11)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='1-10', 'water_risk_business_exposure_by_country_pct'] = (10 + 1)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='Less than 1%', 'water_risk_business_exposure_by_country_pct'] = (1 + 0)/2/100
W41c.loc[W41c['W4.1c_C7% companyâ€™s total global revenue that could be affected']=='Unknown', 'water_risk_business_exposure_by_country_pct'] = np.nan


"""
RISK:
    100 = 1.0
    76-99 = (99 + 76)/2
    51-75 = (75 + 51)/2
    26-50 = (50 + 26)/2
    1-25 = (25 + 1)/2
    1 = (1 + 0)/2
    Unknown = np.nan
"""
W41c['water_risk_company_size_facilities_exposed_by_country_pct'] = np.nan
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='100%', 'water_risk_company_size_facilities_exposed_by_country_pct'] = 1.0
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='76-99', 'water_risk_company_size_facilities_exposed_by_country_pct'] = (99 + 76)/2/100
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='51-75', 'water_risk_company_size_facilities_exposed_by_country_pct'] = (75 + 51)/2/100
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='26-50', 'water_risk_company_size_facilities_exposed_by_country_pct'] = (50 + 26)/2/100
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='1-25', 'water_risk_company_size_facilities_exposed_by_country_pct'] = (25 + 1)/2/100
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='Less than 1%', 'water_risk_company_size_facilities_exposed_by_country_pct'] = (1 + 0)/2/100
W41c.loc[W41c['W4.1c_C3% company-wide facilities this represents']=='Unknown', 'water_risk_company_size_facilities_exposed_by_country_pct'] = np.nan
# C24a =C24a.drop(['C2.4a_C8Likelihood'], axis=1)


## Drop last cols:
W41c = W41c.drop([
    'W4.1c_C2Number of facilities exposed to water risk',
    'W4.1c_C3% company-wide facilities this represents',
    'W4.1c_C7% companyâ€™s total global revenue that could be affected'], axis=1)


### Create by country list:
## Clean up cols to allow weighted means calc
W41c.loc[W41c['water_risk_company_size_facilities_exposed_by_country_pct'].isna(), 'water_risk_company_size_facilities_exposed_by_country_pct'] = 0.0000000001
W41c.loc[W41c['water_risk_business_exposure_by_country_pct'].isna(), 'water_risk_business_exposure_by_country_pct'] = 0.0000000001
wm_risk_financial_impact_by_country = lambda x: np.average(x, weights=W41c.loc[x.index, 'water_risk_company_size_facilities_exposed_by_country_pct'])

## Group by country
W41c_by_country = W41c.groupby(['W4.1c_C1Country/Area & River basin_G'], as_index=False).agg(
    facilities_exposed_to_water_risk = ('facilities_exposed_to_water_risk', 'sum'),
    water_risk_business_exposure_by_country_pct = ('water_risk_business_exposure_by_country_pct', wm_risk_financial_impact_by_country),
    )

## Create KPIs
W41c_by_country['KPI_rank_business_risk_facilities_exposed_ctry'] = W41c_by_country['facilities_exposed_to_water_risk'].rank(pct=True)
W41c_by_country['KPI_rank_business_risk_business_exposure_ctry'] = W41c_by_country['water_risk_business_exposure_by_country_pct'].rank(pct=True)



## Visualize results
# W41c.head(5)
#W41c_by_country.head(5)
# W41c['water_risk_business_exposure_by_country_pct'].value_counts(dropna=True)
# W41c_by_country.head(5)
# W41c[W41c['W4.1c_C1Country/Area & River basin_G']=='Brazil']

###########################################################  TABLE 4.2 Water-related identified risks ##############################################################

W42 = w20[w20['question_number']=='W4.2']
W42 = W42.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W42=W42.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W42=W42.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

## Drop rows where == nan
W42 = W42[W42['row_number']!=0]

## Calculate financial impact into a single col
# Clean up numeric cols
W42['W4.2_C10Potential financial impact figure - minimum (currency)'] = W42['W4.2_C10Potential financial impact figure - minimum (currency)'].astype(float)
W42['W4.2_C11Potential financial impact figure - maximum (currency)'] = W42['W4.2_C11Potential financial impact figure - maximum (currency)'].astype(float)
W42['W4.2_C9Potential financial impact figure (currency)'] = W42['W4.2_C9Potential financial impact figure (currency)'].astype(float)

# Calculate financial impact into a single col
W42['C9_C10_C11_risk_financial_impact'] = W42['W4.2_C9Potential financial impact figure (currency)']
W42.loc[W42['C9_C10_C11_risk_financial_impact'].isnull(), 'C9_C10_C11_risk_financial_impact'] = (W42['W4.2_C10Potential financial impact figure - minimum (currency)'] + W42['W4.2_C11Potential financial impact figure - maximum (currency)'])/2


### Re-categorize categoricals into float
## Timeframe
"""
TIMEFRAME:
    Current up to one year = 1
    1-3 years = 2
    4-6 years = 5
    More than 6 years = 6
    Unknown = np.nan
"""
W42['C5_timeframe'] = np.nan
W42.loc[W42['W4.2_C5Timeframe']=='Current up to one year', 'C5_timeframe'] = 1
W42.loc[W42['W4.2_C5Timeframe']=='1-3 years', 'C5_timeframe'] = 2
W42.loc[W42['W4.2_C5Timeframe']=='4-6 years', 'C5_timeframe'] = 5
W42.loc[W42['W4.2_C5Timeframe']=='More than 6 years', 'C5_timeframe'] = 6
W42.loc[W42['W4.2_C5Timeframe']=='Unknown', 'C5_timeframe'] = np.nan

## Magnitude
"""
MAGNITUDE
    High = 5/5
    Medium-high = 4/5
    Medium = 3/5
    Medium-low = 2/5
    Low = 1/5
    Unknown = np.nan
"""
W42['C6_risk_impact_magnitude'] = np.nan
W42.loc[W42['W4.2_C6Magnitude of potential impact']=='High', 'C6_risk_impact_magnitude'] = 5/5
W42.loc[W42['W4.2_C6Magnitude of potential impact']=='Medium-high', 'C6_risk_impact_magnitude'] = 4/5
W42.loc[W42['W4.2_C6Magnitude of potential impact']=='Medium', 'C6_risk_impact_magnitude'] = 3/5
W42.loc[W42['W4.2_C6Magnitude of potential impact']=='Medium-low', 'C6_risk_impact_magnitude'] = 2/5
W42.loc[W42['W4.2_C6Magnitude of potential impact']=='Low', 'C6_risk_impact_magnitude'] = 1/5

## Likelihood
"""
LIKELIHOOD:
    Virtually certain = 8/8
    Very likely = 7/8
    Likely = 6/8
    More likely than not = 5/8
    About as likely as not =  = 4/8
    Unlikely = 3/8
    Very unlikely = 2/8
    Exceptionally unlikely = 1/8
    Unknown = np.nan
"""
W42['C7_risk_likelihood'] = np.nan
W42.loc[W42['W4.2_C7Likelihood']=='Virtually certain', 'C7_risk_likelihood'] = 8/8
W42.loc[W42['W4.2_C7Likelihood']=='Very likely', 'C7_risk_likelihood'] = 7/8
W42.loc[W42['W4.2_C7Likelihood']=='Likely', 'C7_risk_likelihood'] = 6/8
W42.loc[W42['W4.2_C7Likelihood']=='More likely than not', 'C7_risk_likelihood'] = 5/8
W42.loc[W42['W4.2_C7Likelihood']=='About as likely as not', 'C7_risk_likelihood'] = 4/8
W42.loc[W42['W4.2_C7Likelihood']=='Unlikely', 'C7_risk_likelihood'] = 3/8
W42.loc[W42['W4.2_C7Likelihood']=='Very unlikely', 'C7_risk_likelihood'] = 2/8
W42.loc[W42['W4.2_C7Likelihood']=='Exceptionally unlikely', 'C7_risk_likelihood'] = 1/8


## Drop last cols:
W42 = W42.drop([
    'W4.2_C12Explanation of financial impact'
    ,'W4.2_C14Description of response'
    ,'W4.2_C16Explanation of cost of response'
    ,'W4.2_C1Country/Area & River basin'
    ,'W4.2_C4Company-specific description'
    ,'W4.2_C8Are you able to provide a potential financial impact figure?'
    ,'row_name'
    ,'W4.2_C10Potential financial impact figure - minimum (currency)'
    ,'W4.2_C11Potential financial impact figure - maximum (currency)'
    ,'W4.2_C9Potential financial impact figure (currency)'
    ,'W4.2_C5Timeframe'
    ,'W4.2_C6Magnitude of potential impact'
    ,'W4.2_C7Likelihood'
    ], axis=1)

## Re-categorize col values with low frequencies:
cols_to_recat =['W4.2_C13Primary response to risk', 'W4.2_C2Type of risk & Primary risk driver', 'W4.2_C3Primary potential impact']
W42 = W42.apply(lambda x: x.mask(x.map(x.value_counts())<10, 'Other') if x.name in cols_to_recat else x)

# Clean up a 'Other' value:
"""
8, 9, 9: too many columns! Recode into new df, one for companies without these, and another by country
"""
W42.loc[W42['W4.2_C13Primary response to risk']=='Other, please specify: Comprehensive site water stewardship action plans', 'W4.2_C13Primary response to risk'] = 'Other'



## Create df by company
W42_company = W42.groupby(['account_number','organization'], as_index=False).agg(
    risks_count = ('row_number', 'count'),
    C9_C10_C11_risk_financial_impact = ('C9_C10_C11_risk_financial_impact', 'sum'), 
    C5_timeframe_years = ('C5_timeframe', 'mean'),
    C6_risk_impact_magnitude = ('C6_risk_impact_magnitude', 'mean'), 
    C7_risk_likelihood = ('C7_risk_likelihood', 'mean')
    )

## Create df by country
W42_country = W42.groupby(['W4.2_C1Country/Area & River basin_G'], as_index=False).agg(
    risks_count = ('row_number', 'count'),
    C9_C10_C11_risk_financial_impact = ('C9_C10_C11_risk_financial_impact', 'sum'), 
    C5_timeframe_years = ('C5_timeframe', 'mean'),
    C6_risk_impact_magnitude = ('C6_risk_impact_magnitude', 'mean'), 
    C7_risk_likelihood = ('C7_risk_likelihood', 'mean')
    )

# Clean up bug by Pandas of counting NaNs as 0 in sum
W42_company.loc[W42_company['C9_C10_C11_risk_financial_impact']==0, 'C9_C10_C11_risk_financial_impact'] = np.nan
W42_country.loc[W42_country['C9_C10_C11_risk_financial_impact']==0, 'C9_C10_C11_risk_financial_impact'] = np.nan


## Create KPIs:
## Company
# Likelihood * Magnitude / Years
W42_company['risk_index'] = (W42_company['C7_risk_likelihood'] * W42_company['C6_risk_impact_magnitude']) / W42_company['C5_timeframe_years'] 
W42_company['risk_index'].replace(np.inf, np.nan, inplace=True)

# Create KPI (0: least at risk; 1: most at risk)
W42_company['KPI_rank_risk_count'] = W42_company['risks_count'].rank(pct=True)
W42_company['KPI_rank_risk_financial_impact'] = W42_company['C9_C10_C11_risk_financial_impact'].rank(pct=True)
W42_company['KPI_rank_risk_timeframe_years'] = W42_company['C5_timeframe_years'].rank(pct=True)
W42_company['KPI_rank_risk_timeframe_years'] = 1 - W42_company['KPI_rank_risk_timeframe_years']
W42_company['KPI_rank_risk_impact_magnitude'] = W42_company['C6_risk_impact_magnitude'].rank(pct=True)
W42_company['KPI_rank_risk_risk_likelihood'] = W42_company['C7_risk_likelihood'].rank(pct=True)
W42_company['KPI_rank_risk_index'] = W42_company['risk_index'].rank(pct=True)

## Country
# Likelihood * Magnitude / Years
W42_country['risk_index'] = (W42_country['C7_risk_likelihood'] * W42_country['C6_risk_impact_magnitude']) / W42_country['C5_timeframe_years'] 
W42_country['risk_index'].replace(np.inf, np.nan, inplace=True)

# Create KPI (0: least at risk; 1: most at risk)
W42_country['KPI_rank_risk_count'] = W42_country['risks_count'].rank(pct=True)
W42_country['KPI_rank_risk_financial_impact'] = W42_country['C9_C10_C11_risk_financial_impact'].rank(pct=True)
W42_country['KPI_rank_risk_timeframe_years'] = W42_country['C5_timeframe_years'].rank(pct=True)
W42_country['KPI_rank_risk_timeframe_years'] = 1 - W42_country['KPI_rank_risk_timeframe_years']
W42_country['KPI_rank_risk_impact_magnitude'] = W42_country['C6_risk_impact_magnitude'].rank(pct=True)
W42_country['KPI_rank_risk_risk_likelihood'] = W42_country['C7_risk_likelihood'].rank(pct=True)
W42_country['KPI_rank_risk_index'] = W42_country['risk_index'].rank(pct=True)


## Visualize results
# W42.head(5)
#W42_company.head(5)
# W42_country.head(5)

################################################## TABLE 4.3a Water  opportunities currently being realized ########################################################

W43a = w20[w20['question_number']=='W4.3a']
W43a = W43a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W43a=W43a.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W43a=W43a.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

## Drop rows where == nan
W43a = W43a[W43a['row_number']!=0]

## Calculate financial impact into a single col
# Clean up numeric cols
W43a['W4.3a_C8Potential financial impact figure â€“ minimum (currency)'] = W43a['W4.3a_C8Potential financial impact figure â€“ minimum (currency)'].astype(float)
W43a['W4.3a_C9Potential financial impact figure â€“ maximum (currency)'] = W43a['W4.3a_C9Potential financial impact figure â€“ maximum (currency)'].astype(float)
W43a['W4.3a_C7Potential financial impact figure (currency)'] = W43a['W4.3a_C7Potential financial impact figure (currency)'].astype(float)

# Calculate financial impact into a single col
W43a['C7_C8_C9_opportunity_financial_impact'] = W43a['W4.3a_C7Potential financial impact figure (currency)']
W43a.loc[W43a['C7_C8_C9_opportunity_financial_impact'].isnull(), 'C7_C8_C9_opportunity_financial_impact'] = (W43a['W4.3a_C8Potential financial impact figure â€“ minimum (currency)'] + W43a['W4.3a_C9Potential financial impact figure â€“ maximum (currency)'])/2

## Re-categorize into float
"""
TIMEFRAME:
    Current up to one year = 1
    1-3 years = 2
    4-6 years = 5
    More than 6 years = 6
    Unknown = np.nan
"""
W43a['C4_timeframe'] = np.nan
W43a.loc[W43a['W4.3a_C4Estimated timeframe for realization']=='Current - up to 1 year', 'C4_timeframe'] = 1
W43a.loc[W43a['W4.3a_C4Estimated timeframe for realization']=='1-3 years', 'C4_timeframe'] = 2
W43a.loc[W43a['W4.3a_C4Estimated timeframe for realization']=='4-6 years', 'C4_timeframe'] = 5
W43a.loc[W43a['W4.3a_C4Estimated timeframe for realization']=='More than 6 years', 'C4_timeframe'] = 6
W43a.loc[W43a['W4.3a_C4Estimated timeframe for realization']=='Unknown', 'C4_timeframe'] = np.nan

## Magnitude
"""
MAGNITUDE
    High = 5/5
    Medium-high = 4/5
    Medium = 3/5
    Medium-low = 2/5
    Low = 1/5
    Unknown = np.nan
"""
W43a['C5_opportunity_impact_magnitude'] = np.nan
W43a.loc[W43a['W4.3a_C5Magnitude of potential financial impact']=='High', 'C5_opportunity_impact_magnitude'] = 5/5
W43a.loc[W43a['W4.3a_C5Magnitude of potential financial impact']=='Medium-high', 'C5_opportunity_impact_magnitude'] = 4/5
W43a.loc[W43a['W4.3a_C5Magnitude of potential financial impact']=='Medium', 'C5_opportunity_impact_magnitude'] = 3/5
W43a.loc[W43a['W4.3a_C5Magnitude of potential financial impact']=='Medium-low', 'C5_opportunity_impact_magnitude'] = 2/5
W43a.loc[W43a['W4.3a_C5Magnitude of potential financial impact']=='Low', 'C5_opportunity_impact_magnitude'] = 1/5

## Create binaries for type of opportunity:
one_hot = pd.get_dummies(W43a['W4.3a_C1Type of opportunity']).add_prefix('opportunity_type_')
W43a = W43a.join(one_hot)


## Drop last cols:
W43a = W43a.drop([
    'row_name'
    ,'W4.3a_C4Estimated timeframe for realization'
    ,'W4.3a_C5Magnitude of potential financial impact'
    ,'W4.3a_C6Are you able to provide a potential financial impact figure?'
    ,'W4.3a_C7Potential financial impact figure (currency)'
    ,'W4.3a_C8Potential financial impact figure â€“ minimum (currency)'
    ,'W4.3a_C9Potential financial impact figure â€“ maximum (currency)'
    ,'W4.3a_C10Explanation of financial impact'
    ,'W4.3a_C1Type of opportunity'
    ,'W4.3a_C2Primary water-related opportunity'
    ,'W4.3a_C3Company-specific description & strategy to realize opportunity'
    ], axis=1)



## Create df by company
W43a_KPI = W43a.groupby(['account_number','organization'], as_index=False).agg(
    opportunity_count = ('row_number', 'count'),
    opportunity_type_efficiency = ('opportunity_type_Efficiency', 'mean'),
    opportunity_type_products_and_services = ('opportunity_type_Products and services', 'mean'),
    opportunity_type_resilience = ('opportunity_type_Resilience', 'mean'),
    opportunity_type_markets = ('opportunity_type_Markets', 'mean'),
    opportunity_type_other = ('opportunity_type_Other', 'mean'),
    C7_C8_C9_opportunity_financial_impact = ('C7_C8_C9_opportunity_financial_impact', 'sum'), 
    C4_timeframe_years = ('C4_timeframe', 'mean'),
    C5_opportunity_impact_magnitude = ('C5_opportunity_impact_magnitude', 'mean') 
    )

# Clean up bug by Pandas of counting NaNs as 0 in sum
W43a_KPI.loc[W43a_KPI['C7_C8_C9_opportunity_financial_impact']==0, 'C7_C8_C9_opportunity_financial_impact'] = np.nan


## Create KPIs:
# Likelihood * Magnitude / Years
"""
!!! WARNING !!!
There is no associated likelihood with this df
"""
# W43a_KPI['opportunity_index'] = (W43a_KPI['C7_risk_likelihood'] * W42_country['C6_risk_impact_magnitude']) / W42_country['C5_timeframe_years'] 
W43a_KPI['opportunity_index'] = (W43a_KPI['C5_opportunity_impact_magnitude']) / W43a_KPI['C4_timeframe_years'] 
W43a_KPI['opportunity_index'].replace(np.inf, np.nan, inplace=True)

## Create KPI (0: least at opportunity; 1: most at opportunity)
W43a_KPI['KPI_rank_opportunity_financial_impact'] = W43a_KPI['C7_C8_C9_opportunity_financial_impact'].rank(pct=True)
W43a_KPI['KPI_rank_opportunity_timeframe'] = W43a_KPI['C4_timeframe_years'].rank(pct=True)
W43a_KPI['KPI_rank_opportunity_impact_magnitude'] = W43a_KPI['C5_opportunity_impact_magnitude'].rank(pct=True)
W43a_KPI['KPI_rank_opportunity_index'] = W43a_KPI['opportunity_index'].rank(pct=True)


## Visualize results
#W43a.head(5)
#W43a_KPI.head(5)
# W42_country.head(5)
# W43a['W4.3a_C1Type of opportunity'].value_counts(dropna=True)

###########################################################  TABLE 5.1 Water use by facility ######################################################################

W51 = w20[w20['question_number']=='W5.1']
W51 = W51.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W51=W51.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W51=W51.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

## Drop rows where == nan
W51 = W51[W51['row_number']!=0]

## Convert coordinates into Point
W51['W5.1_C5Longitude'] = W51['W5.1_C5Longitude'].astype(float)
W51['W5.1_C4Latitude'] = W51['W5.1_C4Latitude'].astype(float)
W51['C4_C5_facility_coordinates_lon_lat'] = W51.apply(lambda x: Point(x['W5.1_C5Longitude'], x['W5.1_C4Latitude']), axis=1)

## Create binaries for whether there is water stress in area:
# Recode Unknown into NaN
W51.loc[W51['W5.1_C6Located in area with water stress']=='Unknown', 'W5.1_C6Located in area with water stress'] = np.nan

# Binarize
one_hot = pd.get_dummies(W51['W5.1_C6Located in area with water stress']).add_prefix('C6_facility_in_area_with_water_stress_')
W51 = W51.join(one_hot)
del W51['C6_facility_in_area_with_water_stress_No']


## Convert (and rename) strings to float:
float_cols = [
 'W5.1_C9Total water withdrawals at this facility (megaliters/year)',
 'W5.1_C11Withdrawals from fresh surface water, including rainwater, water from wetlands, rivers and lakes', 
 'W5.1_C12Withdrawals from brackish surface water/seawater', 
 'W5.1_C13Withdrawals from groundwater - renewable', 
 'W5.1_C14Withdrawals from groundwater - non-renewable', 
 'W5.1_C15Withdrawals from produced/entrained water', 
 'W5.1_C16Withdrawals from third party sources', 
 'W5.1_C17Total water discharges at this facility (megaliters/year)',
 'W5.1_C19Discharges to fresh surface water', 
 'W5.1_C20Discharges to brackish surface water/seawater', 
 'W5.1_C21Discharges to groundwater', 'W5.1_C22Discharges to third party destinations', 
 'W5.1_C23Total water consumption at this facility (megaliters/year)'
]
float_cols_renamed = [
 'C9_total_water_withdrawals_MLpa',
 'C11_withdrawals_from_fresh_surface_water_MLpa', 
 'C12_withdrawals_from_brackish_surface_water_or_seawater_MLpa', 
 'C13_withdrawals_from_groundwater_renewable_MLpa', 
 'C14_withdrawals_from_groundwater_non_renewable_MLpa', 
 'C15_withdrawals_from_produced_water_MLpa', 
 'C16_withdrawals_from_third_party_sources_MLpa', 
 'C17_total_water_discharges_at_facility_MLpa',
 'C19_discharges_to_fresh_surface_water_MLpa', 
 'C20_discharges_to_brackish_surface_water_or_seawater_MLpa', 
 'C21_discharges_to_groundwater_MLpa', 
 'C22_discharges_to_third_party_destinations_MLpa', 
 'C23_total_water_consumption_at_facility_MLpa'
]
ix = 0
for col in float_cols:
    W51[col] = W51[col].astype(float)
    W51[float_cols_renamed[ix]] = W51[col]
    del W51[col]
    ix += 1

    
### Recode time comparisons
"""
CONSUMPTION:
    Much lower = 1/5
    Lower = 2/5
    About the same = 3/5
    Higher = 4/5
    Much higher = 5/5
    This is our first year of measurement = np.nan
"""
compa_cols = ['W5.1_C10Comparison of total withdrawals with previous reporting year',
    'W5.1_C18Comparison of total discharges with previous reporting year',
    'W5.1_C24Comparison of total consumption with previous reporting year']
compa_cols_new = [
    'C10_water_total_withdrawals_vs_last_year',
    'C18_water_total_discharges_vs_last_year',
    'C24_water_total_consumption_vs_last_year']

ix_ = 0
for col in compa_cols:
    compa_col_new = compa_cols_new[ix_]

    W51[compa_col_new] = np.nan
    W51.loc[W51[col]=='Much lower', compa_col_new] = 1/5
    W51.loc[W51[col]=='Lower', compa_col_new] = 2/5
    W51.loc[W51[col]=='About the same', compa_col_new] = 3/5
    W51.loc[W51[col]=='Higher', compa_col_new] = 4/5
    W51.loc[W51[col]=='Much higher', compa_col_new] = 5/5
    W51.loc[W51[col]=='This is our first year of measurement', compa_col_new] = np.nan

    ix_ += 1

    
## Create composite cols:
W51['C9_C17_C23_water_total_use_MLpa'] = W51['C9_total_water_withdrawals_MLpa'] + W51['C17_total_water_discharges_at_facility_MLpa'] + W51['C23_total_water_consumption_at_facility_MLpa']
W51['C10_C18_C24_water_total_use_vs_last_year'] = W51['C10_water_total_withdrawals_vs_last_year'] + W51['C18_water_total_discharges_vs_last_year'] + W51['C24_water_total_consumption_vs_last_year']


## Drop last cols:
W51 = W51.drop([
#     'row_number'
    'row_name'
    ,'W5.1_C25Please explain'
    ,'W5.1_C2Facility name (optional)'
    ,'W5.1_C3Country/Area & River basin'
    ,'W5.1_C5Longitude'
    ,'W5.1_C4Latitude'
    ,'W5.1_C1Facility reference number'
    ,'W5.1_C8Oil & gas sector business division' # Too few data: only 42 answers
    ,'W5.1_C6Located in area with water stress'
    ,'W5.1_C10Comparison of total withdrawals with previous reporting year'
    ,'W5.1_C18Comparison of total discharges with previous reporting year'
    ,'W5.1_C24Comparison of total consumption with previous reporting year'
    ], axis=1)



### Create 3 df, Group by company and by country, and by facility (original df)
## Prepare for groupbys
# Create weight based on tot water use:
wm_water_use = lambda x: np.average(x, weights=W51.loc[x.index, 'C9_C17_C23_water_total_use_MLpa'])
# wm_water_use = lambda x: np.average(x, weights=W51.loc[x.index, 'facility_count'])


# Add a minimal value to NaN in order for weights to work 
W51['C9_C17_C23_water_total_use_MLpa'] = W51['C9_C17_C23_water_total_use_MLpa'].fillna(0)
W51['C10_water_total_withdrawals_vs_last_year'] = W51['C10_water_total_withdrawals_vs_last_year'].fillna(0)
W51['C18_water_total_discharges_vs_last_year'] = W51['C18_water_total_discharges_vs_last_year'].fillna(0)
W51['C24_water_total_consumption_vs_last_year'] = W51['C24_water_total_consumption_vs_last_year'].fillna(0)

W51['C9_C17_C23_water_total_use_MLpa'] = W51['C9_C17_C23_water_total_use_MLpa'] + 0.0000000001
W51['C10_water_total_withdrawals_vs_last_year'] = W51['C10_water_total_withdrawals_vs_last_year'] + 0.0000000001
W51['C18_water_total_discharges_vs_last_year'] = W51['C18_water_total_discharges_vs_last_year'] + 0.0000000001
W51['C24_water_total_consumption_vs_last_year'] = W51['C24_water_total_consumption_vs_last_year'] + 0.0000000001

## Facility
W51_by_facility = W51.copy()

## Company 
W51_by_company = W51.groupby(['account_number','organization'], as_index=False).agg(
    facility_count = ('row_number', 'count'),
    C6_facility_in_area_with_water_stress_Yes = ('C6_facility_in_area_with_water_stress_Yes', wm_water_use),
    C11_withdrawals_from_fresh_surface_water_MLpa = ('C11_withdrawals_from_fresh_surface_water_MLpa', 'sum'),
    C12_withdrawals_from_brackish_surface_water_or_seawater_MLpa = ('C12_withdrawals_from_brackish_surface_water_or_seawater_MLpa', 'sum'),
    C13_withdrawals_from_groundwater_renewable_MLpa = ('C13_withdrawals_from_groundwater_renewable_MLpa', 'sum'),
    C14_withdrawals_from_groundwater_non_renewable_MLpa = ('C14_withdrawals_from_groundwater_non_renewable_MLpa', 'sum'),
    C15_withdrawals_from_produced_water_MLpa = ('C15_withdrawals_from_produced_water_MLpa', 'sum'),
    C16_withdrawals_from_third_party_sources_MLpa = ('C16_withdrawals_from_third_party_sources_MLpa', 'sum'),
    C19_discharges_to_fresh_surface_water_MLpa = ('C19_discharges_to_fresh_surface_water_MLpa', 'sum'),
    C20_discharges_to_brackish_surface_water_or_seawater_MLpa = ('C20_discharges_to_brackish_surface_water_or_seawater_MLpa', 'sum'),
    C21_discharges_to_groundwater_MLpa = ('C21_discharges_to_groundwater_MLpa', 'sum'),
    C22_discharges_to_third_party_destinations_MLpa = ('C22_discharges_to_third_party_destinations_MLpa', 'sum'),
    C9_total_water_withdrawals_MLpa = ('C9_total_water_withdrawals_MLpa', 'sum'),
    C17_total_water_discharges_at_facility_MLpa = ('C17_total_water_discharges_at_facility_MLpa', 'sum'),
    C23_total_water_consumption_at_facility_MLpa = ('C23_total_water_consumption_at_facility_MLpa', 'sum'),
    C9_C17_C23_water_total_use_MLpa = ('C9_C17_C23_water_total_use_MLpa', 'sum'),
    C10_water_total_withdrawals_vs_last_year = ('C10_water_total_withdrawals_vs_last_year', wm_water_use),
    C18_water_total_discharges_vs_last_year = ('C18_water_total_discharges_vs_last_year', wm_water_use),
    C24_water_total_consumption_vs_last_year = ('C24_water_total_consumption_vs_last_year', wm_water_use),
    C10_C18_C24_water_total_use_vs_last_year = ('C10_C18_C24_water_total_use_vs_last_year', wm_water_use)
    )

W51_by_company['C10_C18_C24_water_total_use_vs_last_year'] = W51_by_company['C10_water_total_withdrawals_vs_last_year'] + W51_by_company['C18_water_total_discharges_vs_last_year'] + W51_by_company['C24_water_total_consumption_vs_last_year']


## Country
W51_by_country = W51.groupby(['W5.1_C3Country/Area & River basin_G'], as_index=False).agg(
    facility_count = ('row_number', 'count'),
    C6_facility_in_area_with_water_stress_Yes = ('C6_facility_in_area_with_water_stress_Yes', wm_water_use),
    C11_withdrawals_from_fresh_surface_water_MLpa = ('C11_withdrawals_from_fresh_surface_water_MLpa', 'sum'),
    C12_withdrawals_from_brackish_surface_water_or_seawater_MLpa = ('C12_withdrawals_from_brackish_surface_water_or_seawater_MLpa', 'sum'),
    C13_withdrawals_from_groundwater_renewable_MLpa = ('C13_withdrawals_from_groundwater_renewable_MLpa', 'sum'),
    C14_withdrawals_from_groundwater_non_renewable_MLpa = ('C14_withdrawals_from_groundwater_non_renewable_MLpa', 'sum'),
    C15_withdrawals_from_produced_water_MLpa = ('C15_withdrawals_from_produced_water_MLpa', 'sum'),
    C16_withdrawals_from_third_party_sources_MLpa = ('C16_withdrawals_from_third_party_sources_MLpa', 'sum'),
    C19_discharges_to_fresh_surface_water_MLpa = ('C19_discharges_to_fresh_surface_water_MLpa', 'sum'),
    C20_discharges_to_brackish_surface_water_or_seawater_MLpa = ('C20_discharges_to_brackish_surface_water_or_seawater_MLpa', 'sum'),
    C21_discharges_to_groundwater_MLpa = ('C21_discharges_to_groundwater_MLpa', 'sum'),
    C22_discharges_to_third_party_destinations_MLpa = ('C22_discharges_to_third_party_destinations_MLpa', 'sum'),
    C9_total_water_withdrawals_MLpa = ('C9_total_water_withdrawals_MLpa', 'sum'),
    C17_total_water_discharges_at_facility_MLpa = ('C17_total_water_discharges_at_facility_MLpa', 'sum'),
    C23_total_water_consumption_at_facility_MLpa = ('C23_total_water_consumption_at_facility_MLpa', 'sum'),
    C9_C17_C23_water_total_use_MLpa = ('C9_C17_C23_water_total_use_MLpa', 'sum'),
    C10_water_total_withdrawals_vs_last_year = ('C10_water_total_withdrawals_vs_last_year', wm_water_use),
    C18_water_total_discharges_vs_last_year = ('C18_water_total_discharges_vs_last_year', wm_water_use),
    C24_water_total_consumption_vs_last_year = ('C24_water_total_consumption_vs_last_year', wm_water_use),
    C10_C18_C24_water_total_use_vs_last_year = ('C10_C18_C24_water_total_use_vs_last_year', wm_water_use)
    )

W51_by_country['C10_C18_C24_water_total_use_vs_last_year'] = W51_by_country['C10_water_total_withdrawals_vs_last_year'] + W51_by_country['C18_water_total_discharges_vs_last_year'] + W51_by_country['C24_water_total_consumption_vs_last_year']



### KPIs
# # W43a_KPI['opportunity_index'] = (W43a_KPI['C7_risk_likelihood'] * W42_country['C6_risk_impact_magnitude']) / W42_country['C5_timeframe_years'] 
# W43a_KPI['opportunity_index'] = (W43a_KPI['C5_opportunity_impact_magnitude']) / W43a_KPI['C4_timeframe_years'] 
# W43a_KPI['opportunity_index'].replace(np.inf, np.nan, inplace=True)

## Create KPI (0: ; 1: )
# By facility
W51_by_facility['KPI_rank_facility_in_area_with_water_stress'] =W51_by_facility['C6_facility_in_area_with_water_stress_Yes'].rank(pct=True)
W51_by_facility['KPI_rank_total_water_withdrawals'] =W51_by_facility['C9_total_water_withdrawals_MLpa'].rank(pct=True)
W51_by_facility['KPI_rank_total_water_discharges_at_facility'] =W51_by_facility['C17_total_water_discharges_at_facility_MLpa'].rank(pct=True)
W51_by_facility['KPI_rank_total_water_consumption_at_facility'] =W51_by_facility['C23_total_water_consumption_at_facility_MLpa'].rank(pct=True)
W51_by_facility['KPI_rank_water_total_use'] =W51_by_facility['C9_C17_C23_water_total_use_MLpa'].rank(pct=True)
W51_by_facility['KPI_rank_water_total_withdrawals_vs_last_year'] =W51_by_facility['C10_water_total_withdrawals_vs_last_year'].rank(pct=True)
W51_by_facility['KPI_rank_water_total_discharges_vs_last_year'] =W51_by_facility['C18_water_total_discharges_vs_last_year'].rank(pct=True)
W51_by_facility['KPI_rank_water_total_consumption_vs_last_year'] =W51_by_facility['C24_water_total_consumption_vs_last_year'].rank(pct=True)
W51_by_facility['KPI_rank_water_total_use_vs_last_year'] =W51_by_facility['C10_C18_C24_water_total_use_vs_last_year'].rank(pct=True)

# By company
W51_by_company['KPI_rank_facility_in_area_with_water_stress'] =W51_by_company['C6_facility_in_area_with_water_stress_Yes'].rank(pct=True)
W51_by_company['KPI_rank_total_water_withdrawals'] =W51_by_company['C9_total_water_withdrawals_MLpa'].rank(pct=True)
W51_by_company['KPI_rank_total_water_discharges_at_facility'] =W51_by_company['C17_total_water_discharges_at_facility_MLpa'].rank(pct=True)
W51_by_company['KPI_rank_total_water_consumption_at_facility'] =W51_by_company['C23_total_water_consumption_at_facility_MLpa'].rank(pct=True)
W51_by_company['KPI_rank_water_total_use'] =W51_by_company['C9_C17_C23_water_total_use_MLpa'].rank(pct=True)
W51_by_company['KPI_rank_water_total_withdrawals_vs_last_year'] =W51_by_company['C10_water_total_withdrawals_vs_last_year'].rank(pct=True)
W51_by_company['KPI_rank_water_total_discharges_vs_last_year'] =W51_by_company['C18_water_total_discharges_vs_last_year'].rank(pct=True)
W51_by_company['KPI_rank_water_total_consumption_vs_last_year'] =W51_by_company['C24_water_total_consumption_vs_last_year'].rank(pct=True)
W51_by_company['KPI_rank_water_total_use_vs_last_year'] =W51_by_company['C10_C18_C24_water_total_use_vs_last_year'].rank(pct=True)

# By company
W51_by_country['KPI_rank_facility_in_area_with_water_stress'] =W51_by_country['C6_facility_in_area_with_water_stress_Yes'].rank(pct=True)
W51_by_country['KPI_rank_total_water_withdrawals'] =W51_by_country['C9_total_water_withdrawals_MLpa'].rank(pct=True)
W51_by_country['KPI_rank_total_water_discharges_at_facility'] =W51_by_country['C17_total_water_discharges_at_facility_MLpa'].rank(pct=True)
W51_by_country['KPI_rank_total_water_consumption_at_facility'] =W51_by_country['C23_total_water_consumption_at_facility_MLpa'].rank(pct=True)
W51_by_country['KPI_rank_water_total_use'] =W51_by_country['C9_C17_C23_water_total_use_MLpa'].rank(pct=True)
W51_by_country['KPI_rank_water_total_withdrawals_vs_last_year'] =W51_by_country['C10_water_total_withdrawals_vs_last_year'].rank(pct=True)
W51_by_country['KPI_rank_water_total_discharges_vs_last_year'] =W51_by_country['C18_water_total_discharges_vs_last_year'].rank(pct=True)
W51_by_country['KPI_rank_water_total_consumption_vs_last_year'] =W51_by_country['C24_water_total_consumption_vs_last_year'].rank(pct=True)
W51_by_country['KPI_rank_water_total_use_vs_last_year'] =W51_by_country['C10_C18_C24_water_total_use_vs_last_year'].rank(pct=True)



## Visualize results
#W51.head(5)
#W51_by_facility.head(5)
#W51_by_company.head(5)
#W51_by_country.head(5)
# W51['W5.1_C24Comparison of total consumption with previous reporting year'].value_counts()
# W51.size

######################################################## TABLE 8.1a Water targets ###############################################################################

W81a = w20[w20['question_number']=='W8.1a']
W81a = W81a.drop(['survey_year', 'response_received_date','accounting_period_to','ors_response_id','submission_date','page_name','module_name',
          'question_number','question_unique_reference','data_point_name','data_point_id','comments','table_columns_unique_reference','column_number'], axis=1)
W81a=W81a.pivot(index=['account_number','organization','row_number','row_name'], columns='column_name', values='response_value').reset_index()
W81a=W81a.sort_values(by=['account_number','organization','row_number'])
# C11=C11.dropna()

## Drop rows where == nan
# W81a = W81a[W81a['row_number']!=0]

## Convert (and rename) strings to float:
float_cols = [
'W8.1a_C7Baseline year',
'W8.1a_C8Start year',
'W8.1a_C9Target year',
'W8.1a_C10% of target achieved'
]
float_cols_renamed = [
'C7_baseline_year',
'C8_start_year',
'C9_target_year',
'C10_pctg_target_achieved'
]
ix = 0
for col in float_cols:
    W81a[col] = W81a[col].astype(float)
    W81a[float_cols_renamed[ix]] = W81a[col]
    del W81a[col]
    ix += 1

W81a['C10_pctg_target_achieved'] = W81a['C10_pctg_target_achieved'] / 100


## Re-categorize col values with low frequencies:
cols_to_recat = ['W8.1a_C2Category of target', 'W8.1a_C3Level', 'W8.1a_C4Primary motivation', 'W8.1a_C6Quantitative metric']
W81a = W81a.apply(lambda x: x.mask(x.map(x.value_counts())<10, 'Other') if x.name in cols_to_recat else x)


"""
!!! WARNING !!!
Targets are too disparate/sparse, meaning a comparison of the 'Other' classifications would be inaccurate.
At this point, let's calculate overall achievement metrics & KPI
"""


## Drop last cols:
W81a = W81a.drop([
#     'row_number'
    'row_name'
    ,'W8.1a_C11Please explain'
    ,'W8.1a_C1Target reference number'
    ,'W8.1a_C5Description of target'
    ], axis=1)


## Calculate extra features:
# Calculate years to achievement:
W81a['target_years_to_achievement'] = W81a['C9_target_year'] - W81a['C8_start_year'] + 1 
W81a.loc[W81a['C9_target_year']>2020, 'years_left'] = W81a['C9_target_year']-2020
W81a.loc[W81a['C9_target_year']<=2020, 'years_left'] = 0
W81a.loc[W81a['C9_target_year']>2020, 'years_past'] = W81a['C9_target_year']-W81a['C8_start_year'] + 1
W81a.loc[W81a['C9_target_year']<=2020, 'years_past'] = W81a['target_years_to_achievement']

# Progress
W81a['target_pctg_achievement_per_year']= 1/W81a['target_years_to_achievement']
W81a['actual_pctg_achievement_per_year']= W81a['C10_pctg_target_achieved']/W81a['years_past']
W81a['actual_years_to_achiev'] = 1/W81a['actual_pctg_achievement_per_year']

W81a['target_pctg_achievement_per_year'].replace(np.inf, np.nan, inplace=True)
W81a['target_pctg_achievement_per_year'].replace(np.NINF, np.nan, inplace=True)
W81a['actual_pctg_achievement_per_year'].replace(np.inf, np.nan, inplace=True)
W81a['actual_pctg_achievement_per_year'].replace(np.NINF, np.nan, inplace=True)
W81a['actual_years_to_achiev'].replace(np.inf, np.nan, inplace=True)
W81a['actual_years_to_achiev'].replace(np.NINF, np.nan, inplace=True)

W81a['Years_diff'] = W81a['target_years_to_achievement']-W81a['actual_years_to_achiev']
W81a.loc[(W81a['Years_diff']>W81a['Years_diff'].std()) | (W81a['Years_diff']<-W81a['Years_diff'].std()), 'Years_diff'] = np.nan
W81a['Years_diff'].replace(np.inf, np.nan, inplace=True)
W81a['Years_diff'].replace(np.NINF, np.nan, inplace=True)

W81a['diff_actual_vs_achieved_pctg_achievement_per_year'] = W81a['actual_pctg_achievement_per_year'] / W81a['target_pctg_achievement_per_year']


## Group by company
W81a_by_company = W81a.groupby(['account_number','organization'], as_index=False).agg(
    targets_count = ('row_number', 'count'),
    C7_baseline_year=('C7_baseline_year', 'mean'),
    C8_start_year=('C8_start_year', 'mean'),
    C9_target_year=('C9_target_year', 'mean'),
    C10_pctg_target_achieved=('C10_pctg_target_achieved', 'mean'),
    target_years_to_achievement=('target_years_to_achievement', 'mean'),
    years_left=('years_left', 'mean'),
    years_past=('years_past', 'mean'),
    target_pctg_achievement_per_year=('target_pctg_achievement_per_year', 'mean'),
    actual_pctg_achievement_per_year=('actual_pctg_achievement_per_year', 'mean'),
    actual_years_to_achiev=('actual_years_to_achiev', 'mean'),
    Years_diff=('Years_diff', 'mean'),
    diff_actual_vs_achieved_pctg_achievement_per_year = ('Years_diff', 'mean')
    ).dropna().reset_index()



## Calculate KPIs
W81a_by_company['KPI_rank_objective_strategy'] = W81a['Years_diff'].rank(pct=True)
W81a_by_company['KPI_rank_objective_progress'] = W81a['C10_pctg_target_achieved'].rank(pct=True)
W81a_by_company['KPI_rank_objective_ambition'] = W81a['diff_actual_vs_achieved_pctg_achievement_per_year'].rank(pct=True)

W81a_by_company = W81a_by_company.sort_values(by=['KPI_rank_objective_strategy']).reset_index()



## Visualize results
#W81a.head(5)
#W81a_by_company.head(5)
# W81a['W8.1a_C4Primary motivation'].value_counts()
# W81a.size

### Climate
##################################################################### Make list of all df with KPIs: ###############################################################
# Companies 
climate_df_company = [C23a_KPI,
    C24a_KPI,
    C41a_KPI,
    C75_KPI,
    C82a,
    C82d_KPI]

climate_df_company_suffixes = [
    '_risks_identified',
    '_opportunities_identified',
    '_emissions_targets',
    '_emissions_targets_by_country',
    '_energy_consumption',
    '_energy_consumption_details',
    '_emissions_targets_by_country']


## Merge dataframes together
# Companies
company_climate_cleaned_2020 = q20[['account_number', 'organization']].drop_duplicates(ignore_index=True)
ix = 0
count_cols = 0
for df_ in climate_df_company:
    company_climate_cleaned_2020 = pd.merge(company_climate_cleaned_2020, df_.copy().add_suffix(climate_df_company_suffixes[ix]), 
                                 how='outer', 
                                 left_on='account_number', 
                                 right_on='account_number'+climate_df_company_suffixes[ix], 
                                ) 

    ix += 1
    count_cols += len(df_.columns)

# Sort
company_climate_cleaned_2020 = company_climate_cleaned_2020.sort_values('account_number', ascending=True, ignore_index=True)

# Make a single df with only KPIs   
KPIs_cols = ['account_number', 'organization']
KPIs_cols += [col for col in company_climate_cleaned_2020.columns if 'KPI_rank_' in col]
company_climate_KPIs_2020 = company_climate_cleaned_2020[KPIs_cols]

# Do the same for country df
country_climate_cleaned_2020 = C75_country.copy()
KPIs_cols = ['C7.5_C1Country/Region']
KPIs_cols += [col for col in country_climate_cleaned_2020.columns if 'KPI_rank_' in col]
country_climate_KPIs_2020 = country_climate_cleaned_2020[KPIs_cols]



### Water
## Make list of all df with KPIs:
# Companies 
water_df_company = [
    W11_c,
    W12b_c,
    W41b,
    W42_company,
    W43a_KPI,
    W51_by_company,
    W81a_by_company
    ]

water_df_company_suffixes = [
    '_water_importance',
    '_water_use',
    '_risks_overall_identified',
    '_risks_identified',
    '_opportunities_identified',
    '_water_use_by_facility',
    '_water_targets'
    ]

# Country
water_df_country = [
    W41c_by_country,
    W42_country,
    W51_by_country
    ]

water_df_country_suffixes = [
    '_risks_overall_identified_by_geography',
    '_risks_identified',
    '_water_use_by_facility'
    ]


## Merge dataframes together
# Companies
company_water_cleaned_2020 = w20[['account_number', 'organization']].drop_duplicates(ignore_index=True)
ix = 0
count_cols = 0
for df_ in water_df_company:
    try:
        company_water_cleaned_2020 = pd.merge(company_water_cleaned_2020, df_.copy().add_suffix(water_df_company_suffixes[ix]), 
                                     how='outer', 
                                     left_on='account_number', 
                                     right_on='account_number'+water_df_company_suffixes[ix], 
                                    ) 
    except:
        print(df_.columns)

    ix += 1
    count_cols += len(df_.columns)

# Sort
company_water_cleaned_2020 = company_water_cleaned_2020.sort_values('account_number', ascending=True, ignore_index=True)

# Make a single df with only KPIs   
KPIs_cols = ['account_number', 'organization']
KPIs_cols += [col for col in company_water_cleaned_2020.columns if 'KPI_rank_' in col]
company_water_KPIs_2020 = company_water_cleaned_2020[KPIs_cols]


# Countries
country_water_cleaned_2020 = W41c_by_country.copy().add_suffix(water_df_country_suffixes[0])
country_water_cleaned_2020 = pd.merge(country_water_cleaned_2020, 
                                     W42_country.copy().add_suffix(water_df_country_suffixes[1]), 
                                     how='outer', 
                                     left_on='W4.1c_C1Country/Area & River basin_G'+water_df_country_suffixes[0], 
                                     right_on='W4.2_C1Country/Area & River basin_G'+water_df_country_suffixes[1], 
                                    ) 
country_water_cleaned_2020 = pd.merge(country_water_cleaned_2020, 
                                     W51_by_country.copy().add_suffix(water_df_country_suffixes[2]), 
                                     how='outer', 
                                     left_on='W4.1c_C1Country/Area & River basin_G'+water_df_country_suffixes[0], 
                                     right_on='W5.1_C3Country/Area & River basin_G'+water_df_country_suffixes[2], 
                                    ) 

country_water_cleaned_2020 = country_water_cleaned_2020.rename(columns={'W4.1c_C1Country/Area & River basin_G'+water_df_country_suffixes[0]: 'country_or_region'})

# Sort
country_water_cleaned_2020 = country_water_cleaned_2020.sort_values('country_or_region', ascending=True, ignore_index=True)

# Make a single df with only KPIs   
KPIs_cols = ['country_or_region']
KPIs_cols += [col for col in country_water_cleaned_2020.columns if 'KPI_rank_' in col]
country_water_KPIs_2020 = country_water_cleaned_2020[KPIs_cols]


## Facilities
facility_water_cleaned_2020 = W51_by_facility.copy()
KPIs_cols = ['account_number','organization', 'C4_C5_facility_coordinates_lon_lat']
KPIs_cols += [col for col in facility_water_cleaned_2020.columns if 'KPI_rank_' in col]
facility_water_KPIs_2020 = facility_water_cleaned_2020[KPIs_cols]



### Save all df into CSV:
company_climate_cleaned_2020.to_csv('./df_KPIs/Final/company_climate_cleaned_2020.csv')
company_climate_KPIs_2020.to_csv('./df_KPIs/Final/company_climate_KPIs_2020.csv')
country_climate_cleaned_2020.to_csv('./df_KPIs/Final/country_climate_cleaned_2020.csv')
country_climate_KPIs_2020.to_csv('./df_KPIs/Final/country_climate_KPIs_2020.csv')
company_water_cleaned_2020.to_csv('./df_KPIs/Final/company_water_cleaned_2020.csv')
company_water_KPIs_2020.to_csv('./df_KPIs/Final/company_water_KPIs_2020.csv')
country_water_cleaned_2020.to_csv('./df_KPIs/Final/country_water_cleaned_2020.csv')
country_water_KPIs_2020.to_csv('./df_KPIs/Final/country_water_KPIs_2020.csv')
facility_water_cleaned_2020.to_csv('./df_KPIs/Final/facility_water_cleaned_2020.csv')
facility_water_KPIs_2020.to_csv('./df_KPIs/Final/facility_water_KPIs_2020.csv')


# print(count_cols) # 90+2 = 92
# print(len(company_water_cleaned_2020.columns))
# print(len(climate_cleaned_2020.index))
# climate_cleaned_2020.head(5)
# climate_KPIs_2020.head(5)
#country_water_KPIs_2020.head(5)


################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################
###################################################                                                      #######################################################
###################################################                 C I T I E S - 2 0 2 0                #######################################################
###################################################                                                      #######################################################
################################################################################################################################################################
################################################################################################################################################################
################################################################################################################################################################

#### ########################################################### TABLE 2.1 - CLIMATE HAZARD #################################################################### 
#Please list the most significant climate hazards faced by your city and indicate the probability and consequence of these hazards, 
### as well as the expected future change in frequency and intensity. Please also select the most relevant assets or services that are affected by the climate hazard 
### and provide a description of the impact.

# Import table

C21 = c20[c20['Question Number']=='2.1']
C21 = C21.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Last update','Row Name','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

#Prepare table for analysis

def convert(list): 
    return tuple(list) 
C21 =C21.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C21['Response Answer'] = tuple(list(C21['Response Answer']))
C21 ['Response Answer'] = C21['Response Answer'].apply(convert)
C21=C21.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C21 = C21.drop(['Most relevant assets / services affected overall','Please describe the impacts experienced so far, and how you expect the hazard to impact in   the future',
               'Please identify which vulnerable populations are affected','Social impact of hazard overall'], axis=1)
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
collist = C21.columns.values.tolist()[3:]
C21[collist] = C21[collist].apply(convertTuple)

# Current risk likelihood indicator  

C21['risk_likelihood_curr'] = np.nan
C21.loc[C21['Current probability of hazard']=='High', 'risk_likelihood_curr'] = 5/5
C21.loc[C21['Current probability of hazard']=='Medium High', 'risk_likelihood_curr'] = 4/5
C21.loc[C21['Current probability of hazard']=='Medium', 'risk_likelihood_curr'] = 3/5
C21.loc[C21['Current probability of hazard']=='Medium Low', 'risk_likelihood_curr'] = 2/5
C21.loc[C21['Current probability of hazard']=='Low', 'risk_likelihood_curr'] = 1/5
C21.loc[C21['Current probability of hazard']=='Does not currently impact the city', 'risk_likelihood_curr'] = 0/5

# Current Risk magnitude indicator

C21['risk_impact_magnitude_curr'] = np.nan
C21.loc[C21['Current magnitude of hazard']=='High', 'risk_impact_magnitude_curr'] = 5/5
C21.loc[C21['Current magnitude of hazard']=='Medium High', 'risk_impact_magnitude_curr'] = 4/5
C21.loc[C21['Current magnitude of hazard']=='Medium', 'risk_impact_magnitude_curr'] = 3/5
C21.loc[C21['Current magnitude of hazard']=='Medium Low', 'risk_impact_magnitude_curr'] = 2/5
C21.loc[C21['Current magnitude of hazard']=='Low', 'risk_impact_magnitude_curr'] = 1/5
C21.loc[C21['Current magnitude of hazard']=='Does not currently impact the city', 'risk_likelihood_curr'] = 0/5

# Time horizon Variable

C21['time_horizon_years'] = np.nan
C21.loc[C21['When do you first expect to experience those changes in frequency and intensity?']=='Immediately', 'time_horizon_years'] = 1
C21.loc[C21['When do you first expect to experience those changes in frequency and intensity?']=='Short-term (by 2025)', 'time_horizon_years'] = 5
C21.loc[C21['When do you first expect to experience those changes in frequency and intensity?']=='Medium-term (2026-2050)', 'time_horizon_years'] = 30
C21.loc[C21['When do you first expect to experience those changes in frequency and intensity?']=='Long-term (after 2050)', 'time_horizon_years'] = 50

# Future risk likelihood indicator

C21['risk_likelihood_fut_temp'] = np.nan
C21.loc[C21['Current probability of hazard']=='High', 'risk_likelihood_fut_temp'] = 5
C21.loc[C21['Current probability of hazard']=='Medium High', 'risk_likelihood_fut_temp'] = 4
C21.loc[C21['Current probability of hazard']=='Medium', 'risk_likelihood_fut_temp'] = 3
C21.loc[C21['Current probability of hazard']=='Medium Low', 'risk_likelihood_fut_temp'] = 2
C21.loc[C21['Current probability of hazard']=='Low', 'risk_likelihood_fut_temp'] = 1
C21.loc[C21['Current probability of hazard']=='Does not currently impact the city', 'risk_likelihood_fut_temp'] = 0

# Change in future intensity - trend factor

C21['risk_add_fut_temp'] = np.nan
C21.loc[C21['Future change in intensity']=='Increasing','risk_add_fut_temp']=1
C21.loc[C21['Future change in intensity']=='Decreasing','risk_add_fut_temp']=-1
C21.loc[C21['Future change in intensity']=='None','risk_add_fut_temp']=0
C21.loc[C21['Future change in intensity']=='Not expected to happen in the future','risk_add_fut_temp']=-5 # will set the risk to zero if currently existing

# Change in future likelihood indicator

C21['risk_likelihood_fut'] = C21['risk_likelihood_fut_temp']+ C21['risk_add_fut_temp']
C21.loc[C21['risk_likelihood_fut'] > 5.0,'risk_likelihood_fut'] = 5.0       
C21.loc[C21['risk_likelihood_fut'] < 0,'risk_likelihood_fut'] = 0
C21['risk_likelihood_fut'] = C21['risk_likelihood_fut']/5
C21 = C21.drop(['risk_likelihood_fut_temp','risk_add_fut_temp'],axis=1)

# Future risk magnitude indicator

C21['risk_impact_magnitude_fut'] = np.nan
C21.loc[C21['Future expected magnitude of hazard']=='High', 'risk_impact_magnitude_fut'] = 5/5
C21.loc[C21['Future expected magnitude of hazard']=='Medium High', 'risk_impact_magnitude_fut'] = 4/5
C21.loc[C21['Future expected magnitude of hazard']=='Medium', 'risk_impact_magnitude_fut'] = 3/5
C21.loc[C21['Future expected magnitude of hazard']=='Medium Low', 'risk_impact_magnitude_fut'] = 2/5
C21.loc[C21['Future expected magnitude of hazard']=='Low', 'risk_impact_magnitude_fut'] = 1/5
C21.loc[C21['Future expected magnitude of hazard']=='Does not currently impact the city', 'risk_likelihood_fut'] = 0/5

### KPI: Yearly risk exposure (Likelihood * Magnitude / Years) - Current & Future
C21['risk_index_curr'] = (C21['risk_likelihood_curr'] * C21['risk_impact_magnitude_curr']) / C21['time_horizon_years'] 
C21['risk_index_curr'].replace(np.inf, np.nan, inplace=True)
C21['risk_index_fut'] = (C21['risk_likelihood_fut'] * C21['risk_impact_magnitude_fut']) / C21['time_horizon_years'] 
C21['risk_index_fut'].replace(np.inf, np.nan, inplace=True)

# Grouping at city level by using SIMPLE AVERAGE

C21_city= C21.drop(['Row Number','Climate Hazards','Current magnitude of hazard','Current probability of hazard','Did this hazard significantly impact your city before 2020?',
                'Future change in frequency','Future change in intensity','Future expected magnitude of hazard','When do you first expect to experience those changes in frequency and intensity?'],
               axis=1)
                
C21_city =C21_city.groupby(['Account Number', 'Organization'])[['risk_likelihood_curr', 'risk_impact_magnitude_curr','time_horizon_years',
                                                                                 'risk_likelihood_fut','risk_impact_magnitude_fut','risk_index_curr',
                                                                                 'risk_index_fut']].mean().reset_index()

### KPI: Rank by city - Current & Future Yearly risk exposure - The higher the number, the more the exposure to risk

C21_city['KPI_21_risk_index_curr_rank'] = C21_city['risk_index_curr'].rank(pct=True)
C21_city['KPI_21_risk_index_fut_rank'] = C21_city['risk_index_fut'].rank(pct=True)

################################################### Import the Social Vulnerability Index by CDP #####################################################################
#a lot of manual cleaning is required as no merging key is provided. Available for USA Only

cities=c20[['Account Number','Organization','Country','CDP Region']].groupby(['Account Number','Organization','Country','CDP Region']).sum().reset_index()
cities= cities[cities['Country']=='United States of America']

# Remove extra text to be able to merge by city name

cities['Organization'] = cities['Organization'].map(lambda x: x.replace('Town of',''))
cities['Organization'] = cities['Organization'].map(lambda x: x.replace('Township of',''))
cities['Organization'] = cities['Organization'].map(lambda x: x.replace('City of',''))
cities['Organization'] = cities['Organization'].map(lambda x: x.replace('City and County of',''))
cities['Organization'] = cities['Organization'].map(lambda x: x.replace('Metropolitan Government of',''))
cities['Organization'] = cities['Organization'].map(lambda x: x.lstrip())

# Prepare for merge

cities['Organization_new']=cities['Organization']
cities.loc[cities['Organization_new'].str.contains(', [A-Z][A-Z]', regex=True), 'Organization_new'] = cities['Organization_new'].str[:-4]
ct =pd.merge(cities, SVI_C, left_on='Organization_new',right_on='COUNTY',how='left')

# Clean after merge - cities in wrong states are picked up - they need to be eliminated

ct = ct[ct['ST'].notna()]
ct= ct[ct['Organization']!='Columbia, MO']
ct = ct.drop(ct[(ct['Organization'] == 'York, ME') & (ct['ST_ABBR'] == 'NE')].index)
ct = ct.drop(ct[(ct['Organization'] == 'York, ME') & (ct['ST_ABBR'] == 'PA')].index)
ct = ct.drop(ct[(ct['Organization'] == 'York, ME') & (ct['ST_ABBR'] == 'SC')].index)
ct = ct.drop(ct[(ct['Organization'] == 'York, ME') & (ct['ST_ABBR'] == 'VA')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Dallas') & (ct['ST_ABBR'] == 'MO')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Dallas') & (ct['ST_ABBR'] == 'IA')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Dallas') & (ct['ST_ABBR'] == 'AR')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Dallas') & (ct['ST_ABBR'] == 'AL')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Houston') & (ct['ST_ABBR'] == 'AL')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Houston') & (ct['ST_ABBR'] == 'GA')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Houston') & (ct['ST_ABBR'] == 'MN')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Houston') & (ct['ST_ABBR'] == 'TN')].index)
ct= ct[ct['Organization']!='Henderson']
ct = ct.drop(ct[(ct['Organization'] == 'Richmond, VA') & (ct['ST_ABBR'] == 'GA')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Richmond, VA') & (ct['ST_ABBR'] == 'NY')].index)
ct = ct.drop(ct[(ct['Organization'] == 'Richmond, VA') & (ct['ST_ABBR'] == 'NC')].index)
ct= ct[ct['Organization']!='Fremont']
ct= ct[ct['Organization']!='Cleveland']
ct= ct[ct['Organization']!='Northampton, MA']
ct= ct[ct['Organization']!='Miami']
ct= ct[ct['Organization']!='Buffalo']
ct = ct.drop(ct[(ct['Organization'] == 'Santa Cruz, CA') & (ct['ST_ABBR'] == 'AZ')].index)
ct= ct[ct['Organization']!='Ashland, OR']
ct= ct[ct['Organization']!='Charlotte']
ct= ct[ct['Organization']!='Berkeley']
ct= ct[ct['Organization']!='Aurora, IL']
ct= ct[ct['Organization']!='Columbus']
ct= ct[ct['Organization']!='Oakland']
ct= ct[ct['Organization']!='Palo Alto']
ct= ct[ct['Organization']!='Salem, MA']
ct= ct[ct['Organization']!='Guilford, VT']
ct= ct[ct['Organization']!='Lexington, MA']

# Merge datasets
# Legend:
# Socioeconomic – RPL_THEME1
# Household Composition & Disability – RPL_THEME2
# Minority Status & Language – RPL_THEME3
# Housing Type & Transportation – RPL_THEME4
# The overall tract summary ranking variable is RPL_THEMES.
# SVI is created by using percentile ranking values range from 0 to 1, with higher values indicating greater vulnerability.

ct=ct[['Account Number','Organization','ST_ABBR','FIPS','E_TOTPOP','RPL_THEME1','RPL_THEME2','RPL_THEME3','RPL_THEME4','RPL_THEMES']]
ct =pd.merge(ct,C21_city, left_on='Account Number',right_on='Account Number',how='left')

########################################################### TABLE 3.0 - ADAPTATION #############################################################################

#Please describe the main actions you are taking to reduce the risk to, and vulnerability of, your city’s infrastructure, 
###services, citizens, and businesses from climate change as identified in the Climate Hazards section.

# Import table

C30 = c20[c20['Question Number']=='3.0']
C30 = C30.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Last update','Row Name','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

def convert(list): 
    return tuple(list) 
C30 = C30.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C30['Response Answer'] = tuple(list(C30['Response Answer']))
C30['Response Answer'] = C30['Response Answer'].apply(convert)
C30 = C30.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C30 = C30.drop(['Action description and implementation progress','Action title','Web link','Finance status'],axis=1)
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
C30['Action'] = C30['Action'].apply(convertTuple)
C30['Climate hazards'] = C30['Climate hazards'].apply(convertTuple)
C30['Co-benefit area'] = C30['Co-benefit area'].apply(convertTuple)
C30['Majority funding source'] = C30['Majority funding source'].apply(convertTuple)
C30['Sectors/areas adaptation action applies to'] = C30['Sectors/areas adaptation action applies to'].apply(convertTuple)
C30['Status of action'] = C30['Status of action'].apply(convertTuple)
C30['Total cost of the project (currency)'] = C30['Total cost of the project (currency)'].apply(convertTuple)
C30['Total cost provided by the local government (currency)'] = C30['Total cost provided by the local government (currency)'].apply(convertTuple)
C30['Total cost provided by the majority funding source (currency)'] = C30['Total cost provided by the majority funding source (currency)'].apply(convertTuple)
C30['Means of implementation'] = C30['Means of implementation'].apply(convertTuple)

# Group 'Other please specify category'

C30= C30.dropna(subset=['Action'])
C30.loc[C30['Action'].str.contains('Other, please specify:'), 'Action'] = 'Other'
C30= C30.dropna(subset=['Status of action'])
C30.loc[C30['Status of action'].str.contains('Other, please specify:'), 'Status of action'] = 'Other'
C30= C30.dropna(subset=['Majority funding source'])
C30.loc[C30['Majority funding source'].str.contains('Other, please specify:'), 'Majority funding source'] = 'Other'

# Import population figures

C05 = c20[c20['Question Number']=='0.5']
C05 = C05.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Last update','Row Name','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)
C05 = C05.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
pop = C05[['Account Number', 'Current population']]

# Import currency - most recurrent currencies are converted to USD

C04 = c20[c20['Question Number']=='0.4']
C04 = C04.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Last update','Row Name','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)
C04 = C04.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C04.columns = C04.columns.fillna('Currency')

#Exchange Rates for main currencies to USD as of 25/11/20

C04['Exchange Rate'] = np.nan
C04.loc[C04['Currency']=='USD US Dollar', 'Exchange Rate'] = 1
C04.loc[C04['Currency']=='BRL Brazilian Real', 'Exchange Rate'] = 0.19
C04.loc[C04['Currency']=='EUR Euro', 'Exchange Rate'] = 1.19
C04.loc[C04['Currency']=='MXN Mexican Peso', 'Exchange Rate'] = 0.05
C04.loc[C04['Currency']=='GBP Pound Sterling', 'Exchange Rate'] = 1.34
C04.loc[C04['Currency']=='CAD Canadian Dollar', 'Exchange Rate'] = 0.77
C04.loc[C04['Currency']=='COP Colombian Peso', 'Exchange Rate'] = 0.00028
C04.loc[C04['Currency']=='PEN Nuevo Sol', 'Exchange Rate'] = 0.28
C04.loc[C04['Currency']=='DKK Danish Krone', 'Exchange Rate'] = 0.16
C04.loc[C04['Currency']=='AUD Australian Dollar', 'Exchange Rate'] = 0.74
C04.loc[C04['Currency']=='SEK Swedish Krona', 'Exchange Rate'] = 0.12
curr = C04[['Account Number','Exchange Rate']]

# Merge population & currencies

C30 =pd.merge(C30,pop, left_on='Account Number',right_on='Account Number',how='left')
C30 =pd.merge(C30,curr, left_on='Account Number',right_on='Account Number',how='left')
C30= C30.dropna(subset=['Total cost of the project (currency)'])
C30= C30.dropna(subset=['Exchange Rate'])

# Convert total project cost in USD

C30['Total cost of the project (currency)'] = C30['Total cost of the project (currency)'].astype(float)
C30['Total Project Cost USD']= C30['Total cost of the project (currency)']*C30['Exchange Rate']

### KPI: Calculate spending per capite

C30['Current population'] = C30['Current population'].astype(float)
C30['Spend_per_capite'] = C30['Total Project Cost USD']/C30['Current population']

# Group by city - use sum to group spending per capite

C30_KPI = C30.groupby(['Account Number', 'Organization'])['Spend_per_capite'].sum().reset_index()

#Some cities omitted the info and entered 0 - exclude

C30_KPI = C30_KPI[C30_KPI['Spend_per_capite']!=0]

# There are some outliers - Big infrastructure projects
#C30_KPI['Spend_per_capite'].describe()

## KPI: Spending per capite ranking - the more the city spends, the higher the rank

C30_KPI['KPI_30_spend_per_capite_rank'] = C30_KPI['Spend_per_capite'].rank(pct=True)



# Compare current risk with spending per capite and vulnerability
ct1 =pd.merge(ct,C30_KPI, left_on='Account Number',right_on='Account Number',how='left')
ct1= ct1.dropna(subset=['Spend_per_capite'])


########################################################################## TABLE 4.6b - CITY WIDE EMISSIONS ##########################################################

#Please provide a summary of emissions by sector and scope as defined in the Global Protocol for Community Greenhouse Gas Emissions Inventories (GPC) in the table below.

# Import table 
C46b = c20[c20['Question Number']=='4.6b']
C46b = C46b.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

C46b = C46b.pivot(index=['Account Number','Organization','Row Number','Row Name'], columns='Column Name', values='Response Answer').reset_index()
C46b = pd.merge(C46b,pop, left_on='Account Number',right_on='Account Number',how='left')
C46b['Current population'] = C46b['Current population'].astype(float)

### KPI: Metric tonnes of CO2 emissions per capite. Note: is possible todo this by emissions type

C46b= C46b.dropna(subset=['Current population'])
C46b= C46b.dropna(subset=['Emissions (metric tonnes CO2e)'])
C46b=C46b[C46b['Emissions (metric tonnes CO2e)']!='Question not applicable']
C46b['Emissions (metric tonnes CO2e)'] = C46b['Emissions (metric tonnes CO2e)'].astype(float)
C46b['CO2 Emissions per capite'] = C46b['Emissions (metric tonnes CO2e)']/C46b['Current population']

## KPI: Metric tonnes of CO2 emissions per capite - TOTAL BASIC emissions Rank order - the higher the rank, the higher the emissions 

C46b_KPI = C46b[C46b['Row Name'] == 'TOTAL BASIC emissions']
C46b_KPI['KPI_46b_CO2 Emissions per capite_rank'] = C46b_KPI['CO2 Emissions per capite'].rank(pct=True)

####################################################### TABLE 4.13 - CITY WIDE EMISSIONS ########################################################################
#Please provide details on any historical and base year city-wide emissions inventories your city has, in order to allow assessment of targets in the table below.

# Import table

C413 = c20[c20['Question Number']=='4.13']
C413 = C413.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

def convert(list): 
    return tuple(list) 
C413 =C413.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C413['Response Answer'] = tuple(list(C413['Response Answer']))
C413 ['Response Answer'] = C413['Response Answer'].apply(convert)
C413 = C413.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
C413 = C413.drop(['Comments','File name and attach your inventory','Is this inventory used as the base year inventory?','Scopes / boundary covered'],axis=1)
collist = C413.columns.values.tolist()[3:]
C413[collist] = C413[collist].apply(convertTuple)

# KPI: Average trend of emissions reduction/increase percentage from time series

C413_KPI = C413.drop('Inventory date from',axis=1)
C413_KPI['Inventory date to'] = pd.to_datetime(C413_KPI['Inventory date to'])
C413_KPI['Inventory date to'] = C413_KPI['Inventory date to'].dt.year
C413_KPI = C413_KPI.sort_values(['Organization','Inventory date to','Row Number']).reset_index()
C413_KPI = C413_KPI[C413_KPI['Methodology']== 'Global Protocol for Community Greenhouse Gas Emissions Inventories (GPC)']
C413_KPI['Previous emissions (metric tonnes CO2e)'] = C413_KPI['Previous emissions (metric tonnes CO2e)'].astype(float)
C413_KPI['diffs'] = C413_KPI['Previous emissions (metric tonnes CO2e)'].pct_change()
mask = C413_KPI.Organization != C413_KPI.Organization.shift(1)
C413_KPI['diffs'][mask] = np.nan
C413_KPI['y_diffs'] = C413_KPI['Inventory date to'].diff()
C413_KPI['y_diffs'][mask] = np.nan
C413_KPI=C413_KPI.dropna(subset=['diffs'])
C413_KPI=C413_KPI[C413_KPI['y_diffs']==1]
C413_KPI = C413_KPI.groupby(['Account Number', 'Organization'])['diffs'].mean().reset_index()

### KPI: Ranking of average trend of city emissions reduction/increase percentage from time series - The lowest the rank the better, it means the 
### city is cutting more emisions

C413_KPI['KPI_413_diffs_rank'] = C413_KPI['diffs'].rank(pct=True)

###################################################### TABLE 5.0a - EMISSIONS REDUCTION ######################################################################

# Please provide details of your total city-wide base year emissions reduction (absolute) target(s). 
# In addition, you may add rows to provide details of your sector-specific targets, by providing the base year emissions specific to that target.

# Import table 

C50a = c20[c20['Question Number']=='5.0a']
C50a = C50a.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis 

def convert(list): 
    return tuple(list) 
C50a =C50a.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C50a['Response Answer'] = tuple(list(C50a['Response Answer']))
C50a['Response Answer'] = C50a['Response Answer'].apply(convert)
C50a = C50a.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C50a = C50a.drop(['Boundary of target relative to city boundary (reported in 0.1)','Does this target align to a requirement from a higher level of sub-national government',
                 'Does this target align with the global 1.5 - 2 °C pathway set out in the Paris Agreement?','Select the initiatives that this target contributes towards',
                 'Target meets initial GCoM validation criteria','Where sources differ from the inventory, identify and explain these additions / exclusions',
                 'Please describe your target. If your country has an NDC and your city’s target is less ambitious than the NDC, please explain why.'],axis=1)
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
collist = C50a.columns.values.tolist()[3:]
C50a[collist] = C50a[collist].apply(convertTuple)
C50a = C50a[C50a['Base year']!='Question not applicable']
C50a = C50a.dropna()
C50a['Base year'] = C50a['Base year'].astype(float)
C50a['Base year emissions (metric tonnes CO2e)'] = C50a['Base year emissions (metric tonnes CO2e)'].astype(float)
C50a['Percentage of target achieved so far'] = C50a['Percentage of target achieved so far'].astype(float)
C50a['Percentage reduction target'] = C50a['Percentage reduction target'].astype(float)
C50a['Target year absolute emissions (metric tonnes CO2e) [Auto-calculated]'] = C50a['Target year absolute emissions (metric tonnes CO2e) [Auto-calculated]'].astype(float)
C50a['Target year'] = C50a['Target year'].astype(float)
C50a['Year of target introduction'] = C50a['Year of target introduction'].astype(float)

# Calculate new cols - to evaluate ambition & progress of objectives 

C50a['target_years']=C50a['Target year'] - C50a['Year of target introduction']+1
C50a['Emissions_reduction_obj']= C50a['Target year absolute emissions (metric tonnes CO2e) [Auto-calculated]']
C50a['Emissions_reduction_achieved']= (C50a['Percentage of target achieved so far']/100)*C50a['Base year emissions (metric tonnes CO2e)'] ###I THINK is WRONG!
C50a['Percentage_reduction_per_year']= (C50a['Percentage reduction target']/100)/C50a['target_years']
C50a['years_left']= C50a['Target year']-2020
C50a['achiev_per_year'] = (C50a['Percentage reduction target'])*(C50a['Percentage of target achieved so far']/100)/(2020- C50a['Year of target introduction']+1)/100
C50a['effective_years_to_achiev'] = C50a['Emissions_reduction_obj']/ (C50a['Base year emissions (metric tonnes CO2e)']*(C50a['Percentage reduction target']/100))
C50a['Emissions_reduction per year'] = C50a['Emissions_reduction_obj']/C50a['target_years']
C50a['Actual_emissions_cut_per_year'] = (C50a['Emissions_reduction_obj']*(C50a['Percentage of target achieved so far']/100))/(2020- C50a['Year of target introduction']+1)
C50a['Actual_years_to_achiev'] = C50a['Emissions_reduction_obj']/C50a['Actual_emissions_cut_per_year']
C50a['Years_diff'] = C50a['target_years']-C50a['Actual_years_to_achiev']
C50a['Years_diff'].replace(np.inf, np.nan, inplace=True)
C50a['Years_diff'].replace(np.NINF, np.nan, inplace=True)
C50a.loc[(C50a['Percentage of target achieved so far']<0),'Years_diff'] = np.nan 
C50a['KPI_Strategy']=np.nan 

C50a.loc[(C50a['Years_diff']>C50a['Years_diff'].std()) | (C50a['Years_diff']<-C50a['Years_diff'].std()), 'Years_diff'] = np.nan
C50a['Years_diff'].replace(np.inf, 0, inplace=True)
C50a['Years_diff'].replace(np.NINF, 0, inplace=True)

C50a_KPI=C50a[['Account Number','Organization','Row Number', 'Base year emissions (metric tonnes CO2e)', 'Emissions_reduction_achieved',
               'Emissions_reduction_obj','Years_diff']]

# Group by city by weighted average per Emissions reduction objective

wm = lambda x: np.average(x, weights=C50a_KPI.loc[x.index, 'Emissions_reduction_obj'])
C50a_KPI
C50a_KPI['Years_diff']=C50a_KPI['Years_diff']+0.00000000001
C50a_KPI['Emissions_reduction_obj']=C50a_KPI['Emissions_reduction_obj']+0.00000000001
C50a_KPI=C50a_KPI.groupby(['Account Number','Organization']).agg(Emissions_reduction_achieved=('Emissions_reduction_achieved', 'sum'),
    Covered_emissions= ('Base year emissions (metric tonnes CO2e)', 'sum'),
    Emissions_reduction_obj=('Emissions_reduction_obj', 'sum'),
    Years_diff=('Years_diff', wm)).dropna().reset_index()

### KPIs: Objective strategy, Objective ambition, Objective progress - RANK

C50a_KPI['Percentage_obj_total'] = C50a_KPI.Emissions_reduction_obj / C50a_KPI.Covered_emissions
C50a_KPI['Percentage_obj_achieved'] = C50a_KPI.Emissions_reduction_achieved / C50a_KPI.Covered_emissions

C50a_KPI['KPI_50a_rank_objective_strategy'] = C50a_KPI['Years_diff'].rank(pct=True)
C50a_KPI['KPI_50a_rank_objective_ambition'] = C50a_KPI['Percentage_obj_total'].rank(pct=True)
C50a_KPI['KPI_50a_rank_objective_progress'] = C50a_KPI['Percentage_obj_achieved'].rank(pct=True)

##################################################### TABLE 5.4 - EMISSIONS REDUCTION ################################################################################# 
#Describe the anticipated outcomes of the most impactful mitigation actions your city is currently undertaking; the total cost of the action and how much is being 
#funded by the local government.

# Import table

C54 = c20[c20['Question Number']=='5.4']
C54 = C54.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

def convert(list): 
    return tuple(list) 
C54 =C54.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C54['Response Answer'] = tuple(list(C54['Response Answer']))
C54['Response Answer'] = C54['Response Answer'].apply(convert)
C54 = C54.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C54 = C54.drop(['Web link to action website','Scope and impact of action','Role in the GCC program','Name of the stakeholder group','Name of the engagement activities',
               'Co-benefit area','Aim of the engagement activities','Attach reference document','Means of implementation'],axis=1)
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
collist = C54.columns.values.tolist()[3:]
C54[collist] = C54[collist].apply(convertTuple)
C54=C54.dropna(subset=['Energy savings (MWh)', 'Estimated emissions reduction (metric tonnes CO2e)','Renewable energy production (MWh)'], how='all')
C54['Energy savings (MWh)'] = C54['Energy savings (MWh)'].astype(float)
C54['Estimated emissions reduction (metric tonnes CO2e)'] = C54['Estimated emissions reduction (metric tonnes CO2e)'].astype(float)
C54['Renewable energy production (MWh)'] = C54['Renewable energy production (MWh)'].astype(float)
C54['Total cost of the project'] = C54['Total cost of the project'].astype(float)

# Import currency information to convert cost of project in USD

C54 =pd.merge(C54,curr, left_on='Account Number',right_on='Account Number',how='left')
C54['Total dollar cost'] = C54['Total cost of the project']*C54['Exchange Rate']

## CLean up final data
C54 = C54.dropna(subset=['Total dollar cost'])
C54 = C54[C54['Timescale of reduction / savings / energy production'] == 'Per year']
C54 = C54[C54['Total dollar cost']!=0]


### KPI: Energy savings in MWh per dollar spent
### KPI: CO2 emissions reduction per dollar spent 
### KPI: Mwh of renewable energy per MWh spent

C54_KPI = C54.groupby(['Account Number', 'Organization'], as_index=False).agg({
    'Total dollar cost': 'sum',
    'Energy savings (MWh)': 'sum',
    'Estimated emissions reduction (metric tonnes CO2e)': 'sum',
    'Renewable energy production (MWh)': 'sum',
    'Total cost of the project': 'sum'
    })

C54_KPI['Dollar per 1 MWh of Energy Savings'] = C54_KPI['Total dollar cost']/ C54_KPI['Energy savings (MWh)']
C54_KPI['Dollar per 1 Mt of CO2 Emissions reductions'] = C54_KPI['Total dollar cost']/ C54_KPI['Estimated emissions reduction (metric tonnes CO2e)']
C54_KPI['Dollar per 1 MWh of Renewable energy production'] = C54_KPI['Total dollar cost']/ C54_KPI['Renewable energy production (MWh)']
C54_KPI.replace(np.inf, np.nan, inplace=True)
C54_KPI.replace(np.NINF, np.nan, inplace=True)
C54_KPI=C54_KPI.dropna(subset=['Dollar per 1 MWh of Energy Savings','Dollar per 1 Mt of CO2 Emissions reductions',
                               'Dollar per 1 MWh of Renewable energy production'], how='all')

### KPI: Ranking for Dollar for each Mwh Energy saving, CO2 emissions cuts and Mwh of Reneable energy - The lower the better, more savings for less $
# Done on the Per year metric for comparability
# C54_KPI = C54_KPI[C54_KPI['Timescale of reduction / savings / energy production'] == 'Per year']
# C54_KPI = C54_KPI[C54_KPI['Total dollar cost'] != 0]

# High chance that cities have used different scales e.g. they write 100 but they mean 100k for 100000. Try correct for that

C54_KPI = C54_KPI[C54_KPI['Dollar per 1 MWh of Energy Savings']<10000]
C54_KPI = C54_KPI[C54_KPI['Dollar per 1 Mt of CO2 Emissions reductions']<25000]
C54_KPI = C54_KPI[C54_KPI['Energy savings (MWh)']>1000]
C54_KPI = C54_KPI[C54_KPI['Estimated emissions reduction (metric tonnes CO2e)']>10000]
C54_KPI = C54_KPI[C54_KPI['Total cost of the project']>10000]

C54_KPI['KPI_54_$_per_MWh_saving'] = C54_KPI['Dollar per 1 MWh of Energy Savings'].rank(pct=True)
C54_KPI['KPI_54_$_per_CO2_cut'] = C54_KPI['Dollar per 1 Mt of CO2 Emissions reductions'].rank(pct=True)
C54_KPI['KPI_54_$_per_MWh_production'] = C54_KPI['Dollar per 1 MWh of Renewable energy production'].rank(pct=True)

################################################################ TABLE 8.0a - ENERGY ########################################################################### 
#Please provide details of your renewable energy or electricity target(s) and how the city plans to meet those targets.

# Import table

C80a = c20[c20['Question Number']=='8.0a']
C80a = C80a.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis 

C80a = C80a.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C80a = C80a.drop(['Please specify plans to meet the target(s) and in which sector this target will be implemented (i.e. All energy sectors, electricity, heating and cooling and/or transport)'],
                 axis=1)
C80a = C80a[C80a['Base year']!='Question not applicable']

## Clean up the table
C80a['Base year'] = C80a['Base year'].astype(float)
C80a['Target year'] = C80a['Target year'].astype(float)
C80a['Percentage of target achieved'] = C80a['Percentage of target achieved'].astype(float)
C80a['Total renewable energy / electricity covered by target in base year (in unit specified in column 3:  energy/electricity types covered by target)'] = C80a['Total renewable energy / electricity covered by target in base year (in unit specified in column 3:  energy/electricity types covered by target)'].astype(float)
C80a['Total renewable energy / electricity covered by target in target year (in unit specified in column 3: energy/electricity types covered by target)'] = C80a['Total renewable energy / electricity covered by target in target year (in unit specified in column 3: energy/electricity types covered by target)'].astype(float)
C80a['Percentage renewable energy / electricity of total energy or electricity in base year'] = C80a['Percentage renewable energy / electricity of total energy or electricity in base year'].astype(float)
C80a['Percentage renewable energy / electricity of total energy or electricity in target year'] = C80a['Percentage renewable energy / electricity of total energy or electricity in target year'].astype(float)

## Create new calculated features
C80a['target_years']=C80a['Target year'] - C80a['Base year']+1
C80a['Percentage_E_growth_per_year']= ((C80a['Percentage renewable energy / electricity of total energy or electricity in target year']-C80a['Percentage renewable energy / electricity of total energy or electricity in base year'])/100)/C80a['target_years']
C80a['years_past']=2020-C80a['Base year']+1
C80a['Green_E_achiev_per_year']=(C80a['Percentage of target achieved']/100)/C80a['years_past']
C80a=C80a.dropna(subset=['Green_E_achiev_per_year','Percentage_E_growth_per_year'])
C80a = C80a[C80a['Percentage_E_growth_per_year']>=0]

## KPI: Target status tracker: On target or behind target?
C80a['Target_status'] = 'Behind target'
C80a.loc[C80a['Percentage_E_growth_per_year']<=C80a['Green_E_achiev_per_year'], 'Target_status'] = 'On target'



### KPI: Target percentage growth of reneable per year VS Actual growth against the target
# wm_ = lambda x: np.average(x, weights=C41a_KPI.loc[x.index, 'Emissions_reduction_obj'])
C80a_KPI = C80a.groupby(['Account Number', 'Organization'], as_index=False).agg({
    'target_years': 'mean',
    'Percentage_E_growth_per_year': 'mean',
    'Green_E_achiev_per_year': 'mean',
    })


### KPI: Rank order of distance from target - the higher the number the worst it is as further from target

C80a_KPI['Distance from target'] = C80a_KPI['Percentage_E_growth_per_year'] - C80a_KPI['Green_E_achiev_per_year']
C80a_KPI['KPI_80a_rank_dist_from_target'] = C80a_KPI['Distance from target'].rank(pct=True)

############################################################# TABLE 8.1 - ENERGY ############################################################################ 
#Please indicate the source mix of electricity consumed in your city

# Import table

C81 = c20[c20['Question Number']=='8.1']
C81 = C81.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare data for analysis

C81 = C81.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C81=C81.dropna(subset=['Total - please ensure this equals 100%'])

### KPI: Rank ordering Greenest cities - higher rank mean cleaner energy used

C81['Geothermal'] = C81['Geothermal'].astype(float)
C81['Solar'] = C81['Solar'].astype(float)
C81['Wind'] = C81['Wind'].astype(float)
C81['Green_energy%'] = C81['Geothermal'] + C81['Wind'] + C81['Solar']
C81['KPI_81_Green_energy_rank'] = C81['Green_energy%'].rank(pct=True)

########################################################################### TABLE 8.5 - ENERGY ############################################################### 
#Does your city have a target to increase energy efficiency?

# Import table
C85a = c20[c20['Question Number']=='8.5a']
C85a = C85a.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare data for analysis

def convert(list): 
    return tuple(list) 
C85a =C85a.groupby(['Account Number', 'Organization','Column Name', 'Row Number'])['Response Answer'].apply(list).reset_index()
C85a['Response Answer'] = tuple(list(C85a['Response Answer']))
C85a['Response Answer'] = C85a['Response Answer'].apply(convert)
C85a = C85a.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C85a = C85a.drop(['Plans to meet target (include details on types of energy in thermal /electricity)','Please indicate to which energy sector(s) the target applies (Multiple choice)']
                 ,axis=1)
def convertTuple(tup): 
    str = functools.reduce(operator.add, (tup)) 
    return str
collist = C85a.columns.values.tolist()[3:]
C85a[collist] = C85a[collist].apply(convertTuple)
C85a = C85a[C85a['Base year']!='Question not applicable']


## Clean up columns
C85a['Base year'] = C85a['Base year'].astype(float)
C85a['Target year'] = C85a['Target year'].astype(float)
C85a['Percentage of energy efficiency improvement in target year compared to base year levels'] = C85a['Percentage of energy efficiency improvement in target year compared to base year levels'].astype(float)
C85a['Percentage of target achieved'] = C85a['Percentage of target achieved'].astype(float)


## Calculate new features
C85a['target_years']=C85a['Target year'] - C85a['Base year']+1
C85a['Percentage_E_growth_per_year']= (C85a['Percentage of energy efficiency improvement in target year compared to base year levels']/100)/C85a['target_years']
C85a['years_past']=2020-C85a['Base year']+1
C85a['Green_E_achiev_per_year']=(C85a['Percentage of target achieved']/100)/C85a['years_past']
C85a=C85a.dropna(subset=['Green_E_achiev_per_year','Percentage_E_growth_per_year'])
C85a = C85a[C85a['Percentage_E_growth_per_year']>=0]

## KPI: Target status tracker: On target or behind target?
C85a['Target_status'] = 'Behind target'
C85a.loc[C85a['Percentage_E_growth_per_year']<=C85a['Green_E_achiev_per_year'], 'Target_status'] = 'On target'



### KPI: Target percentage growth of energy efficiency per year VS Actual growth against the target
# wm_ = lambda x: np.average(x, weights=C41a_KPI.loc[x.index, 'Emissions_reduction_obj'])
C85a_KPI = C85a.groupby(['Account Number', 'Organization'], as_index=False).agg({
    'Percentage_E_growth_per_year': 'mean',
    'Green_E_achiev_per_year': 'mean',
    })



### KPI: Rank order of distance from target - the higher the number the worst it is as further from target

C85a_KPI['Distance from target'] = C85a_KPI['Percentage_E_growth_per_year'] - C85a_KPI['Green_E_achiev_per_year']
C85a_KPI['KPI_85a_rank_dist_from_target_eff'] = C85a_KPI['Distance from target'].rank(pct=True)

################################################################# TABLE 10.1 TRANSPORT ##################################################################### 
#What is the mode share of each transport mode in your city for passenger transport?

# Import table

C101 = c20[c20['Question Number']=='10.1']
C101 = C101.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Row Name','Last update','Column Number', 'Question Name',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

C101 = C101.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C101 = C101[C101['Cycling']!='Question not applicable']
C101['Micro-Mobility'] = C101['Micro-Mobility'].astype(float)
C101['Cycling'] = C101['Cycling'].astype(float)
C101['Walking'] = C101['Walking'].astype(float)

### KPI: Rank ordering greenest cities - the higher the greener  

C101['Green_transport'] = C101['Micro-Mobility'] + C101['Cycling'] + C101['Walking']
C101['KPI_101_Green_transport_rank'] = C101['Green_transport'].rank(pct=True)

#################################################################### TABLE 10.4 TRANSPORT ############################################################
#Please provide the total fleet size and number of vehicle types for the following modes of transport.

# Import table

C104 = c20[c20['Question Number']=='10.4']
C104 = C104.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Question Name','Last update','Column Number',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis 

C104 = C104.pivot(index=['Account Number','Organization','Row Number','Row Name'], columns='Column Name', values='Response Answer').reset_index()

# Merge with population per city

C104 =pd.merge(C104,pop, left_on='Account Number',right_on='Account Number',how='left')
C104_KPI = C104[C104['Row Name']== 'Total fleet size']
C104_KPI['Number of private cars'] = C104_KPI['Number of private cars'].astype(float)
C104_KPI['Current population'] = C104_KPI['Current population'].astype(float)
C104_KPI['Number of buses'] = C104_KPI['Number of buses'].astype(float)
C104_KPI['Number of municipal fleet (excluding buses)'] = C104_KPI['Number of municipal fleet (excluding buses)'].astype(float)

### KPI: Public transport per capite/ private cars per capite

C104_KPI['Municipal fleet'] = C104_KPI['Number of buses'] + C104_KPI['Number of municipal fleet (excluding buses)']
C104_KPI['Municipal fleet per capite'] = C104_KPI['Municipal fleet'] / C104_KPI['Current population']
C104_KPI['Private cars per capite'] = C104_KPI['Number of private cars'] / C104_KPI['Current population']

# Remove some weird outliers

C104_KPI=C104_KPI[C104_KPI['Private cars per capite']<2]
C104_KPI=C104_KPI[C104_KPI['Municipal fleet per capite']<2]

### KPI: Rank of public transport per capite (higher the better) and ranking of cars per capite (higher the worse)

C104_KPI['KPI_104_public_transport_rank'] = C104_KPI['Municipal fleet per capite'].rank(pct=True)
C104_KPI['KPI_104_private_car_rank'] = C104_KPI['Private cars per capite'].rank(pct=True)

################################################################################# TABLE 10.7 TRANSPORT ########################################################
#Do you have a low or zero-emission zone in your city? (i.e. an area that disincentivises fossil fuel vehicles through a charge, a ban or access restriction)

# Import table

C107a = c20[c20['Question Number']=='10.7a']
C107a = C107a.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Question Name','Last update','Column Number','Row Name',
               'Question Number','CDP Region', 'Country'], axis=1)
C107a = C107a.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()

# Import land area 

C06 = c20[c20['Question Number']=='0.6']
C06 = C06.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Question Name','Last update','Column Number','Row Name',
               'Question Number','CDP Region', 'Country'], axis=1)
C06 = C06.pivot(index=['Account Number','Organization','Row Number'], columns='Column Name', values='Response Answer').reset_index()
C06['Land area of the city boundary as defined in question 0.1 (in square km)'] = C06['Land area of the city boundary as defined in question 0.1 (in square km)'].astype(float)

### KPI: Percentage of zero emission area with respect to city size

C107a = C107a[C107a['Size (sq. km)']!='Question not applicable']
C107a = C107a.dropna(subset=['Size (sq. km)'])
C107a['Size (sq. km)'] = C107a['Size (sq. km)'].astype(float)
C107a =pd.merge(C107a,C06, left_on='Account Number',right_on='Account Number',how='left')
C107a['Percentage of 0 emission'] = C107a['Size (sq. km)'] / C107a['Land area of the city boundary as defined in question 0.1 (in square km)']

# Remove weird outliers

C107a = C107a[C107a['Percentage of 0 emission'] <= 1]

### KPI: Ranking of percentage of zero emission area - the highest the better

C107a['KPI_107a_zero_emissions'] = C107a['Percentage of 0 emission'].rank(pct=True)

############################################################################ TABLE 12.1 - FOOD ##################################################################
#What is the per capita meat and dairy consumption (kg) in your city?

# Import table

C121 = c20[c20['Question Number']=='12.1']
C121 = C121.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Question Name','Last update','Column Number',
               'Question Number','CDP Region', 'Country'], axis=1)

#Prepare table for analysis

C121 = C121.pivot(index=['Account Number','Organization','Row Number','Row Name'], columns='Column Name', values='Response Answer').reset_index()
C121 = C121[C121['Amount']!='Question not applicable']
C121 = C121.dropna(subset=['Amount'])
C121['Amount']= C121['Amount'].astype(float)
C121_meat = C121[C121['Row Name']=='Meat consumption per capita (kg/year)']
C121_meat['Meat_rank'] = C121_meat['Amount'].rank(pct=True)
C121_dairy = C121[C121['Row Name']=='Dairy consumption per capita (kg/year)']
C121_dairy['dairy_rank'] = C121_dairy['Amount'].rank(pct=True)
C121_dairy = C121_dairy[['Account Number','dairy_rank']]
C121_md =pd.merge(C121_meat,C121_dairy, left_on='Account Number',right_on='Account Number',how='left')
C121_md['M/D'] = C121_md['dairy_rank'] + C121_md['Meat_rank']

### KPI:Meat/Dairy consumption rank, the lower the better

C121_md['KPI_121_Meat/Dairy Rank'] = C121_md['M/D'].rank(pct=True)

############################################################################# TABLE 13.0 - WASTE ####################################################
#What is the annual solid waste generation in your city?

# Import table

C130 = c20[c20['Question Number']=='13.0']
C130 = C130.drop(['Questionnaire', 'Year Reported to CDP','Parent Section','Section','Comments','File Name','Question Name','Last update','Column Number',
               'Question Number','CDP Region', 'Country'], axis=1)

# Prepare table for analysis

C130 = C130.pivot(index=['Account Number','Organization','Row Number','Row Name'], columns='Column Name', values='Response Answer').reset_index()
C130['Amount of waste generated (tonnes/year)']= C130['Amount of waste generated (tonnes/year)'].astype(float)

# Import population size data

C130 =pd.merge(C130,pop, left_on='Account Number',right_on='Account Number',how='left')
C130['Current population']= C130['Current population'].astype(float)


### KPI: Tonnes of waste produced per capite

C130['Waste tonnes per capite'] = C130['Amount of waste generated (tonnes/year)']/C130['Current population']

#Remove weird outliers 

C130 = C130[C130['Waste tonnes per capite']<4]

### KPI: ranking of tonnes of waste produced per capite - the smaller the better

C130['KPI_130_waste_rank'] = C130['Waste tonnes per capite'].rank(pct=True)

########################################################### CREATE SINGLE KPI RANKING TABLE - CITY LEVEL ########################################################

T21 = C21_city[['Account Number', 'KPI_21_risk_index_curr_rank', 'KPI_21_risk_index_fut_rank']]
T30 = C30_KPI[['Account Number','KPI_30_spend_per_capite_rank']]
T46b = C46b_KPI[['Account Number','KPI_46b_CO2 Emissions per capite_rank']]
T413 = C413_KPI[['Account Number','KPI_413_diffs_rank']]
T50a = C50a_KPI[['Account Number','KPI_50a_rank_objective_strategy','KPI_50a_rank_objective_ambition','KPI_50a_rank_objective_progress']] 
T54 = C54_KPI[['Account Number','KPI_54_$_per_MWh_saving','KPI_54_$_per_CO2_cut','KPI_54_$_per_MWh_production']] 
T80a = C80a_KPI[['Account Number','KPI_80a_rank_dist_from_target']]
T81 = C81[['Account Number','KPI_81_Green_energy_rank']]
T85a = C85a_KPI[['Account Number','KPI_85a_rank_dist_from_target_eff']]
T101 = C101[['Account Number','KPI_101_Green_transport_rank']]
T104 = C104_KPI[['Account Number','KPI_104_public_transport_rank','KPI_104_private_car_rank']]
T107a = C107a[['Account Number','KPI_107a_zero_emissions']]
T121 = C121_md[['Account Number','KPI_121_Meat/Dairy Rank']]
T130 = C130[['Account Number','KPI_130_waste_rank']]


## Merge everything
# Create list of df
cities_df_list = [
    T21,
    T30,
    T46b,
    T413,
    T50a,
    T54,
    T80a,
    T81,
    T85a,
    T101,
    T104,
    T107a,
    T121,
    T130]

# Create list of suffixes
cities_df_suffixes_list = [
    '_climate_hazard',
    '_adaptation',
    '_city_wide_emissions',
    '_city_wide_emissions_historical',
    '_emissions_reduction_targets',
    '_emissions_reduction_outcomes',
    '_energy_targets',
    '_energy_consumption',
    '_energy_efficiency_targets',
    '_transport',
    '_transport_fleet_size',
    '_transport_low_zero_emission_areas',
    '_food',
    '_waste']

# Prime the final table
cities_KPIs_2020 = disc[['Account Number', 'Organization']]

# Merge everything
ix = 0
count_cols = 0
for df_ in cities_df_list:
    cities_KPIs_2020 = pd.merge(cities_KPIs_2020, df_.copy().add_suffix(cities_df_suffixes_list[ix]), 
                                 how='outer', 
                                 left_on='Account Number', 
                                 right_on='Account Number'+cities_df_suffixes_list[ix], 
                                ) 

    ix += 1
    count_cols += len(df_.columns)

    
## Save into CSV
cities_KPIs_2020.to_csv('./df_KPIs/Final/cities_KPIs_2020.csv')


## Final checks
#print(count_cols) # 34+2 = 36
#print(len(cities_KPIs_2020.columns)) # 36

################################################################ Add region and Country to Cities KPIs df #######################################################

uq_cities = c20[['Account Number', 'Organization', 'CDP Region', 'Country', 'Row Number']].groupby(['Account Number', 'Organization', 'CDP Region', 'Country'], as_index=False).count() #value_counts()

cities_KPIs_2020 = pd.merge(cities_KPIs_2020, 
                        uq_cities.copy().add_suffix('_'), 
                        how='left', 
                        left_on='Account Number',
                        right_on='Account Number_'
                        )


cities_KPIs_2020 = cities_KPIs_2020.drop(['Organization_', 'Account Number_', 'Row Number_'], axis=1)


# uq_cities.head()
#cities_KPIs_2020.head() Organization

################################################################### Link Companies to cities ######################################################

## Add Cities to corporations' df
corp_locations = pd.read_csv('../input/cdp-unlocking-climate-solutions/Supplementary Data/Locations of Corporations/NA_HQ_public_data.csv', low_memory=False)
corp_locations_cleaned= corp_locations.groupby(['account_number', 'hq_country','address_city'], as_index=False)['public'].count()
company_climate_KPIs_2020 = pd.merge(company_climate_KPIs_2020, 
                                    corp_locations_cleaned.copy().add_suffix('_'), 
                                    how='left', 
                                    left_on='account_number',
                                    right_on='account_number_'
                                    )

company_water_KPIs_2020 = pd.merge(company_water_KPIs_2020, 
                                    corp_locations_cleaned.copy().add_suffix('_'), 
                                    how='left', 
                                    left_on='account_number',
                                    right_on='account_number_'
                                    )


climate_city_KPIs_2020 = pd.merge(company_water_KPIs_2020, 
                                   corp_locations_cleaned.copy().add_suffix('_'), 
                                   how='left', 
                                    left_on='account_number',
                                    right_on='account_number_'
                                    )


## Merge 


# print(len(corp_locations.index))
# corp_locations.head(4)
# company_climate_KPIs_2020['theme'].value_counts()
# company_water_KPIs_2020['theme'].value_counts()








## Visualize sample table
# pd.set_option('display.max_columns', None)
# C41a_KPI
C41a[C41a['organization'] == 'Celestica Inc.']


**Objective strategy**

We start presenting a KPI that tracks the "objective strategy" and feasibility set by a company/city. The metric that is proposed is the "Years to achievement", which is built by comparing directly what % of CO2 emissions a company will need to achieve to reach its goal VS how much it has already been able to cut so far against the same goal. 

It is interesting to observe that whilst the distribution peaks at a zero value, many large negative and positive values are present for this metric. How to interpret this KPI:
* A value close to zero indicates that the objective is well calibrated, meaning that if the company/city keeps reducing emissions in the way it has done so far, it will achieve its goal exactly in the indicated time span
* A negative value for the KPI indicates that the company/city has not been able to keep on track with its objective and reaching the objective is starting to look unfeasible, unless there is a change in pace, the further away from zero, the less likely the company/city will be to achieve its goal
* A positive value for the KPI means that the company has been able to reduce emissions at a much faster pace than the target. While this can be seen as a positive, it could also indicate that the objective is not ambitious enough, and was therefore too easy to achieve and should probably be adjusted.

Below the distribution of "Years to achievement" for the Corporates objectives based on the answers of table 4.1a of the Corporate questionnaire:


In [None]:
def kde_plotter(df, x, y, x_label, y_label, main_title, subtitle):

    fig = sns.kdeplot(df[x], df[y], cmap="Blues", shade=True, bw=.1)

    fig.figure.suptitle(main_title, fontsize = 14)

    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.title(subtitle, fontsize=12)
    plt.ylim(0, 1.0)
    plt.xlim(0, 1.0)
    plt.gca().set_aspect('equal', adjustable='box')

fig = sns.kdeplot(C41a_KPI['Years_diff'], shade=True)
fig.figure.suptitle("Companies ambitions of objectives", fontsize = 14)
plt.title("Distribution of companies' years to achievement of objective", fontsize=12)
plt.xlabel('Years to achievement', fontsize=12)
plt.ylabel("Companies' distribution", fontsize=12)
plt.xlim(-50, 50)
plt.legend('')
plt.show()

**Objective strategy VS objective ambition**

We also created another metric which looks at the "objective ambition", expressed in the percentage of CO2 that the company/city intends to cut with for a given objective. For this, we produce another graph that shows the distance from the objective for companies grouped by sector (the companies around the red dotted line are on track to achieve the goal, for the ones above they are way ahead of time to achieve it and for the one below are falling behind). The size of the dot shows the quantity of emissions that they are aiming to cut, while the "ambition KPI" shows what is the percentage of emissions they are aiming to cut. 

In the example below, we cross-reference three different KPIs for the cities objectives: "Years to achivement", "Ambition" and overall CO2 emissions targeted. The KPI refers to question 5.0a of the Cities questionnaire on city-wide emissions reductions in absolute targets.

The closer to the red dotted line, the more the city is on track to achieve its goal. Points below the red dotted line mean that the city is behind target and the dots above the red line mean that the city is ahead of target, similarly to the distribution presented for companies in the previous section. Further, we cross-reference the KPIs to get additional insight: we can also check how "ambitious" is the objective (meaning what percentage of emissions they are aiming to cut) and how big emissions cuts are in terms of CO2 in scope of the objective.

In other words, to read the graph below, "big dots" that are close or above of the red line on the right hand side of the graph are the preferrable status for a city objective.  

In [None]:
fig = sns.relplot(x="KPI_50a_rank_objective_ambition", 
            y="Years_diff", 
            size="Covered_emissions", # WARNING: these emissions are based on different years
            sizes=(50, 500),
            hue="Covered_emissions", 
            legend='brief',
            data=C50a_KPI)
leg = fig._legend
leg.texts[0].set_text("")#, fontsize = 10)
# fig._legend.set_title("Covered emissions (metric tonnes CO2e)", fontsize = 10)
# leg.set_bbox_to_anchor([.7,0.7])
plt.axhline(0,1,0,color='r',linestyle='--')
# fig.figure.suptitle("Cities: ambitions of objectives", fontsize = 18)
# plt.suptitle("Cities: ambitions of objectives", fontsize = 18)
plt.title("Cities' objectives: Years to achievement VS Ambition", fontsize=14)
plt.ylabel('Years to achievement', fontsize=12)
plt.xlabel("KPI: Objective Ambition", fontsize=12)
plt.gcf().set_size_inches(12, 4)#, dpi=80)#, forward=True)
plt.xlim(0, 1)

# plt.setp(
fig._legend.set_title("Covered emissions")
#     , fontsize='14') # for legend text #(metric tonnes CO2e)
plt.setp(fig._legend.get_texts(), fontsize='6') # for legend text
plt.setp(fig._legend.get_title(), fontsize='8') # for legend title
# plt.legend('Covered emissions (metric tonnes CO2e)')
plt.show()

**Objective Status**

The objectives can be also monitored with a binary flag to check if they are on target or behind target, based on progress history.
A representation below for the Cities objectives on reneable energy targets from table 8.0a of the Cities Questionnaire:

In [None]:
# On target VS Behind target
fig = sns.countplot(x='Target_status', data=C80a)

fig.figure.suptitle("Cities: Objective status Table 8.0a", fontsize = 14)
plt.title("Count of target achievement statuses", fontsize=12)
plt.xlabel('Objective achievement status', fontsize=12)
# plt.ylabel("Count of companies", fontsize=12)
# plt.xlim(-80, 80)
# plt.legend('')
plt.show()

**Progress VS Ambition of an objective**

Finally, we show for table 4.1a of Corporations the relationship between two KPI we have created: "progress" and "ambition". As it can be expected there is an inverse relationship between the two where a lot of low ambition objectives have been achieved or are closed to be achieved (see the dark area on the top left of the graph below), whereas there is a concentration of highly ambitious objectives with little progress that will require a lot of further work in the future (bottom right of the graph below).  

In [None]:
x="KPI_rank_objective_ambition" 
y="KPI_rank_objective_progress"
x_label = "Ambition"
y_label = "Progress"
main_title = "Corporations Objectives"
subtitle = "Progress vs Ambition"

kde_plotter(C41a_KPI,
    x, 
    y, 
    x_label, 
    y_label, 
    main_title, 
    subtitle)

<a id="section-two"></a>
# Risk & Opportunity assessment KPIs

In this section the Risk & Opportunity assessment KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ffffcc><b>Risk Index</b></td>
    <td style="text-align:right" BGcolor=#ffffcc>Risk Index = Risk probability * Risk Magnitude / Time horizon</td>
    <td style="text-align:right" BGcolor=#ffffcc>This KPI gives risk exposure by company/city based on their expectations. The time horizon in years for companies is taken from table C2.1a where they state their definition of short/medium/long term. 
The concept behind this KPI is an yearly exposure to risk based on how likely it is to happen and how severe its manifestation will be.  Companies/cities can be ranked on their risk exposure. 
For cities, it was also possible to compare current VS future exposure to a given risk as the questionnaire asks to assess this in both time dimentions. Additionally, for cities this KPI is linked to the Social Vulnerability Index (SVI) to better understand a possible risk/effect outcome.
For the water questionnaire this is also aggregated at the country level.</td>
   <td style="text-align:right" BGcolor=#ffffcc>2.1</td>
   <td style="text-align:right" BGcolor=#ffffcc>2.3a</td>
   <td style="text-align:right" BGcolor=#ffffcc>4.2</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#ffffcc><b>Historical trends of water consumption</b></td>
   <td style="text-align:right" BGcolor=#ffffcc>Opportunity Index = Opportunity Probability * Opportunity magnitude / Time horizon</td>
   <td style="text-align:right" BGcolor=#ffffcc>This KPI returns an assessment on opportunities available for companies, by considering the likelihood and the magnitude of the event within a time horizon. The time horizon in years for companies is taken from table C2.1a where they state their definition of short/medium/long term.  
Please note that the Water questionnaire did not have the magnitude of the opportunity so in this case the probability is simply divided by the time horizon.</td>
   <td style="text-align:right" BGcolor=#ffffcc>-</td>
   <td style="text-align:right" BGcolor=#ffffcc>2.4a</td>
   <td style="text-align:right" BGcolor=#ffffcc>-</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#ffffcc><b>Water related risk </b></td>
   <td style="text-align:right" BGcolor=#ffffcc>Ranking of water related risks by country and company</td>
   <td style="text-align:right" BGcolor=#ffffcc>A ranking based on number of facilities exposed to water risks. The second version of this KPI aggregates by business and looks at how much a company is exposed in % terms in a given the facility.</td>
   <td style="text-align:right" BGcolor=#ffffcc>-</td>
   <td style="text-align:right" BGcolor=#ffffcc>-</td>
   <td style="text-align:right" BGcolor=#ffffcc>4.1b/ 4.1c</td>
  </tr>
</table>
</body>
</html>

**Risk Index**

We first display the relationship bewteen the Risk Index KPI and the Time Horizon. The Time horizon in the case of Corporation is obtained by using table 2.1a, where they define their timeframe for short/medium and long term. 

We then match that answer with table 2.3a, where risks faced by the company are transformed by assigning a numerical value to the categorical answers (e.g. High=5, Medium-High=4, Medium=3, etc.) and probability is multipled by magnitude. This value is then divided by the time horizon (where short/medium/long will be based on question 2.1a).

This information can be further enhanced with the "Financial Impact" of the risk that is also identified within table 2.3a.

We can obesrve from the plot below that the majority of Risks faced by organizations are expected to materialise within the next 10 years, even some high probability/materiality ones with significant financial impact. 

In [None]:
## MOST IDENTIFIED RISKS ARE CLOSER IN TIME AND HIGHER IN COSTS! 
fig = sns.relplot(x="time_horizon_years", 
            y="KPI_rank_risk_index", 
            size="risk_financial_impact",
            sizes=(10, 500), 
            hue="risk_financial_impact", 
            legend='brief',
            data=C23a_KPI)

axes = fig.axes.flatten()
leg = fig._legend
leg.texts[0].set_text("")#, fontsize = 10)
# axes[0].set_ylabel("Risk Index")
# axes[0].set_xlabel("Time Horizon (years)")
axes[0].set_xlim(0, 50)
axes[0].set_ylim(0, 1.1)
# axes[0].set_title("Risk: forward look and magnitude")

plt.title("Risk: forward look and magnitude", fontsize=14)
plt.ylabel('Risk Index', fontsize=12)
plt.xlabel("Time Horizon (years)", fontsize=12)
plt.gcf().set_size_inches(12, 4)#, dpi=80)#, forward=True)
plt.xlim(0, 50)
# plt.setp(#
fig._legend.set_title("Financial impact")
# , fontsize='14') # for legend text #(metric tonnes CO2e)
plt.setp(fig._legend.get_texts(), fontsize='6') # for legend text
plt.setp(fig._legend.get_title(), fontsize='8') # for legend title
plt.show()

**Current VS Future Risk Index for Cities with Social Vulnerability Index**

An advantage of the KPIs that we propose within this section can be easily crossed with external data sources to enhance the understanding of the risks faced are covered: present (or current) risk and future risk. 

We present below how the risk index increases/decreases over time by also adding the Social Vulnerability Index that was made available by CPD in the supplementary data. 

The bigger the dot, the higher is the social vulnerability (which is the result of an assessment of demographic data on socieconomic status, household composition & disability, minority & language barriers, housing type and transportation). 

Additionally, dots that are above the diagonal indicate that the risk is increasing in the future, whereas the ones below it show a decreasing risk trend reported by the city. Points on the diagonal have a constant expected present/future risk level.   

In [None]:
g = sns.relplot(data=ct, x='risk_index_curr', y='risk_index_fut',  size='RPL_THEMES', hue='RPL_THEMES',
    sizes=(20, 200))
axes = g.axes.flatten()
plt.gcf().set_size_inches(10, 7.5)
axes[0].set_ylabel("Future Risk Index")
axes[0].set_xlabel("Current Risk Index")
axes[0].set_title("Current VS Future - Risk Index")
data_to_graph = ct[['risk_index_curr','risk_index_fut','Organization_x']]
cities_to_remove=[
   'San Luis Obispo', 'Racine, WI', 'Durham', 'Providence', 'York, ME', 'District of Columbia', 'Winona, MN'
]
data_to_graph[~data_to_graph['Organization_x'].isin(cities_to_remove)].apply(lambda x: axes[0].text(*x),axis=1)
#     ct['risk_index_curr'], ct['risk_index_fut'], *x, transform=ax.transAxes
#     ),axis=1)

g._legend.texts[0].set_text("")
g._legend.set_title("Social Vulnerability Index")

X_plot = np.linspace(0, 1)
Y_plot = X_plot
plt.plot(X_plot, Y_plot, '--', color='r')

plt.axis('equal')
plt.ylim(0, 1.1)
plt.xlim(0, 1.1)

plt.show()

**Risk Index and Project spending on Drought risk - with ND-GAIN data for vulnerability**

Simirarly, we can use the Risk vulnerability KPI against other external datasets such as the ND-GAIN dataset which provides a risk/vulnerability assessment against specific climate change events such as rising sea levels, droughts, heavy precipitations, etc.

In the graph below we only select projects from table 2.1 of the City questionnaire which are earmarked as for "Drought" prevention. We then enchance the data with the ND-GAIN which gives us the Vulnerability to the Drough event for a given city, and finally we plot against the project cost, to check how well a city is gearing up against the risk.

From the below, it looks like cities who are more exposed and vulnerable to Drought risk are investing relatively less than cities who are less exposed. They should probably invest more to prevent this risk!

In [None]:
a=C30.dropna(subset=['Climate hazards'])
a=a.loc[a['Climate hazards'].str.contains('Water Scarcity')] 
a = a.groupby(['Account Number', 'Organization'])['Total Project Cost USD'].sum().reset_index()
a['Total_cost_Log'] = np.log(a['Total Project Cost USD'])
drough =pd.merge(cities, ND_DROUGHT, left_on='Organization_new',right_on='City',how='left')
#drough = drough.dropna()
dro =pd.merge(drough,a, left_on='Account Number',right_on='Account Number',how='left')
dro= dro[dro['Organization_x']!='Fremont']
dro['Risk_Rank'] = dro['Risk'].rank(pct=True)
g = sns.relplot(data=dro, x='Total_cost_Log', y='Risk',  size='Vulnerability', hue= 'Vulnerability',
    sizes=(20, 200))

axes = g.axes.flatten()
plt.gcf().set_size_inches(12, 4)
axes[0].set_ylabel("Risk Index")
axes[0].set_xlabel("Total Cost of Project in USD (Log))")
#axes[0].set_xlim(0, 50)
# axes[0].set_ylim(0, 0.5)
axes[0].set_title("Drought: Risk Index/Vulnerability and investment")
dro[['Total_cost_Log','Risk','Organization_x']].apply(lambda x: axes[0].text(*x),axis=1)
plt.show()

**Water related Risk**

It is possible to visualize the Risk Index also with a "country" level aggregation. The Water questionnaire with Table 4.1c lends itself quite well for this purpose as it reports the Country/Area for which the risk is analysed. 
The map below shows water related Risk Index by country, where a darker red color indicates a higher Risk Index, meaning that the country is more exposed to water related risk:

In [None]:
import plotly.express as px

fig = px.choropleth(country_water_KPIs_2020, locations="country_or_region",
                    color="KPI_rank_risk_index_risks_identified", # lifeExp is a column of gapminder
                    labels={
                     "country_or_region": "Countries",
                     "KPI_rank_risk_index_risks_identified": "Risk index"},
                    hover_name="country_or_region", # column to add to hover information
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.OrRd)

fig.update_layout(
    title_text="Companies' water-related Risk Index by country"
)

fig.show()

**Stress areas by country**

We finally show another view by country of a KPI: "water stress" by area. This is from Table 6.0 of the Water questionnaire. It reports if there is a "water stress" for the area where the company facility is located. The darker the red, the higher the stress for the area.

In [None]:
fig = px.choropleth(country_water_KPIs_2020, locations="country_or_region",
                    color="KPI_rank_facility_in_area_with_water_stress_water_use_by_facility", # lifeExp is a column of gapminder
                    labels={
                     "country_or_region": "Countries",
                     "KPI_rank_facility_in_area_with_water_stress_water_use_by_facility": "Water stress facility"},
                    hover_name="country_or_region", # column to add to hover information
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.OrRd)

fig.update_layout(
    title_text="Companies' facilities in water stress areas, by country"
)

fig.show()

<a id="section-three"></a>
# Return on Investment KPIs

In this section the Return on investment KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#00cc66><b>ROI of risk mitigation</b></td>
    <td style="text-align:right" BGcolor=#00cc66>ROI of risk mitigation = Risk financial impact / Risk response cost </td>
    <td style="text-align:right" BGcolor=#00cc66>This KPI show the return on investment of the response cost. A low response cost linked to an high financial impact, indicates a high ROI ,meaning that the company will get a big benefit from a proportionally low investment.</td>
   <td style="text-align:right" BGcolor=#00cc66>-</td>
   <td style="text-align:right" BGcolor=#00cc66>2.3a</td>
   <td style="text-align:right" BGcolor=#00cc66>4.3a</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#00cc66><b>ROI of opportunity</b></td>
    <td style="text-align:right" BGcolor=#00cc66>ROI of opportunity = Opportunity financial impact / Cost of realizing the opportunity</td>
    <td style="text-align:right" BGcolor=#00cc66>This KPI show the return on investment for realizing opportunities. A low opportunity cost linked to an high financial impact, indicates a high ROI ,meaning that the company will get a big benefit from a proportionally low investment.</td>
   <td style="text-align:right" BGcolor=#00cc66></td>
   <td style="text-align:right" BGcolor=#00cc66>2.4a</td>
   <td style="text-align:right" BGcolor=#00cc66></td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#00cc66><b>Spending per capite</b></td>
    <td style="text-align:right" BGcolor=#00cc66>Spending per capite = Total USD project cost/ Population</td>
    <td style="text-align:right" BGcolor=#00cc66>Amount in USD per citizen spent on reducing the vulnerability of the city on climate threats.</td>
   <td style="text-align:right" BGcolor=#00cc66>3.0</td>
   <td style="text-align:right" BGcolor=#00cc66>-</td>
   <td style="text-align:right" BGcolor=#00cc66>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#00cc66><b>Return on investment for MWh saving/production C02 cuts</b></td>
    <td style="text-align:right" BGcolor=#00cc66>ROI MWh cuts = Total USD cost / Mwh cuts 
<br>ROI Green MWh produced = Total USD cost / MWh of green energy produced
<br>ROI Carbon cuts = Total USD cost / CO2 cuts</td>
    <td style="text-align:right" BGcolor=#00cc66>This KPI gives a return per dollar investment in terms of the target being achieved. It allows to compare the effectiveness of projects and select the best ones. For example, between 2 projects both aimed at saving 10MWh costing respectively 1000 USD and 500 USD, the second project is clearly more efficient as for half the cost achieves the same result.</td>
   <td style="text-align:right" BGcolor=#00cc66>5.4</td>
   <td style="text-align:right" BGcolor=#00cc66>-</td>
   <td style="text-align:right" BGcolor=#00cc66>-</td>
  </tr>
</table>
</body>
</html>

**Return on investment for MWh saving/production C02 cuts**

The graph below shows the Return on Investment (ROI) KPI for the cities investments, this is applied to Table 5.4 of the Cities questionnaire. The way to interpret the KPI is that it gives "how much" a MWh of electricity savings has cost in US Dollars.

Here we compare: energy savings projects, CO2 reduction projects and energy production projects. 

The lower the number the better it is, as a unit saving is achieved for less cost. 

We can observe that on average CO2 reduction projects tend to be the most cost efficient ones as opposed to Green energy production projects which tend to be more expensive on average and also show a greater degree of variability in cost/efficiency. 

However, it needs to be said, that often these savings are linked to each other (e.g. a new Green energy facility will bring CO2 reductions), so it shouldn't necessarily be concluded that to invest in a CO2 reduction project alone is more money efficient.

Furthermore, by using this KPI, it is also interesting to identify the projects with the best ROI, as these projects could also be recommended to other cities to achieve their goals.

KPIs on ROI for Corporations Risk Mitigations actions (Table 2.3a for Corporate climate change & Table 4.3a for water questionnaires) can be analysed in a similar fashion, and the most cost efficient actions can be identified. Similarly Table 2.4a is analysed with respect to climate related opportunities, in this case there will be the return on investment a company will achieve  with that climate opportunity.

In [None]:
C54KPI= C54_KPI[['Dollar per 1 MWh of Energy Savings','Dollar per 1 Mt of CO2 Emissions reductions','Dollar per 1 MWh of Renewable energy production']]
C54KPI = C54KPI[C54KPI['Dollar per 1 MWh of Energy Savings']<5000]
C54KPI = C54KPI[C54KPI['Dollar per 1 MWh of Renewable energy production']<5000]
C54KPI = C54KPI[C54KPI['Dollar per 1 Mt of CO2 Emissions reductions']<5000]
C54KPI=C54KPI.rename(columns={'Dollar per 1 MWh of Energy Savings':"$ return MWH Savings",'Dollar per 1 Mt of CO2 Emissions reductions':'$ return CO2 reductions',
                              'Dollar per 1 MWh of Renewable energy production':'$ return MWH ren eng prod'})

#C54KPIe=C54KPI[["$ return MWH Savings",'$ return MWH renewable energy production']]
sns.set_style("white")
sns.boxplot(data=C54KPI)
plt.gcf().set_size_inches(12, 4)
plt.title("Dollars spent for MWh/CO2 unit reduction")
plt.show()
 

**Other ROI KPIs produced in this section**

ROI KPIs are quite useful and can be used together with other KPIs produced within this work to put things into perspective. 

We have shown some examples of this throughout this Notebook, where we use for example total project cost or spending per capite to add further insight to other KPIs. 


<a id="section-four"></a>
# Trends KPIs

In this section the Trends KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ffcc66><b>Historical emissions trends (city or company)</b></td>
    <td style="text-align:right" BGcolor=#ffcc66>Average of the historical CO2 emissions trends by calculating % difference between year t0 - year t1 and then averaging the amounts of the time series per city</td>
    <td style="text-align:right" BGcolor=#ffcc66> This KPI looks at the trend of Historical emissions of CO2 for the city. Cities which consistently have reduced emissions and showed a % cut for the available time series will be the best performing ones.</td>
   <td style="text-align:right" BGcolor=#ffcc66>4.13</td>
   <td style="text-align:right" BGcolor=#ffcc66>6.1</td>
   <td style="text-align:right" BGcolor=#ffcc66>-</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#ffcc66><b>Historical trends of water consumption</b></td>
   <td style="text-align:right" BGcolor=#ffcc66>A ranking is produced by looking at the trends of water consumption of the company.</td>
   <td style="text-align:right" BGcolor=#ffcc66>Based on the answer of question 1.2b companies are ranked by the trends in water consumption</td>
   <td style="text-align:right" BGcolor=#ffcc66>-</td>
   <td style="text-align:right" BGcolor=#ffcc66>-</td>
   <td style="text-align:right" BGcolor=#ffcc66>1.2d</td>
  </tr>

</table>
</body>
</html>


**Trends KPI**

The series below shows the data on which we build the historical emissions trends KPI, in this case for cities (Table 4.13). The KPI is simply obtained by averaging the time series of historical percentage changes in emissions. We then rank order cities using this metric to see be able to compare their performance. Cities with the biggest CO2 cuts are obviously the best performing ones.  

In [None]:
C413time = C413.drop('Inventory date from',axis=1)
C413time['Inventory date to'] = pd.to_datetime(C413time['Inventory date to'])
C413time['Inventory date to'] = C413time['Inventory date to'].dt.year
C413time = C413time.sort_values(['Organization','Inventory date to','Row Number']).reset_index()
C413time = C413time[C413time['Methodology']== 'Global Protocol for Community Greenhouse Gas Emissions Inventories (GPC)']
C413time['Previous emissions (metric tonnes CO2e)'] = C413time['Previous emissions (metric tonnes CO2e)'].astype(float)
C413time['diffs'] = C413time['Previous emissions (metric tonnes CO2e)'].pct_change()
mask = C413time.Organization != C413time.Organization.shift(1)
C413time['diffs'][mask] = np.nan
C413time['y_diffs'] = C413time['Inventory date to'].diff()
C413time['y_diffs'][mask] = np.nan
C413time=C413time.dropna(subset=['diffs'])
C413time=C413time[C413time['y_diffs']==1]


C413time = C413time.drop(['index','Methodology','Row Number','Previous emissions (metric tonnes CO2e)','y_diffs','Account Number'],axis=1)
#C413time = C413time.pivot(index=['Inventory date to'], columns='Organization', values='diffs').reset_index()
C413time = C413time.pivot(index=['Organization'], columns='Inventory date to', values='diffs').reset_index()
C413time = C413time.T.reset_index()
new_header = C413time.iloc[0] #grab the first row for the header
C413time = C413time[1:] #take the data less the header row
C413time.columns = new_header
C413time=C413time[8:12]
C413time=C413time[['Organization','Ajuntament de Barcelona','City of Boston','City of Columbus','City of Los Angeles','City of Toronto','Hørsholm Kommune']]
C413time=C413time.astype(float)

C413time['Organization'] = C413time['Organization'].astype(int)

# sns.lineplot(x="Organization", y="Ajuntament de Barcelona", data=C413time)
# sns.lineplot(x="Organization", y="City of Boston", data=C413time)
# sns.lineplot(x="Organization", y="City of Columbus", data=C413time)
# sns.lineplot(x="Organization", y="City of Toronto", data=C413time)


fig, ax = plt.subplots(1, 1)
ax.plot(C413time["Organization"], C413time["Ajuntament de Barcelona"], color="blue", label="Ajuntament de Barcelona", linestyle="-")
ax.plot(C413time["Organization"], C413time["City of Boston"], color="red", label="City of Boston", linestyle="-")
ax.plot(C413time["Organization"], C413time["City of Columbus"], color="green", label="City of Columbus", linestyle="-")
ax.plot(C413time["Organization"], C413time["City of Toronto"], color="orange", label="City of Toronto", linestyle="-")

plt.title("CO2 emission trends for cities", fontsize=14)
plt.xlabel('Years', fontsize=12)
plt.ylabel("Percentage emissions change year on year", fontsize=12)

import math
xint = range(C413time['Organization'].min(), math.ceil(C413time['Organization'].max())+1)

plt.xticks(xint)

ax = plt.gca()
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.1%}'.format(x) for x in vals])

plt.legend(bbox_to_anchor=(1.4, 0.57))#loc=0)
plt.gcf().set_size_inches(12, 4)
plt.axhline(0,1,0,color='grey',linestyle=':', linewidth=0.5)

plt.show()

**Historical trends of water consumption**

For the water questionnaire, we show increase of water movement versus previous reporting year, this KPI is built by using the information collected in Table 1.2d of the Water questionnaire. Water movement is defined as the total sum of water withdrawal, consumption and discharge, which is then grouped by country.

In [None]:
fig = px.choropleth(country_water_KPIs_2020, locations="country_or_region",
                    color="KPI_rank_water_total_use_vs_last_year_water_use_by_facility", # lifeExp is a column of gapminder
                    labels={
                     "country_or_region": "Countries",
                     "KPI_rank_water_total_use_vs_last_year_water_use_by_facility": "Total water movements VS previous year"},
                    hover_name="country_or_region", # column to add to hover information
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.PuBu)

fig.update_layout(
    title_text="Companies' total water movements VS previous year, by country"
)

fig.show()

<a id="section-five"></a>
# City Assessment KPIs

In this section the City Assessment KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:


<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Green Energy</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Ranking of cities with the most % of green energy, where green energy is % of Geothermal + % of Solar + % of Wind Energy</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI ranks the city by looking looking at the energy source.</td>
   <td style="text-align:right" BGcolor=#ccffcc>8.1</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Green Transport</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Ranking of cities with the greenest ways of transport. Sum of % Micromobility, % Cycling, % Walking.</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI ranks the city by looking at % of clean transportation means. </td>
   <td style="text-align:right" BGcolor=#ccffcc>10.1</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Private VS Public Transport</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Private transport = Ranking of cities with the least cars per capite by looking at num of cars/ number of citizens
Public transport = Ranking of cities by looking at public transportation per capita</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI returns the number of cars in the city per capite and the number of public transportation vehicles per capite</td>
   <td style="text-align:right" BGcolor=#ccffcc>10.4</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Size of Zero emissions zone</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Percentage of zero emissions zone compared to city area</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI assess the size of zero emissions zones in comparable terms</td>
   <td style="text-align:right" BGcolor=#ccffcc>10.7a</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Meat/Dairy consumption</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Ranking of cities by using the dairy and meat consumption per capite data </td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI assesses food habits by ranking cities with the most food/dairy consumption per capita</td>
   <td style="text-align:right" BGcolor=#ccffcc>12.1</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>Waste per capita/ per km2</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Ranking of cities by calculating of tons of waste per capite (or per km2)</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI returns the amount of waste in  a city per capita or per km2 to allow a term of comparison between cities</td>
   <td style="text-align:right" BGcolor=#ccffcc>13.0</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ccffcc><b>CO2 emissions per capita/ per km2</b></td>
    <td style="text-align:right" BGcolor=#ccffcc>Ranking of cities per emissions of CO2 per citizen</td>
    <td style="text-align:right" BGcolor=#ccffcc>This KPI allows a comparable CO2 per capita (or km2) in order to understand which cities are most polluting</td>
   <td style="text-align:right" BGcolor=#ccffcc>4.6b</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
   <td style="text-align:right" BGcolor=#ccffcc>-</td>
  </tr>
</table>
</body>
</html>

**City assessment and performance comparison**

The City Assessment KPIs allow a direct comparison between cities. 

This is achieved by calculating the raw KPI value (explained below) and then ranking cities based on their results. 

In the spider charts below we compare four cities by using CO2 Emissions per capita (the lower the better), % of green energy sourcing (the higher the better), number of private cars per individual (the lower the better), number of public transport vehicles per capita (the higher the better) and tons of waste per capita (the lower the better). It becomes evident from the below how the KPIs we have built in this section allow a direct comparison of the performance of the city against key metrics. 

In [None]:
abc = cities_KPIs_2020[['Organization','KPI_130_waste_rank_waste','KPI_104_private_car_rank_transport_fleet_size',
                  'KPI_104_public_transport_rank_transport_fleet_size','KPI_81_Green_energy_rank_energy_consumption',
                 'KPI_46b_CO2 Emissions per capite_rank_city_wide_emissions']]
abc=abc.dropna()
abc=abc.rename(columns={"KPI_130_waste_rank_waste": "Waste per capite", "KPI_104_private_car_rank_transport_fleet_size": "Cars per capite", 
                   "KPI_104_public_transport_rank_transport_fleet_size": "Public transport per individual",'KPI_81_Green_energy_rank_energy_consumption':'Green Energy consumed',
                  'KPI_46b_CO2 Emissions per capite_rank_city_wide_emissions':'CO2 per capite'}, errors="raise")
bcd =abc[2:6]
bcd =bcd.reset_index()
bcd=bcd.drop(['index'], axis=1)

# ------- PART 1: Define a function that do a plot for one line of the dataset!
 
def make_spider( row, title, color):
 
# number of variable
    categories=list(bcd)[1:]
    N = len(categories)
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]
 
# Initialise the spider plot
    ax = plt.subplot(2,2,row+1, polar=True, )
 
# If you want the first axis to be on top:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)
 
# Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories, color='grey', size=8)
 
# Draw ylabels
    ax.set_rlabel_position(0)
    plt.yticks([0.25,0.5,0.75,1], ['0.25','0.5','0.75','1'], color="grey", size=7)
    plt.ylim(0,1)
 
# Ind1
    values=bcd.loc[row].drop('Organization').values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
    ax.fill(angles, values, color=color, alpha=0.4)
 
# Add a title
    plt.title(title, size=11, color=color, y=1.1)
    plt.subplots_adjust(wspace=0.4)
 
# ------- PART 2: Apply to all individuals
# initialize the figure
my_dpi=96
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=my_dpi)
 
# Create a color palette:
my_palette = plt.cm.get_cmap("Set2", len(bcd.index))
 
# Loop to plot
for row in range(0, len(bcd.index)):
    make_spider( row=row, title=bcd['Organization'][row], color=my_palette(row))

**Correlation between cars per capita and Public transport per capita**

Another example on how the KPIs we have presented in this section can be used is to checking the correlation between them, when meaningful. For example, somewhat expectedly we find that there is a negative correlation between number of Cars and Public transport available: the more the cars per capita in the city, the less public transport per capita. This means that cities that invest in public transportation are successfully reducing the number of cars per person. 

Correlation between the two KPIs:

In [None]:
C104_KPI['Private cars per capite'].corr(C104_KPI['Municipal fleet per capite'])

<a id="section-six"></a>
# Corporate Assessment KPIs

In this section the Corporate Assessment KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#99ccff><b>Efficiency of purchased energy in terms of CO2 for company or country</b></td>
    <td style="text-align:right" BGcolor=#99ccff>Efficiency of Metric tons of CO2 Scope 2 market based divide by Purchased & consumed electricity including low carbon. Available per company and country</td>
    <td style="text-align:right" BGcolor=#99ccff>This KPI shows the CO2 efficiency of energy sources, lower C02 emissions per MWh purchased are desirable. Available at company or country level.</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
   <td style="text-align:right" BGcolor=#99ccff>7.5</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#99ccff><b>Energy consumption - renewable</b></td>
   <td style="text-align:right" BGcolor=#99ccff>% of renewable energy energy consumed</td>
   <td style="text-align:right" BGcolor=#99ccff>This KPI allows to order companies by % of renwable energy consumed</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
   <td style="text-align:right" BGcolor=#99ccff>8.2a</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#99ccff><b>Electricity generated from renewable sources</b></td>
   <td style="text-align:right" BGcolor=#99ccff>Renewable energy generated = Energy from renewable sources divided by total generation of energy
Renewable energy consumed =  Energy from renewable sources that company consumes divided total energy consumed
Renewable energy sold] = Renewable energy produced in excess (gross renewable produced - consumed) divide by total gorss generation of energy</td>
   <td style="text-align:right" BGcolor=#99ccff>This KPI allows to rank companies by looking at the amount of renewable energy generated/consumed and released into the network in percentage terms</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
   <td style="text-align:right" BGcolor=#99ccff>8.2d</td>
   <td style="text-align:right" BGcolor=#99ccff>-</td>
  </tr>
</table>
</body>
</html>

**Corporations Key performance**

Below is an example of Corporation assessment KPIs: Energy produced from renewables (the highest the better), Excess Energy produced from renewables (clean energy produced but not used by the Company, and put into the network, the highest the better), Percentage consumption of renewable energy (the higher the better) and finally the Market emissions efficiency where the higher KPI is, the less CO2 Metric TOns are produced per MWh. 

For the KPIs presented in this section, it is also possible to create a grouping by Country by using the data of the company that are located in a certain country. It needs to be said that, that this grouping is based on the location of the company headquarter and so this visual could be biased as global companies could have operations abroad and some of the emissions could be generated in another country.

In [None]:
abc1 = company_climate_KPIs_2020[['organization','KPI_rank_electricty_generated_from_renewables_energy_consumption_details','KPI_rank_electricty_generated_from_renewables_sold_energy_consumption_details',
                       'KPI_rank_energy_renewables_energy_consumption','KPI_rank_energy_efficiency_mkt_emissions_targets_by_country']]
abc1=abc1.dropna()
abc1=abc1.rename(columns={'KPI_rank_electricty_generated_from_renewables_energy_consumption_details':'Energy produced from renewables',
                          'KPI_rank_electricty_generated_from_renewables_sold_energy_consumption_details': 'Excess renewable produced',
                       'KPI_rank_energy_renewables_energy_consumption':'Consumption of renewable energy','KPI_rank_energy_efficiency_mkt_emissions_targets_by_country':'Market emissions efficiency'},
                 errors="raise")
bcd1 =abc1[8:12]
bcd1 =bcd1.reset_index()
bcd1=bcd1.drop(['index'], axis=1)
# ------- PART 1: Define a function that do a plot for one line of the dataset!
 
def make_spider( row, title, color):
 
# number of variable
    categories=list(bcd1)[1:]
    N = len(categories)
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]
 
# Initialise the spider plot
    ax = plt.subplot(2,2,row+1, polar=True, )
 
# If you want the first axis to be on top:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)
 
# Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories, color='grey', size=8)
 
# Draw ylabels
    ax.set_rlabel_position(0)
    plt.yticks([0.25,0.5,0.75,1], ['0.25','0.5','0.75','1'], color="grey", size=7)
    plt.ylim(0,1)
 
# Ind1
    values=bcd1.loc[row].drop('organization').values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
    ax.fill(angles, values, color=color, alpha=0.4)
 
# Add a title
    plt.title(title, size=11, color=color, y=1.1)
    plt.subplots_adjust(wspace=0.75)
 
# ------- PART 2: Apply to all individuals
# initialize the figure
my_dpi=96
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=my_dpi)
 
# Create a color palette:
my_palette = plt.cm.get_cmap("Set2", len(bcd1.index))
 
# Loop to plot
for row in range(0, len(bcd1.index)):
    make_spider( row=row, title=bcd1['organization'][row], color=my_palette(row))

**Energy effieincy in terms of CO2**

Additionally, we report the CO2 efficiency reported by companies answering Table 7.5, the greener the color in the map, the more energy efficient is the country, where efficiency is expressed in of Metric tons of CO2 Scope 2 market based divided by Purchased & consumed electricity including low carbon. 

In [None]:
import plotly.express as px

fig = px.choropleth(country_climate_KPIs_2020, locations="C7.5_C1Country/Region",
                    color="KPI_rank_energy_efficiency_loc", # lifeExp is a column of gapminder
                    labels={
                     "C7.5_C1Country/Region": "Countries",
                     "KPI_rank_energy_efficiency_loc": "Energy efficiency"},
                    hover_name="C7.5_C1Country/Region", # column to add to hover information
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.BuGn)

fig.update_layout(
    title_text="Companies' energy efficiency by country"
)

fig.show()

<a id="section-seven"></a>
# Water Management Assessment KPIs

In this section the Water Management Assessment KPIs will be described. The table below summarizes the main KPIs we propose for this section and the tables within the questionnaires they can be applied to:

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
table {
  width: 100%;
} 
</style>
</head>
<body>
   
    
<table>
   
  <tr>
    <th style="text-align:centre" "width:130px"BGcolor=#ff9999>KPI type</th>
    <th style="text-align:centre" "width:200px"BGcolor=#ff9999>Formula/methodology</th>
    <th style="text-align:centre" "width:300px"BGcolor=#ff9999>Interpretation</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Cities</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Corporations</th>
    <th style="text-align:centre" "width:100px"BGcolor=#ff9999>Water</th>
  </tr>
  <tr>
    <td style="text-align:right" BGcolor=#ffccff><b>Overall importance of water</b></td>
    <td style="text-align:right" BGcolor=#ffccff>Overall importance = Direct water importance + Indirect water importance both for fresh and recycled water </td>
    <td style="text-align:right" BGcolor=#ffccff>This KPI looks at the questionnaire answers of the company and rates the importance of water for the comapany. Ranking of the companies is provided.</td>
   <td style="text-align:right" BGcolor=#ffccff>-</td>
   <td style="text-align:right" BGcolor=#ffccff>-</td>
   <td style="text-align:right" BGcolor=#ffccff>1.1</td>
  </tr>
  <tr>
   <td style="text-align:right" BGcolor=#ffccff><b>Water movement </b></td>
   <td style="text-align:right" BGcolor=#ffccff>Water movement = water withdrawn + water consumed + water discharged
This is also available by location/facility/company by using table 5.1</td>
   <td style="text-align:right" BGcolor=#ffccff>This KPI shows the amount of water that is handled by the company by looking at the quantity from the processes of withdrawal, consumption and discharged. Ranking of the companies with higher/lower movements is produced.</td>
   <td style="text-align:right" BGcolor=#ffccff>-</td>
   <td style="text-align:right" BGcolor=#ffccff>-</td>
   <td style="text-align:right" BGcolor=#ffccff>1.2d/ 5.1</td>
  </tr>

</table>
</body>
</html>

**Water management - Key performance**

Below we show some of the KPIs produced to assess Water management, built from the water questionnaire. 

In detail we show: Water intensity (which accounts for overall water usage from the company), Water Withdrawals, Water Consumption, Water Discharges and Water Importance. These KPIs are also produced by Country based on the facility location . 

In [None]:
abc2=company_water_KPIs_2020[['organization','KPI_rank_water_use_total_movement_quantity_water_use','KPI_rank_water_use_withdrawals_quantity_water_use','KPI_rank_water_use_discharges_quantity_water_use',
                        'KPI_rank_water_use_consumption_quantity_water_use','KPI_rank_use_importance_water_importance']]
abc2=abc2.dropna()
abc2=abc2.rename(columns={'KPI_rank_water_use_total_movement_quantity_water_use':'Water Intensity','KPI_rank_water_use_withdrawals_quantity_water_use':'Water Withdrawalsz',
                          'KPI_rank_water_use_discharges_quantity_water_use':'Water Discharges',
                        'KPI_rank_water_use_consumption_quantity_water_use':'Water Consumption','KPI_rank_use_importance_water_importance':'Water Importance'},errors='raise')
bcd2 =abc2[8:12]
bcd2 =bcd2.reset_index()
bcd2=bcd2.drop(['index'], axis=1)
# ------- PART 1: Define a function that do a plot for one line of the dataset!
 
def make_spider( row, title, color):
 
# number of variable
    categories=list(bcd2)[1:]
    N = len(categories)
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]
 
# Initialise the spider plot
    ax = plt.subplot(2,2,row+1, polar=True, )
 
# If you want the first axis to be on top:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)
 
# Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories, color='grey', size=8)
 
# Draw ylabels
    ax.set_rlabel_position(0)
    plt.yticks([0.25,0.5,0.75,1], ['0.25','0.5','0.75','1'], color="grey", size=7)
    plt.ylim(0,1)
 
# Ind1
    values=bcd2.loc[row].drop('organization').values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
    ax.fill(angles, values, color=color, alpha=0.4)
 
# Add a title
    plt.title(title, size=11, color=color, y=1.1)
    plt.subplots_adjust(wspace=0.6)
 
# ------- PART 2: Apply to all individuals
# initialize the figure
my_dpi=96
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=my_dpi)
 
# Create a color palette:
my_palette = plt.cm.get_cmap("Set2", len(bcd2.index))
 
# Loop to plot
for row in range(0, len(bcd2.index)):
    make_spider( row=row, title=bcd2['organization'][row], color=my_palette(row))

**Water Movements**

Finally, total water movements by country are captured from table 5.1 of the Water questionnaire, where water movement is defined by the sum of water withdrawal, water consumption and discharge. 

This map can be looked at with the tables we have shown previously for water, for example USA has a high water risk (shown in the Risk KPI section), combined with an upward water movement trend (shown in the Trend KPI section). Therefore this country,as well as others, could be affected by water related issues in the future and should make efforts to mitigate this risk.

In [None]:
### Companies' water uses
fig = px.choropleth(country_water_KPIs_2020, locations="country_or_region",
                    color="KPI_rank_water_total_use_water_use_by_facility", # lifeExp is a column of gapminder
                    labels={
                     "country_or_region": "Countries",
                     "KPI_rank_water_total_use_water_use_by_facility": "Total water movements"},
                    hover_name="country_or_region", # column to add to hover information
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.PuBu)

fig.update_layout(
    title_text="Companies' total water movements, by country"
)

fig.show()

# Conclusion

We have produced quite a large number of possible KPIs that can be found in our code and summary tables for this project. We chose the ones that we think are more meaningful to be presented in this notebook. 

We think our KPIs can help CDP to construct a meaningful monitoring of key indicators on how well a company/city is doing. Additionally, we have shown that the KPIs can be utilised against other data sources to obtain further insights.

We have enjoyed working on this project and we have learned a lot from it. We hope our results will be interesting for the CDP and the Kaggle community. 