# About this Notebook

### In this notebook, I propose new KPI (Key Performance Indicator) system "FRoG", ---Functional Regressive optimized Goals (of CO2-reduction)---. which measures how the porformance of corporations and cities are from 2 aspects, socially and environmentally, and find the best effective point that every cities and companies need to target, by quantitative measuraments.


# Agenda
## ・ Data Cleaning and EDA
## ・ Explanations of KPI (FRoG)
## ・ Implementation Example of FRoG
## ・ Comparison to past KPIs
## ・ Conclusion
## ・ Reference

# Data Cleaning & EDA (Exploratory Data Analysis)

### At first, I'll show and clean current tables in order to make the most of existing data, because some of the corporations/cities response include NaN(Not a Number), and we need to extract/devide numerical responses and text responses to quantitatively measure how impacts of both cities and corporations have changed in recent years, socially and environmentally. 
![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1314380%2F6f0f4d334e5b094bfcf002c4d2e931f6%2FCDP_dataset.png?generation=1603468553539656&alt=media)
### Except supplementaly data, data mainly consists of Corporations/Cities Disclosure, Questionnaires, and Responses. At first let's see current situation of both Corporations and Cities from these data.

# Response rate

In [None]:

import warnings

warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

data = {}

for year in range(2018,2021):
    
    data[year] = []
    
    # Corporations ---Climate ChangeI’m 
    
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Corporations/Corporations Disclosing/Climate Change/{year}_Corporates_Disclosing_to_CDP_Climate_Change.csv'))
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/{year}_Full_Climate_Change_Dataset.csv'))
    
    # Corporations ---Water Security
    
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Corporations/Corporations Disclosing/Water Security/{year}_Corporates_Disclosing_to_CDP_Water_Security.csv'))
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/{year}_Full_Water_Security_Dataset.csv'))
    
    # Cities
    
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/{year}_Cities_Disclosing_to_CDP.csv'))
    data[year].append(pd.read_csv(f'../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/{year}_Full_Cities_Dataset.csv'))
    
    

answering_rate = {'Corporations Climate Change':[], 'Corporations Water Security':[], 'Cityies':[]}

for year in range(2018,2021):
    
    print('year', year)
    print()

    # Corporations ---Climate Change

    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]

    # Corporations ---Water Security

    Corporations_disclosing_Water = data[year][2]
    Corporations_response_Water = data[year][3]

    # Cities

    Cities_disclosing = data[year][4]
    Cities_response = data[year][5]
    
    n_answered_account = []
    n_total_account = []
    
    n_answered_account.append(len(np.unique(Corporations_response_Climate['account_number'].values)))
    n_total_account.append(len(np.unique(Corporations_disclosing_Climate['account_number'].values)))
    
    n_answered_account.append(len(np.unique(Corporations_response_Water['account_number'].values)))
    n_total_account.append(len(np.unique(Corporations_disclosing_Water['account_number'].values)))
    
    n_answered_account.append(len(np.unique(Cities_response['Account Number'].values)))
    n_total_account.append(len(np.unique(Cities_disclosing['Account Number'].values)))
    
    
    for n, name in enumerate(['Corporations Climate Change', 'Corporations Water Security', 'Cityies']):
        
        answering_rate[name].append(100*n_answered_account[n]/n_total_account[n])
        print(n_answered_account[n], '/', n_total_account[n], 'accounts of', name, 'responded')
    
    print()
    
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(9,6))

for name in answering_rate.keys():
    
    plt.plot(np.array(list(range(2018,2021))).astype(str), answering_rate[name], label=name)
    
plt.legend()
plt.xlabel('year')
plt.ylabel('%')
plt.show()

### We can see response rate of Corporations are always 100% for all years, and response rate of Cities increased every year, became 100% at 2020.
### Nextly, I'll show and create clean dataframe for each Climate Change, Water Security, and City. In order to quantitatively compare how impact of organizations have changed, clean dataframe extracts only organizations included for all 2018-2020.

# Climate Change

In [None]:

sets = {}
sets_qname = {}

print('Climate Change')

print()

print('number of questions')

print()

for year in range(2018, 2021):
    
    #Corporations ---Climate Change

    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    
    vc = Corporations_response_Climate.account_number.value_counts()
    
    longest = 0
    
    a = []
    for i in range(len(vc)):
        an = vc.index[i]
        v = len(np.unique(Corporations_response_Climate[Corporations_response_Climate.account_number==an].question_number.value_counts().index))
        a.append(v)
        
        if v>longest:
            sets[year] = set(Corporations_response_Climate[Corporations_response_Climate.account_number==an].question_number.values)
            sets_qname[year] = set(Corporations_response_Climate[Corporations_response_Climate.account_number==an].question_unique_reference.values)
            longest = v
        
    print(year, max(a))
    
duplicate = sets[2018]&sets[2019]&sets[2020]
duplicate_questions = sets_qname[2018]&sets_qname[2019]&sets_qname[2020]

print()
print(len(duplicate_questions), ' questions are included for all 2018-2020.')

print()
print('number of questions')
print()

from tqdm import tqdm

sets_an = []
for year in range(2018, 2021):
    
    #Corporations ---Climate Change
    
    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    
    print(year, Corporations_response_Climate.account_number.value_counts().shape[0])
    
    sets_an.append(set(Corporations_response_Climate.account_number.value_counts().index.values))
    
print()
print(len(sets_an[0]&sets_an[1]&sets_an[2]), 'of account_number are contained for all 2018-2020')

print()
print('unique questions')
print()

from textblob import TextBlob

for n, q in enumerate(np.unique(list(duplicate_questions))):
    
    blob = TextBlob(q)
    print('Question', n, q)
    #print(blob.translate(to='ja'))
    
    #print(Corporations_response_Climate[Corporations_response_Climate['question_unique_reference']==q].response_value.value_counts().shape[0])
    

df_climate = pd.DataFrame()
df_climate['account_number'] = list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2])))*3

# CO2 per revenue

q = 'Describe your gross global combined Scope 1 and 2 emissions for the reporting year in metric tons CO2e per unit currency total revenue and provide any additional intensity metrics that are appropriate to your business operations.'
value_q = []

for year in range(2018,2021):
    
    #Corporations ---Climate Change
    
    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    
    #print(year)
    for an in list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2]))):
        if year!=2020:
            value_q.append(float(Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)&(Corporations_response_Climate['table_columns_unique_reference']=='C6.10_c2-Metric numerator (Gross global combined Scope 1 and 2 emissions)')]['response_value'].values[0]))
        else:
            value_q.append(float(Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)&(Corporations_response_Climate['table_columns_unique_reference']=='C6.10_c2-Metric numerator (Gross global combined Scope 1 and 2 emissions, metric tons CO2e)')]['response_value'].values[0]))
        
df_climate['CO2/revenue'] = value_q

# Scope 1: CO2 when fuels are burned

q = 'What were your organizationâ€™s gross global Scope 1 emissions in metric tons CO2e?'
value_q = []
value_year = []

for year in range(2018,2021):
    
    #Corporations ---Climate Change
    
    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    
    #print(year)
    for an in list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2]))):
        value_q.append(float(Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)&(Corporations_response_Climate['table_columns_unique_reference']=='C6.1_c1-Gross global Scope 1 emissions (metric tons CO2e)')]['response_value'].values[0]))
        value_year.append(year)
        
df_climate['year'] = value_year
df_climate['CO2 emissions Scope1 [tons]'] = value_q

#  Scope 2: CO2 when electricities are used

q = 'What were your organizationâ€™s gross global Scope 2 emissions in metric tons CO2e?'

value_q = []
value_year = []

for year in range(2018,2021):
    #Corporations ---Climate Change
    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    #print(year)
    for an in list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2]))):
        
        dd = Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)].table_columns_unique_reference.values
        if 'C6.3_c1-Scope 2, location-based' in dd:
            value_q.append(float(Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)&(Corporations_response_Climate['table_columns_unique_reference']=='C6.3_c1-Scope 2, location-based')]['response_value'].values[0]))
        else:
            value_q.append(float(np.nan))
        
df_climate['CO2 emissions Scope2 [tons]'] = value_q

df_climate['total CO2 [tons]'] = df_climate['CO2 emissions Scope1 [tons]'].values + df_climate["CO2 emissions Scope2 [tons]"].values

q = 'Select the currency used for all financial information disclosed throughout your response.'

value_q = []
value_year = []

for year in range(2018,2021):
    #Corporations ---Climate Change
    Corporations_disclosing_Climate = data[year][0]
    Corporations_response_Climate = data[year][1]
    #print(year)
    for an in list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2]))):
        
        dd = Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)].table_columns_unique_reference.values
        value_q.append(Corporations_response_Climate[(Corporations_response_Climate['question_unique_reference']==q)&(Corporations_response_Climate['account_number']==an)]['response_value'].values[0])
        
        
df_climate['currency'] = value_q

# calcurate revenue [USD], use 2018-2020 mean of currency [USD]
converter = {'USD':1,
             'CAD':0.756111,
             'CNY':0.147014,
             'GBP':1.296685,
             'JPY':0.009194,
             'TWD':0.033720,
             np.nan:np.nan}
df_climate['currency2USD'] = df_climate.currency.apply(lambda x: converter[x]).values

cols = [
         'organization','country','authority_types','activities','sectors',
         'industries','primary_activity','primary_sector','primary_industry',
         'primary_questionnaire_sector','tickers'
       ]
disclosing = []
for an in df_climate.account_number.values:
    disclosing.append(Corporations_disclosing_Climate[Corporations_disclosing_Climate.account_number==an][cols].values)
disclosing = np.concatenate(disclosing,0)
for col in cols:
    df_climate[col] = 0
    
df_climate.loc[:,cols] = disclosing

# Add Annual Financial Data For Hybrid Metrics from Takahiro Kubo (https://www.kaggle.com/takahirokubo0) to measure economical information

fin = pd.read_csv('../input/annual-financial-data-for-hybrid-cdp-kpi/cdp_financial_data.csv')
financial_df = []
idxs = []
df_climate_tickers = df_climate.tickers.apply(lambda x: str(x).split()[0].replace('nan','')).values
for i in range(len(df_climate)):
    an, year = df_climate.loc[i,['account_number','year']].values
    ticker = df_climate_tickers[i]
    financial_ticker_year = fin[(fin.Ticker.values==ticker)&(fin['Fiscal Year'].values==year)]
    if len(financial_ticker_year)==1:
        idxs.append(i)
        financial_df.append(financial_ticker_year)
financial_df = pd.concat(financial_df,0)
fin_cols = ['Ticker','Revenue','Cost of Revenue','Operating Income (Loss)','Operating Expenses','Depreciation & Amortization','EBITDA']
for col in fin_cols:
    df_climate[col] = np.nan
    
df_climate.loc[np.array(idxs),fin_cols] = financial_df[fin_cols].values

cols = []
for col in df_climate.columns:
    if col=='revenue':
        cols.append('revenue[USD]')
    else:
        cols.append(col)
df_climate.columns = cols

print('Data cleaning for climate-change completed.')


# Water Security

In [None]:

print('Water Security')
print()

print('number of questions')
print()

sets = {}

for year in range(2018, 2021):
    
    # Corporations ---Water Security
    
    Corporations_disclosing_Water = data[year][2]
    Corporations_response_Water = data[year][3]
    
    vc = Corporations_response_Water.account_number.value_counts()
    
    longest = 0
    
    a = []
    for i in range(len(vc)):
        an = vc.index[i]
        v = len(np.unique(Corporations_response_Water[Corporations_response_Water.account_number==an].question_number.value_counts().index))
        a.append(v)
        
        if v>longest:
            sets[year] = Corporations_response_Water[Corporations_response_Water.account_number==an].question_number.values
            longest = v
    
    print(year, max(a))
    
print()
print(len(sets), 'unique questions are included for all 2018-2020.')
print()

print('number of accounts')
print()

sets_an = []

for year in range(2018, 2021):
    
    #Corporations ---Water Security
    
    Corporations_disclosing_Water = data[year][2]
    Corporations_response_Water = data[year][3]
    
    print(year, Corporations_response_Water.account_number.value_counts().shape[0])
    
    sets_an.append(set(Corporations_response_Water.account_number.value_counts().index.values))
    
print()
print(len(sets_an[0]&sets_an[1]&sets_an[2]), 'of account_number are contained for all 2018-2020')
print()
print('unique questions')

from textblob import TextBlob

for n, q in enumerate(np.unique(list(duplicate_questions))):
    
    blob = TextBlob(q)
    print('Question', n, q)
    #print(blob.translate(to='ja'))
    
    #print(Corporations_response_Climate[Corporations_response_Climate['question_unique_reference']==q].response_value.value_counts().shape[0])
    

    
    

# City

In [None]:
sets = {}
sets_qname = {}

print('City')
print()

print('number of questions')
print()

for year in range(2018, 2021):
    
    #Corporations ---City

    Corporations_disclosing_City = data[year][4]
    Corporations_response_City = data[year][5]
    
    vc = Corporations_response_City['Account Number'].value_counts()
    
    longest = 0
    
    a = []
    for i in range(len(vc)):
        an = vc.index[i]
        v = len(np.unique(Corporations_response_City[Corporations_response_City['Account Number']==an]['Question Number'].value_counts().index))
        a.append(v)
        
        if v>longest:
            sets[year] = set(Corporations_response_City[Corporations_response_City['Account Number']==an]['Question Number'].values)
            sets_qname[year] = set(Corporations_response_City[Corporations_response_City['Account Number']==an]['Question Name'].values)
            longest = v
        
    print(year, max(a))
    
duplicate = sets[2018]&sets[2019]&sets[2020]
duplicate_questions = sets_qname[2018]&sets_qname[2019]&sets_qname[2020]

print()
print(len(duplicate_questions), 'unique questions are included for all 2018-2020.')
print()

print('number of accounts')
print()

sets_an = []
for year in range(2018, 2021):
    
    #Corporations ---City
    
    Corporations_disclosing_City = data[year][4]
    Corporations_response_City = data[year][5]
    
    print(year, Corporations_response_City['Account Number'].value_counts().shape[0])
    
    sets_an.append(set(Corporations_response_City['Account Number'].value_counts().index.values))
    
print(len(sets_an[0]&sets_an[1]&sets_an[2]), 'of account_number are contained for all 2018-2020')
print()

print('unique questions')
print()

#from textblob import TextBlob

for n, q in enumerate(np.unique(list(duplicate_questions))):
    
    #blob = TextBlob(q)
    print('Question', n, q)
    #print(blob.translate(to='ja'))
    
    #print(Corporations_response_City[Corporations_response_City['Question Name']==q]['Response Answer'].value_counts().shape[0])
    
df_city = pd.DataFrame()
df_city['Account Number'] = list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2])))*3
df_city['year'] = [2018]*len(sets_an[0]&sets_an[1]&sets_an[2])+[2019]*len(sets_an[0]&sets_an[1]&sets_an[2])+[2020]*len(sets_an[0]&sets_an[1]&sets_an[2])

# city's KPI and goals

q = 'Please describe the main goals of your city’s adaptation efforts and the metrics / KPIs for each goal.'
value_q = []
for year in range(2018,2021):
    #Corporations ---City
    Corporations_disclosing_City = data[year][4]
    Corporations_response_City = data[year][5]
    #print(year)
    for an in list(np.sort(list(sets_an[0]&sets_an[1]&sets_an[2]))):
        try:
            value_q.append(Corporations_response_City[(Corporations_response_City['Question Name']==q)&(Corporations_response_City['Account Number']==an)]['Response Answer'].values[0])
        except:
            value_q.append(float(np.nan))
df_city['KPI'] = value_q

cols = ['Organization','City','Country','CDP Region','Population','City Location']

disclosing = []
for an in df_city['Account Number'].values:
    disclosing.append(Corporations_disclosing_City[Corporations_disclosing_City['Account Number']==an][cols].values)
disclosing = np.concatenate(disclosing,0)

for col in cols:
    df_city[col] = 0
    
df_city.loc[:,cols] = disclosing

print('Data cleaning for city completed.')
    

### Below is the plot of average CO2 emissions for each viewpoint (primary_industry-emission, Revenue-emission). distribution of CO2 emissions can change by corporations' income, industry, and any other multiple factors, and it made quantitative assessment difficult. FRoG can solve this problem by taking all these features into account.

In [None]:
fig = plt.figure(figsize=(16,6))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
ax1.scatter(df_climate.groupby('primary_industry').mean()['total CO2 [tons]'].index, df_climate.groupby('primary_industry').mean()['total CO2 [tons]'].values)
ax1.set_xticklabels(labels=df_climate.groupby('primary_industry').mean()['total CO2 [tons]'].index, Rotation= -40)
ax1.set_xlabel('primary_industry')
ax1.set_ylabel('total CO2 [tons]')
ax2.scatter(df_climate.groupby('Revenue').mean()['total CO2 [tons]'].index, df_climate.groupby('Revenue').mean()['total CO2 [tons]'].values)
ax2.set_xlabel('Revenue')
ax2.set_ylabel('total CO2 [tons]')
plt.show()

 # Explanations of KPI

### In order to measure how  large impacts of cities &  corporate ambitions are, I propose new KPI system, 
## "FRoG" ,  Functional Regressive optimized Goals (of CO2-reduction)

## Overview of FRoG

In [None]:
import cv2
im = cv2.imread('../input/frog-diagram/FRoG.jpg')
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
fig = plt.figure(figsize=(12,12))
plt.imshow(im)
plt.axis("off")
plt.show()

### FRoG proposes underlying extra-wastsed amount of CO2 for some companies and shows how corporations have possibility to reduce CO2 emissions, by using regression. Functional regression (for example, linear regression, like below image) enables stable and statistical assessment of CO2-value for each (statistical) conditions of corporations, and by adding population based penalty, we can assess corporations & city from 2 aspects, environmentally and socially.

![](http://i.imgur.com/DT4H1Yk.jpg)

### Merit to use Regression function for FRoG
### FRoG uses functional regression, which is the technology used in the field of Machine Learning. by using regression function, we extract statistical distribution & importance of input values to estimate essential value (FRoG value) more precisely. Function of regression can also be other useful machine-learning models (like RandomForest, GradientBoostingDicisionTree, NeuralNetwork, etc...), better estimation make us more precisely evaluate potential wasted-CO2, and why I decided linear regression for this time was because it's often used in the fields of both ML & statistics.

# Implementation Example of FRoG

## FRoG without social-dimension penalty
### At first, label data encoding (like data['primary_industry']=='Survices' into numerical values). This time I use simple columns ['primary_industry','Revenue'], as input values, ['total CO2 [tons]'] as a target to train & define FRoG. (use 2018 data for training, 2019 data for evaluation). any other columns can be added for more precise function fitting, and any function to regress target can be used, this time using linear regression to simplize.

In [None]:
converter = {'Services': 0,
 'Manufacturing': 1,
 'Materials': 2,
 'Retail': 3,
 'Food, beverage & agriculture': 4,
 'Biotech, health care & pharma': 5,
 'Transportation services': 6,
 'Infrastructure': 7,
 'Fossil Fuels': 8,
 'Power generation': 9,
 'Hospitality': 10,
 'Apparel': 11,
 'International bodies': 12}
df_encode = df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0][['year']+['primary_industry','Revenue']+['total CO2 [tons]']]
onehot = np.eye(len(converter))[df_encode.primary_industry.apply(lambda x: converter[x]).values]
for c in converter:
    df_encode[c] = 0
df_encode.loc[:,list(converter.keys())] = onehot

input_cols = ['Revenue']+list(converter.keys())
target_cols = ['total CO2 [tons]']

train_X = df_encode[df_encode.year==2018][input_cols]
train_y = df_encode[df_encode.year==2018][target_cols]
eval_X = df_encode[df_encode.year==2019][input_cols]
eval_y = df_encode[df_encode.year==2019][target_cols]

### Nextly Let's fit FRoG function

In [None]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(train_X, train_y)
eval_pred = clf.predict(eval_X)

## Result
### Raw FRoG scores are below

In [None]:
fig = plt.figure(figsize=(9,6))
plt.plot(eval_y.head(69)['total CO2 [tons]'].values,label='real value')
plt.plot(eval_pred[:69],label='FRoG value')
plt.legend()
plt.show()

### We can see some companies are over-wasting CO2 based on FRoG (blue line > orange line), and some companies are saving CO2 (orange line > blue line), based on functional based regression. some specific examples are like below.

In [None]:

fig = plt.figure(figsize=(9,6))

idx = [ 1,  5,  6, 16, 22, 23, 34, 38, 39, 40, 48]
dec = df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0][df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0].year==2019]
plt.bar(dec.iloc[idx]['organization'].values, dec.iloc[idx]['total CO2 [tons]'].values,label='real value')
plt.bar(dec.iloc[idx]['organization'].values, eval_pred.flatten()[idx],label='FRoG value')
flag_r = True
flag_b = True
for i in range(len(idx)):
    x = dec.iloc[idx]['organization'].values[i]
    real, pred = dec.iloc[idx]['total CO2 [tons]'].values[i], eval_pred.flatten()[idx][i]
    margin = real-pred
    if margin>0:
        plt.bar([x],[real],color='#1f77b4')
        plt.bar([x],[pred],color='#ff7f0e')
    else:
        plt.bar([x],[pred],color='#ff7f0e')
        plt.bar([x],[real],color='#1f77b4')
for i in range(len(idx)):
    x = dec.iloc[idx]['organization'].values[i]
    real, pred = dec.iloc[idx]['total CO2 [tons]'].values[i], eval_pred.flatten()[idx][i]
    margin = real-pred
    if margin>0:
        if flag_r:
            plt.vlines([x],ymin=pred,ymax=real,color='r',label='Wasted CO2')
            flag_r = False
        else:
            plt.vlines([x],ymin=pred,ymax=real,color='r')
    else:
        if flag_b:
            plt.vlines([x],ymin=real,ymax=pred,color='b',label='Saved CO2')
            flag_b = False
        else:
            plt.vlines([x],ymin=real,ymax=pred,color='b')
            
plt.legend()
plt.xticks(rotation=-50)
plt.xlabel('organization name')
plt.ylabel('CO2[tons]')
plt.show()

## FRoG with social-dimension penalty (city's population)
![](http://wattsupwiththat.files.wordpress.com/2016/05/clip_image002_thumb1.png?resize=625%2C409)
### From history, we can see there's almost linear relationship between population and CO2 ppm. if population increase, CO2 emissions from the city increase proportionately, and we need to take this into account.
### We add linear penalty into FRoG, by population based penalty, corporations in the city with more population need to reduce more CO2, and we're able to take social-impact of CO2 emissions into account.
### FRoG = FRoG(functional regression estimation) - α * population. (this time using α=1.0 for example.)
### Results are like below

In [None]:

organization2city = {'The AES Corporation':'Arlington','American Airlines Group Inc':'Fort Worth',
                     'American Electric Power Company, Inc.':'Columbus','CSX Corporation':'Jacksonville',
                     'Exelon Corporation':'Chicago','Fluor Corporation':'Irving',
                     'NRG Energy Inc':'Houston','PG&E Corporation':'San Francisco',
                     'Pinnacle West Capital Corporation':'Phoenix','PPL Corporation':'AllenTown',
                     'TransAlta Corporation':'Calgary'}
idx = [ 1,  5,  6, 16, 22, 23, 34, 38, 39, 40, 48]
dec = df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0][df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0].year==2019]
city_names = dec.iloc[idx]['organization'].apply(lambda x: organization2city[x])
populations = []
for name in city_names:
    if len(df_city[(df_city.City==name)&(df_city.year==2019)])==0:
        populations.append(np.nan)
    else:
        populations.append(df_city[(df_city.City==name)&(df_city.year==2019)]['Population'].values[0])

# Penalty -alpha * (population), this time alpha = 1e+12
        

fig = plt.figure(figsize=(9,6))

idx = [ 1,  5,  6, 16, 22, 23, 34, 38, 39, 40, 48]
dec = df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0][df_climate[df_climate[['primary_industry','Revenue']+['total CO2 [tons]']].isna().sum(1)==0].year==2019]
plt.bar(dec.iloc[idx]['organization'].values, dec.iloc[idx]['total CO2 [tons]'].values,label='real value')
plt.bar(dec.iloc[idx]['organization'].values, eval_pred.flatten()[idx],label='FRoG value')
flag_r = True
flag_b = True
alpha = 1
for i in range(len(idx)):
    x = dec.iloc[idx]['organization'].values[i]
    real, pred = dec.iloc[idx]['total CO2 [tons]'].values[i], eval_pred.flatten()[idx][i]
    margin = real-pred
    if margin>0:
        plt.bar([x],[real],color='#1f77b4')
        plt.bar([x],[pred],color='#ff7f0e')
    else:
        plt.bar([x],[pred],color='#ff7f0e')
        plt.bar([x],[real],color='#1f77b4')
for i in range(len(idx)):
    x = dec.iloc[idx]['organization'].values[i]
    real, pred = dec.iloc[idx]['total CO2 [tons]'].values[i], eval_pred.flatten()[idx][i]
    margin = real-pred
    if margin>0:
        if flag_r:
            plt.vlines([x],ymin=pred,ymax=real,color='r',label='Wasted CO2')
            plt.bar([x],bottom=pred-(alpha * np.array(populations))[i],height=(alpha * np.array(populations))[i], label='Penalty',color='g')
            flag_r = False
        else:
            plt.vlines([x],ymin=pred,ymax=real,color='r')
            plt.bar([x],bottom=pred-(alpha * np.array(populations))[i],height=(alpha * np.array(populations))[i], color='g')
    else:
        if flag_b:
            plt.vlines([x],ymin=real,ymax=pred,color='b',label='Saved CO2')
            plt.bar([x],bottom=pred-(alpha * np.array(populations))[i],height=(alpha * np.array(populations))[i],color='g')
            flag_b = False
        else:
            plt.vlines([x],ymin=real,ymax=pred,color='b')
            plt.bar([x],bottom=pred-(alpha * np.array(populations))[i],height=(alpha * np.array(populations))[i],color='g')
            
plt.legend()
plt.xticks(rotation=-50)
plt.xlabel('organization name')
plt.ylabel('CO2[tons]')
plt.show()

# Comparison to Past KPIs
### Lastly, let's see what's the advantage of FRoG compared to other past KPIs of cities.
### Past KPIs of cities are like below (displaying only America/Canada 2020, not NaN response, total 64).
### We can see every cities are making their own absolute KPI by using their own past data, but almost all of them don't mention about country-wide relative CO2 reduction metric which is compared to other cities in whole country. In this sense, FRoG has advantage as relative metric which regress essential CO2 emissions and also consider about other multiple fuctors of cities and corporations to measure how their CO2 emissions are from social&environmental standpoints.

In [None]:
sets_an = []
for year in range(2018, 2021):
    
    #Corporations ---City
    
    Corporations_disclosing_City = data[year][4]
    Corporations_response_City = data[year][5]
    
    sets_an.append(set(Corporations_response_City['Account Number'].value_counts().index.values))
    
year = 2020
Corporations_disclosing_City = data[year][4]
Corporations_response_City = data[year][5]

count = 0

for i in range(302):
    
    an = list((sets_an[0]&sets_an[1]&sets_an[2]))[i]
    q = 'Please describe the main goals of your city’s adaptation efforts and the metrics / KPIs for each goal.'
    city = Corporations_disclosing_City[Corporations_disclosing_City['Account Number']==an].City.values[0]
    country = Corporations_disclosing_City[Corporations_disclosing_City['Account Number']==an].Country.values[0]
    response = Corporations_response_City[(Corporations_response_City['Account Number']==an)&(Corporations_response_City['Question Name']==q)&(Corporations_response_City['Column Name']=='Description of metric / indicator used to track goal')]['Response Answer'].values[0]
    if (country in ['United States of America','Canada'])&(type(response)==str):
        print('city name :',city, country)
        print()
        print('response :',response)
        print()
        count += 1

# Conclusion
### In this notebook I proposed new KPI system FRoG, which uses machine-learning technics into evaluation of CO2 emissions, from standpoints of both environment and society. Using machine-learning technologies enables us estimating potential wasted CO2 precisely, and it can be used for any other factors, not only revenue/industry of corporations, but also other conditions of companies can be used for estimation.

# Reference
### [1] Visualization of CDP files from Jared Savage at https://www.kaggle.com/c/cdp-unlocking-climate-solutions/data
### [2] Financial informations of corporations from Takahiro Kubo https://www.kaggle.com/takahirokubo0/annual-financial-data-for-hybrid-cdp-kpi
### [3] Explanation image of Linear Regression http://i.imgur.com/DT4H1Yk.jpg
### [4] Relationship between world population and global CO2 https://wattsupwiththat.com/2016/05/17/the-correlation-between-global-population-and-global-co2/

# Thank you for watching!