# KPI - Estimating Level of City Effort - The "Yes Score"

The observation that most of the success in complex efforts results from a relatively small portion of the overall effort is captured by the Pareto Priniciple. This principle is based on the often occurring phenomenon that 80% of the improvement comes from 20% of the effort. Along those lines this data investigation attempts to determine a city's level of effort by assigning a "Positive Yes Score" to each city as a measure of whether they are even trying.

The Yes score is based on each city's positive yes responses for their questionaire with diminishing factor for returns on each successive yes. Since the questionnaires have a "branching" style set of questions, the most important part of the response is whether or not the city is even trying. The main question is the primary differentiator. An example of a branching question is when a 'yes' for 1.0 would then ask the respondant for more details on the response in 1.1, 1.2, and so on. Conversely a 'no' response would result in the questionnaire ending that line of questions or possibly branching to ask a different set of questions altogether.

**I kept the order of the survey questions in this investigation, with an assumption that the more important questions are posed at the beginning of each section in the survey.**

The first two questions as an example case for 2020 are: 
1.0 Does your city incorporate sustainability goals and targets (e.g. GHG reductions) into the master planning for the city?
2.0 Has a climate change risk or vulnerability assessment been undertaken for your city?

**Datasets Used:**
1) CDP Questionnaire Responses
2) CDP Complemenatary Datasets - GHG Emissions
3) Gap Minder - Gini Index for years 2018-2020


In [None]:
import os
import math
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# set seed for reproducibility
np.random.seed(42) 

In [None]:
# Functions for creating easier to use column names
def std_col_ref(text_series):
    ''' Standardizes column values for matching column names dynamically
    '''
    new_text = text_series.str.strip()
    new_text = new_text.str.replace('\s+','_')
    new_text = new_text.str.replace('[-]+', '_')
    new_text = new_text.str.replace('#', 'num')
    new_text = new_text.str.replace('(', '',regex=False)
    new_text = new_text.str.replace(')', '',regex=False)
    new_text = new_text.str.replace('[_]+', '_')
    new_text = new_text.str.replace('/', '_', regex=False)
    new_text = new_text.str.replace('\\', '_', regex=False)
    new_text = new_text.str.replace('[!@#$%^&*]+','')
    new_text = new_text.str.lower()
    return new_text


def standardize_columns(df):
    ''' Standardizes column names and references
    '''
    the_columns = df.columns
    the_columns = std_col_ref(the_columns)
    df.columns = the_columns
    df = df.dropna(axis=1, how='all')
    return df

In [None]:
# Read in the disclosing data and format the dates
cities_disc_2020_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv"))
cities_disc_2020_data['account_number'] = cities_disc_2020_data.account_number.astype(np.object)
cities_disc_2020_data['last_update'] = pd.to_datetime(cities_disc_2020_data['last_update'], format='%Y-%m-%dT%H:%M:%S.%f')

cities_disc_2019_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2019_Cities_Disclosing_to_CDP.csv"))
cities_disc_2019_data['account_number'] = cities_disc_2019_data.account_number.astype(np.object)
cities_disc_2019_data['last_update'] = pd.to_datetime(cities_disc_2019_data['last_update'], format='%Y-%m-%dT%H:%M:%S.%f')

cities_disc_2018_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2018_Cities_Disclosing_to_CDP.csv"))
cities_disc_2018_data['account_number'] = cities_disc_2018_data.account_number.astype(np.object)
cities_disc_2018_data['last_update'] = pd.to_datetime(cities_disc_2018_data['last_update'], format='%Y-%m-%dT%H:%M:%S.%f')

In [None]:
# Read in the response data
cities_resp_2020_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv"))
cities_resp_2019_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv"))
cities_resp_2018_data = standardize_columns(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv"))

# Drop rows that don't have any response answer
cities_resp_2020_data.dropna(subset=['response_answer'], inplace=True)
cities_resp_2019_data.dropna(subset=['response_answer'], inplace=True)
cities_resp_2018_data.dropna(subset=['response_answer'], inplace=True)

# Turn the question numbers in to numbers, nullifying any subquestions by "coercing" items such as 2.1a to NaN
cities_resp_2018_data['question_number'] = pd.to_numeric(cities_resp_2018_data['question_number'], errors='coerce')
cities_resp_2019_data['question_number'] = pd.to_numeric(cities_resp_2019_data['question_number'], errors='coerce')
cities_resp_2020_data['question_number'] = pd.to_numeric(cities_resp_2020_data['question_number'], errors='coerce')

cities_resp_2020_data['question_number'].value_counts()

**This is where I make a big assumption. The assumption is that the most important questions are posed early in each section.**
In that regard, I put the first question for each section at the front of the list. I also referenced the "CDP Recommendations for Questions to Focus On" Excel worksheet to aid the selection of questions.

In [None]:
# Yes/No Response Questions for each year, first question of each section, with secondary questions in order
list_of_yes_no_questions_2018 = [1.4,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,2.1,2.2,3.1,3.2,5.1,6.8,6.10,7.3,7.9,7.10,7.12,8.2,9.3,15.3]
list_of_yes_no_questions_2019 = [1.0,2.0,3.1,4.0,5.0,6.1,7.0,8.0,1.1,1.6,1.15,4.9,4.11,5.5,6.3,6.4,6.6,6.9,7.7,7.9,8.1,8.6,9.1,9.2,9.3,10.7,10.11,12.4,12.5,13.6,14.3,14.5]
list_of_yes_no_questions_2020 = [1.0,2.0,3.1,4.0,5.0,6.2,8.0,2.3,3.2,4.9,4.12,5.2,5.3,5.5,8.5,10.7,12.3,14.2,14.4]

question_filter_2018 = cities_resp_2018_data['question_number'].isin(list_of_yes_no_questions_2018)
question_filter_2019 = cities_resp_2019_data['question_number'].isin(list_of_yes_no_questions_2019)
question_filter_2020 = cities_resp_2020_data['question_number'].isin(list_of_yes_no_questions_2020)

# List of "Positive" answers 
positive_answer_regex = ['Yes','In progress','Base year','Fixed level']

# Filter reponse answers based on "postive" responses
answer_filter_2018 = (cities_resp_2018_data['response_answer'].isin(positive_answer_regex)) | (cities_resp_2018_data['response_answer'].str.startswith('Base year'))  | (cities_resp_2018_data['response_answer'].str.startswith('Fixed level'))
answer_filter_2019 = (cities_resp_2019_data['response_answer'].isin(positive_answer_regex)) | (cities_resp_2019_data['response_answer'].str.startswith('Base year'))  | (cities_resp_2019_data['response_answer'].str.startswith('Fixed level'))
answer_filter_2020 = (cities_resp_2020_data['response_answer'].isin(positive_answer_regex)) | (cities_resp_2020_data['response_answer'].str.startswith('Base year'))  | (cities_resp_2020_data['response_answer'].str.startswith('Fixed level'))

# Filter out the "yes" data for scoring
cities_resp_2018_yes_data = cities_resp_2018_data.loc[question_filter_2018 & answer_filter_2018,:]
cities_resp_2019_yes_data = cities_resp_2019_data.loc[question_filter_2019 & answer_filter_2019,:]
cities_resp_2020_yes_data = cities_resp_2020_data.loc[question_filter_2020 & answer_filter_2020,:]

In [None]:
# Assign "weight" based on the Pareto Principle (First 20% of questions are worth 80%)
p80 = .8  # Pareto 80% effective weight
p20 = .2 # Pareto 20% effective weight

# Set ceiling index values which I use to select the first ~20% of the yes/no questions
ceil_2018 = math.ceil(len(list_of_yes_no_questions_2018) * .2)
ceil_2019 = math.ceil(len(list_of_yes_no_questions_2019) * .2)
ceil_2020 = math.ceil(len(list_of_yes_no_questions_2020) * .2)

# Select the questions which will be assigned 80% of the weighted value
eighty_percent_2018 = list_of_yes_no_questions_2018[0:ceil_2018]
eighty_percent_2019 = list_of_yes_no_questions_2019[0:ceil_2019]
eighty_percent_2020 = list_of_yes_no_questions_2020[0:ceil_2020]

# Select the questions which will be assigned 20% of the weighted value
twenty_percent_2018 = list_of_yes_no_questions_2018[ceil_2018:]
twenty_percent_2019 = list_of_yes_no_questions_2019[ceil_2019:]
twenty_percent_2020 = list_of_yes_no_questions_2020[ceil_2020:]

# Filters for question weight assignments
eighty_percent_2018_filter = (cities_resp_2018_yes_data['question_number'].isin(eighty_percent_2018))
eighty_percent_2019_filter = (cities_resp_2019_yes_data['question_number'].isin(eighty_percent_2019))
eighty_percent_2020_filter = (cities_resp_2020_yes_data['question_number'].isin(eighty_percent_2020))

twenty_percent_2018_filter = (cities_resp_2018_yes_data['question_number'].isin(twenty_percent_2018))
twenty_percent_2019_filter = (cities_resp_2019_yes_data['question_number'].isin(twenty_percent_2019))
twenty_percent_2020_filter = (cities_resp_2020_yes_data['question_number'].isin(twenty_percent_2020))

# Weight assignments for each question.
cities_resp_2018_yes_data.loc[eighty_percent_2018_filter, 'question_weight'] = p80 / len(eighty_percent_2018)
cities_resp_2019_yes_data.loc[eighty_percent_2019_filter, 'question_weight'] = p80 / len(eighty_percent_2019)
cities_resp_2020_yes_data.loc[eighty_percent_2020_filter, 'question_weight'] = p80 / len(eighty_percent_2020)

cities_resp_2018_yes_data.loc[twenty_percent_2018_filter, 'question_weight'] = p20 / len(twenty_percent_2018)
cities_resp_2019_yes_data.loc[twenty_percent_2019_filter, 'question_weight'] = p20 / len(twenty_percent_2019)
cities_resp_2020_yes_data.loc[twenty_percent_2020_filter, 'question_weight'] = p20 / len(twenty_percent_2020)

# Double check that every question got weighted.
if any(cities_resp_2018_yes_data.question_weight.isna()):
    print('WARNING: Some questions did not receive a weight value')
    
if any(cities_resp_2019_yes_data.question_weight.isna()):
    print('WARNING: Some questions did not receive a weight value')
    
if any(cities_resp_2020_yes_data.question_weight.isna()):
    print('WARNING: Some questions did not receive a weight value')
    
print(cities_resp_2018_yes_data.question_weight.describe())
print(cities_resp_2019_yes_data.question_weight.describe())
print(cities_resp_2020_yes_data.question_weight.describe())

In [None]:
# Aggregate scores for each city
city_scores_2018 = cities_resp_2018_yes_data.groupby(['account_number']).question_weight.sum('question_weight')
city_scores_2019 = cities_resp_2019_yes_data.groupby(['account_number']).question_weight.sum('question_weight')
city_scores_2020 = cities_resp_2020_yes_data.groupby(['account_number']).question_weight.sum('question_weight')

print(city_scores_2018.describe())
print(city_scores_2019.describe())
print(city_scores_2020.describe())

## Top and Bottom Cities by Yes Scores for 2020

In [None]:
# Bring in data from the Cities Disclosing dataset
all_city_data_2018 = city_scores_2018.reset_index().merge(cities_disc_2018_data, how='inner', on='account_number')
all_city_data_2019 = city_scores_2019.reset_index().merge(cities_disc_2019_data, how='inner', on='account_number')
all_city_data_2020 = city_scores_2020.reset_index().merge(cities_disc_2020_data, how='inner', on='account_number')

cities_2020 = all_city_data_2020.sort_values('question_weight')[['organization','question_weight']]
print(f'5 Cities with worst Yes Scores in 2020:\n{cities_2020.head()}')
print('-' * 25)
print(f'5 Cities with best Yes Scores in 2020:\n{cities_2020.tail()}')

In [None]:
all_city_scores = all_city_data_2018.append(all_city_data_2019, ignore_index=True).append(all_city_data_2020, ignore_index=True)
all_city_scores.rename(columns={'question_weight':'yes_score'}).to_csv('all_city_yes_scores.csv')

## Calculate the average scores across country and region

In [None]:
# Generate average score for the country
country_means_2018 = all_city_data_2018.groupby('country').question_weight.mean().rename('country_avg_score').reset_index()
country_means_2019 = all_city_data_2019.groupby('country').question_weight.mean().rename('country_avg_score').reset_index()
country_means_2020 = all_city_data_2020.groupby('country').question_weight.mean().rename('country_avg_score').reset_index()
country_means =  country_means_2018.rename(columns={'country_avg_score':'2018_country_avg_score'}).merge(country_means_2019.rename(columns={'country_avg_score':'2019_country_avg_score'})).merge(country_means_2020.rename(columns={'country_avg_score':'2020_country_avg_score'}))

all_city_data_2018 = all_city_data_2018.merge(country_means_2018, how='inner', on='country') 
all_city_data_2019 = all_city_data_2019.merge(country_means_2019, how='inner', on='country')
all_city_data_2020 = all_city_data_2020.merge(country_means_2020, how='inner', on='country')


# Generage average cdp_region score
region_means_2018 = all_city_data_2018.groupby('cdp_region').question_weight.mean().rename('region_avg_score').reset_index()
region_means_2019 = all_city_data_2019.groupby('cdp_region').question_weight.mean().rename('region_avg_score').reset_index()
region_means_2020 = all_city_data_2020.groupby('cdp_region').question_weight.mean().rename('region_avg_score').reset_index()

region_means =  region_means_2018.rename(columns={'region_avg_score':'2018_region_avg_score'}).merge(region_means_2019.rename(columns={'region_avg_score':'2019_region_avg_score'})).merge(region_means_2020.rename(columns={'region_avg_score':'2020_region_avg_score'}))
region_means.to_csv('region_yes_score_means.csv')

# Bring region mean to the rest of the data for correlation analysis
all_city_data_2018 = all_city_data_2018.merge(region_means_2018, how='inner', on='cdp_region') 
all_city_data_2019 = all_city_data_2019.merge(region_means_2019, how='inner', on='cdp_region')
all_city_data_2020 = all_city_data_2020.merge(region_means_2020, how='inner', on='cdp_region')

In [None]:
#Visualize city scores for 2018
f, ax = plt.subplots(figsize=(20, 5))
ax = sns.swarmplot(data=all_city_data_2018.sort_values('region_avg_score'), x='cdp_region', y='question_weight')
ax.set(title=f'2018 City Scores by Region',
       ylabel="City Score",
       xlabel="CDP Region")
plt.xticks(rotation=90)
for ind, label in enumerate(ax.get_xticklabels()):
    if ind % 1 == 0:  # every 10th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)
plt.show()

#2019
f, ax = plt.subplots(figsize=(20, 5))
ax = sns.swarmplot(data=all_city_data_2018.sort_values('region_avg_score'), x='cdp_region', y='question_weight')
ax.set(title=f'2019 City Scores by Region',
       ylabel="City Score",
       xlabel="CDP Region")
plt.xticks(rotation=90)
for ind, label in enumerate(ax.get_xticklabels()):
    if ind % 1 == 0:  # every 10th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)
plt.show()

#2020
f, ax = plt.subplots(figsize=(20, 5))
ax = sns.swarmplot(data=all_city_data_2020.sort_values('region_avg_score'), x='cdp_region', y='question_weight')
ax.set(title='2020 City Scores by Region',
       ylabel="City Score",
       xlabel="CDP Region")
plt.xticks(rotation=90)
for ind, label in enumerate(ax.get_xticklabels()):
    if ind % 1 == 0:  # every 10th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)
plt.show()

# What correlation exists between score and C40 emmisisons?

In [None]:
c40_cities_emmissions = standardize_columns(pd.read_csv("/kaggle/input/cdp-complementary-datasets/C40_Cities_Citywide_Emissions.csv"))
c40_cities_emmissions.info()

In [None]:
country_c40_map = {'United States of America':'USA',
               'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
               'China, Hong Kong Special Administrative Region': 'Hong Kong',
               'Republic of Korea': 'South Korea'
              }
country_means['country_mapped'] = country_means.country.map(country_c40_map)
country_means.loc[country_means.country_mapped.isna(), 'country_mapped'] = country_means['country']
country_means_c40 = country_means.merge(c40_cities_emmissions, how='inner', left_on='country_mapped', right_on='country')

In [None]:
# Plot the correlations
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(country_means_c40.corr(), annot=True)
plt.show()

### Only weak correlation between, C40 Emissions and a country's "Yes Score"

## Analysis of the correlation between country or region "Yes Score" and the Gini index for the country
The idea here is to determine how a Yes Score might correlate with a Gini Index. Starting in on this part of the analysis, it seems to make sense that a negative correlation between the two values would be likely. Negative correlation in this case would mean that a country with higher Yes Score would probably have a lower Gini Index value. Meaning that countries or regions that make sustainability efforts would have less social inequality. Not necessarily a causal relationship though.

In [None]:
gm_gini_scores = standardize_columns(pd.read_csv('/kaggle/input/income-inequality/gini.csv', encoding = 'cp1252'))

# Keep only the years 2017-2021
gm_gini_scores = pd.concat([gm_gini_scores.country, gm_gini_scores.iloc[:,-23:-19]], axis=1)
gm_gini_scores.head()

In [None]:
# Map Country names so joining to the Gap Minder Gini data is possible
country_map = {'United States of America':'United States',
               'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
               'Gibraltar': None, 
               'Viet Nam': 'Vietnam', 
               'Taiwan, Greater China': 'Taiwan',
               'China, Hong Kong Special Administrative Region': None,
               'Republic of Korea': 'South Korea', 
               'United Republic of Tanzania': 'Tanzania',
               'Democratic Republic of the Congo': 'Congo, Dem. rep',
               "CÃ´te d'Ivoire": "Cote d'Ivoire",
               'Bolivia (Plurinational State of)': 'Bolivia',
               'State of Palestine': 'Palestine'
              }
all_city_data_2018['country_mapped'] = all_city_data_2018.country.map(country_map)
all_city_data_2019['country_mapped'] = all_city_data_2019.country.map(country_map)
all_city_data_2020['country_mapped'] = all_city_data_2020.country.map(country_map)

all_city_data_2018.loc[all_city_data_2018.country_mapped.isna(), 'country_mapped'] = all_city_data_2018.country
all_city_data_2019.loc[all_city_data_2019.country_mapped.isna(), 'country_mapped'] = all_city_data_2019.country
all_city_data_2020.loc[all_city_data_2020.country_mapped.isna(), 'country_mapped'] = all_city_data_2020.country

In [None]:
# How much data can we keep when joining the gini index data?
print(all_city_data_2018.country_mapped.isin(gm_gini_scores['country']).sum() / len(all_city_data_2018.country))
print(all_city_data_2019.country_mapped.isin(gm_gini_scores['country']).sum() / len(all_city_data_2019.country))
print(all_city_data_2020.country_mapped.isin(gm_gini_scores['country']).sum() / len(all_city_data_2020.country))

In [None]:
# Merge to comare "Current Year" Yes Score with the Gini Score the same year and the next year
all_city_gini_data_2018 = all_city_data_2018.merge(gm_gini_scores[['country','2018','2019']], how='inner', left_on='country_mapped', right_on='country')
all_city_gini_data_2019 = all_city_data_2019.merge(gm_gini_scores[['country','2019','2020']], how='inner', left_on='country_mapped', right_on='country')
all_city_gini_data_2020 = all_city_data_2020.merge(gm_gini_scores[['country','2020','2021']], how='inner', left_on='country_mapped', right_on='country')
all_city_gini_data_2020.sample(5)

In [None]:
# Plot the correlations
f, ax = plt.subplots(1, 3, figsize=(30,7))
sns.heatmap(all_city_gini_data_2018.corr(), annot=True, ax=ax[0])
ax[0].set_title('2018 Correlation Matrix')

sns.heatmap(all_city_gini_data_2019.corr(), annot=True, ax=ax[1])
ax[1].set_title('2019 Correlation Matrix')

sns.heatmap(all_city_gini_data_2020.corr(), annot=True, ax=ax[2])
ax[2].set_title('2020 Correlation Matrix')
plt.show()

### Result of Correlation Analysis between Yes Score and Gini Index
There is a moderate negative correlation between country and region average score and the Gini Index. This indicates that a country where cities are making additional efforts to improve sustainability and environmental impact of human actions are also environments where there is more social equality as measured by a lower Gini Index.

## Examples of "low" scoring regions and countries

In [None]:
# Examples of "low" scoring regions and countries

print(all_city_gini_data_2020[all_city_gini_data_2020.region_avg_score < .6].cdp_region.value_counts())
print('-'*25)
print(all_city_gini_data_2020[all_city_gini_data_2020.country_avg_score < .6].country_mapped.value_counts().head())

In [None]:
# Checking to see if a low score for the country correlates with the Gini index
f, ax = plt.subplots(1, 3, figsize=(30,7))
sns.heatmap(all_city_gini_data_2018[all_city_gini_data_2018.country_avg_score < .6].loc[:,['country_mapped','country_avg_score']].drop_duplicates().merge(gm_gini_scores[['country','2018','2019']], how='inner', left_on='country_mapped', right_on='country').corr(), annot=True, ax=ax[0])
ax[0].set_title('2018 Country Score Correlation Matrix')

sns.heatmap(all_city_gini_data_2019[all_city_gini_data_2019.country_avg_score < .6].loc[:,['country_mapped','country_avg_score']].drop_duplicates().merge(gm_gini_scores[['country','2019','2020']], how='inner', left_on='country_mapped', right_on='country').corr(), annot=True, ax=ax[1])
ax[1].set_title('2019 Country Score Correlation Matrix')

sns.heatmap(all_city_gini_data_2020[all_city_gini_data_2020.country_avg_score < .6].loc[:,['country_mapped','country_avg_score']].drop_duplicates().merge(gm_gini_scores[['country','2020','2021']], how='inner', left_on='country_mapped', right_on='country').corr(), annot=True, ax=ax[2])
ax[2].set_title('2020 Country Score Correlation Matrix')
plt.show()

### No obvious correlation between a low country "Yes score" and the Gini Index for 2020.

# Checking correlation between regions with a low Yes Score and the Gini Index

In [None]:
# Visualize - Correlation matrix for "Low" region scores
f, region_ax = plt.subplots(1, 3, figsize=(30,7))
region_gini_merge_2018 = all_city_gini_data_2018[all_city_gini_data_2018.region_avg_score < .6].loc[:,['country_mapped','region_avg_score']].drop_duplicates()
region_gini_merge_2018 = region_gini_merge_2018.merge(gm_gini_scores[['country','2018','2019']], how='inner', left_on='country_mapped', right_on='country')
region_gini_merge_2018.rename(columns={'2018':'cdp_year','2019':'cdp_year_+_1'}, inplace=True)
sns.heatmap(region_gini_merge_2018.corr(), annot=True, ax=region_ax[0])
region_ax[0].set_title('2018 Region Score Correlation Matrix')

region_gini_merge_2019 = all_city_gini_data_2019[all_city_gini_data_2019.region_avg_score < .6].loc[:,['country_mapped','region_avg_score']].drop_duplicates()
region_gini_merge_2019 = region_gini_merge_2019.merge(gm_gini_scores[['country','2019','2020']], how='inner', left_on='country_mapped', right_on='country')
region_gini_merge_2019.rename(columns={'2019':'cdp_year','2020':'cdp_year_+_1'}, inplace=True)
sns.heatmap(region_gini_merge_2019.corr(), annot=True, ax=region_ax[1])
region_ax[1].set_title('2019 Region Score Correlation Matrix')

region_gini_merge_2020 = all_city_gini_data_2020[all_city_gini_data_2020.region_avg_score < .6].loc[:,['country_mapped','region_avg_score']].drop_duplicates()
region_gini_merge_2020 = region_gini_merge_2020.merge(gm_gini_scores[['country','2020','2021']], how='inner', left_on='country_mapped', right_on='country')
region_gini_merge_2020.rename(columns={'2020':'cdp_year','2021':'cdp_year_+_1'}, inplace=True)
sns.heatmap(region_gini_merge_2020.corr(), annot=True, ax=region_ax[2])
region_ax[2].set_title('2020 Region Score Correlation Matrix')

plt.show()

In [None]:
# Bring all the data together to see what factors may help predict the Gini Index
ac_2018 = all_city_gini_data_2018.rename(columns={'country_x':'country','2018':'gini_cdp_year','question_weight':'yes_score'})
ac_2019 = all_city_gini_data_2019.rename(columns={'country_x':'country','2019':'gini_cdp_year','question_weight':'yes_score'})
ac_2020 = all_city_gini_data_2020.rename(columns={'country_x':'country','2020':'gini_cdp_year','question_weight':'yes_score'})
all_city_gini_data = ac_2018.append(ac_2019).append(ac_2020).drop(['country_y','2019','2020','2021','country_mapped'], axis=1)
all_city_gini_data

In [None]:
cdp_region_categories = pd.CategoricalDtype(all_city_gini_data.cdp_region.dropna().unique())
all_city_gini_data['cdp_region'] = all_city_gini_data.cdp_region.astype(cdp_region_categories)

first_time_region_categories = pd.CategoricalDtype(all_city_gini_data.first_time_discloser.dropna().unique())
all_city_gini_data['first_time_discloser'] = all_city_gini_data.first_time_discloser.astype(first_time_region_categories)
print(all_city_gini_data.dtypes)
all_city_gini_data

In [None]:
# Regional factors Correlations:
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(all_city_gini_data.corr(), annot=True)
plt.show()

### A low region "Yes Score" is negatively correlated to the country's Gini score
This may indicate that areas with little or no efforts toward sustainability are some how correlated to situations that also result in environments with worse social equality (a higher Gini Index)

## Decision Tree Prediction of Gini Scores from Country Yes Scores and count of cities responding

In [None]:
from sklearn import svm
from sklearn import tree
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# Aggregate values on country
country_gini_data = all_city_gini_data.groupby(['country','year_reported_to_cdp']).agg({'country_avg_score':'mean','city':'count','gini_cdp_year':'mean'}).reset_index()

# Normalize the yes scores
scaler = MinMaxScaler()
country_gini_data[['country_avg_score','gini_cdp_year']] = scaler.fit_transform(country_gini_data[['country_avg_score','gini_cdp_year']])
country_gini_data

In [None]:
parameter_selction_scoring_type = 'neg_mean_squared_error'
model_scoring_type = 'r2'

# Set up prediction
x = country_gini_data[['country_avg_score','city']]
y = country_gini_data['gini_cdp_year']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

parameters = {'max_depth':range(3,7)}
clf = GridSearchCV(tree.DecisionTreeRegressor(), parameters, n_jobs=4, scoring=parameter_selction_scoring_type)
clf.fit(X=X_train, y=y_train)
tree_model = clf.best_estimator_
print(f'{parameter_selction_scoring_type}:\tBest{(clf.best_score_, clf.best_params_)}')

scores = cross_val_score(tree_model, X_test, y_test, cv=5, scoring=model_scoring_type)
print(f"{model_scoring_type}: {scores.mean():0.2f} (+/- {scores.std() * 2:0.2f})")

In [None]:
# Test countries with low scores
score_threshold = .6

# Set up prediction
x = country_gini_data.loc[country_gini_data.country_avg_score <= score_threshold, ['country_avg_score','city']]
y = country_gini_data.loc[country_gini_data.country_avg_score <= score_threshold, 'gini_cdp_year']

# Set up prediction
x = country_gini_data[['country_avg_score','city']]
y = country_gini_data['gini_cdp_year']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

parameters = {'max_depth':range(3,7)}
low_clf = GridSearchCV(tree.DecisionTreeRegressor(), parameters, n_jobs=4, scoring=parameter_selction_scoring_type)
low_clf.fit(X=X_train, y=y_train)
low_tree_model = low_clf.best_estimator_
print(f'{parameter_selction_scoring_type}:\tBest{(low_clf.best_score_, low_clf.best_params_)}')

scores = cross_val_score(low_tree_model, X_test, y_test, cv=5, scoring=model_scoring_type)
print(f"{model_scoring_type}: {scores.mean():0.2f} (+/- {scores.std() * 2:0.2f})")

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(tree_model, 
                   feature_names=['country_avg_score','city'],  
                   class_names=['gini_cdp_year'],
                   filled=True)

# Conclusion
My analysis of a "Yes Score" as an indicator for the level of sustainability efforts is one method for comparing the relative efforts of cities, countries and regions. It does not seem to be a predictor of factors that lead to better social equality as measured by the Gini Index even though it is negatively correlated with the Gini Index. and it is only weakly correlated with City C40 Emissions.

Since the country and region Yes Score is negatively correlated with the Gini Index, it seems that efforts tend to be higher in environments that also foster a better level of social equality as measured by the Gini Index. It doesn't make a good predictor though of the Gini Index. At the region level, a low Yes Score (Indicating a low-level effort toward sustainability as measured by the CDP Yes/No questions) is negatively correlated with the Gini index. So countries with less efforts may also have an environment that is worse for social equality.