<table>
    <tr>
        <td style="align:center;font-size:20pt;  font-weight: 600;"> CDP: <br>Unlocking <br>Climate <br>Solutions
        </td>
        <td>
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a0/Climate_change_mitigation_icon.png" alt="wikimedia" width="150"/></td>
    </tr>
</table>
<br>
    

<p style="text-align: justify;font-style: italic;">CDP is an organization that investigates global environmental disclosure system. Every year, many cities and companies are involved in investigations into the climate, its consequences and their interest in it. CDP wonders how companies and cities are dealing with the climate emergency. By extracting information from the various treaties and their results, we hope to improve the climate transition. </p>
<hr />
<br>

<h2 style="color:#FFA57D">> Goal:</h2>
The aim of this work is to find KPIs in the intesection between <b>environment</b> and <b>social </b>challenge for the climate change. In this notebook, we tackle the cities answers. Notice that section 1.2. is a storyline where not many results are found but it is the way I started to search. I just want to keep it to show the progress of the work. If you want to see KPI only, go straigth to section 2.



## Outlines:
- [1. How cities Deal with Climate Change: Dataset Exploration](#1) 
    * [1.1. Relating Demography, Budget and Social Risks: What are the Numerical Key Values?](#11)
      + [1.1.1. Administrative Boundary](#111)
      + [1.1.2. Population](#112) 
      + [1.1.3.Governance & Data Management ](#113)
      + [1.1.4.Climate Risk and Vulnerability Assessment](#114)
      + [1.1.5. City-wide Emissions](#115)
      + [1.1.6. Emission Reduction](#116)
      + [1.1.7. Collaborations](#117)
      + [1.1.8 Energy](#118)
      + [1.1.9. Transports](#119)
    * [1.2. Data Analysis: Clusters, Features Coorelations ](#12)
      + [1.2.1 Simple K-means to get some clusters and plot on a map](#121)
      + [1.2.2 Improving K-means to get KPIs](#122) 
      + [1.2.3 A Decision Tree by Using GHG Reduction as Target](#123)
- [2. Intersection between Environment and Social Challenge for the Cities](#2)
    * [2.1 Logistic Regression for KPI Extractions](#21)
    * [2.2 Results](#22)
    * [2.3 Conclusion](#23)
- [3. Other Datasets](#3)
    * [3.1 Adding Country Information to Cities](#31)
    * [3.2 Country Information Only](#32)
<hr />


In [None]:
#all the imports required for the notebook are given in this block
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import HTML
from IPython.display import  Markdown
import seaborn as sns
import random
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import geopandas as gpd
from shapely.geometry import Point
from  sklearn.metrics import davies_bouldin_score
from sklearn.decomposition import PCA
from matplotlib.path import Path
from matplotlib.spines import Spine
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.patches import Circle, RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D
from sklearn import tree
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
#centering figures
HTML("""<style> .output_png { display: table-cell; text-align: center; vertical-align: middle;}</style>""")

#my colors
colors= ['#003f5c','#2f4b7c','#665191','#a05195','#d45087','#f95d6a','#ff7c43','#ffa600','#fcca46','#a1c181','#619b8a','#386641']

In [None]:
# import the datasets 
cities_2018 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv",
                         usecols=[c for c in list(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv", nrows =1)) if c not in ['Questionnaire','Year Reported to CDP','File Name','Last update','Comments']])
cities_2019 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv",
                         usecols=[c for c in list(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv", nrows =1)) if c not in ['Questionnaire','Year Reported to CDP','File Name','Last update','Comments']])
cities_2020 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv",
                         usecols=[c for c in list(pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv", nrows =1)) if c not in ['Questionnaire','Year Reported to CDP','File Name','Last update','Comments']])

geo_cities_2020 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv", usecols=["Account Number","City Location"])



<a id='1'></a>
<h1 style="color:#084FD0"> 1. How cities Deal with Climate Change</h1>
<p style="text-align: justify;">In this section, we focus on cities. We recall that the objective of this work is to find correlations between the recorded values in order to interpret the numerous responses to the questionnaires. To do this, we first examine the numerical values and categorical values that can easily be transformed into numerical values. We first try to extract KPI from those data that incorporate city answers only. Then, in section 1.2, we analyze information from the textual responses. There are methods for extracting topics that are useful to get an overview of the reflections. </p>

<h2 style="color:#7209b7"> 1.1. Relating Demography, Budget and Social Risks: What are the Numerical Key Values?</h2>

<p style="text-align: justify;">Our first interest go for the easy-to-interpret responses, i.e., the data that are either numerical either categorical. Indeed the questionnaries contain a lot of textual information that require deep data processing. With numercal values, we aim at giving first coorelation between important climate keys. </p>

In [None]:
cities_2020.head(2)

In [None]:
cities_2020.shape

A line is an answer of a city to a question. There are 869313 rows. Not all cities respond to all questions. The next code shows the mean number of questions that haven answered but also the minimal and maximal. 

In [None]:
cities_2020["Account Number"].value_counts().describe()

In order to get coorelations between the values. We should group the value per `Account Number`. Remember in this first work, we use only the numerical and categorical answers.  

<h2 style="color:#FFA57D"> Analyzing each question, one by one</h2>

In order to extract numeral information, we look at every question to obtain a dataset in the format : 

| Account Number | Feature 1 | Feature 2 | ... | Feature n| 
| --- | --- | ---| ---| --- |
| 68296 | 1.0 | 3304 | ... | 20394 |
|  ... | ... |  ... | ... |  ... |

where each feature <i>i</i> is the result of a question. For this purpose, I have created a generic function that you can find below.

In [None]:
def extract_answer(df, question, newColumnName, condition=True):
    '''
    Extract answer from the number of question 
    @question: (string) name of the question 
    @newColumnName: (string) column name of the question
    @condition: (boolen) when the question has several outputs 
    @return: DataFrame
    '''
    result_df = df[(df['Question Number'] == question) & condition ][["Account Number","Response Answer"]]
    result_df.columns = [newColumnName if x=='Response Answer' else x for x in result_df.columns]
    return result_df


<a id="111"></a>
<h3 style="color:#D00892"> 1.1.1 Administrative Boundary </h3>
The first information recorded in the form is the type of city (Question 0.1). It's a categorical value that we transform into numeral values. But first, let's have a look at the distribution. 

In [None]:
# General description to your cityâ€™s reporting boundary.
cities_2020_admin_boundary = extract_answer(cities_2020,'0.1','Administrative boundary', condition = (cities_2020['Column Name'] == 'Administrative boundary'))
cities_2020_admin_boundary['Administrative boundary'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary.head(3)

In [None]:
# show a pie chart of the distribution
cities_2020_admin_boundary['Administrative boundary'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Administrative Boundary Distribution ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Administrative boundary'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

We observe that most cities are City or Minicipality type. 

<a id="112"></a>
<h3 style="color:#D00892"> 1.1.2. Population </h3>
First, we find the current population and the projected population (Question 0.5). What are the plans of the cities? Population raising or descreasing? 

In [None]:
# extract current population from answer of question 0.5 
cities_2020_numerical_current_pop = extract_answer(cities_2020,'0.5','Current population', condition = (cities_2020['Column Name'] == 'Current population'))

# extract projected population from answer of question 0.5
cities_2020_numerical_projected_pop = extract_answer(cities_2020,'0.5','Projected population', condition = (cities_2020['Column Name'] == 'Projected population'))

# merge two answers in a dataframe
cities_2020_numerical_pop = cities_2020_numerical_current_pop.merge(cities_2020_numerical_projected_pop)
cities_2020_numerical_pop.head(3)

We get two features: "Current population" and "Projected population". Let's compute the difference in terme of percentage to see if cities plan to grow. 

In [None]:
# compute the difference
cities_2020_numerical_pop["Population Projection Difference"] = (cities_2020_numerical_pop["Projected population"].fillna(0).astype(int)-cities_2020_numerical_pop["Current population"].fillna(0).astype(int))*100/ cities_2020_numerical_pop["Current population"].fillna(1).astype(int)

# some cities did not fill this information then we have wrong answers, replace them 
cities_2020_numerical_pop["Population Projection Difference"].replace(1,np.nan,inplace=True)
cities_2020_numerical_pop["Population Projection Difference"].replace(0,np.nan,inplace=True)

# give a plot of the plans of the cities
sns.set_style("whitegrid",{'axes.grid' : False})
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

fig.suptitle("Population Projection Difference: : What are the plans?", fontsize=18)
sns.boxplot(data=cities_2020_numerical_pop["Population Projection Difference"], palette=["#2f4b7c"],ax=ax1)
ax1.set_xlabel(xlabel='All Cities',fontsize=15)

# remove extremes values to zoom 
ax2.set_xlabel(xlabel='Excluding Extremes',fontsize=15)
cities_2020_numerical_pop_diff_without_huge = cities_2020_numerical_pop[cities_2020_numerical_pop["Population Projection Difference"]<300]
sns.boxplot(data=cities_2020_numerical_pop_diff_without_huge["Population Projection Difference"], palette=["#ffa600"],ax=ax2)

plt.show()

Despite a mean of grow to 4% to the current population size, the plot above shows that several cities estimate to grow a lot.  We wonder if this plan is related to cities decisions about climate change!

<a id="113"></a>
<h3 style="color:#D00892"> 1.1.3. Governance & Data Management</h3>
In section (1), the form tackles the interest of the governance of the cities to incorporate climate change related goals. 

In [None]:
# extract Governance & Data Management answer of question 1.0 
cities_2020_gov_and_data = extract_answer(cities_2020,'1.0','Governance & Data Management')

# compare to 2019
cities_2019_gov_and_data = extract_answer(cities_2019,'1.0','Governance & Data Management')

# show distribution
cities_2019_2020_gov_and_data = pd.DataFrame({'2019': cities_2019_gov_and_data['Governance & Data Management'].value_counts(),
                               '2020': cities_2020_gov_and_data['Governance & Data Management'].value_counts()})
#plot it!
fig = plt.figure(figsize=(9,6))
fig.suptitle("Governance & Data Management: Does your city incorporate sustainability goals and targets (e.g. GHG reductions) into the master planning for the city?", fontsize=18)
ax1 = fig.add_subplot(111)
cities_2019_2020_gov_and_data.plot.barh(color=['#2f4b7c','#ffa600'], ax=ax1, edgecolor='w',linewidth=1.3)
ax1.yaxis.grid(False)
ax1.xaxis.grid(False)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)

We remark that cities declare being more involved to incorporate sustainability goals and targets (e.g. GHG reductions) into the master planning over time. 

<a id="114"></a>
<h3 style="color:#D00892">  1.1.4 Climate Risk and Vulnerability Assessment</h3>
In section (2), the form focuses on assessment about the riks of climate change. 

In [None]:
# extract answer of question 2.0 
cities_2020_risk = extract_answer(cities_2020,'2.0','Climate Change')

# compare to 2019
cities_2019_risk = extract_answer(cities_2019,'2.0','Climate Change')

# show distribution
cities_2019_2020_risk = pd.DataFrame({'2019': cities_2019_risk['Climate Change'].value_counts(),
                               '2020': cities_2020_risk['Climate Change'].value_counts()})
fig = plt.figure(figsize=(8,4))
fig.suptitle("Has a climate change risk or vulnerability assessment been undertaken for your city?", fontsize=17)
ax1 = fig.add_subplot(111)
cities_2019_2020_risk.plot.barh(color=['#2f4b7c','#ffa600'], ax=ax1, edgecolor='w',linewidth=1.3)
ax1.yaxis.grid(False)
ax1.xaxis.grid(False)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)

When a city answers `Yes` to the previous question, more details about the assessment is asked like what are the processes and methods to deal with them.

In [None]:
# Select the primary process or methodology used to undertake the risk or vulnerability assessment of your city.
cities_2020_risk_detail = extract_answer(cities_2020,'2.0a','Climate Change Detail')

# when I use 'df' it means that it's only for plots
cities_2020_risk_detail = cities_2020_risk_detail.groupby('Climate Change Detail').filter(lambda x : len(x)>3)
df= cities_2020_risk_detail[cities_2020_risk_detail['Climate Change Detail']!="Question not applicable"]

plt.title("The primary process or methodology used to undertake the risk or vulnerability assessment of the cities:",fontsize=17,ha="center")
ax=df['Climate Change Detail'].value_counts().plot.pie(textprops={'color':"#000000"},pctdistance=1.18,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
ax.set_ylabel('')
plt.legend(labels=df['Climate Change Detail'].value_counts().index, loc="best",bbox_to_anchor=(1.1, 0.4, 0.5, 0.5))
centre_circle = plt.Circle((0,0),0.8,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.show()

<p>There are 4 mains methds: <i>State or region vulnerability and risk assessment methodology</i>, <i>IPCC climate change impact assessment guidance</i>, <i>Proprietary Methodology</i>, and. <i>Agency specific vulnerability and risk assessment methodology</i> which are very vague and similar. This answer might not be interested to consider. </p>

<p>A interested question about risk and vulnerability assessment is the reason of cities to take the decision to investigate in it or not. Below we see the different reasons. We notice that the answers are very varied.  </p>

In [None]:
# Identify and describe the factors that most greatly affect your cityâ€™s ability to adapt to climate change and indicate how those factors either support or challenge this ability.
cities_2020_risk_factor_ability_adapt = extract_answer(cities_2020,'2.2','Factor Affect Climate Adaptation', condition=(cities_2020['Column Number'] == 1))
cities_2020_risk_factor_ability_adapt['Factor Affect Climate Adaptation'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_risk_factor_ability_adapt=cities_2020_risk_factor_ability_adapt.groupby('Factor Affect Climate Adaptation').filter(lambda x : len(x)>3)

df=cities_2020_risk_factor_ability_adapt['Factor Affect Climate Adaptation'].value_counts().to_frame().transpose()
fig = plt.figure(figsize=(22,12))
ax1 = fig.add_subplot()
df.plot(kind='barh',stacked=True,legend=False, color=colors, ax=ax1, grid=False, width=0.04)
ax1.set_ylabel('')
for p in range(0,len(ax1.patches)):
    b = ax1.patches[p].get_bbox()
    ax1.annotate(df.columns[p] , ((b.x0 + b.x1)/2 - 0.2 , b.y1 + 0.01),rotation=-280,fontsize=13)
ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax1.set_frame_on(False)
ax1.tick_params(tick1On=False)
plt.title("Factors that Affect Ability to Adapt",fontsize=18)
plt.show()


<a id="115"></a>
<h3 style="color:#D00892">  1.1.5 City-wide Emissions </h3> 

In [None]:
# Does your city have a city-wide emissions inventory to report?
cities_2020_emission = extract_answer(cities_2020,'4.0','City-wide emissions')

# plot it!
df = cities_2020_emission['City-wide emissions'].value_counts()
explode=[0.2 for i in range(4)]
ax =df.plot.pie(explode=explode,pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=df.index )
listOfWhiteText=[]
for a in ax.texts:
    if "%" in a.get_text(): 
        listOfWhiteText.append(a)
plt.setp(listOfWhiteText, **{'color':'white'})
ax.set_ylabel('')
plt.title("Does your city have a city-wide emissions inventory to report?",fontsize=18)
plt.show()


<a id="116"></a>
<h3 style="color:#D00892"> 1.1.6. Emissions Reduction </h3>
This section is interested because there are many select-type questions. We show some questions and the distributions of the answers. 

In [None]:
# Do you have a GHG emissions reduction target(s) in place at the city-wide level?
cities_2020_GHG_reduction = extract_answer(cities_2020,'5.0','GHG Reduction')
df1 = cities_2020_GHG_reduction['GHG Reduction'].value_counts()

# Is your city-wide emissions reduction target(s) conditional on the success of an externality or component of policy outside of your control?
cities_2020_external_control_reduction = extract_answer(cities_2020,'5.2','External Control Reduction')
df2 = cities_2020_external_control_reduction['External Control Reduction'].value_counts()

# Does your city-wide emissions reduction target(s) account for the use of transferable emissions units?
cities_2020_transferable_emission = extract_answer(cities_2020,'5.3','Transferable Emission')
df3 = cities_2020_transferable_emission['Transferable Emission'].value_counts()

# Does your city have a climate change mitigation or energy access plan for reducing city-wide GHG emissions?
cities_2020_GHG_mitigation_plan = extract_answer(cities_2020,'5.5','GHG Mitigation Plan')
df4 = cities_2020_GHG_mitigation_plan['GHG Mitigation Plan'].value_counts()

fig = plt.figure(figsize=(15,9))
plt.suptitle("Emission Reduction",fontsize=18)
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

df1.plot.pie(pctdistance=0.7,autopct='%.2f%%',colors=colors[3:], labels=df1.index, ax=ax1,ylabel="" )
ax1.title.set_text("Do you have a GHG emissions reduction target(s) in \n place at the city-wide level?")
df2.plot.pie(pctdistance=0.7,autopct='%.2f%%',colors=colors[5:], labels=df2.index, ax=ax2,ylabel="" )
ax2.title.set_text("Is your city-wide emissions reduction target(s) conditional \n on the success of an externality or component of policy outside of your control?")              
df3.plot.pie(pctdistance=0.7,autopct='%.2f%%',colors=colors[1:], labels=df3.index, ax=ax3,ylabel="" )
ax3.title.set_text("Does your city-wide emissions reduction target(s) account \n for the use of transferable emissions units?")
df4.plot.pie(pctdistance=0.7,autopct='%.2f%%',colors=colors[8:], labels=df4.index, ax=ax4,ylabel="" )
ax4.title.set_text("Does your city have a climate change mitigation or \n energy access plan for reducing city-wide GHG emissions?")
plt.show()


<a id="117"></a>
<h3 style="color:#D00892"> 1.1.7. Collaborations</h3>
We all know that compagnies have a huge impact on decisions. In the survey, this aspect has been covered by asking to the cities if they are working with the private businesses. The results are encouraging. 

In [None]:
# Does your city collaborate in partnership with businesses in your city on sustainability projects?
cities_2020_transferable_emission = extract_answer(cities_2020,'6.2','Collaborations')
sums = cities_2020_transferable_emission['Collaborations'].value_counts()

fig = plt.figure(figsize=(10,2))
plt.suptitle("Collaborations",fontsize=18)
plt.scatter(sums.index,[1,1,1,1,1],[x*6 for x in list(sums)],color=colors[::-1][3:8])
plt.box(on=None)
plt.yticks([])
plt.ylim(0.2, 1.5)
plt.xticks(rotation=-20,y=0.4, ha="left" )
plt.show()


<a id="118"></a>
<h3 style="color:#D00892"> 1.1.8. Energie</h3>
Now comes the strongest part of the form: where is the energy of the cities from? 

In [None]:
# Does your city have a renewable energy or electricity target?
cities_2020_renewable_energie = extract_answer(cities_2020,'8.0','Renewable Energie Target')
sums = cities_2020_renewable_energie['Renewable Energie Target'].value_counts()

fig = plt.figure(figsize=(10,3))
plt.suptitle("Renewable Energie Target",fontsize=18)
plt.hlines(sums.index, 0, sums, color=colors[7:],linewidth=3)
plt.plot(sums, sums.index, 'o',color=colors[1])
plt.box(on=None)
plt.xticks([])
plt.show()

Question 8.1 of the survey tackles the different sources of electricity of the cities. For 566 cities, we show below the number of answers for each source. 

In [None]:
# 8.1 Please indicate the source mix of electricity consumed in your city.
cities_2020_energie_info =  cities_2020[(cities_2020['Question Number'] == '8.1')   ]
df_energie_info = cities_2020_energie_info[cities_2020_energie_info['Response Answer'].notna()]['Column Name'].value_counts().to_frame()
df_energie_info.columns=['Number of Answers']
display(HTML(df_energie_info.to_html()))

More than half of cities fill the table of energy sources! What is highlighted from this question?

In [None]:
# all sources in a unique dataframe
cities_2020_sources_energy = extract_answer(cities_2020,'8.1',df_energie_info.index[0], condition=(cities_2020["Column Name"]==df_energie_info.index[0]))
for source in df_energie_info.index[1:10]:
    cities_2020_sources_energy = cities_2020_sources_energy.merge(extract_answer(cities_2020,'8.1',source, condition=(cities_2020["Column Name"]==source)))

# data are not numeric
for c in cities_2020_sources_energy.columns[1:]:
    cities_2020_sources_energy[c] =  pd.to_numeric(cities_2020_sources_energy[c])
    
# print the total of the sources
df = cities_2020_sources_energy.transpose().drop(index='Account Number', axis=0)
ax = df.mean(axis=1).plot.pie(colors=colors,pctdistance=0.7,autopct='%.2f%%',figsize=(7,7))
listOfWhiteText=[]
for a in ax.texts:
    if "%" in a.get_text(): 
        listOfWhiteText.append(a)
plt.setp(listOfWhiteText, **{'color':'white'})
ax.set_ylabel('')
plt.title("Mean of Percentages of Energy Sources", fontsize=18)
plt.show()

We see that <i>Hydro</i> and <i>Gas</i> are the two main sources of the cities that have answered to the question. 


<a id="119"></a>
<h3 style="color:#D00892"> 1.1.9. Transports</h3>
Transports polute. Let's have a look at the transport decisions of the cities. 

In [None]:
#What is the mode share of each transport mode in your city for passenger transport?
def transport_plot(dataset,ax,date):
    dataset_transports =  dataset[(dataset['Question Number'] == '10.1')   ]
    df_transports = dataset_transports[dataset_transports['Response Answer'].notna()]['Column Name'].value_counts().to_frame()

    dataset_transports = extract_answer(dataset,'10.1',df_transports.index[0], condition=(dataset["Column Name"]==df_transports.index[0]))
    for source in df_transports.index[1:10]:
        dataset_transports = dataset_transports.merge(extract_answer(dataset,'10.1',source, condition=(dataset["Column Name"]==source)))

    # data are not numeric
    for c in dataset_transports.columns[1:]:
        dataset_transports[c] =  pd.to_numeric(dataset_transports[c], errors='coerce')


    # print the total of the sources
    df = dataset_transports.transpose().drop(index='Account Number', axis=0)
    df.mean(axis=1).plot.pie(colors=colors[5:],pctdistance=0.7,autopct='%.2f%%',ax=ax)
    listOfWhiteText=[]
    for a in ax.texts:
        if "%" in a.get_text(): 
            listOfWhiteText.append(a)
    plt.setp(listOfWhiteText, **{'color':'white'})
    ax.set_ylabel('')
    ax.set_xlabel(xlabel=date,fontsize=15)
    return ax, dataset_transports

# prepare plot
fig = plt.figure(figsize=(15,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
axa, cities_2019_transports = transport_plot(cities_2019,ax1,"Transports in 2019")
axb, cities_2020_transports = transport_plot(cities_2020,ax2,"Transports in 2020")
plt.show()




We sadly see that from 2019 to 2020, the mean percentage of <i>Private Motorized Transport</i> grows. We also see a new kind of vehicle entitled <i>Micro-Mobility</i>.  

</a><a id='12'></a>
<h2 style="color:#7209b7">1.2 Data Analysis and KPI Extraction ðŸ”Ž </h2>
Now that we have loaded and briefly analyzed some questions of the survey, we can go to the next step: data mining! First we merge the different features and list them. 

In [None]:
# recap of our datasets
ready_dataset = [cities_2020_numerical_pop,cities_2020_sources_energy,cities_2020_transports]
to_transform_into_dummies = [cities_2020_renewable_energie,cities_2020_GHG_mitigation_plan,cities_2020_transferable_emission,cities_2020_external_control_reduction,cities_2020_GHG_reduction,cities_2020_admin_boundary,cities_2020_gov_and_data,cities_2020_risk,cities_2020_risk_detail,cities_2020_risk_factor_ability_adapt,cities_2020_emission]

# transform categorical data into dummies
for df in to_transform_into_dummies:
    df = pd.get_dummies(df, columns=[df.columns[1]], prefix=[df.columns[1]]).groupby(['Account Number'], as_index=False).sum() 
    ready_dataset.append(df)
    
# merge all dataset!
all_num_2020 = ready_dataset[0] 
for df in ready_dataset[1:]:
    all_num_2020 = pd.merge(all_num_2020, df, on='Account Number', how="outer")

# inner join 
all_num_2020.head(2)

We have used get_dummies for the categorical variables (see [what is get_dummies?](https://pbpython.com/categorical-encoding.html)). The list of the features is the following:

In [None]:
display(Markdown(pd.Series(all_num_2020.columns, name="Features").to_markdown()))

<h3 style="color:#8B4513"> Coorelation between Features </h3>
We verify that columns are not too much correlated and remove them if any.

In [None]:
f = plt.figure(figsize=(7,7))
plt.matshow(all_num_2020.corr(),fignum=1)
plt.xticks([])
plt.yticks([])
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=18);

In [None]:
# Select upper triangle of correlation matrix
# source: https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on
upper = all_num_2020.corr().where(np.triu(np.ones(all_num_2020.corr().shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.9) and column!="Account Number"]

# Drop features 
all_num_2020_reduced = all_num_2020.drop(to_drop, axis=1)
all_num_2020_reduced.shape

<a id="121"></a>
<h3 style="color:#D00892">1.2.1. Simple K-means to get some clusters and plot on a map</h3>

I first used Kmeans to create clusters. To choose the number of clusters, I launch the Kmeans methods for 2 to 10 clusters and compute the Davies-Bouldin score. The minimal value of the score describe the best clustering with this metric. 


In [None]:
# replace NaN to -1
all_num_2020_reduced.fillna(-1, inplace=True)

# normalizing data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(all_num_2020_reduced)

# choose 5 because many categorical variables have 5 choices, arbitrary 
#kmeans = KMeans( n_clusters=5).fit(scaled_features)

dbs = {} 
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k,random_state=5).fit(scaled_features)
    dbs[k] = davies_bouldin_score(scaled_features,kmeans.labels_)
plt.figure()
plt.plot(list(dbs.keys()), list(dbs.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin score")
plt.title("How many clusters should we keep?", fontsize=18)
plt.show()

kmeans1 = KMeans(n_clusters=2,random_state=5).fit(scaled_features)

We then choose to keep 2 clusters which is nice to balance between cities that are in the good direction for climate change and cities that don't tackle the problem yet.<br>  Before analyzing the cluster feature importances, I decided to plot the cities. A color in the map corresponds to a cluster. 

In [None]:
# concat label to Account Number
concatenation_labels_an = pd.concat([all_num_2020_reduced["Account Number"],pd.Series(kmeans1.labels_)],axis=1)
geo_cities_2020_noNa = geo_cities_2020.dropna(axis=0)

#oopsie warning
pd.options.mode.chained_assignment = None
# reshape geometry.. 
geo_cities_2020_noNa.loc[:,"geometry"]= geo_cities_2020_noNa["City Location"].apply(lambda x: Point(eval(x.split("POINT ")[1].replace(" ",","))))
geo_cities_2020_noNa = geo_cities_2020_noNa.drop(["City Location"],axis=1)

print(geo_cities_2020_noNa[geo_cities_2020_noNa['Account Number']==50378])
merge_cluster_an_geo = pd.merge(concatenation_labels_an, geo_cities_2020_noNa, on='Account Number', how="inner")
merge_cluster_an_geo = merge_cluster_an_geo.dropna(axis=0)

# start to plot! 
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

# We restrict to South America.
ax = world.plot(color="#303030", edgecolor='white',figsize=(18,10))

# We can now plot our ``GeoDataFrame`` for the different clusters (column=0)
gpd.GeoDataFrame(merge_cluster_an_geo[merge_cluster_an_geo[0]==1]).plot(ax=ax, color=colors[5])
gpd.GeoDataFrame(merge_cluster_an_geo[merge_cluster_an_geo[0]==0]).plot(ax=ax, color="#ffd500")
plt.box(on=None)
plt.xticks([])
plt.yticks([])

# to show the value: uncomment
#geo_cities_2020_noNa[geo_cities_2020_noNa["Account Number"]==50378].apply(lambda x: ax.annotate(s=x["Account Number"],  xy=x.geometry.centroid.coords[0], ha='center', color="w"),axis=1);

See the dots on the bottom... Number 50378 is "San JosÃ©, Costa Rica". Yes, in Latin America. After some research on the crs (https://geopandas.org/projections.html) understanding, I realize that the longitude and latitude are reversed for this city. It's the case of other cities but not all... GOOD LUCK hhh

In fact, we see that we cannot conclude anything about the location of the clusters. However, the cities of the survey are quite often near the oceans! 

Now we will look at the centroids of the clusters and point out the differences between them. To do so we compute the difference between the two centroids and present the features such that the difference is the highest. 

In [None]:
# cluster 0 - cluster 1 give us the difference between the centroids 
difference_between_centroids = abs(kmeans1.cluster_centers_[0] - kmeans1.cluster_centers_[1])

# add the labels
difference_between_centroids_df = pd.Series(difference_between_centroids, index=all_num_2020_reduced.columns)
difference_between_centroids_df

# print the top highest feature importances
difference_between_centroids_df = difference_between_centroids_df.sort_values(ascending=False)
difference_between_centroids_df.head(10)

Interested results! We see that the features are strong. Now, we plot a 2D versions of the data point to see if the centroids are indeed well representing the clusters. If so, we are on a good path to extract some KPI. (I am using the PCA method to reduce dimensionality to 2. It is often used before the kmeans... maybe I should do that.  

In [None]:
# to get 2 dimensions to plot the data points
def plot_pca_of_my_kmeans(scaled_features,kmeans,ax):
    pca = PCA(n_components=2, random_state=1)
    pca_results = pca.fit_transform(scaled_features)

    for i in range(0, pca_results.shape[0]):
        ax.scatter(pca_results[i,0],pca_results[i,1],c=colors[kmeans.labels_[i]*2+1], marker='x')
    return ax

plot_pca_of_my_kmeans(scaled_features,kmeans1,plt)
plt.title('2D vizualisation of our datapoints', fontsize=16)
plt.show()


From this chart, we are able to say two main things: first our kmean was quite good, second we should definitively do 4 clusters. I redo it but hide on Kaggle it because we already have the same work above. 

In [None]:
# a bit different of previous cluster search because I'm looking for 4 clusters.
dbs = {} 
for k in range(10, 20):
    # in fact, I'm looking for a random_state that is better for n_clusters=4 
    kmeans = KMeans(n_clusters=4,random_state=k).fit(scaled_features)
    dbs[k] = davies_bouldin_score(scaled_features,kmeans.labels_)

fig = plt.figure(figsize=(13,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.plot(list(dbs.keys()), list(dbs.values()))
ax1.set_title("Plot to find the best random_state for n_cluster=4")

# the best is for random_state=17
kmeans2 = KMeans(n_clusters=4,random_state=17).fit(scaled_features)
plot_pca_of_my_kmeans(scaled_features,kmeans2,ax2)
ax2.set_title( "Chart to show the new clusters")
plt.show()


Result of this clustering is also very good but finally we work with the first version that does not contain any clusters that are grouped.<br>
Let's come back to the features. We saw that the section `Factor Affect Adaptation` separate a lot the two groups. Let's figure that.

In [None]:
# taken from https://stackoverflow.com/questions/52910187/how-to-make-a-polygon-radar-spider-chart-in-python
def radar_factory(num_vars, frame='circle'):
    """Create a radar chart with `num_vars` axes.

    This function creates a RadarAxes projection and registers it.

    Parameters
    ----------
    num_vars : int
        Number of variables for radar chart.
    frame : {'circle' | 'polygon'}
        Shape of frame surrounding axes.

    """
    # calculate evenly-spaced axis angles
    theta = np.linspace(0, 2*np.pi, num_vars, endpoint=False)

    class RadarAxes(PolarAxes):

        name = 'radar'

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # rotate plot such that the first axis is at the top
            self.set_theta_zero_location('N')

        def fill(self, *args, closed=True, **kwargs):
            """Override fill so that line is closed by default"""
            return super().fill(closed=closed, *args, **kwargs)

        def plot(self, *args, **kwargs):
            """Override plot so that line is closed by default"""
            lines = super().plot(*args, **kwargs)
            for line in lines:
                self._close_line(line)

        def _close_line(self, line):
            x, y = line.get_data()
            # FIXME: markers at x[0], y[0] get doubled-up
            if x[0] != x[-1]:
                x = np.concatenate((x, [x[0]]))
                y = np.concatenate((y, [y[0]]))
                line.set_data(x, y)

        def set_varlabels(self, labels):
            self.set_thetagrids(np.degrees(theta), labels)

        def _gen_axes_patch(self):
            # The Axes patch must be centered at (0.5, 0.5) and of radius 0.5
            # in axes coordinates.
            if frame == 'circle':
                return Circle((0.5, 0.5), 0.5)
            elif frame == 'polygon':
                return RegularPolygon((0.5, 0.5), num_vars,
                                      radius=.5, edgecolor="k")
            else:
                raise ValueError("unknown value for 'frame': %s" % frame)

        def draw(self, renderer):
            """ Draw. If frame is polygon, make gridlines polygon-shaped """
            if frame == 'polygon':
                gridlines = self.yaxis.get_gridlines()
                for gl in gridlines:
                    gl.get_path()._interpolation_steps = num_vars
            super().draw(renderer)


        def _gen_axes_spines(self):
            if frame == 'circle':
                return super()._gen_axes_spines()
            elif frame == 'polygon':
                # spine_type must be 'left'/'right'/'top'/'bottom'/'circle'.
                spine = Spine(axes=self,
                              spine_type='circle',
                              path=Path.unit_regular_polygon(num_vars))
                # unit_regular_polygon gives a polygon of radius 1 centered at
                # (0, 0) but we want a polygon of radius 0.5 centered at (0.5,
                # 0.5) in axes coordinates.
                spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
                                    + self.transAxes)


                return {'polar': spine}
            else:
                raise ValueError("unknown value for 'frame': %s" % frame)

    register_projection(RadarAxes)
    return theta



In [None]:
# the idea is to show the answers of the cities of the two clusters.
# difference_between_centroids_df
# all_num_2020_reduced 
# kmean1.labels_ 

all_num_2020_reduced_c1 = all_num_2020_reduced.iloc[np.where(kmeans1.labels_ == 1)]
all_num_2020_reduced_c1 = all_num_2020_reduced_c1.astype(float).mean(axis=0)

all_num_2020_reduced_c2 = all_num_2020_reduced.iloc[np.where(kmeans1.labels_ == 0)]
all_num_2020_reduced_c2 = all_num_2020_reduced_c2.astype(float).mean(axis=0)

def give_top_n(n, top, data):
    '''
    Give the n first columns of data with order given by top 
    '''
    cols = [top.index[c] for c in range(0,n)]
    return data[cols]
    
top_C1 = give_top_n(9,difference_between_centroids_df, all_num_2020_reduced_c1)
top_C2 = give_top_n(9,difference_between_centroids_df, all_num_2020_reduced_c2)


theta = radar_factory(9, 'polygon')
 
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(1, 1, 1, projection='radar')

colors_2 = [colors[0*2+1],colors[1*2+1]]
i=0
for d in [top_C2,top_C1]:
    ax.plot(theta, d, color= colors_2[i] )
    ax.fill(theta, d, color= colors_2[i], alpha=0.25)
    i+=1
ax.set_varlabels([x.split("_")[1] for x in top_C2.index])
plt.title("What affect the climante change adapation of the cities?", fontsize=16)
plt.show()

We sadly see that the blue cluster is different from the purple one mostly because they did not answer to the question of climate change adaptation. As this is not a KPI... we will redo the clustering step by removing rows that contain to many blanks. 

<a id="122"></a>
<h3 style="color:#D00892">1.2.2. Improving K-means to get KPIs</h3>
<p>I first remove the columns that have more than 100 lines with Nan values. Then I remove the rows that still contains values and obtain a nice sub-set of 421 cities and 80 features </p>

In [None]:
# count NaN rows
count = all_num_2020.isna().sum()

# remove columns that have more than 100 blanks
cols_to_remove = [x for x  in all_num_2020.columns if count[x]>100]
all_num_2020_remove_Nan_col = all_num_2020.drop(cols_to_remove,axis=1)

# now the lines
all_num_2020_no_Nan = all_num_2020_remove_Nan_col.dropna()
all_num_2020_no_Nan.shape

In [None]:
# normalizing data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(all_num_2020_no_Nan)

dbs = {} 
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k,random_state=12).fit(scaled_features)
    dbs[k] = davies_bouldin_score(scaled_features,kmeans.labels_)

fig = plt.figure(figsize=(13,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.plot(list(dbs.keys()), list(dbs.values()))
ax1.set_title("Chart to find the best number of clusters")

# the best is for random_state=17
kmeans_noNan = KMeans(n_clusters=3,random_state=12).fit(scaled_features)
plot_pca_of_my_kmeans(scaled_features,kmeans_noNan,ax2)
ax2.set_title( "Chart to show the new clusters")
plt.show()

I tried different random_state to get 2 clusters but most scores indicate to choose 7 clusters. As I prefer to have least clusters to interpret them, I choose 3 which is not so bad for a random_state to 12. Let go back to the  polygone chart and figures the differences! 

In [None]:
# what are the top differences between the three centroids:
difference_between_centroids = pd.Series(abs(kmeans_noNan.cluster_centers_[0] - kmeans_noNan.cluster_centers_[1])+\
                                       abs(kmeans_noNan.cluster_centers_[0] - kmeans_noNan.cluster_centers_[2])+\
                                      abs(kmeans_noNan.cluster_centers_[2] - kmeans_noNan.cluster_centers_[1]))

difference_between_centroids.sort_values(ascending=False, inplace=True)

# a lot of copy/paste code...hhh, sorry, I will work better if you hire me, I promise
all_num_2020_reduced_c1 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 0)]
all_num_2020_reduced_c1 = all_num_2020_reduced_c1.astype(float).mean(axis=0)

all_num_2020_reduced_c2 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 1)]
all_num_2020_reduced_c2 = all_num_2020_reduced_c2.astype(float).mean(axis=0)

all_num_2020_reduced_c3 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 2)]
all_num_2020_reduced_c3 = all_num_2020_reduced_c3.astype(float).mean(axis=0)
    
top_C1 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c1)
top_C2 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c2)
top_C3 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c3)

# and we plot it !!
theta = radar_factory(15, 'polygon')
 
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1, projection='radar')

colors_2 = [colors[0*2+1],colors[1*2+1],colors[2*2+1]]
i=0
for d in [top_C2,top_C1,top_C3]:
    ax.plot(theta, d, color= colors_2[i] )
    ax.fill(theta, d, color= colors_2[i], alpha=0.25)
    i+=1
ax.set_varlabels([x for x in top_C2.index])
plt.title("Top 15 features that separates the clusters", fontsize=16)
plt.show()



This chart shows the 15 main differences and similarities between the clusters. For instance, we remark that the blue cluster is mainly the cities that detailed several factors that affect their adapation while the red cluster represent the one that don't have any target for GHG reduction and don't have to answer to the External Control Reduction. Finaly the purple cluster has a cite-wide inventory to report but did not detail factor that affects the adapatations. 

<a id="123"></a>
<h3 style="color:#D00892">1.2.3. Using GHG reduction as target </h3>
In this part, I decide to learn a decision tree with the target `GHG Reduction` which seems to be one important criteria of the study. 

In [None]:
#all_num_2020_no_Nan.columns
ghglist = ['GHG Reduction_Base year emissions (absolute) target','GHG Reduction_Base year intensity target',\
           'GHG Reduction_Baseline scenario (business as usual) target','GHG Reduction_Fixed level target', \
           'GHG Reduction_No target']

def revert_get_dummiers_for_GHG(row):
    for c in ghglist:
        if row[c]==1:
            return ghglist.index(c)
        
all_num_2020_no_Nan_Y = all_num_2020_no_Nan.apply(revert_get_dummiers_for_GHG, axis=1)
all_num_2020_no_Nan_X_ = all_num_2020_no_Nan.drop(ghglist, axis=1)
all_num_2020_no_Nan_X = all_num_2020_no_Nan_X_.drop(["Account Number","External Control Reduction_Question not applicable"],axis=1)
all_num_2020_no_Nan_X

index = all_num_2020_no_Nan_Y.index[all_num_2020_no_Nan_Y.apply(np.isnan)]
all_num_2020_no_Nan_Y.drop(index,axis=0, inplace=True)
all_num_2020_no_Nan_X.drop(index,axis=0,inplace=True)

In [None]:
# I first try a decision tree
X_train, X_test, y_train, y_test = train_test_split(all_num_2020_no_Nan_X, all_num_2020_no_Nan_Y, test_size=0.2, random_state=42)
clf = tree.DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
r = tree.export_text(clf, feature_names=list(all_num_2020_no_Nan_X.columns))
print(r)


The decision tree gives easy-to-ready outputs which is good for KPI extractions. The different classes that we set as targets of our classifier are the GHG Reduction feature and are the followings:
- <b>class 0.0 = 'Base year emissions (absolute) target':</b> Reduce emissions by a specified quantity relative to a base year. For example, a 25% reduction in emissions from 1990 levels by 2030.
- <b>class 1.0 = 'Base year intensity target:</b> Reduce  emissions  intensity  (emissions  per  unit  of  another variable, typically GDP or capital Gross Domestic Product â€“GDP or per capita) by a specified quantity  relative  to  a  base  year.  For  example,  a  40%  reduction  in  emissions  intensity  per capita from 1990 levels by 2030.
- <b>class 2.0 = 'Baseline scenario (business as usual) target':</b> Reduce  emissions  by  a  specified  quantity  relative  to  a  projected emissions baseline scenario. A Business as Usual (BAU) baseline scenario is a reference case that  represents  future  emissions  most  likely  to  occur  if  the  current  trends  in  population, economy  and  technology  continue  and  in  the  absence  of  changes  in  current  energy  and climate policies. For example, a 30% reduction from baseline scenario emissions in 2030. 
- <b> class 3.0 = 'Fixed level target':</b> Reduce,  or  control  the  increase  of,  emissions  to  an  absolute  emissions level  in  a target  year.  One  type  of  fixed-level  target  is  a  carbon  neutrality  target,  which  is designed to reach zero net emissions by a certain date (e.g., 2050).
- <b>class 4.0 = 'No target'</b>


<h3 style="color:#FFA57D"> First outcome:</h3> 
<i>Please click on "show output" of the previous box to see the decision tree. I have hiden it because the text is long.</i><br>
We created a training and testing set to validate our decision tree.<br>
The accuracy is 0.53 which is not very good. However when we look at the precision and recall of the different classes we see that the model can easly discover class 0.0 and class 4.0 than the others. This can be due to the size of the sets of those classes.</br>
We see that any the branches above the conditions "GHG Mitigation Plan_Yes <= 0.50" plus "External Control Reduction_Yes >  0.50" return class 4.0, i.e, no target. We can <i>conclude</i> that not having a mitigation plan yet (Question 5.5) and  being under an external control outside the policy of the city (Question 5.2) help cities to set tagets... <br> 
From this first "KPI", I see that the method is interested but not my features. I am going to improve the learning features, first by removing all the target questions (section 5 of the survey) and then by adding more data from other datasets. <br>
I don't want to give more details on the interpretation of this results because the model (tree) is not good. 


<a id="2"></a>
<h1  style="color:#084FD0">2. Intersection between Environment and Social Challenge for the Cities</h1>

<p style="text-align: justify;">In the previous section, we work with the dataset to explore it. We aims at finding solution to the compete problem, i.e., can we extract KPI in the dataset ? </p>
<a id="21"></a>
<h2 style="color:#7209b7">2.1 Logistic Regression for KPI Extractions</h2> 
Linear regression is an excellent method to find coorelations between data. However, for the moment, we are working with classes. Thanks to Logistic Regression, we are able to find a probabilistic model that will give us coefficients of the features. Similarly to the linear model, the logistic model can be learn with all the features or a restricted amount of them. As we want to give KPI, understandable by humans, we looked for model from 1 to 3 features. To do so, we learn all the possible models and keep the best one. 
Like in the previous section, we created a training and testing set to validate the model. 


Find below 3 functions that helps us to train the best logistic regressions:
- <b>remove_outlier_from_zscore</b>: removes the rows that are outliers by using the outlier definition given by the zscore
- <b>compute_regressions</b>: learns all the logistic regressions of 1 to 3 variables. 
- <b>train_and_print_regression</b>: shows several results that we wil use to interpert and extract valuable KPI

In [None]:
def remove_outlier_from_zscore(df, column,y_follows):
    '''
    Function to remove outliers
    source: https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame
    '''
    df['z_score']=stats.zscore(df[column])
    rows = df[df['z_score'].abs()>3].index
    y_follows = y_follows.drop(rows,axis=0)
    return df[df['z_score'].abs()<=3], y_follows


def compute_regressions(X, Y):
    '''
    Function to learn a good model of 3 features (X)
    @return: best logistic regression + the list of the columns used for it
    source: memories of my Master degree! 
    '''
    best_model = None
    best_score = 0 
    columns = None
    
    def learn_regression(X,Y,list_of_indices,best_score, best_model,columns):
        X2 = X[[X.columns[i] for i in list_of_indices]]
        Y2 = Y
        # remove outliers and same rows in Y
        for c in list_of_indices:
            X2, Y2 = remove_outlier_from_zscore(X2,X.columns[c],Y2)
        X2 = X2.drop("z_score",axis=1)
        # learn model
        model = LogisticRegression(random_state=1)
        model.fit( X2,Y2)
        result = model.score(X2,Y2)
        # if better, keep it
        if result > best_score:
            best_model = model
            best_score = result
            columns = [X.columns[i] for i in list_of_indices]
            print("...score:",result, columns)
        return (best_score, best_model, columns)
    
    # learn regression of 1, 2 or 3 predictive variables
    for i in range(0,len(X.columns)):
        best_score, best_model, columns = learn_regression(X, Y,[i],best_score, best_model,columns)
        for j in range(i+1,len(X.columns)): 
            best_score, best_model, columns = learn_regression(X, Y,[i,j],best_score, best_model,columns)
            for k in range(j+1,len(X.columns)): 
                best_score, best_model, columns = learn_regression(X, Y,[i,j,k],best_score, best_model, columns)
                
    return best_model, columns

def train_and_print_regression(X,Y):
    '''
    Function that split the training and testing set
    Get the best regression model for 3 features
    Show accuracy on the testing set
    '''
    X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.3, random_state=1)
    model, keys = compute_regressions(X_train,y_train)
    y_pred = model.predict(X_test[[keys[i] for i in range(0,len(keys))]])
    
    print("Training Score:",model.score(X_train[[keys[i] for i in range(0,len(keys))]],y_train))
    print("Testing Score:",model.score(X_test[[keys[i] for i in range(0,len(keys))]],y_test))
    
    # nice prints 
    # print(classification_report(list(y_test), [1 if x>0.5 else 0 for x in y_pred]))
    for k in range(0,len(keys)):
        print(keys[k],  str(model.coef_[0][k]))
    return model, keys

Below (which is hidden in the Kaggle notebook view, open it if interested), we preprocess a bit the data. First, I remove columns of our previous dataset that have more than 100 blanks (Nan values). Then, I remove the lines that contains blanks too. After this, we see that there many columns that are hard to interpret like columns with `Do not know`, `Question not applicable`... I took arbitrary decisions like removing some of them and associating some to real answer. Find below the list of decisions:
- <b>.._Do not know</b>: if the cities answer that they do not know for climat change target, I feel like they didn't plan something so I associated this to the negative answers
- <b>.._Intending to undertake in the next 2 years</b>: I associated this target to yes. In general, the motivation is present and that what interest us
- <b>.._Question not applicable" and .._unknow</b>: what would we do if we find some KPI with that target? I removed this. 

Then, I applied a scaler to the variables. 

In [None]:
# I take back the data of the cities only. I want to work on the energies, transports and emissions. 
# count NaN rows
count = all_num_2020.isna().sum()

# remove columns that have more than 100 blanks
cols_to_remove = [x for x  in all_num_2020.columns if count[x]>250]
all_num_2020_remove_Nan_col = all_num_2020.drop(cols_to_remove,axis=1)

# now the lines
all_num_2020_remove_Nan_col = all_num_2020_remove_Nan_col.dropna()

# I modify the Do not know to -> "no"
# and "Intending to undertake in the next 2 years" to "yes" 
for c in all_num_2020_remove_Nan_col.columns:
    if "Do not know" in c:
        if c.split("_")[0]+"_Not intending to undertake" in all_num_2020_no_Nan.columns:
            all_num_2020_remove_Nan_col[c.split("_")[0]+"_Not intending to undertake"] =  all_num_2020_remove_Nan_col[c.split("_")[0]+"_Not intending to undertake"]+all_num_2020_remove_Nan_col[c]
        elif c.split("_")[0]+"_Not intending to incorporate" in all_num_2020_remove_Nan_col.columns:
            all_num_2020_remove_Nan_col[c.split("_")[0]+"_Not intending to incorporate"] =  all_num_2020_remove_Nan_col[c.split("_")[0]+"_Not intending to incorporate"]+all_num_2020_remove_Nan_col[c]
        else :
            all_num_2020_remove_Nan_col[c.split("_")[0]+"_No"] =  all_num_2020_remove_Nan_col[c.split("_")[0]+"_No"]+all_num_2020_remove_Nan_col[c]
        all_num_2020_remove_Nan_col.drop(c,axis=1, inplace=True)
    if "Intending to undertake in the next 2 years" in c:
        all_num_2020_remove_Nan_col[c.split("_")[0]+"_Yes"] = all_num_2020_remove_Nan_col[c.split("_")[0]+"_Yes"]+all_num_2020_remove_Nan_col[c]
        all_num_2020_remove_Nan_col.drop(c,axis=1, inplace=True)
    if "Question not applicable" in c:
        all_num_2020_remove_Nan_col.drop(c,axis=1, inplace=True)
    if "Unknown" in c:
        all_num_2020_remove_Nan_col.drop(c,axis=1, inplace=True)

# remove nearly empty columns (due to get dummies)
count = all_num_2020_remove_Nan_col.sum()   
cols_to_remove = [i for i in count.index if int(count[i])<10]
all_num_2020_remove_Nan_col = all_num_2020_remove_Nan_col.drop(cols_to_remove,axis=1)


# Select upper triangle of correlation matrix
upper = all_num_2020_remove_Nan_col.corr().where(np.triu(np.ones(all_num_2020_remove_Nan_col.corr().shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.85) and column!="Account Number"]

In [None]:
# We arbitrary define the targets (Y value)
target_columns = ['Renewable Energie Target_In progress',
       'Renewable Energie Target_Not intending to undertake',
       'Renewable Energie Target_Yes', 'GHG Mitigation Plan_In progress',
       'GHG Mitigation Plan_Yes',
       'GHG Reduction_Base year emissions (absolute) target',
       'GHG Reduction_Base year intensity target',
       'GHG Reduction_Baseline scenario (business as usual) target',
       'GHG Reduction_Fixed level target', 'GHG Reduction_No target',
        'City-wide emissions_In progress', 'City-wide emissions_Yes']
# then the predictive values (X) are the others
X_cities = all_num_2020_remove_Nan_col.drop(target_columns, axis=1)
X_cities_ = X_cities.drop(["Account Number", "Current population" ,"Projected population"], axis=1)

# scale the data (but not sure it is usefull with this model..)
scaler = StandardScaler()
X_cities_scaled = scaler.fit_transform(X_cities_[X_cities_.columns])
X_cities_scaled = pd.DataFrame(X_cities_scaled, columns = X_cities_.columns, index=X_cities_.index)
#X_cities_scaled["cst"] = 1

<a id="22"></a>
<h2 style="color:#7209b7">2.2 Results </h2> 
   Now that we had the functions, we can run it to extract the predictive values for different targets. As the learning step is long, I have run it, discovered the columns and just present the best logistic regression in the notebook. However, if you want to verify that the chosen predictive values are the good ones, you can uncomment the corresponding line in the code to run the regressions.  

The logistic regression, similarly to the OSL, gives equations of probabilities based on coefficients:</br>

log(p/1-p) = b0 + C1*x1 + C2*x2 + C3*x3.


In [None]:
def just_run_the_best_model(X, Y, X_predictive_columns):
    X2 = X[X_predictive_columns]
    Y2 = Y
    
    X_train, X_test, y_train, y_test = train_test_split(X2,Y2, test_size=0.3, random_state=1)    
    X_train2, y_train2 = X_train.copy(), y_train.copy()
    
    # remove outliers and same rows in Y2
    for c in range(0,len(X_train2.columns)):
        X_train2, y_train2 = remove_outlier_from_zscore(X_train2,X_train2.columns[c],y_train2)
    X_train2 = X_train2.drop("z_score",axis=1)
    # learn model
    model = LogisticRegression(random_state=1)
    model.fit(X_train2,y_train2)
    
    y_pred = model.predict(X_test)
    
    print("Training Score:",model.score(X_train,y_train))
    print("Testing Score:",model.score(X_test,y_test),"\n")
    
    # nice prints 
    print(classification_report(list(y_test), [1 if x>0.5 else 0 for x in y_pred]))
    for k in range(0,len(X2.columns)):
        print(X2.columns[k], str(model.coef_[0][k]))
    return model

<a id="221"></a>   <h3 style="color:#D00892">2.2.1 KPI for 'Renewable Energie Target_Not intending to undertake' </h3>

In [None]:
# to run all the model, uncomment the following line
# model, keys = train_and_print_regression(X_cities_scaled,all_num_2020_remove_Nan_col['Renewable Energie Target_Not intending to undertake'])    
model = just_run_the_best_model(X_cities_scaled,all_num_2020_remove_Nan_col['Renewable Energie Target_Not intending to undertake'],
                                     ["Hydro","Geothermal","Factor Affect Climate Adaptation_Access to basic services"])#,"cst"])    

From this first result, we get the following: </br>

log(p/1-p) = xHydro * 0.6697022837363221 + xGeothermal * -0.933891979530483 + xFactor.. * -0.053014559214614364 + cts</br>

Despite an average accuracy to 0.82, we see that the model can only predict negative data. We could interpret the following: given `Hydro`, `Geothermal` and `Factor Affect Climate Adaptation_Access to basic services` of a city, if the model predicts `Renewable Energie Target_Not intending to undertak` to `False`, the prediction is 82 percent correct. However, when we look at the data, we see that the result is due to the percentage of the classes. There are 160 `False` values for 38 `True` ones. 

In [None]:
all_num_2020_remove_Nan_col['Renewable Energie Target_Not intending to undertake'].value_counts()

We will try to run a very small model of 76 cities (38 False, and 38 True). This will implies a very small amout of data for the training and testing set.

In [None]:
# find 38 positives to keep
index_of_pos_to_remove = [x for x in all_num_2020_remove_Nan_col.index if all_num_2020_remove_Nan_col['Renewable Energie Target_Not intending to undertake'][x]==0]
suffle_index = random.sample(index_of_pos_to_remove,160-38)
suffle_index

X_221 = X_cities_scaled.drop(suffle_index)
Y_221 = all_num_2020_remove_Nan_col.drop(suffle_index)
#model, keys = train_and_print_regression(X_221,Y_221['Renewable Energie Target_Not intending to undertake'])    
model = just_run_the_best_model(X_221,Y_221['Renewable Energie Target_Not intending to undertake'],["Hydro"])    

The feature `Hydro` seems to be the most coorelated to the `Renewable Energie Target_Not intending to undertake` target. When a city uses Hydro as energy, the probability to give a negative answer to the question about renewable energie target is lower than giving a positive answer. Those cities answer positive target like "in progress" or "yes". 

 <a id="222"></a>  <h3 style="color:#D00892">2.2.2 KPI for 'GHG Mitigation Plan_Yes' </h3>

For each feature, we now reduce the data to get a good percentage of True/False value to predict.  

In [None]:
all_num_2020_remove_Nan_col['GHG Mitigation Plan_Yes'].value_counts()

In [None]:
# find 38 positives to keep
index_of_pos_to_remove = [x for x in all_num_2020_remove_Nan_col.index if all_num_2020_remove_Nan_col['GHG Mitigation Plan_Yes'][x]==1]
suffle_index = random.sample(index_of_pos_to_remove,163-35)
suffle_index

X_222 = X_cities_scaled.drop(suffle_index)
Y_222 = all_num_2020_remove_Nan_col.drop(suffle_index)
# to run all the model, uncomment the following line
#model, keys = train_and_print_regression(X_222,Y_222[ 'GHG Mitigation Plan_Yes'])    
model = just_run_the_best_model(X_222,Y_222['GHG Mitigation Plan_Yes'],
                                     ["Wind","External Control Reduction_No","Governance & Data Management_In progress"])    

<h2 style="color:#7209b7">2.3 Conclusion </h2> 

The method can be applied to any other features but I did not run this because of time. We see that there's no big conclusion on the features that I have applied. 

<h1 style="color:#084FD0"> 3. Other Datasets</h1>

In the first part of the compete,some Kaggle users shared some datasets for the competition. In this section, we try to add them in our dataset. 

<h2 style="color:#7209b7">3.1 Adding Country Information to Cities</h2> 

In this section, we concatenate several datasets about countries and our city responses in order to see if their target is related to some sociality and environnmental metrics that are not from their point of view.

In [None]:
# I decide to keep the same target 
cities_plus_countries_Y = pd.concat([all_num_2020_no_Nan_Y,all_num_2020_no_Nan_X_["Account Number"]], axis=1)

First, I remove all the target columns because those columns are too much coorelated to the GHG reduction and don't give much informations in the intersection of sociaties and environnment. 

In [None]:
# first I remove all target questions (about future plans)
# + Projected population because current and Difference is enough
target_columns = ["Projected population",\
                  'Renewable Energie Target_Do not know',\
                   'Renewable Energie Target_In progress',\
               'Renewable Energie Target_Intending to undertake in the next 2 years',\
               'Renewable Energie Target_Not intending to undertake',\
               'Renewable Energie Target_Yes', 'GHG Mitigation Plan_Do not know',\
               'GHG Mitigation Plan_In progress',\
               'GHG Mitigation Plan_Intending to undertake in the next 2 years',\
               'GHG Mitigation Plan_Not intending to undertake',\
               'GHG Mitigation Plan_Yes', 'Collaborations_Do not know',\
                         'Governance & Data Management_Do not know',\
       'Governance & Data Management_In progress',\
       'Governance & Data Management_Intending to incorporate in the next 2 years',\
       'Governance & Data Management_Not intending to incorporate',\
       'Governance & Data Management_Yes', 'Climate Change_Do not know',\
                         'City-wide emissions_In progress',\
       'City-wide emissions_Intending to undertake in the next 2 years',\
       'City-wide emissions_Not intending to undertake',\
                  'External Control Reduction_Do not know',\
                  'External Control Reduction_Yes',\
                  'External Control Reduction_No',\
       'City-wide emissions_Yes']
cities_plus_countries_X = all_num_2020_no_Nan_X_.drop(target_columns, axis=1)

We have loaded some public datasets and now we link the data to the cities: 
- <b>yearly-air-quality-index-aqi-for-CDP-Cities:</b> an environnment information 
- <b>globses:</b> some sociality information like a Socioeconomic Status Score  
- <b>ecological-footprint:</b> some environnment information like Forest Footprint

In [None]:
# we add the column arithmetic_mean of the last year of each city 
# taken from the yearly-air-quality-index-aqi-for-CDP-Cities
air = pd.read_csv("../input/yearly-air-quality-index-aqi-for-cdp-cities/year_over_year_aqi_data_v2.csv")
air = air.sort_values(by='year', ascending=False).groupby("account_number").head(1)
air.drop([c for c in air.columns if c not in ["account_number","arithmetic_mean"] ], axis=1, inplace=True)
air.columns = ['Account Number','Air Quality']
cities_plus_countries_X = cities_plus_countries_X.merge(air)

In [None]:
# add SES, gdppc and popshare from the globses dataset
country_cities_2020 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv", usecols=["Account Number", "Country"])
globses = pd.read_csv("../input/globses/GLOB.SES.csv",usecols=["country","year","SES","gdppc","popshare"], encoding = "ISO-8859-1" )
globses = globses.sort_values(by='year', ascending=False).groupby("country").head(1).drop("year",axis=1)
globses_cities = country_cities_2020.merge(globses, left_on="Country",right_on="country").drop(["country","Country"],axis=1)
cities_plus_countries_X = cities_plus_countries_X.merge(globses_cities)

In [None]:
# add footprint from the ecological-footprint dataset
footprint = pd.read_csv("../input/ecological-footprint/countries.csv", encoding = "ISO-8859-1" ).drop(["Region","Population (millions)","Data Quality"],axis=1)
footprint_cities = country_cities_2020.merge(footprint, left_on="Country",right_on="Country")
cities_plus_countries_X = cities_plus_countries_X.merge(footprint_cities).drop("Country",axis=1)

In [None]:
sociality_env = cities_plus_countries_Y.merge(cities_plus_countries_X)
sociality_env = sociality_env.dropna(axis=0)
cities_plus_countries_X_ = sociality_env.drop([0,"GDP per Capita","Countries Required","Earths Required", "Account Number","External Control Reduction_Question not applicable"], axis=1)
cities_plus_countries_Y_ = sociality_env[0]


In fact, this idea is not good because to many cities have the same measures and are coorelated. We could either run several models with at most one city per country or run only a model on the new features of the country. Because the new features are type of float, I am curious to try a OLS model on those only. 

<h2 style="color:#7209b7">3.2  Country Information Only</h2> 


In [None]:
globses = pd.read_csv("../input/globses/GLOB.SES.csv",usecols=["country","year","SES","gdppc","popshare"], encoding = "ISO-8859-1" )
globses = globses.sort_values(by='year', ascending=False).groupby("country").head(1).drop("year",axis=1)

footprint = pd.read_csv("../input/ecological-footprint/countries.csv", encoding = "ISO-8859-1" ).drop(["Region","Population (millions)","Data Quality"],axis=1)
globses_plus_footprint = footprint.merge(globses, left_on="Country",right_on="country").drop(["country","Country"],axis=1)
globses_plus_footprint.drop("GDP per Capita", axis=1, inplace=True)  

count = globses_plus_footprint.isna().sum()

# remove columns that have more than 100 blanks
cols_to_remove = [x for x  in globses_plus_footprint.columns if count[x]>50]
globses_plus_footprint = globses_plus_footprint.drop(cols_to_remove,axis=1)

# now the lines
globses_plus_footprint = globses_plus_footprint.dropna()
globses_plus_footprint

In [None]:
def learn_regression(X,Y):
    # learn model
    X = sm.add_constant(X)
    model = sm.OLS(Y, X)
    result = model.fit()
    return result


for x in range(0, len(globses_plus_footprint.columns)):
    for y in range(x+1, len(globses_plus_footprint.columns)):
        X = np.asarray(globses_plus_footprint[globses_plus_footprint.columns[x]])
        Y= np.asarray(globses_plus_footprint[globses_plus_footprint.columns[y]])
        result = learn_regression(X,Y)
        if result.rsquared > 0.8 :
            print(globses_plus_footprint.columns[x],globses_plus_footprint.columns[y],result.rsquared)
            best_rsquared = result.rsquared
            best_model = result
            columns = [globses_plus_footprint.columns[x],globses_plus_footprint.columns[y]]
            

I have played Linear Regression by pair of feature to see if one is a linear combinaison of another one. I displayed the features and scores of the ones that have a Rsquared greater then 0.8. We see that the Human Development Index is very coorelated to the Socioeconomic status score. Those scores come from different datasets. Also, we remark that the Carbon Footprint in related to the Gross domestic product Per Capita. We want to see in which way one implies the other one: 

In [None]:
print(learn_regression(np.asarray(globses_plus_footprint["Carbon Footprint"]),np.asarray(globses_plus_footprint["gdppc"])).summary())

The p-value is high meaning that there 35% of probability that the Carbon Footprint is not related to the Gross domestic product Per Capita. 

<h1 style="color:#D00892">Thank you for reading</h1>
And thanks for the nice comments! 

In [None]:
# what are the top differences between the three centroids:
difference_between_centroids = pd.Series(abs(kmeans_noNan.cluster_centers_[0] - kmeans_noNan.cluster_centers_[1])+\
                                       abs(kmeans_noNan.cluster_centers_[0] - kmeans_noNan.cluster_centers_[2])+\
                                      abs(kmeans_noNan.cluster_centers_[2] - kmeans_noNan.cluster_centers_[1]))

difference_between_centroids.sort_values(ascending=False, inplace=True)

# a lot of copy/paste code...hhh, sorry, I will work better if you hire me, I promise
all_num_2020_reduced_c1 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 0)]
all_num_2020_reduced_c1 = all_num_2020_reduced_c1.astype(float).mean(axis=0)

all_num_2020_reduced_c2 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 1)]
all_num_2020_reduced_c2 = all_num_2020_reduced_c2.astype(float).mean(axis=0)

all_num_2020_reduced_c3 = all_num_2020_no_Nan.iloc[np.where(kmeans_noNan.labels_ == 2)]
all_num_2020_reduced_c3 = all_num_2020_reduced_c3.astype(float).mean(axis=0)
    
top_C1 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c1)
top_C2 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c2)
top_C3 = give_top_n(15,difference_between_centroids, all_num_2020_reduced_c3)

# and we plot it !!
theta = radar_factory(15, 'polygon')
 
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(1, 1, 1, projection='radar')

colors_2 = [colors[0*2+1],colors[1*2+1],colors[2*2+1]]
i=0
for j in range(4,43):
    theta = radar_factory(j, 'polygon')
    top_C1 = give_top_n(j,difference_between_centroids, all_num_2020_reduced_c1)
    top_C2 = give_top_n(j,difference_between_centroids, all_num_2020_reduced_c2)
    top_C3 = give_top_n(j,difference_between_centroids, all_num_2020_reduced_c3)
    for d in [top_C2, top_C1, top_C3]:
        #ax.plot(theta, d, color= colors[i%10] )
        ax.fill(theta, d, color= colors[i%10], alpha=0.25)     
        i+=1
#ax.set_varlabels([x for x in top_C2.index])
#plt.title("Top 15 features that separates the clusters", fontsize=16)
ax.yaxis.grid(False)
ax.xaxis.grid(False)
plt.xticks([])
plt.yticks([])
plt.show()



Please upvote if you like it :)