# Risk, Ambition and Solution KPIs for cities

<a id="ToC"></a>
## Table of Contents

* [Introduction](#intro)
* [1. Risk exposure KPIs](#Risk-exposure)
    - [1.1. Dataset description](#Dataset-description)
    - [1.2. Risk exposure scoring methodology](#Risk-exposure-methodology)
* [2. Ambition KPIs](#Ambition)
    - [2.1. Dataset preparation](#IDataset-preparation)
    - [2.2. Ambition Scores by question](#Ambition-score-by-question)
    - [2.3. Ambition KPIs by macro-theme](#KPIs-compilation)
* [3. Discussions](#Discussion)
    - [3.1. Studying the intersection of KPIs](#Risk-Effort)
    - [3.2. KPIs validation](#Kpi-validation)
    - [3.3. Questions' Pertinence](#Pertinence)
    - [3.4. Solutions recommendations](#Political)
* [Conclusion](#Conclusion)
* [Appendices](#Appendices)

<a id="intro"></a>
## Introduction

### Problem original statement

<p style="text-align:justify;">Let us start by analyzing the terms of the request, as formulated by the Carbon Disclosure Project.</p>

> <p style="text-align:justify;">"Data scientists will scour environmental information provided to CDP by disclosing companies and cities, searching for <b>solutions</b> to our most pressing problems related to <b>climate change, water security, deforestation, and social inequity"</b>.''</p>

> <p style="text-align:justify;">"Develop a <b>methodology</b> for calculating <b>key performance indicators (KPIs)</b> that relate to the <b>environmental</b> and <b>social</b> issues that are discussed in the CDP survey data. Leverage external data sources and thoroughly discuss the intersection between environmental issues and social issues. Mine information to create automated insight generation demonstrating whether city and corporate <b>ambitions</b> take these factors into account. "</p>

<p style="text-align:justify;">We propose to focus our analysis on the "cities", as companies have already been the subject of numerous studies on environmental, social and governance (ESG) issues.</p>

Hence, some of the questions that these KPIs may help to answer could be:

- <p style="text-align:justify;">How do you help cities adapt to a rapidly changing climate amidst a global pandemic, but do it in a way that is socially equitable?</p>

- <p style="text-align:justify;">What are the projects that can be invested in that will help pull cities out of a recession, mitigate climate issues, but not perpetuate racial/social inequities?</p>

- <p style="text-align:justify;">What are the practical and actionable points where city and corporate ambition join, i.e. where do cities have problems that corporations affected by those problems could solve, and vice versa?</p>

- <p style="text-align:justify;">How can we measure the intersection between environmental risks and social equity, as a contributor to resiliency?</p>

### Extracting the main goals

We propose to synthesize the request as follows:

| Type of output | Topics | Specificities |
| ----------- | ----------- | ----------- | 
| - Methodology <br> - KPIs | - Climate change <br> - Water security <br> - Deforestation <br> - Racial/social inequities <br> - Pull out of a recession <br> - Global pandemic <br> - Mitigate and Adapt to climate change | - Intersection between env. and soc. issues <br> - Ambitions to take factors into account <br> - Projects can be invested <br> - That corporations could solve <br> - External data sources <br> - Automated process |

### Contextualizing the challenge

#### Topics: the need for an holistic approach to tackle *grand challenges*

<p style="text-align:justify;">Although the CDP has historically focused on climate issues, its questionnaires have regularly evolved to incorporate social and economic issues. This development is part of an overall trend of changing issues facing decision-makers in cities and companies. Formerly confronted with <em>local</em> problems, mainly related to the direct consequences of their activities, they are now involved in global and interconnected issues such as climate change, pandemics. This new type of societal issues has been conceptualised under the terms of <em>grand challenges</em>. Grand challenges differ from tame problems in three ways: they are complex, uncertain and evaluative <a href="https://www.researchgate.net/publication/272788893_Tackling_Grand_Challenges_Pragmatically_Robust_Action_Revisited">(Ferraro et al., 2015)</a>. They can be defined as <em>’specific critical barriers that, if removed, would help solve an important societal problem with a high likelihood of global impact through widespread implementation’</em> <a href="https://journals.aom.org/doi/10.5465/amj.2016.4007">(George et al., 2016)</a>. Being complex means that they are inseparable, such as the holistic vision proposed by the 17 sustainable development goals from the United Nations (see below). </p>

<p style="text-align:justify;">The case of 'yellow vests' (2018) - opposed to an increase in the French carbon tax - illustrates the need to integrate social issues into the energy transition plans. More recently, the COVID-19  crisis has shown one of the underestimated impacts of climate change and the importance of health system resilience. On this specific concer, the only real performance indicator -or at least one that facilitates the management of this crisis - seems to be hospital capacity. We do not have enough hindsight on these situations and local management to draw relevant indicators and, more generally, we have too little data on this subject to properly address the risk of an infectious wave. However, it is noteworthy that this risk is now fully identified by respondents as early as the 2020 questionnaires. Although we focus on climate issues, we have therefore tried throughout our study to integrate these social and economic issues. </p>

#### Data: the (climate) data gap

<p style="text-align:justify;">One of the issues <a href="https://2degrees-investing.org/resource/asset-level-data-and-climate-relate-financial-analysis-a-market-survey/">regularly raised</a> in studies on the integration of climate issues in finance is the lack of data on the exposure of the various actors regarding their climate risks. In this study we rely on the definition of the <a href="https://assets.bbhub.io/company/sites/60/2020/10/FINAL-2017-TCFD-Report-11052018.pdf">TCFD (2017)</a> to distinguish:</p>

-  <p style="text-align:justify;">transition risk: risks related to the <em>transition</em> to a lower-carbon economy </p>
- <p style="text-align:justify;">physical risks: risks related to the <em>physical</em> impacts of climate change.</p>

<p style="text-align:justify;">The question of physical risks and the necessary adaptations in the face of changing conditions requires particularly precise data.  Cities and companies that are very close to each other may face very different climate hazards. Once again, putting these physical risk indicators into perspective with the state of the art of the market clearly shows us the added value of the CDP databases. Its questionnaires make it possible to locate risks by typology, probability and intensity, which in the end can easily allow a mapping of these risks and trace them back to the different forms of economic activity. This question is particularly relevant for enriching the offer made to the financial organization.</p>

<p style="text-align:justify;">Although climate data is still fragmented, we can notice that a large amount of information is made available on open data platforms, from the <a href="https://datatopics.worldbank.org/esg/">World Bank</a> or the <a href="https://unstats.un.org/sdgs/indicators/database">United Nations</a>.We will therefore complement the CDP database with these external resources. </p>

#### Outputs: evaluation and values

<p style="text-align:justify;">One of the objectives of the Kaggle challenge is to build KPIs for each city. This methodological challenge is part of a global logic of evaluation of environmental, social and governance (ESG) performance, which has taken several names: corporate social responsibility (CSR) for companies and actors of the real economy or ESG on the financial side. While financial performance is measured in a relatively universal way (thanks to the homogeneity of accounting frameworks and financial markets), environmental and social performance leaves room for many, sometimes divergent approaches. For the same counterpart, the themes selected, the way of calculating the scores on these themes and the weighting of these themes differ according to the observers and their values <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3438533">(Berg, Kölbel and Rigobon, 2020)</a>. In order to limit the bias induced by <em>our</em> values, we have tried to refer as much as possible to a widely shared international reference framework - the Sustainable Development Goals - in particular in the choice of macro-themes and variables used to build our KPIs. </p>

### The United Nations Sustainable Development Goals framework

<p style="text-align:justify;">As one of the specificities of the project is to study the interactions between the social, environmental and economic components of climate change, we have chosen to organise our study according to the 17 Sustainable Development Goals (SDGs) of the United Nations. Adopted in 2015, the SDGs are an extension of the eight Millennium Development Goals (MDGs) defined in 2000 with a horizon of 2015. However, they differ on two points: whereas the MDGs focused on humanitarian issues, the SDGs cover social, environmental and economic issues on the one hand, and are the result of negotiations involving all stakeholders (local authorities, civil society and the private sector) on the other.</p>

![sdgs.PNG](attachment:sdgs.PNG)

We therefore believe that this framework is relevant to structure our work: 

- It enables us to address the interactions between social, environmental and economic issues.

- It is recognised by the organisations we are studying: cities and companies (72% of the companies surveyed by PwC worldwide included SDOs in their communications in 2018).

<p style="text-align:justify;">After analysing the 17 goals and 169 targets, we propose to select and group the following goals into three macro-themes:</p>

![tableau_Sdgs_macro_themes.png](attachment:tableau_Sdgs_macro_themes.png)

### Risks, Ambitions, Solutions

Given the CDP's request, our study has three main objectives:

- **Risks**:  develop a risk KPI for each city and macro-theme, based on CDP and external data;

- <p style="text-align:justify;"><b>Ambitions</b>: develop ambition KPI for each city and macro-theme, based on the analysis of all the responses to the CDP: quantitatives, categorical and free text;</p>

- <p style="text-align:justify;"><b>Solutions</b>: propose solutions for each city based on the responses provided by other cities with similar risks.</p>

We articulate these three goals in the following methodology:

![schema1.3.jpg](attachment:schema1.3.jpg)

#### In order to meet these three objectives in the most relevant and rigorous way possible, we propose:

* <p style="text-align:justify;">KPIs per city built from the CDP database, supplemented by external data;</p>
* <p style="text-align:justify;">KPIs based on the SDGs' framework;</p>
* <p style="text-align:justify;">A method for automatically extracting and processing the qualitative content of the CDP responses;</p>
* <p style="text-align:justify;">An analysis of the intersections between environmental, economic and social issues based on correlation analysis;</p>
* <p style="text-align:justify;">A method for checking and explaining the relevance of scores based on decision trees;</p>
* <p style="text-align:justify;">An assessment of the questions'pertinence to give the Carbon Disclosure Project insights on the questionnaire's possible changes;</p>
* <p style="text-align:justify;">A tool to provide solutions/recommendations to cities even if they are not in the database.

#### Libraries

In [None]:
!pip install python-highcharts pycountry-convert

In [None]:
# standard libs
import os
import pandas as pd
import numpy as np
import json
import string
import pickle
import math
from collections import Counter
from collections import defaultdict
from collections.abc import Iterable
from itertools import compress

# Graphic libraries
import matplotlib.pyplot as plt
import seaborn as sn
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from highcharts import Highchart
import pydot
from scipy import stats
#import altair as alt # https://altair-viz.github.io/ 
# to install: conda install -c conda-forge altair vega_datasets

# Maps 
import geopandas as gpd
import pycountry_convert as pc
# import basemap as bm

# Natural Language Toolkit
import re
import spacy
import nltk
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

# Topics modeling
from gensim import corpora
from gensim import models
from gensim.models import Phrases
# LDA visualisation
import pyLDAvis
import pyLDAvis.gensim

# Min max scaler
from sklearn.preprocessing import MinMaxScaler

# Machine learning 
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from itertools import chain # to flatten list of sentences of tokens into list of tokens

[Back to Table of Contents](#ToC)

<a id="Risk-exposure"></a>
# 1. Risk exposure KPIs

<a id="Dataset-description"></a>
## 1.1 Dataset description

### 1.1.1 CDP survey

<p style="text-align:justify;">We use the CDP 2020 Cities database as a reference to build our KPIs. This database contains responses to questionnaires from 566 cities. After a slight processing of the coordinates, we selected the cities located as follows:</p>

In [None]:
CDP_cities_correctly_located  = pd.read_excel('../input/cdp-inputs-final/Cities_Disclosing_to_CDP_corrected.xlsx')

gdf = gpd.GeoDataFrame(
    CDP_cities_correctly_located, 
    geometry=gpd.points_from_xy(CDP_cities_correctly_located.Longitude,
                                CDP_cities_correctly_located.Latitude))

fig = px.scatter_geo(gdf,
                    lat=gdf.geometry.y,
                    lon=gdf.geometry.x,
                    hover_name='Organization')
fig.show()

### 1.1.2 External databases

<p style="text-align:justify;">To complete the CDP database, we rely on the following open-source databases: </p>

- <a href="https://unstats.un.org/sdgs/indicators/database/">United Nation database</a>
- <p style="text-align:justify;"><a href="http://sedac.ciesin.columbia.edu/es/compendium.html">Compendium of Environmental Sustainability Indicator Collections: 2004 Environmental Vulnerability Index (EVI) (July 2006) Center for International Earth Science Information Network (CIESIN) Columbia University</a></p>
- [Notre-Dame Gain](https://gain.nd.edu/our-work/country-index/)
- City 500 census (for health issues scores)

<p style="text-align:justify;">We give an example of an indicator corresponding to the renewable energy share in the total final energy consumption (%), accessible on the UN database hereunder. </p>

In [None]:
df = pd.read_excel('../input/cdp-inputs-final/Country_indicators.xlsx')

fig = go.Figure(data=go.Choropleth(
    locations = df['iso3'],
    z = df['7.2.1 Renewable energy share in the total final energy consumption (%)'],
    text = df['Country'],
    colorscale = 'Reds',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Renewable energy share (%)',
))
fig.show()

For global climate conditions we used:
- [Climate conditions (GLDAS NOAH)](https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS): We use a model from NASA web service.
- [Soils properties from Global Data Set of Derived Soil Properties, 0.5-Degree Grid (ISRIC-WISE)](https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=546)

<a id="Risk-exposure-methodology"></a>
## 1.2 Risk exposure scoring methodology

<p style="text-align:justify;">In this section we present the methodology for constructing risk exposure KPIs. We illustrate this methodology through the example of the KPI of the macro-theme "Planet". Within each macro-theme, we have defined sub-components around the relevant SDGs. For example, for the macro-theme "Planet", we built the scores for the following SDGs: water (6), energy (7), climate change (13), biodiversity (14 and 15). Each SDG score is built from relevant variables from CDP and external databases.</p>

### 1.2.1 From CDP and external database to SDGs scores

<p style="text-align:justify;">From a risk exposure perspective, the question we considered most relevant to the CDP questionnaire is the question 2.1. where respondents are asked to list <em>the most significant climate hazards faced by [their] city</em>. For each potential hazard, cities must report the probability of occurrence and the magnitude of the impact. The SDG 13 risk exposure score, for each City $j$, is therefore built as follow:</p>

$$Expo_{j,SDG13} = \sum_{i=1}^{N}Probability_{i} \times Magnitude_{i}$$

<p style="text-align:justify;">where $N$ is the total number of climate hazards, $Probability$ and $ Magnitude$ their respective probability of occurence and the magnitude of the impact. </p>

<p style="text-align:justify;">Below are the most frequent climate hazards. The size provide the number of respondent considering the hazard in their analysis and the coordinates are the average probability and magnitude over respondents.</p>

In [None]:
Proba_magnitude_aggreg = pd.read_excel('../input/cdp-inputs-final/Proba_Magn_agg_risk.xlsx')
fig = px.scatter(Proba_magnitude_aggreg, 
                 x="Magnitude", 
                 y="Probability",
                 size="n", color="Risk",
                 hover_name="Climate Hazard", log_x=False, size_max=30)
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.show()

For each indicator, we use a simple scaling process to obtain a normalized signal:

$$ Score = \frac{Score - mean(Score)}{sd(Score)} $$

The other SDGs' scores within the macro theme planet are based on the following score:

|Water (6)  | Energy  (7) |  Biodiversity (14-15)| 
| :----- | :-----------   | :-----               |
|- (CDP2.1) Water Scarcity > Drought <br> - (CDP14.1) pct pop potable water <br> - (CDP4.2a) Change in land-use <br> - (CDP14.2a) Declining water quality <br> - (CDP14.2a) Drought<br>- (CDP14.2a) Ecosystem vulnerability <br>- (CDP14.2a) Energy supply issues<br>- (CDP14.2a) Environmental regulations<br>- (CDP14.2a) Higher water prices<br>- (CDP14.2a) Inadequate or ageing water supply infrastructure<br>- (CDP14.2a) Increased levels of plastic in freshwater bodies<br>- (CDP14.2a) Increased water demand<br>- (CDP14.2a) Increased water scarcity<br> - (CDP14.2a) Increased water stress<br>- (CDP14.2a) Pollution incidents<br>- (CDP14.2a) Rationing of municipal water supply<br>- (CDP14.2a) Severe weather events<br>- (CDP14.2a) Unauthorised/unregistered water connections<br>- (CDP14.2a) Water infestation/disease<br>- (CDP6.2.1) Proportion of population practicing open defecation by urban/rural (%)<br>- (CDP6.6.1) Water body extent (permanent and maybe permanent) (%) of total land area)<br>- (CDP6.6.1) Water body extent (permanent and maybe permanent) (square kilometres)<br>- (CDP6.6.1) Water body extent (permanent) (% of total land area)<br>- (CDP6.6.1) Water body extent (permanent) (square kilometres)<br>- (SEDAC) DRYEVI<br> - (SEDAC) WATEVI<br>- (SEDAC) WATEREVI<br>- (NDG) water Index| - (UN8.1) Biomass (+) <br> - (UN8.1) Coal (-) <br>- (UN8.1) Gas (-) <br>- (UN8.1) Geothermal (+) <br>- (UN8.1) Hydro (+) <br> - (UN8.1) Nuclear (+) <br>- (UN8.1) Oil (-)<br> - (UN8.1) Solar (+)<br>- (UN8.1) Wind (+)<br> - (UN7.1.1) Proportion of population with access to electricity, by urban/rural (%) (+) <br> - (UN7.2.1) Renewable energy share in the total final energy consumption (%) (+) <br> - (UN7.3.1) Energy intensity level of primary energy (megajoules per constant 2011 purchasing power parity GDP)|- (NDG) habitat<br> - (UN14.1.1) Chlorophyll-a deviations, remote sensing (%)<br> - (UN15.1.1) Forest area (thousands of hectares) <br>- (UN15.1.1) Forest area as a proportion of total land area (%)<br>- (UN15.1.2) Average proportion of Terrestrial Key Biodiversity Areas (KBAs)<br> covered by protected areas (%)<br>- (UN15.2.1) Above-ground biomass stock in forest (tonnes per hectare)<br>- (UN15.2.1) Forest area annual net change rate (%)<br>- (UN15.2.1) Forest area under an independently verified forest management <br>certification scheme (thousands of hectares)<br>- (UN15.4.1) Average proportion of Mountain Key Biodiversity Areas (KBAs) <br>covered by protected areas (%)<br>15.5.1 Red List Index<br>- (UN15.6.1) Countries that are contracting Parties to the International Treaty <br>on Plant Genetic Resources for Food and Agriculture (PGRFA) (1 = YES; 0 = NO)<br>- (UN15.6.1) Countries that are parties to the Nagoya Protocol (1 = YES; 0 = NO)<br>- (UN15.6.1) Countries that have legislative, administrative and policy framework or<br> measures reported through the Online Reporting System on Compliance  of the International Treaty on Plant Genetic Resources for Food and Agriculture (PGRFA) (1 = YES; 0 = NO)<br>- (UN15.6.1) Countries that have legislative, administrative and <br>policy framework or measures reported to the Access and Benefit-Sharing Clearing-House (1 = YES; 0 = NO)<br>15.6.1 Total reported number of Standard Material Transfer Agreements (SMTAs) transferring plant genetic resources for food and agriculture to the country (number)<br> - (UN15.8.1) Legislation, Regulation, Act related to the prevention of introduction<br> and management of Invasive Alien Species (1 = YES, 0 = NO)<br>- (UN15.8.1) National Biodiversity Strategy and Action Plan (NBSAP)<br> targets alignment to Aichi Biodiversity target 9 set out in the Strategic Plan for Biodiversity 2011-2020 (1 = YES, 0 = NO))|

### 1.2.2 From SDG to macro-theme KPIs

<p style="text-align:justify;">We construct Planet and Social componant based on the aggregation of the metrics in the table hereunder. We use an equally weighting scheme because of the large number of components. For the prosperity risk exposure we propose a more advanced method. Following the <a href="http://www.oecd.org/sdd/42495745.pdf">OECD methodology</a>, we build a new score using Principal Component Analysis. More specificaly, we use factor loadings, i.e. the correlation between the original variables and the factors. Squared factor loadings indicate what percentage of the variance in an original variable is explained by a factor and are retained as the weight for each variable.</p>

In [None]:
CDP_cities_scores = pd.read_excel('../input/cdp-inputs-final/final_risk_scores.xlsx')
scores  = ["biodiv_Z",
           "water_Z",
           "energy_Z",
           "physical_Z"]
CDP_cities_scores = CDP_cities_scores.loc[:, scores]
g = sn.pairplot(CDP_cities_scores)

The macro-theme scores are based on the following variables:

| Prosperity  | Planet| Social|
| :---        | :---- | :---- |
|- (CDP.2.2.) Budgetary capacity <br>- (CDP.2.2.) Cost of living<br> - (CDP.2.2.) Economic diversity <br>- (CDP.2.2.) Economic health <br>- (CDP.2.2.) Infrastructure capacity <br>- (CDP.2.2.) Infrastructure conditions / maintenance<br> - (CDP.2.2.) Underemployment<br>- (CDP.2.2.) Unemployment<br>- Population<br>- (UN8.1.1) Annual growth rate of <br>real GDP per capita (%)<br>- (UN9.2.1) Manufacturing value added as a <br>proportion of GDP (%)| Equally weighted <br>construction based on the <br>- biodiversity ,<br> - water,<br>- energy and<br>- physical score<br> detailed above| - (NDG) Food <br>- (NDG) Health  <br> - (UN3.9.1) Mortality  Age-standardized mortality rate<br> attributed to household and ambient air pollution <br>(deaths per 100,000 population)<br>- (UN3.9.2) Mortality rate attributed to unsafe <br>water, unsafe sanitation and lack of<br> hygiene <br>(deaths per 100,000 population) <br>- (UN14.2a) Higher water prices <br> - (CDP2.2) Unemployment <br> - (UN7.1.1) Proportion of population with access<br> to electricity, by urban/rural (%)|

In [None]:
def hc_gaussianKDE(x, bounds = [0,1], title = '', dataType = ''):
    
    # Generates a Highchart with the Gaussian kernel density estimators of the data 
    
    # Highchart options
    H = Highchart(width=750, height=600)
    
    options = {
        'title': {
            'text': title
        },
        'xAxis': {
            'title': { 'text': dataType}
        },
        'tooltip': {
            'shared': True,  
        },
        'colors': ["#df5353","#434348", "#7cb5ec", "#90ed7d", "#f7a35c", "#8085e9", "#f15c80", "#e4d354", "#2b908f", "#f45b5b", "#91e8e1"]
    }

    H.set_dict_options(options)
    
    # Kernel Density Estimation
    for i in x.columns:
        kde = stats.gaussian_kde(x.loc[:,i].values)
        data = [[float(j),float(np.round(kde(j)[0],3))] for j in np.array(range(int(100*bounds[0]),int(100*bounds[1])+1))/100]

        H.add_data_set(data, 'area', i, tooltip={'headerFormat': '<b>{series.name}</b><br>'})#, color='rgba(223, 83, 83, .5)')
    
    return H

In [None]:
CDP_cities_scores = pd.read_excel('../input/cdp-inputs-final/final_risk_scores.xlsx')
scores  = ["Social Risk", # To call prosperity
           "Prosperity Risk",
           "Planet Risk"]
CDP_cities_scores = CDP_cities_scores.loc[:, scores] 

H_risks = hc_gaussianKDE(CDP_cities_scores, [0,1], title = 'Density estimation of the Risk scores', dataType = 'Score')
H_risks

Depending on the broad range of variable we used, KPIS are more or less normally distributed. However the scope of variables is large enough to we prevent skewed distributions. We analyse later on the empirical meaning of our KPIs.

[Back to Table of Contents](#ToC)

<a id="Ambition"></a>
# 2. Ambition KPIs

<p style="text-align:justify;">This section aims to develop ambition key performance indicators for each city and macro-theme, based on the analysis of all types of responses addressed to the CDP: quantitative, categorical and free text.</p>

<p style="text-align:justify;">We differentiate the creation of scores per question from the computation of KPIs by macro-themes of SDGs.</p>

<a id="Dataset-preparation"></a>
## 2.1. Dataset preparation

We added supplementary variables coming from our analysis of the 2020 cities' questionnaire:
* <p style="text-align:justify;"><b>Type of Answer</b> takes four possible values depending on the type of the question: Descriptive Statistics, Risk Exposure, Willingness to fight risks, also referred as Ambition to fight risks, and Investment. </p>
* <p style="text-align:justify;"><b>SDGs</b> takes three possible values depending on which macro-theme it relates to (<em>Prosperity</em>, <em>Social</em> and <em>Planet</em>).</p>
* <p style="text-align:justify;"><b>Which is text to analyse</b> gives information to which columns of the question are suitable for text analysis while <b>Which else to analyse</b> tells which other columns must be analysed based on other approaches. </p>
* <p style="text-align:justify;"><b>Political Recommendation</b> is marked for a given question if we can use the answers of some cities to give political recommendations to others.</p>
* <p style="text-align:justify;"><b>Is Positive</b> tells if the possible answers are positive. We define <em>Positive</em> as a three-dimensional object either quantitative, ranked categories or all positive categories. It helps us apply a specific approach for answers' evaluation to build our KPIs. </p>
* <p style="text-align:justify;"><b>Text Ref Col</b> tells which column may be used to define reference categories to analyse free text answers. </p>

In [None]:
cities_2020 = pd.read_pickle('../input/cdp-inputs-final/2020_Full_Cities_Dataset_Expended.pkl')
cities_2020.rename(columns={'Account Number': 'Account'}, inplace = True)

In [None]:
geo_cities_2020 = pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv', usecols=["Account Number","City Location"])
geo_cities_2020.rename(columns={'Account Number': 'Account'}, inplace = True)

<p style="text-align:justify;">From our first analysis of the questionnaire, we created three different datasets based on the underlying <b>type of Answer</b>:</p>

* **Cities_Stat_2020**: with all descriptive questions of the cities.

* **Cities_Risk_2020**: with all questions regarding risk exposures.

* **Cities_Will_2020**: with all questions regarding ambition of cities to fight risks.

In [None]:
cities_Will_2020 = pd.read_pickle('../input/cdp-inputs-final/cities_Will_2020.pkl')
cities_Stat_2020 = pd.read_pickle('../input/cdp-inputs-final/cities_Stat_2020.pkl')

<p style="text-align:justify;">Each question in the original dataset may lead to several answers and hence multiple rows. To ease our analysis of the data, we reshape the data into a "<em>dict/wide format</em>" where each question's answers are stored in a panda dataframe. These dataframes are in a "<em>wide format</em>" with one row per respondent and every sub-question as column. For a given question, each cell of the dataframe is a dictionary where keys are the answers to a sub-question and values are their identifier.</p>

In [None]:
with open('../input/cdp-inputs-final/cities_Will_2020_vWide.pkl', 'rb') as handle:
    cities_Will_2020_vWide = pickle.load(handle)

We then build a new dataset from the extra columns we added to the main dataset w called ***extra_info***. 

In [None]:
extra_info = pd.read_pickle('../input/cdp-inputs-final/extra_info_2020.pkl')

<a id="Ambition-score-by-question"></a>
## 2.2. Ambition Scores by question

For each question, we aim at creating an ambition score per respondent. 

<p style="text-align:justify;">We first distinguish between data that are directly analyzable (<em>analytical answers</em>) and data that will require particular processing (<em>free text answers</em>). For the free text answers we apply:</p>

* <p style="text-align:justify;"><b>Sentiment analysis:</b> It aims to capture the mindset of the respondent. Does he emphasise its actions to mitigate risk, promote new policies and projects or simply gives a list of what is being done, with no real beliefs on the matter?</p>

* <p style="text-align:justify;"><b>Topics analysis:</b> It aims to assess the respondent ability to fully capture the scope of the question. For a given, specific question, did the respondent write about the main topics that other respondent mentioned in their answers to the same question?</p>

Each type of answers is then treated differently following the process presented hereafter.

![schema_AmbitionScore.jpg](attachment:schema_AmbitionScore.jpg)

<a id="Quantitative-and-Categorical"></a>
### 2.2.1. Quantitative and Categorical answers: Frequentist and Automated Scoring

<p style="text-align:justify;">Out of the 73 questions we spotted as relevant for the analysis of <b>cities' ambition</b> to protect the <b><em>Planet</em></b>, take care of <b><em>Social</em></b> issues and ensure <b><em>Prosperity</em></b> through sustainable development, only four of them do not have either quantitative or categorcial answers to analyse. It emphasises how important this part is to build KPIs.</p>

We follow the process explained above to score each cities' answers.

In [None]:
unique_answer = pd.read_pickle('../input/cdp-inputs-final/positive_dict.pkl')
unique_answer = dict(zip(unique_answer.Category, unique_answer.Label))

In [None]:
cities_Will_2020_CatAnalysis = pd.read_pickle('../input/cdp-inputs-final/cities_Will_2020_CatAnalysis.pkl')

with open('../input/cdp-inputs-final/cities_Will_2020_CatCount.pkl', 'rb') as handle:
    cities_Will_2020_CatCount = pickle.load(handle)

<p style="text-align:justify;">We quickly control the final output of this process. Since we have one score per question for all cities who answered, a quick way is to average all score to the city level and then visualise the top 10 and bottom 10 cities and the distribution of all scores.</p>

In [None]:
cities_Will_2020_CatScore = cities_Will_2020_CatAnalysis.mean(axis=1)
top_10_CatAnalysis = cities_Will_2020_CatScore.sort_values(ascending=False).iloc[0:10]
bottom_10_CatAnalysis = cities_Will_2020_CatScore.sort_values(ascending=True).iloc[0:10]
Topbottom_Cities_CatAnalysis = pd.DataFrame([[cities_Will_2020.loc[cities_Will_2020.Account == x, 'Organization'].values[0] for x in list(top_10_CatAnalysis.index)],
                           [cities_Will_2020.loc[cities_Will_2020.Account == x, 'Organization'].values[0] for x in list(bottom_10_CatAnalysis.index)]],
                         index = ['Top', 'Bottom'], columns = range(1,11)).T


Topbottom_Cities_CatAnalysis = Topbottom_Cities_CatAnalysis.applymap(lambda x: x.encode('iso-8859-1', 'ignore').decode('utf8','ignore'))
Topbottom_Cities_CatAnalysis

In [None]:
data = pd.DataFrame(cities_Will_2020_CatScore, columns = ['Score'])
H_risks = hc_gaussianKDE(data, [0,1], title = 'Density estimation of the Quantitative and Categorical answers', dataType = 'Score')
H_risks

<p style="text-align:justify;">The output is consistant with our expectations: most of the top cities are well known for their efforts in favor of all three SDGs macro-themes and the scores are relatively normally distributed.</p>

<a id="Free-text"></a>
### 2.2.2 Free text answers: Unsupervised Sentiment and Topics Scoring

<p style="text-align:justify;">We build a fonction that reshapes and cleans for a given question all free text answers into a single string. It is possible to specify a reference sub-question to analyse free text answers based on the sub-question topics.</p>

<p style="text-align:justify;">This approach is relevant for topics and sentiment analysis as we can compare answers between cities while filtering on topics. For topics analysis, using LDA for instance, it avoids finding the sub-question topics and rather allows to truly identify common topics among responses.</p>

All details are given in the dedicated [section](#Text-explore) of the appendix.

#### 2.2.2.1 Unsupervised Sentiment Analysis

<p style="text-align:justify;">We define all functions in a <a href="#Utils-sent">Utils section</a> which may be seen as a reporistory for unsupervised sentiment analysis. It consolidates Wordnet and Vader sentiment analysis with multiple possible word weighting schemes, including term frequency-inverse document frequency (<em>TF-IDF</em>). 
Each score is then discretized to quintile and the final sentiment score is computed by a majority vote between the different models.</p>

<p style="text-align:justify;">We apply the sentiment analysis to every question marked on the <em><b>Which is text to Analyse</b></em> column from the <em><b>extra_info</b></em> database. It makes 21 questions. We also set the reference sub-question from the <em><b>Text Ref Col</b></em> column.

In [None]:
with open('../input/cdp-inputs-final/cities_Will_2020_SentimentAnalysis.pkl', 'rb') as handle:
    cities_Will_2020_SentimentAnalysis = pickle.load(handle)

<p style="text-align:justify;">Next is an exemple of question 6.0 output for the sub-question on opportunities regarding the Development of energy efficiency measures and technologies. </p>

<p style="text-align:justify;">Question 6.0 asks: "<em>Please indicate the opportunities your city has identified as a result of addressing climate change and describe how the city is positioning itself to take advantage of these opportunities.</em>"</p>

<p style="text-align:justify;">We give three examples of free text answers with their sentiment score. One clearly see a difference in the way repondents answer the question:</p>

* "*LED lighting upgrades - Municipal buildings and street lights*" : **Sentiment Score = 0** (index 23);

* <p style="text-align:justify;">"<em>The City is looking for new opportunities within the clean technology business like implementing electrical vehicle chargers and enrolling all the residents and municipal accounts into East Bay Community Energy's 100 Renewable plan.</em>" : <b>Sentiment Score = 2</b> (index 38);</p>

* <p style="text-align:justify;">"<em>The Regional Energy System Operator (RESO) project delivers a smart local energy system design for Coventry and a viable business model using an approach that can be replicated across the wider region. The cost effective project will assist Coventry in decarbonising to align with West Midlands 2041 targets. Coventry won a national competition as the location of a new Battery Industrialisation Centre which hopes to be the centre of excellence in battery technologies, aimed specifically at electric car energy storage.</em>" : <b>Sentiment Score = 4</b> (index = 33)</p>

<a id="U-Top"></a>
#### 2.2.2.2. Unsupervised Topics analysis

<p style="text-align:justify;">As explain previously, we capture the addressed topics in a specific sub-question and then identify if a given answer captures them all by retrieving each topic's density in the answer.</p>

<p style="text-align:justify;">Similar to the sentiment analysis section, we define all functions in a <a href="#Utils-top">Utils section</a> which may be seen as a reporistory for unsupervised topics analysis.</p>

<p style="text-align:justify;">In practice, we set an a-priori number of optimal topics to three. It seems enough to capture the complexity of the answers' structure since we get homogeneous set of answers by filtering on referencce sub-questions. </p>

In [None]:
with open('../input/cdp-inputs-final/cities_Will_2020_LDA_models.pkl', 'rb') as handle:
    cities_Will_2020_LDA_models = pickle.load(handle)

with open('../input/cdp-inputs-final/cities_Will_2020_LDA_topics.pkl', 'rb') as handle:
    cities_Will_2020_LDA_topics = pickle.load(handle)

<p style="text-align:justify;">To illustrate the pertinence of our approach next, we present an exemple of the topics identified in the corpus of question 6.0 about the "<em>Development of sustainable transport sector</em>".</p>

<p style="text-align:justify;">Althoug the first topic seems to be predominant in the answers, all three are clearly identified and justified. Indeed, by looking at the most relevant terms for each topics, we can conclude that their meanings are as follow:</p>

* Topic 1: Personal vehicule and electric transportation
* Topic 2: Active transportation / clean transportation
* Topic 3: Public transport and infrastructure

No city have grasped the three identified topics, but some answers cover two out of three topics.

<p style="text-align:justify;">For example, the answer with index 168 is "<em>The City has recently received funds to complete two trails - the Clippership Connector and the South Medford Connector - that will expand options for active transportation throughout the region.</em>"</p>

<p style="text-align:justify;">We can clearly identify the Topic 3 that corresponds to public infrastructure development and the Topic 2 on active transportation.</p>

<p style="text-align:justify;">Another example is the answer with index 9: "<em>Expansion and promotion of E-mobility by grants for appropriate investments such as vehicle procurement or charging infrastructure. Impulse investments in hydrogen technology through charging station, vehicle and bus procurement.Testing new types of E-vehicles and exemplary use in new areas (e.g. city logistics).</em>"</p>

<p style="text-align:justify;">One can easily identifies the Topic 1 on electric transportation as well as Topic 2 on clean transportation.</p>

<a id="Score-bq"></a>
### 2.2.3. Scores aggregation to questions' level

To sum things up, we have the following data sources to exploit from our automated data mining process:

* ***cities_Will_2020_CatAnalysis*** : categorical and quantitative questions scores

* <p style="text-align:justify;"><em><b>cities_Will_2020_SentimentAnalysis</b></em> : sentiment on qualitative questions. For each respondent, we still have to combine the sentiment of each sub-question to create a final sentiment per question. Each sentiment being between 0 and 2 we can average them to the question level and still compare sentiment between questions.</p>

* <p style="text-align:justify;"><em><b>cities_Will_2020_LDA_topics</b></em> : topics raised in qualitative questions. For each sub-question we have a distribution of topics among each answer. We set the score as the product between the number of captured topics in the answer and the difference between the maximum and the difference in the probaility distribution. Hence this foster answers which captures the maximum of topics equally. Of course the final score per question is comparable between questions.

<p style="text-align:justify;">We first combine the score for the <b>sentiment and topics analyses</b> to the question level. Each sub-questions' score is comparable to others since they are either class or density. Hence we simply average scores per sub-question to come up with quesions' scores.

In [None]:
cities_Will_2020_LDA_topics_Avg = pd.read_pickle('../input/cdp-inputs-final/cities_Will_2020_LDA_topics_Avg.pkl')
cities_Will_2020_SentimentAnalysis_Avg = pd.read_pickle('../input/cdp-inputs-final/cities_Will_2020_SentimentAnalysis_Avg.pkl')

<p style="text-align:justify;">We then aggregate each approach' score to compute a final one per respondent per question answered. To do so, we rank each respondent on a given approach and then compute their average scaled rank.</p>

<p style="text-align:justify;">It is important to note that we give the opporunity for the user to balance each approach importance in the final score by weighting them. In this case, we decided to assign <b>80%</b> to the <b>Quantitative/Qualitative</b> analysis and <b>10%</b> to both <b>Free Text</b> analyses. Depending on the evolution of the questionnaire one might want to change the weights.</p>

\begin{equation}
    Score_{q,i} = Score_{q,i,QCat}*W_{QCat} + Score_{q,i,Senti}*W_{Senti} + Score_{q,i,Topics}*W_{Topics}
\end{equation}

<p style="text-align:justify;">With <b>QCat</b> being the quantitative/categorical analysis, <b>Senti</b> the sentiment analysis and <b>Topics</b> the output of the LDA analysis.</p>

All details are presented in the dedicated [appendix section](#Question-level-app).

In [None]:
Scores_All_Questions = pd.read_pickle('../input/cdp-inputs-final/cities_2020_Scores_All_Questions.pkl')

<a id="KPIs-compilation"></a>
## 2.3. Ambition KPIs by macro-theme

We then aggregate scores by the macro-themes to create so called **KPIs**.

<p style="text-align:justify;">We compute a weighted average of each question' score using a <b>questions' quality metric</b> that we built and discuss further in the <a href="#Pertinence">Section 3.3</a>.</p>

The score of city *i* on a specific macro-theme is then :
\begin{equation}
    KPI_{s,i} = \sum_{q=1}^{Q} Pertinence_q * Score_{q,i}
\end{equation}

With **Q** the number of relevant questions for specific macro-theme and $Pertinence_q$ corresponding to quesion q assessed quality.

In [None]:
Question_Quality = pd.read_pickle('../input/cdp-inputs-final/Questions_Quality.pkl')

In [None]:
# Subset data per SDGs macro-theme
Planet_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'Planet' in x if isinstance(x, list) else False)])
Planet_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Planet_questions].copy()

Prosperity_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'Prosperity' in x if isinstance(x, list) else False)])
Prosperity_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Prosperity_questions].copy()

Social_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'SocialJustice' in x if isinstance(x, list) else False)])
Social_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Social_questions].copy()

In [None]:
def weighted_nan_average(x, w):
    indices = np.array(x.apply(lambda y: ~np.isnan(y)))
    if sum(indices)>0:
        out = np.average(np.array(x[indices], dtype='float'), weights = w[indices])
    else:
        out = np.nan
    return out

In [None]:
# Compute average score per macro-theme
weights_Planet = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Planet_Scores.columns[2:]]])
Planet_Scores['KPI'] = Planet_Scores.apply(lambda x: weighted_nan_average(x[2:], weights_Planet), axis = 1)

weights_Prosperity = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Prosperity_Scores.columns[2:]]])
Prosperity_Scores['KPI'] = Prosperity_Scores.apply(lambda x: weighted_nan_average(x[2:], w = weights_Prosperity), axis = 1)

weights_Social = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Social_Scores.columns[2:]]])
Social_Scores['KPI'] = Social_Scores.apply(lambda x: weighted_nan_average(x[2:], w = weights_Social), axis = 1)

We obtain well distributed KPIs on our three macro-themes: **Planet**, **Prosperity** and **Social**.

In [None]:
KPIs = pd.DataFrame([Planet_Scores.KPI, Prosperity_Scores.KPI, Social_Scores.KPI]).T
KPIs.columns = ['Planet', 'Prosperity', 'Social']
KPIs.insert(loc=0, column='Account Number', value=Planet_Scores.Account)
KPIs.set_index('Account Number', inplace = True)

In [None]:
H_scores = hc_gaussianKDE(pd.concat([Social_Scores['KPI'].rename('Social'),Prosperity_Scores['KPI'].rename('Prosperity'),Planet_Scores['KPI'].rename('Planet')],axis=1), [0,1], title = 'Density estimation of the Ambition scores', dataType = 'Score')
H_scores

[Back to Table of Contents](#ToC)

<a id="Discussion"></a>
# 3. Discussion

<p style="text-align:justify;">In this section, we discuss the methodology used to generate our KPIs and propose some potential extensions to leverage our results. First, we study the intersections of the Ambitions and Risks KPIs of the Planet, Prosperity and Social macro themes. Next, in order to make sure that our Ambition KPIs are coherent measures, we propose a systematic process for validation. In addition, we introduce metrics to measure the pertinence and coverage of the information extracted from the CDP questions. Finally, we propose an extension of our work where we generate solution recommendations from model cities facing similar climate risks.</p>

<a id="Risk-Effort"></a>
## 3.1 Studying the intersection of KPIs

<p style="text-align:justify;">We built:

* Risk KPIs, based on a selection of external data consolidated by some relevant question from the CDP. 
* Ambition KPIs, derived from the unsupervised numeration of the questions answers.
    
<p style="text-align:justify;">It is thus natural to ask ourselves how the two KPIs' types are related. A correlation study provides an answer to the question of the intersection between the environmental, economic and social components raised in the initial problem statement.</p>

In [None]:
CDP_cities_scores_ambitions      = pd.read_excel('../input/cdp-inputs-final/Ambition_KPIs.xlsx')
CDP_cities_scores_risk_exposures = pd.read_excel('../input/cdp-inputs-final/final_risk_scores.xlsx')
loc  = ["Account Number",
        "Prosperity Risk", 
        "Planet Risk",
        "Social Risk"]
CDP_cities_scores_risk_exposures= CDP_cities_scores_risk_exposures.loc[:, loc] 

CDP_cities_scores = pd.merge(CDP_cities_scores_ambitions,CDP_cities_scores_risk_exposures, on='Account Number')

KPIs_corr = CDP_cities_scores.iloc[:,1:].corr()
ax = sn.heatmap(KPIs_corr, annot=True, fmt='.2%', cmap = 'Blues') #cmap='Blues' pour changer la couleur
# bottom, top = ax.get_ylim()
# ax.set_ylim(bottom + 0.5, top - 0.5)

<p style="text-align:justify;">We observe high correlation between the ambition KPIs. This result is characteristic of the ambitions expressed in terms of social and responsible development, and it seems that the two dimensions are intrinsically linked in the general discourse. Depending on the scores' objectives, correlations may be reduced by not taking into account the same question in multiple macro-themes. On the other hand risk KPIs are mostly uncorrelated with eachothers and with the ambition KPIs. This is becaused we used a non-exhaustive proposal of indicators, selected to be representative and independent.</p>
    
<p style="text-align:justify;">This second result is important. It suggests that cities that are highly exposed to certain risks have not systematically developed a specific action plan on these issues. It is therefore essential to come up with solutions for these cities. This second result is also found at the country level:

In [None]:
CDP_cities_correctly_located  = pd.read_excel('../input/cdp-inputs-final/Cities_Disclosing_to_CDP_corrected.xlsx')

gdf = gpd.GeoDataFrame(
    CDP_cities_correctly_located, 
    geometry=gpd.points_from_xy(CDP_cities_correctly_located.Longitude,
                                CDP_cities_correctly_located.Latitude))

gdf['Social Ambition Score'] = [Social_Scores.loc[Social_Scores.Account==c, 'KPI'].values[0] for c in gdf['Account.Number'] if c in Social_Scores.Account.values]
gdf['Planet Ambition Score'] = [Planet_Scores.loc[Social_Scores.Account==c, 'KPI'].values[0] for c in gdf['Account.Number'] if c in Social_Scores.Account.values]
gdf['Prosperity Ambition Score'] = [Prosperity_Scores.loc[Social_Scores.Account==c, 'KPI'].values[0] for c in gdf['Account.Number'] if c in Social_Scores.Account.values]

In [None]:
country_code = [pc.country_alpha3_to_country_alpha2(c) for c in gdf.iso3.unique()]
gdf['Continent'] = [pc.country_alpha2_to_continent_code(pc.country_alpha3_to_country_alpha2(c)) for c in gdf.iso3] 

CDP_cities_scores = pd.read_excel('../input/cdp-inputs-final/final_risk_scores.xlsx')
scores  = ["Planet Risk", # To call prosperity
           "Prosperity Risk",
           "Social Risk"]
CDP_cities_scores = CDP_cities_scores.loc[:, scores] 
CDP_cities_scores['Account'] = geo_cities_2020.loc[:,'Account']

risk_env = CDP_cities_scores.copy().loc[:,['Account','Planet Risk']].dropna()
risk_env['iso3']=np.nan
risk_env.loc[[x in gdf.loc[:,'Account.Number'].values for x in risk_env.Account],'iso3'] = [gdf.loc[gdf['Account.Number']==x,'iso3'].values for x in risk_env.Account if x in gdf['Account.Number'].values]

EUR = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='EU','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='EU','iso3'].unique()]
NA = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='NA','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='NA','iso3'].unique()]
SA = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='SA','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='SA','iso3'].unique()]
AS = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='AS','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='AS','iso3'].unique()]
AF = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='AF','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='AF','iso3'].unique()]
OC = [{'x':np.round(gdf.loc[gdf.iso3==x,'Planet Ambition Score'].mean(),2),'y':np.round(risk_env.loc[risk_env.iso3==x, 'Planet Risk'].mean(),2), 'z':float(gdf.loc[gdf.Continent=='OC','iso3'].value_counts().loc[x]), 'country':x} for x in gdf.loc[gdf.Continent=='OC','iso3'].unique()]

H = Highchart(width=900, height=600)

options = {
    'chart': {
        'type': 'bubble',
        'zoomType': 'xy'
    },

    'title': {
        'text': 'Risk exposure/Ambition for Planet related issues aggregated at the country level'
    },
    'xAxis': {
        'title': { 'text': 'Average Ambition Score amongst cities'}
    },
    'yAxis': {
        'title': { 'text': 'Average Risk Score amongst cities'}
    },
    'tooltip': {
        'useHTML': True,
        'headerFormat': '<table>',
        'pointFormat': '<tr><th colspan="2"><h3>{point.country}</h3></th></tr>' +
            '<tr><th>Risk Score:</th><td>{point.y}</td></tr>' +
            '<tr><th>Ambition Score:</th><td>{point.x}</td></tr>' +
            '<tr><th>Number of cities:</th><td>{point.z}</td></tr>',
        'footerFormat': '</table>',
        'followPointer': True
    },
    'colors': ["#df5353","#434348", "#7cb5ec", "#90ed7d", "#f7a35c", "#8085e9", "#f15c80", "#e4d354", "#2b908f", "#f45b5b", "#91e8e1"]

}

H.set_dict_options(options)

H.add_data_set(EUR, 'bubble', 'Europe')
H.add_data_set(NA, 'bubble', 'North America')
H.add_data_set(SA, 'bubble', 'South America')
H.add_data_set(AS, 'bubble', 'Asia')
H.add_data_set(AF, 'bubble', 'Africa')
H.add_data_set(OC, 'bubble', 'Oceania')

H

In addition to the absence of correlation between risk exposure and ambition exposure, we can see from this figure that:
* There are no clear clusters of Planet risk exposure per region. It seems fair since climate differ within a region;
* On the contrary, there are different trends on the Ambition side between regions. Europe seems clearly ahead from its counterparts.

<a id="Kpi-validation"></a>
## 3.2 KPIs validation 

<p style="text-align:justify;">Assessing the quality of our KPIs is not straightforward. Indeed, there are no clear target scores for our macro-themes and environmental and social assessments are often controversial <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3438533">(Berg, Kölbel and Rigobon, 2020)</a>. In our case, the alignment of cities to SDGs are by no means evaluated or referenced. One way to tackle this issue is to directly challenge some of the output, like the top 10 cities of each macro-theme. Another one is to use <em>machine-learning</em> algorithms to build classification rules to explain the deciding factors of our KPIs and challenge the rules. The next sections go through these two approaches.</p>

### 3.2.1 Validation by vizualisation

In [None]:
top_10_Planet = Planet_Scores.sort_values('KPI', ascending=False).iloc[0:10,0].values
bottom_10_Planet = Planet_Scores.sort_values('KPI', ascending=True).iloc[0:10,0].values

top_10_Social = Social_Scores.sort_values('KPI', ascending=False).iloc[0:10,0].values
bottom_10_Social = Social_Scores.sort_values('KPI', ascending=True).iloc[0:10,0].values

top_10_Prosperity = Prosperity_Scores.sort_values('KPI', ascending=False).iloc[0:10,0].values
bottom_10_Prosperity = Prosperity_Scores.sort_values('KPI', ascending=True).iloc[0:10,0].values

In [None]:
fig = px.scatter_geo(gdf,
                    lat=gdf.geometry.y,
                    lon=gdf.geometry.x,
                    hover_name='Organization',
                    color ='Planet Ambition Score',
                    width = 1200,
                    height = 600, 
                    title = 'Planet Ambition Score')
fig.show()

<p style="text-align:justify;">The map above conveys the same conclusions that different trends appear in the ambition on the Planet macro-theme. However there are some exceptions, based on several factors, the geographic situation for instance.

In [None]:
Top_Cities = pd.DataFrame([[cities_Will_2020.loc[cities_Will_2020.Account == x, 'Organization'].values[0] for x in top_10_Planet],
                           [cities_Will_2020.loc[cities_Will_2020.Account == x, 'Organization'].values[0] for x in top_10_Prosperity],
                           [cities_Will_2020.loc[cities_Will_2020.Account == x, 'Organization'].values[0] for x in top_10_Social]],
                         index = ['Planet', 'Prosperity', 'Social'], columns = range(1,11)).T

Top_Cities = Top_Cities.applymap(lambda x: x.encode('iso-8859-1','ignore').decode('utf8','ignore'))
Top_Cities

<p style="text-align:justify;">Our top 10 on the three macro-themes look very convincing. Indeed, on the Planet side, we can compare our resultst to the A list of the CDP since cities from the A list are <em>"setting and meeting ambitious climate goals"</em>. <b>Out of our ten cities with the highest Planet Ambition score, 7 are in the 2020 CDP A list</b>. However the match is not perfect because we did not use the CDP's methodology to build our Planet KPI and therefore the scope of questions may differ. We believe we can be confident on the other macro-theme since our approach is mostly unsupervised and automated.

### 3.2.2 Validation by classification tree

<p style="text-align:justify;">Here is an exemple for the Planet macro-theme. We rank the scores in 5 groups (quintiles) and run a classic classification tree based on trivial categorical variables that are included in the Planet KPI's construction. </p>

<p style="text-align:justify;">The main goal is to use the abilities of classification trees to come up with simple rules that make sense to justify our KPIs on the ambition of cities to protect the environment. </p>

In [None]:
Planet_data = {k: cities_Will_2020_vWide[k] for k in Planet_questions}
Planet_data_dummies = geo_cities_2020.copy()
label = {'1.0':'Sust Plan', '5.5':'GHG Plan', '10.7':'Zero Emission Zone', '12.3':'Food Policies', '14.4':'Public Water Management',
         '2.0':'Climate Assesment', '3.2':'Climate Change Plan', '4.0':'Emissions Invetory', '4.8':'Emission Growth', '6.2':'Business Partnership',
         '8.0':'Renewable Target'}
col = {'1.0':1, '5.5':1, '10.7':1,  '12.3':1, '14.4':1, '2.0':1, '3.2':1, '4.0':1, '4.8':1, '6.2':1, '8.0':1}
for q in ['1.0', '5.5', '10.7', '12.3', '14.4', '2.0', '3.2', '4.0', '4.8', '6.2', '8.0']:
    df = pd.DataFrame(Planet_data[q].iloc[:,(1+col[q])].apply(lambda x: list(x.keys())[0] if isinstance(x, dict) else x))
    df.columns = [label[q]]
    df.dropna(inplace=True)
    df = pd.DataFrame(df[label[q]].tolist(), index= df.index)
    df = pd.get_dummies(df.stack(), prefix = label[q]).sum(level=0)
    df.drop([col for col, val in df.sum().iteritems() if val < 10], axis=1, inplace=True)
    Planet_data_dummies = pd.merge(Planet_data_dummies, df, left_index=True, right_index=True, how = 'outer')
Planet_data_dummies.fillna(0, inplace = True)
Planet_data_dummies.drop([c for c in Planet_data_dummies.keys() if re.search(r'_No$', c)],axis=1,inplace=True)

In [None]:
Y = pd.qcut(Planet_Scores.KPI,5, labels = False)
X_train, X_test, Y_train, Y_test = train_test_split(Planet_data_dummies.iloc[:,2:], Y, test_size=0.2, stratify=Y, random_state=10)

In [None]:
clf = tree.DecisionTreeClassifier(min_samples_leaf=40, max_depth=3,random_state=10)
clf = clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
# print(classification_report(Y_test, Y_pred))

In [None]:
from sklearn.tree import export_graphviz
export_graphviz(clf, 
                out_file='tree.dot', 
                feature_names = list(Planet_data_dummies.columns[2:]),
                rounded = True, proportion = False, 
                precision = 2, filled = True)
import pydot
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
plt.figure(figsize = (18, 20))
plt.imshow(plt.imread('tree.png'))
plt.axis('off');
plt.show();


<p style="text-align:justify;">Based on the classification tree results, the <b>Planet KPI</b> is very consistent. Indeed, cities with the highest scores are ones with : a food policy to fight againt wastes AND a public water management agency AND a decrese in GHG emissions OR Renewable Targets OR Colaborate with Business on climate issues, while the cities with the lowest scores have none of the above.</p>

<p style="text-align:justify;">This classification tree has not been design for prediction but rather to explain our KPIs results. Hence it works particularly well for the extremes and could be used as an independent sparse scoring system.</p> 

<p style="text-align:justify;">This approach may of course also be applied to the other macro-themes.


<a id="Pertinence"></a>
## 3.3 Questions' pertinence

[Back to section 2.3](#KPIs-compilation)

<p style="text-align:justify;">We wish to assess the pertinence of the questions to determine their value-added in our analysis and more broadly in the CDP's questionnaire.</p>

<p style="text-align:justify;">A naive metric would be to consider the question's coverage, ie the percentage of cities that were eligible to the question. Although it is important to measure such quantity, it is not sufficient to assess the pertinence of a question. Indeed, even if all the cities are involved, if all respondents give the same answer, the question will not be useful to discriminate between ambitious and non-ambitious cities. We thus propose a finer approach of pertinence where we examine the answers distributions. </p>

<p style="text-align:justify;"> <b>Hence, we define Pertinence as a measures of the diversity in the answers provided to a given question.</b> The higher the diversificationm the closer to 1 the score will be. Conversely, the highest the concentration on a particular answer, the closer the score will be to 0. To measure this concentration, we consider the following cases:</p>

* <p style="text-align:justify;">The answer is a free range text: we check that the frequency of topics appearances amongst answers are not concentrated on a single topic, and that the sentiments are well distributed amongst all answers. </p>
* <p style="text-align:justify;">The answer is quantitative: we verify that the answers are not concentrated on single values (usually the bounds of the distribution). </p>
* <p style="text-align:justify;">The answers are categorials: we check that no single categorie accounts for the majority of the cities answers.</p>

All functions are detailed in the dedicated [appendix section](#Quality).

In [None]:
# Generates a Highchart with the values of the pertinence score and coverage for every questions

H = Highchart(width=900, height=400)

options = {
    'chart': {
        'type': 'column'
    },
    'title': {
        'text': 'Questions quality'
    },
    'xAxis': {
        'title': { 'text': 'Question Number'},
        'categories' : [q for q in Question_Quality.index.values]
    },
    'yAxis': {
        'title': {'text': 'Score'},
        'max' : 1.0
    },
    'tooltip': {
        'headerFormat': '<span style="font-size:10px"><b>Question {point.key}</b></span><table>',
        'pointFormat': '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' +
            '<td style="padding:0">{point.y}</td></tr>',
        'footerFormat': '</table>',
        'shared': True,
        'useHTML': True
    }
}

H.set_dict_options(options)

pertinence = [x for x in Question_Quality.loc[:, 'Pertinence'].apply(lambda x: np.round(x,2)).values]
coverage = [x for x in Question_Quality.loc[:, 'Coverage'].apply(lambda x: np.round(x,2)).values]

H.add_data_set(pertinence, 'scatter', 'Pertinence')
H.add_data_set(coverage, 'scatter', 'Coverage', color='rgba(223, 83, 83, .5)')

H

<p style="text-align:justify;">We can see that coverage and pertinence are not much related. It is consistent with our definition because some questions apply to very specific cases. Hence, our pertinence score appears more able to capture the question's quality than only the proportion of answers. </p>

<p style="text-align:justify;">In particular, we can read on this chart that questions 2.1a, 2.1b, and 2.1b have the same coverage because the same cities were asked these questions,but they do not all have the same relevance score because the answers differ. Precisely what we are showing is that the coverage is not a good measure of relevance: we can have weak coverage with few cities responding but a wide variety of responses conveying informations, whence a relevance of 1 for 2.1b and 2.1c even with low coverage.</p>

<a id="Political"></a>
## 3.4 Solutions recommendations

<p style="text-align:justify;">We propose to use our risk and ambition scores to generate political recommendations cities given a set of property. To make these solutions accessible to any cities, we chose to construct cluster on external bases only (answering the CDP is not a requirement to be assigned a cluster). </p>

<p style="text-align:justify;">We performed a clustering on the climate and geological conditions using a Gaussian finite mixture model fitted by EM algorithm and based on a principal component analysis on climate and geological conditions of respondents (see Appendix). We use one year of the following average monthly informations:</p>

* Soil moisture (0-10, 10-40 and 40-100cm)
* Total precipitation rate
* Surface and air temperature
* Snow precipitation rate
* Wind speed
* Surface albedo (sunlight reflection)
* Specific Humidity
* Total available water capacity (mm water per 1 m soil depth)
* Soil organic carbon density (kg C/m2 for 0-30cm depth range)
* Soil organic carbon density (kg C/m2 for 0-100cm depth range)
* Soil carbonate carbon density (kg C/m2 for 0-100cm depth range)
* Soil pH (0-30 cm depth range)
* Soil pH (30-100 cm depth range)

<p style="text-align:justify;">Based on this information set, we retrieved the principal components and define clusters that can be used for non-responding cities as well. For instance, a city could be interested in insights into what ambitious cities facing similar challenges are doing. </p>

<p style="text-align:justify;">We can see on the following map that clusters are grouped per latitude which is an indicator of quality for the climate clusters. Note that we could easily encompass Risk KPIs in the clustering process if we wanted to restrict the analysis on respondents only.</p>

In [None]:
CDP_cities_correctly_located  = pd.read_excel('../input/cdp-inputs-final/Cities_Disclosing_to_CDP_corrected.xlsx')

gdf = gpd.GeoDataFrame(
    CDP_cities_correctly_located, 
    geometry=gpd.points_from_xy(CDP_cities_correctly_located.Longitude,
                                CDP_cities_correctly_located.Latitude))

fig = px.scatter_geo(gdf,
                    lat=gdf.geometry.y,
                    lon=gdf.geometry.x,
                    hover_name='Organization',
                    color ='fit.classification')
fig.show()

<p style="text-align:justify;">We then use the five top cities in this sample and look at their answers for some questions that could be of interests for cities wanting to change their adaptation strategies.</p>
    
<p style="text-align:justify;">The example proposed below focuses on Planet Risk and Ambitions to tackle related issues. We illustrate our results on two questions for city that would be placed in the second group of our clustering. These cities mostly are located around or below the Tropic of Capricorne.</p>

In [None]:
def remove_dictKey(d, key):
    # Removes a key in a dictionary if the key is present
    r = dict(d)
    if key in d.keys():
        del r[key]
    return r

def get_Recommendation(SDG, comparableCities = []):
    
    # if comparableCities is empty, we use the top 5 of all cities
    # if comparableCities is not empty, we retain only the 5 top cities in the list
    if len(comparableCities)==0:
        comparableCities = pd.DataFrame([top_10_Planet[0:5], top_10_Social[0:5], top_10_Prosperity[0:5]], index = ['Planet', 'Social', 'Prosperity']).transpose()
        comparableCities = comparableCities.loc[:,SDG].values
    
    elif len(comparableCities)>5:
        comparableCities = pd.concat([Planet_Scores.loc[[Planet_Scores.loc[c,'Account'] in comparableCities for c in Planet_Scores.index], 'Account'].rename('Account'),Planet_Scores.loc[[Planet_Scores.loc[c,'Account'] in comparableCities for c in Planet_Scores.index], 'KPI'].rename('Planet'), Social_Scores.loc[[Social_Scores.loc[c,'Account'] in comparableCities  for c  in Social_Scores.index], 'KPI'].rename('Social'), Prosperity_Scores.loc[[Prosperity_Scores.loc[c,'Account'] in comparableCities  for c  in Prosperity_Scores.index], 'KPI'].rename('Prosperity')], axis=1)
        comparableCities = comparableCities.sort_values(SDG,ascending=False).Account.values[:5]
    
    # Gets the answers related to this SDG and that can be used to generate recommendations
    reco_info = extra_info.loc[extra_info['Political Recommandation'].isna()==False]
    reco_info = reco_info.loc[[q for q in reco_info.loc[reco_info['SDGs'].isna()==False,'Question Number'].index if SDG in reco_info.loc[q,'SDGs']]]
    #reco_q = reco_info.loc[:,'Question Number'].values
    reco_q = ['3.0','6.2a']
    reco_cols = [[9],[2,3]]
    reco_dict = {q:[] for q in reco_q}
    
    for i,q in enumerate(reco_q):
        # Gets the column of interests
        cols_study = reco_cols[i]
        reco_ans = cities_Will_2020_vWide[q].loc[[cities_Will_2020_vWide[q].loc[c,'Account'] in comparableCities for c in cities_Will_2020_vWide[q].index]]
        
        if q == '3.0':
            reco_dict[q].append(reco_ans)
        else:
            reco_ans = reco_ans.iloc[:,cols_study]
            t0 = [list(item.keys()) for item in reco_ans.iloc[:,0] if isinstance(item, dict)]
            t0 = [item for sublist in t0 for item in sublist]
            t1 = [list(item.keys()) for item in reco_ans.iloc[:,1] if isinstance(item, dict)]
            t1 = [item for sublist in t1 for item in sublist]
            c0 = remove_dictKey(dict(Counter(t0)), 'Question not applicable')
            c1 = remove_dictKey(dict(Counter(t1)), 'Question not applicable')
            counters = {reco_ans.columns[0] : {sect : c0[sect] for sect in list(c0)}, reco_ans.columns[1]:{sect:{col:0 for col in list(c1)} for sect in list(c0)}}

            for i in reco_ans.index:
                for j in list(reco_ans.loc[i,reco_ans.columns[1]]):
                    if j!='Question not applicable':
                        for k in reco_ans.loc[i,reco_ans.columns[1]][j]:
                            z = [x for x in list(reco_ans.loc[i,reco_ans.columns[0]]) if k in reco_ans.loc[i,reco_ans.columns[0]][x]][0]
                            counters[reco_ans.columns[1]][z][j]=counters[reco_ans.columns[1]][z][j]+1
             
            reco_dict[q].append(counters)
    
    return reco_dict

In [None]:
SDG='Planet'
comparableCities = pd.read_excel('../input/cdp-inputs-final/Cities_Disclosing_to_CDP_corrected.xlsx')
comparableCities = comparableCities.iloc[:,[0,-1]]
comparableCities = comparableCities.loc[comparableCities['fit.classification']==2,'Account.Number'].values

reco_dict = get_Recommendation(SDG,comparableCities)
print('Cities used to generate recommendations:')
reco_cities = pd.DataFrame([reco_dict['3.0'][0].loc[:,'Account'].values, [cities_Will_2020.loc[cities_Will_2020['Account']==c, 'Organization'].values[0] for c in reco_dict['3.0'][0].loc[:,'Account'].values]], index = ['Account','Organisation']).transpose()
reco_cities.iloc[:,1] = reco_cities.iloc[:,1].apply(lambda x: x.encode('iso-8859-1','ignore').decode('utf8','ignore'))
reco_cities

In [None]:
# Exemple 1:
print(cities_Will_2020.loc[cities_Will_2020['Question Number']=='3.0', 'Question Name'].values[0]+'\n')
print('Respondent: ' +cities_Will_2020.loc[cities_Will_2020['Account']==reco_dict['3.0'][0].loc[351,'Account'], 'Organization'].values[0]+'\n')
print('Action: '+list(reco_dict['3.0'][0].loc[351,'Action'])[8])
print('Action description: '+list(reco_dict['3.0'][0].loc[351,'Action description and implementation progress'])[8])

In [None]:
#Exemple 2
print(cities_Will_2020.loc[cities_Will_2020['Question Number']=='6.2a', 'Question Name'].values[0]+'\n')
d = reco_dict['6.2a'][0]
H_pie = Highchart(width = 850, height = 600)

colors = ["#df5353","#434348", "#7cb5ec", "#90ed7d", "#f7a35c", "#8085e9", "#f15c80", "#e4d354", "#2b908f", "#f45b5b", "#91e8e1"]
colors = [tuple(int(c.lstrip('#')[i:i+2], 16) for i in (0, 2, 4)) for c in colors]
colors = ['rgba('+str(c[0])+','+str(c[1])+','+str(c[2])+',1)' for c in colors]


data = [{
            'y': np.round(100*float(d[list(d)[0]][x]/sum(d[list(d)[0]][xx] for xx in d[list(d)[0]])),2),
            'color': colors[c],
            'drilldown': {
                'name': x,
                'categories': [xx for xx in list(d[list(d)[1]][x]) if d[list(d)[1]][x][xx]>0],
                'data': [np.round(100*float(d[list(d)[1]][x][k]*d[list(d)[0]][x]/(sum(d[list(d)[1]][x][xx] for xx in d[list(d)[1]][x])*sum(d[list(d)[0]][xx] for xx in d[list(d)[0]]))),2) for k in list(d[list(d)[1]][x]) if d[list(d)[1]][x][k]>0],
                'color': colors[c]
            } 
        } for c,x in enumerate(list(d[list(d)[0]]))]


options = {
    'chart': {
        'type': 'pie'
    },
    'title': {
        'text': 'Share of collaboration areas and the corresponding type of collaborations provided by the 5 top cities'
    },
    'plotOptions': {
        'pie': {
            'shadow': False,
            'center': ['50%', '50%']
        }
    },
    'tooltip': {
        'valueSuffix': '%'
    }
}


categories = list(d[list(d)[0]])
browserData = []
versionsData = []

for i in range(len(data)):

    browserData.append({
        'name': categories[i],
        'y': data[i]['y'],
        'color': data[i]['color']
        })

    drillDataLen = len(data[i]['drilldown']['data'])
    for j in range(drillDataLen): 

        brightness = 0.2 - (j / drillDataLen) / 5;
        versionsData.append({
            'name': data[i]['drilldown']['categories'][j],
            'y': data[i]['drilldown']['data'][j],
            'color': data[i]['color'][:-2]+ '0.5)'
        })

H_pie.set_dict_options(options)

H_pie.add_data_set(browserData, 'pie', cities_Will_2020_vWide['6.2a'].columns[2], size='60%',
            dataLabels={
                'formatter': 'function () { \
                                    return this.y > 5 ? this.point.name : null;\
                                }',
                'color': 'white',
                'distance': -30
            })

H_pie.add_data_set(versionsData, 'pie', cities_Will_2020_vWide['6.2a'].columns[3], size='80%',
            innerSize='60%',
            dataLabels={
                'formatter': "function () {\
                                    return this.y > 5 ? '<b>' + this.point.name + ':</b> ' + this.y + '%'  : null;\
                                }"
            })
H_pie

[Back to Table of Contents](#ToC)

<a id="Conclusion"></a>
# Conclusion

<p style="text-align:justify;">To summarize in a few words, we capitalize on the CDP databases and questionnaires to analyze both the cities risk exposures and their demonstrated ambitions to tackle such issues. Our methodology exploits the data synergies between the UN Sustainable Development Goals and CDP questionnaires to construct relevant Key Performance Indicators. The quality of the obtained KPIs is then assessed empirically and using machine learning techniques. In addition we investigate which questions seems to be more pertinent in the CDP questionnaires. Finally, we develop a tool for cities, inside or outside of the CDP database to gain insights into which directions ambitious cities facing similar climate risks are pursuing.</p>

<p style="text-align:justify;">Our mainly unsupervised approach yields very satisfactory results. Most of the cities with the highest KPIs are publicly known for their leading efforts in the three macro-themes we studied: Planet, Prosperity and Social. In that sense, our KPIs seem to be reliable. </p>

<p style="text-align:justify;">In addition to that, our methodology may be applied in a systematic and dynamic way to follow the evolution of the actors over time. Also, while this notebook focuses on cities - because they have been the subject of less study than corporates - the approach is generalizable, especially the unsupervised assessment of ambition detailed in Section 2.</p> 

<p style="text-align:justify;">Finally, we believe the small number of final KPIs we retained make these measures easily understandable and compatible with many processes, such as portfolio management and can help stakeholders to include climate change, sustainable development and social issues into account.

[Back to Table of Contents](#ToC)

<a id="Appendices"></a>
# Appendices

## Appendices 1. Risk Scores

Everything regarding data generation in this section is available in the R notebook *SDG_quant_database_construction* attached in the dataset cdp-inputs-final.

## Appendices 2. Ambition Scores

### 2.1. Dataset preparation

We clean raw data from CDP of the data and define functions for further text cleaning.

In [None]:
# cities_2020 = pd.read_excel('../input/cdp-inputs-final/2020_Full_Cities_Dataset_Expended.xlsx', sheet_name = '2020_Full_Cities_Dataset')
# cities_2020.to_pickle('../input/cdp-inputs-final/2020_Full_Cities_Dataset_Expended.pkl')
# cities_2020.rename(columns={'Account Number': 'Account'}, inplace = True)

In [None]:
cities_2020.drop(['Questionnaire', 'Country', 'CDP Region', 'Parent Section', 'Section', 'File Name', 'Last update', 'Row Name', 'Comments'], axis = 1, inplace = True)
cities_2020.keys()

In [None]:
# First we clean our personal variables
cities_2020.loc[cities_2020['SDGs']==0, 'SDGs'] = np.nan
cities_2020.loc[cities_2020['Type of Answer']==0, 'Type of Answer'] = np.nan
cities_2020.loc[cities_2020['Max Column Number']==0, 'Max Column Number'] = np.nan
cities_2020.loc[cities_2020['Which Is Text to Analyse']==0, 'Which Is Text to Analyse'] = np.nan
cities_2020.loc[cities_2020['Which else to Analyse']==0, 'Which else to Analyse'] = np.nan
cities_2020.loc[cities_2020['Political Recommandation']==0, 'Political Recommandation'] = np.nan
cities_2020.loc[cities_2020['Is Positive']==0, 'Is Positive'] = np.nan
cities_2020.loc[cities_2020['Text Ref Col']==0, 'Text Ref Col'] = np.nan
cities_2020.loc[cities_2020['2019 OK']==0, '2019 OK'] = np.nan

In [None]:
cities_2020.shape

We build a dictionary referencing which cities answered in english. This will be usefull later for free text analysis.

In [None]:
isEnglish = {'Account':np.array(cities_2020.Account[cities_2020['Question Name'] == 'What language are you submitting your response in?']), 
             'isEnglish':np.array(cities_2020[cities_2020['Question Name'] == 'What language are you submitting your response in?']['Response Answer']=='English')}

isEnglish_df = pd.DataFrame(isEnglish)
isEnglish_df = isEnglish_df[isEnglish_df.isEnglish==True]

# with open('../input/cdp-inputs-final/isEnglish_df.pkl', 'wb') as handle:
#     pickle.dump(isEnglish_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

We further build a function for text cleaning.

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def cleanData(text, to_list=False):

    if isinstance(text, str):
        
        if text: # empty string is False
            
            # Liens https
            pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
            text = pattern.sub('', text)

            # Tweet account et emails
            pattern = re.compile(r'\S*@\S*\s?')
            text = pattern.sub('', text)
            
            # Suppressions de symboles spécifiques
            pattern = re.compile(r'[$€£¥\+%]{1}')
            text = pattern.sub('', text)
            
            # Remove multiple whitespace
            text = ' '.join(text.split())
            
            if to_list: 
                
                # Remove dates
                pattern = re.compile(r'([0-9]{2}[:\/,.]){1,2}[:\/,.]?[0-9]{0,4}|am|pm|(\s[0-9]{1,2}[a-zA-Z]{1,3})')
                text = pattern.sub('', text)

                # Remove months
                months = ['january', 'february', 'march', 'april', 'mai', 'june', 'july', 'august', 'september', 'october', 
                          'november', 'december']
                text = ' '.join([word for word in text.split() if word.lower() not in months])
                
                # Remove remaining numbers and whats surround them
                pattern = re.compile(r'\s[a-zA-Z]{0,3}[0-9]+[.,]?[0-9]*[a-zA-Z]{0,3}')
                text = pattern.sub(' ', text)
                
                # Remove multiple whitespace
                text = ' '.join(text.split())
                
                CLEAN_words = {word.lemma_.strip().lower(): word.pos_ for word in nlp(text) 
                               if (not word.is_stop and not word.is_punct and not word.is_space and not word.pos_ in ['NUM'])}  
                
            else:
                CLEAN_words = text
        else:
            CLEAN_words = np.nan
    
    elif (isinstance(text, float) or isinstance(text, int)):
        CLEAN_words = float(text)
            
    else:
        CLEAN_words = np.nan
    
    return CLEAN_words

From our analysis of the questionnaire, we saw that questions regard different topics. Hence we create three different datasets based on the underlying topic give in **Type of Answer**:

* **Cities_Stat_2020**: with all descriptive questions of the cities.

* **Cities_Risk_2020**: with all questions regarding risk exposures.

* **Cities_Will_2020**: with all questions regarding ambition of cities to fight risks.

In [None]:
cities_Risk_2020 = cities_2020[['Risk Exposure' in k for k in 
                                   cities_2020['Type of Answer'].map(lambda x: x.split(', ') if isinstance(x, str) else [])]].copy()
cities_Will_2020 = cities_2020[['Willingness' in k for k in 
                                   cities_2020['Type of Answer'].map(lambda x: x.split(', ') if isinstance(x, str) else [])]].copy()
cities_Stat_2020 = cities_2020[['Statistics' in k for k in 
                                   cities_2020['Type of Answer'].map(lambda x: x.split(', ') if isinstance(x, str) else [])]].copy()

In [None]:
# We reset the index of each subset.
cities_Risk_2020.reset_index(drop=True, inplace=True)
cities_Will_2020.reset_index(drop=True, inplace=True)
cities_Stat_2020.reset_index(drop=True, inplace=True)

# cities_Stat_2020.to_pickle('../input/cdp-inputs-final/cities_Stat_2020.pkl')
# cities_Will_2020.to_pickle('../input/cdp-inputs-final/cities_Stat_2020.pkl')

### Transform database to wide format for a given question

Each question in the original dataset may lead to several answers and hence multiple rows. To ease our analysis of the data we decided to reshape the data into a "*dict/wide format*" where for each question all answers are stored in a panda dataframe. These dataframes are in a "*wide format*" with one row per respondent and every sub-question as column. 

Hence, for a given question, each cell of the dataframe is a dictionnary where keys are the answers to a sub-question and values are their identifier.

In [None]:
def replace_other_spec(s):
    out = s
    try:
        if s[:21]=='Other, please specify':
            out = 'Other'
        return out
    except:
        return out

In [None]:
def extract_answer_perCol(df,question=""):
    
    # Possible to run for one question or multiple
    if question:
        df = df[df['Question Number']==question]
    else:
        out_list = []
        q_nb_list = []
    
    # Run over all questions in df
    for q in df['Question Number'].unique():
        # Add geographic data. In case it might be usefull for matching with other data sources.
        out = geo_cities_2020.copy()
        #Subset data to question q
        df_small = df[df['Question Number']==q].copy()
        # Apply transformation to "Other, please specify"
        df_small['Response Answer'] = df_small['Response Answer'].apply(replace_other_spec)
        # Dropp nan answers
        df_small = df_small[df_small['Response Answer'].notna()]
    
        # Run over each sub-question (given by the column number)
        for cn in np.sort(df_small['Column Number'].unique()):
            # subset again 
            res = df_small[df_small['Column Number']==cn][['Account', 'Response Answer', 'Row Number']].copy()
            
            # define output dataset
            res_ok = pd.DataFrame()
            res_ok['Account'] = res.Account.unique()
            
            # For each column number we get all answers and the corrsponding row
            resp_list = []
            for i in res.Account.unique():
                h = defaultdict(list)
                for k, v in zip(list(res[res.Account==i].iloc[:,1]), list(res[res.Account==i].iloc[:,2])):
                    h[k].append(v)
                resp_list.append(dict(h))
            res_ok['Response Answer'] = pd.Series(resp_list, dtype='object')
            
            res_ok.rename(columns = {'Response Answer': df_small[df_small['Column Number']==cn]['Column Name'].unique()[0]}, 
                          inplace=True)
            out = pd.merge(out, res_ok, on='Account', how='outer')
        
            if not question:               
                out_list.append(out)
                q_nb_list.append(q)
    
    # output change if there is one question or multiple.
    if question:
        return out
    else:
        return dict(zip(q_nb_list, out_list))

In [None]:
cities_Will_2020_vWide = extract_answer_perCol(cities_Will_2020)

# with open('../input/cdp-inputs-final/cities_Will_2020_vWide.pkl', 'wb') as handle:
#     pickle.dump(cities_Will_2020_vWide, handle, protocol=pickle.HIGHEST_PROTOCOL)

Here is an exemple for question **6.0**: *Please indicate the opportunities your city has identified as a result of addressing climate change and describe how the city is positioning itself to take advantage of these opportunities.*

In [None]:
print(cities_Will_2020[cities_Will_2020['Question Number']=='6.0'].iloc[0,4])
df_6_0 = cities_Will_2020_vWide['6.0']
df_6_0.head(5)

### 2.2. Ambition Scores by question

### Extra information from questionnaire analysis

From the extra columns we added to the main dataset we build a new dataset called ***extra_info***. 

We could have done it automatically from the added columns but there are some questions in the main dataset who don't appear in the questionnaire. Therefore we had to update manually the extra_info dataframe.

In [None]:
def transform_ToL(data, split_arg = ','):
    
    x = data.copy()
    
    isStr = x.apply(lambda y: isinstance(y, str))
    
#     x[isStr] = pd.Series(x[isStr].apply(lambda y: [int(i) if len(i)<3 else i for i in list(y.split(split_arg))]), dtype = 'object')
    x[isStr] = pd.Series(x[isStr].apply(lambda y: [int(i) if len(i)<3 else i.replace(' ', '') for i in list(y.split(split_arg))]), dtype = 'object')
    
    return x

In [None]:
##### This step can't be run because it requires a manual input.

# We build extra_info from the main dataset. Some reshaping is required.

# extra_info = []
# for q in cities_Will_2020['Question Number'].unique():
#     d = pd.DataFrame(cities_Will_2020[cities_Will_2020['Question Number']==q].iloc[0,10:])
#     d.columns = [q]
#     extra_info.append(d)
# extra_info = pd.concat(extra_info, axis = 1).T
# extra_info.reset_index(inplace = True, drop = False)
# extra_info.rename(columns={'index':'Question Number'}, inplace = True)
# extra_info['SDGs'] = transform_ToL(extra_info['SDGs'],',')
# extra_info['Which else to Analyse'] = transform_ToL(extra_info['Which else to Analyse'],';')
# extra_info['Is Positive'] = transform_ToL(extra_info['Is Positive'],';')
# extra_info['2019 OK'] = extra_info['2019 OK'].astype('str')

####################################################################################

# # We manually update some questions.

# # question
# q = '10.1'

# # Question Name
# cities_Will_2020['Question Name'][cities_Will_2020['Question Number']==q].unique()

# # Exemple of possible answers
# boo = cities_Will_2020_vWide[q].iloc[:,2].apply(lambda x: list(x.keys())[0] if isinstance(x, dict) else x) != 'Question not applicable'
# boo2 = cities_Will_2020_vWide[q].iloc[:,2].notna()
# cities_Will_2020_vWide[q][boo & boo2].iloc[0:7,2:]

# # Current data in  extra_info
# extra_info[extra_info['Question Number'] ==q]

# # Update extra_info
# # # Text
# # extra_info.loc[extra_info['Question Number'] ==q, 'Which Is Text to Analyse'] = 2

# # # Esle
# extra_info.loc[extra_info['Question Number'] ==q, 'Which else to Analyse'] = [[2,3,5,6]]

# # # Positive
# extra_info.loc[extra_info['Question Number'] ==q, 'Is Positive'] = [[1,1,1,1]]

# # Ref Text
# # extra_info.loc[extra_info['Question Number'] ==q, 'Text Ref Col'] = np.nan

# ###################################################################################

# extra_info.to_pickle('../input/cdp-inputs/extra_info_2020.pkl')

### 2.2.1. Quantitative and Categorical answers: Frequentist and Automated Scoring

Out of the 73 questions we spotted as relevant for the analysis of **cities' ambition** to protect the ***Planet** take care of ***Social*** inequalities and ensure ***Prosperity*** through sustainable development, only four of them do not have either quantitative of categorcial answers to analyse. It emphasises how important this part is to build *Key Performance Indicators* on the previous issues based on the cities answers.

We follow the process explained [above](#Ambition-score-by-question) to score each cities' answers.

We first define a dictionary that scores all catagories that may be ranked. These questions are identified from the **extra_information** dataset based on the **Is Positive** column.

In [None]:
def flatten_LoloS(coll):
    # Flattent list of list of string to a single list of string 
    for i in coll:
            if isinstance(i, Iterable) and not isinstance(i, str):
                for subc in flatten_LoloS(i):
                    yield subc
            else:
                yield i

In [None]:
####### READ ME ####### 

# This chunk will not lead to the donloaded dictionary because some steps were manually made in Excel.

# #  We indentify all positive questions
# pos_question = list()
# for q in extra_info['Question Number'][extra_info['Is Positive'].notna() & extra_info['Is Positive']!=0]:
#     pos_question.append(q)

# #  We subset all non-numeric answers to those questions
# cat_Pos = []
# for q in pos_question:
#     df = cities_Will_2020_vWide[q]
#     # Categories to analyse per question
#     cat_to_analyse = extra_info.loc[extra_info['Question Number']==q, 'Which else to Analyse'].values[0]
#     is_to_analyse = extra_info.loc[extra_info['Question Number']==q, 'Is Positive'].values[0] 
    
#     # Filter to positve catagories only
#     if isinstance(cat_to_analyse, list):
#         cat_to_analyse = [i[1] for i in enumerate(cat_to_analyse) if is_to_analyse[i[0]]==1]
    
#     # Get text
#     if isinstance(cat_to_analyse, int):
#         cat_to_analyse = [cat_to_analyse]

#     output = []
#     for c in cat_to_analyse:
#         text = list(df.iloc[:,(2+c-1)]) 
#         # Only keep unique variable
#         for n in range(len(df.Account)):
#             d = text[n]
#             if isinstance(d, dict):
#                 isSTR = [isinstance(x, str) for x in list(d.keys())]
#                 kSTR = list(d.keys())
#                 kSTR = list(compress(kSTR, isSTR))
#                 output.append(kSTR)

#     # Flattent list of list of string to a single list of string                
#     output = list(set(list(flatten_LoloS(output))))
#     if 'Question not applicable' in output:
#         output.remove('Question not applicable')

#     # delete numbers entered as strings
#     output = [i for i in output if any(map(str.isalpha, i))]

#     cat_Pos.append(output)
    
# # Assign unique answers to a dictionary as usual
# Dict_POS = dict(zip(pos_question, cat_Pos))
# # Delete empty strings
# Dict_POS = {k:v for k,v in Dict_POS.items() if v}

# # Reshape to dataframe format
# unique_answer = pd.DataFrame(list(set(list(flatten_LoloS(list(Dict_POS.values()))))))

# # Export to Excel to mannually assign quantitative label
# unique_answer.to_excel('Positive_dict.xlsx', engine='xlsxwriter')

# # Load and save up-to-date dataset
# unique_answer = pd.read_excel('Positive_dict.xlsx', sheet_name='Feuil1')
# unique_answer.to_pickle('../inputcdp-inputs-final/positive_dict.pkl')

<a id="Utils-cat"></a>
#### Utils
We now adds some functions that are going to help us determine the type of answers and computes the scores.

In [None]:
def remove_dictKey(d, key):
    # Removes a key in a dictionary if the key is present
    r = dict(d)
    if key in d.keys():
        del r[key]
    return r

In [None]:
def isRepartition(df, threshold=0.5):
    # Check if the answers in the dataframe correspond to a question about a repartition
    # that is when the number of answers whose sum are either 1 or 100 are above the threshold
    # If yes, we compute the rescaled data to correct cities that didn't use decimal (e.g. 50 rather than 0.5).
    
    out = df.copy()
    vals = pd.DataFrame(index = df.index, columns = df.columns[2:])
    for i,ct in enumerate(vals.index):
        for j,c in enumerate(vals.columns):
            if type(df.iloc[i,j+2])!=dict:
                vals.loc[ct,c] = 0
            else:
                try:
                    vals.loc[ct,c] = float(list(df.iloc[i,j+2])[0])
                except:
                    vals.loc[ct,c] = np.nan
    
    vals_drop = vals[vals.sum(axis=1)!=0].dropna()
    if vals_drop.empty:
        return False, pd.DataFrame()
    else:
        if (vals_drop.sum(axis=1).round(0)==100).mean()>threshold:
            if vals_drop.shape[1]>1:
                vals_drop = vals_drop.divide(vals_drop.sum(axis=1), axis=0)
                out.loc[vals_drop.index, df.columns[2:]] = vals_drop.values
            else:
                vals_drop[vals_drop>1] = vals_drop[vals_drop>1]/100
                vals_drop[vals_drop>1] = 1.0
                out.loc[vals_drop.index, df.columns[2]] = vals_drop.loc[:,df.columns[2]].values
            
            return True, out
        else:
            return False, pd.DataFrame()

In [None]:
def questionType(df, columns = [], colPos = []):
    # Determine the type of answers for each columns of interest in the specified question
    # The possible types are: ['quant', 'cat_Rank', 'cat_noRank', 'cat_noRank_noPos']
    
    # Check the columns of interest are in the good format
    if type(columns)==int:
        columns = [columns + 1]
    else:
        columns = [x+1 for x in columns]
     
    # Check the orientation the columns of interest are in the good format
    if type(colPos)!=list:
        if math.isnan(colPos):
            colPos = ['9999' for x in columns]
        else:
            colPos = [colPos]
    
    # Determination of the types  
    questionType = []
    for k,n in enumerate(columns):
        data = df.iloc[:,n]
        t = [list(item.keys()) for item in data if isinstance(item, dict)]
        t = [item for sublist in t for item in sublist]
        t_drop = [x for x in t if x!='Question not applicable']
        
        try:
            # Quantitative answers
            pd.Series(t_drop).astype(float).dropna()
            questionType.append('quant')
        except:
            # Categorial answers
            isRank = np.sum([(x in unique_answer) for x in t_drop])==len(t_drop)
            if isRank:
                # Gradable categories
                questionType.append('cat_Rank')
            else:
                if type(colPos[k])==str:
                    # Ungradable categories with different orientations
                    questionType.append('cat_noRank_noPos')
                else:
                    # Ungradable categories with similar orientation
                    questionType.append('cat_noRank')
            
    return list(questionType)

Main function to compute the scores on analytical data

In [None]:
def Explore_AnalyticData(df, columns = [], colPos = [], models = [], weights = {}):
    
    # Computes the score for each columns of interest based on the column type
    
    # Check the columns of interest are in the good format
    if type(columns)==int:
        columns = [columns + 1]
    else:
        columns = [x+1 for x in columns]
        
    if type(colPos)!=list:
        if math.isnan(colPos):
            colPos = ['9999' for x in columns]
        else:
            colPos = [colPos]
            
    # Check if the columns corresponds to a repartition
    isRep, rescaledData = isRepartition(df)
    if isRep:
        df = rescaledData.copy()
        
    # Determination of the column type
    # If quantitative, computation of the deciles distribution 
    # If categorial, computation of the categories frequencies
    Count = []
    colType = []
    colDec = []
    for n in columns:
        data = df.iloc[:,n]
        if isRep:
            t = [item for item in data if isinstance(item, float)]
            t = [item for item in t if not math.isnan(item)]
            
        else:
            t = [list(item.keys()) for item in data if isinstance(item, dict)]
            t = [item for sublist in t for item in sublist]
        c = remove_dictKey(dict(Counter(t)), 'Question not applicable')
        Count.append(c)
        
        try:
            # The data are quantitative if we can convert all the answers to float
            val = pd.Series([x for x in t if x!='Question not applicable']).astype(float).dropna()
            dec = pd.Series(index = val.index, data = pd.qcut(val.values, 10, labels=False, duplicates='drop'))
            colType.append('quant')
            colDec.append(pd.concat([val.rename(0),dec.rename(1)], axis=1))
            
        except:
            # If at least one of the answers cannot be convert to numeric type, then the type is categorial
            colType.append('cat')
            colDec.append(pd.DataFrame())
    
    # For each column of interest, we compute the score of every respondent
    df_city = pd.DataFrame(index = df['Account'].unique(), columns = df.columns[columns])
    
    for k,n in enumerate(columns):
        
        # If we entered model cities, we compute the particular frequencies amongst their answer only
        if len(models)>0:
            model_idx = [df.index[i] for i in df.index if df.loc[i,'Account'] in models]
            model_data = df.loc[model_idx, df.columns[n]]
            
            t_model = [list(item.keys()) for item in model_data if isinstance(item, dict)]
            t_model = [item for sublist in t_model for item in sublist]
            c_model = remove_dictKey(dict(Counter(t_model)), 'Question not applicable')
            
        for ct in df_city.index:
            
            # Quantitative Data
            # We get the decile corresponding to the city answer
            # If the orientation of the question is negative, we inverse the decile in order 
            # to give a higher score to the cities with the lowest answers
            if colType[k]=='quant':
                decs = colDec[k]
                try: 
                    if isRep:
                        val = float(df.loc[df['Account']==ct, df.columns[n]].values[0])
                    else:
                        val = float(list(df.loc[df['Account']==ct, df.columns[n]].values[0])[0])
                    df_city.loc[ct, df.columns[n]] = decs[decs[0]==val].iloc[0,1]
                    if colPos[k]==0:
                        df_city.loc[ct, df.columns[n]] = 9-df_city.loc[ct, df.columns[n]]
                except:
                    df_city.loc[ct, df.columns[n]] = np.nan
                    
            # Categorial Data
            # We determine the sub-type of the column to compute the score
            else:
                # set the weighting scheme
                w = np.ones(len(Count[k]))/np.sum([x for x in Count[k].values()])
                if n in weights.keys():
                    if len(weights[n])==len(Count[k]):
                        w = weights[n]

                try: 
                    # The answer is not empty
                    if type(df.loc[df.index[df['Account']==ct].values[0], df.columns[n]])==dict:
                        # The city was eligible for the question
                        if len(remove_dictKey(df.loc[df.index[df['Account']==ct].values[0], df.columns[n]],'Question not applicable'))>0:
                            
                            isRank = np.sum([(x in unique_answer) for x in list(Count[k])])==len(list(Count[k]))
                            
                            # Gradable categories
                            # We weight the number of answers by their relative scores in unique_value
                            if isRank:
                                df_city.loc[ct, df.columns[n]] = np.sum([unique_answer[x] for x in df.loc[df['Account']==ct, df.columns[n]].values[0]])/np.sqrt(len(Count[k]))
                            else:
                            
                                # Ungradable categories with similar orientation
                                # We count the number of chosen categories
                                if type(colPos[k])==int:
                                    df_city.loc[ct, df.columns[n]] = len(df.loc[df['Account']==ct, df.columns[n]].values[0])/len(Count[k])
                                    if colPos[k]==0:
                                        df_city.loc[ct, df.columns[n]]=1-df_city.loc[ct, df.columns[n]]
                                
                                else:
                                # Ungradable categories with different orientations
                                # If the list of model cities is empty, we weight the number of categories 
                                # by their relative frequency in the whole set of answers
                                # If the list of model cities is not empty, we weight the number of of categories 
                                # by their relative frequency in the set of answers from the model cities
                                    if len(models) == 0:
                                        df_city.loc[ct, df.columns[n]] = np.sum([Count[k][x]*w[i]  for i,x in enumerate(df.loc[df.index[df['Account']==ct].values[0], df.columns[n]].keys())])

                                    else:
                                        df_city.loc[ct, df.columns[n]] = np.sum([c_model[x]*w[i]  for i,x in enumerate(df.loc[df.index[df['Account']==ct].values[0], df.columns[n]].keys()) if x in list(c_model)])


                    # Empty answer         
                    else:
                        df_city.loc[ct, df.columns[n]] = 0
                    
                                
                        
                except Exception as exception:
                    print(exception.__str__())
                    df_city.loc[ct, df.columns[n]] = 0
                    
                    
        # We scale our scores to make them comparable  
        z = (df_city.loc[:, df.columns[n]].max()-df_city.loc[:, df.columns[n]].min())
        df_city.loc[:, df.columns[n]] = (df_city.loc[:, df.columns[n]]-df_city.loc[:, df.columns[n]].min())/z
    
    return Count, df_city

<a id="Practice-cat"></a>
##### Explore_AnalyticData In Practice

We first determine the questions that are categorial with ungradable categories and different orientations. Indeed, we need to identify model cities to compute the scores on such questions.

In [None]:
questions = extra_info['Question Number'][extra_info['Which else to Analyse'].notna()]
questionsTypes = pd.DataFrame(index = questions, columns = ['questionsType'])

cities_Will_2020_CatAnalysis = pd.DataFrame(index = cities_Will_2020.Account.unique(), columns = questions)
cities_Will_2020_CatCount = {q:{} for q in questions}

for q in questions:
    
    if type(extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0])==list:
        cols_study = [int(c) for c in extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    else:
        cols_study = [extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    
    colsPos = extra_info[extra_info['Question Number']==q].loc[:,'Is Positive'].values[0] 
    questionsTypes.loc[q, 'questionsType'] = questionType(cities_Will_2020_vWide[q], columns = cols_study, colPos = colsPos)

q_secondAnalysis = questionsTypes.applymap(lambda x: np.sum([i=='cat_noRank_noPos' for i in x])>0)
q_secondAnalysis = q_secondAnalysis[q_secondAnalysis].dropna().index.values
q_firstAnalysis = [x for x in questions if x not in q_secondAnalysis]
q_secondAnalysis

We then computes the scores for the questions that do not require model cities

In [None]:
for q in q_firstAnalysis:
    
    if type(extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0])==list:
        cols_study = [int(c) for c in extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    else:
        cols_study = [extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    
    colsPos = extra_info[extra_info['Question Number']==q].loc[:,'Is Positive'].values[0] 
    
    try:
        catCounter, catAnalysis = Explore_AnalyticData(cities_Will_2020_vWide[q], columns = cols_study, colPos = colsPos)
        catAnalysis = catAnalysis.mean(axis=1)
        cities_Will_2020_CatAnalysis.loc[catAnalysis.index, q] = catAnalysis.values
        cities_Will_2020_CatCount[q] = catCounter
    except Exception as exception:
        print(exception.__str__())

We select the cities with the highest scores as model cities and use them to compute the scores on the remaining questions.

In [None]:
cities_Will_2020_CatScore = cities_Will_2020_CatAnalysis.loc[:,q_firstAnalysis].mean(axis=1)
cities_Will_2020_CatModel = pd.Series(index = cities_Will_2020_CatScore.index, data = pd.qcut(cities_Will_2020_CatScore.values, 10, labels=False, duplicates='drop'))
cities_Will_2020_CatModel = cities_Will_2020_CatModel[cities_Will_2020_CatModel==9].index.values

We check the coherence of the results before going forward with the process.

In [None]:
cities_2020.Organization[cities_2020.Account.isin(list(cities_Will_2020_CatModel))].unique()

It seems coherent with our beliefs. There are expected major cities but also some smaller cities who might have a strong amibition to tackle our macro-themes for many reasons such as geographic exposure, country's plan is spreading...

Hence we run the second step of the analysis.

In [None]:
for q in q_secondAnalysis:
    if type(extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0])==list:
        cols_study = [int(c) for c in extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    else:
        cols_study = [extra_info[extra_info['Question Number']==q].loc[:,'Which else to Analyse'].values[0]]
    
    colsPos = extra_info[extra_info['Question Number']==q].loc[:,'Is Positive'].values[0] 
    
    try:
        catCounter, catAnalysis = Explore_AnalyticData(cities_Will_2020_vWide[q], columns = cols_study, colPos = colsPos, models = cities_Will_2020_CatModel)
        catAnalysis = catAnalysis.mean(axis=1)
        cities_Will_2020_CatAnalysis.loc[catAnalysis.index, q] = catAnalysis.values
        cities_Will_2020_CatCount[q] = catCounter
    except Exception as exception:
        print(exception.__str__())
        
# cities_Will_2020_CatAnalysis.to_pickle('../input/cdp-inputs-final/cities_Will_2020_CatAnalysis.pkl')

# with open('../input/cdp-inputs-final/cities_Will_2020_CatCount.pkl', 'wb') as handle:
#     pickle.dump(cities_Will_2020_CatCount, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 2.2.2 Free text answers: Unsupervised Sentiment and Topics Scoring
[Back to main section](#Free-text)

<a id="Text-explore"></a>
Next is an exemple of the output of *Explore_Textdata*, the function which reshape and clean text to make answers ready to use for sentiment and topics analysis.

In [None]:
def Explore_Textdata(df, niv_ref=np.nan, niv_text=np.nan):
    
    # We build a fonction that reshape for a given question all free text answers into a single string. It is possible 
    # to specify a reference sub-question to analyse free text answers based on the sub-question topics.
    
    # the output is a dictionary of dataframes with either 'All' as the single key if no sub-question has been given as 
    # reference or sub-question topics as keys.

    #""""""""""""""""""""""""""""""""""""""""""args:
    # niv_ref=np.nan # Column number from the main dataset to define as reference
    # niv_text=np.nan # Column number from the main dataset to define as free text answers
    # ATTENTION: We define the column numbers as the one in the original CDP questionnaire. It is easy to find when reading 
    # the .pdf file.    
    
    # Cas où il n'y a pas de colonne de référence. ie. on souhaite récupérer tout le texte disponible dans une grande matrice.
    if np.isnan(niv_ref):
        
        # On récupère l'ensemble du texte associé à la question
        if np.isnan(niv_text):
            text = list(df.iloc[:,(df.shape[1]-1)]) 
        else:
            text = list(df.iloc[:,(2+niv_text)])
        
        # On transforme en dataframe
        output = []
        for n in range(len(df.Account)):
            d = text[n]
            if isinstance(d, dict):
                # Ici on met toutes les réponses d'une ville sur une thématique dans un grand string.
                output.append(' '.join(list(d.keys())) if (len(list(d.keys()))>1) else list(d.keys())[0])
            else:
                output.append('')
        # On transforme output vers un dataframe        
        output_d = {'Account': df.Account, 'Response Answer': output}
        output_d = pd.DataFrame(output_d)
        output_d.dropna(inplace=True) # drop na

        # On ne conserve que les données en anglais.
        output_d = pd.merge(output_d, isEnglish_df, on='Account', how='inner')
        del output_d['isEnglish']
        output_d = output_d[output_d['Response Answer'].apply(lambda x: isinstance(x, str) and 
                                                              len(x)>1 and x!='Question not applicable')]

        # On nettoie le texte
        output_d['Response Answer'] = output_d['Response Answer'].map(lambda x: cleanData(x))
        output_d['Response Answer ToL'] = output_d['Response Answer'].map(lambda x: cleanData(x, to_list=True))
        # On reset l'index
        output_d.reset_index(drop=True, inplace=True)

        text_per_category = output_d   
    
    # Cas où il y a une colonne de référence, permettant de filtrer les réponses textuelles dans différentes catégories.
    else:
        
        # On récupère la fréquence d'appartition des catégories
        Count = []
        for n in range(2,(df.shape[1]-1)):
            data = df.iloc[:,n]
            t = [list(item.keys()) for item in data if isinstance(item, dict)]
            t = [item for sublist in t for item in sublist]
            Count.append(dict(Counter(t)))

        # On récupère les textes par catégories les plus générales
        # liste des catgéories
        all_refs = list(Count[niv_ref].keys()) 
        if 'Question not applicable' in all_refs:
            all_refs.remove('Question not applicable')
        ref_vect = list(df.iloc[:,2+niv_ref])

        all_refs_text = []

        for item_ref in all_refs:
            
            print('Category for text selection :', item_ref)
            
            # On récupère l'indice des lignes par entreprise qui traitent de item_ref
            idx = []
            for i in ref_vect:
                if isinstance(i, dict):
                    idx.append([i[k] for k in list(i.keys()) if k == item_ref])
                else:
                    idx.append([])

            idx_notEmpty = [i for i, x in enumerate(idx) if x]
            Account_With_Text = list(df.Account[idx_notEmpty])
            idx = [idx[i] for i in idx_notEmpty] 

            # On récupère l'ensemble du texte associé à la question
            if np.isnan(niv_text):
                text = list(df.iloc[:,(df.shape[1]-1)]) 
            else:
                text = list(df.iloc[:,(2+niv_text)])

            # On filtre sur les entreprises qui traitent d'item_ref
            res_list = [text[i] for i in idx_notEmpty] 

            # On filtre sur les textes qui traitent d'item_ref et on raccroche aux entreprises
            output = []
            for n in range(len(Account_With_Text)):
                d = res_list[n]
                if isinstance(d, dict):
                    # Ici on met toutes les réponses d'une ville sur une thématique dans un grand string en ne conservant 
                    # que ce qui est du texte (format str).
                    output.append(' '.join([list(d.keys())[k] for k in range(len(d)) if 
                                            (list(d.values())[k][0] in idx[n][0]) and isinstance(list(d.keys())[k], str)]))
                else:
                    output.append('')
            # On transforme output vers un dataframe        
            output_d = {'Account': Account_With_Text, 'Response Answer': output}
            output_d = pd.DataFrame(output_d)
            output_d.dropna(inplace=True) # drop na

            # On ne conserve que les données en anglais.
            output_d = pd.merge(output_d, isEnglish_df, on='Account', how='inner')
            del output_d['isEnglish']
            output_d = output_d[output_d['Response Answer'].apply(lambda x: isinstance(x, str) and 
                                                                  len(x)>1 and x!='Question not applicable')]
            if not output_d.empty:
                # On nettoie le texte
                output_d['Response Answer'] = output_d['Response Answer'].apply(lambda x: cleanData(x))
                output_d['Response Answer ToL'] = output_d['Response Answer'].apply(lambda x: cleanData(x, to_list=True))
                # On reset l'index
                output_d.reset_index(drop=True, inplace=True)

            all_refs_text.append(output_d)

        text_per_category = dict(zip(all_refs, all_refs_text))
        
    return text_per_category

In [None]:
# Exemple for question '6.0'
Explore_Td_6_0 = Explore_Textdata(df_6_0,niv_ref=0, niv_text=1)
Explore_Td_6_0['Development of energy efficiency measures and technologies'].head(5)

##### 2.2.2.1. Sentiment Analysis

We define all functions in a Utils section which may be seen as a reporistory for sentiment analysis.

<a id="Utils-sent"></a>
#### Utils

[Back to the main section](#Free-text)

##### A) Word weights

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

def Compute_weights(corpus, type = "tf_idf"):
    
    if not ((isinstance(corpus, list) and isinstance(corpus[0], str)) 
            or (isinstance(corpus,  pd.Series) and isinstance(corpus[0],  dict))):
        print("Corpus must be either a list of strings or a series of dict with words as keys.")
        return
    
    def dict_to_string(x):
        if isinstance(x,dict):
            res = ' '.join(x.keys())
        else:
            res = ''
        return res
        
    if (isinstance(corpus,  pd.Series) and isinstance(corpus[0],  dict)):
        corpus = list(corpus.map(lambda x: dict_to_string(x)))
    
    if type=="tf_idf":
        # Tf-Idf
        vectorizer = TfidfVectorizer()
        got_tfidf = vectorizer.fit_transform(corpus)
        tfidf = pd.DataFrame(got_tfidf.toarray())
        tfidf.columns = vectorizer.get_feature_names()

        weights = dict(zip(tfidf.columns, tfidf.mean(axis = 0)))
    else:
        print("type : corpus frequency")  
        # Corpus freq
        bow = CountVectorizer()
        BOW = bow.fit_transform(corpus)
        bagOFwords = pd.DataFrame(BOW.toarray())
        bagOFwords.columns = bow.get_feature_names()

        weights = dict(zip(bagOFwords.columns, bagOFwords.mean(axis = 0)))

    return weights

##### B) Wordnet sentiment analysis

In [None]:
from nltk.corpus import wordnet as wn
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn

In [None]:
def spacy_to_wn(tag):
    """
    Convert between the Spacy tags to simple Wordnet tags
    """
    if tag.startswith('ADJ'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('ADV'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    else:
        return None

In [None]:
def Wordnet_sentiment(sentence, weights=np.nan):
        
    if not isinstance(sentence, dict):
        print("Sentence must be cleaned in a dict format with pos tags")
        return np.nan
    
    if not isinstance(weights,dict):
#         print("Equal weights will be given to words")
        weights = dict(zip(list(sentence.keys()), np.ones(len(sentence))))
    
    if not sentence:
        # Dans le cas où le texte est vide
        score = np.nan
        
    else:

        # 1) On récupère le sentiment de chaque mot.
        sentiment = []

        for word, pos_tag in sentence.items():
            synsets = wn.synsets(word, pos=spacy_to_wn(pos_tag))
            if synsets:
                # Take the first sense, the most common
                synset = synsets[0]
                swn_synset = swn.senti_synset(synset.name())
#                 sentiment.append([swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score()])
                sentiment.append(swn_synset.pos_score() - swn_synset.neg_score())
            else:
#                 sentiment.append([])
                sentiment.append(0)

        # 2) On pondère les mots par un vecteur donné en argument
        score = 0
        tokens_count = 0
        for i in range(len(sentiment)):
            if list(sentence.keys())[i] in weights.keys():
                score += sentiment[i]*weights[list(sentence.keys())[i]]               
            if sentiment[i]!=0:
                tokens_count +=1

        # 3) On normalise le score par la taille de la phrase. En ne comptant que les mots ayant un impact.
        if tokens_count!=0:
            score = score/tokens_count
            
    return score

##### C) Vader sentiment analysis

In [None]:
!pip install twython

In [None]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
def Vader_sentiment(sentence, weights=np.nan):
    sid = SentimentIntensityAnalyzer()
    
    if isinstance(sentence,dict) and bool(sentence):
        sentence = list(sentence.keys())
    
    if isinstance(sentence, list):
        
        if not isinstance(weights,dict):
            weights = dict(zip(sentence, np.ones(len(sentence))))  
        
        # 1) On récupère le sentiment de chaque mot.
        sentiment = []
        for word in sentence:
            sentiment.append(sid.polarity_scores(word)['compound'])

        # 2) On pondère les mots par un vecteur donné en argument
        score = 0
        tokens_count = 0
        for i in range(len(sentiment)):
            if sentence[i] in weights.keys():
                score += sentiment[i]*weights[sentence[i]]               
            if sentiment[i]!=0:
                tokens_count +=1
    
        # 3) On normalise le score par la taille de la phrase. En ne comptant que les mots ayant un impact.
        if tokens_count!=0:
            score = score/tokens_count
            
    elif (isinstance(sentence, str) and len(list(sentence.split(" ")))>1):
        score = sid.polarity_scores(sentence)['compound']
    
    else:
        score = np.nan
            
    return score   

##### D) From Score to Label

In [None]:
def Score2Label(scores,*tau):
    q = []
    for t in tau:
        q.append(np.nanquantile(scores,t))
    
    q.sort()
    
    label = np.empty(len(scores))
    label[:] = np.nan
    
    label[scores<q[0]] = 0
    for i in range(len(q)-1):
        label[(q[i]<=scores) & (scores<q[i+1])] = (i+1)  
    label[scores>q[-1]] = len(q)
    
    return label.astype(float)

##### E) Apply sentiment analysis

In [None]:
def Apply_Sentiment_Analysis(df):
    
    df_new = {}
    
    if not isinstance(df, dict):
        df = {'All': df}
        
    # On travaille thématiques par thématiques
    for thematic in df.keys():
        
        if not df[thematic].empty:
        
            df_small = df[thematic].copy()

            # On créer une série des réponses sous forme de dict pour le calcul du vecteur tf-idf. Pour cela on reset 
            # l'idx de cette série
            answer_dict = df_small['Response Answer ToL']            
            answer_dict.reset_index(drop=True, inplace=True)
            answer_dict = pd.Series([i for i in answer_dict.dropna() if i])
            # On calcule le vecteur tf-idf sur l'ensemble des réponses à la question.
            tfidf_weights = Compute_weights(answer_dict)
            # On calcule les sentiments à partir de ces poids
            WordNet_EQ = df_small['Response Answer ToL'].map(lambda x: Wordnet_sentiment(x, weights="EQ"))
            WordNet_TFIDF = df_small['Response Answer ToL'].map(lambda x: Wordnet_sentiment(x, weights=tfidf_weights))
            Vader_TFIDF = df_small['Response Answer ToL'].map(lambda x: Vader_sentiment(x, weights=tfidf_weights))
            Vader = df_small['Response Answer'].map(lambda x: Vader_sentiment(x))
            # On transforme les scores en labels.
            WordNet_EQ = Score2Label(WordNet_EQ, 0.2, 0.4, 0.6, 0.8)
            WordNet_TFIDF = Score2Label(WordNet_TFIDF, 0.2, 0.4, 0.6, 0.8)
            Vader_TFIDF = Score2Label(Vader_TFIDF, 0.2, 0.4, 0.6, 0.8)
            Vader = Score2Label(Vader, 0.2, 0.4, 0.6, 0.8)
            # On applique le vote à mojorité pour le choix du label
            Sentiment_df = pd.DataFrame({'Wordnet_EQ': WordNet_EQ, 'Wordnet_TFIDF':WordNet_TFIDF, 
                                             'Vader_TFIDF':Vader_TFIDF, 'Vader': Vader})
            df_small['Sentiment'] = Sentiment_df.apply(lambda x: Counter(x).most_common()[0][0], axis = 1)

            df_new[thematic] = df_small
        
        else:
            df_new[thematic] = pd.DataFrame()

    return df_new

#### Sentiment Analysis in Practice

We apply the sentiment analysis to every question marked on the ***Which is text to Analyse*** column from the ***extra_info*** database. We also set the reference sub-question from the ***Text Ref Col*** column.

In [None]:
cities_Will_2020_SentimentAnalysis = []
questions = []
# Question to analyse
for q in extra_info['Question Number'][extra_info['Which Is Text to Analyse'].notna()]:
    
    if q not in []:
    
        print('Question number', q)

       # Col number for reference categories
        l_r = extra_info.loc[extra_info['Question Number']==q, 'Text Ref Col'].iloc[0]
        if isinstance(l_r, list):
            l_r = int(l_r[0])
        if not np.isnan(l_r):
            l_r = int(l_r-1) # to be good with python indexing starting at 0

        # Col number of text
        l_t = int(extra_info.loc[extra_info['Question Number']==q, 'Which Is Text to Analyse'].iloc[0])-1

        print('Column number ref :', l_r, 'and Column text:', l_t)

        # Get text for the given categories into a dictionary
        df = Explore_Textdata(cities_Will_2020_vWide[q],niv_ref=l_r, niv_text=l_t)

        # Run sentiment analysis
        df = Apply_Sentiment_Analysis(df)
        
        # Reshape to size of all unique respondent
        df_new={}
        for k in df.keys():
            d = df[k].copy()
            if not d.empty:
                df_new[k] = pd.merge(geo_cities_2020, d, on='Account', how='outer')
                
        cities_Will_2020_SentimentAnalysis.append(df_new)
        questions.append(q)

cities_Will_2020_SentimentAnalysis = dict(zip(questions, cities_Will_2020_SentimentAnalysis))   

# with open('../input/cdp-inputs-final/cities_Will_2020_SentimentAnalysis.pkl', 'wb') as handle:
#     pickle.dump(cities_Will_2020_SentimentAnalysis, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### 2.2.2.2. Unsupervised Topics analysis

As explain previously, the idea is to obtain the addressed topics in a specific sub-question and then identify if a given answer captures them all by retrieving each topic's density in the answer.

<a id="Utils-top"></a>
##### Utils

[Back to the main section](#U-Top)

In [None]:
from nltk.corpus import stopwords

In [None]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('N'):
        return wn.NOUN
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return ''
    
def topics_document_to_dataframe(topics_document, num_topics):
    res = pd.DataFrame(columns=['Topic ' + str(t) for t in range(num_topics)])
    for topic_weight in topics_document:
        res.loc[0, 'Topic ' + str(topic_weight[0])] = topic_weight[1]
    return res

def LDA_Analysis(df_source, nb_topics):
    
    # We check if there are enough answers to run a LDA
    if df_source['Response Answer'].value_counts().shape[0]<20:
        df_source = df_source.loc[:,['Account', 'Response Answer']].dropna().reset_index(drop=True)
        df_topics = pd.DataFrame(data = np.nan, index = df_source.index, columns = np.append(['Account', 'Response Answer'],['Topic ' + str(x) for x in range(nb_topics)]))
        df_topics.loc[:, ['Account', 'Response Answer']] = df_source.loc[:, ['Account', 'Response Answer']] 
        lda = {}
    
    else:
    
        df = df_source.loc[:,['Account', 'Response Answer']].dropna().reset_index(drop=True)
        df.columns = ['account', 'answer']
    
        # Tokenization
        df['sentences'] = df.answer.map(sent_tokenize)
        df['tokens_sentences'] = df['sentences'].map(lambda sentences: [word_tokenize(sentence) for sentence in sentences])

        # Lemmatization with POS tagging
        df['POS_tokens'] = df['tokens_sentences'].map(lambda tokens_sentences: [pos_tag(tokens) for tokens in tokens_sentences])

        # Lemmatizing each word with its POS tag, in each sentence
        lemmatizer = WordNetLemmatizer()
        df['tokens_sentences_lemmatized'] = df['POS_tokens'].map(
            lambda list_tokens_POS: [
                [
                    lemmatizer.lemmatize(el[0], get_wordnet_pos(el[1])) 
                    if get_wordnet_pos(el[1]) != '' else el[0] for el in tokens_POS
                ] 
                for tokens_POS in list_tokens_POS
            ]
        )

        # Regrouping tokens and removing stop words
        stopwords_verbs = ['say', 'get', 'go', 'know', 'may', 'need', 'like', 'make', 'see', 'want', 'come', 'take', 'use', 'would', 'can']
        stopwords_other = ['one', 'mr', 'bbc', 'image', 'getty', 'de', 'en', 'caption', 'also', 'copyright', 'something']
#         my_stopwords = stopwords.words('English') + stopwords_verbs + stopwords_other
        my_stopwords = stopwords.words('english') + stopwords_verbs + stopwords_other

        df['tokens'] = df['tokens_sentences_lemmatized'].map(lambda sentences: list(chain.from_iterable(sentences)))
        df['tokens'] = df['tokens'].map(lambda tokens: [token.lower() for token in tokens if token.isalpha() and token.lower() not in my_stopwords and len(token)>1])

        # Data preparation - Bi-gram and Tri-gram models
        tokens = df['tokens'].tolist()
        bigram_model = Phrases(tokens)
        trigram_model = Phrases(bigram_model[tokens], min_count=1)
        tokens = list(trigram_model[bigram_model[tokens]])

        # LDA
        dictionary_LDA = corpora.Dictionary(tokens)
        dictionary_LDA.filter_extremes(no_below=3)
        corpus = [dictionary_LDA.doc2bow(tok) for tok in tokens]

        np.random.seed(2020)
        lda_model = models.LdaModel(corpus, num_topics=nb_topics, id2word=dictionary_LDA, passes=4, alpha=[0.01]*nb_topics, eta=[0.01]*len(dictionary_LDA.keys()))


        # Like TF-IDF, create a matrix of topic weighting, with documents as rows and topics as columns
        topics = [lda_model[corpus[i]] for i in range(len(df))]
        document_topic = pd.concat([topics_document_to_dataframe(topics_document, num_topics=nb_topics) for topics_document in topics]).reset_index(drop=True).fillna(0)
        
        df_topics = df_source.loc[:,['Account', 'Response Answer']].dropna().reset_index(drop=True)
        df_topics[['Topic ' + str(x) for x in range(nb_topics)]] = document_topic

        # LDA results
        lda = {'model': lda_model, 'corpus': corpus, 'dictionary': dictionary_LDA}
    
    return df_topics, lda

In [None]:
def Apply_LDA_Analysis(df, nb_topics):
    
    d_topics = {}
    d_lda = {}
    
    if not isinstance(df, dict):
        df = {'All': df}
        
    # We work thematic by thematic
    for thematic in df.keys():
        
        if not df[thematic].empty:
            df_small = df[thematic].copy()
            df_topics, lda = LDA_Analysis(df_small, nb_topics)
            d_topics[thematic] = df_topics
            d_lda[thematic] = lda
            
        else:
            d_topics[thematic] = pd.DataFrame()
            d_lda[thematic] = {}

    return d_topics, d_lda

In [None]:
def LDA_heatmap(lda_df, n_topics):
    if lda_df.dropna().empty:
        print('The LDA was not computed')
    else:
        document_topic = lda_df.dropna().iloc[:,-n_topics:]
        sn.set(rc={'figure.figsize':(10,20)})
        sn.heatmap(document_topic.loc[document_topic.idxmax(axis=1).sort_values().index])
    
def LDA_printTopics(lda_model, n_topics, n_words = 10):
    if len(lda_model)==0:
        print('The LDA was not computed')
    else:
        lda_model = lda_model['model']
        for i,topic in lda_model.show_topics(formatted=True, num_topics=n_topics, num_words=n_words):
            print('Topic ' + str(i)+": "+ topic)
            print()
        
def LDA_visual(lda_model):
    
    vis = pyLDAvis.gensim.prepare(topic_model=lda_model['model'], corpus=lda_model['corpus'], dictionary=lda_model['dictionary'])
    pyLDAvis.enable_notebook()
    return vis

#### LDA in Practice

In [None]:
nb_topics=3

In [None]:
cities_Will_2020_LDA_topics = []
cities_Will_2020_LDA_models = []
questions = []

# Question to analyse
for q in extra_info['Question Number'][extra_info['Which Is Text to Analyse'].notna()]:
    
    if q not in []:
    
        print('Question number', q)

        # Col number for reference categories
        l_r = extra_info.loc[extra_info['Question Number']==q, 'Text Ref Col'].iloc[0]
        if isinstance(l_r, list):
            l_r = int(l_r[0])
        if not np.isnan(l_r):
            l_r = int(l_r-1) # On se met en phase avec python

        # Col number of text
        l_t = int(extra_info.loc[extra_info['Question Number']==q, 'Which Is Text to Analyse'].iloc[0])-1

        print('Column number ref :', l_r, 'and Column text:', l_t)

        # Get text for the given categories into a dictionary
        df = Explore_Textdata(cities_Will_2020_vWide[q],niv_ref=l_r, niv_text=l_t)
        
        # Run LDA
        d_topics, d_lda = Apply_LDA_Analysis(df, nb_topics)

        # Reshape dataframe to size of all unique respondent
        d_topics_new={}
        for k in d_topics.keys():
            d = d_topics[k].copy()
            if not d.empty:
                d_topics_new[k] = pd.merge(geo_cities_2020, d, on='Account', how='outer')  
        cities_Will_2020_LDA_topics.append(d_topics_new)
        
        cities_Will_2020_LDA_models.append(d_lda)
        
        questions.append(q)

cities_Will_2020_LDA_topics = dict(zip(questions, cities_Will_2020_LDA_topics)) 
cities_Will_2020_LDA_models = dict(zip(questions, cities_Will_2020_LDA_models))  

# with open('../input/cdp-inputs-final/cities_Will_2020_LDA_models.pkl', 'wb') as handle:
#     pickle.dump(cities_Will_2020_LDA_models, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# with open('../input/cdp-inputs-final/cities_Will_2020_LDA_topics.pkl', 'wb') as handle:
#     pickle.dump(cities_Will_2020_LDA_topics, handle, protocol=pickle.HIGHEST_PROTOCOL)

<a id="Quality"></a>
### 2.2.3. Questions' pertinence

In [None]:
def Question_Pertinence(q, threshold=0.8):
    
    sent = []
    # Check if the answer is categorielle
    isCat = q in extra_info.loc[extra_info['Which else to Analyse'].notna(), 'Question Number'].values
    
    # Check if the answers are text
    isText = q in extra_info.loc[extra_info['Which Is Text to Analyse'].notna(), 'Question Number'].values
    
    # Check if the free text answer has to be separated by categories
    isFreeText = extra_info.loc[extra_info['Question Number']==q, 'Text Ref Col'].isna().values[0]
    
    # Check if the answers are well distributed amongst the potential categories
    if isCat:
        max_freq = []
        cat_count = cities_Will_2020_CatCount[q]
                    
        for i in range(len(cat_count)):
            freq = np.array([x for x in cat_count[i].values()])/np.sum([x for x in cat_count[i].values()])
            max_freq.append(max(freq))
            
            if min(max_freq)<threshold:
                question_score = 1.0
            else: 
                # If the answers are concentrated on one answer, check if the text answers are different
                if isText:
                    if max(cat_count[0], key=lambda key: cat_count[0][key])=='Question not applicable':
                        question_score = 0
                    else:
                        if isFreeText:
                            sent = cities_Will_2020_SentimentAnalysis[q]['All']
                            sent_freq = (sent.iloc[:, -1].value_counts())/sum(sent['Sentiment'].notna())
                            if sent_freq.empty:
                                sent_freq = pd.Series([0.0, 0.0])

                            # Check topics distribution in the answers
                            lda_topics = cities_Will_2020_LDA_topics[q]['All']
                            lda_freq = (lda_topics.iloc[:, -nb_topics:].idxmax(axis=1).value_counts())/lda_topics.iloc[:, -nb_topics:].dropna(axis=0, how='all').shape[0]
                            
                        else:
                            # Check sentiment distribution in the answers
                            sent = cities_Will_2020_SentimentAnalysis[q][max(cat_count[0], key=lambda key: cat_count[0][key])]
    #                         sent_freq = (sent.iloc[:, -1].value_counts())/sent.shape[0]
                            sent_freq = (sent.iloc[:, -1].value_counts())/sum(sent['Sentiment'].notna())
                            if sent_freq.empty:
                                sent_freq = pd.Series([0.0, 0.0])

                            # Check topics distribution in the answers
                            lda_topics = cities_Will_2020_LDA_topics[q][max(cat_count[0], key=lambda key: cat_count[0][key])]
    #                         lda_freq = (lda_topics.iloc[:, -nb_topics:].idxmax(axis=1).value_counts())/lda_topics.shape[0]
                            lda_freq = (lda_topics.iloc[:, -nb_topics:].idxmax(axis=1).value_counts())/lda_topics.iloc[:, -nb_topics:].dropna(axis=0, how='all').shape[0]
                        if lda_freq.empty:
                            lda_freq = pd.Series([0.0, 0.0])

                        if max(max(lda_freq), max(sent_freq))<threshold:
                            question_score = 1.0
                        else:
                            question_score = 1.0 - max(max(lda_freq), max(sent_freq))
                    
                else:
                    question_score = (1.0 - max(max_freq))
            
    else:
        if isText:
            # Check sentiment distribution in the answers
            sent = cities_Will_2020_SentimentAnalysis[q]['All']
#             sent_freq = (sent.iloc[:, -1].value_counts())/sent.shape[0]
            sent_freq = (sent.iloc[:, -1].value_counts())/sum(sent['Sentiment'].notna())
            if sent_freq.empty:
                sent_freq = pd.Series([0.0, 0.0])
                
            # Check topics distribution in the answers
            lda_topics = cities_Will_2020_LDA_topics[q]['All']
#             lda_freq = (lda_topics.iloc[:, -nb_topics:].idxmax(axis=1).value_counts())/lda_topics.shape[0]
            lda_freq = (lda_topics.iloc[:, -nb_topics:].idxmax(axis=1).value_counts())/lda_topics.iloc[:, -nb_topics:].dropna(axis=0, how='all').shape[0]
            if lda_freq.empty:
                lda_freq = pd.Series([0.0, 0.0])
            
            if max(max(lda_freq), max(sent_freq))<threshold:
                question_score = 1.0
            else:
                question_score = 1.0 - max(max(lda_freq), max(sent_freq))
    
        else:
            question_score = 1.0
            
    return question_score

In [None]:
def Question_Coverage(q):
    
    # Determines the number of cities that were eligible to the question
    # and computes the actual % number of answers (not left blanks)
    
    df = cities_Will_2020_vWide[q].iloc[:,2]
    coverage = (df.shape[0]-np.sum(['Question not applicable' in list(x) for x in df if type(x)==dict]))
    not_empty_answers = 1-(df.shape[0] - df.dropna().shape[0])/coverage
            
    return coverage/df.shape[0], not_empty_answers

In [None]:
Question_Quality = pd.DataFrame(index = extra_info['Question Number'].unique(), columns = ['Pertinence','Coverage'])
for q in extra_info['Question Number'].unique():
        Question_Quality.loc[q, 'Pertinence'] = Question_Pertinence(q, 0.6)
        Question_Quality.loc[q, 'Coverage'] = Question_Coverage(q)[0]
        
# Question_Quality.to_pickle('../input/cdp-inputs-final/Questions_Quality.pkl')

## 2.3. Ambition KPIs by macro-theme

<a id="Question-level-app"></a>
### 2.3.1. Combine scores to questions' level

[Back to the main section](#Score-bq)

In [None]:
def Average_Score_Per_Question(data, nb_col_to_analyse):
    if not isinstance(data, dict):
        print('Data do not seem to require any reshape')
        return
    # We create an empty dataframe for each question's score
    All_Scores = geo_cities_2020.copy()
    
    # We iterate through each question of the dict
    for q in data.keys():
#     for q in ['1.0a', '1.3', '10.7a', '12.3']:
        # We create an empty dataframe for scores
        q_scores = data[q][list(data[q].keys())[0]].iloc[:,:1].copy()
        
        # We iterate through each category of a given question
        for cat in data[q].keys():
            df = data[q][cat].copy()
            if nb_col_to_analyse>1:
                qutile = df.iloc[:,-nb_col_to_analyse:].apply(lambda x: np.nanquantile(x,0.5), axis = 0)
                # If multiple scores per category, we are considering topics analysis. Hence the agregate score is the 
                # sum of topics present in the answer depreciated by the difference between the largest topics and the smallest.
                df['Score'] = df.iloc[:,-nb_col_to_analyse:].apply(lambda x: sum(x>qutile)*(1-(np.max(x)-np.min(x))) 
                                                                   if not np.isnan(x).any() else np.nan, axis = 1)
            else:
                df['Score'] = df.iloc[:,-nb_col_to_analyse:]
            
            # Save the data to the question level
            q_scores = pd.merge(q_scores, df.iloc[:,[0, -1]], on='Account', how='outer')
        
        # Average all categories scores to compute a single score per question
        q_scores['Avg'] = q_scores.iloc[:,1:].apply(lambda x: np.nanmean(x), axis = 1)
        All_Scores = pd.merge(All_Scores, q_scores.iloc[:,[0, -1]], on='Account', how='outer')
        All_Scores.rename(columns={'Avg': q}, inplace=True)  
        
        del q_scores
    
    return All_Scores

In [None]:
ities_Will_2020_LDA_topics_Avg = Average_Score_Per_Question(cities_Will_2020_LDA_topics, 3)
cities_Will_2020_SentimentAnalysis_Avg = Average_Score_Per_Question(cities_Will_2020_SentimentAnalysis, 1)
# cities_Will_2020_LDA_topics_Avg.to_pickle('../input/cdp-inputs-final/cities_Will_2020_LDA_topics_Avg.pkl')
# cities_Will_2020_SentimentAnalysis_Avg.to_pickle('../input/cdp-inputs-final/cities_Will_2020_SentimentAnalysis_Avg.pkl')

Average all thre approaches base on cities ranking in each approach.

In [None]:
def weighted_nan_average(x, w):
    indices = np.array(x.apply(lambda y: ~np.isnan(y)))
    if sum(indices)>0:
        out = np.average(np.array(x[indices], dtype='float'), weights = w[indices])
    else:
        out = np.nan
    return out

In [None]:
def Compute_Aggregate_Score(X_o,Y_o,Z_o, Approach_weights):
    # Create output with account informations
    res = geo_cities_2020.copy()
    
    # Rank scores per approach
    X = X_o.copy()
    Y = Y_o.copy()
    Z = Z_o.copy()
    X.iloc[:,2:] = X.iloc[:,2:].apply(lambda x: x.rank(ascending = True, method = 'dense'))
    Y.iloc[:,2:] = Y.iloc[:,2:].apply(lambda x: x.rank(ascending = True, method = 'dense'))
    Z.iloc[:,2:] = Z.iloc[:,2:].apply(lambda x: x.rank(ascending = True, method = 'dense'))
    
    # We define a scaler to range oaal rankings between 0 and 1.
    scaler = MinMaxScaler()
    
    # For each question, average ranking over the available approaches
    for q in extra_info['Question Number'][extra_info.loc[:,['Which Is Text to Analyse' , 'Which else to Analyse']].isna().apply(lambda x: sum(x)<2, axis = 1)]:
#     for q in ['1.0a']:
        if q in X.columns:
            id_X = list(X.columns).index(q)
            r_X = X.iloc[:,id_X].copy()
        else:
            r_X = np.empty([X.shape[0],1])
            r_X[:]=np.nan
            r_X = r_X[0]
        if q in Y.columns:
            id_Y = list(Y.columns).index(q)
            r_Y = Y.iloc[:,id_Y].copy()
        else:
            r_Y = np.empty([Y.shape[0],1])
            r_Y[:]=np.nan
            r_Y = r_Y[0]
        if q in Z.columns:
            id_Z = list(Z.columns).index(q)
            r_Z = Z.iloc[:,id_Z].copy()
        else:
            r_Z = np.empty([Z.shape[0],1])
            r_Z[:]=np.nan
            r_Z = r_Z[0]

        r_inter = pd.DataFrame([r_X, r_Y, r_Z])        
        # we Scale all rankings between 0 and 1 for meaningfull averaging
        r_inter = pd.DataFrame(scaler.fit_transform(r_inter.T)).T
        
        # We avreage each approach with specified weights
#         r_mean = r_inter.apply(lambda x: np.nanmean(x), axis = 0)
        r_mean = r_inter.apply(lambda x: weighted_nan_average(x, Approach_weights), axis = 0)
        res[q] = r_mean

    return res

In [None]:
# Utile pour remettre au même niveau que les autres bases
cities_Will_2020_CatAnalysis.reset_index(drop = False, inplace = True)
cities_Will_2020_CatAnalysis.rename(columns={'index':'Account'}, inplace=True)
cities_Will_2020_CatAnalysis = pd.merge(geo_cities_2020, cities_Will_2020_CatAnalysis, on='Account', how='outer')

In [None]:
Scores_All_Questions = Compute_Aggregate_Score(cities_Will_2020_CatAnalysis, cities_Will_2020_LDA_topics_Avg, cities_Will_2020_SentimentAnalysis_Avg, Approach_weights=np.array([0.8, 0.1, 0.1]))
# Scores_All_Questions.to_pickle('../input/cdp-inputs-final/cities_2020_Scores_All_Questions.pkl')

### 2.3.2. Combine scores to SDGs macro-themes level

In [None]:
# Subset data per SDGs macro-theme
Planet_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'Planet' in x if isinstance(x, list) else False)])
Planet_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Planet_questions].copy()

Prosperity_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'Prosperity' in x if isinstance(x, list) else False)])
Prosperity_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Prosperity_questions].copy()

Social_questions = list(extra_info['Question Number'][extra_info['SDGs'].apply(lambda x: 'SocialJustice' in x if isinstance(x, list) else False)])
Social_Scores = Scores_All_Questions.loc[:, ['Account', 'City Location'] + Social_questions].copy()

In [None]:
# Compute average score per macro-theme
weights_Planet = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Planet_Scores.columns[2:]]])
Planet_Scores['KPI'] = Planet_Scores.apply(lambda x: weighted_nan_average(x[2:], weights_Planet), axis = 1)

weights_Prosperity = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Prosperity_Scores.columns[2:]]])
Prosperity_Scores['KPI'] = Prosperity_Scores.apply(lambda x: weighted_nan_average(x[2:], w = weights_Prosperity), axis = 1)

weights_Social = np.array([float(Question_Quality.Pertinence.iloc[w]) for w in [list(Question_Quality.index).index(q) for q in Social_Scores.columns[2:]]])
Social_Scores['KPI'] = Social_Scores.apply(lambda x: weighted_nan_average(x[2:], w = weights_Social), axis = 1)