In this notebook we try establishing a way to investigate and evaluate the level of cities engagement in answering the questionnaire. Such level of engagement shows which questions are more important for cities and which questions are more often avoided. Measuring engagement could help analysis social and climate issues, but most importantly shall help with improving the questionnaire and engaging cities into providing more complete answers

# Table of Contents

* [Data Loading and Cleaning](#data-load)
* [Cities Engagement in the Survey](#engagement)
    - [Engagement KPI](#kpi)    
    - [Grouping cities based on answerring or skipping questions pattern](#groupping)
    - [Grouping questions based on cities answering](#groupping_questions)
* [Corporations Engagement in the Survey](#corp_engagement)    
    

<a id="data-load"></a>
# Data Loading and cleaning
Although using additional data source helps a lot in understanding of environmental and social issues our planet is facing now, we concentrate on analysis of CDP questionaires responses only. 

Let us start with restructuring  and cleaning input data. 


In [None]:
#all the imports required for the notebook are given in this block
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import HTML
from IPython.display import  Markdown
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import seaborn as sns

import umap
from sklearn.cluster import DBSCAN


#centering figures
HTML("""<style> .output_png { display: table-cell; text-align: center; vertical-align: middle;}</style>""")

#my colors
colors= ['#003f5c','#2f4b7c','#665191','#a05195','#d45087','#f95d6a','#ff7c43','#ffa600','#fcca46','#a1c181','#619b8a','#386641']

In [None]:
# Load cities disclosing info
cities_disclosing_2018 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2018_Cities_Disclosing_to_CDP.csv")
cities_disclosing_2019 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2019_Cities_Disclosing_to_CDP.csv")
cities_disclosing_2020 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv")


# Load cities responses
cities_2018 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv")
cities_2019 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv")
cities_2020 = pd.read_csv("/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv")

## Define Functions for bringing data in tabular form
   
def pivot_cities(cities):
    ''' make pivot table preserving multi-choice columns '''    
    
    def fix_pivot_column(col):
        ''' remove dummy arrays in pivoted cities dataframe '''
        def fix_multi_choice(x):
            if x==x: 
                return x[0]
            else:
                return x    
        def fix_single_choice(x):
            if x==x: 
                return x[0][0]
            else:
                return x

        if all([len(c[0]) == 1 for c in col if c==c]):
            return col.map(fix_single_choice)
        else:
            return col.map(fix_multi_choice)
    
    cities_piv = cities.pivot_table(index=['CDP Region', 'Country', 'Account Number', 'Organization'], 
                      columns=[ 'Question Number', 'Column Number', 'Row Number'], 
                      values = ['Response Answer'],
                   aggfunc = lambda x:[x.values])    

    # drop columns with only one answer - > like filled by mistake
    columns = cities_piv.columns
    to_drop=[]
    for col in columns:        
        if sum([len(c[0]) == 1 for c in cities_piv[col] if c==c]) == 1:
            to_drop.append(col)
    cities_piv.drop(columns=to_drop, inplace=True)        
    
    # remove dummy arrays that appear in pivot because of multi-choice columns
    columns = cities_piv.columns
    for col in columns:        
        cities_piv[col] = fix_pivot_column(cities_piv[col])
    return cities_piv

def get_questions(cities):
    ''' get questions reference table '''
    return cities.loc[:, ['Question Number', 'Question Name', 'Column Number', 'Column Name', 'Row Number', 'Row Name']].drop_duplicates()

def get_question_names(cities, questionNumber, columnNumber, rowNumber):
    questions = get_questions(cities)
    return questions.loc[(questions['Question Number'] == questionNumber) & 
                  (questions['Column Number'] == columnNumber) & 
                  (questions['Row Number'] == rowNumber), ['Question Name', 'Column Name', 'Row Name']]
    

def fix_row_number(cities):
    '''For Rows without Name, positive Row Number means multi-choice question. Change Row Number accordingly'''
    cities.loc[pd.isnull(cities['Row Name']), 'Row Number'] = 0

def fix_skipped_multichoice(cities_piv):
    """ Caused by use of wrong use of row number and the way it is fixed by fix_row_number, some multichoice answers contain nan. This functions remove those nan's"""
    def fix_skipped_multichoice_loc(multichoice):
        multichoice_fixxed = np.array([choice for choice in multichoice if choice==choice])
        if len(multichoice_fixxed)==0:
            multichoice_fixxed = np.array([np.nan])
        return multichoice_fixxed
    
    for col in cities_piv.columns:
        c = cities_piv[col]
        if type(c[0]) == np.ndarray:
            cities_piv[col] = cities_piv[col].apply(fix_skipped_multichoice_loc)
    return cities_piv
    

# Questions to consider    
CDP_recommended_questions = ['0.1', '0.5', '1.0', '1.0a', '2.0', '2.0a', '2.0b', '2.1', '2.2', '4.0', '4.1', '4.2', '4.3', '4.4', '4.5', 
                    '5.0', '5.0a', '5.0b','5.0c','5.0d', '6.0', '6.1',  '8.0', '8.0a', '8.2', '8.3', '8.5', '10.1',
                   '10.2', '10.11', '10.12', '10.13', '10.14', '10.15', '12.0', '12.4', '14.0', '14.1', '14.3',  '14.4'] # ['6.1a' '8.6' '10.16' '14.3a'] were never answered/asked in 2020


# Cleans and transform data to tabular form
fix_row_number(cities_2020)
cities_2020_piv = pivot_cities(cities_2020)
cities_2020_piv = cities_2020_piv['Response Answer']#[CDP_recommended_questions]
cities_2020_piv = fix_skipped_multichoice(cities_2020_piv)


        

<a id="engagement"></a>
# Cities Engagement in the Survey
We would like to understand why some questions are not answered by some cities. 

As we can see there is a big variety in how actively different questions was answered


In [None]:
# Find which questions were not answered at all, and which were answered with 'Question not applicable'

def is_answered (x):
    if isinstance(x, list) or isinstance(x, np.ndarray):
        return any(x==x)
    else:
        return x==x
    
def is_not_applicable(x):
    if isinstance(x, list) or isinstance(x, np.ndarray):
        return all([el == 'Question not applicable' for el in x ])
    else:
        return x == 'Question not applicable'


def find_answered_question(cities_piv):    
    cities_answered = pd.DataFrame().reindex_like(cities_piv, method = None)
    cities_notApplicable = pd.DataFrame().reindex_like(cities_piv, method = None)
    
    for col in cities_piv.columns:    
        cities_answered[col] = cities_piv[col].apply(is_answered)
        cities_notApplicable[col] = cities_piv[col].apply(is_not_applicable)

    cities_answer_provided = cities_answered & (~cities_notApplicable)    
    return cities_answered, cities_notApplicable, cities_answer_provided

cities_2020_answered, cities_2020_notApplicable, cities_2020_answer_provided = find_answered_question(cities_2020_piv)


In [None]:
def plot_answere_per_question(cities, cities_bool):

    answers_per_question = cities_bool.values.sum(axis=0)

    questionNumber = list(map(lambda x: x[0], cities_bool.columns))
    columnNumber = list(map(lambda x: x[1], cities_bool.columns))
    rowNumber = list(map(lambda x: x[2], cities_bool.columns))

    df = pd.DataFrame({'Question Number': questionNumber, 'Column Number': columnNumber, 'Row Number':rowNumber, 'Answer Count': answers_per_question})
    df = df.merge(get_questions(cities), on = ['Question Number', 'Column Number', 'Row Number'])
    df['ind'] = df.index
    df['Question Name'] = df['Question Name'].apply(lambda x: x[:60])
    df['Column Name'] = df['Column Name'].apply(lambda x: (x[:60] if x==x else 'None'))
    df['Row Name'] = df['Row Name'].apply(lambda x: (x[:60] if x==x else 'None'))


    fig = px.scatter(df, x='ind', y='Answer Count', color='Question Number', 
                     hover_data=['Answer Count', 'Question Number', 'Question Name', 'Column Number', 'Column Name', 'Row Number', 'Row Name'],
                     title="Number of answers per question",height=800,)

    return fig
    
    
plot_answere_per_question(cities_2020, cities_2020_answer_provided)


To make plot less busy we look only at questions recommended for the analysis

In [None]:
fig = plot_answere_per_question(cities_2020, find_answered_question(cities_2020_piv[CDP_recommended_questions])[2])
fig.update_layout(title='Number of answers on selected set of questions')

Few of recommended questions were actually answered by majority of the cities. Ans some very important questions, were answered by a fraction of cities.
For example question 10.15 "Please indicate if your city currently has any programs or projects to improve air quality." was unswered only by 39 cities! (See table below)

In [None]:
# the only 39 cities that indicated they currrently have any programs or projects to improve air quality
pd.DataFrame(cities_2020_piv['10.15'][0][0].loc[cities_2020_answer_provided['10.15'][0][0]])

<a id="kpi"></a>
## Engagement KPI

We define an engagement  KPI that shows how much a city is willing to give answers to the questionnaire as quantile ranking of number of answered questions (i.e City that answered the most of questions in a category get KPI value of 1.0, and City that answeerd the least of question get KPI value of 0). Such KPI could be defined per question, per section, per group of questions or for the whole questionnaire. And we suggest to use qunatile raning as KPI. Further we caclulate the KPI on set of Recomended questions.

In [None]:
cities_2020_selected_answered =find_answered_question(cities_2020_piv[CDP_recommended_questions])[2]
cities_2020_selected_answered['n_answers'] = cities_2020_selected_answered.values.sum(axis=1)
cities_2020_selected_answered['rank'] = cities_2020_selected_answered['n_answers'].rank(pct=True)

df = cities_2020_selected_answered.reset_index().loc[:, ['Account Number']]
df = df.merge(cities_disclosing_2020.loc[:, ['City', 'Account Number', 'City Location', 'CDP Region']], on=['Account Number'])
df.fillna('POINT (0 0)', inplace=True) # fast dummy solution for cities without coordinates
df['x'] = df['City Location'].apply(lambda txt: float(txt.split("POINT ")[1].replace('(', '').replace(")", '').split(" ")[0]))
df['y'] = df['City Location'].apply(lambda txt: float(txt.split("POINT ")[1].replace('(', '').replace(")", '').split(" ")[1]))
df['rank'] = cities_2020_selected_answered['rank'].values
df

smb = go.Scattermapbox(name='severity map',
        lon = df.x,
        lat = df.y,
        text = df.loc[:,['City', 'rank']],
        mode = 'markers',
        marker = dict(
            size = 10,
            opacity = 0.8,
            colorscale = 'Viridis',
            cmin = 0,
            color = cities_2020_selected_answered['rank'],

        ))
fig = go.Figure()

fig.add_trace(smb)
fig.update_layout(
        mapbox_style="open-street-map",
        title = dict(text='Questionnaire Engagement KPI for recommended set of questions',x=0.5,y=.97),
        height=900,
        width=900,
        )

cities_2020_selected_answered = cities_2020_selected_answered.drop(columns= ['n_answers', 'rank'])

fig 

<a id="groupping"></a>
## Grouping cities based on answering or skipping questions

It would be possible to define an engagement  KPI: KPI showing how much a city is willing to give answers to the questionnaire. Such KPI could be defined per question, per section, per group of questions or for the whole questionnaire. And we suggest to use qunatile raning as KPI (i.e City that answered the most of questions in a category get KPI value of 1.0, and City that answeerd the least of question get KPI value of 0)

Going further we can cluster cities based on the way the answer the questions. We see distinc clusters associated with often answering on distinct questions.  For example cluster number 4 consist of cities that answered on 10.14 questions

In [None]:
embedding = umap.UMAP(metric='cosine', n_components = 2, random_state=42).fit_transform(cities_2020_selected_answered.values)

clustering = DBSCAN().fit(embedding)
labels = clustering.labels_


In [None]:
df['umap_x'] = embedding[:,0]
df['umap_y'] = embedding[:,1]
df['cluster_id'] = labels

fig = px.scatter(df, x='umap_x', y='umap_y', color='cluster_id', 
                 hover_data=['City', 'cluster_id'],
                 title="Clustering of Cities by the answering on the Questions recommened for the analysis")
fig

In [None]:
cities_2020_selected_answered['cluster_id'] = labels
df2 = cities_2020_selected_answered.groupby('cluster_id').mean()
x = np.arange(len(df2.columns))

# Create traces
fig = go.Figure(layout = go.Layout(title='Average answering frequency per cluster over set of recommended questions', legend = {'title':'cluster_id'}))


for i, cluster_id in  enumerate(list(df2.index)):
    fig.add_trace(go.Scatter(x=x, y=df2.values[i,:],
                        mode='lines',
                        name= str(cluster_id),
                        hovertext = list(df2.columns)))

fig.show()

In [None]:
smb = go.Scattermapbox(name='severity map',
        lon = df.x,
        lat = df.y,
        hovertext = df.loc[:, ['City', 'cluster_id']],
        mode = 'markers',
        marker = dict(
            size = 10,
            opacity = 0.8,
            colorscale = 'rainbow',
            color = df["cluster_id"],
        ))
fig = go.Figure()

fig.add_trace(smb)
fig.update_layout(
        mapbox_style="open-street-map",
        title = dict(text='Clustering of Cities by the answering on the Questions recommened for the analysis',x=0.5,y=.97),
        height=900,
        width=900,
        )

<a id="groupping_questions"></a>
## Grouping questions based on cities answering


It would be possible to define an engagement  KPI: KPI showing how much a city is willing to give answers to the questionnaire. Such KPI could be defined per question, per section, per group of questions or for the whole questionnaire. And we suggest to use qunatile raning as KPI (i.e City that answered the most of questions in a category get KPI value of 1.0, and City that answeerd the least of question get KPI value of 0)
Questions could also be clustered in the simialr way.**

In [None]:
embedding = umap.UMAP(metric='cosine', n_components = 2, random_state=42).fit_transform(cities_2020_selected_answered.values.T)
clustering = DBSCAN(eps = 1).fit(embedding)
labels = clustering.labels_


In [None]:
questions = list(cities_2020_selected_answered.columns)
df = pd.DataFrame({'x': embedding[:,0],'y': embedding[:,1], 'cluster_id':labels, 'question': questions})


fig = px.scatter(df, x='x', y='y', color='cluster_id', 
                 hover_data=['question'],
                 title="Clustering of recommended Questions by the way how they were answered by cities")
fig


<a id="corp_engagement"></a>
# Corporations Engagement in the Survey
Our assumption is that analysys of corporation engagement is much more interesting, as vwe expect corporations to avoid answering certain questions. Analysing this phenomena with methodology developped above for the cities could actually lead to defenition of other KPIs, more meaningful for actual social and environmental effects. And most definetely such analysis shall lead for a better questionnaire and strategy for engaging cororations into sharing answers.



In [None]:
#TODO