# Introduction

**Challenge**

This Kaggle analytics challenge consists of exploring and analyzing social and environmental information provided to CDP by disclosing companies and cities in order to help in identifying solutions to tackle climate change impacts.

The main questions defined by the challenge organizer are:

 Helping cities adapt: How do you help cities adapt to a rapidly changing climate in a way that is socially equitable?
    Cities/corporation cooperation: What are the practical and actionable points where city and corporate ambition join? Where do cities have problems that corporations affected by those problems could solve, and vice-versa?
    Environmental risks & Social equity: How can we measure the intersection between environmental risks and social equity?


Goals

Calculating KPIs that relate to the environmental and social issues and demonstrating whether city and corporate ambitions take these factors into account

   Leverage external data sources and discuss the intersection between environmental and social issues


In [None]:
# standard libs
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import json
from glob import glob
from tqdm import tqdm

# plotting libs
import seaborn as sns
import altair as alt
import plotly as plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# geospatial libs
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Polygon
import geopandas as gpd
import folium
import plotly.graph_objects as go
import plotly_express as px

# set in line plotly 
from plotly.offline import init_notebook_mode;
init_notebook_mode(connected=True)
import warnings
warnings.filterwarnings('ignore')
from difflib import SequenceMatcher
colors= ['#003f5c','#2f4b7c','#665191','#a05195','#d45087','#f95d6a','#ff7c43','#ffa600','#fcca46','#a1c181','#619b8a','#386641']

print(os.getcwd())

In [None]:
data_2018 = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/2018*.csv')+glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*2018*.csv')
data_2019 = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/2019*.csv')+glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*2019*.csv')
data_2020 = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/2020*.csv')+glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*2020*.csv')
cities_response_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*Full_Cities_Dataset.csv')
cities_disclosing_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*Cities_Disclosing_to_CDP.csv')
cities_disclosing = pd.concat([pd.read_csv(file) for file in cities_disclosing_files])
cities_response = pd.concat([pd.read_csv(file) for file in cities_response_files])
corporate_climate_change_disclosing_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*/*Corporates_Disclosing_to_CDP_Climate_Change*.csv')
corporate_water_security_disclosing_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*/*Corporates_Disclosing_to_CDP_Water_Security*.csv')
corporate_climate_change_response_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*/*Full_Climate_Change_Dataset*.csv')
corporate_water_security_response_files = glob('/kaggle/input/cdp-unlocking-climate-solutions/*/*/*/*Full_Water_Security_Dataset*.csv')
corporate_climate_change_disclosing = pd.concat([pd.read_csv(file) for file in corporate_climate_change_disclosing_files])
corporate_water_security_disclosing = pd.concat([pd.read_csv(file) for file in corporate_water_security_disclosing_files])
corporate_climate_change_response = pd.concat([pd.read_csv(file) for file in corporate_climate_change_response_files])
corporate_water_security_response = pd.concat([pd.read_csv(file) for file in corporate_water_security_response_files])

Here we can see that even if we have the name of the organisation we dont have the name if the cities and the population column too is null . I will show it on the next visualization.

As we know that putting all comments of 2020,2018,2019 would be a hectic task and it would take up so much time . So I am just visualizing the 2020 data

# Cities Disclosing Overview

In [None]:
cities_disclosing_2020=cities_disclosing[cities_disclosing["Year Reported to CDP"] == 2020]

In [None]:
_cities_20=cities_disclosing_2020[cities_disclosing_2020["CDP Region"] == "Latin America"]

In [None]:
cities_disclosing.head()

In [None]:
cities_disclx=cities_disclosing[cities_disclosing["Population Year"] < 2017  ]

In [None]:
cities_disclx.shape

In [None]:
fig = px.sunburst(cities_disclx, path=['Year Reported to CDP','Population Year','CDP Region'], values='Population',
                  color='Population',
                  color_continuous_scale='RdBu')
fig.show()

As we can see here that many regions have reported their population year less than 2017 .
In the total data of cities_disclosing of 3 years we can see that 991 cities . It will be more elaborated on the year on year chart below .

In [None]:
fig = px.sunburst(cities_disclosing, path=['Year Reported to CDP','First Time Discloser','CDP Region','Organization'], values='Population',
                  color='Population',
                  color_continuous_scale='RdBu')
fig.show()

In [None]:
''''Latin America    173    304    187
North America    169    208    184
Europe           137    182    165
SE ASIA OCEAN    33     72      51
Africa           29     48      22
East Asia        23     24      21
Middle East      5      15      7
South West Asia  3      8       4'''

In [None]:
import plotly.graph_objects as go

months = ['Latin America', 'North America', 'Europe', 'South East and Oceania', 
          'Africa', 'East Asia','Middle East','South West Asia']

fig = go.Figure()
fig.add_trace(go.Bar(
    x=months,
    y=[187,184,165,51,22,21,7,4],
    name='Year_18',
    marker_color='yellowgreen'
))
fig.add_trace(go.Bar(
    x=months,
    y=[304, 208,182,72,48,24,15,8],
    name='Year_19',
    marker_color='orange'
))
fig.add_trace(go.Bar(
    x=months,
    y=[173,169,137,33,29,23,5,3],
    name='Year_20',
    marker_color='orangered'
))

# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()



In [None]:
cities_disclosing['Population'].fillna(0)

In [None]:
cities_disclosing_p=cities_disclosing.sort_values('Population',ascending=False)[:2068]
cities_dp20=cities_disclosing_p[cities_disclosing_p["Year Reported to CDP"] == 2020]
cities_dp19=cities_disclosing_p[cities_disclosing_p["Year Reported to CDP"] == 2019]
cities_dp18=cities_disclosing_p[cities_disclosing_p["Year Reported to CDP"] == 2018]

In [None]:
fig = make_subplots(rows = 1, cols = 1,
specs = [[{'type' : 'choropleth'} for column in np.arange(1)]for row in np.arange(1)],
subplot_titles = ['Country V Population'])
fig.add_trace(go.Choropleth(locations = cities_dp20.Country,
z = cities_dp20.Population,
locationmode = 'country names',
colorscale = px.colors.sequential.Viridis),
row = 1, col = 1)

**Points of cities Disclosing**
1. There are too many irregularities in the data .
2. Accurate data is not given by many countries , uruguay had given its population as 1 billion
3. The Population Reported was too old and  some cities gave the same population for all three years like they gave the census of 2010,2011 etc
4. Many organizations have also reported in their local languages like spanish
5. We see a huge drop in cities reporting in 2020 it is due to the coronavirus crisis and there is not active participation from many cities with huge participation.
6. As from many countries the participation rate is too low for regions like Middle East, Asia ,Oceania we cannot give a particular country an overview based on very less cities it it likely going to mislead us .
7. We must only asses the cities participation and find the similarities between different metropolitians and countys

# Cities Response Overview

In [None]:
cities_response.head()

In [None]:
cities_response.replace(['NaN'],['None'], inplace=True)

As of the cities we can see that in all three years latin america has the most population. But why is that have you wondered . As the top 10 most populous countries in the world are 
China,India,USA,Indonesia,Pakistan,Brazil,Nigeria,Bangladesh,Russia,Mexico

Then why this irregularities the problems are that there isnt much reports from populous countries and the population year many countries population census was taken way back . Which makes it difficult to extract . As of latin america they have been updating their population every year and there are a lot of cities participating from that region. Below I will show a year on year participation of countries that have disclosed to CDP.

In [None]:
import IPython
url = 'https://data.cdp.net/dataset/bh/9wyq-236v/embed?width=800&height=600'
iframe = '<iframe src=' + url + ' width=800 height=650></iframe>'
IPython.display.HTML(iframe)

**Above data is of 2020**

**From the above pie chart we can see that many cities have not elaborated on their answers which can lead to misunderstanding and confusion . We must have pre given answers like mcq**

 **And we can also see that most responses were from latin america region and they answered the question in their local language which becomes hard to get a clear perspective**

Change the column nan to none of the below visualization cause it is the problem

In [None]:
url = 'https://data.cdp.net/dataset/Response-answer/wrf6-9nbk/embed?width=800&height=600'
iframe = '<iframe src=' + url + ' width=800 height=600></iframe>'
IPython.display.HTML(iframe)

In [None]:
def extract_answer(df, question, newColumnName, condition=True):
    '''
    Extract answer from the number of question 
    @question: (string) name of the question 
    @newColumnName: (string) column name of the question
    @condition: (boolen) when the question has several outputs 
    @return: DataFrame
    '''
    result_df = df[(df['Question Number'] == question) & condition ][["Account Number","Response Answer"]]
    result_df.columns = [newColumnName if x=='Response Answer' else x for x in result_df.columns]
    return result_df

**Points of Cities Responses**
1. The responses have been quite interesting as most responses to strategy related questions were given as question not applicable. Due to this we can't get to know how does the city wants to strategize
2. The responses have not been elaorated . The comments are 99% null and it is a bad factor
3. We must tell the cities to give response in english or we must transalate the answers to english for a better understanding

Below we are going to see the responses on the cities questionnaire of 2020

**Cities Responses by questions**


1. Projects in which money can be invested and pull the city out of a recession

In [None]:
cities_response_20=cities_response[cities_response["Year Reported to CDP"] == 2020]

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'2.0c','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for not having a climate assesment ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()
cities_2020_admin_boundary = extract_answer(cities_response_20,'3.2b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for having no published plan for climate change ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

We can see that in the response of the climate assesment many cities have answered as question not applicable and the 2nd most respons is the lack of resources/funding . Which ultimately leads that cities with more revenue have more funding

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'4.12b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for why city wide emmisions aren't verified ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()
cities_2020_admin_boundary = extract_answer(cities_response_20,'5.0e','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for having no climate reduction target ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

Here also the verification of the data is not seriously taken by many cities which is bad . The reason for having no published plan too is demotivating

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'5.5b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for not having mitigation plan ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'8.0b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for not having a renewable energy target ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()
cities_2020_admin_boundary = extract_answer(cities_response_20,'8.5b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for having no energy efficiency target ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

In the mitigation,climate,renewable energy pie charts we can see that many cities projects are still in development if we try to see the year on year response answers based on different organizations . That how many cities are really taking action.

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'14.2b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Why dont cities consider them to be exposed to a substantive water risk",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()
cities_2020_admin_boundary = extract_answer(cities_response_20,'14.4b','Reason', condition = (cities_response_20['Column Name'] == 'Reason'))
cities_2020_admin_boundary['Reason'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Reason'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Reason for not having a water resource strategy ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Reason'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

Now we are going to look into opportunities

In [None]:
cities_2020_admin_boundary = extract_answer(cities_response_20,'0.1','Administrative boundary', condition = (cities_response_20['Column Name'] == 'Administrative boundary'))
cities_2020_admin_boundary['Administrative boundary'].replace('^(Other).*','Other',regex=True, inplace=True)
cities_2020_admin_boundary['Administrative boundary'].value_counts().plot.pie(textprops={'color':"w"},pctdistance=0.7,autopct='%.2f%%',figsize=(6,6),colors=colors, labels=None)
plt.title("Administrative Boundary Distribution ",fontsize=17,ha='left')
plt.legend(labels=cities_2020_admin_boundary['Administrative boundary'].value_counts().index, loc="best",bbox_to_anchor=(1, 0.25, 0.5, 0.5))
plt.show()

In [None]:
worldcities = pd.read_csv('/kaggle/input/world-cities/worldcities.csv').rename(columns={'country':'Country'})
worldcities['Country'].replace(['Bolivia','Hong Kong','Côte D’Ivoire','Congo (Kinshasa)',
                               'Korea, South','Moldova','Russia','West Bank','Taiwan','United Kingdom','Tanzania',
                                'United States','Venezuela','Vietnam'                                
                               ],
                               
                               
                              ['Bolivia (Plurinational State of)',
 'China, Hong Kong Special Administrative Region',                              
 "Côte d'Ivoire",
 'Democratic Republic of the Congo',
 'Republic of Korea',
 'Republic of Moldova',
 'Russian Federation',
 'State of Palestine',
 'Taiwan, Greater China',
 'United Kingdom of Great Britain and Northern Ireland',
 'United Republic of Tanzania',
 'United States of America',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam'],
                              inplace=True)

worldcities = worldcities.merge(cities_response[['Country','CDP Region']].drop_duplicates(),on='Country',how='left')

In [None]:
import plotly.express as px 
import numpy as np
fig = px.sunburst(cities_disclosing_2020, path=['CDP Region','Country','Organization'], values='Population',
                  color='Population', hover_data=['Organization'],
                  color_continuous_scale='RdBu',
                  color_continuous_midpoint=np.average(cities_disclosing_2020['Population'], weights=cities_disclosing_2020['Population']))
fig.show()

**Suggestions for the cdp team**
1. Verify the data and translate local languages into english
2. See the year on year charts . for example many cities have responded saying target still in development all three years.
3. There are a lot of individual responses for a question . Which is not nice rather than elaborating in comments they elaborated about it in the response answer.
4. Add more columns like city gdp,per capita income and city revenue and their spending 
5. Also put a political stability score . Because in less developed cities authorities usually try to take advantage of the money that was given to do a project .
6. As of climate security and water security develop a score for which countries have stricter rules on climate and water security . And are they really helpful
7. Love the initiative of keeping an analytics competition I learned so much through this competition and love your company agenda too .