# Building Data-Driven Models to Identify the Key Socio-Environmental Hazards of Today

For this project I decided to focus on the question: *How do you help cities adapt to a rapidly changing climate amidst a global pandemic, but do it in a way that is socially equitable?*

I built the following KPIs to measure the various ways in which climate hazards affect cities in 2020 with particular focus on the environmental and social impacts. These KPI will be important to identify which climate hazards to attack and where:

**KPIs to measure hazards:**

* Expected Environmental Impact

* Expected Social Impact

**KPIs to measure cities:**

* Climate Hazard Count

* Total Expected Environmental Impact

* Total Expected Social Impact

The objective of the first set of KPIs is for cities to identify which of the many climate hazards they face have the highest need for urgency and action. 

The objective of the second set of KPIs is for national and regional governments to identify specific cities that are most at risk and therefore most in need for funding or aid. 




# Deep dive into KPIs



**Expected Environmental Impact**

This KPI is calculated using the formula

*Expected_Environmental_Impact = Magnitude_of_Hazard x Probability_of_Hazard*

where magnitude and probability of hazard are derived from 2020 cities disclosure survey.

**Expected Social Impact**

Given that social impact data is also provided for each hazard, we can build an Expected Social Impact KPI that weigh not only the social impacts that a climate hazard generates but also the vulnerable populations that it affects. Expected Social Impact is calculated as:

*Expected_Social_Impact = Social_Impact_Factor x Vulnerable_Population_Factor*

where the social impact factor is the number of social impacts selected for each hazard divided by the number of all social impacts categories available. Similarly the vulnerable population factor is the number of vulnerable populations selected per hazard divided by the number of all vulnerable populations categories available.

**Climate Hazard Count:**

This is a simple KPI that measures the number of climate hazards that cities reported each year. This KPI allow national or regional governments to identify specific areas that are facing the most hazards and therefore in need of more aid. 

**Total Expected Environmental Impact**

Since the CDP cities data includes additional information about each climate risk, including probability and magnitude estimates, we can develop a second KPI which is a more complex yet also a more robust measure of climate hazard. The Total Expected Environmental Impact KPI considers not just how many hazards a city reports but also weighs both the magnitude and probability of each hazard. For each city each year, the Total Expected Environmental Impact KPI is calculated as follows:

*Total_Environmental_Impact = Expected_Environmental_Impact (1st Hazard) + Expected_Environmental_Impact (2nd Hazard) + ... + Expected_Environmental_Impact ("n"-th Hazard),*

where "n" is the climate hazard count.

**Total Expected Social Impact**

We can similarly aggregate all the expected social impacts per city to generate a Total Social Impact KPI using the formula

*Total_Social_Impact = Expected_Social_Impact (1st Hazard) + Expected_Social_Impact (2nd Hazard) + ... + Expected_Social_Impact ("n"-th Hazard),*

here "n" is the climate hazard count.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_columns', None)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



In [None]:

# importing cities response df
cities_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv")


In [None]:

social_impacts=cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] == 5))]
social_impacts=social_impacts.rename(columns={"Response Answer":"Social Impact"})

vulnerables=cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] == 7))]

vulnerables=vulnerables.rename(columns={"Response Answer":"Vulnerable Population"})

hazards_prob=cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] == 3))]
hazards_prob=hazards_prob.rename(columns={"Response Answer":"Current Probability of Hazard"})


hazards_mag=cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] == 4))]
hazards_mag=hazards_mag.rename(columns={"Response Answer":"Current Magnitude of Hazard"})


climate_harzards=cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] == 1))]
climate_harzards=climate_harzards.rename(columns={"Response Answer":"Climate Hazard"})


In [None]:
impacts_hazards = social_impacts.merge(climate_harzards, on=['Row Number','Organization'])

impacts_hazards_prob = impacts_hazards.merge(hazards_prob,  on=['Row Number','Organization'])

impacts_hazards_prob_mag = impacts_hazards_prob.merge(hazards_mag,  on=['Row Number','Organization'])

impacts_hazards_prob_mag_vul = impacts_hazards_prob_mag.merge(vulnerables, on=['Row Number','Organization'])

test=impacts_hazards_prob_mag_vul[impacts_hazards_prob_mag_vul['Organization']=='City of Berkeley']
a=test.groupby(['Organization','Climate Hazard']).agg({'Social Impact': 'nunique','Row Number': 'count','Vulnerable Population':'nunique'})
a.reset_index()


In [None]:
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Magnitude of Hazard'] == 'Low', 'Mag_num'] = 1/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Magnitude of Hazard'] == 'Medium Low', 'Mag_num'] = 2/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Magnitude of Hazard'] == 'Medium', 'Mag_num'] = 3/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Magnitude of Hazard'] == 'Medium High', 'Mag_num'] = 4/5
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Magnitude of Hazard'] == 'High', 'Mag_num'] = 5/5

impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Probability of Hazard'] == 'Low', 'Prob_num'] = 1/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Probability of Hazard'] == 'Medium Low', 'Prob_num'] = 2/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Probability of Hazard'] == 'Medium', 'Prob_num'] = 3/5 
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Probability of Hazard'] == 'Medium High', 'Prob_num'] = 4/5
impacts_hazards_prob_mag_vul.loc[impacts_hazards_prob_mag_vul['Current Probability of Hazard'] == 'High', 'Prob_num'] = 5/5

#impacts_hazards_prob_mag_vul=impacts_hazards_prob_mag_vul2.rename(columns={"Response Answer_y": "Climate Hazard","Response Answer_x":"Social Impact"})


#impacts_hazards_prob_mag_vul[['Climate Hazard','Social Impact','Prob_num','Mag_num']]
#test=impacts_hazards_prob_mag_vul

summary=impacts_hazards_prob_mag_vul.groupby(['Country','CDP Region','Organization','Climate Hazard']).agg({'Social Impact': 'nunique','Vulnerable Population': 'nunique','Prob_num': 'max','Mag_num': 'max','Row Number': 'count'})
summary=summary.sort_values(['Prob_num','Social Impact'], ascending=[False,False]).reset_index()
summary[summary['Organization']=='City of Berkeley']

#cities_df[((cities_df['Question Number'] == '2.1')&(cities_df['Column Number'] ==1)&(cities_df['Organization']=='Comune di Milano'))]


In [None]:
summary[summary['Organization']=='City of Berkeley']


In [None]:
summary[summary['Organization']=='Comune di Milano']


In [None]:
#showing for US
cities_details = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv")

cities_details=cities_details.rename(columns={"Country":"Country2","CDP Region":"CDP Region2"})

impacts_hazards_prob_mag_vul_regional=impacts_hazards_prob_mag_vul.merge(cities_details, on=['Account Number'])



us_summary=impacts_hazards_prob_mag_vul_regional[impacts_hazards_prob_mag_vul_regional['Country2']=='United States of America']

us_summary[us_summary['Organization_x']=='City of Miramar'].head(30)


#impacts_hazards_prob_mag_vul  34601


In [None]:
us=impacts_hazards_prob_mag_vul_regional[impacts_hazards_prob_mag_vul_regional['Country2']=='United States of America']

us_pre=us[['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num','Vulnerable Population','Social Impact','Population']].groupby(['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num','Population']).agg({'Vulnerable Population': 'nunique','Social Impact': 'nunique'}).reset_index()
us_pre['Expected_Environmental_Impact']=us_pre['Mag_num']*us_pre['Prob_num']


us_pre['Vulnerable Population Factor']=us_pre['Vulnerable Population']/11

us_pre['Social Impact Factor']=us_pre['Social Impact']/12

us_pre['Expected_Social_Strain']=us_pre['Social Impact Factor']*us_pre['Vulnerable Population Factor']

us_summary=us_pre.groupby(['Country','CDP Region2','Organization_y','Population']).agg({'Climate Hazard': 'nunique','Mag_num': 'mean','Prob_num': 'mean','Vulnerable Population':'mean','Social Impact':'mean','Expected_Environmental_Impact':'sum','Expected_Social_Strain':'sum'}).reset_index()
top20=us_summary.sort_values(['Climate Hazard'], ascending=[False]).head(20)
topAll=us_summary.sort_values(['Climate Hazard'], ascending=[False])

top20_exp_imp=us_summary.sort_values(['Expected_Environmental_Impact'], ascending=[False]).head(20)
topAll_exp_imp=us_summary.sort_values(['Expected_Environmental_Impact'], ascending=[False])
top20_exp_imp

top20_social=us_summary.sort_values(['Expected_Social_Strain'], ascending=[False]).head(20)


#top20_exp_imp
us_pre[us_pre['Organization_y']=='City of Cincinnati']


us_pre[us_pre['Organization_y']=='City of Cincinnati'].sort_values(by='Expected_Social_Strain', ascending=False)
#us_summary
#us
#us_summary

us[us['Organization_x']=='City of Miramar']
topAll
#us_pre=us[['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num','Vulnerable Population','Social Impact','Population']].groupby(['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num']).agg({'Vulnerable Population': 'nunique','Social Impact': 'nunique'}).reset_index()



In [None]:
#us_pre=us[['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num','Vulnerable Population','Social Impact','Population']].groupby(['Country','CDP Region2','Organization_y','Climate Hazard','Mag_num','Prob_num']).agg({'Vulnerable Population': 'nunique','Social Impact': 'nunique'}).reset_index()

us_prex=us[['Organization_y','Population']].groupby(['Organization_y']).max().reset_index()

us_prex.sort_values(by='Population')
#us.sort_values(by='Organization_x')

In [None]:
# plotting libs
!pip install seaborn==0.11.0
import seaborn as sns
import matplotlib.pyplot as plt
print(sns.__version__)

# Analysis and Insights

Using the Climate Hazard Count KPI to look at all the US cities that reported in 2020, we can see in the chart below that most cities reported between 3-7 climate hazards but a few cities face more than 10 hazards. This means that these locations might need more attention or help from the national government.

In [None]:
plot1=sns.histplot(data=topAll, x="Climate Hazard", binwidth = 1)
plot1.set(ylabel="Number of Cities", xlabel = "Climate Hazard Count")
plt.show() #.xticks([0, 5, 10, 20])

Ranking the cities by Climate Hazard Count, we see that the top condenders are Highland Park, Urbana, and Boynton Beach, each identifying 14 or more climate hazards.

In [None]:


plot2 = sns.barplot(x="Organization_y", y="Climate Hazard",hue='Country', data=top20)

plot2.set_xticklabels(plot2.get_xticklabels(), rotation=45, horizontalalignment='right')
plot2.set(xlabel="Number of Cities", ylabel = "Climate Hazard Count")

plt.show()

We can then use the Expected Environmental Impact KPI to check out which are the most urgent and impactful issues in Boynton Beach. It appears that it's facing major issues related to flood and sea level rise. This makes sense; the city is called Boynton *Beach* after all. Storm and wind related climate hazards follow in terms of Expected Environmental Impact.

In [None]:
df=us_pre[['Organization_y','Climate Hazard','Expected_Environmental_Impact']][us_pre['Organization_y']=='City of Boynton Beach'].sort_values(by='Expected_Environmental_Impact', ascending=False).reset_index()
social_impacts=social_impacts.rename(columns={"Response Answer":"Social Impact"})

df=df[['Organization_y','Climate Hazard','Expected_Environmental_Impact']]

cm = sns.light_palette("red", as_cmap=True)

s = df.rename(columns={"Expected_Environmental_Impact":"Expected Environmental Impact","Organization_y":"City"}).style.background_gradient(cmap=cm)

s
#social_impacts=social_impacts.rename(columns={"Response Answer":"Social Impact"})



We can also show the distribution of cities according to Total Expected Social Impact. We have a high frequency from 0-0.5 Total Expected Social Impact as many cities only cited one or two social impacts or vulnerable populations per hazard. 

In [None]:
plot1b=sns.histplot(data=topAll, x="Expected_Social_Strain", binwidth = .5)
plot1b.set(ylabel="Number of Cities", xlabel = "Total Expected Social Impact")
plt.show() #.xticks([0, 5, 10, 20])

If we rank the cities from highest to lowerst Total Expected Social Impact, we see below that Medford, Tempe, and Iowa cited the highest Total Expected Social Impact.

In [None]:
top20_social


plot2b = sns.barplot(x="Organization_y", y="Expected_Social_Strain",hue='Country', data=top20_social)

plot2b.set_xticklabels(plot2b.get_xticklabels(), rotation=45, horizontalalignment='right')
plot2b.set(xlabel="Number of Cities", ylabel = "Total Expected Social Impact")

plt.show()


Below is a breakdown of the Expected Social Impact of each hazard in Medford.

In [None]:
df=us_pre[['Organization_y','Climate Hazard','Expected_Social_Strain']][us_pre['Organization_y']=='City of Medford'].sort_values(by='Expected_Social_Strain', ascending=False).reset_index()

df=df[['Organization_y','Climate Hazard','Expected_Social_Strain']]

cm = sns.light_palette("red", as_cmap=True)

s = df.rename(columns={"Expected_Social_Strain":"Expected Social Impact","Organization_y":"City"}).style.background_gradient(cmap=cm)

s

Let's check out the social impacts of "Extreme Precipitation > Rain storm" in Medford.

In [None]:
us_medford=us[['Organization_y','Climate Hazard','Social Impact']][((us['Organization_y']=='City of Medford')&(us['Climate Hazard']=='Extreme Precipitation > Rain storm'))].drop_duplicates()
us_medford['Len']=us_medford['Social Impact'].str.len()

us_medford.sort_values('Len', ascending=False, inplace=True)
us_medford_short=us_medford[['Organization_y','Climate Hazard','Social Impact']].reset_index()
us_medford_short.rename(columns={"Organization_y":"City"}).style

We can also check out the list of vulnerable populations affected by this climate hazard.

In [None]:
us_medford=us[['Organization_y','Climate Hazard','Vulnerable Population']][((us['Organization_y']=='City of Medford')&(us['Climate Hazard']=='Extreme Precipitation > Rain storm'))].drop_duplicates()
us_medford['Len']=us_medford['Vulnerable Population'].str.len()

us_medford.sort_values('Len', ascending=False, inplace=True)
us_medford_short=us_medford[['Organization_y','Climate Hazard','Vulnerable Population']].reset_index()
us_medford_short.rename(columns={"Organization_y":"City"}).style

Lastly, we can plot Total Expected Environmental Impact versus Total Expected Social Impact to identify where each city lie on both dimensions. By representing city population as the relative size of the dot on the scatterplot, we can find where the highly populated cities fall on the environmental and social impact scales.

The scatterplot below also shows a positive correlation between environmental impact and social impact. This means that as climate hazards intensify the social impacts, especially to vulnerable populations, worsen as well.

In [None]:
sns.scatterplot(data=topAll, x="Expected_Environmental_Impact", y="Expected_Social_Strain",size='Population',sizes=(50, 500),alpha=.5)


plt.xlabel("Toal Expected Environmental Impact")
plt.ylabel("Toal Expected Social Impact")
plt.show()

#topAll.a = df.a.astype(float)

#topAll
#useful for cohorting cities to address
#highlights that social impact is not yet fully realized harder to measure. 