# Measuring the Progress Toward Preventing Climate Change

Climate change is a continuous and emerging problem in today's world. It has been proven through studies that climate change is most likely caused by human, or anthropogenic. To provide solutions for climate change, data science can be a useful weapon to find insights and discover ways to prevent climate change.

In this notebook, we will investigate 3 different survey responses that are publicly available: 1) corporate climate change disclosures, 2) corporate water security disclosures, and 3) disclosures from cities.

First, we will read and combine the climate change data into one DataFrame

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly

## Cleaning and Exploratory Data Analysis (EDA)

In [None]:
# See first 5 rows of cities 2020 data
cities_2020 = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv")
cities_2020.head()

In [None]:
# Read and append all city responses data into one DataFrame
cities_2019 = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv")
cities_2018 = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv")

cities = cities_2020.append(cities_2019)
cities = cities.append(cities_2018)
cities.head()

In [None]:
# See info about data
cities.info()

We can notice from the information that there are some string columns which are incomplete. We also notice there are not a lot of responses compared to questions. We will filter out any rows without responses.

In [None]:
# Filter out questions with no responses and fill other rows with "No response"
cities = cities[cities['Response Answer'].notna()]
cities[['Parent Section', 'Column Name', 'Row Name', 'Comments', 'File Name']] = cities[['Parent Section', 'Column Name', 'Row Name', 'Comments', 'File Name']].fillna("No response")
cities.info()

Now, we will investigate the questions that were asked and responses to see which ones could be useful for defining KPIs.

In [None]:
cities_questions = cities['Question Name'].unique()
print(cities_questions)

In [None]:
cities_responses = cities['Response Answer'].unique()
print(cities_responses)

Let us count the number of questions with responses to see which is the most common question. We can then use this to define our KPI since we want the KPI to be representative of the views and perspectives of the sample.

In [None]:
cities_no_questions = cities.groupby('Question Number').count()

# We will visualize the top 10 questions as a bar chart
cities_no_questions = cities_no_questions.sort_values('Response Answer', ascending=False)
cities_no_questions

fig, ax = plt.subplots(figsize=(15, 10))
plt.bar(cities_no_questions.index[:10], cities_no_questions['Response Answer'].iloc[:10])
plt.ylabel("Number of Responses", fontsize=18)
plt.xlabel("Question Number", fontsize=18)
plt.title("Number of Reponses for Top 10 Question Numbers", fontsize=24)
plt.show()

We can see that the top 5 questions have over 1,500 responses. Let us see what questions these question numbers correspond to.

In [None]:
top_4_qs = cities.loc[cities['Question Number'].isin(['2.1', '5.4', '2.2a', '3.0']), ['Question Number', 'Question Name']].copy()
top_4_qs = top_4_qs.drop_duplicates()
top_4_qs

In the code cell output, we see that there is not a one-to-one relationship between the Question Number and Question Name.

## Defining KPIs

According to Klipfolio (https://www.klipfolio.com/resources/articles/what-is-a-key-performance-indicator#:~:text=Key%20Performance%20Indicator%20(KPI)%20Definition,their%20success%20at%20reaching%20targets.), KPIs are measurable values that demonstrate how effectively an organization is achieving key objectives.

We can define an environmental KPI to have a maximum value of 100 and composed of 4 categories: energy usage, water usage, waste generated, and greenhouse gases generated. A score of 100 means that a city or organization used a lot of resources or generated a lot of waste and greenhouse gases, therefore a score of 0 should be the target.

To quantify the responses, we can use natural language processing (NLP) to score each city and corporation based on their performance in the 4 categories.