## **DESCRIPTION**

CDP is a global non-profit that drives companies and governments to reduce their greenhouse gas emissions, safeguard water resources, and protect forests. Each year, CDP takes the information supplied in its annual reporting process and scores companies and cities based on their journey through disclosure and towards environmental leadership.

CDP houses the world’s largest, most comprehensive dataset on environmental action. As the data grows to include thousands more companies and cities each year, there is increasing potential for the data to be utilized in impactful ways. Because of this potential, CDP is excited to launch an analytics challenge for the Kaggle community. Data scientists will scour environmental information provided to CDP by disclosing companies and cities, searching for solutions to our most pressing problems related to climate change, water security, deforestation, and social inequity.

How do you help cities adapt to a rapidly changing climate amidst a global pandemic, but do it in a way that is socially equitable?

What are the projects that can be invested in that will help pull cities out of a recession, mitigate climate issues, but not perpetuate racial/social inequities?

What are the practical and actionable points where city and corporate ambition join, i.e. where do cities have problems that corporations affected by those problems could solve, and vice versa?

How can we measure the intersection between environmental risks and social equity, as a contributor to resiliency?

## **PROBLEM STATEMENT**
Develop a methodology for calculating key performance indicators (KPIs) that relate to the environmental and social issues that are discussed in the CDP survey data. Leverage external data sources and thoroughly discuss the intersection between environmental issues and social issues. Mine information to create automated insight generation demonstrating whether city and corporate ambitions take these factors into account.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#Importing Text Analysis Libraries:
from textblob import TextBlob
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
!pip install textblob 
from nltk.corpus import stopwords
import nltk
nltk.download()
from nltk.sentiment.vader import SentimentIntensityAnalyzer as vader
print('Libraries Imported')

In [None]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## **CITIES:**

## **CITIES DISCLOSING**

In [None]:
c1=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2018_Cities_Disclosing_to_CDP.csv')
c2=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2019_Cities_Disclosing_to_CDP.csv')
c3=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing/2020_Cities_Disclosing_to_CDP.csv')
c21=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2018_Full_Cities_Dataset.csv')
c22=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2019_Full_Cities_Dataset.csv')
c23=pd.read_csv('../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv')

In [None]:

c1

In [None]:
c2

In [None]:
c3

In [None]:
c21.head()

In [None]:
c22.head()

In [None]:
c23.head()

In [None]:
c1['Country'] = c1['Country'].replace(['United States of America'],'USA')
c2['Country'] = c2['Country'].replace(['United States of America'],'USA')
c3['Country'] = c3['Country'].replace(['United States of America'],'USA')
c1['Country'] = c1['Country'].replace(['Republic of Korea'],'S.Korea')
c2['Country'] = c2['Country'].replace(['Republic of Korea'],'S.Korea')
c3['Country'] = c3['Country'].replace(['Republic of Korea'],'S.Korea')
c1['Country'] = c1['Country'].replace(['United Kingdom of Great Britain and Northern Ireland'],'UK')
c2['Country'] = c2['Country'].replace(['United Kingdom of Great Britain and Northern Ireland'],'UK')
c3['Country'] = c3['Country'].replace(['United Kingdom of Great Britain and Northern Ireland'],'UK')
c1['Country'] = c1['Country'].replace(['Bolivia (Plurinational State of)'],'Bolivia')
c2['Country'] = c2['Country'].replace(['Bolivia (Plurinational State of)'],'Bolivia')
c3['Country'] = c3['Country'].replace(['Bolivia (Plurinational State of)'],'Bolivia')
c1['Country'] = c1['Country'].replace(['Taiwan, Greater China'],'Taiwan')
c2['Country'] = c2['Country'].replace(['Taiwan, Greater China'],'Taiwan')
c3['Country'] = c3['Country'].replace(['Taiwan, Greater China'],'Taiwan')
c1['Country'] = c1['Country'].replace(['Democratic Republic of the Congo'],'DR Congo')
c2['Country'] = c2['Country'].replace(['Democratic Republic of the Congo'],'DR Congo')
c3['Country'] = c3['Country'].replace(['Democratic Republic of the Congo'],'DR Congo')
c1['Country'] = c1['Country'].replace(['China, Hong Kong Special Administrative Region'],'Hong Kong')
c2['Country'] = c2['Country'].replace(['China, Hong Kong Special Administrative Region'],'Hong Kong')
c3['Country'] = c3['Country'].replace(['China, Hong Kong Special Administrative Region'],'Hong Kong')
c1['Country'] = c1['Country'].replace(['United Republic of Tanzania'],'Tanzania')
c2['Country'] = c2['Country'].replace(['United Republic of Tanzania'],'Tanzania')
c3['Country'] = c3['Country'].replace(['United Republic of Tanzania'],'Tanzania')
c1['Country'] = c1['Country'].replace(['Russian Federation'],'Russia')
c2['Country'] = c2['Country'].replace(['Russian Federation'],'Russia')
c3['Country'] = c3['Country'].replace(['Russian Federation'],'Russia')
c1['Country'] = c1['Country'].replace(['United Arab Emirates'],'UAE')
c2['Country'] = c2['Country'].replace(['United Arab Emirates'],'UAE')
c3['Country'] = c3['Country'].replace(['United Arab Emirates'],'UAE')
c1['Country'] = c1['Country'].replace(['Venezuela (Bolivarian Republic of)'],'UAE')
c2['Country'] = c2['Country'].replace(['Venezuela (Bolivarian Republic of)'],'UAE')
c3['Country'] = c3['Country'].replace(['Venezuela (Bolivarian Republic of)'],'UAE')
c1['Country'] = c1['Country'].replace(['State of Palestine'],'Palestine')
c2['Country'] = c2['Country'].replace(['State of Palestine'],'Palestine')
c3['Country'] = c3['Country'].replace(['State of Palestine'],'Palestine')

## **CDP OPERATION DISTRIBUTION BASED ON COUNTRY(YEAR - 20-18,19,20)**

In [None]:
plt.figure(figsize=(20,25))
plt.subplot(1,2,1)
sns.countplot(y=c1['Country'],order = c1['Country'].value_counts().index,palette='rainbow')
plt.ylabel('COUNTRY',fontsize=25)
plt.xlabel('COUNT',fontsize=30)
plt.title('2018', fontsize= 20);

plt.subplot(1,2,2)
sns.countplot(y=c2['Country'],order = c2['Country'].value_counts().index,palette='rainbow')
plt.ylabel('',fontsize=10)
plt.xlabel('COUNT',fontsize=30)
plt.title('2019',fontsize= 20);

In [None]:
plt.figure(figsize=(10,25))
sns.countplot(y=c3['Country'],order = c3['Country'].value_counts().index,palette='rainbow')
plt.ylabel('COUNTRY - 2020',fontsize=40)
plt.xlabel('COUNT',fontsize=30)

### **CDP OPERATION DISTRIBUTION BASED ON REGION (YEAR - 20-18,19,20)**

In [None]:
plt.figure(figsize=(30,30))
plt.subplot(3,1,1)
sns.countplot(y=c1['CDP Region'],order = c1['CDP Region'].value_counts().index,palette='rocket')
plt.ylabel('REGION - 2018',fontsize=40)
plt.xlabel('COUNT',fontsize=25)
plt.yticks(fontsize=15);

plt.subplot(3,1,2)
sns.countplot(y=c2['CDP Region'],order = c2['CDP Region'].value_counts().index,palette='rocket')
plt.ylabel('REGION - 2019',fontsize=40)
plt.xlabel('COUNT',fontsize=25)
plt.yticks(fontsize=15);


plt.subplot(3,1,3)
sns.countplot(y=c3['CDP Region'],order = c3['CDP Region'].value_counts().index,palette='rocket')
plt.ylabel('REGION - 2020',fontsize=40)
plt.xlabel('COUNT',fontsize=25)
plt.yticks(fontsize=15);


### **DISTRIBUTION OF CDP RESPONSE STATUS - PRIVATE/PUBLIC (YEAR - 20-18,19,20)**

In [None]:
plt.figure(figsize=(20,5))
plt.subplot(1,3,1)
sns.countplot(c1['Access'],palette='summer')
plt.ylabel('COUNT',fontsize=20)
plt.title('2018', fontsize= 20);

plt.subplot(1,3,2)
sns.countplot(c2['Access'],palette='summer')
plt.ylabel('')
plt.title('2019', fontsize= 20);

plt.subplot(1,3,3)
sns.countplot(c3['Access'],palette='summer')
plt.ylabel('')
plt.title('2020', fontsize= 20);

#### 2018 - In 2018, around one third of the responce status access was restricted to public.
#### 2019 - In 2019, the restriction for a public to access responce status got reduced comparatively with the               year 2018.
#### 2020 - In 2019, the restrcition was removed and made it entirely public. 

## **CITIES RESPONSES**

### PARENT SECTION DISTRIBUTION

In [None]:
c21.head()

In [None]:
c21.columns

In [None]:
c21.shape

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(y=c21['Parent Section'],order = c21['Parent Section'].value_counts().index,alpha=0.7,palette='spring')
plt.yticks(fontsize=10)
plt.ylabel('PARENT_SECTION - 2018', fontsize='22')
plt.xlabel('COUNT', fontsize='15')

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,1,1)
sns.countplot(y=c22['Parent Section'],order = c22['Parent Section'].value_counts().index,alpha=0.7,palette='spring')
plt.yticks(fontsize=10);
plt.ylabel('PARENT_SECTION - 2019', fontsize='20')
plt.xlabel('COUNT', fontsize='15')

plt.subplot(2,1,2)
sns.countplot(y=c23['Parent Section'],order = c23['Parent Section'].value_counts().index,alpha=0.7,palette='spring')
plt.yticks(fontsize=10);
plt.ylabel('PARENT_SECTION - 2020', fontsize='20')
plt.xlabel('COUNT', fontsize='15')

### SECTION DISTRIBUTION

In [None]:
plt.figure(figsize=(10,15))
sns.countplot(y=c21['Section'],order = c21['Section'].value_counts().index,alpha=0.7,palette='magma')
plt.yticks(fontsize=10);
plt.ylabel('SECTION - 2018', fontsize='50')
plt.xlabel('COUNT', fontsize='25')

In [None]:
plt.figure(figsize=(10,15))
sns.countplot(y=c22['Section'],order = c22['Section'].value_counts().index,alpha=0.7,palette='magma')
plt.yticks(fontsize=10);
plt.ylabel('SECTION - 2019', fontsize='50')
plt.xlabel('COUNT', fontsize='25')

In [None]:
plt.figure(figsize=(10,15))
sns.countplot(y=c23['Section'],order = c23['Section'].value_counts().index,alpha=0.7,palette='magma')
plt.yticks(fontsize=10);
plt.ylabel('SECTION - 2020', fontsize='50')
plt.xlabel('COUNT', fontsize='25')

## **PERFORMING SENTIMENTAL ANALYSIS ON QUESTION NAME AND RESPONSE ANSWER:**

In [None]:
print(c21['Question Name'].duplicated().sum())
print(c22['Question Name'].duplicated().sum())
print(c23['Question Name'].duplicated().sum())

In [None]:
c211 = c21['Question Name']
c211 = c211.drop_duplicates().reset_index()
c211 = c211.drop('index',axis=1)
c222 = c22['Question Name']
c222 = c222.drop_duplicates().reset_index()
c222 = c222.drop('index',axis=1)
c233 = c23['Question Name']
c233 = c233.drop_duplicates().reset_index()
c233 = c233.drop('index',axis=1)

In [None]:
c31 = c211.merge(c222, on='Question Name', how='right')
c32 = c31.merge(c233,on='Question Name', how='right')
c32

In [None]:
c32['Question Name'] = c32['Question Name'].apply(lambda x: " ".join(x.lower() for x in x.split()))
c32['Question Name'].head()

In [None]:
c32['Question Name'] = c32['Question Name'].str.replace('[^\w\s]','')
c32['Question Name'].head()

In [None]:
stop = stopwords.words('english')

c32['Question Name'] = c32['Question Name'].apply(lambda x: " ".join(x for x in x.split() if x not in stop));
c32['Question Name'].head()

In [None]:
freq = pd.Series(' '.join(c32['Question Name']).split()).value_counts()[:10]
freq

In [None]:
freq = list(freq.index)
c32['Question Name'] = c32['Question Name'].apply(
    lambda x: " ".join(x for x in x.split() if x not in freq))
c32['Question Name'].head()

In [None]:
freq = pd.Series(' '.join(c32['Question Name']).split()).value_counts()[-15:]
freq

In [None]:
freq = list(freq.index)
c32['Question Name'] = c32['Question Name'].apply(
    lambda x: " ".join(x for x in x.split() if x not in freq))
c32['Question Name'].head()

In [None]:
st = PorterStemmer()
c32['Question Name'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

In [None]:
c32['sentiment'] = c32['Question Name'].apply(lambda x: TextBlob(x).sentiment[0] )
c32[['Question Name','sentiment']].head()

In [None]:
vader=vader()

In [None]:
c32['scores'] = c32['Question Name'].apply(lambda x: vader.polarity_scores(x))
c32['compound']=c32['scores'].apply(lambda score_dict: score_dict['compound']) 
c32['pos']=c32['scores'].apply(lambda score_dict: score_dict['pos'])
c32['neg']=c32['scores'].apply(lambda score_dict: score_dict['neg']) 
c32['neu']=c32['scores'].apply(lambda score_dict: score_dict['neu'])
c32=c32.drop('scores',axis=1)
c32

In [None]:
c32['sentiment']=c32['sentiment'].astype(float)
c32.sentiment[c32.sentiment>0]=1
c32.sentiment[c32.sentiment<0]=-1;

In [None]:
c32

## **SENTIMENTAL ANALYSIS ON QUESTION / RESPONSE ANSWER:**

In [None]:
c41=c21['Response Answer']
c42=c22['Response Answer']
c43=c23['Response Answer']

In [None]:
c41=c41.drop_duplicates().reset_index()
c42=c42.drop_duplicates().reset_index()
c43=c43.drop_duplicates().reset_index()

In [None]:
c41=c41.drop('index', axis=1)
c42=c42.drop('index', axis=1)
c43=c43.drop('index', axis=1)

In [None]:
c51 = c41.merge(c42, how='right',on='Response Answer')
c52 = c51.merge(c43, how='right',on='Response Answer')

In [None]:
c53 = c52.dropna()
c53

In [None]:
c53['Response Answer'] = c53['Response Answer'].apply(lambda x: " ".join(x.lower() for x in x.split()))
c53['Response Answer'].head()

In [None]:
c53['Response Answer'] = c53['Response Answer'].str.replace('[^\w\s]','')
c53['Response Answer'].head()

In [None]:
stop = stopwords.words('english')

c53['Response Answer'] = c53['Response Answer'].apply(lambda x: " ".join(x for x in x.split() if x not in stop));
c53['Response Answer'].head()

In [None]:
freq = pd.Series(' '.join(c53['Response Answer']).split()).value_counts()[:10]
freq

In [None]:
freq = list(freq.index)
c53['Response Answer'] = c53['Response Answer'].apply(
    lambda x: " ".join(x for x in x.split() if x not in freq))
c53['Response Answer'].head()

In [None]:
freq = pd.Series(' '.join(c53['Response Answer']).split()).value_counts()[-10:]
freq

In [None]:
freq = list(freq.index)
c53['Response Answer'] = c53['Response Answer'].apply(
    lambda x: " ".join(x for x in x.split() if x not in freq))
c53['Response Answer'].head()

In [None]:
st = PorterStemmer()
c53['Response Answer'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
c53['Response Answer'].head()

In [None]:
c53['sentiment'] = c53['Response Answer'].apply(lambda x: TextBlob(x).sentiment[0] )
c53[['Response Answer','sentiment']].head()

In [None]:
c53['scores'] = c53['Response Answer'].apply(lambda x: vader.polarity_scores(x))
c53['compound']=c53['scores'].apply(lambda score_dict: score_dict['compound']) 
c53['pos']=c53['scores'].apply(lambda score_dict: score_dict['pos'])
c53['neg']=c53['scores'].apply(lambda score_dict: score_dict['neg']) 
c53['neu']=c53['scores'].apply(lambda score_dict: score_dict['neu'])
c53=c53.drop('scores',axis=1)
c53

In [None]:
c53['sentiment']=c53['sentiment'].astype(float)
c53.sentiment[c53.sentiment>0]=1
c53.sentiment[c53.sentiment<0]=-1;

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
ax=sns.countplot(c32['sentiment'],palette='winter')
ax.set_xticklabels(['Negative','Neutral','Positive']);
plt.ylabel('SENTIMENTAL DISTRIBUTION', fontsize=25)
plt.title('QUESTION NAME', fontsize=15)

plt.subplot(1,2,2)
ax=sns.countplot(c53['sentiment'],palette='winter')
ax.set_xticklabels(['Negative','Neutral','Positive'])
plt.title('RESPONSE ANSWER', fontsize=15)
plt.ylabel('')