The objective of this notebook is to propose an analytical view of climate change response of cities across the globe. For this, I first conduct an exploratory data analysis using graphical tools to create self explanatory plots to better understand what is behind the content of cities responses. Then I use Natural Language Process tools. First, I implement sentiment analysis to make a text classification tool to detect whether a city sees opportunity or concern ver future climate scenarios. Second, I build a text summarizing tool to summarize city city readiness for climate change.

I hope you enjoy this notebook!

[](http://)<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>
    
* [I. EDA - Text mining](#1)
    - [Overview](#1.1)
    - [ Length (word count) of answers](#1.2)
    - [ Sentiment polarity](#1.3)
* [II. Natural Language Processing](#2)
    - [1 Sentiment Analysis](#2.1)
        - [1.1 Text cleaning](#2.1.1)
        - [1.2 Test the data](#2.1.2)
        - [1.3 Evaluation metrics](#2.1.3)
        - [1.4 Model classification - CountVectorizer](#2.1.4)
        - [1.5 Deploying the model](#2.1.5)
    - [2 Text Summarizing](#2.2)
        - [1.1 Text cleaning](#2.2.1)
        - [1.2 Test summarizing](#2.2.2)

In [57]:
# Importing libraries

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import numpy as np 
import pandas as pd       
from time import time
import re
import string
import os
import emoji
from pprint import pprint
import collections

import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.offline import (download_plotlyjs, 
                            init_notebook_mode, 
                            plot, 
                            iplot)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# import gensim

# import heapq
# from textblob import TextBlob
# import spacy
# nlp = spacy.load('en_core_web_sm')
# from wordcloud import WordCloud
# from IPython.display import display
# import base64

# from collections import Counter


# import nltk
# from nltk.corpus import stopwords
# stopwords = stopwords.words('english')
# from nltk.stem import PorterStemmer
# from nltk.tokenize import word_tokenize, sent_tokenize
# import warnings
# warnings.filterwarnings('ignore')
# import logging
# logging.getLogger("lda").setLevel(logging.WARNING)


In [78]:
# Generic NLP variables
year = 2018
question_number = '4.0a'
col_number = 1

#   {
#   'all'          :    all cities that are in the dataset
#   'top_cities'   :    only cities that are in US, UK, Canda... look at cell 5 for entire list
#   'aus'          :    only cities in Australia
#   }
city_subset = 'all'

In [79]:
# import cities response df
cities_df = pd.read_csv("../Cities/Cities Responses/{}_Full_Cities_Dataset.csv".format(year))

In [80]:
all_cities = cities_df[(cities_df["Question Number"] == question_number) & ((cities_df["Column Number"] == col_number) )]\
    .rename(columns={'Organization': 'City'})

all_cities = all_cities.loc[:, ['Year Reported to CDP', 'City', 'Country', 'CDP Region', 'Section', 'Question Number', 'Column Number', 'Question Name', 'Response Answer']]

all_cities['Response Answer'] = all_cities['Response Answer'].fillna('No Response')
all_cities['Response_Answer'] = all_cities['Response Answer']
all_cities = all_cities.drop(['Response Answer'], axis =1)

all_cities.head()

Unnamed: 0,Year Reported to CDP,City,Country,CDP Region,Section,Question Number,Column Number,Question Name,Response_Answer
41,2018,Greater Amman Municipality,Jordan,Middle East,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased resource demand
332,2018,Región Metropolitana de Santiago,Chile,Latin America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased risk to already vulnerable populations
649,2018,Município de Torres Vedras,Portugal,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased incidence and prevalence of disease
769,2018,Comune di Padova,Italy,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased risk to already vulnerable populations
1001,2018,City of Providence,United States of America,North America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Population displacement


In [81]:
top_cities = all_cities.loc[(all_cities['Country'] == 'United States of America') 
                | (all_cities['Country'] == 'Canada')
                | (all_cities['Country'] == 'United Kingdom of Great Britain and Northern Ireland')
                | (all_cities['Country'] == 'Brazil') 
                | (all_cities['Country'] == 'Mexico') 
                | (all_cities['Country'] == 'Peru')
                | (all_cities['Country'] == 'Portugal') 
                | (all_cities['Country'] == 'Italy')
                | (all_cities['Country'] == 'Australia') 
                | (all_cities['Country'] == 'Argentina')]

# aus = all_cities.loc[(cities['Country'] == 'Australia')]


top_cities['Country'].unique()

array(['Portugal', 'Italy', 'United States of America', 'Canada',
       'Brazil', 'United Kingdom of Great Britain and Northern Ireland',
       'Argentina', 'Australia', 'Peru', 'Mexico'], dtype=object)

In [82]:
top_cities.head()

Unnamed: 0,Year Reported to CDP,City,Country,CDP Region,Section,Question Number,Column Number,Question Name,Response_Answer
649,2018,Município de Torres Vedras,Portugal,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased incidence and prevalence of disease
769,2018,Comune di Padova,Italy,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased risk to already vulnerable populations
1001,2018,City of Providence,United States of America,North America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Population displacement
1552,2018,City of San Francisco,United States of America,North America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased demand for public services (includin...
2361,2018,City of Vancouver,Canada,North America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased demand for public services (includin...


In [83]:
if city_subset == 'all':
    df_subset = all_cities
elif city_subset == 'top':
    df_subset = top_cities
elif city_subset == 'all':
    df_subset = aus

In [84]:
df_subset.head()

Unnamed: 0,Year Reported to CDP,City,Country,CDP Region,Section,Question Number,Column Number,Question Name,Response_Answer
41,2018,Greater Amman Municipality,Jordan,Middle East,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased resource demand
332,2018,Región Metropolitana de Santiago,Chile,Latin America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased risk to already vulnerable populations
649,2018,Município de Torres Vedras,Portugal,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased incidence and prevalence of disease
769,2018,Comune di Padova,Italy,Europe,Social Risks,4.0a,1,Please complete the table to indicate which s...,Increased risk to already vulnerable populations
1001,2018,City of Providence,United States of America,North America,Social Risks,4.0a,1,Please complete the table to indicate which s...,Population displacement


Ayuntamiento de celaya (Mexico) has the most city responses to whether they see climate change as an opportunity or concern. They were closely followed by the city of Columbus. The rest of the cities in this list are a little far behind.

In [85]:
def get_top_n_words(corpus, n=None):
    vec = TfidfVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(df_subset['Response_Answer'], 20)
for word, freq in common_words:
    print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['word' , 'count'])

fig = px.bar(df1, x='word', y='count')
fig.update_layout(title_text='Response_Answer word count top 20', template="plotly_white")
fig.show()

increased 174.73486136846813
risk 115.77268832832685
vulnerable 115.77268832832685
populations 115.77268832832685
demand 110.97838563110757
health 76.15510820826829
public 76.10391587679482
services 76.10391587679482
including 76.10391587679482
incidence 70.19634751577462
prevalence 70.19634751577462
disease 70.19634751577462
response 68.0
resource 67.87772583024733
population 46.32250896190046
displacement 46.0559618636058
fluctuating 42.5
socio 42.5
economic 42.5
conditions 42.5


The most commonly used words in the city answers are challenges, city and energy, followed up by climate and development

In [86]:
 def get_top_n_bigram(corpus, n=None):
    vec = TfidfVectorizer(ngram_range=(3,3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_bigram(df_subset['Response_Answer'], 20)
for word, freq in common_words:
    print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['word' , 'count'])

fig = px.bar(df2, x='word', y='count')
fig.update_layout(title_text='Answers bigram count top 20', template="plotly_white")
fig.show()

increased risk vulnerable 149.19953083036214
risk vulnerable populations 149.19953083036214
increased incidence prevalence 89.09545442950511
incidence prevalence disease 89.09545442950511
increased resource demand 87.0
increased demand public 85.0
demand public services 85.0
public services including 85.0
services including health 85.0
fluctuating socio economic 60.10407640085646
socio economic conditions 60.10407640085646
loss traditional jobs 38.0
increased conflict crime 34.0
migration rural areas 33.23401871576771
rural areas cities 33.23401871576771
decreased access recreation 1.0
oferta recurso hídrico 1.0
historical racial trauma 1.0
remocion en masa 1.0
low energy efficiency 0.7071067811865475


From the top bigrams in the comments, we can see that the cities are generally more concerned about climate change and energy efficiency. Most cities view that climate hazards represent either a significant or moderate challenge, while other cities assert the need for sustainable development and their desire for energy efficiency.