# CDP Competition Starter Notebook
Example data mapping, EDA and data wrangling pipeline to relate CDP Corporate response data to CDP Cities data and external data sets containing social equity data.

#### Parameters

#### Input

**CDP Corporate Questionnaire response data sets**
- **2019_Full_Climate_Change_Dataset.csv** = 2019 Climate Change publically disclosed questionnaire responses for North America
- **2019_Full_Water_Security_Dataset.csv** = 2019 Water Security publically disclosed questionnaire responses for North America

**CDP Cities Questionnaire response data sets**
- **2020_-_Full_Cities_Dataset.csv** = Full 2020 Cities Questionnaire response data set

**CDP Cities Meta data sets**
- **NA_HQ_public_data.csv** = CDP curated Organisations metadata, mapping publically disclosed North American organisations to HQ city and state

**External Non-CDP data sets**
- **SVI2018_US.csv** = US Centers for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) Data for 2018 (*Census tract level*) - available publicly  bat https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
- **SVI2018_US_COUNTY.csv** = US Centers for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) Data for 2018 (*County level*) - available publicly at https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
- **uscities.csv** = metadata for United States cities and towns, with information such as populations size, median age and lat,lng location coordinates - available publicly at https://simplemaps.com/data/us-cities.

SVI 2018 Documentation and Data Dictionary https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/SVI_documentation_2018.html

#### Output

EDA and Visualisations to begin investigating the CDP competition data sets, environmental performance indicators and social-equity KPIs.


## Imports

In [None]:
# standard libs
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import json

# plotting libs
import seaborn as sns

# geospatial libs
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Polygon
import geopandas as gpd
import folium
import plotly.graph_objects as go
import plotly_express as px

# set in line plotly 
from plotly.offline import init_notebook_mode;
init_notebook_mode(connected=True)

print(os.getcwd())

## Data

### Import Data

In [None]:
# import corporate response data
cc_df = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/2019_Full_Climate_Change_Dataset.csv')
ws_df = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/2019_Full_Water_Security_Dataset.csv')

In [None]:
cc_df.head()

In [None]:
ws_df.head()

In [None]:
# import cities response df
cities_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv")

In [None]:
cities_df.head()

In [None]:
cities_df['Question Number'].unique()

In [None]:
# external data - import CDC social vulnerability index data - census tract level
svi_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/CDC Social Vulnerability Index 2018/SVI2018_US.csv")

In [None]:
svi_df

In [None]:
# cities metadata - lat,lon locations for US cities
cities_meta_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/Simple Maps US Cities Data/uscities.csv")

# cities metadata - CDP metadata on organisation HQ cities
cities_cdpmeta_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/Locations of Corporations/NA_HQ_public_data.csv")

In [None]:
cities_meta_df.head()

In [None]:
cities_cdpmeta_df.head()

### Helpers

In [None]:
def list_dedupe(x):
    """
    Convert list to dict and back to list to dedupe
    
    Parameters
    ----------
    x: list
        Python list object
        
    Returns
    -------
    dictionary:
        dictionary object with duplicates removed
        
    """
    return list(dict.fromkeys(x))

### Set up and Parameters

## Calculations

### Data Cleaning & EDA

#### Extract City Questionnaire Response and map Cities to Organisations

- Extract city response data for question *6.0 whether a city sees opportunity*
- Map cities to organisations who are headquartered within that city, using the NA_HQ_public_data.csv meta data file

(see [CDP Cities questionnaire guidance](https://guidance.cdp.net/en/guidance?cid=16&ctype=theme&idtype=ThemeID&incchild=1&microsite=0&otype=Questionnaire&tags=TAG-637%2CTAG-570%2CTAG-13013%2CTAG-13002%2CTAG-13009%2CTAG-13010))


In [None]:
cities_6_0 = cities_df[cities_df['Question Number'] == '6.0']\
    .rename(columns={'Organization': 'City'})

cities_6_0['Response Answer'] = cities_6_0['Response Answer'].fillna('No Response')

cities_6_0.head()

Clean Organisation City HQ Metadata

In [None]:
# state abbreviation dictionary
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

# map dict to clean full state names to abbreviations
cities_cdpmeta_df['state'] = cities_cdpmeta_df['address_state'].map(us_state_abbrev)

# infill non-matched from dict
cities_cdpmeta_df['state'] = cities_cdpmeta_df['state'].fillna(cities_cdpmeta_df['address_state'])
cities_cdpmeta_df['state'] = cities_cdpmeta_df['state'].replace({'ALBERTA':'AB'})
cities_cdpmeta_df['address_city'] = cities_cdpmeta_df['address_city'].replace({'CALGARY':'Calgary'})
cities_cdpmeta_df= cities_cdpmeta_df.drop(columns=['address_state'])

# create joint city state variable
cities_cdpmeta_df['city_state'] = cities_cdpmeta_df['address_city'].str.cat(cities_cdpmeta_df['state'],sep=", ")

cities_cdpmeta_df

Summarise the cities metadata to count the number organisations (HQ) per city 

In [None]:
cities_count = cities_cdpmeta_df[['organization', 'address_city', 'state', 'city_state']]\
        .groupby(['address_city', 'state', 'city_state']).count().\
            sort_values(by = ['organization'],ascending = False)\
                .reset_index()\
                    .rename(columns={'organization' : 'num_orgs'})
cities_count.head()

City name conversion

- Align City names in CDP City questionnaire response data ('City Org') with common city names that may be present in external data sets
- e.g. 'City of Boulder' -> Boulder

*Note* This data quality control step can also be addressed by using the 'City' column in the 2019_Cities_Disclosing_to_CDP.csv dataset

In [None]:
# convert indexes to columns
cities_count.reset_index(inplace=True)
cities_count = cities_count.rename(columns = {'index':'city_id'})
cities_df.reset_index(inplace=True)
cities_df = cities_df.rename(columns = {'index':'city_org_id'})

# convert id and city label columns into lists
city_id_no = list_dedupe(cities_count['city_id'].tolist())
city_name = list_dedupe(cities_count['address_city'].tolist())

city_org_id_no = list_dedupe(cities_df['city_org_id'].tolist())
city_org_name = list_dedupe(cities_df['Organization'].tolist())

# remove added index column in cities df
cities_df.drop('city_org_id', inplace=True, axis=1)
cities_count.drop('city_id', inplace=True, axis=1)

# zip to join the lists and dict function to convert into dicts
city_dict = dict(zip(city_id_no, city_name))
city_org_dict = dict(zip(city_org_id_no, city_org_name))

In [None]:
# compare dicts - matching when city name appears as a substring in the full city org name
city_names_df = pd.DataFrame(columns=['City ID No.','address_city', 'City Org ID No.','City Org', 'Match']) # initiate empty df

for ID, seq1 in city_dict.items():
    for ID2, seq2 in city_org_dict.items():
        m = re.search(seq1, seq2) # match string with regex search 
        if m:
            match = m.group()
            # Append rows in Empty Dataframe by adding dictionaries 
            city_names_df = city_names_df.append({'City ID No.': ID, 'address_city': seq1, 'City Org ID No.': ID2, 'City Org': seq2, 'Match' : match}, ignore_index=True)
            
# subset for city to city org name matches
city_names_df = city_names_df.loc[:,['address_city','City Org']]

city_names_df.head()

Join city_org names to city-org count table


In [None]:
cities_count  = pd.merge(cities_count, city_names_df, on='address_city', how='left')
cities_count.head()

Join Count of Disclosing Organisations in HQ Cities with Question 6.0 Response dataframe

- Label the response variable as a city's current Sustainability Project Collaboration

In [None]:
cities_6_0 = cities_6_0[['City', 'Response Answer']].rename(columns={'City' : 'City Org'})
cities_count = pd.merge(left=cities_count, right=cities_6_0, how='left', 
                        on ='City Org').rename(columns={'Response Answer' : 'Opportunity'})
cities_count['Opportunity'] = cities_count['Opportunity'].fillna('No Response')

In [None]:
cities_count

Plot cities containing the highest proportion of organisations disclosing to CDP

- Highlight number of disclosing orgnanisations with a HQ in the city
- Highlight the city's response to question 6.0 as bar colour


In [None]:
cities_count_50 = cities_count.iloc[0:40,:]

plt.figure(figsize=(15,8))
ax = sns.barplot(
    x="city_state", y="num_orgs",
    hue = "Opportunity",
    data=cities_count_50 ,
    palette="OrRd_r"
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='medium'  
)

Spatial plot of cities and organisation mapping

[Example bubble map with plotting with plotly](https://plotly.com/python/bubble-maps/)

In [None]:
# subset for lat, lng cities data
cities_meta_df = cities_meta_df[['city', 'state_id', 'lat','lng']].rename(columns={'city' : 'address_city', 'state_id' : 'state'})
cities_meta_df.head()

In [None]:
# join coordinates to cities count
cities_count = pd.merge(left=cities_count, right=cities_meta_df, how='left', on=['address_city', 'state'])

#convert text response to question 6.0 to an integar encoding 
resp_int_df = cities_count[["Opportunity"]]
resp_int_df= resp_int_df.rename(columns={'Opportunity' : 'resp_int'})

labels = resp_int_df['resp_int'].unique().tolist()
mapping = dict( zip(labels,range(len(labels))) )
resp_int_df.replace({'resp_int': mapping},inplace=True)

resp_list = resp_int_df['resp_int'].tolist()
cities_count['resp_int'] = resp_list 
cities_count.head()

- Highlight number of disclosing orgnanisations with a HQ in the city via bubble size
- Highlight city's response to question 6.2 as bubble colour and highlight in hover box

In [None]:
# plot spatial bubble map
cities_count['text'] = cities_count['address_city'] + '<br>Number of Orgs: ' + (cities_count['num_orgs']).astype(str) +\
    '<br>Opportunity: ' + (cities_count['Opportunity']).astype(str)
limits = [(0,20),(21,40),(41,60),(61,80),(81,100)]
cities = []
scale = 5

fig = go.Figure()

for i in range(len(limits)):
    lim = limits[i]
    fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = cities_count['lng'],
        lat = cities_count['lat'],
        text = cities_count['text'],
        marker = dict(
            size = cities_count['num_orgs']*scale,
            color = cities_count['resp_int'],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = '{0} - {1}'.format(lim[0],lim[1])))

fig.update_layout(
        title_text = '2019 CDP Climate Change Corporate Responders (Public) by City',
        showlegend = False,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )

fig.show()

####  Build NYC City Specific Dataset

Combine SVI dataset with CDP City Questionnaire and Organisation level 2019 Climate Change questionnaire response data

E.g. :
- Identify which organisations located within NYC see climate-related opportunities within their operations
- Match organisations with areas of the city that suffer from high unemployment rates
- Pinpoint areas of NYC that present an opportunity for corporate collobaration and therefore an uplift in social equity metrics

###### Subset climate change questionnaire response data for question C6.0

*Apply sentiment analysis to detect whether a city sees opportunity (positive sentiment/polarity) (Cities Question 6.0)

(see [CDP Climate Change questionnaire guidance](https://guidance.cdp.net/en/guidance?cid=13&ctype=theme&idtype=ThemeID&incchild=1&microsite=0&otype=Questionnaire&tags=TAG-646%2CTAG-605%2CTAG-600))

In [None]:
cities_count

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [None]:
# Import package
from wordcloud import WordCloud, STOPWORDS

In [None]:
cloud_words = " ".join(i for i in rv_tokens)
stopwords = set(STOPWORDS)
stopwords.update(["http", "https","co","com","amp"])

In [None]:
%matplotlib inline

In [None]:
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer

In [None]:
rv=cities_count['Opportunity'].tolist()

In [None]:
rv1="".join(rv)

In [None]:
rv_tokens = nltk.word_tokenize(rv1)

In [None]:
rv_tokens

In [None]:
# Define a function to plot word cloud
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(26, 8))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");

In [None]:
# Generate word cloud
wordcloud = WordCloud(width = 5000, height = 2000, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False, stopwords = STOPWORDS).generate(cloud_words)
# Plot
plot_cloud(wordcloud)

In [None]:
for i in range(len(cities_count['Opportunity'])):
    cities_count['Opportunity'][i]=re.sub(r"[^a-zA-Z,]","",str(cities_count['Opportunity'][i]))
    cities_count['Opportunity'][i]= cities_count['Opportunity'][i].lower()
    cities_count['Opportunity'][i]=cities_count['Opportunity'][i].split()
    ps = PorterStemmer()
    cities_count['Opportunity'][i] = [ps.stem(word) for word in cities_count['Opportunity'][i] if not word in set(stopwords.words('english'))]
    cities_count['Opportunity'][i] = ''.join(cities_count['Opportunity'][i])

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize # funtions for standard tokenisation
from nltk.tokenize import TweetTokenizer # function for tweets tokenization

In [None]:
tknzr = TweetTokenizer() # initialization of Tweet Tokenizer

def mean_words_length(text):
    words = word_tokenize(text)
    word_lengths = [len(w) for w in words]
    return round(np.mean(word_lengths),1)

# words count
cities_count['words_count'] = cities_count['Opportunity'].apply(lambda x: len(tknzr.tokenize(x))) # number of words in tweet

# numbers count
numbers_regex = r"(\d+\.?,?\s?\d+)"
cities_count['numbers_count'] = cities_count['Opportunity'].apply(lambda x: len(regexp_tokenize(x, numbers_regex))) # count of extracted mentions

# hashtags count
hashtags_regex = r"#\w+"
cities_count['hashtags_count'] = cities_count['Opportunity'].apply(lambda x: len(regexp_tokenize(x, hashtags_regex))) # count of extracted hashtags


# mean words length
cities_count['mean_words_length'] = cities_count['Opportunity'].apply(mean_words_length) # count of extracted mentions

# characters count
cities_count['characters_count'] = cities_count['Opportunity'].apply(lambda x: len(x)) # count of extracted mentions

In [None]:
from nltk.corpus import stopwords
import string

# lowercase tokens
cities_count['lowercase_bag_o_w']=cities_count['Opportunity'].apply(lambda x: [w for w in tknzr.tokenize(x.lower())])

# stopwords
cities_count['stopwords']=cities_count['lowercase_bag_o_w'].apply(lambda x: [t for t in x if t in stopwords.words('english')])

# stopwords count
cities_count['stopwords_count']=cities_count['stopwords'].apply(lambda x: len(x))

# alpha words only (excludes mentions and hashtags)
cities_count['alpha_only']=cities_count['lowercase_bag_o_w'].apply(lambda x: [t for t in x if t.isalpha()])

# counts of alpha words only
cities_count['alpha_count']=cities_count['alpha_only'].apply(lambda x: len(x))

# counts of punctuation marks only
punctuation_regex = r"[^\w\s]"
cities_count['punctuation_count']=cities_count['Opportunity'].apply(lambda x: len(regexp_tokenize(x, punctuation_regex)))

In [None]:
cities_count