# CDP Competition Starter Notebook
Example data mapping, EDA and data wrangling pipeline to relate CDP Corporate response data to CDP Cities data and external data sets containing social equity data.

#### Parameters

#### Input

**CDP Corporate Questionnaire response data sets**
- **2019_Full_Climate_Change_Dataset.csv** = 2019 Climate Change publically disclosed questionnaire responses for North America
- **2019_Full_Water_Security_Dataset.csv** = 2019 Water Security publically disclosed questionnaire responses for North America

**CDP Cities Questionnaire response data sets**
- **2020_-_Full_Cities_Dataset.csv** = Full 2020 Cities Questionnaire response data set

**CDP Cities Meta data sets**
- **NA_HQ_public_data.csv** = CDP curated Organisations metadata, mapping publically disclosed North American organisations to HQ city and state

**External Non-CDP data sets**
- **SVI2018_US.csv** = US Centers for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) Data for 2018 (*Census tract level*) - available publicly  bat https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
- **SVI2018_US_COUNTY.csv** = US Centers for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) Data for 2018 (*County level*) - available publicly at https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
- **uscities.csv** = metadata for United States cities and towns, with information such as populations size, median age and lat,lng location coordinates - available publicly at https://simplemaps.com/data/us-cities.

SVI 2018 Documentation and Data Dictionary https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/SVI_documentation_2018.html

#### Output

EDA and Visualisations to begin investigating the CDP competition data sets, environmental performance indicators and social-equity KPIs.


## Imports

In [None]:
# standard libs
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import json

# plotting libs
import seaborn as sns

# geospatial libs
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Polygon
import geopandas as gpd
import folium
import plotly.graph_objects as go
import plotly_express as px

# set in line plotly 
from plotly.offline import init_notebook_mode;
init_notebook_mode(connected=True)

print(os.getcwd())

In [None]:
class cdp_kpi:
    """
    import corporate climate change response data
    """
    cc_df = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/2019_Full_Climate_Change_Dataset.csv')
    """
    import corporate water security response data
    """
    ws_df = pd.read_csv('../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/2019_Full_Water_Security_Dataset.csv')
    # import cities response df
    cities_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Cities/Cities Responses/2020_Full_Cities_Dataset.csv")
    # external data - import CDC social vulnerability index data - census tract level
    svi_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/CDC Social Vulnerability Index 2018/SVI2018_US.csv")
    """
    cities metadata - lat,lon locations for US cities
    """
    cities_meta_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/Simple Maps US Cities Data/uscities.csv")
    """
    cities metadata - CDP metadata on organisation HQ cities
    """
    cities_cdpmeta_df = pd.read_csv("../input/cdp-unlocking-climate-solutions/Supplementary Data/Locations of Corporations/NA_HQ_public_data.csv")
    
    def list_dedupe(self, x):
        """
        Convert list to dict and back to list to dedupe

        Parameters
        ----------
        x: list
            Python list object

        Returns
        -------
        dictionary:
            dictionary object with duplicates removed

        """
        return list(dict.fromkeys(x))
    
    def __init__(self):
        """
        Extract city response data for question 6.2 Does your city collaborate in partnership with businesses in your city on sustainability projects?
        Map cities to organisations who are headquartered within that city, using the NA_HQ_public_data.csv meta data file
        """
        self.cities_6_2 = self.cities_df[self.cities_df['Question Number'] == '6.2']\
            .rename(columns={'Organization': 'City'})
        self.cities_6_2['Response Answer'] = self.cities_6_2['Response Answer'].fillna('No Response')
        # map dict to clean full state names to abbreviations
        self.cities_cdpmeta_df['state'] = self.cities_cdpmeta_df['address_state'].map(self.us_state_abbrev)

        # infill non-matched from dict
        self.cities_cdpmeta_df['state'] = self.cities_cdpmeta_df['state'].fillna(self.cities_cdpmeta_df['address_state'])
        self.cities_cdpmeta_df['state'] = self.cities_cdpmeta_df['state'].replace({'ALBERTA':'AB'})
        self.cities_cdpmeta_df['address_city'] = self.cities_cdpmeta_df['address_city'].replace({'CALGARY':'Calgary'})
        self.cities_cdpmeta_df= self.cities_cdpmeta_df.drop(columns=['address_state'])

        # create joint city state variable
        self.cities_cdpmeta_df['city_state'] = self.cities_cdpmeta_df['address_city'].str.cat(self.cities_cdpmeta_df['state'],sep=", ")
        #Summarise the cities metadata to count the number organisations (HQ) per city
        self.cities_count = self.cities_cdpmeta_df[['organization', 'address_city', 'state', 'city_state']]\
        .groupby(['address_city', 'state', 'city_state']).count().\
            sort_values(by = ['organization'],ascending = False)\
                .reset_index()\
                    .rename(columns={'organization' : 'num_orgs'})
        # convert indexes to columns'
        self.cities_count.reset_index(inplace=True)
        self.cities_count = self.cities_count.rename(columns = {'index':'city_id'})
        self.cities_df.reset_index(inplace=True)
        self.cities_df = self.cities_df.rename(columns = {'index':'city_org_id'})

        # convert id and city label columns into lists
        self.city_id_no = self.list_dedupe(self.cities_count['city_id'].tolist())
        self.city_name = self.list_dedupe(self.cities_count['address_city'].tolist())

        self.city_org_id_no = self.list_dedupe(self.cities_df['city_org_id'].tolist())
        self.city_org_name = self.list_dedupe(self.cities_df['Organization'].tolist())

        # remove added index column in cities df
        self.cities_df.drop('city_org_id', inplace=True, axis=1)
        self.cities_count.drop('city_id', inplace=True, axis=1)

        # zip to join the lists and dict function to convert into dicts
        self.city_dict = dict(zip(self.city_id_no, self.city_name))
        self.city_org_dict = dict(zip(self.city_org_id_no, self.city_org_name))
        
        # compare dicts - matching when city name appears as a substring in the full city org name
        self.city_names_df = pd.DataFrame(columns=['City ID No.','address_city', 'City Org ID No.','City Org', 'Match']) # initiate empty df

        for ID, seq1 in self.city_dict.items():
            for ID2, seq2 in self.city_org_dict.items():
                m = re.search(seq1, seq2) # match string with regex search 
                if m:
                    match = m.group()
                    # Append rows in Empty Dataframe by adding dictionaries 
                    self.city_names_df = self.city_names_df.append({'City ID No.': ID, 'address_city': seq1, 'City Org ID No.': ID2, 'City Org': seq2, 'Match' : match}, ignore_index=True)

        # subset for city to city org name matches
        self.city_names_df = self.city_names_df.loc[:,['address_city','City Org']]
        self.cities_count  = pd.merge(self.cities_count, self.city_names_df, on='address_city', how='left')
        self.cities_6_2 = self.cities_6_2[['City', 'Response Answer']].rename(columns={'City' : 'City Org'})
        self.cities_count = pd.merge(left=self.cities_count, right=self.cities_6_2, how='left', 
                                on ='City Org').rename(columns={'Response Answer' : 'Sustainability Project Collab.'})

        self.cities_count['Sustainability Project Collab.'] = self.cities_count['Sustainability Project Collab.'].fillna('No Response')
        self.cities_meta_df = self.cities_meta_df[['city', 'state_id', 'lat','lng']].rename(columns={'city' : 'address_city', 'state_id' : 'state'})
        
        # join coordinates to cities count
        self.cities_count = pd.merge(left=self.cities_count, right=self.cities_meta_df, how='left', on=['address_city', 'state'])

        # convert text response to question 6.2 to an integar encoding 
        resp_int_df = self.cities_count[["Sustainability Project Collab."]]
        resp_int_df= resp_int_df.rename(columns={'Sustainability Project Collab.' : 'resp_int'})

        labels = resp_int_df['resp_int'].unique().tolist()
        mapping = dict( zip(labels,range(len(labels))) )
        resp_int_df.replace({'resp_int': mapping},inplace=True)

        resp_list = resp_int_df['resp_int'].tolist()
        self.cities_count['resp_int'] = resp_list 
        
        self.cc_2_4a = self.cc_df[self.cc_df['question_number'] == 'C2.4a']
        cities_cdpmeta_join = self.cities_cdpmeta_df[["account_number", 'survey_year', 'address_city']]
        self.cc_2_4a = pd.merge(left=self.cc_2_4a, right=cities_cdpmeta_join,  left_on=['account_number','survey_year'], right_on = ['account_number','survey_year'])
        
    def City_SVI_Geo(self, city_svi_df, city, shapefile):
        cc_nyc = self.cc_2_4a[(self.cc_2_4a['address_city'] == city)]
        self.cities_6_2['City Org'] = self.cities_6_2['City Org'].replace({city +' City':city})
        cc_nyc = pd.merge(left=cc_nyc, right= self.cities_6_2,  left_on=['address_city'], right_on = ['City Org']).rename(columns={'Response Answer' : 'sustain_collab'})

        #e.g.'../input/cdp-unlocking-climate-solutions/Supplementary Data/NYC CDP Census Tract Shapefiles/nyu_2451_34505.shp'
        # import shapefile of NYC census tracts
        self.geodf = gpd.read_file(shapefile)

        # join geospatial data to SVI unemployment rates ('E_UNEMP')
        gdf_join = self.geodf[['tractid', 'geometry']].to_crs('+proj=robin')
        nyc_join =  nyc_svi_df[['E_UNEMP', 'FIPS']]
        gdf_join["tractid"] = pd.to_numeric(self.geodf["tractid"])
        gdf_nyc = pd.merge(left=gdf_join, right=nyc_join, how='left', left_on='tractid', right_on = 'FIPS')
        return gdf_nyc
        
    def City_CC_Resp(self, cc_city, city_svi_df, county, bb_df):
        # subset for Bronx
        #bb_df = city_svi_df[(city_svi_df.COUNTY == county)]

        # join to city and climate change response data
        print(cc_city.shape)
        cc_city_temp = cc_city.rename(columns={'City Org' : 'City'})
        city_df = pd.merge(cc_city_temp,bb_df,on='City',how='outer')
        return city_df
        
    
    # state abbreviation dictionary
    us_state_abbrev = {
        'Alabama': 'AL',
        'Alaska': 'AK',
        'American Samoa': 'AS',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Guam': 'GU',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Northern Mariana Islands':'MP',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Puerto Rico': 'PR',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virgin Islands': 'VI',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'
    }


In [None]:
c = cdp_kpi()

In [None]:
#verify data captured
print("c.cc_df.head()")
print(c.cc_df.head())
print("c.cities_df.head()")
print(c.cities_df.head())
print("c.svi_df.head()")
print(c.svi_df.head())
print("c.cities_meta_df.head()") 
print(c.cities_meta_df.head()) 
print("c.cities_cdpmeta_df.head()")
print(c.cities_cdpmeta_df.head())
print("c.cities_6_2.head()")
print(c.cities_6_2.head())
print("c.cities_count.head()")
print(c.cities_count.head())
print("c.cc_2_4a.head()")
print(c.cc_2_4a.head())

In [None]:
nyc_svi_df = c.svi_df[c.svi_df['STCNTY'].isin([36005, 36047, 36061, 36081, 36085])]
nyc_svi_df['City'] = 'New York'
cc_city = c.City_SVI_Geo(nyc_svi_df, "Bronx", '../input/cdp-unlocking-climate-solutions/Supplementary Data/NYC CDP Census Tract Shapefiles/nyu_2451_34505.shp')
bb_df = nyc_svi_df[(nyc_svi_df.COUNTY == "Bronx")]
cc_city = cc_city.rename(columns={'City Org' : 'City'})
#print(cc_city)
nyc_df = c.City_CC_Resp(cc_city,nyc_svi_df, "Bronx", bb_df)

In [None]:

print(nyc_df.shape)

In [None]:
nyc_df.head()

#### Water Responses

Identify organisations with facilities oeprating in the Hudson river basin, flagging companies who's operations may impact NYC's major fresh water resource


In [None]:
ws_df_4_1c = ws_df[ws_df['question_number'] == 'W4.1c']
ws_df_4_1c = ws_df_4_1c[ws_df_4_1c['response_value'].notnull()]
ws_df_4_1c.head()         

Reshape data

- Climate change and water response datasets are often presented in long format in the CDP datasets.
- These data sets will become more useful when widened on the 'column_name' variable, enabling you to derive measurable metrics and KPIs from questionnaire response data

In [None]:
# pivot data
ws_df_4_1c_wide = ws_df_4_1c.pivot_table(index=['account_number', 'organization', 'row_number'],
                                     columns='column_name', 
                                     values='response_value',
                                     aggfunc=lambda x: ' '.join(x)).reset_index()
# identify orgs with facilities within the Hudson river basin
ws_df_4_1c_wide = ws_df_4_1c_wide[ws_df_4_1c_wide['W4.1c_C2River basin'].str.contains('Hudson', na=False)]
ws_df_4_1c_wide.head()

In [None]:
ws_df.head()

### Modelling

#### What next?

Suggested analysis and modelling techniques that you can be apply as you tackle the [competitions problem statement](https://www.kaggle.com/c/cdp-unlocking-climate-solutions/overview/description).

Suggestions below are **only** a guide. You are not limited to these approaches -  use your imagination and publically available data to tackle this challenge from any angle you can dream of! 


**NLP principles to investigate the social-environmental overlap between Corporations and Cities Climate Change 'Readiness'**

- Utilise pythons NLP capabilities and tokenization approaches such as [Term Frequency–inverse Document Frequency (TF-IDF)](https://medium.com/analytics-vidhya/getting-started-with-nlp-tokenization-document-term-matrix-tf-idf-2ea7d01f1942) (1) to construct a Document Term Matrix (DTM) from questionnaire responses, highlighting key terms in free text answers to aid in topic identification

        - e.g. summarise city 'readiness' for climate change and the hazards they anticipate (Cities Question 2.1)
        - e.g outline the future adaptations cities must implement to prepare for environmental challenges (City Question 3.0)
        - e.g. find common topics in examples of colloboration between cities and business on sustainability projects (City Question 6.2a)

- Apply [sentiment analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267440/) to detect whether a city sees opportunity (positive sentiment/polarity) (Cities Question 6.0) or concern (negative sentiment/polarity) (City Question 2.2) over future climate scenarios


- Combine DTM and Sentiment analysis to build a combined KPI that incorporates measures of sentiment and susceptibility into one metric, identifying cities with high levels of percieved risk who may be open to colloboration with business as they foster climate resilience.

        - e.g. Sentiment x Susceptibility  = Climate Risk Sensitivity Score



**Social Accounting with Water Shadow Price Modeling**

Using external datasets and water-related risks identified by Corporations (Water Security Question W4.2), build a 'Shadow Price' of water for Corporations operating in a selection of North American cities. 

- A [shadow price](https://www.fir-pri-awards.org/wp-content/uploads/MasterThesis_Chisem.pdf) (3) can attempt to account for the total cost of a Corporations water use, estimating all internal and external costs ,as well as exposure to water stress. 

- The shadow price coefficient can be combined within volumetric withdrawal data (Water Security Question W5.1a) to assign a Water Risk Cost per company, weighting corporate activties with a measure of the inersection between environmental risks and social impact.

    - e.g. Water risk cost for Company  = Shadow price for Company  * Water withdrawal volume for Company 



**References**

1. Muñoz (2020). Getting started with NLP: Tokenization, Document-Term Matrix, TF-IDF. Medium. https://medium.com/analytics-vidhya/getting-started-with-nlp-tokenization-document-term-matrix-tf-idf-2ea7d01f1942

2. Reyes-Menendez A, Saura JR, Alvarez-Alonso C. Understanding #WorldEnvironmentDay User Opinions in Twitter: A Topic-Based Sentiment Analysis Approach. Int J Environ Res Public Health. 2018;15(11):2537. Published 2018 Nov 13. doi:10.3390/ijerph15112537. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267440/

3. FIR-PRI. Portfolio Analysis Using Water Shadow Pricing: How Valuing Water Risk Can Reduce Carbon Emissions. https://www.fir-pri-awards.org/wp-content/uploads/MasterThesis_Chisem.pdf

In [None]:
sub.to_csv('submission.csv')