# Improving City-Business Collaboration: A Data Science Approach

### Notebook by Carlos GIRONDA

To enjoy a full viewing experience of <a href="https://nbviewer.jupyter.org/github/cgironda/CDP_prj/blob/main/CDP_final.ipynb">this notebook</a>, you can also use nbviewer.

## Table of Contents <a name="TOC"></a>
1. [Required Python Libraries](#RPL)
2. [Abstract](#abs)
2. [Introduction](#intro)
3. [Overview of the Cities Datasets](#overview_1)
   - [Cities Disclosing and Cities Responses Datasets](#cdcrd)
4. [Analyzing the Datasets](#ad)
   - [Cities Disclosing Dataset](#cdd)
   - [Map of the Organizations Locations](#mol)
   - [Cities Responses Dataset](#crd)
       - [Cities Responses by Year, Country and Question Number](#crgc)
5. [Overview of the Corporations Datasets](#overview_2)
   - [Corporations Disclosing and Responses Datasets](#cdrd)
   - [Climate Change Corporations Datasets](#cccd)
6. [Corporations: Analyzing Low-Carbon Energy Technologies](#alcet)
7. [Cities: Analyzing Sources of Renewable Energy](#asre)
8. [Conclusions](#conclusions)

## Required Python Libraries <a name="RPL"></a>
[back to the top](#TOC)

This notebook uses the following Python libraries, 

- Pandas: Provides a DataFrame structure to store data
- NumPy: Provides a numerical array structure for data
- Folium: Plotting library that provides a interactive Leaflet map
- Matplotlib: Python 2D plotting library that produces quality figures
- NLTK : It is a platform that help us to work with human language data
- WordClouds: Online word cloud generator and tag cloud creator library
- Scikit-Learn: Library that gives tools for predictive data analysis
- Seaborn: Lybrary for data visualization to draw informative statistical graphics

### Abstract <a name="abs"></a>

In order to investigate the impact of low-carbon technology used by Corporations over the sources of renewable energy used by Cities in the US, datasets disclosed by the international non-profit organization CDP (Carbon Disclosure Project) were analyzed for the year 2018.

To perform this task, text preprocessing was applied, and found that Corporations tend to use Aeolic energy over other renewable sources of energy. However, cities have installed more renewable energy via photovoltaic systems.

## Introduction <a name="intro"></a>
[back to the top](#TOC)

Several technologies based on sources of renewable energy, as well as operational approaches, can reduce cost-effectively energy consumption dramatically, and helps to avoid social problems providing equivalent or better quality of life and services.

These technologies can be enhanced by integrated systems of renewable energy that are less risky to implement in a natural environment by companies, but it needs information such as:

   - a) The most low-carbon energy source technology used by corporations
   - b) The amount of installed renewable energy in natural environments or cities' boundaries

If these two items can be answered, it is possible to create collaboration between cities and corporations taking account that in a periodic electric power demand cycle the pick demand of energy can be solved by the energy storage of renewable energy, such as photovoltaic systems, wind turbines, hydropower plants, etc.

The following sections explain how items a) and b) are reached, which reflect the underlying data.

## Overview of the Cities Datasets <a name="overview_1"></a>
[back to the top](#TOC)

The code below reads the CSV files from the **Cities**, **Corporations**, and **Supplementary Data** <a href="https://www.kaggle.com/c/cdp-unlocking-climate-solutions/data" target="_blank">folders provided by the CDP</a>

These folders are in the folder `/kaggle/input/cdp-unlocking-climate-solutions`, and the CSV files are loaded into the `files_csv` list.

In [None]:
import glob2

path = '/kaggle/input/cdp-unlocking-climate-solutions'
files_csv = sorted(glob2.glob(path + '/*/**/*.csv'))
files_csv

### Cities Disclosing and Cities Responses Datasets<a name="cdcrd"></a>
[back to Overview of the Datasets](#overview)

Let's see how the features of the `Cities_Disclosing` and `Cities Responses` csv format files are configurated for the year 2018:

In [None]:
import pandas as pd

df_cidis18, df_cires18 = pd.read_csv(files_csv[0]), pd.read_csv(files_csv[4]) # Disclosing # Responses
print(df_cidis18.info())
print('\n')
print(df_cires18.info())

Now that we see what features these **CSV** files have, let's join the **Cities Disclosing** and **Cities Responses** files of 2018, 2019 and 2020.

In [None]:
# Cities Disclosing Datasets
df_cidis = pd.concat([pd.read_csv(files_csv[i], encoding = "utf-8") for i in range(2+1)])
col_cidis = [*(df_cidis.columns[0:5+1]), *(df_cidis.columns[9:11+1])]
df_cidis = df_cidis[col_cidis].reset_index(drop=True)
# Cities Responses Datasets
df_cires = pd.concat([pd.read_csv(files_csv[i], encoding = "utf-8", low_memory=False) for i in range(4, 6+1)])
df_cires = df_cires[df_cires.columns[1:15]].reset_index(drop=True)

In [None]:
df_cidis.head(3)

In [None]:
df_cires.head(3)

## Analyzing the Datasets <a name="ad"></a>
[back to the top](#TOC)

### Cities Disclosing Dataset<a name="cdd"></a>
[back to Analyzing the Datasets](#ad)

In this section we analize the data to create a **map** that visually summarizes the information in the `df_cidis` DataFrame. The analysis starts cleaning the `City Location` feature and extract the **(latitude, longitude)** coordinates. There is no special reason to proceed in this way, however, it would be ideal to analize data that can be visualized in the aforementioned map.

In [None]:
# Number of 'City Location' rows that does have NANs
df_cidis['City Location'].isna().sum()

Next we insert **zero** as an string object type into the NAN places of the `df_cidis` DataFrame.

In [None]:
df_cidis.fillna(str(0), inplace=True) # We fill with 'zeros' the 'NAN' places
df_cidis.info()

The `City Location` column is cleaned considering only elements different from **zero** string.

In [None]:
import numpy as np
import re

lon_lat_index, lon_lat_list = ([] for l in range(2))

for idx_cl, j in enumerate(df_cidis['City Location']): # Remember that we need the 'indexes'
    if j != str(0): # The 'City Location' column is given by 'Longitude' and 'Latitude'
        lon_lat = tuple(map(float, re.sub(r'[POINT \(\)]', " ", j).strip().split()))
        lon_lat_list.append(lon_lat) # List of 'Longitude' and 'Latitude' - 2018, 2019, 2020
        lon_lat_index.append(idx_cl) # List of indexes that have not NULL values in 'City Location' column

From the `City Location` feature the `longitude` and `latitude` are extracted and put them as columns into the `df_cidis_cl` DataFrame.

In [None]:
df_cidis_crd = df_cidis.copy() # We preserve the ORIGINAL DataFrame
df_lon_lat = pd.DataFrame(lon_lat_list, columns=['longitude', 'latitude'], index=lon_lat_index)
df_cidis_cl = pd.concat([df_cidis_crd.iloc[lon_lat_index], df_lon_lat], axis=1)
df_cidis_cl.head(5)

In [None]:
df_cidis_cl.info()

We sort values under the `'Acount Number'` column in the `df_cidis_cl` Dataframe to get the `'Population'` for every year in the `Year Reported to CDP` column that corresponds to the cities disclosure cycle survey year.

In [None]:
# 'df_cidis_cl' DataFrame has non-ZERO values in the 'City Location' column plus two additional columns
df_cidis_cl.sort_values(['Account Number', 'Year Reported to CDP'], inplace=True)
df_cidis_cl.head()

It must be noticed that some elements of the `City Location` column are repeated. One of the reasons is because, the `df_cidis` DataFrame is the result of joining the csv files of three consecutive years 2018, 2019 and 2020.

Let's find the indices of duplicate rows for the `longitude` and `latitude` columns of the `df_cidis_cl` DataFrame. We just need to keep ONE SINGLE pair `longitude` and `latitude` coordinates to visualize the city on the map.

But on the other hand, we need to know which duplicated indexes were erased, because we will use that indexes plus the non-deleted index to localize the `Year Reported to CDP` and `Population` columns that will pop-up at every `City Location` in the map.

In [None]:
# We extract the 'latitude' and 'longitude' columns from 'df_cidis_cl' DataFrame
lat_lon_df = df_cidis_cl[['latitude', 'longitude']]
lat_lon_df = lat_lon_df[lat_lon_df.duplicated(keep=False)] # The duplicated rows are identified
# The following code extract the duplicated indexes of the 'latitude' and 'longitude'
lat_lon_idx = lat_lon_df.groupby(list(lat_lon_df)).apply(lambda x: tuple(x.index)).tolist()
lat_lon_idx[:5] # These are the FIRST 5 duplicated indexes of the 'latitude' and 'longitude'

Only the first index of every tuple is extracted and stored in the `idx0_lat_lon` list. This list is used to construct the `df_lat_lon` DataFrame from the `df_cidis_cl` DataFrame. The new DF have unique `latitude` and `longitude` rows.

In [None]:
idx0_lat_lon = [idx[0] for idx in lat_lon_idx] # Extract first indexes of every tuple in 'lat_lon_idx'
df_lat_lon = pd.DataFrame(df_cidis_cl[['latitude', 'longitude']].loc[idx0_lat_lon])
df_lat_lon.info()

In [None]:
df_lat_lon.head()

The `get_reverse_geocode_data` function used in the <a href="https://medium.com/@ericsalesdeandrade/how-to-call-rest-apis-with-pandas-and-store-the-results-in-redshift-2b35f40aa98f" target="_blank">article of Eric Sales</a>, helps to construct a new DataFrame that contains the addresses of their corresponding geographical coordinates.

It must be mentioned that the `get_reverse_geocode_data` function must be used with unique `latitude` and `longitud` values, contained in the `df_lat_lon` DataFrame.

In [None]:
import requests, json, time

def get_reverse_geocode_data(row):
    try:
        YOUR_API_KEY = '3a4b154aa58257' # You should change to your own 'YOUR_API_KEY'
        url = 'https://eu1.locationiq.org/v1/reverse.php?key=' + YOUR_API_KEY \
            + '&lat=' + str(row['latitude']) + '&lon=' + str(row['longitude']) + '&format=json'
        response = (requests.get(url).text)
        response_json = json.loads(response)
        time.sleep(0.5)
        return(response_json)
    
    except Exception as e:
        raise e

After we receive the JSON response, the `API_response` column is inserted into the `df_lat_lon` DataFrame. Once the `API_response` column is flatten into columns using the `pd.json_normalize()` command, we pick up the features that we need.

Unfortuntely, the `INDEXES` are lost in this process when the `df_API` DataFrame is created but we recover these indexes from the `df_lat_lon` DataFrame later, then this `df_API` DataFrame is renamed as `df_api_clean` DataFrame.

In [None]:
df_lat_lon['API_response'] = df_lat_lon.apply(get_reverse_geocode_data, axis=1)
df_API = pd.json_normalize(df_lat_lon['API_response']) # The `df_API` DataFrame is created
df_API = df_API[['lat', 'lon', 'display_name']]

In [None]:
print(df_API.head())
print('\n')
print(df_API.tail())
print('\n')
print(df_API.info())

The info above shows `544 non-null` elements from `560 rows`. This is because some `(latitude, longitude)` coordinates were not identified by the `get_reverse_geocode_data()` function.

On the other hand, the `lat` and `lon` columns are no longer numeric. So, the below process is applied, were indexes of `df_lat_lon` becomes indexes of `df_api`.

In [None]:
df_API[['lat', 'lon']] = df_API[['lat', 'lon']].apply(pd.to_numeric) # 'lat', 'lon' columns are numeric
df_api = df_API.copy() # We preserve the original output of the `get_reverse_geocode_data` function
df_api.index = df_lat_lon.index # The original indexes of `df_lat_lon` are put into the `df_api`
df_api.head()

The `NaN` values of the `df_api` DataFrame are expresed below together with its indexes and its total number of `NaN` values for each column.

In [None]:
df_api_null = df_api[df_api.isnull().any(axis=1)]
print(df_api_null)
print('\n')
print(df_api.isnull().sum())

Using the `idx0_lat_lon` and `df_api_null.index` lists of `NaNs` values, the `non_null_idx` is obtained to get the `df_api_clean` DataFrame which has `non-NaN` values.

In [None]:
non_null_idx = [idx for idx in idx0_lat_lon if idx not in df_api_null.index]
df_api_clean = df_api.loc[non_null_idx]
df_api_clean

In [None]:
df_api_clean.info()

The `df_api_clean` DataFrame above is clean of `NaNs` elements. 

The `lat_lon_list` list below, contains a list of `tuple` elements that are used to identify the cities in a `Folium map`. The `lat_lon_idx` is used to create the `df_citis_flt` DataFrame.

In [None]:
lat_lon_list = list(df_api_clean.to_records())
lat_lon_list[0:5] # These are the first five tuples in the 'lat_lon_list' list

To clean the `df_citis_cl` DataFrame free of `NaN` elements in its `longitude` and `latitude` columns, the `df_citis_flt` is created.

We extract the `NaNs` indexes of tuples from the total `lat_lon_idx` list into the `idx_nan` list. 
From there, we create the `idx_nonan` list of tuples which is flatten to obtain the `df_citis_flt` DataFrame.

In [None]:
idx_nan = [] # List of tuples of indexes that will contain rows with 'NaNs' elements
for i in lat_lon_idx:
    for j in df_api_null.index: # We need the NaNs indexes of the 'df_api_null' DataFrame
        if i[0] == j:
            idx_nan.append(i)

idx_nonan = [i for i in lat_lon_idx if i not in idx_nan] # List of tuples of indexes that have non-NaNs
idx_flatten = [j for i in idx_nonan for j in i] # List of tuples of indexes that were flatten 

In [None]:
df_citis_flt = df_cidis_cl.loc[idx_flatten]
df_citis_flt.head()

After a deep analysis over the `df_citis_flt` DataFrame, it shows that the `Account Number` and `City Location` columns are not correlated, i.e. the number of unique identifiers `Account Number` given to every city organisation that receives a request to complete a CDP questionnaire which is **551**, is not equal to the number of elements of the unique coordinates **544** of the `City Location` column as shown below using the `citis_flt()` function:

In [None]:
def citis_flt(df, col1, col2):
    years = 3 # The 'City Location' shows up to three times
    col1_u, col2_u = df[col1].nunique(), df[col2].nunique()
    df_g = df.groupby([col2]).filter(lambda x: len(x) > years)
    df_nu = df_g[col2].nunique()
    return([col1_u, col2_u, df_nu, df_g])

For the years 2018, 2019, 2020, the `City Location` coordinates should be shown up to **three times** unless that for the same location there are different unique identifiers in the `Account Number`. In this case, it appears **five** times as shown below.

In [None]:
citis_flt(df_citis_flt, 'Account Number', 'City Location')[0:2+1]

As shown below there are **26** unique identifiers in the `Account Number` that are not count in the rest of the analysis. It has been proceed in this way, because in some cases there is no information about the `Population` in some particular year for the `Year Reported to CDP`  column, or because the same **population** is used for differents identifiers in the `Account Number` column.

In [None]:
df_citis_nou = citis_flt(df_citis_flt, 'Account Number', 'City Location')[3]
df_citis_nou.info()

After extracting the indexes from the `df_citis_nou` DataFrame, the `df_citis_u` unique DataFrame is obtained:

In [None]:
idx_u = [i for i in df_citis_flt.index if i not in df_citis_nou.index]
df_citis_u = df_citis_flt.loc[idx_u]
df_citis_u.head() # Very last clean DataFrame

In [None]:
df_citis_u.info()

### Map of the Organizations Locations <a name="mol"></a>
[back to Analyzing the Datasets](#ad)

The following code extract the **Year Reported to CDP, City,** and **Population** features for a unique  **(latitude, longitude)** coordinates, from the `df_citis_u` DataFrame.

Every location in the map below shows its address (after hover the pointer of the mouse on it), thanks to the `get_reverse_geocode_data()` function that use an **API**, which is obtained after creating a FREE account in the `https://locationiq.com/` website. 

After that, if you make a **click** with the mouse over any location in the map, it also shows a **table** with the `Year Reported to CDP`, `City`, and `Population` features obtained from the `df_citis_u` DataFrame.

In [None]:
# We group by 'longitude' and 'latitude' and print 'Year Reported to CDP', 'City', 'Population' columns
df_citis_flt_cp = df_citis_u.copy() # We preserve the 'df_citis_flt' DataFrame
flt_value, flt_key = ([] for l in range(2))
for key, value in df_citis_flt_cp.groupby(by=['latitude', 'longitude']):
    df_citis_mod = value[['Year Reported to CDP', 'City', 'Population']]
    flt_value.append(df_citis_mod) # This list gives a DataFrame of the columns specified in the 'value' item
    flt_key.append(key) # This list gives the (longitude, latitude) coordinate

In [None]:
import folium, branca
import folium.plugins as fp

map_osm = folium.Map(location=[50.7128, 44.0060], zoom_start=2.49, tiles='Stamen Terrain'
                     , max_bounds=True, scrollWheelZoom=False, no_wrap=True)

marker_cluster = fp.MarkerCluster().add_to(map_osm)

loc_df = [(i, j) for i, j in zip(flt_key, flt_value)]
for crd in lat_lon_list:
    for df in loc_df:
        if crd[0] == df[1].index[0]:
            address = str([k.lstrip() for k in crd[3].split(',')][:])[1:-1]
            html = df[1].to_html(classes='table table-striped table-hover table-condensed table-responsive')
            popup_op = folium.Popup(html, max_width='100%')
            folium.Marker(location=[crd[1], crd[2]], popup=popup_op, tooltip=address).add_to(marker_cluster) 
            map_osm.add_child(folium.LatLngPopup(), folium.ClickForMarker(popup='Waypoint'))
map_osm

The above geographical map does show the number of cities per region that answered the questions of the CDP.

The function below `year_nation_city_pop()` is used to obtain the DataFrames, that help us to get the 2D figures of the **twenty** most populated US cities in 2018, 2019, and 2020.

Unlike the **map** shown above, the 2D graphics below show the same information in a more compact way.

In [None]:
def year_nation_city_pop(df, year, country, org, pop):
    
    year_CDP = df['Year Reported to CDP'] == year
    nation = df['Country'] == country
    df_u = df[year_CDP & nation]

    org_pop_list = []
    for i, j in df_u.groupby(by=pop):
        if int(i) != 0: # 'Population' --> integers <> This is to avoid 'zero' rows
            k = j[org].array[0]
            org_pop_list.append(tuple([k, i]))
    df_n = pd.DataFrame(org_pop_list, columns=[org, pop])

    return(df_n)

In [None]:
def df_of_df(df, year):
    df_pop = year_nation_city_pop(df, year, 'United States of America', 'Organization', 'Population')
    return(df_pop)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

plt.figure()

df_short_18 = df_of_df(df_citis_u, 2018).iloc[0:20]
plt.subplot(2, 2, 1)
df_short_18.plot(x = 'Organization', y = 'Population', figsize = (15, 12), kind='barh', fontsize=14
                , legend=False, title='20 Most Populated US Cities in 2018', color='#00a6d2'
                , ax=plt.gca())
plt.xlabel('Population', color='black')
plt.ylabel('Organizations', color='black')
for i in range(len(df_short_18)):
    plt.text(23000, i - 0.3, int(df_short_18['Population'][i]))
## --------------------------
df_short_19 = df_of_df(df_citis_u, 2019).iloc[0:20]
plt.subplot(2, 2, 2)
df_short_19.plot(x = 'Organization', y = 'Population', figsize = (15, 12), kind='barh', fontsize=14
                 , legend=False, title='20 Most Populated US Cities in 2019', color='#f9c642'
                 , ax=plt.gca())
plt.xlabel('Population', color='black')
plt.ylabel('Organizations', color='black')
for i in range(len(df_short_19)):
    plt.text(30000, i - 0.3, int(df_short_19['Population'][i]))
## --------------------------    
df_short_20 = df_of_df(df_citis_u, 2020).iloc[0:20]
plt.subplot(2, 2, 3)
df_short_20.plot(x = 'Organization', y = 'Population', figsize = (15, 12), kind='barh', fontsize=14
                 , legend=False, title='20 Most Populated US Cities in 2020', color='#4aa564'
                 , ax=plt.gca())
plt.xlabel('Population', color='black')
plt.ylabel('Organizations', color='black')
for i in range(len(df_short_20)):
    plt.text(33000, i - 0.3, int(df_short_20['Population'][i]))

plt.tight_layout(pad=1.5)
plt.show()

### Cities Responses Dataset<a name="crd"></a>
[back to Analyzing the Datasets](#ad)

In this section, we select from the `df_cires` DataFrame the `Question Number` feature and its corresponding row on the`Response Answer` feature based on its `Section` feature.

Remember that:

In [None]:
df_cires.head(3)

Let's use the indexes of the `df_citis_u` DataFrame to clean the `df_cires` DataFrame based on the `Account Number` column:

In [None]:
ac_citis_u = df_citis_u['Account Number'].unique() # Cleaning list
ac_cires = df_cires['Account Number'].unique()
ac_cires_u = [i for i in ac_cires if i in ac_citis_u] # List that contains unique numbers in 'Account Number'

In [None]:
idx_ac_cires_u = [] # Lists of indexes of unique numbers in 'Account Number' column of 'df_cires' DataFrame
for idx, ide in enumerate(df_cires['Account Number']):
    if ide in ac_cires_u:
        idx_ac_cires_u.append(idx)

The `df_cires_unique` DataFrame contains the identification numbers for the `Account Number` that are in the `df_cities_u`, which has been sorted according to the `Account Number` and `Year Reported to CDP` columns.

In [None]:
df_cires_unique = df_cires.loc[idx_ac_cires_u]
df_cires_unique.sort_values(['Account Number', 'Year Reported to CDP'], inplace=True)
df_cires_unique.info()

In [None]:
df_cires_unique.head()

The last step is to clean the `df_cires_unique` DataFrame where the `Response Answer` column has rows of `NaN` values:

In [None]:
idx_cires_u = df_cires_unique.index
idx_ra = df_cires_unique[df_cires_unique['Response Answer'].isnull()].index
idx_u = [i for i in idx_cires_u if i not in idx_ra]
df_cires_u = df_cires_unique.loc[idx_u]
df_cires_u.info()

In [None]:
df_cires_u.head()

This `df_cires_u` DataFrame was obtained from the indexes of the `df_citis_u` DataFrame based on its `Account Number` column.

#### Cities Responses by Year, Country and Question Number<a name="crgc"></a>
[back to Cities Responses Dataset](#crd)

In the following analysis the `year_country()` function is build using the `df_citis_u` and `df_cires_u` DataFrames.

This function can be used to select any **country** inside those Dataframes for the 2018, 2019, and 2020 years.

In [None]:
cires_key, cires_value = ([] for l in range(2))
group_cols = df_cires_u.columns[0:4+1].tolist() # Columns grouped from the 'df_cires_u' DataFrame
shown_cols = [*df_cires_u.columns[1:2+1], *df_cires_u.columns[5:]] # Columns shown after grouping the columns
for key, value in df_cires_u.groupby(by=group_cols):
    df_cires_mod = value[shown_cols]
    cires_key.append(key)
    cires_value.append(df_cires_mod)

The `year_country()` function gives the `year_country()[0]` and `year_country()[1]` outputs, that helps to isolate the `Response Answer` columns to every `Organization` together with its `Account Number` for any of the 2018, 2019, and 2020 years.

In [None]:
def year_country(year, country, sort_column): # It allows to chose the 'year', 'country', and 'sort_column' 
    key_list, value_list = ([] for l in range(2))
    for i, j in zip(cires_key, cires_value):
        if i[0] == year and i[3] == country:
            key_list.append((i[0], i[1], i[2]))
            value_list.append(j.sort_values([sort_column])) # The DataFrame is sort by column
    return([key_list, value_list])

In [None]:
def question_number(year, country, sort_column):
    fn = year_country(year, country, sort_column)
    df_list= []
    for k in range(len(fn[1])):
        for i, j in fn[1][k].groupby(by=[sort_column]):
            df_j = j[fn[1][k].columns]
            df_list.append(df_j)
    df_q_n = pd.concat(df_list).sort_values([sort_column])
    return(df_q_n)

In the following, let's do the analysis for the **US** in 2018 year, specifically for the `Question Number` column.

In [None]:
df_US_2018_res = question_number(2018, 'United States of America', 'Question Number')
print(df_US_2018_res.info())
print('\n')
print(df_US_2018_res['Section'].unique())

## Overview of the Corporations Datasets<a name="overview_2"></a>
[back to top](#TOC)

### Corporations Disclosing and Responses Datasets<a name="cdrd"></a>
[back to Overview of the Corporations Datasets](#overview_2)

The `files_csv` list of files below, is needed to concatenate the `Corporation Disclosing` Datasets for the 2018, 2019 and 2020 years, as well as, the `Corporation Responses` Datasets.

In [None]:
files_csv

In [None]:
import pandas as pd

# Corporation Disclosing Datasets
## The file_csv[11] --> 'Corporations_Disclosing_to_CDP_Data_Dictionary.csv'
df_codis = pd.concat([pd.read_csv(files_csv[i]) for i in range(8, 14+1) if i != 11]) # range(8, 14+1) --> 'files_csv'
col_codis = [*(df_codis.columns[0:3+1]), *(df_codis.columns[9:20+1])] # Some columns are not included
df_codis = df_codis[col_codis].reset_index(drop=True)

# Corporation Responses Datasets
## The file_csv[18] --> 'Full_Corporations_Response_Data_Dictionary copy.csv', is not included
df_cores = pd.concat([pd.read_csv(files_csv[i], low_memory=False) for i in range(16, 22+1) if i != 19])
col_cores = [*(df_cores.columns[0:2+1]), *(df_cores.columns[7:18+1])] # Some columns are not included
df_cores = df_cores[col_cores].reset_index(drop=True)

In [None]:
print(df_codis.info())
print('\n')
print(df_cores.info())

Then, the `df_codis` corporation disclosing dataset and the `df_cores` corporation responses dataset are **merged** over the `account_number`, `organization`, and `survey_year` features.

In [None]:
merge_cols = ['account_number', 'organization', 'survey_year']
df_codisres = pd.merge(df_codis, df_cores, on=merge_cols, how='inner') # The indexes must be kept
df_codisres.info()

The `df_codisres` DataFrame is sorted by the three columns below, such that the `df_codisres` DataFrame would be ordered in descending form ruled by the `question_number` column.

In this way, it is less time consuming extract at once the `response_value` for every question of the **Questionary** for 2018, 2019 and 2020.

In [None]:
df_codisres.sort_values(['question_number', 'account_number', 'survey_year'], inplace=True)
df_codisres.head(3)

In [None]:
df_codisres.tail(3)

Next, the `df_codisres` DataFrame is cleaned extracting the `NaNs` values of the `response_value` feature by using the `get_nonan_df()` function, and then obtaining the `df_codisres_u` DataFrame.

In [None]:
def get_nonan_df(df, colnan):
    idx = df.index
    idx_nan = df[df[colnan].isnull()].index
    idx_u = [i for i in idx if i not in idx_nan]
    df_u = df.loc[idx_u]
    return(df_u)

In [None]:
df_codisres_u = get_nonan_df(df_codisres, 'response_value')
df_codisres_u.info()

### Climate Change Corporations Datasets<a name="cccd"></a>
[back to Overview of the Corporations Datasets](#overview_2)

The `df_year_nation_glbissue()` function select the `df` DataFrame by `year`, `nation`, and **global issue**.

In [None]:
def df_year_nation_glbissue(df, year, nation, glbissue):
    col_1, col_2, col_3 = 'survey_year', 'country', 'theme'
    survey, country, glbissue = df[col_1] == year, df[col_2] == nation, df[col_3] == glbissue
    df_res = df[survey & country & glbissue]
    return(df_res)

In the following, the function above is used to obtain the `df_cc_2018` DataFrame for the US with `Climate Change` as global issue in 2018. *(Note.- Because of the time the analysis is narrowed to just 2018, however the code was written in such a way that can be used for any particular year, i.e., 2018, 2019, or 2020).*

From this DataFrame the unique modules of the `module_name` column are obtained as shown below: 

In [None]:
# 'Climate Change' based DataFrame
df_cc_2018 = df_year_nation_glbissue(df_codisres, 2018, 'United States of America', 'Climate Change')
df_cc_2018['module_name'].unique() # From 'df_codisres_u.info()' above we select the 'module_name' column

The `module_of_name()` function below obtain the DataFrame for the **Energy** module from the `df_cc_2018` DataFrame.

In [None]:
def name_of_module(df, module, endstr): # module: columns's name, endstr: letters in which end the string
    
    module_key, module_value, name_module = ([] for l in range(3))   
    for key, value in df.groupby(by=module):
        module_key.append(key)
        module_value.append(value)
    for i, j in zip(module_key, module_value):
        if i.endswith(endstr):
            name_module.append(j) # 'name_module' has only ONE element --> [0]
    
    return(name_module[0]) 

In [None]:
df_energy18 = name_of_module(df_cc_2018, 'module_name', 'nergy')
df_energy18.head(3)

From the `question_number` column of the `df_energy18` DataFrame shown below, inside the question `C8.2f` there is a question related to the **low carbon technology type**, as shown by the `column_name` column, that is extracted to its analysis.

In [None]:
print(df_energy18['question_number'].unique().tolist())
print('\n')
print(df_energy18['column_name'].unique())

With the help of the `df_carbon()` function below we obtain the `response_value` column for each of the the five `C8.2f` questions above.

The `response_value` column of the output `qC2` DataFrame obtained via the `df_carbon()` function will help to analize the question related to the **low carbon technology type** in the next section.

In [None]:
def df_carbon(df, que_num, que, col_name, analysis):
    
    for i, j in df.groupby(by=que_num): # Grouped by 'que_num' column
        if i == que: # 'que' is the <<question>> in the 'que_num' column
            j.sort_values(col_name, inplace=True) # Sorted by 'col_name' column
    df_all = get_nonan_df(j, analysis) # 'get_nonan_df()' clean the 'analysis' column of the 'j' DataFrame

    questions = df_all[col_name].unique().tolist()
    carbon_questions = []
    for q in questions:
        for i, j in df.groupby(by=col_name):
            if i == q:
                carbon_questions.append(get_nonan_df(j, analysis))
    
    return(carbon_questions)

In [None]:
qC1, qC2, qC3, qC4, qC5 = df_carbon(df_energy18, 'question_number', 'C8.2f', 'column_name', 'response_value')

As we can see below, the `qC2` DataFrame is related with the **Low-Carbon Technology** question. 

In [None]:
df_qC2 = qC2[['column_name', 'response_value']]
df_qC2.head()

## Corporations: Analyzing Low-Carbon Energy Technologies<a name="alcet"></a>
[back to the top](#TOC)

In [None]:
import nltk, string
from nltk.tokenize import word_tokenize # our tokenizer
from nltk.corpus import stopwords # used for preprocessing
from nltk.stem import WordNetLemmatizer # used for preprocessing
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
print(df_qC2.info())
print('\n')
print(df_qC2.shape)

In the **preprocessing text analysis** below, I used the `preprocessing()` function that I found in <a href="https://github.com/stgran/Coursework/blob/master/Practical%20Data%20Science/Preprocessing_Text_Data_in_Python.ipynb" target="_blank">this link</a>:

In [None]:
# remove urls, handles, and the hashtag from hashtags, see the link below: 
#https://stackoverflow.com/questions/8376691/how-to-remove-hashtag-user-link-of-a-tweet-using-regular-expression
def remove_urls(text):
    clean_text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", text).split())
    return(clean_text)

# make all text lowercase
def text_lowercase(text): 
    return(text.lower())

# remove numbers
def remove_numbers(text): 
    nonumbers = re.sub(r'\d+', '', text) 
    return(nonumbers)

# remove punctuation
def remove_punctuation(text): 
    nopunct = str.maketrans('','', string.punctuation)
    return(text.translate(nopunct))

# tokenize
def tokenize(text):
    words = word_tokenize(text)
    return(words)

# remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    words = [w for w in text if w not in stop_words]
    return(words)

# lemmatize
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    lem_text = [lemmatizer.lemmatize(token) for token in text]
    return(lem_text)

def preprocessing(text):
    text = text_lowercase(text)
    text = remove_urls(text)
    text = remove_numbers(text)
    text = remove_punctuation(text)
    text = tokenize(text)
    text = remove_stopwords(text)
    text = lemmatize(text)
    text = ' '.join(text) # rejoins the list of tokens
    return(text)

The **"low carbon technology please specify"** sentence appears frecuently in the `response_value` column of the `df_qC2` DataFrame. So the `unused_sentence()` function helps to erase it.

In [None]:
def unused_sentence(text):
    sentence = re.sub(r'\b\w*(low carbon technology|please specify|renewable energy|including)\w*\b', '', text)
    sentence = sentence.replace('solar pv', 'solarpv').strip()
    return(sentence)

In [None]:
df_qC2_copy = df_qC2.copy() # The original DataFrame is preserved
response_clean = df_qC2_copy['response_value'].apply(lambda x: preprocessing(x)) \
                                              .apply(lambda x: unused_sentence(x))
df_qC2_copy['response_clean'] = response_clean
df_qC2_copy.head()

In order to verify if the preprocessing text analysis was done correctly, let's get a visual representation of the most common words used in the `response_clean` column to know if more preprocessing text is necessary.

So we need to put the `response_clean` column in a **single string** as shown below, after importing the `wordcloud` libfrary.

In [None]:
def word_cloud(df, col):
    # Import the wordcloud library
    from wordcloud import WordCloud

    single_string = ','.join(df[col].tolist())
    # Create a WordCloud object
    wordcloud = WordCloud(background_color="white", max_words=1000, contour_width=2, contour_color='steelblue')
    wordcloud.generate(single_string)
    return(wordcloud.to_image())

In [None]:
word_cloud(df_qC2_copy, 'response_clean')

As we can see from the image above, words like **wind** and **solar pv**, appears frequently in the text.

Let's see how offen these and other words are common in the whole text.

In [None]:
def words_counts(df, col):
    
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(stop_words='english')
    count_data = vectorizer.fit_transform(df[col].tolist())

    words = vectorizer.get_feature_names()
    counts = np.zeros(len(words))
    for i in count_data:
        counts += i.toarray()[0] # Extract the first element of every array

    words_counts = zip(words, counts)
    words_counts = sorted(words_counts, key=lambda x:x[1], reverse=True)[0: 7+1]
    df_words_counts = pd.DataFrame(words_counts, columns=['words', 'counts'])
    
    return(df_words_counts)

In [None]:
words_counts(df_qC2_copy, 'response_clean')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))
plt.title('Most Common Words for Low-Carbon Source of Energy in 2018')
df_w_c_co = words_counts(df_qC2_copy, 'response_clean')
for i in range(len(df_w_c_co)):
    plt.text(220, i + 0.2, int(df_w_c_co['counts'][i]), weight='bold')

sns.set(font_scale = 1.5)
ax = sns.barplot(y = "words", x = "counts", data = df_w_c_co)

After the preprocessing text analysis over the `response value` column of the `df_energy18` DataFrame, it can be seen that **wind, solar PV, and hydropower** are the most predominant words. In some cases, the Corporations support **PV solar power**, **wind power**, and **hydropower** energy generation at the same time as renewable energy sources.

## Cities: Analyzing Sources of Renewable Energy<a name="asre"></a>
[back to the top](#TOC)

From [this sub section](#crgc), remember that the `question_number()` function was used to obtain the `df_US_2018_res` DataFrame:

In [None]:
df_US_2018_res.head(3)

In [None]:
print(df_US_2018_res.info()) # It shows the info about the 'columns' of the DataFrame
print('\n')
print(df_US_2018_res['Section'].unique()) # It shows the 'Sections' of the questionary
print('\n')
print(df_US_2018_res['Question Number'].unique()) # It shows the 'Question Numbers' of the questionary

The `df_renewable_energy()` function below, helps to obtain the `Row Name` and `Response Answer` for the `9.1` question of the `Question Number` column inside `Section` column, which is related with the amount of **renewable energy** (in MW capacity) installed within the city boundary in the categories of **solar PV**, **solar thermal**, **ground or water source**, **wind** or other.

In [None]:
def df_renewable_energy(df, col1, col2, num):
    for i, j in df.groupby(by=col1):
        if i == 'Energy':
            df_j = j[j[col2] == num]
    return(df_j)

In [None]:
df_energy_18 = df_renewable_energy(df_US_2018_res, 'Section', 'Question Number', '9.1')
df_e_18 = df_energy_18[['Row Name', 'Response Answer']]
df_e_18

The `erase_sentence()` function below helps erase unnecessary text and transform text into a single string words like: **solar pv** into **solarpv**.

In [None]:
def erase_sentence(text):
    sentence = re.sub(r'\b\w*(renewable district heat cooling)\w*\b', '', text)
    sentence = sentence.replace('solar pv', 'solarpv').strip() #'solar pv' is a repeated word
    sentence = sentence.replace('ground water source', 'GroundWaterSource').strip()
    sentence = sentence.replace('solar thermal', 'SolarThermal').strip() #'solar thermal' is a repeated word
    return(sentence)

The preprocessing text analysis is applied below:

In [None]:
df_e_18_copy = df_e_18.copy() # The original DataFrame is preserved
Row_Name_Clean = df_e_18_copy['Row Name'].apply(lambda x: preprocessing(x))\
                                         .apply(lambda x: erase_sentence(x))
df_e_18_copy['Row_Name_Clean'] = Row_Name_Clean
df_e_18_copy.head()

As we can see, there are some empty spaces in the new `Row_Name_Clean` column.

As shown below, after deleting these empty spaces, the amount of **renewable energy (in MW)** by 'Wind' and 'Solar' sources are extrated.

In [None]:
# Some some empty rows are deleted applying the code below.
df_e_18_copy = df_e_18_copy[df_e_18_copy['Row_Name_Clean'].str.strip().astype(bool)]
# The 'Response Answer' column is transformed into a 'float' column
df_e_18_copy['Response Answer'] = df_e_18_copy['Response Answer'].apply(float)
# The amount of 'renewable energy (in MW)' by 'Wind' and 'Solar' sources are extrated
df_e_18_MW_s = df_e_18_copy[df_e_18_copy['Response Answer'] > 500]
df_e_18_MW_w = df_e_18_copy[(df_e_18_copy['Response Answer'] > 0.0) & (df_e_18_copy['Response Answer'] < 1)]
print('Solar Sources:')
print(df_e_18_MW_s)
print('\n')
print('Wind Sources:')
print(df_e_18_MW_w)

According to the **most common words for installed sources of renewable energy** graphic below, there is a correlation with the amount of installed renewable energy via Solar Power Systems that cities have.

According to the table above, there are more cities that have installed renewable energy via photovoltaic systems than via wind devices.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))
plt.title('Most Common Words for Installed Sources of Energy in the Cities 2018')
df_w_c_ci = words_counts(df_e_18_copy, 'Row_Name_Clean')
for i in range(len(df_w_c_ci)):
    plt.text(52, i + 0.1, int(df_w_c_ci['counts'][i]), weight='bold')

sns.set(font_scale = 1.5)
ax = sns.barplot(y = "words", x = "counts", data = df_w_c_ci)

## Conclusions<a name="conclusions"></a>
[back to the top](#TOC)

After analyzing the impact of low-carbon technology used by Corporations over the sources of renewable energy used by Cities in the US during 2018, has been found that,

 - a) The most low-carbon energy source technology used by Corporations are predominantly wind devices, photovoltaic systems that store solar energy, and hydropower energy.

 - b) The amount of installed renewable energy in natural environments or cities' boundaries is mainly solar energy and wind energy, as shown in the graphics of the last two sections.

So, it is possible to generate a collaboration bridge between corporations and cities, considering that the pick demand for electricity by the cities can be balanced by the energy storage of renewable energy given by the corporations. This also would reduce the cost-effectiveness of energy consumption in cities, giving a better quality of life in the society.