This notebook will be focused on the data only; We'll be disecting the given files and will be giving remarks and providing ideas along the way.
[CDP: Unlocking Climate Solutions | Kaggle](https://www.kaggle.com/c/cdp-unlocking-climate-solutions)

# Initial Setup: Installing Dependencies & Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# The next line is used to avoid the following warning --> debconf: delaying package configuration, since apt-utils is not installed
# Sourcec: [[16.04] debconf: delaying package configuration, since apt-utils is not installed · Issue #319 · phusion/baseimage-docker](https://github.com/phusion/baseimage-docker/issues/319)
!apt-get update > /dev/null
!apt-get install -y --no-install-recommends apt-utils > /dev/null

In [None]:
# Upgrading pip to avoid the following warning:
# WARNING: You are using pip version 20.3.1; however, version 20.3.3 is available.
# You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.
!/opt/conda/bin/python3.7 -m pip install --upgrade pip > /dev/null

In [None]:
# Installing Dependencies
!pip install git+https://github.com/GeospatialPython/pyshp.git > /dev/null # installs PyShp (https://github.com/GeospatialPython/pyshp)
!pip install pandas-profiling[notebook] > /dev/null
!pip install simpledbf > /dev/null
!pip install geopandas > /dev/null
!apt-get install tree > /dev/null
!pip install sweetviz > /dev/null
!pip install folium > /dev/null

In [None]:
from scipy.spatial import ConvexHull, convex_hull_plot_2d
from pandas_profiling import ProfileReport
from IPython.display import clear_output
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
from collections import Counter
from simpledbf import Dbf5
from pprint import pprint
from pathlib import Path
import geopandas as gpd  # [geopandas/geopandas: Python tools for geographic data](https://github.com/geopandas/geopandas)
import sweetviz as sv
import shapefile
import folium
import re

In [None]:
# [python - How to make inline plots in Jupyter Notebook larger? - Stack Overflow](https://stackoverflow.com/questions/36367986/how-to-make-inline-plots-in-jupyter-notebook-larger)
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100 # 200 e.g. is really fine, but slower

In [None]:
# Dataset File Hierarchy
!tree /kaggle/input

# Helper Functions & Global Variables

In [None]:
CDP_UNLOCKING_CLIMATE_SOLUTIONS                                              = Path('/kaggle/input/cdp-unlocking-climate-solutions')
CITIES                                                                       = CDP_UNLOCKING_CLIMATE_SOLUTIONS / 'Cities'
CITIES_DISCLOSING                                                            = CITIES / 'Cities Disclosing'
_2018_CITIES_DISCLOSING_TO_CDP_CSV                                           = CITIES_DISCLOSING / '2018_Cities_Disclosing_to_CDP.csv'
_2019_CITIES_DISCLOSING_TO_CDP_CSV                                           = CITIES_DISCLOSING / '2019_Cities_Disclosing_to_CDP.csv'
_2020_CITIES_DISCLOSING_TO_CDP_CSV                                           = CITIES_DISCLOSING / '2020_Cities_Disclosing_to_CDP.csv'
CITIES_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV                                 = CITIES_DISCLOSING / 'Cities_Disclosing_to_CDP_Data_Dictionary.csv'
CITIES_QUESTIONNAIRES                                                        = CITIES / 'Cities Questionnaires'
_2018_CITIES_QUESTIONNAIRE_PDF                                               = CITIES_QUESTIONNAIRES / '2018_Cities_Questionnaire.pdf'
_2019_CITIES_QUESTIONNAIRE_PDF                                               = CITIES_QUESTIONNAIRES / '2019_Cities_Questionnaire.pdf'
_2020_CITIES_QUESTIONNAIRE_PDF                                               = CITIES_QUESTIONNAIRES / '2020_Cities_Questionnaire.pdf'
CITIES_RESPONSES                                                             = CITIES / 'Cities Responses'
_2018_FULL_CITIES_DATASET_CSV                                                = CITIES_RESPONSES / '2018_Full_Cities_Dataset.csv'
_2019_FULL_CITIES_DATASET_CSV                                                = CITIES_RESPONSES / '2019_Full_Cities_Dataset.csv'
_2020_FULL_CITIES_DATASET_CSV                                                = CITIES_RESPONSES / '2020_Full_Cities_Dataset.csv'
FULL_CITIES_RESPONSE_DATA_DICTIONARY_CSV                                     = CITIES_RESPONSES / 'Full_Cities_Response_Data_Dictionary.csv'
CORPORATIONS                                                                 = CDP_UNLOCKING_CLIMATE_SOLUTIONS / 'Corporations'
CORPORATIONS_DISCLOSING                                                      = CORPORATIONS / 'Corporations Disclosing'
CLIMATE_CHANGE                                                               = CORPORATIONS_DISCLOSING / 'Climate Change'
_2018_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV                        = CLIMATE_CHANGE / '2018_Corporates_Disclosing_to_CDP_Climate_Change.csv'
_2019_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV                        = CLIMATE_CHANGE / '2019_Corporates_Disclosing_to_CDP_Climate_Change.csv'
_2020_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV                        = CLIMATE_CHANGE / '2020_Corporates_Disclosing_to_CDP_Climate_Change.csv'
CORPORATIONS_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV                           = CLIMATE_CHANGE / 'Corporations_Disclosing_to_CDP_Data_Dictionary.csv'
WATER_SECURITY                                                               = CORPORATIONS_DISCLOSING / 'Water Security'
_2018_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV                        = WATER_SECURITY / '2018_Corporates_Disclosing_to_CDP_Water_Security.csv'
_2019_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV                        = WATER_SECURITY / '2019_Corporates_Disclosing_to_CDP_Water_Security.csv'
_2020_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV                        = WATER_SECURITY / '2020_Corporates_Disclosing_to_CDP_Water_Security.csv'
CORPORATIONS_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV                           = WATER_SECURITY / 'Corporations_Disclosing_to_CDP_Data_Dictionary.csv'
CORPORATIONS_QUESTIONNAIRES                                                  = CORPORATIONS / 'Corporations Questionnaires'
CLIMATE_CHANGE                                                               = CORPORATIONS_QUESTIONNAIRES / 'Climate Change'
_2018_CLIMATE_CHANGE_QUESTIONNAIRE_PDF                                       = CLIMATE_CHANGE / '2018_Climate_Change_Questionnaire.pdf'
_2019_CLIMATE_CHANGE_QUESTIONNAIRE_PDF                                       = CLIMATE_CHANGE / '2019_Climate_Change_Questionnaire.pdf'
_2020_CLIMATE_CHANGE_QUESTIONNAIRE_PDF                                       = CLIMATE_CHANGE / '2020_Climate_Change_Questionnaire.pdf'
CDP_CLIMATE_CHANGE_CHANGES_DOCUMENT_PDF                                      = CLIMATE_CHANGE / 'CDP-climate-change-changes-document.pdf'
WATER_SECURITY                                                               = CORPORATIONS_QUESTIONNAIRES / 'Water Security'
_2018_WATER_SECURITY_QUESTIONNAIRE_PDF                                       = WATER_SECURITY / '2018_Water_Security_Questionnaire.pdf'
_2019_WATER_SECURITY_QUESTIONNAIRE_PDF                                       = WATER_SECURITY / '2019_Water_Security_Questionnaire.pdf'
_2020_WATER_SECURITY_QUESTIONNAIRE_PDF                                       = WATER_SECURITY / '2020_Water_Security_Questionnaire.pdf'
CDP_WATER_CHANGES_DOCUMENT_PDF                                               = WATER_SECURITY / 'CDP-water-changes-document.pdf'
CORPORATIONS_RESPONSES                                                       = CORPORATIONS / 'Corporations Responses'
CLIMATE_CHANGE                                                               = CORPORATIONS_RESPONSES / 'Climate Change'
_2018_FULL_CLIMATE_CHANGE_DATASET_CSV                                        = CLIMATE_CHANGE / '2018_Full_Climate_Change_Dataset.csv'
_2019_FULL_CLIMATE_CHANGE_DATASET_CSV                                        = CLIMATE_CHANGE / '2019_Full_Climate_Change_Dataset.csv'
_2020_FULL_CLIMATE_CHANGE_DATASET_CSV                                        = CLIMATE_CHANGE / '2020_Full_Climate_Change_Dataset.csv'
FULL_CORPORATIONS_RESPONSE_DATA_DICTIONARY_COPY_CSV                          = CLIMATE_CHANGE / 'Full_Corporations_Response_Data_Dictionary copy.csv'
WATER_SECURITY                                                               = CORPORATIONS_RESPONSES / 'Water Security'
_2018_FULL_WATER_SECURITY_DATASET_CSV                                        = WATER_SECURITY / '2018_Full_Water_Security_Dataset.csv'
_2019_FULL_WATER_SECURITY_DATASET_CSV                                        = WATER_SECURITY / '2019_Full_Water_Security_Dataset.csv'
_2020_FULL_WATER_SECURITY_DATASET_CSV                                        = WATER_SECURITY / '2020_Full_Water_Security_Dataset.csv'
FULL_CORPORATIONS_RESPONSE_DATA_DICTIONARY_CSV                               = WATER_SECURITY / 'Full_Corporations_Response_Data_Dictionary.csv'
SUPPLEMENTARY_DATA                                                           = CDP_UNLOCKING_CLIMATE_SOLUTIONS / 'Supplementary Data'
CDC_500_CITIES_CENSUS_TRACT_DATA                                             = SUPPLEMENTARY_DATA / 'CDC 500 Cities Census Tract Data'
_500_CITIES__CENSUS_TRACT_LEVEL_DATA__GIS_FRIENDLY_FORMAT___2019_RELEASE_CSV = CDC_500_CITIES_CENSUS_TRACT_DATA / '500_Cities__Census_Tract-level_Data__GIS_Friendly_Format___2019_release.csv'
CDC_SOCIAL_VULNERABILITY_INDEX_2018                                          = SUPPLEMENTARY_DATA / 'CDC Social Vulnerability Index 2018'
SVI2018_US_CSV                                                               = CDC_SOCIAL_VULNERABILITY_INDEX_2018 / 'SVI2018_US.csv'
SVI2018_US_COUNTY_CSV                                                        = CDC_SOCIAL_VULNERABILITY_INDEX_2018 / 'SVI2018_US_COUNTY.csv'
DATASET_LICENSES                                                             = SUPPLEMENTARY_DATA / 'Dataset Licenses'
CDP_DATASET_LICENSES_TXT                                                     = DATASET_LICENSES / 'CDP_dataset_licenses.txt'
SUPPLEMENTARY_DATASET_LICENSES_TXT                                           = DATASET_LICENSES / 'Supplementary_dataset_licenses.txt'
LOCATIONS_OF_CORPORATIONS                                                    = SUPPLEMENTARY_DATA / 'Locations of Corporations'
NA_HQ_PUBLIC_DATA_CSV                                                        = LOCATIONS_OF_CORPORATIONS / 'NA_HQ_public_data.csv'
NYC_CDP_CENSUS_TRACT_SHAPEFILES                                              = SUPPLEMENTARY_DATA / 'NYC CDP Census Tract Shapefiles'
NYU_2451_34505_DBF                                                           = NYC_CDP_CENSUS_TRACT_SHAPEFILES / 'nyu_2451_34505.dbf'
NYU_2451_34505_PRJ                                                           = NYC_CDP_CENSUS_TRACT_SHAPEFILES / 'nyu_2451_34505.prj'
NYU_2451_34505_SHP                                                           = NYC_CDP_CENSUS_TRACT_SHAPEFILES / 'nyu_2451_34505.shp'
NYU_2451_34505_SHX                                                           = NYC_CDP_CENSUS_TRACT_SHAPEFILES / 'nyu_2451_34505.shx'
NYU_2451_34505_ISO_XML                                                       = NYC_CDP_CENSUS_TRACT_SHAPEFILES / 'nyu_2451_34505_iso.xml'
RECOMMENDATIONS_FROM_CDP                                                     = SUPPLEMENTARY_DATA / 'Recommendations from CDP'
CDP_RECOMMENDATIONS_FOR_QUESTIONS_TO_FOCUS_ON_XLSX                           = RECOMMENDATIONS_FROM_CDP / 'CDP_recommendations_for_questions_to_focus_on.xlsx'
CDP_RECOMMENDATIONS_FOR_SUPPLEMENTARY_DATASETS_TO_INCLUDE_XLSX               = RECOMMENDATIONS_FROM_CDP / 'CDP_recommendations_for_supplementary_datasets_to_include.xlsx'
SIMPLE_MAPS_US_CITIES_DATA                                                   = SUPPLEMENTARY_DATA / 'Simple Maps US Cities Data'
USCITIES_CSV                                                                 = SIMPLE_MAPS_US_CITIES_DATA / 'uscities.csv'

In [None]:
def file_path_wrapper(absolute_path):
    full_file_name = absolute_path.split('/')[-1] # file_name + extension
    return {
        'path': absolute_path,
        'file_name': '.'.join(full_file_name.split('.')[:-1]),
        'extension': full_file_name.split('.')[-1]
    }

In [None]:
def list_files():
    import os
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            absolute_path = os.path.join(dirname, filename)
            yield file_path_wrapper(absolute_path)

In [None]:
def read_dbf_file(file_path):
    path = file_path_wrapper(file_path)
    # dbf_to_csv
    tmp_file_path = '/tmp/' + path.get('file_name') + '.csv'
    dbf = Dbf5(file_path, codec='utf-8')
    dbf.to_csv(tmp_file_path)
    return read_file(tmp_file_path)

In [None]:
def read_file(path):
    path = file_path_wrapper(path)
    extension = path.get('extension')
    if extension == 'csv':  return pd.read_csv(path.get('path'), low_memory=False)
    if extension == 'xlsx': return pd.read_excel(path.get('path'))
    if extension == 'dbf':  return read_dbf_file(path.get('path'))
    # if extension == 'shp': return gpd.read_file(path.get('path'))
    # if extension == 'shx': return shapefile.Reader(path.get('path'))
    if extension in ['shp', 'shx']: return { 'gpd': gpd.read_file(path.get('path')), 'shp': shapefile.Reader(path.get('path')) }
    raise NotImplementedError(f'Filetype ({extension}) is not supported yet')

In [None]:
# Get the set of file types found in the provided datasets
file_types = set()
for file in list_files():
    file_types.add(file.get('extension'))
print(', '.join([ f'`{x.upper()}`' for x in sorted(file_types) ]))

In [None]:
def eda_sweetviz(df):
    # EDA using sweetviz
    df_report = sv.analyze(df)
    clear_output()
    return df_report.show_notebook()

In [None]:
def eda_pandas_profiling(df):
    # EDA using pandas-profiling
    profile = ProfileReport(df,
                            title='Pandas Profiling Report',
                            html={'style':{'full_width':True}})
    clear_output()
    return profile.to_widgets()

In [None]:
def eda_minimal(df):
    print('> df.columns')       ; print(df.columns)
    print('\n> df.info()')      ; print(df.info())
    print('\n> df.describe()')  ; print(df.describe())
    return df.head()

In [None]:
def eda(df, eda_tool='minimal'):
    eda_tool_values = [ 'sweetviz', 'pandas_profiling', 'minimal' ]  # possible values for eda_tool
    if   eda_tool == 'sweetviz': return eda_sweetviz(df)
    elif eda_tool == 'pandas_profiling': return eda_pandas_profiling(df)
    elif eda_tool == 'minimal': return eda_minimal(df)
    raise NotImplementedError(f'Invalid eda_tool value ({eda_tool})\nThe possible values are: {", ".join(eda_tool_values)}')

In [None]:
def add_locations_to_dataframe(df1):
    '''Return the result of merging the given dataframe with the one of NA_HQ_public_data.csv
    to be able to know the location of each participating organization'''
    file_path = str(NA_HQ_PUBLIC_DATA_CSV)
    df2 = read_file(file_path)
    # print('-'*20)
    # print(*df2['account_number'].unique())
    # Select the columns we're interested in
    columns = ['account_number', 'organization', 'theme', 'survey_year', 'survey_name', 'hq_country', 'address_city', 'address_state']
    df2 = df2[columns]
    return pd.merge(df1, df2, on='account_number')

In [None]:
def point_to_xy(input_string):
    '''Converts a POINT string to list of numerical coordinates
    Example: 'POINT (12.5921 56.0308)' --> ['12.5921 56.0308'] --> '12.5921 56.0308' --> ['12.5921', '56.0308'] --> [12.5921, 56.0308]
    '''
    pattern = '\((.*)\)'
    matches = re.findall(pattern, input_string)[0].split()
    numbers = list(map(float, matches))
    # print('[DEBUG]', input_string, numbers, sep='\t')
    return numbers

In [None]:
def sequence_to_coords(point_sequence):
    point_sequence = point_sequence.dropna()  # Drop NaN values from sequence
    coords = list(map(point_to_xy, point_sequence))
    coords.append(coords[0]) # repeat the first point to create a 'closed loop'
    return coords

In [None]:
def plot_point_data(point_sequence):
    '''Given a sequence of POINT (x y) strings, turn them into numerical data and plot them along with their convex hull (envelope)'''
    coords = sequence_to_coords(point_sequence)
    hull = ConvexHull(coords)
    np_coords = np.array(coords)
    plt.plot(np_coords[:,0], np_coords[:,1], 'o')
    for simplex in hull.simplices:
        plt.plot(np_coords[simplex, 0], np_coords[simplex, 1], 'k-')

In [None]:
def plot_point_data_on_map(point_sequence):
    coords = sequence_to_coords(point_sequence)
    # called the variable map_ to avoid the clash with the built-in function map
    map_ = folium.Map(location=[0.0673459650476065, 15.137661925287018], zoom_start=2) # map centered on Africa
    for point in coords:
        folium.Marker(point, popup=str(point)).add_to(map_)
    return map_

In [None]:
def plot_counter(c, title=None):
    '''Plot Counter as Horizontal Bar Plot'''
    keys, values = zip(*c.most_common()[::-1]) # max value at the top
    y_pos = np.arange(len(keys))
    plt.barh(y_pos, values, align='center')
    plt.yticks(y_pos, keys)
    if title: plt.title(title)
    return plt.show()

# Exploratory Data Analysis

The data provided in this challenge is a set of `CSV`, `DBF`, `PDF`, `PRJ`, `SHP`, `SHX`, `TXT`, `XLSX`, and `XML` files.  
The environment provided by Kaggle lets us open most of these file types. Namely:
- `CSV`: A simple file format used to store tabular data, such as a spreadsheet or database. [(Source)](https://www.computerhope.com/issues/ch001356.htm#:~:text=CSV%20is%20a%20simple%20file,%22comma%2Dseparated%20values%22.)
- `PDF`: The Portable Document Format (PDF) is a file format developed by Adobe in 1993 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. [(Source)](https://en.wikipedia.org/wiki/PDF)
- `PRJ`: A generic file extension for a project file used with many applications. PRJ stands for ProJect. PRJ files are commonly used to store data, settings, and references to other files used by the project. [(Source)](https://www.reviversoft.com/file-extensions/prj)
- `TXT`: A standard text document that contains unformatted text. [(Source)](https://fileinfo.com/extension/txt#:~:text=A%20TXT%20file%20is%20a,Microsoft%20Notepad%20and%20Apple%20TextEdit.)
- `XLSX`: The XLSX file extension is associated with files saved with Microsoft Excel (2007/2010), one of the most popular and powerful tools you can use to create and format spreadsheets, graphs and much more. [(Source)](https://www.leadtools.com/help/sdk/v20/main/api/file-formats-ms-excel-format-xls-xlsx.html)
- `XML`: An XML file is an XML (Extensible Markup Language) data file. It is formatted much like an .HTML document, but uses custom tags to define objects and the data within each object. XML files can be thought of as a text-based database. [(Source)](https://fileinfo.com/extension/xml)

Unsupported File Types:

- `DBF`: (Unless we install additional software) We can't open `DBF` files [(Wikipedia Page)](https://en.wikipedia.org/wiki/.dbf), this is because `Previews for binary data are not supported` as the error message states when we attempt to open this type of files. A couple of tools to be use to access this type of files are [dbfread](https://dbfread.readthedocs.io/en/latest/) and [simpledbf](https://pypi.org/project/simpledbf/).
- `SHP`: The shapefile format is a geospatial vector data format for geographic information system software. [(Source)](https://en.wikipedia.org/wiki/Shapefile) In order to open this type of files, we need additional libraries like `Fiona` or `PyShp`. [(Source)](https://gis.stackexchange.com/questions/113799/how-to-read-a-shapefile-in-python)
- `SHX`: The format consists of a collection of files with a common filename prefix, stored in the same directory. The three mandatory files have filename extensions `.shp`, `.shx`, and `.dbf`. [(Source)](https://en.wikipedia.org/wiki/Shapefile)



* Kaggle's platform already does a good job by providing some insights about the data provided; Including number of records, number of missing values, number of unique values, mean, standard deviations, min values, max values, ...
* The website providing the data also provides visualization tools that can help during the preliminary EDA phase in each Data Science project.

# Cities

## Cities Disclosing

In [None]:
str(CITIES_DISCLOSING)

In [None]:
!tree "/kaggle/input/cdp-unlocking-climate-solutions/Cities/Cities Disclosing"

### 2018_Cities_Disclosing_to_CDP.csv

In [None]:
file_path = str(_2018_CITIES_DISCLOSING_TO_CDP_CSV)
df = read_file(file_path)
eda(df)

```python
# The 'Last update' column only contains datetime values, thus the need to convert them from string to datetime
# [Python | Pandas.to_datetime() - GeeksforGeeks](https://www.geeksforgeeks.org/python-pandas-to_datetime/)
x = df['Last update'].iloc[0]
print(type(x))
print(x)
print('-'*50)
last_update = pd.to_datetime(df['Last update'])
x = last_update.iloc[0]
print(type(x))
print(x)

# <class 'str'>
# 2020-06-25T04:52:49.050
# --------------------------------------------------
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# 2020-06-25 04:52:49.050000
```

In [None]:
eda(df, eda_tool='pandas_profiling')

In [None]:
df['Last update'] = pd.to_datetime(df['Last update']) # str -> datetime

In [None]:
plot_point_data(df['City Location'])

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
df2

In [None]:
# Not all account numbers are present in the NA_HQ_PUBLIC_DATA_CSV file!
x = set('840309 54609 840914 841416 840030 54649 54632 2028 840926 36504 73678 3203 54104 50356 841153 59631 31184 36286 60236 36495 54579 73763 60906 58865 69995 73301 60268 58871 31113 59595 74428 840930 74531 35848 54700 74309 49367 54270 60361 36492 35859 54430 839669 834258 840515 55325 831616 73736 60384 73732 58395 36037 54111 50354 50794 63562 64014 49342 35897 54109 42178 73789 59588 63919 840070 1499 826236 69848 31154 50680 3429 60328 54529 60369 832097 43917 70005 60414 35475 60271 50579 50385 54293 35860 59678 50394 840601 31117 73668 60391 59681 46514 50387 54337 74573 840529 55466 60409 50559 55419 834251 60320 54488 839965 31115 54290 55615 60318 55381 36242 36041 834287 74386 35907 54402 826103 50378 58511 839667 60114 36501 50543 831152 826182 60284 50565 50395 43930 60218 42388 49360 31111 35904 59160 59535 74453 840927 54108 831823 68290 60292 74423 50386 69840 840161 43909 31181 58482 832610 14344 73725 69823 826423 49359 834163 834280 60140 36261 54493 31090 54623 35886 55334 49172 54459 50391 839673 54098 74525 73365 31185 73754 50389 834300 63999 54706 54696 73252 50364 50792 35862 58424 58590 58597 841269 46470 54342 834229 50671 2185 840521 54345 73712 31110 60332 826209 49335 839648 74466 59697 839970 50370 839964 50368 826167 31164 53829 59537 826431 63616 60385 834161 73694 59707 51075 831617 35870 58670 840244 35393 73302 60419 831926 35893 37261 839980 50541 53254 31109 60898 50390 74563 73724 13067 73707 58796 1184 826407 840941 840492 62180 54588 63543 44185 36158 74673 63601 50371 59165 834413 60387 3422 840042 833379 73645 60399 50377 60379 827047 839954 43932 73665 58543 50562 826212 839666 54386 54124 840514 74560 834120 840371 74568 52897 55380 60125 49347 51079 59572 59168 35873 58671 838939 840507 73650 840313 54364 31146 59642 60394 58797 832274 54078 35903 58609 834277 54611 60413 73293 59163 74558 54518 834246 73679 60276 63862 834374 50381 54613 840037 52894 35878 73713 31175 36043 826239 832838 36263 73759 54289 10894 834301 73295 31149 60340 840253 68383 73749 35867 43920 60223 54084 826208 832078 60638 54110 50549 50362 31168 840693 50382 35993 57347 74575 73801 36493 54092 31151 44076 58668 32480 74495 840937 54060 834058 68296 60393 60295 60273 43934 73709 59158 31155 14088 49339 60003 73788 826210 42120 60603 74631 36032 60408 43905 54491 60258 50358 54030 73676 70261 69850 36274 834261 840938 19233 43928 35449 55372 43910 58413 54389 60433 834153 55801 834313 54620 839665 832909 31153 834362 73706 839670 54619 54102 59531 54352 831618 839966 35854 36045 60356 841003 31163 31446 54113 832000 841326 53921 59971 74534 73693 31166 54656 58795 54654 50220 54606 36039 54625 54478 69985 50396 58627 54088 834255 36285 58626 43970 54329 834259 50550 54070 35898 55800 43940 834289 826396 57616 50359 36426 54360 31148 36159 834226 60375 827048 54647 60577 60371 14874 840925 74594 36494 62864 54699 55379 59580 62868 54291 54692 54318 60349 60126 68378 73652 73722 59151 50361 839963 58569 50566 839971 69834 35879 826429 60233 59545 834403 839668 59298 50557 31177 54510 54119 55371 31114 51374 834405 35885 55799 834260 35274 36254 53959 60400 54650 35857 54395 36477 43975 43914 841154 50578 35880 59956 63615 31172 31108 839931 60307 831674 35883 54662 36470 42123 60546 35910 59633 50574 74508 50681 73413 36004 74677 73752 50375 840425 50572 839650 54612 35884 54305 74488 54085 50380 36522 35853 31170 31112 826237 60599 31182 31150 840349 54608 54703 54538 840924 50401 54026 60104 49343 31157 63941 2430 49334 73684 834083 54603 826427 73695 54678 59562 59967 60267 54114 60621 56276 31169 54617 840024 840269 31171 60142 31052 840033 50392 50560 60053 73240 43911 60274 57509 31051 58513 50357 840916 58310 31179 55331 50558 73686 36282 50203 841098 35877 73701 11315 36036 840419 58346 36469 69822 36262 840490 54667 55373 54457 840370 54048 31187 62791 60278 16581 55415 50679 840039 43912 60656 834323 54327 840919 54057 61427 834238 60073 60050 54390 54641 54517 54348 8242 54498 74401 73879 59644 54605 73803 840036 49327 74427 52638 50673 35874 834370 70017 60127 60216 73802 59536 840034 50672 50555 50551 840918 62817 59969 841155 60108 60392 68337 60229 36223 54274 54349 73637 59552 73806 54082 54116 31156 10495 54681 834219 60264 60007 59563 840931 74418 49330 54513 32550 43923 54687 36002 46473 54361 840943 50211 35865 54347 35913 36410 54521 73680 60213 840018 35905 54029 61467 69824 50388 60588 834202 74678 831999 31173 73750 31180 59669 20113 1093 54391 73666 58357 60633 35872 73690 54100 59996 73648 73530 50650 31158 44077 37241 59180 37038 54306 60272 54709 839972 60410 50571 826380 43969 61790 55324 68373 50674 59657 43907 73671 50665 35858 53860 35863 73787 60416 834278 69968 69999 35915 31009 54367 43921 59538 834406 50154 49344 54066 58530 68385 54651 54409 31165 834347 58595 59166 54075 36491 840936 60374 50544 49787 58485 35887 35268 60417 54627 36289 831620 840328 73762 50383 839982 73663 50568 62855 43938 59167 839967 54335 61876 60029 50782 50384 840935 826207 54670 834167 60279 54277 58531 54637 58621 826446 54497 31176 43937 59124 60381 50373 35755 54356 831230 31174 54697 54037 831433 58783 54388 31167 74680 73715 35894 36265 840917 73746 73738 59653 54253 36512 54652 49333 74414 50398 54370 58489 54633 31055 35864 68388 54683 60388 840944 45219 58591 60347 3417 61753'.split())
y = set('58859 31831 40952 30634 19582 1708 13532 7631 1328 4089 41522 5154 19581 4826 9284 64620 14013 832061 34462 32533 61180 34494 832077 20813 8126 23577 10661 22360 29936 40297 12348 8274 8644 10820 13488 17788 4109 47034 20186 20344 7264 831827 33756 40557 10494 19593 11876 2017 10666 827075 33253 1884 586 3848 831441 14061 18125 9843 20175 35790 57761 45126 7345 64 1857 14683 533 453 2825 20265 1087 9336 11566 51923 912 2523 19845 40299 33790 33636 52425 4528 33628 42037 14926 15027 435 32490 59326 4895 2354 3795 5377 9366 11411 57973 44266 35800 30498 35761 20822 16158 7616 4830 34291 34446 14019 29391 73445 834785 17925 829451 829441 829489 61298 61308 61315 62954 61306 828812 829444 829510 47287 53106 3349 33597 40316 1227 13425 4120 3538 3564 7156 73894 2555 1574 3352 1203 51990 51380 9298 22865 52448 20523 14590 15913 8348 11329 1452 660 1639 51199 19802 40574 829232 50820 56166 33259 33677 16114 56165 56447 831636 33293 71319 52378 46726 34079 32929 33241 51909 39166 28949 5169 11085 14268 826823 15419 22859 41519 44650 52854 35313 12406 682 28617 13024 4287 33899 52983 58 33701 40442 14928 33437 38759 35050 39684 62656 71543 71563 829463 46940 71941 61317 10405 34392 61560 61544 38434 28826 62079 57241 62058 57255 62055 2836 28973 10217 38980 38107 38092 827719 56997 46939 57190 61802 3329 33870 33766 7166 28924 13042 33715 50818 70407 70279 56270 836121 63625 830396 837106 826809 34006 7915 868 40327 29807 10195 45103 10408 846 23634 13121 48105 59901 50074 4822 18951 831825 4832 15623 40255 40413 40195 831380 20355 40300 831379 58295 22365 40228 61824 57031 62916 828005 828000 828286 840355 828361 839872 828866 829760 2360 35092 70974 23521 30867 33374 33461 40889 30039 39066 29952 19271 30724 39716 39735 15306 49441 40219 4151 36718 34285 3751 6559 582 44763 15831 35778 34382 62281 28692 41491 22861 53090 838457 38592 39362 830475 50821 19259 1464 831862 11930 46978 47122 47144 17180 34497 52169 47019 47056 34581 56728 51307 56720 827220 70631 15625 33161 18513 692 30131 58858 20595 11584 8272 4833 20384 699 19859 9759 41697 17690 1875 15878 834496 56479 47400 51661 5229 829486 71860 70601 61310 10057 2885 11905 13604 6550 13888 334 47763 5194 21407 2519 40344 29943 13884 19569 31497 14821 32296 28840 28800 61565 14855 702 57225 10124 57237 57246 28588 62861 62044 62024 17666 57238 61980 70624 28869 3694 38005 19905 2068 23229 847168 847995 20402 63724 37921 14712 3732 11547 2611 3507 2326 33267 11383 4408 33275 17420 57179 14089 71637 61823 61821 62288 17421 73516 829741 28560 4428 23504 51409 38829 38432 21320 59258 6595 35244 40719 23197 18155 19923 61897 1213 2044 830415 30591 30047 23612 35090 30056 51444 33411 29969 33481 29855 34391 38144 454 333 33282 22373 6113 40183 23227 1017 13649 15132 14830 35762 59899 45116 8761 838718 56958 46657 839066 61316 48218 55955 3323 30108 38509 40330 4357 19898 58210 39231 71825 33405 33419 36917 7164 11796 20575 18127 59271 719 2914 833291 840657 9630 840748 12889 37977 6287 19075 44672 18633 3654 12379 11421 22867 12768 13486 8553 37523 14678 5052 9902 51537 18640 746 11401 11904 19902 58301 34996 10612 2951 18526 427 11332 17886 52363 37764 28797 34448 37851 47448 34437 831934 2985 1616 15673 20841 11160 50117 17166 2670 2683 10797 11846 17063 14960 865 23254 20515 15946 20839 304 74263 54554 21491 59722 41902 4638 18405 839509 19822 20111 45114 63454 20593 829442 35323 2573 6333 8838 6414 9301 9310 19328 36604 902 629 8675 20173 10117 19304 4058 5624 2667 5581 21481 7582 350 14961 7690 53513 6685 12799 1470 1417 15980 19241 9037 5337 32144 1602 10175 16804 3635 51895 829479 30125 47030 62059 62158 830700 35063 14266 15046 70616 28915 838776 61832 827983 58002 47784 28853 48089 20021 511 3398 9871 16072 5021 661 61318 71847 19858 16652 17684 3583 10498 15373 19305 3005 8634 31648 524 119 1941 35769 2656 35814 2666 2924 7344 17929 50116 8526 8116 11141 40418 14605 58857 10148 16307 5519 15916 7814 11336 11382 9829 3551 47069 30813 71199 70085 828417 35024 828352 39219 33496 830537 38459 33452 51451 1113 7599 22872 689 20750 7147 1830 20896 5540 1536 8858 7619 2055 827001 1092 33536 48107 57020 47765 58642 58646 53518 1152 1951 31771 32134 58651 34439 39404 40461 22317 44767 13779 2902 7904 7292 839199 839184 839192 839251 839265 38819 839511 41907 839414 839526 839452 839451 839531 33274 6430 2091 285 578 13918 14360 64106 44677 15440 3546 17652 11765 14132 10056 17673 32491 15670 8055 33272 28844 837939 30766 827205 2455 4136 51343 34453 1198 37878 47054 29789 32143 13489 33830 23276 23260 829994 47808 40472 37788 2407 2802 73541 838140 838078 62299 71671 29986 839279 29979 30853 40405 830182 41429 16639 836147 37884 39865 40241 6128 57312 706 5521 16842 34555 12893 38807 38267 28707 13117 20869 10331 11581 4365 628 11017 19145 20705 5767 6383 14169 13813 12382 13849 13483 1693 13562 13314 16606 40303 362 3133 4678 50311 20917 2695 5197 17815 5885 36602 40390 716 21063 1219 23202 58304 17069 15279 2191 57963 710 38280 29803 35311 21492 33438 17035 51726 38015 57259 836132 838822 828316 40698 3876 15271 19016 391 17771 18320 7060 14901 9954 889 836615 836609 70970 62549 33788 18535 44675 56402 56444 34067 831987 831480 50815 37837 64049 15541 2870 8587 23217 1271 7271 50310 33739 51461 33562 39749 70960 56327 848471 165 70639 56167 19396 38047 33051 33108 838737 836552 37896 829863 829509 14783 8027 745 23228 356 22331 13279 15169 19377 6684 6285 3944 9638 36916 18169 10109 8670 707 20612 22873 3640 20159 40342 10733 13439 10233 53510 832418 1146 3358 58300 40310 15097 365 10696 22874 4562 9327 40175 19083 5057 40279 10143 62144 2942 23219 1356 3989 8054 22370 20141 8770 59242 40410 31070 2095 20398 9352 827649 57307 47812 40284 20801 35322 18553 73490 44653 40271 18573 291 6419 5115 40288 5195 57326 47818 47776 56163 22901 2982 33755 56266 33890 13126 39421 30003 38537 34728 31585 29751 59905 58316 57411 37816 840352 46674 827200 47471 51725 45120 40423 54862 23233 5653 22306 22408 59911 57208 57396 35086 830201 30586 39378 39381 29860 40399 20661 45142 23576 31591 30628 72401 37920 52389 40460 11623 13422 18859 16303 59710 62121 829223 836571 837515 38340 836327 6323 71053 52918 40560 829847 40581 33882 33836 829846 33767 33261 62278 56396 831830 73207 831828 52562 52587 11578 62672 40425 19564 61426 40328 40435 46977 827169 38063 46679 32914 56107 836310 56114 33173 70037 51742 829448 837271 829478 829484 61423 2861 838414 57233 62138 57256 838452 838416 57230 838447 838451 56055 826960 836136 47853 836883 836945 838806 838809 827650 51994 840379 57212 57157 828002 828001 827972 71641 837631 47254 839045 57419 837164 837108 837938 837675 828261 828245 828349 837881 828307 837872 838145 837681 837718 837759 829773 829709 39764 30092 57397 38206 57228 830234 830194 30607 39747 830288 830296 830294 39226 39725 61989 33515 830534 37957 41279 39408 840733 33524 57092 9736 40759 1340 70638 838477 829918 56149 829487 836629 34390 70039 51526 51536 828702 39196 47583 829467 40573 52401 71211 33264 45150 35459 61683 70615 32913 37854 47536 33104 34430 47428 51840 52394 37974 829490 826824 829506 57996 57221 838427 838444 838439 838450 838445 836107 826966 51301 836819 57100 838805 57106 57136 827997 47801 827993 827996 47839 52622 47805 61846 71642 47004 71123 837128 837087 38205 828344 828276 73520 828392 828242 828338 828264 828256 828233 837743 837877 837974 837822 828393 838070 830444 837932 39355 38657 33520 837511 837551 32945 38374 38547 39564 830202 30113 839438 38793 830359 830366 39643 39269 34706 39449 39552 51449 39126 848018 848285 848284 848215 61402 62435 59247 51889 51785'.split())
x.intersection(y)

In [None]:
# eda(df, eda_tool='sweetviz')

In [None]:
# We noticed that there are some missing values; We can fill the gaps by searching the Internet or by using APIs like
# [World Cities Population — Opendatasoft](https://public.opendatasoft.com/explore/dataset/worldcitiespop/api/?disjunctive.country)
filter_flags = df['Population'].isnull()
df[filter_flags]

In [None]:
# We can notice as well that the population data is skewed; The majority of populations do not exceed 4_000_000 inhabitants
# [Skewness Definition](https://www.investopedia.com/terms/s/skewness.asp)
# [pandas.notnull — pandas 0.23.4 documentation](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.notnull.html)
# [pandas.Series.hist — pandas 1.1.5 documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.hist.html)
# [How to Set Axis Range (xlim, ylim) in Matplotlib](https://stackabuse.com/how-to-set-axis-range-xlim-ylim-in-matplotlib/)
# [python - Make a histogram of a pandas series - Stack Overflow](https://stackoverflow.com/questions/53055708/make-a-histogram-of-a-pandas-series/53056267)
filter_flags = df['Population'].notnull()
df[filter_flags]['Population'].hist(bins=1000)
_ = plt.xlim([0, 30_000_000])

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We can also notice that the 'Population Year' column contain some zeros;
# I suppose that they mean by that that the year is not available or unknown since most rows have a NaN value except 3 rows
# maybe this is how the Online Response System (ORS) was created: Replace undefined values with zeroes
filter_flags = df['Population Year'] == 0
df[filter_flags]

### 2019_Cities_Disclosing_to_CDP.csv

In [None]:
file_path = str(_2019_CITIES_DISCLOSING_TO_CDP_CSV)
df = read_file(file_path)
eda(df)

In [None]:
plot_point_data(df['City Location'])

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
df2 = add_locations_to_dataframe(df)
df2

### 2020_Cities_Disclosing_to_CDP.csv

In [None]:
file_path = str(_2020_CITIES_DISCLOSING_TO_CDP_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
plot_point_data(df['City Location'])

In [None]:
# The cities locations data contain some errors (cities on water, cities outside of the map, ...)
# This is an R notebook explaining how to fix this issue: [CDP - Cities Location Exploratory Analysis | Kaggle](https://www.kaggle.com/shabou/cdp-cities-location-exploratory-analysis)
plot_point_data_on_map(df['City Location'])

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
plot_counter(counter, title='Numbers of Reports Submitted by Country (2020)')

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
df2

### Cities_Disclosing_to_CDP_Data_Dictionary.csv

In [None]:
file_path = str(CITIES_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV)
df = read_file(file_path)
eda(df)

In [None]:
for idx, row in df.iterrows():
    # print(row)
    print(row.field)
    print(row.description)
    print('-'*50)

## Cities Questionnaires

This folder contains only `PDF` files.

## Cities Responses

### 2018_Full_Cities_Dataset.csv

In [None]:
file_path = str(_2018_FULL_CITIES_DATASET_CSV)
df = read_file(file_path)
eda(df)

In [None]:
pd.set_option('max_colwidth', 300)

In [None]:
df[['Question Name', 'Response Answer']]

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
df2

### 2019_Full_Cities_Dataset.csv

In [None]:
file_path = str(_2019_FULL_CITIES_DATASET_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
plot_counter(counter, title='Numbers of Reports Submitted by Country (2020)')

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
df2

### 2020_Full_Cities_Dataset.csv

In [None]:
file_path = str(_2020_FULL_CITIES_DATASET_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
# Which CDP Region contributed the most to this survey?
column_name = 'CDP Region'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
df2

### Full_Cities_Response_Data_Dictionary.csv

In [None]:
file_path = str(FULL_CITIES_RESPONSE_DATA_DICTIONARY_CSV)
df = read_file(file_path)
eda(df)

In [None]:
for idx, row in df.iterrows():
    # print(row)
    print(row.field)
    print(row.description)
    print('-'*50)

# Corporations

## Corporations Disclosing

### Climate Change

#### 2018_Corporates_Disclosing_to_CDP_Climate_Change.csv

In [None]:
file_path = str(_2018_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### 2019_Corporates_Disclosing_to_CDP_Climate_Change.csv

In [None]:
file_path = str(_2019_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### 2020_Corporates_Disclosing_to_CDP_Climate_Change.csv

In [None]:
file_path = str(_2020_CORPORATES_DISCLOSING_TO_CDP_CLIMATE_CHANGE_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### Corporations_Disclosing_to_CDP_Data_Dictionary.csv

In [None]:
file_path = str(CORPORATIONS_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV)
df = read_file(file_path)
eda(df)

### Water Security

#### 2018_Corporates_Disclosing_to_CDP_Water_Security.csv

In [None]:
file_path = str(_2018_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### 2019_Corporates_Disclosing_to_CDP_Water_Security.csv

In [None]:
file_path = str(_2019_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### 2020_Corporates_Disclosing_to_CDP_Water_Security.csv

In [None]:
file_path = str(_2020_CORPORATES_DISCLOSING_TO_CDP_WATER_SECURITY_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'country'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
Counter(df2['country'])

#### Corporations_Disclosing_to_CDP_Data_Dictionary.csv

In [None]:
file_path = str(CORPORATIONS_DISCLOSING_TO_CDP_DATA_DICTIONARY_CSV)
df = read_file(file_path)
eda(df)

## Corporations Questionnaires

### Climate Change

This folder contains only `PDF` files.

### Water Security

This folder contains only `PDF` files.

## Corporations Responses

### Climate Change

#### 2018_Full_Climate_Change_Dataset.csv

In [None]:
file_path = str(_2018_FULL_CLIMATE_CHANGE_DATASET_CSV)
df = read_file(file_path)
eda(df)

In [None]:
# Which part of the world contributed the most to this survey?
column_name = 'organization'
filter_flags = df[column_name].notnull()
counter = Counter(df[filter_flags][column_name])
counter.most_common()

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country'])
c

In [None]:
x = np.array(c.most_common())
np.random.shuffle(x) # to avoid getting small values next to each other (i.e. overlapping labels) <-- this is the easiest solution!
x

In [None]:
x[:,0]

In [None]:
x[:,1]

In [None]:
# [Matplotlib Pie Charts](https://www.w3schools.com/python/matplotlib_pie_charts.asp)
y = x[:,1]
mylabels = x[:,0]
# myexplode = [0, 0, 0.5, 0]

# plt.pie(y, labels = mylabels, shadow = True, explode = myexplode)
plt.pie(y, labels = mylabels, shadow = True)
# plt.legend(title = 'Countries')
plt.tight_layout()
plt.show()

#### 2019_Full_Climate_Change_Dataset.csv

In [None]:
str(_2019_FULL_CLIMATE_CHANGE_DATASET_CSV)

In [None]:
!head -n 1 '/kaggle/input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/2019_Full_Climate_Change_Dataset.csv'

In [None]:
file_path = str(_2019_FULL_CLIMATE_CHANGE_DATASET_CSV)
df = read_file(file_path)
eda(df)

In [None]:
df

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country'])
c

In [None]:
plot_counter(c)

#### 2020_Full_Climate_Change_Dataset.csv

In [None]:
file_path = str(_2020_FULL_CLIMATE_CHANGE_DATASET_CSV)
df = read_file(file_path)
eda(df)

#### Full_Corporations_Response_Data_Dictionary copy.csv

In [None]:
# Check if the files "Full_Corporations_Response_Data_Dictionary copy.csv" and "Full_Corporations_Response_Data_Dictionary.csv" are similar or not
# /kaggle/input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/Full_Corporations_Response_Data_Dictionary copy.csv
# /kaggle/input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/Full_Corporations_Response_Data_Dictionary.csv
!cmp --silent "/kaggle/input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Climate Change/Full_Corporations_Response_Data_Dictionary copy.csv" "/kaggle/input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses/Water Security/Full_Corporations_Response_Data_Dictionary.csv" && echo "identical! ✅" || echo "different! ❌"

In [None]:
file_path = str(FULL_CORPORATIONS_RESPONSE_DATA_DICTIONARY_COPY_CSV)
df = read_file(file_path)
# eda(df)

### Water Security

#### 2018_Full_Water_Security_Dataset.csv

In [None]:
file_path = str(_2018_FULL_WATER_SECURITY_DATASET_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country'])
c

In [None]:
plot_counter(c)

#### 2019_Full_Water_Security_Dataset.csv

In [None]:
file_path = str(_2019_FULL_WATER_SECURITY_DATASET_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country'])
c

In [None]:
plot_counter(c)

#### 2020_Full_Water_Security_Dataset.csv

In [None]:
file_path = str(_2020_FULL_WATER_SECURITY_DATASET_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country'])
c

In [None]:
plot_counter(c)

#### Full_Corporations_Response_Data_Dictionary.csv

In [None]:
file_path = str(FULL_CORPORATIONS_RESPONSE_DATA_DICTIONARY_CSV)
df = read_file(file_path)
eda(df)

# Supplementary Data

## CDC 500 Cities Census Tract Data

### 500_Cities__Census_Tract-level_Data__GIS_Friendly_Format___2019_release.csv

In [None]:
file_path = str(_500_CITIES__CENSUS_TRACT_LEVEL_DATA__GIS_FRIENDLY_FORMAT___2019_RELEASE_CSV)
df = read_file(file_path)
eda(df)

## CDC Social Vulnerability Index 2018

### SVI2018_US.csv

In [None]:
file_path = str(SVI2018_US_CSV)
df = read_file(file_path)
eda(df)

### SVI2018_US_COUNTY.csv

In [None]:
file_path = str(SVI2018_US_COUNTY_CSV)
df = read_file(file_path)
eda(df)

## Dataset Licenses

In [None]:
str(DATASET_LICENSES)

In [None]:
!tree '/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/Dataset Licenses'

### CDP_dataset_licenses.txt

In [None]:
!cat '/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/Dataset Licenses/CDP_dataset_licenses.txt'

The links contained in this file present metadata about the data provided in this challenge (Column Name, Description, Type), and we can even contact the dataset owner. It also provides online tools for visualizing and exploring the data in depth.

### Supplementary_dataset_licenses.txt

In [None]:
!cat '/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/Dataset Licenses/Supplementary_dataset_licenses.txt'

This file contain a list of references of the source from which the data was collected for this competition.

## Locations of Corporations

### NA_HQ_public_data.csv

In [None]:
file_path = str(NA_HQ_PUBLIC_DATA_CSV)
df = read_file(file_path)
# eda(df)

In [None]:
df['survey_year'].unique()

In [None]:
df2.columns

In [None]:
# We should use the same column name, otherwise, merging the two dataframes will fail
df.columns = [ 'account_number' if x == 'Account Number' else x for x in df.columns ]
# print(*df['account_number'].unique())
df2 = add_locations_to_dataframe(df)
c = Counter(df2['hq_country_x'])
c

In [None]:
plot_counter(c)

## NYC CDP Census Tract Shapefiles

This folder contains `DBF`, `PRJ`, `SHP`, `SHX`, and `XML` files.

In [None]:
str(NYC_CDP_CENSUS_TRACT_SHAPEFILES)

In [None]:
!tree "/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/NYC CDP Census Tract Shapefiles"

### nyu_2451_34505.dbf

In [None]:
file_path = str(NYU_2451_34505_DBF)
df = read_file(file_path)
eda(df)

### nyu_2451_34505.prj

In [None]:
str(NYU_2451_34505_PRJ)

In [None]:
!cat '/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/NYC CDP Census Tract Shapefiles/nyu_2451_34505.prj'

### nyu_2451_34505.shp

In [None]:
file_path = str(NYU_2451_34505_SHP)
sf = read_file(file_path)

In [None]:
print(sf)

In [None]:
sf.keys()

In [None]:
sf['gpd']

In [None]:
sf['shp']

In [None]:
sf['shp'].shapeType, sf['shp'].shapeTypeName

In [None]:
# The number of features
len(sf['shp'])

In [None]:
# The bounding box area the shapefile covers
sf['shp'].bbox

In [None]:
# GeoJSON Data; [GeoJSON](https://geojson.org/)
sf['shp'].__geo_interface__

### nyu_2451_34505.shx

Same thing as opening `nyu_2451_34505.shp`
> the format consists of a collection of files with a common filename prefix, stored in the same directory. The three mandatory files have filename extensions .shp, .shx, and .dbf. — [Shapefile - Wikipedia](https://en.wikipedia.org/wiki/Shapefile)

In [None]:
# Proof that opening the SHX file is the same as opening the SHP or the DBF ones

file_path = str(NYU_2451_34505_SHX)
sf2 = read_file(file_path)
print('SHP == SHX ?', sf['shp'].__geo_interface__ == sf2['shp'].__geo_interface__, sep='\t') # True

sf3 = shapefile.Reader(str(NYU_2451_34505_DBF))
print('SHX == DBF ?', sf2['shp'].__geo_interface__ == sf3.__geo_interface__, sep='\t') # True

In [None]:
file_path = str(NYU_2451_34505_SHP)
shapefile = read_file(file_path)
shapefile

In [None]:
print(shapefile)

In [None]:
# [Using GeoPandas to display Shapefiles in Jupyter Notebooks – acgeospatial](http://www.acgeospatial.co.uk/geopandas-shapefiles-jupyter/)
shapefile['gpd'].plot()

### nyu_2451_34505_iso.xml

In [None]:
str(NYU_2451_34505_ISO_XML)

In [None]:
!cat '/kaggle/input/cdp-unlocking-climate-solutions/Supplementary Data/NYC CDP Census Tract Shapefiles/nyu_2451_34505_iso.xml'

In [None]:
# [Python XML Parser Tutorial: Read xml file example(Minidom, ElementTree)](https://www.guru99.com/manipulating-xml-with-python.html)
# [How do I parse XML in Python? - Stack Overflow](https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python)
tree = ET.parse(str(NYU_2451_34505_ISO_XML))
root = tree.getroot()

In [None]:
# [xml.etree.ElementTree — The ElementTree XML API — Python 3.9.1 documentation](https://docs.python.org/3/library/xml.etree.elementtree.html)
for child in root:
    print('>', child.tag)
    if child.attrib != {}: pprint(child.attrib)
    for grandchild in child:
        print(grandchild.tag)
        if grandchild.attrib != {}: pprint(grandchild.attrib)
    print('-'*50)

## Recommendations from CDP

### CDP_recommendations_for_questions_to_focus_on.xlsx

In [None]:
file_path = str(CDP_RECOMMENDATIONS_FOR_QUESTIONS_TO_FOCUS_ON_XLSX)
df = read_file(file_path)
eda(df)

In [None]:
df

### CDP_recommendations_for_supplementary_datasets_to_include.xlsx

In [None]:
file_path = str(CDP_RECOMMENDATIONS_FOR_SUPPLEMENTARY_DATASETS_TO_INCLUDE_XLSX)
df = read_file(file_path)
eda(df)

## Simple Maps US Cities Data

### uscities.csv

In [None]:
file_path = str(USCITIES_CSV)
df = read_file(file_path)
eda(df)

# References
* [Python Dictionary get()](https://www.programiz.com/python-programming/methods/dictionary/get)
* [Reading and Writing CSV Files in Python – Real Python](https://realpython.com/python-csv/)
* [python - When to use 'raise NotImplementedError'? - Stack Overflow](https://stackoverflow.com/questions/44315961/when-to-use-raise-notimplementederror)
* [python - ipython notebook clear cell output in code - Stack Overflow](https://stackoverflow.com/questions/24816237/ipython-notebook-clear-cell-output-in-code)
* [Sweetviz: Automated EDA in Python | by Himanshu Sharma | Towards Data Science](https://towardsdatascience.com/sweetviz-automated-eda-in-python-a97e4cabacde)
* [Modern Exploratory Data Analysis. Review of 4 libraries for automatic EDA | by ChiefHustler | Towards Data Science](https://towardsdatascience.com/modern-exploratory-data-analysis-29fdbecec957)
* [shutil - Python Pathlib path object not converting to string - Stack Overflow](https://stackoverflow.com/questions/44315815/python-pathlib-path-object-not-converting-to-string)
* [pandas-profiling · PyPI](https://pypi.org/project/pandas-profiling/)
* [sweetviz · PyPI](https://pypi.org/project/sweetviz/)
* [Python Pathlib Tutorial - YouTube](https://www.youtube.com/watch?v=HejUKf88Ua0)
* [python - How to read a .xlsx file using the pandas Library in iPython? - Stack Overflow](https://stackoverflow.com/questions/16888888/how-to-read-a-xlsx-file-using-the-pandas-library-in-ipython)
* [simpledbf · PyPI](https://pypi.org/project/simpledbf/)
* [Check for NaN in Pandas DataFrame (examples included) - Data to Fish](https://datatofish.com/check-nan-pandas-dataframe/)
* [pip - How to install Python package from GitHub? - Stack Overflow](https://stackoverflow.com/questions/15268953/how-to-install-python-package-from-github)
* [python - Should I ignore DtypeWarning: Columns(17,62).....? - Stack Overflow](https://stackoverflow.com/questions/61335916/should-i-ignore-dtypewarning-columns17-62)
* [how to merge two data frames based on particular column in pandas python? - Stack Overflow](https://stackoverflow.com/questions/37697195/how-to-merge-two-data-frames-based-on-particular-column-in-pandas-python)
* [Replace values in list using Python - Stack Overflow](https://stackoverflow.com/questions/1540049/replace-values-in-list-using-python)
* [python - How to extract a floating number from a string - Stack Overflow](https://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string)
* [RegExr: Learn, Build, & Test RegEx](https://regexr.com/)
* [plot - How to draw polygons with Python? - Stack Overflow](https://stackoverflow.com/questions/43971259/how-to-draw-polygons-with-python)
* [scipy.spatial.ConvexHull — SciPy v1.5.4 Reference Guide](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html)
* [folium · PyPI](https://pypi.org/project/folium/)
* [Mapping Points with Folium | Data EconoScientist](https://georgetsilva.github.io/posts/mapping-points-with-folium/)
* [python - How to iterate over rows in a DataFrame in Pandas - Stack Overflow](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas)
* [Python pandas Filtering out nan from a data selection of a column of strings - Stack Overflow](https://stackoverflow.com/questions/22551403/python-pandas-filtering-out-nan-from-a-data-selection-of-a-column-of-strings)
* [python - How to plot Counter object in horizontal bar chart? - Stack Overflow](https://stackoverflow.com/questions/22222573/how-to-plot-counter-object-in-horizontal-bar-chart)
* [python - Unpacking a list / tuple of pairs into two lists / tuples - Stack Overflow](https://stackoverflow.com/questions/7558908/unpacking-a-list-tuple-of-pairs-into-two-lists-tuples)

---