## References
* [Automate Entity Extraction of Reddit Subgroup using BERT Model | by Manmohan Singh | Towards Data Science](https://towardsdatascience.com/automate-entity-extraction-of-reddit-subgroup-using-bert-model-336f9edb176e)
* [langdetect · PyPI](https://pypi.org/project/langdetect/)
* [TextBlob: Simplified Text Processing — TextBlob 0.16.0 documentation](https://textblob.readthedocs.io/en/dev/)
* [Transformers — transformers 4.0.0 documentation](https://huggingface.co/transformers/)
* [How to Compute Sentence Similarity Using BERT and Word2Vec | by Pedram Ataee, PhD | Oct, 2020 | Towards Data Science](https://towardsdatascience.com/how-to-compute-sentence-similarity-using-bert-and-word2vec-ab0663a5d64)



## Methodology
This Notebook is the fruit of the hard work of team: [Ausberto Escorcia](ausbertoescorcia@think-it.io), [Ghada Louil](ghada@think-it.io) and [Mustapha Sahli](mustaphasahli@think-it.io).

We collaborated as a team to solve [the CDP - Unlocking climate solutions challenge on Kaggle](https://www.kaggle.com/c/cdp-unlocking-climate-solutions) on a race speed during the last month or so. 

What is particularly interesting and challenging in this Kaggle competition is that the output is not concisely defined. As a team, we collaborated first of all, on defining the deliverables of the competition, which are `the methodology` and the resulting `KPIs`. 

As a first step, we had to define the KPIs, so then we have a clear understanding of how to proceed to calculate them and generally to describe the methodology. For that, we refered to many resources mentioned accross this notebook. We have as well noticed, that once defined, and with the large amount of data, our efforts would be disparsed, so we aimed at looking only to one important domain, `Water` (combined with `Water security`), and focus on delivering the KPIs for this domain. Once that is done, we have a methodology that is applyable on the other domains and that delivers the rest of the KPIs.

Our approach is decribed by the following figure.

In [None]:
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
plt.figure(figsize=(20,10))
img = mpimg.imread('/kaggle/input/cdp-kpis/Water KPIs Tree.png')
imgplot = plt.imshow(img)
plt.show()

As the figure shows, our methodology is quite simple and should be efficient. We intend to combine our knowledge of NLP with CDP data to extract the KPIs stated. 

BUT, before we dig into details of the KPIs we wanted to take a quick look at the data and introduce some of the state-of-the-art models in the NLP world that we would like to use.

## Dependencies

In [None]:
!pip install langdetect > /dev/null
!pip install --upgrade pip > /dev/null  # updating the package-management system pip
!pip install PyDictionary > /dev/null   # installing PyDictionary (https://pypi.org/project/PyDictionary/)
!pip install sent2vec > /dev/null

## Imports

In [None]:
import numpy as np 
import pandas as pd
import warnings; warnings.filterwarnings("ignore")  # to ignore "wandb"'s warning

from sent2vec.vectorizer import Vectorizer  # used to compute the embedding vectors
from scipy import spatial                   # used to compute the distance between these embedding vectors
from PyDictionary import PyDictionary
from collections import Counter
from pprint import pprint          # pretty prints 
from langdetect import detect      # a Language detection library ported from Google's language-detection.
from textblob import TextBlob      # a library for processing textual data
from glob import glob              # usef for filename pattern matching
from transformers import pipeline  # state-of-the-art natural language processing

### Globals

In [None]:
CDP_PATH = "/kaggle/input/cdp-unlocking-climate-solutions/"

In [None]:
# import BERT question-answering pipeline
question_answerer = pipeline('question-answering')

# import the BERT summarization pipeline
summary = pipeline('summarization')

# import the BERT named entity recognition pipeline
ner = pipeline('ner')

## Helper functions

In [None]:
def load_data(filepath):
    """Reads in the CSV datasets (of the given file path) as Dataframes then return their concatenation"""
    files = glob(filepath+"*.csv")
    data = pd.DataFrame()
    for file in files:
        if "Data_Dictionary" not in file:
            df = pd.read_csv(file)
            data = pd.concat([data, df])
    data.reset_index(drop=True, inplace=True)
    return data

def get_csv_data(path, folder, subfolder=None):
    """A helper function providing an easier way to access the different dataset files"""
    if path.lower() == "cities":
        if folder.lower() == "disclosing":
            data_path = "Cities/Cities Disclosing/"
        elif folder.lower() == "responses":
            data_path = "Cities/Cities Responses/"
        else:
            data_path = "Cities/Cities Questionnaires/"
        
    elif path.lower() == "corporations":
        if not subfolder:
            print("Subfolder needs to be specified")
            return None
        if folder.lower() == "disclosing":
            if subfolder.lower() == "climate change":
                data_path = "Corporations/Corporations Disclosing/Climate Change/"
            elif subfolder.lower() == "water security":
                data_path = "Corporations/Corporations Disclosing/Water Security/"
            else:
                print("No such folder")
                return None
        elif folder.lower() == "responses":
            if subfolder.lower() == "climate change":
                data_path = "Corporations/Corporations Responses/Climate Change/"
            elif subfolder.lower() == "water security":
                data_path = "Corporations/Corporations Responses/Water Security/"
            else:
                print("No such folder")
                return None
        else:
            data_path = "Cities/Cities Questionnaires/"
            
    else:
        print("open the file directly")
        return None
    
    try:
        return load_data(CDP_PATH + data_path)
    except:
        print("No such file in directory")
        return None
    
def translate(text):
    """Translates the given text to English"""
    if detect(text) == "en":
        return text
    blob = TextBlob(text)
    translated = blob.translate(to='en')
    return str(translated)

def get_questions(module_id):
    """Given the module ID, returns the questions belonging to that module"""
    res = list()
    for ques_id in list(questions['2019 Question number'].unique()):
        if module_id == ques_id.split('.')[0]:
            res.append(ques_id)
    return res

def get_city_name(account_number):
    """Given an account number, returns the city name"""
    df = cities_disc[cities_disc['Account Number'] == account_number]
    values = list(df['City'].unique())
    if len(values) > 1 or values == [""]:
        values = cities_resp[cities_resp['Account Number'] == account_number]['Organization'].values
        translated_org_name = translate(values[0])
        return translated_org_name
    else:
        return values[0]

## Pipeline

In [None]:
%%time 

# loading data
cities_resp = get_csv_data("cities", "responses")
cities_disc = get_csv_data("cities", "disclosing")
corpos_resp = get_csv_data("corporations", "responses", "water security")
corpos_disc = get_csv_data("corporations", "disclosing", "water security")
print(cities_resp.shape, cities_disc.shape, corpos_resp.shape, corpos_disc.shape)

# replace NA/NaN values with an empty string from cities and corporations dataframes
cities_resp.fillna("", inplace = True)
cities_disc.fillna("", inplace = True)
corpos_resp.fillna("", inplace = True)
corpos_disc.fillna("", inplace = True)

# read and process questions
questions = pd.read_excel(CDP_PATH + "Supplementary Data/Recommendations from CDP/CDP_recommendations_for_questions_to_focus_on.xlsx")
questions.drop(columns=["Unnamed: 4", "Unnamed: 5", "Unnamed: 6", "Unnamed: 7"], axis=1, inplace= True)  # drop unamed columns
questions.columns=questions.iloc[3]                               # setting the dataframe's column names
questions.drop(index=[0,1,2,3], axis=0, inplace=True)             # remove unwanted rows
questions.reset_index(drop=True, inplace=True)                    # resets the index to the default integer index
questions.iloc[0].index.name = "index"                            # setting the "indexes" column's name
questions.fillna("", inplace=True)                                # replace NA/NaN values with an empty string
questions = questions.apply(lambda x: x.astype(str).str.lower())  # set the questions to lowercase


In [None]:
pd.options.display.max_rows = 200  # sets the maximum number of rows displayed when a frame is printed
questions

In [None]:
q = questions['2019 Question number'].iloc[0]              # select the first row (containing the first question)
print('Question', q, sep='\t')
df = cities_resp[cities_resp["Question Number"] == q[1:]]  # select the rows related to that question (the first character, referring to the module, is omitted)
print('Dataframe Shape', df.shape, sep='\t')
df.head()

In [None]:
answers = list(df["Response Answer"].unique())  # get a list of unique answers to that question
print(len(answers), 'Answers Found')
pprint(answers)

In [None]:
languages = []
for answer in answers:
    try:
        languages.append(detect(answer))
    except:
        print(answer)
        continue
counter = Counter(languages)

In [None]:
labels = list(counter.keys())
lang_counter = list(counter.values())

explode = [0,0.1,0,0,0,0,0,0,0,0,0,0]
fig = plt.figure(figsize =(10,8))
plt.title('Distribution of languages across water response data')
plt.pie(lang_counter, explode=explode, startangle=90) 
plt.legend(labels = labels,loc=[1.1, 0.5])
plt.show()

### Translation

One of our interesting criteria in our approach is that we take into consideration all languages; If the response is not written in English, we start by translating them then pass them through the same pipeline for the answers originally written in English.

In [None]:
a = "La ciudad de Barrancabermeja es el mayor centro urbano de la Región del Magdalena Medio,  Ubicado en el departamento de Santander en Colombia, es la Capital de la Provincia de Mares, Localizada a 120 Km de distancia de la Capital del Departamento Bucaramanga, se encuentra en la margen derecha del Rio Magdalena conocida como la Capital Petrolera de Colombia, porque desde sus inicios ésta actividad contribuyó al desarrollo de lo que hoy es la ciudad. Con la llegada del petróleo a este territorio se generó una fuerte migración de pobladores de diferentes zonas del país, y junto con la explotación del petróleo originada por la concesión de Mares, llamada así por Roberto de Mares, quien dirigió las primeras actividades de explotación, la ciudad empezó a generar un proceso de crecimiento y expansión urbanística hasta el punto que en 1.922 se dio la ordenanza para convertir al caserío en un municipio, y en ese mismo año entró en funcionamiento la refinería, la cual era administrada por la Tropical Oil Company, hoy en día ExxonMobil de Colombia S.A.Gracias al desarrollo de la industria petrolera, Barrancabermeja se ha convertido en un fuerte complejo empresarial para el departamento de Santander donde diferentes sectores económicos tales como la construcción y los servicios, registraron un crecimiento de sus actividades económicas en los últimos años, contribuyendo así en la generación de oportunidades de empleo para los habitantesy el crecimiento urbano de la misma ciudad."
translated = translate(a)
print('Original Language', detect(a), sep='\t')
print('English Translation', translated, sep='\t')

### Text Summarization

One more idea is to summarize the long answers and assess the speed/accuracy tradeoff introduced with this decision.

In [None]:
s = summary(translated)[0]['summary_text']
s

### Question Answering

Once translated, we can use question-answering algorithms to extract the particular information we need from the whole answer.

In [None]:
question_answerer({'question': "what is the name of the city?", 'context': s})

# Methodology

*Idea*
As we explained before, the KPIs were not concisely defined and the methodology aims at calculating them. We first tried to explore resources and came up with a bunch of KPIs for water security, climate change, cities governance and all others stated in our data. Our method consists of processing the response data and on extracting meaningful information from it, so then we have clear understanding about the existance of a particular KPI in a specific location. 

Here, we like to state three remarks:
1. We worked on water and water security KPIs only and therefore, we extracted the cities responses data for both sections and made the rest of the processing apply on them
2. We also only considered water KPIs including water security as helpful resources to build the methodology
3. We will work on a city level, and prioritize cities that have a focus on water issues and solutions

Kindly find below the considered KPIs for water and water security, kudos to Ausberto Escorcia:
* safe and affordable
* end open defecation and provide access to sanitation and hygiene
* improve water quality, wastewater treatment and safe reuse
* increase water use efficiency and ensure freshwater supplies
* implement integrated water resources management
* protect and restore water-related ecosystems
* expand water and sanitation support to developing countries
* support local engagement in water and sanitation management

In the following section, we detail the steps of our methodology along with some testing of methods and models for text processing and feature extraction.

We used the state-of-the-art model BERT tokenizer and pipelines for question-answering, summarization and named-entity-recognition.

Ps. Some of the tests on the data failed as more preprocessing was needed or simply for inadequacy, and therefore they are deleted from this notebook and may re-occur in future versions.

### In-depth data exploration

In [None]:
# water security questions
water_quests_ids = get_questions("c14")
water_questions = [list(questions[questions['2019 Question number'] == q]['Question text'].values)[0] for q in water_quests_ids]
water_questions

In [None]:
water_security_df = cities_resp[cities_resp["Parent Section"] == "Water Security"]  # get all the questions related to "Water Security"
print('Dataframe Shape', water_security_df.shape, sep='\t')
water_security_df.head()

Let's consider one city, then identify water issues, city governance, KPIs accomplished.  
In the same country, we may find different answers for the same question, so, it would be better if we focus on one country at a time.
We randomly selected `Canada` as a large country that contains multiple cities and a fair amount of data we can work on.

In [None]:

canada_resp = water_security_df[water_security_df["Country"] == "Canada"]
print(canada_resp.shape)
canada_resp.head()

In [None]:
get_city_name(848408)  # find out to what city belongs this organization

Another idea we had is to apply [NER (Named Entity Recognition)](https://en.wikipedia.org/wiki/Named-entity_recognition) on an answer to extract other pieces of information.

In [None]:
text = 'Municipality of Cajamarca'
words = text.split()
for word in words:
    print(ner(word))

## Next Steps
This is only a first iteration, in which we relied on the power of the algorithms we used (i.e. Their resilience against raw data); We can improve the results of our approach by focussing more on the data. We noticed that it would be better if some pieces of information were put together not in different datasets.

### Using Synonyms

Another idea we can try is to search, at a first level, for a list of keywords, and on a second level, for their synonyms. Example: clean water --> potable, affordable -> cheap, ... this way, we'll get the information we needed even if it wasn't written as we expected. As we all know, if you give the same text to different translators, you won't get exactly the same result, and this idea will solve the inherited behavior of natural languages. As follows an example (using the [PyDictionary](https://pypi.org/project/PyDictionary/) library) of applying this method.

In [None]:
dictionary = PyDictionary()
keywords = ['clean', 'affordable', 'accessible']
for keyword in keywords:
    print (f'{keyword:10}', ', '.join(dictionary.synonym(keyword)[:5]) + ', ...', sep='\t')

### Sentence Similarity

Another idea is to check whether an answer can be related to more than one KPI. To find out if this is the case, we need to compute the similarity of that answer with the keywords related to that KPI. As follows, an example of applying sentence similarity in this use case.

In [None]:
question = 'Rate the importance (current and future) of water quality and water quantity to the success of your business.'
answer = 'We use water for drinking, sanitary purposes and some industrial processes at our plants. We strive to create a positive impact on our environment, by providing products and services that enable our customers to use less water. For example, we produce a thermosyphon cooler hybrid system, which can create substantial savings in the water used by power plants. Our district energy solutions that include both equipment and controls, such as that deployed at Stanford University, have the benefit of reducing both energy and water use. We also seek to continuously improve in our water management in our operations. Our facility siting and facility acquisitions undergo a due diligence process that we believe helps avoid situations where we would face significant water risks. Given our business changes, we are in the process of conducting additional analysis on our water use.'
keywords = [ 'Water Security', 'Climate Change' ]
sentences = [ answer ] + keywords

In [None]:
print('question', question, sep='\t')
print('answer', answer, sep='\t')

In [None]:
# string --> vector
vectorizer = Vectorizer()
vectorizer.bert(sentences)
vectors_bert = vectorizer.vectors
vectors_bert

In [None]:
answer_vector = vectors_bert[0]
ws_keyword_vector, cc_keyword_vector = vectors_bert[1], vectors_bert[2]

In [None]:
answer_ws_distance = spatial.distance.cosine(answer_vector, ws_keyword_vector)
answer_cc_distance = spatial.distance.cosine(answer_vector, cc_keyword_vector)
print('answer_ws_distance', answer_ws_distance, sep='\t')
print('answer_cc_distance', answer_cc_distance, sep='\t')

In [None]:
# helper function
def semantic_distance(string_a, string_b):
    """Computes the semantic distance between the given strings"""
    sentences = [ string_a, string_b ]
    vectorizer = Vectorizer()
    vectorizer.bert(sentences)
    vectors_bert = vectorizer.vectors
    a_vector, b_vector = vectors_bert
    return spatial.distance.cosine(a_vector, b_vector)

In [None]:
answer = '''
Warming temperatures and increased run off in a changing climate could lead to algal blooms, higher levels of bacterial activity, and a potential increase in the current very low levels of naturally-occurring disease-causing organisms (such as Giardia) in water supply reservoirs.   Algal blooms can cause taste and odour issues and interfere with disinfection.  The low levels of bacterial activity and disease-causing organisms in the reservoirs at present are able to be deactivated by existing water disinfection processes – ultraviolet light, chlorine, and ammonia.  It is unlikely that these organisms would increase beyond the capability of the disinfection system.  The nutrient poor status and large volume of water in supply reservoirs will greatly buffer any effects of climate change.  There is regular testing and monitoring in place to ensure a safe drinking water supply and detect any changes that would require an adjustment to processes.'
'''

A practical example is to compute the distance between the answer against one keyword and all of its synonyms.  
The distance with the concept is the minimum distance found.

In [None]:
for keyword in ['clean'] + dictionary.synonym('clean'):
    distance = semantic_distance(answer, keyword)
    print(keyword, distance, sep='\t')

In [None]:
def answer_concept_distance(answer, concept, threshold=0.5):
    assert 0 <= threshold <= 1, "The threshold should be between 0 and 1!"
    distance_with_concept = float('inf')
    for keyword in [concept] + dictionary.synonym(concept):
        distance = semantic_distance(answer, keyword)
        distance_with_concept = min(distance_with_concept, distance)
    return { 'distance_with_concept': distance_with_concept, 'is_relevant': distance_with_concept < threshold }

In [None]:
answers = [
    'City water supply is secure',
     'The City of Toronto has a Source Protection Plan that contains a series of policies that, when implemented, will protect its drinking water sources from current and future threats. None of the threats identified in the Plan are considered to be substantive.  More information is available at http://www.ctcswp.ca/ctc-source-protection-plan/',
     'Other: Water supply is secure but being monitored and managed',
     "The City of Toronto is part of a larger watershed under the umbrella of the Credit Valley-Toronto and Region-Central Lake Ontario (CTC) Source Protection Plan.  This source protection plan  contains a series of policies intended to protect the watershed's drinking water sources from current and future threats.  See also   https://ctcswp.ca/protecting-our-water/the-ctc-source-protection-plan/",
     'Other, please specify: Water supply is secure but being monitored and managed'
]

In [None]:
%%time

for idx, answer in enumerate(answers):
    print(idx, answer_concept_distance(answer, 'clean'), sep='\t')

As we can see above, the algorithm loops over answers to detect if any of them includes a hint of a `clear water` (or any paraphrase of it) which is directly associated with the KPI `Water safety`. 

The result printed is a list of the distances between couples (answers, "clean"). The closer the distance value to 0, the closer the meaning is between the two sentences. We defined a relevance value based on a 50% threshold, that states whether the closness is relevant or not.

## Intention Recognition

In [None]:
sa = pipeline("sentiment-analysis")

In [None]:
for answer in answers:
    print("\n", answer, sa(answer))

In [None]:
# # !python -m deeppavlov install tfidf_logreg_en_faq
# # !python -m deeppavlov interact tfidf_logreg_en_faq -d
# !pip install deeppavlov

In [None]:
# %load https://raw.githubusercontent.com/deepmipt/DeepPavlov/master/deeppavlov/configs/faq/tfidf_logreg_en_faq.json
'''
{
  "dataset_reader": {
    "class_name": "faq_reader",
    "x_col_name": "Question",
    "y_col_name": "Answer",
    "data_url": "http://files.deeppavlov.ai/faq/school/faq_school_en.csv"
  },
  "dataset_iterator": {
    "class_name": "data_learning_iterator"
  },
  "chainer": {
    "in": "q",
    "in_y": "y",
    "pipe": [
      {
        "class_name": "stream_spacy_tokenizer",
        "in": "q",
        "id": "my_tokenizer",
        "lemmas": true,
        "out": "q_token_lemmas"
      },
      {
        "ref": "my_tokenizer",
        "in": "q_token_lemmas",
        "out": "q_lem"
      },
      {
        "in": [
          "q_lem"
        ],
        "out": [
          "q_vect"
        ],
        "fit_on": [
          "q_lem"
        ],
        "id": "tfidf_vec",
        "class_name": "sklearn_component",
        "save_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/tfidf.pkl",
        "load_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/tfidf.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "id": "answers_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/en_mipt_answers.dict",
        "load_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/en_mipt_answers.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "q_vect",
        "fit_on": [
          "q_vect",
          "y_ids"
        ],
        "out": [
          "y_pred_proba"
        ],
        "class_name": "sklearn_component",
        "main": true,
        "save_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/logreg.pkl",
        "load_path": "{MODELS_PATH}/faq/mipt/en_mipt_faq_v4/logreg.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict_proba",
        "C": 1000,
        "penalty": "l2"
      },
      {
        "in": "y_pred_proba",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": true
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_answers",
        "ref": "answers_vocab"
      }
    ],
    "out": [
      "y_pred_answers",
      "y_pred_proba"
    ]
  },
  "train": {
    "evaluation_targets": [],
    "class_name": "fit_trainer"
  },
  "metadata": {
    "variables": {
      "ROOT_PATH": "~/.deeppavlov",
      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
      "MODELS_PATH": "{ROOT_PATH}/models"
    },
    "requirements": [
      "{DEEPPAVLOV_PATH}/requirements/spacy.txt",
      "{DEEPPAVLOV_PATH}/requirements/en_core_web_sm.txt"
    ],
    "download": [
      {
        "url": "http://files.deeppavlov.ai/faq/mipt/en_mipt_faq_v4.tar.gz",
        "subdir": "{MODELS_PATH}/faq/mipt"
      }
    ]
  }
}
'''

In [None]:
# !python -m deeppavlov install tfidf_logreg_en_faq
# !python -m deeppavlov interact tfidf_logreg_en_faq -d
# !python -m deeppavlov train tfidf_logreg_en_faq

In [None]:
# import csv

# with open("answers.csv", "w") as f:
#     writer = csv.writer(f)
#     writer.writerows(answers)

In [None]:
# %%bash
# wget -q http://files.deeppavlov.ai/faq/school/faq_school_en.csv -O faq.csv
# echo "What's DeepPavlov?, DeepPavlov is an open-source conversational AI library" >> faq.csv

In [None]:
# # https://colab.research.google.com/github/deepmipt/dp_notebooks/blob/master/DP_autoFAQ.ipynb#scrollTo=wMZqyzBYc2eV&uniqifier=1
# from deeppavlov import configs
# from deeppavlov.core.common.file import read_json
# from deeppavlov.core.commands.infer import build_model
# import sklearn.model_selection

# from deeppavlov import configs, train_model


# faq = build_model(configs.faq.tfidf_logreg_en_faq, download = True)
# a = faq(["I need help"])
# a


# model_config = read_json(configs.faq.tfidf_logreg_en_faq)
# model_config["dataset_reader"]["data_path"] = "/kaggle/working/answers.csv"
# model_config["dataset_reader"]["data_url"] = None
# faq = train_model(model_config)
# a = faq(["tell me about water"])
# a

Deeppavlov installation problem has a solution that can be found here:http://docs.deeppavlov.ai/en/master/features/models/classifiers.html
However, it does not seem that the intention recognition will improve the accuracy of detecting a KPI, or add any refinement to the methodology. That is because while attemtping to extract KPIs, the sentiment of the author has no impact on facts mentioned in his/her answer. 

Better processing can be done using text summarization for long answers and more accurate questions to extract the exact information needed.

In [None]:
answers

In [None]:
a

## given a city, detect KPIs

In [None]:
# testing with chicago

chicago_account_numbers = list(set(cities_disc[cities_disc['City'] == "Chicago"]["Account Number"]))
chicago_resp_data = cities_resp[cities_resp['Account Number'] == chicago_account_numbers[0]]
print(chicago_resp_data.shape)
chicago_resp_data.head()

In [None]:
sections = chicago_resp_data['Section'].unique()
sections

In [None]:
maxi = ("", 0)
for section in sections:
    answers_count = chicago_resp_data[chicago_resp_data['Section'] == section].shape[0]
    print(section, answers_count)
    if answers_count > maxi[1]:
        maxi = (section, answers_count)
print("\n", maxi)

In [None]:
df = pd.DataFrame(columns={'section_name', 'answers_number'}, index=range(len(list(sections))))
df['section_name'] = list(sections)
for index, row in df.iterrows():
    row['answers_number'] = chicago_resp_data[chicago_resp_data['Section'] == row['section_name']].shape[0]
df = df.sort_values(by=['answers_number'])
df

In [None]:
# import matplotlib.pyplot as plt

fig = plt.figure(figsize=[10,5])
ax = fig.add_axes([0,0,1,1])
sections = list(df['section_name'].values)
section_count = list(df['answers_number'].values)
ax.bar(sections, section_count)
plt.xticks(rotation=90)
plt.show()

The three issues, most talked about in the city of Chicago are related to Climate Change KPIs.
Let's see what are KPIs related to Climate change and explore what are the ones that are activated in Chicago?

Climate Change KPIs:
* Hazards and Vulnerability
    * Natural disaster risk management
    * Hazardous waste generation
* Emissions
    * Emissions policy
    * Emissions measurments
    * Emissions planning
* Energy
    * Access to energy
    * renewable energy
    
For the proof of concept sake, we will not go through all the questions, but rather we will only select the three first sections: `City-wide GHG Emissions Data`, `Climate Hazards`, and `Transport`

In [None]:
sec = ['City-wide GHG Emissions Data', 'Climate Hazards','Transport']
climate_answers_chicago_df = pd.DataFrame()
for s in sec:
    climate_answers_chicago_df = pd.concat([climate_answers_chicago_df, chicago_resp_data[chicago_resp_data['Section'] == s ]])
print(climate_answers_chicago_df.shape)
cc_chicago_answers = list(climate_answers_chicago_df['Response Answer'].unique())
cc_chicago_answers

In [None]:
cc_answers = ' '
cc_answers = cc_answers.join(cc_chicago_answers)
cc_answers

### Using the question-answerer from BERT

In [None]:
# Using question-answering of BERT
question_answerer({'question': "what is the name of the city?", 'context': cc_answers})

In [None]:
question_answerer({'question': "what are climate change issues?", 'context': cc_answers})

In [None]:
question_answerer({'question': "how is climate change being fought?", 'context': cc_answers})

In [None]:
question_answerer({'question': "what are the plans to fight climate change?", 'context': cc_answers})

In [None]:
question_answerer({'question': "what are the plans to fight disaster risks?", 'context': cc_answers})

In [None]:
question_answerer({'question': "what is being done to reduce emissions?", 'context': cc_answers})

In [None]:
question_answerer({'question': "business and emission?", 'context': cc_answers})

In [None]:
question_answerer({'question': "business and energy?", 'context': cc_answers})

In [None]:
cc_answers[9098:10000]

### Using the sematic similarity 

In [None]:
for keyword in ['Hazards and Vulnerability'] + dictionary.synonym('Hazards and Vulnerability'):
    print(keyword)
#     distance = semantic_distance(answer, keyword)
#     print(keyword, distance, sep='\t')

Ideas:
    * we can categorize according to our own searched KPIs and see where the biggest focus is?
    * we can leave it as it is and learn that information by applying NLP to answers
    * we can move forward to extract more information and then understand what are the KPIs that are activated or need to be activated from the answers
    * do same proccessing with parent sections can give more perspective about the focus area of the region/city

In [None]:
parent_sections = chicago_resp_data['Parent Section'].unique()
maxi = ("", 0)
for section in parent_sections:
    answers_count = chicago_resp_data[chicago_resp_data['Parent Section'] == section].shape[0]
    print(section, answers_count)
    if answers_count > maxi[1]:
        maxi = (section, answers_count)
print("\n", maxi)

Parent section has missing values that apparently are not making this method easier to move forward with. Let's see though the data with no parent section.

In [None]:
chicago_resp_data[chicago_resp_data['Parent Section'] == ""]

We can use sections to figure out parent sections ?

In [None]:
ques_num = chicago_resp_data[chicago_resp_data['Parent Section'] == ""]['Question Number'].unique()
ques_num

In [None]:
def fill_empty_section(df):
    ques_num = df[df['Parent Section'] == '']
    df = df[df['Question Number'] == 'Response Language']
    df = df[df['Question Number'] == 'Amendments_question']
    for index, row in df.iterrows():
        print(index)
        module_code = row['Question Number'].split('.')[0]
        print(module_code)
        df.iloc[index]['Parent Section'] = dic[module_code]

fill_empty_section(chicago_resp_data)
chicago_resp_data['Parent Section'].unique()

In [None]:
ques_num = df[df['Parent Section'] == '']
ques_num.set(0, 'Parent Section', Intro) #df.set_value('C', 'x', 10)
ques_num.iloc[0]

In [None]:
dic = {}

modules = questions.Module.unique()
for module in modules:
    if module:
        code, module_name = module.split(".")
        dic[code[1:]] = module_name
dic

look at column `Question Number` to extract parent section name, also could use sup data, recommended questions file

TODO
- display a figure that shows areas of focus for chicago 
- gather all answers related to chicago in one text variable
- use answers to extract active KPIs
- Understand strategies implemented or under planning


In [None]:
chicago_modules = chicago_resp_data