# Themes Analysis for Consultation Sandbox
## TF-IDF

This notebook is a test of extraction of key themes from dummy consultation data.
Inspired by: https://datasciencecampus.ons.gov.uk/projects/automating-consultation-analysis/

This version of the notebook focuses on only the two most promising approaches to extracting key phrases using TF-IDF.


_Note: I'm not using train/test for prediction here - just shoving everything in to get key words back. This is legitimate as it's not a supervised approach._

---
## Technique C: TD-IDF


### Approaches
In this notebook I've tried out a number of different approaches - not 100% sure if they're all legitimate, but at this stage am just trying to experiment. A summary of the approaches is as follows:

- [**Approach 1:**](#approach-1) _Treating all positive responses as one document, (and, in fact treating responses of each type as a whole document) to gain a picture of the phrases that mark out positive responses as a group distinct from other types of responses, without any need for summarising scores from individual responses._

    We do this by Calculating TF-IDF scores of bigrams and trigrams in all responses, where positive responses are all one document. Can then easily pick out most important phrases in positive response vs other types of response.

- [**Approach 2:**](#approach-2) _Treating all responses as the whole corpus; finding out what makes positive responses as a group distinct from other types of responses._

    We do this by:
    
    [a)](#approach-2a) Calculating TF-IDF scores of bigrams and trigrams appearing in all responses. Calculating mean scores for just positive responses, to gain a summary-view of most important phrases for positive responses.
    
    [b)](#approach-2b) Calculating TF-IDF scores of bigrams and trigrams appearing in all responses, removing phrases that appear most often first. Calculating mean scores for just positive responses, to gain a summary-view of most important phrases for positive responses.
    
[**Results comparison**](#results) _Compare the results from the 3 different approaches._

----
### Prepare data

In [None]:
# Load packages
from arrow_pd_parser import reader
import os
import spacy
import string
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Import data

s3_bucket = "s3://alpha-everyone/nlp-code-examples/"
file_loc = "Consultation_Dummy_NewQuestions.csv"

df = reader.read(os.path.join(s3_bucket, file_loc))

Clean column names

In [None]:
def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

replacements = {" ":"_",
              "-":"_",
              "/":"_",
              "?":"",
              "'":""}

new_cols = list()
for i in df.columns.str.split('- '):
    cleaned = multiple_replace(replacements, i[-1]).lower().strip()
    new_cols.append(cleaned)
df.columns = new_cols

In [None]:
df.head()

In [None]:
#load spacy
nlp = spacy.load("en_core_web_sm")

Define data cleansing functions:

In [None]:
#function to clean and lemmatize comments
def clean_comments(text):
    #remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    #use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

def list_to_string(list):
    filtered_list = [element for element in list if element.strip()]
    list = " ".join(filtered_list)
    return list

---------
<a id='approach-1'></a>
### Approach 1
Approach - treat all positive responses as one and pull out TF-IDF scores from those.

In [None]:
question_cols = ['whats_your_general_understanding_of_the_pilot_scheme',
       'what_are_the_objectives_of_the_pilot_scheme',
       'what_are_the_positives_of_the_pilot_scheme',
       'what_are_the_negatives_of_the_pilot_scheme',
       'has_the_pilot_scheme_been_successful']

In [None]:
# Create corpus where each type of response is a doc, labelled with the Q asked
df_corpus1 = df[question_cols]
df_corpus_cat = dict()
for i in question_cols:
    df_corpus_cat[i] = df_corpus1[i].str.cat(sep = ' ')
df_corpus1 = pd.DataFrame(df_corpus_cat, index=range(1))

df_corpus1 = df_corpus1.melt(var_name = "question", value_name = "response")

In [None]:
df_corpus1

In [None]:
#apply function to clean and lemmatize comments for whole corpus
df_corpus1["response_lemm"] = df_corpus1["response"].map(clean_comments)

#make sure to lowercase everything
df_corpus1["response_lemm"] = df_corpus1["response_lemm"].map(lambda x: [word.lower() for word in x])

# stop everything being a list
df_corpus1["response_lemm"] = df_corpus1.response_lemm.map(list_to_string)

In [None]:
# Create corpus as list from df
corpus1 = df_corpus1.response_lemm.tolist()

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer1 = TfidfVectorizer(ngram_range = (2,3))

corpus_vectorised1 = td_idf_vectorizer1.fit_transform(corpus1)

# If you want to look at it
tfidf_matrix1 = pd.DataFrame(corpus_vectorised1.toarray(), 
                            columns=td_idf_vectorizer1.get_feature_names_out())
print(tfidf_matrix1.shape)
tfidf_matrix1.head()

Positive responses are in row 2

In [None]:
top_positives1 = tfidf_matrix1.iloc[2,].sort_values(ascending = False)
top_positives1[0:20]

-----
<a id='approach-2'></a>

#### Approach 2


<a id='approach-2a'></a>
##### Approach 2a
Include all responses to all questions in the corpus (each as a doc), then calculate the mean for the positive responses:

In [None]:
df_corpus2 = df[question_cols]
df_corpus2 = df_corpus2.melt(var_name = "question", value_name = "response")

In [None]:
df_corpus2

In [None]:
#apply function to clean and lemmatize comments for whole corpus
df_corpus2["response_lemm"] = df_corpus2["response"].map(clean_comments)

#make sure to lowercase everything
df_corpus2["response_lemm"] = df_corpus2["response_lemm"].map(lambda x: [word.lower() for word in x])

# stop everything being a list
df_corpus2["response_lemm"] = df_corpus2.response_lemm.map(list_to_string)

In [None]:
corpus2 = df_corpus2.response_lemm.tolist()
corpus2[0:2]

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer2 = TfidfVectorizer(ngram_range = (2,3))

corpus_vectorised2 = td_idf_vectorizer2.fit_transform(corpus2)

# If you want to look at it
tfidf_matrix2 = pd.DataFrame(corpus_vectorised2.toarray(), 
                            columns=td_idf_vectorizer2.get_feature_names_out())
print(tfidf_matrix2.shape)
tfidf_matrix2.head()

In [None]:
tfidf_matrix2["question"] = df_corpus2["question"]
tfidf_matrix2["response"] = df_corpus2["response"]

In [None]:
tfidf_matrix2.question.unique()

In [None]:
tfidf_matrix2.head()

Select question we want to look at keywords for (using the mean of the TF-IDF scores for responses related to that question):

In [None]:
question_to_select = "what_are_the_positives_of_the_pilot_scheme"
df_selected2 = tfidf_matrix2.loc[tfidf_matrix2.question == question_to_select]

In [None]:
most_important2a = df_selected2.drop(columns = ["question", "response"]).mean()

In [None]:
most_important2a.sort_values(ascending = False)[0:20]

-----

<a id='approach-2b'></a>
##### Approach 2b

Include all responses to all questions in the corpus, then calculate the mean for the positive responses:

Remove those that appear too often, using max_df, to get responses that are more unique to the positives.

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer2b = TfidfVectorizer(max_df = 0.25, ngram_range = (2,3))

corpus_vectorised2b = td_idf_vectorizer2b.fit_transform(corpus2)

# If you want to look at it
tfidf_matrix2b = pd.DataFrame(corpus_vectorised2b.toarray(), 
                            columns=td_idf_vectorizer2b.get_feature_names_out())
tfidf_matrix2b.head()

In [None]:
tfidf_matrix2b["question"] = df_corpus2["question"]
tfidf_matrix2b["response"] = df_corpus2["response"]

In [None]:
question_to_select = "what_are_the_positives_of_the_pilot_scheme"
df_selected2b = tfidf_matrix2b.loc[tfidf_matrix2b.question == question_to_select]

In [None]:
most_important2b = df_selected2b.drop(columns = ["question", "response"]).mean()

In [None]:
most_important2b.sort_values(ascending = False)[0:20]

-----
<a id='results'></a>
### Results comparison

Compare results between approaches 1, 2a and 2b:


In [None]:
results1 = pd.DataFrame(top_positives1)
results1.reset_index(inplace=True)
results1.columns = ["phrase", "value_approach1"]

results1["rank_approach1"] = results1.index + 1

In [None]:
most_important2a = most_important2a.sort_values(ascending = False)
results2a = pd.DataFrame(most_important2a)
results2a.reset_index(inplace=True)
results2a.columns = ["phrase", "value_approach2a"]

results2a["rank_approach2a"] = results2a.index + 1

In [None]:
most_important2b = most_important2b.sort_values(ascending = False)
results2b = pd.DataFrame(most_important2b)
results2b.reset_index(inplace=True)
results2b.columns = ["phrase", "value_approach2b"]

results2b["rank_approach2b"] = results2b.index + 1

In [None]:
# Join results table together
join1 = pd.merge(results1, results2a, on='phrase', how='outer')
join2 = pd.merge(join1, results2b, on = 'phrase', how = 'outer')
all_results = join2
all_results.rank_approach1 = all_results.rank_approach1.astype("float")

In [None]:
all_results["average_rank"] =  all_results[['rank_approach1', 'rank_approach2a', 'rank_approach2b']].mean(axis=1, skipna=True)
all_results["average_value"] =  all_results[['value_approach1', 'value_approach2a', 'value_approach2b']].mean(axis=1, skipna=True)
all_results = all_results.sort_values(by = "average_rank")

In [None]:
all_results[["phrase", "average_rank", 'rank_approach1', 'rank_approach2a', 'rank_approach2b']][0:20]

In [None]:
all_results[["phrase", "average_value", 'value_approach1', 'value_approach2a', 'value_approach2b']][0:20]