# Themes Analysis for Consultation Sandbox
## Phrase Collocation

This notebook is a test of extraction of key themes from dummy consultation data.
Inspired by: https://datasciencecampus.ons.gov.uk/projects/automating-consultation-analysis/

---
## Technique B: Collocation

Method used taken from: https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a#:~:text=The%20two%20most%20common%20types,or%20'Proctor%20and%20Gamble'.![image.png](attachment:c7f00693-1e26-471d-b947-308712fda132.png)![image.png](attachment:45859265-852c-4255-b361-9df0682b3558.png)![image.png](attachment:3c627536-5b49-4758-932b-ea94afc021e5.png)![image.png](attachment:95cc1a5e-a5e0-401c-9b3a-786f91a83445.png)

We'll start by counting the frequency of bigrams and trigrams

---

### Prepare data

In [None]:
# Load packages 

from arrow_pd_parser import reader
import os
import pandas as pd
import numpy as np
import spacy
import re
import string
import nltk

In [None]:
# Import data

s3_bucket = "s3://alpha-everyone/nlp-code-examples/"
file_loc = "Consultation_Dummy_NewQuestions.csv"

df = reader.read(os.path.join(s3_bucket, file_loc))

Clean column names

In [None]:
def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

replacements = {" ":"_",
              "-":"_",
              "/":"_",
              "?":"",
              "'":""}

new_cols = list()
for i in df.columns.str.split('- '):
    cleaned = multiple_replace(replacements, i[-1]).lower().strip()
    new_cols.append(cleaned)
df.columns = new_cols

In [None]:
#load spacy
nlp = spacy.load("en_core_web_sm")

Define data cleansing functions:

In [None]:
#function to clean and lemmatize comments
def clean_comments(text):
    #remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    #use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

Cleanse data:

In [None]:
comments_col = "what_are_the_positives_of_the_pilot_scheme"
comments = df[comments_col]

In [None]:
#apply function to clean and lemmatize comments
df["comments_lemm"] = df[comments_col].map(clean_comments)

#make sure to lowercase everything
df["comments_lemm"] = df["comments_lemm"].map(lambda x: [word.lower() for word in x])

#turn all comments' tokens into one single list
unlist_comments = [item for items in df.comments_lemm for item in items]

In [None]:
tokens = unlist_comments
tokens[0:10]

Create bigrams and trigrams:

In [None]:
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(tokens)

----
#### a. Counting frequencies of adjacent words with part of speech filters:

The simplest method is to rank the most frequent bigrams or trigrams:

In [None]:
#bigrams
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
#trigrams
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)

However, a common issue with this is adjacent spaces, stop words, articles, prepositions or pronouns are common and are not meaningful:

In [None]:
bigramFreqTable.head()

In [None]:
trigramFreqTable.head()

To fix this, we filter out for collocations not containing stop words and filter for only the following structures:

Bigrams: (Noun, Noun), (Adjective, Noun)

Trigrams: (Adjective/Noun, Anything, Adjective/Noun)

This is a common structure used in literature and generally works well.

In [None]:
from nltk.corpus import stopwords
#get english stopwords
en_stopwords = set(stopwords.words('english'))

In [None]:
#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    if '-pron-' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords or word.isspace():
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False

In [None]:
#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]
filtered_bi.head(10)

In [None]:
#function to filter for trigrams
def rightTypesTri(ngram):
    if '-pron-' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords or word.isspace():
            return False
    first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in first_type and tags[2][1] in third_type:
        return True
    else:
        return False

In [None]:
#filter trigrams
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]
filtered_tri.head(10)

In [None]:
filtered_tri.columns = [w.replace('trigram', 'gram') for w in filtered_tri.columns]
filtered_bi.columns = [w.replace('bigram', 'gram') for w in filtered_bi.columns]

In [None]:
filtered_gram = pd.concat([filtered_tri, filtered_bi]).sort_values("freq", ascending = False)
filtered_gram.head(10)

#### Create word cloud for key phrases

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
# Generate a dictionary from DataFrame
words = filtered_gram.gram.apply(lambda x: " ".join(x))
freq = filtered_gram.freq
word_freq_dict = dict(zip(words, freq))

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq_dict)

# Plot the WordCloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off the axis labels
plt.show()

#### Next step: Extract context around key phrases

In [None]:
filtered_gram.head(20)

We'll extract the context for just the trigrams or bigrams that appear more than 5 times:

In [None]:
key_phrases = filtered_gram.loc[filtered_gram.freq > 5]

We need to convert the lemmatized version of the text to a string (currently a list):

In [None]:
def list_to_string(list):
    filtered_list = [element for element in list if element.strip()]
    list = " ".join(filtered_list)
    return list

df["comments_lemm_clean"] = df.comments_lemm.map(list_to_string)

In [None]:
def extract_surrounding_characters(full_string, target_string, chars = 50):
    # Find the index of the target string in the full string
    index = full_string.find(target_string)

    # Check if the target string is present in the full string
    if index != -1:
        # Extract the surrounding characters
        start_index = max(0, index - chars)
        end_index = min(len(full_string), index + len(target_string) + chars)
        surrounding_chars = full_string[start_index:end_index]

        return surrounding_chars

    return None    

We'll loop through all key phrases and create a column for each which is populated for each row where key phrase is present, with the context around each phrase.

In [None]:
# Loop through all key phrases and create a column for each, 
# which is populated for each row where key phrase is present
# with the context around the phrase
for i in key_phrases.gram:
    phrase = (" ".join(i))
    phrase_col = phrase.replace(" ", "_")
    df[phrase_col] = df['comments_lemm_clean'].apply(lambda x: extract_surrounding_characters(x, phrase))

We'll extract this information into a seperate format, to give a column with the key phrases, and a column with the context around it each time it appears. This will have no inidividual's information - designed just to give a summary of the kind of context the phrase appears in.

In [None]:
# Select only context columns
key_phrases_cols = key_phrases.gram.apply(lambda x: "_".join(x)).tolist()
df_key_phrases = df[key_phrases_cols]

In [None]:
# Stack columns using pd.melt()
df_key_phrases = pd.melt(df_key_phrases, var_name = "key_phrase", value_name = "context")

In [None]:
# Filter out 'none' values
df_key_phrases = df_key_phrases[~df_key_phrases.context.isna()]

In [None]:
df_key_phrases.head()

----
#### b. Alternative approach: Pointwise Mutual Information

The main intuition is that it measures how much more likely the words co-occur than if they were independent. However, it is very sensitive to rare combination of words. For example, if a random bigram ‘abc xyz’ appears, and neither ‘abc’ nor ‘xyz’ appeared anywhere else in the text, ‘abc xyz’ will be identified as highly significant bigram when it could just be a random misspelling or a phrase too rare to generalize as a bigram. Therefore, this method is often used with a frequency filter.

In [None]:
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()

In [None]:
#filter for only those with more than 20 occurences
bigramFinder.apply_freq_filter(20)
trigramFinder.apply_freq_filter(20)

In [None]:
bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)
bigramPMITable

In [None]:
trigramPMITable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.pmi)), columns=['trigram','PMI']).sort_values(by='PMI', ascending=False)
trigramPMITable

In [None]:
pmi_tri = trigramPMITable.copy()
pmi_bi = bigramPMITable.copy()

pmi_tri.columns = [w.replace('trigram', 'gram') for w in trigramPMITable.columns]
pmi_bi.columns = [w.replace('bigram', 'gram') for w in bigramPMITable.columns]

pmi_gram = pd.concat([pmi_tri, pmi_bi]).sort_values("PMI", ascending = False)
pmi_gram.head(10)

#### Create word cloud for key phrases

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
# Generate a dictionary from DataFrame
words_pmi = pmi_gram.gram.apply(lambda x: " ".join(x))
pmi = pmi_gram.PMI
word_pmi_dict = dict(zip(words_pmi, pmi))

In [None]:
# Generate the word cloud
wordcloud2 = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_pmi_dict)

# Plot the WordCloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off the axis labels
plt.show()