# Themes Analysis for Consultation Sandbox
## Topic Modelling

This notebook is a test of extraction of key themes from dummy consultation data.
Inspired by: https://datasciencecampus.ons.gov.uk/projects/automating-consultation-analysis/

---
---
## Technique A: Topic Modelling
Upshot: too few responses to give valuable topics.
### 1. Prepare data

In [None]:
from arrow_pd_parser import reader
import os

In [None]:
s3_bucket = "s3://alpha-everyone/nlp-code-examples/"
file_loc = "Consultation_Dummy_NewQuestions.csv"

In [None]:
df = reader.read(os.path.join(s3_bucket, file_loc))

Clean column names

In [None]:
import re 

def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

In [None]:
replacements = {" ":"_",
              "-":"_",
              "/":"_",
              "?":"",
              "'":""}

new_cols = list()
for i in df.columns.str.split('- '):
    cleaned = multiple_replace(replacements, i[-1]).lower().strip()
    new_cols.append(cleaned)
df.columns = new_cols

Look at the column we want to do sentiment analysis on:

In [None]:
df.what_are_the_positives_of_the_pilot_scheme.head().iloc[0]

In [None]:
model_docs = df.what_are_the_positives_of_the_pilot_scheme.tolist()

-----
### 2. Set up model
Import the libraries needed:

In [None]:
from bertopic import BERTopic

In [None]:
# Set random seed
from umap import UMAP
umap_model = UMAP(random_state=42)

In [None]:
topic_model = BERTopic(umap_model=umap_model, 
                       verbose = True,
                       min_topic_size = 5)
topics, probs = topic_model.fit_transform(model_docs)

In [None]:
# Save model
# topic_model.save("models/my_model")

In [None]:
# Load model
# topic_model = BERTopic.load("models/my_model")

---
### 3. Look at results

After generating topics, we can access the frequent topics that were generated:

In [None]:
topic_model.get_topic_info()

We only have 3 topics generated and they all look quite similar.

In [None]:
for i in topic_model.get_topic_info().Representation:
    print(i)

In [None]:
for i in topic_model.get_topic_info().Representative_Docs:
    print(i)