# Themes Sandbox

This notebook is a test of extraction of key themes from dummy data.
Inspired by: https://datasciencecampus.ons.gov.uk/projects/automating-consultation-analysis/

---
## Technique C: TD-IDF


### Approaches
In this notebook I've tried out a number of different approaches - not 100% sure if they're all legitimate, but at this stage am just trying to experiment. A summary of the approaches is as follows:

**Group 1:** _Treating all positive responses as the whole corpus; finding out what makes each individual response distinct from other positive responses._
- [**Attempt 1**:](#attempt-1) Calculating TF-IDF scores of single words appearing in positive responses. Pull out top n most important words for each response.
- [**Attempt 2**:](#attempt-2) Calculating TF-IDF scores of bigrams and trigrams appearing in positive responses. Pull out top n most important phrases for each response.
- [**Attempt 3**:](#attempt-3) Calculating mean TF-IDF scores of bigrams and trigrams appearing in positive responses across all documents. Pulling out highest-scoring phrases appearing in all positive responses.


**Group 2:** _Treating all responses as the whole corpus; finding out what makes positive responses as a group distinct from other types of responses._
- [**Attempt 4**:](#attempt-4) Calculating TF-IDF scores of bigrams and trigrams appearing in all responses. Calculating mean scores for just positive responses, to gain a summary-view of most important phrases for positive responses.
- [**Attempt 5**:](#attempt-5) Calculating TF-IDF scores of bigrams and trigrams appearing in all responses, removing phrases that appear most often first. Calculating mean scores for just positive responses, to gain a summary-view of most important phrases for positive responses.
- [**Attempt 6**:](#attempt-6) ChatGPT's suggestion 1; calculating mean TF-IDF scores for different response types (e.g. positive, negative), and pulling out top 5 words for each type.
- [**Attempt 7**:](#attempt-7) ChatGPT's suggestion 2; calculating TF-IDF scores for different response types (e.g. positive, negative), and pulling out words that have non-zero TF-IDF scores in positive responses that are not non-zero in negative responses.

**Group 3:** _Treating all positive responses as one document, (and, in fact treating responses of each type as a whole document) to gain a picture of the phrases that mark out positive responses as a group distinct from other types of responses, without any need for summarising scores from individual responses._
- [**Attempt 8**:](#attempt-8) Calculating TF-IDF scores of bigrams and trigrams in all responses, where positive responses are all one document. Can then easily pick out most important phrases in positive response vs other types of response.

**Other:**


_Note: I'm not using train/test for prediction here - just shoving everything in to get key words back. This is legitimate as it's not a supervised approach._

----
### 1. Prepare data

In [None]:
from arrow_pd_parser import reader
import os
import spacy
import string
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
s3_bucket = "s3://alpha-everyone/rayner_nikki/"
file_loc = "Consultation_Dummy_NewQuestions.csv"

In [None]:
df = reader.read(os.path.join(s3_bucket, file_loc))

Clean column names

In [None]:
def multiple_replace(replacements, text):
    # Create a regular expression from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, replacements.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: replacements[mo.group()], text) 

In [None]:
replacements = {" ":"_",
              "-":"_",
              "/":"_",
              "?":"",
              "'":""}

new_cols = list()
for i in df.columns.str.split('- '):
    cleaned = multiple_replace(replacements, i[-1]).lower().strip()
    new_cols.append(cleaned)
df.columns = new_cols

Prepare the column to look at:

In [None]:
comments_col = "what_are_the_positives_of_the_pilot_scheme"

In [None]:
#load spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
#function to clean and lemmatize comments
def clean_comments(text):
    #remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    #use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

def list_to_string(list):
    filtered_list = [element for element in list if element.strip()]
    list = " ".join(filtered_list)
    return list

#apply function to clean and lemmatize comments
df["comments_lemm"] = df[comments_col].map(clean_comments)

#make sure to lowercase everything
df["comments_lemm"] = df["comments_lemm"].map(lambda x: [word.lower() for word in x])

# Turn list of words to a string
df["comments_lemm_clean"] = df.comments_lemm.map(list_to_string)

----
<a id='attempt-1'></a>
#### Attempt 1
To start off, comparing the key words in the positive comments against the whole positives corpus.

In [None]:
corpus = df.comments_lemm_clean.tolist()

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer = TfidfVectorizer()

corpus_vectorised = td_idf_vectorizer.fit_transform(corpus)

# If you want to look at it
tfidf_matrix = pd.DataFrame(corpus_vectorised.toarray(), 
                            columns=td_idf_vectorizer.get_feature_names_out())
tfidf_matrix.head()

Want to choose the most important n words for each response (treated as a doc):

In [None]:
def get_top_n_columns(row, n):
    # Get the indices of the top n values in the row
    top_indices = row.argsort()[-n:][::-1]
    
    # Get the corresponding column names
    top_columns = row.index[top_indices]
    
    return top_columns

In [None]:
# Number of top values to retrieve
n = 5

# Apply the function to each row
tfidf_matrix['keywords'] = tfidf_matrix.apply(lambda row: get_top_n_columns(row, n), axis=1)

In [None]:
tfidf_matrix.keywords.head()

-----
<a id='attempt-2'></a>
#### Attempt 2
With bigrams and trigrams

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer = TfidfVectorizer(ngram_range = (2,3))

corpus_vectorised = td_idf_vectorizer.fit_transform(corpus)

# If you want to look at it
tfidf_matrix = pd.DataFrame(corpus_vectorised.toarray(), 
                            columns=td_idf_vectorizer.get_feature_names_out())
tfidf_matrix.head()

In [None]:
# Number of top values to retrieve
n = 5

# Apply the function to each row
tfidf_matrix['keywords'] = tfidf_matrix.apply(lambda row: get_top_n_columns(row, n), axis=1)

In [None]:
tfidf_matrix.keywords.head()

In [None]:
all_top_n = tfidf_matrix.keywords.apply(lambda x : x.tolist()).tolist()
unlisted_top_n = [item for sublist in all_top_n for item in sublist]

In [None]:
pd.DataFrame(unlisted_top_n).value_counts(0)

In [None]:
# no repeats of top n phrases between docs - this is what you'd expect 
# as we're trying to capture uniqueness
any(pd.DataFrame(unlisted_top_n).value_counts(0) > 1)

------------------
<a id='attempt-3'></a>
#### Attempt 3
With bigrams and trigrams, as with the ONS example, calculate the mean of each column with the tf_idf score to pick out most interesting phrases across all documents. Note: not sure if this a legitimate approach, as might be biased towards more frequent terms.

In [None]:
mean_tfidf = tfidf_matrix.drop(columns = ['keywords']).mean()

In [None]:
mean_tfidf.sort_values(ascending = False)[0:20]

This seems to make sense, but is still quite general.

-----
<a id='attempt-4'></a>
#### Attempt 4
Include all responses to all questions in the corpus, then calculate the mean for the positive responses:

In [None]:
question_cols = ['whats_your_general_understanding_of_the_pilot_scheme',
       'what_are_the_objectives_of_the_pilot_scheme',
       'what_are_the_positives_of_the_pilot_scheme',
       'what_are_the_negatives_of_the_pilot_scheme',
       'has_the_pilot_scheme_been_successful']

In [None]:
df_corpus = df[question_cols]
df_corpus = df_corpus.melt(var_name = "question", value_name = "response")

In [None]:
df_corpus

In [None]:
#apply function to clean and lemmatize comments for whole corpus
df_corpus["response_lemm"] = df_corpus["response"].map(clean_comments)

#make sure to lowercase everything
df_corpus["response_lemm"] = df_corpus["response_lemm"].map(lambda x: [word.lower() for word in x])

# stop everything being a list
df_corpus["response_lemm"] = df_corpus.response_lemm.map(list_to_string)

In [None]:
corpus = df_corpus.response_lemm.tolist()
corpus[0:2]

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer = TfidfVectorizer(ngram_range = (2,3))

corpus_vectorised = td_idf_vectorizer.fit_transform(corpus)

# If you want to look at it
tfidf_matrix = pd.DataFrame(corpus_vectorised.toarray(), 
                            columns=td_idf_vectorizer.get_feature_names_out())
print(tfidf_matrix.shape)
tfidf_matrix.head()

In [None]:
tfidf_matrix["question"] = df_corpus["question"]
tfidf_matrix["response"] = df_corpus["response"]

In [None]:
tfidf_matrix.question.unique()

In [None]:
tfidf_matrix.head()

In [None]:
question_to_select = "what_are_the_positives_of_the_pilot_scheme"
df_selected = tfidf_matrix.loc[tfidf_matrix.question == question_to_select]

--------
### Attempt 4b

Trying to think of a way to pull out the most important phrases across all positive responses.

Approach: Pick the 10% highest scoring phrases, and use these.

However, this is by definition flawed - the highest scoring phrases are so because they're rare. So we're unlikely to get much agreement between different responses.

In [None]:
# Remove text columns
df_selected_num = df_selected.iloc[: , :-2]
# Put all scores into one column
tfidf_scores = pd.melt(df_selected_num)
# Remove 0s
tfidf_scores = tfidf_scores.loc[tfidf_scores.value !=0]

In [None]:
# Visualise the distribution of the non-zero scores
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Determine bins
bins = np.arange(0, 0.25, 0.01)

# Plotting the histogram
plt.hist(tfidf_scores['value'], bins=bins, edgecolor='black')

# Adding labels and title
plt.xlabel('Values')
plt.ylabel('Occurrences')
plt.title('Distribution of Occurrences of Each Value (Binned)')

# Show plot
plt.show()

In [None]:
# Select highest scoring 10% on non-zero scoring phrases
# Non-zero words
nzw = tfidf_scores.shape[0]
nzw_thresh = round(nzw / 10)
threshold = tfidf_scores.sort_values(by = "value", ascending = False).iloc[nzw_thresh,][["value"]].value
most_important_phrases = tfidf_scores.loc[tfidf_scores.value >= threshold,]

In [None]:
# Examine most common most important phrases
most_important_phrases.variable.value_counts()

-----

<a id='attempt-5'></a>
### Attempt 5
Include all responses to all questions in the corpus, then calculate the mean for the positive responses:

Remove those that appear too often, using max_df, to get responses that are more unique to the positives.

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer = TfidfVectorizer(max_df = 0.25, ngram_range = (2,3))

corpus_vectorised = td_idf_vectorizer.fit_transform(corpus)

# If you want to look at it
tfidf_matrix = pd.DataFrame(corpus_vectorised.toarray(), 
                            columns=td_idf_vectorizer.get_feature_names_out())
tfidf_matrix.head()

In [None]:
tfidf_matrix["question"] = df_corpus["question"]
tfidf_matrix["response"] = df_corpus["response"]

In [None]:
tfidf_matrix.question.unique()

In [None]:
question_to_select = "what_are_the_positives_of_the_pilot_scheme"
df_selected = tfidf_matrix.loc[tfidf_matrix.question == question_to_select]

In [None]:
most_important = df_selected.drop(columns = ["question", "response"]).mean()

In [None]:
most_important.sort_values(ascending = False)[0:20]

---------
<a id='attempt-8'></a>
### Attempt 8
Approach - treat all positive responses as one and pull out TF-IDF scores from those.

In [None]:
question_cols = ['whats_your_general_understanding_of_the_pilot_scheme',
       'what_are_the_objectives_of_the_pilot_scheme',
       'what_are_the_positives_of_the_pilot_scheme',
       'what_are_the_negatives_of_the_pilot_scheme',
       'has_the_pilot_scheme_been_successful']

In [None]:
df_corpus = df[question_cols]
df_corpus_cat = dict()
for i in question_cols:
    df_corpus_cat[i] = df_corpus[i].str.cat(sep = ' ')
df_corpus = pd.DataFrame(df_corpus_cat, index=range(1))

df_corpus = df_corpus.melt(var_name = "question", value_name = "response")

In [None]:
df_corpus

In [None]:
#apply function to clean and lemmatize comments for whole corpus
df_corpus["response_lemm"] = df_corpus["response"].map(clean_comments)

#make sure to lowercase everything
df_corpus["response_lemm"] = df_corpus["response_lemm"].map(lambda x: [word.lower() for word in x])

# stop everything being a list
df_corpus["response_lemm"] = df_corpus.response_lemm.map(list_to_string)

In [None]:
corpus = df_corpus.response_lemm.tolist()

In [None]:
# Create an instance of the tfidf vectorizer
td_idf_vectorizer = TfidfVectorizer(ngram_range = (2,3))

corpus_vectorised = td_idf_vectorizer.fit_transform(corpus)

# If you want to look at it
tfidf_matrix = pd.DataFrame(corpus_vectorised.toarray(), 
                            columns=td_idf_vectorizer.get_feature_names_out())
print(tfidf_matrix.shape)
tfidf_matrix.head()

Positive responses are in row 2

In [None]:
top_positives = tfidf_matrix.iloc[2,].sort_values(ascending = False)

In [None]:
top_positives[0:20]

---------
<a id='attempt-6'></a>
### Attempt 6 - ChatGPT 1

I have a survey question where people have responded 'yes', 'no' or 'maybe'. They then provide a free text explanation for their response.

I want to understand the key phrases in the free text responses, split by whether the person has responded 'yes', 'no' or 'maybe'. I want to use TF-IDF to do this. How could I do this?


-----
**How the approach works:**

- Corpus is all response types as seperate docs, with category being a list of the labels.
- Apply TF-IDF vectorizer to corpus.
- Calculates the mean TF-IDF scores for features by response type.
- Returns top 5 features.

-----

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
responses = {
    'yes': ["Yes, I am interested in the new product.", "I agree with the proposal.", "Yes, definitely."],
    'no': ["No, I don't think it's a good idea.", "I disagree with the plan.", "No, I'm not interested."],
    'maybe': ["Maybe, I need more information before deciding.", "I'm not sure yet.", "Perhaps, I'll think about it."]
}

# Preprocess data, tokenization, and TF-IDF calculation
corpus = []
categories = []
for category, texts in responses.items():
    corpus.extend(texts)
    categories.extend([category] * len(texts))

In [None]:
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

In [None]:
# Get feature names (terms)
feature_names = tfidf_vectorizer.get_feature_names_out()

These feature names are essentially the terms (words or n-grams) from the vocabulary that the vectorizer has learned from the training data, and they are ordered based on their indices in the feature matrix.

In [None]:
# Print key phrases for each category
for category in set(categories):
    print(f"Key phrases for '{category}':")
    
    # Select indices of the phrases belonging to each category
    category_indices = [i for i, cat in enumerate(categories) if cat == category]
    
    # This line calculates the mean TF-IDF scores across all documents in the current category. 
    # It selects rows of the TF-IDF matrix corresponding to documents in the current category, 
    # calculates the mean TF-IDF score for each feature (term), and then converts the resulting matrix to a 1-dimensional array.
    category_tfidf_scores = tfidf_matrix[category_indices].mean(axis=0).A1
    
    # This line sorts the indices of the features (terms) based on their TF-IDF scores in descending order. 
    # argsort() returns the indices that would sort the array, and [::-1] reverses the order to achieve descending sorting.
    sorted_indices = category_tfidf_scores.argsort()[::-1]
    
    for i in sorted_indices[:5]:  # Print top 5 key phrases
        print(f"{feature_names[i]} (TF-IDF: {category_tfidf_scores[i]:.2f})")
    print()

---------
<a id='attempt-7'></a>
### Attempt 7 - ChatGPT 2

I have a consultation with a number of positive and negative responses. I want to pull out the key phrases that occur in the positive responses that are different to negative responses. How could I do that?


_One approach could be to use natural language processing (NLP) techniques to analyze the responses. You could start by tokenizing the responses into words or phrases, then identify the most frequent phrases in the positive responses that are absent or occur less frequently in the negative responses. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or cosine similarity can help in this process. Alternatively, you could train a machine learning model to classify responses as positive or negative and then extract key features that contribute to the positive classification._


Give me the code to do this with TF-IDF

https://chat.openai.com/c/be8ba9fc-679f-457c-9b65-9bafd6519661

-----
**How the approach works:**

- Corpus is all response types as seperate docs, with label to say positive or negative.
- Apply TF-IDF vectorizer to corpus.
- Calculates the TF-IDF scores for each doc.
- Compares the nth positive response to the nth negative response, and returns words that are non-zero in positive response that are not also non-zero in the negative response.

-----

-------

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Positive and negative responses
positive_responses = [
    "The service was excellent and very professional.",
    "I'm extremely satisfied with the outcome.",
    "The staff were friendly and helpful."
]

negative_responses = [
    "The service was terrible, never coming back.",
    "I'm highly disappointed with the outcome.",
    "The staff were rude and unprofessional."
]

# Combine positive and negative responses
all_responses = positive_responses + negative_responses

# Labels for responses (1 for positive, 0 for negative)
labels = [1] * len(positive_responses) + [0] * len(negative_responses)

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer on all responses
tfidf_vectorizer.fit(all_responses)

# Transform responses to TF-IDF feature vectors
tfidf_features = tfidf_vectorizer.transform(all_responses)

# Separate positive and negative TF-IDF feature vectors
positive_tfidf_features = tfidf_features[:len(positive_responses)]
negative_tfidf_features = tfidf_features[len(positive_responses):]

# Find key phrases in positive responses that are different from negative responses
positive_key_phrases = []
for i, response in enumerate(positive_responses):
    positive_indices = positive_tfidf_features[i].nonzero()[1]
    negative_indices = negative_tfidf_features[i].nonzero()[1]
    key_indices = set(positive_indices) - set(negative_indices)
    key_phrases = [tfidf_vectorizer.get_feature_names_out()[index] for index in key_indices]
    positive_key_phrases.append(key_phrases)

In [None]:
print("Key phrases in positive responses:")
for i, phrases in enumerate(positive_key_phrases):
    print(f"Response {i+1}: {phrases}")