# Advanced LLM-Driven Topic Extraction

**Project Overview**
Building on the initial BERTopic analysis, this phase demonstrates the use of Large Language Models (LLMs) for automated topic extraction and summarization of patient feedback. Instead of unsupervised clustering alone, Falcon-7B-Instruct is leveraged to generate precise, human-readable topics from textual reviews, providing a complementary, AI-driven perspective on negative patient experiences.

**Differences from BERTopic Approach**

* **LLM vs. traditional topic modelling:** While BERTopic clusters reviews based on semantic embeddings, Falcon-7B-Instruct directly interprets text to extract the top themes per review.
* **Structured summarization:** LLM output allows generation of concise, normalized topic labels across reviews, improving readability and actionable insights.
* **Scalability:** Falcon-7B enables processing of large datasets in parallel using GPU acceleration, supporting rapid insights for enterprise-scale applications.

**Methodology**

1. **Data Selection & Cleaning**

   * Focused on negative reviews (<3 stars) from Google and Trustpilot datasets.
   * Preprocessed text: lowercasing, punctuation removal, stopword filtering, tokenization.

2. **Language Filtering**

   * Applied `langdetect` to ensure only English reviews were analyzed.

3. **LLM Topic Extraction**

   * Loaded Falcon-7B-Instruct via Hugging Face Transformers.
   * Generated top 3 topics per review using an instruction-based text generation pipeline.
   * Normalized topics into consistent categories (e.g., “Waiting Times,” “Staff Behavior,” “Treatment Quality”).

4. **Aggregation & Insights**

   * Counted overall frequency of each topic across hospitals.
   * Identified recurring issues and location-specific pain points.

**Key Skills Demonstrated**

* Applying LLMs for structured information extraction from unstructured text.
* GPU-accelerated NLP processing and memory optimization.
* Combining traditional and advanced NLP methods for richer insights.
* Translating raw textual data into actionable business intelligence for healthcare operations.

**Outcome**
This approach provides a human-interpretable, AI-enhanced view of patient feedback, complementing traditional topic modeling. Hospitals and healthcare managers can quickly understand critical pain points and prioritize interventions, demonstrating how LLMs can augment standard NLP pipelines for practical, decision-making insights.


# Change to GPU for parallel processing

Here, for the below code to return true, I choose Runtime → Change runtime type → T4 GPU

In [1]:
import torch
print(torch.cuda.is_available())  # Should be true

True


In [2]:
import torch
torch.cuda.empty_cache()

In [3]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

In [4]:
!pip install -U transformers accelerate bitsandbytes huggingface_hub
!pip install umap-learn
!pip install googletrans==4.0.0-rc1
!pip install gensim pyldavis

Collecting huggingface_hub
  Using cached huggingface_hub-1.1.4-py3-none-any.whl.metadata (13 kB)
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Using cached httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googletrans==4.0.0-rc1)
  Using cached h11-0.9.0-py2.py3-none-any.whl.metadata (8.1 kB)
Using cached httpx-0.13.3-py3-none-any.whl (55 kB)
Using cached httpcore-0.9.1-py3-none-any.whl (42 kB)
Using cached h11-0.9.0-py2.py3-none-any.whl (53 kB)
Installing collected packages: h11, httpcore, httpx
  Attempting uninstall: h11
    Found existing installation: h11 0.14.0
    Uninstalling h11-0.14.0:
      Successfully uninstalled h11-0.14.0
  Attempting uninstall: httpcore
    Found existing installation: httpcore 0.17.3
    Uninstalling httpcore-0.17.3:
      Successfully uninstalle

In [5]:
!pip uninstall -y openai httpx
!pip install httpx==0.24.1
!pip install openai==0.28
!pip install bertopic[hdbscan]

[0mFound existing installation: httpx 0.13.3
Uninstalling httpx-0.13.3:
  Successfully uninstalled httpx-0.13.3
Collecting httpx==0.24.1
  Using cached httpx-0.24.1-py3-none-any.whl.metadata (7.4 kB)
Collecting httpcore<0.18.0,>=0.15.0 (from httpx==0.24.1)
  Using cached httpcore-0.17.3-py3-none-any.whl.metadata (18 kB)
Collecting h11<0.15,>=0.13 (from httpcore<0.18.0,>=0.15.0->httpx==0.24.1)
  Using cached h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Using cached httpx-0.24.1-py3-none-any.whl (75 kB)
Using cached httpcore-0.17.3-py3-none-any.whl (74 kB)
Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, httpcore, httpx
  Attempting uninstall: h11
    Found existing installation: h11 0.9.0
    Uninstalling h11-0.9.0:
      Successfully uninstalled h11-0.9.0
  Attempting uninstall: httpcore
    Found existing installation: httpcore 0.9.1
    Uninstalling httpcore-0.9.1:
      Successfully uninstalled httpcore-0.9.1
[31mERROR: pip's dependency resolve

In [6]:
!pip uninstall -y openai
!pip install bertopic[hdbscan] sentence-transformers
!pip install bertopic[hdbscan]

Found existing installation: openai 0.28.0
Uninstalling openai-0.28.0:
  Successfully uninstalled openai-0.28.0


# Restart session

# Reload data and prep

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers import AutoModelForCausalLM, pipeline

In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Load data from source

In [3]:
# Import the drive module from Colab
from google.colab import drive

# Mount Google Drive so Colab can access your files
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [4]:
# Define the full paths to the excel files
google_path = '/content/drive/My Drive/Colab Notebooks/google_reviews_synthetic.csv'
trustpilot_path = '/content/drive/My Drive/Colab Notebooks/trustpilot_reviews_synthetic.csv'

# Load the reviews CSV files into DataFrames
google_df = pd.read_csv(google_path)
trustpilot_df = pd.read_csv(trustpilot_path)

In [5]:
print(google_df.columns)
print(trustpilot_df.columns)

Index(['Customer Name', 'SurveyID for external use (e.g. tech support)',
       'Club's Name', 'Social Media Source', 'Creation Date', 'Comment',
       'Overall Score'],
      dtype='object')
Index(['Review ID', 'Review Created (UTC)', 'Review Consumer User ID',
       'Review Title', 'Review Content', 'Review Stars', 'Source Of Review',
       'Review Language', 'Domain URL', 'Webshop Name', 'Business Unit ID',
       'Tags', 'Company Reply Date (UTC)', 'Location Name', 'Location ID'],
      dtype='object')


# Data cleaning

# Remove missing reviews

In [6]:
# Remove rows with missing review text
google_df = google_df.dropna(subset=['Comment'])
trustpilot_df = trustpilot_df.dropna(subset=['Review Content'])

# Preprocessing: lowercase, remove stopwords, remove numbers

In [7]:
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove numbers and punctuation
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and single characters
    tokens = [word for word in tokens if word not in stop_words and len(word) > 1]
    return tokens

In [8]:
# Apply to Google and Trustpilot reviews
google_df['clean_tokens'] = google_df['Comment'].apply(preprocess_text)
trustpilot_df['clean_tokens'] = trustpilot_df['Review Content'].apply(preprocess_text)

# Identify Negative Reviews

- For Google reviews, negative: `score < 3`
- For Trustpilot reviews, negative: `stars < 3`

In [9]:
# For Google reviews (negative if Overall Score < 3)
google_neg = google_df[google_df['Overall Score'] < 3].copy()

# For Trustpilot reviews (negative if Review Stars < 3)
trustpilot_neg = trustpilot_df[trustpilot_df['Review Stars'] < 3].copy()

# Safe results using pickle

In [10]:
import pickle

# Define paths to save the pickle files in your Google Drive
google_neg_path = '/content/drive/My Drive/Colab Notebooks/google_neg_reviews.pkl'
trustpilot_neg_path = '/content/drive/My Drive/Colab Notebooks/trustpilot_neg_reviews.pkl'

# Save Google negative reviews
with open(google_neg_path, 'wb') as f:
    pickle.dump(google_neg, f)

# Save Trustpilot negative reviews
with open(trustpilot_neg_path, 'wb') as f:
    pickle.dump(trustpilot_neg, f)

# Restart session

# Load the pickled dfs

In [11]:
import pickle

# Define paths to save the pickle files in your Google Drive
google_neg_path = '/content/drive/My Drive/Colab Notebooks/google_neg_reviews.pkl'
trustpilot_neg_path = '/content/drive/My Drive/Colab Notebooks/trustpilot_neg_reviews.pkl'

# Load (unpickle) the DataFrames
with open(google_neg_path, 'rb') as f:
    google_neg = pickle.load(f)

with open(trustpilot_neg_path, 'rb') as f:
    trustpilot_neg = pickle.load(f)

# Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review.

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

# Load model and tokenizer
model_name = "tiiuae/falcon-7b-instruct"

# Configuration for 8-bit quantization (for better memory usage)
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True  # Offload some parts to CPU automatically
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with the specified configuration
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically choose the best device (GPU/CPU)
    trust_remote_code=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Extract Main Topics from Each Negative Review (Prompted with Falcon-7b-Instruct)

In [13]:
!pip install langdetect



I'm running a subset of 50 because of memory issues.

In [14]:
import re
from collections import Counter
import pandas as pd
from langdetect import detect, LangDetectException
from tqdm import tqdm

# -------------------------------
# Topic normalization dictionary
# -------------------------------
topic_normalization = {
    # Cleanliness & Hygiene
    "dirty": "Hospital Cleanliness",
    "hygiene": "Hospital Cleanliness",
    "sanitary": "Hospital Cleanliness",
    "filthy": "Hospital Cleanliness",
    "infection": "Hospital Cleanliness",
    "sterile": "Hospital Cleanliness",
    "disinfection": "Hospital Cleanliness",

    # Parking & Access
    "parking": "Parking Issues",
    "car park": "Parking Issues",
    "space": "Parking Issues",
    "access": "Access Issues",
    "wheelchair": "Accessibility",
    "ramp": "Accessibility",

    # Pricing & Billing
    "price": "Pricing",
    "charges": "Pricing",
    "expensive": "Pricing",
    "cost": "Pricing",
    "billing": "Billing Issues",
    "insurance": "Billing Issues",

    # Staff Behavior & Communication
    "nurse": "Nurse Behavior",
    "doctor": "Doctor Behavior",
    "rude": "Staff Rudeness",
    "friendly": "Staff Friendliness",
    "unprofessional": "Staff Rudeness",
    "communication": "Communication Quality",
    "explained": "Communication Quality",
    "updates": "Communication Quality",

    # Waiting Times
    "waiting": "Waiting Times",
    "delay": "Waiting Times",
    "queue": "Waiting Times",
    "appointment": "Waiting Times",

    # Treatment & Care Quality
    "treatment": "Treatment Quality",
    "procedure": "Treatment Quality",
    "operation": "Treatment Quality",
    "surgery": "Treatment Quality",
    "specialist": "Doctor Specialty",
    "expert": "Doctor Specialty",
    "cardiology": "Doctor Specialty",
    "orthopedic": "Doctor Specialty",

    # Facilities & Equipment
    "equipment": "Medical Equipment",
    "machines": "Medical Equipment",
    "facility": "Hospital Facilities",
    "room": "Hospital Facilities",
    "bed": "Hospital Facilities",

    # General Experience
    "experience": "Overall Experience",
    "service": "Overall Experience",
    "care": "Overall Experience"
}

# -------------------------------
# Functions
# -------------------------------
def normalize_topic(topic):
    topic = re.sub(r'^\s*\d+\.\s*', '', topic).strip()
    for key, normalized in topic_normalization.items():
        if key.lower() in topic.lower():
            return normalized
    return topic

def is_english(text):
    try:
        return detect(text) == 'en'
    except LangDetectException:
        return False

def extract_and_normalize_topics(main_topic_text):
    raw_extracted_topics = re.findall(r'^\s*\d+\.\s*(.+)', main_topic_text, re.MULTILINE)

    if not raw_extracted_topics:
        lines = [line.strip() for line in main_topic_text.split('\n') if line.strip()]
        raw_extracted_topics = [
            re.sub(r'^(Topic\\s*\\d+\\s*:|^\\d+\\.\\s*|^\\s*-\\s*)*', '', line, flags=re.IGNORECASE).strip()
            for line in lines
        ]
        raw_extracted_topics = [t for t in raw_extracted_topics if t]

    normalized_topics_list = []
    for topic_text in raw_extracted_topics:
        normalized_topic = normalize_topic(topic_text)
        if normalized_topic:
            normalized_topics_list.append(normalized_topic)

    return normalized_topics_list[:3]  # Top 3

# -------------------------------
# Load pipeline for text generation
# -------------------------------
text_gen = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",
    device_map="auto"
)

# -------------------------------
# Prepare data
# -------------------------------
# Assuming 'google_neg' DataFrame exists with a 'Comment' column
english_reviews = google_neg[google_neg['Comment'].apply(is_english)]
subset_size = 50
google_subset = english_reviews.iloc[:subset_size].copy()

google_prompts = [
    f"In the following customer review, pick out the main 3 topics. Return them in a numbered list format, each on a new line.\n\nReview: {review}\nMain topics:"
    for review in google_subset['Comment']
]

# -------------------------------
# Generate topics using pipeline
# -------------------------------
google_results = []
for prompt in tqdm(google_prompts, desc="Generating topics"):
    result = text_gen(prompt, max_new_tokens=64, temperature=0.7)
    google_results.append(result[0]['generated_text'])

# -------------------------------
# Extract and normalize topics
# -------------------------------
google_subset['main_topic_raw'] = [
    text.split("Main topics:")[-1].strip() for text in google_results
]
google_subset['topics_list'] = google_subset['main_topic_raw'].apply(extract_and_normalize_topics)

# -------------------------------
# Count topics
# -------------------------------
all_topics = [topic for sublist in google_subset['topics_list'] for topic in sublist]
topic_counts = Counter(all_topics)
top_3_overall_topics = topic_counts.most_common(3)

# -------------------------------
# Display results
# -------------------------------
print("---")
print("## Overall Top 3 Most Common Topics")
for topic, count in top_3_overall_topics:
    print(f"- **{topic}**: {count} reviews")

print("\n---")
print("## Reviews with Extracted and Limited Topics")
for index, row in google_subset.iterrows():
    print(f"**Review {index+1}:** {row['Comment']}")
    print("  **Extracted Topics (Max 3):**")
    if row['topics_list']:
        for i, topic in enumerate(row['topics_list']):
            print(f"  {i+1}. {topic}")
    else:
        print("  No main topics extracted.")
    print("-" * 40)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Generating topics:   0%|          | 0/50 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:   2%|▏         | 1/50 [00:01<01:02,  1.27s/it]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:   4%|▍         | 2/50 [00:01<00:41,  1.16it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:   6%|▌         | 3/50 [00:02<00:33,  1.39it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:   8%|▊         | 4/50 [00:03<00:37,  1.24it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:  10%|█         | 5/50 [00:03<00:33,  1.33it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:  12%|█▏        | 6/50 [00:04<00:31,  1.42it/s]Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Generating topics:  14%|█▍        | 7/50 [00:

---
## Overall Top 3 Most Common Topics
- **Overall Experience**: 20 reviews
- **Pricing**: 14 reviews
- **Waiting Times**: 9 reviews

---
## Reviews with Extracted and Limited Topics
**Review 2:** Waiting time was minimal, I was seen within 10 minutes of arrival.
  **Extracted Topics (Max 3):**
  1. Waiting Times
  2. Staff friendliness
  3. Overall Experience
----------------------------------------
**Review 14:** Waiting time was minimal, I was seen within 10 minutes of arrival.
  **Extracted Topics (Max 3):**
  1. Overall Experience
  2. Convenient location
  3. Staff Friendliness
----------------------------------------
**Review 21:** Communication was clear and timely, received all updates via SMS and email.
  **Extracted Topics (Max 3):**
  1. Communication Quality
  2. Communication Quality
  3. Timely delivery
----------------------------------------
**Review 24:** Nurse was rude and dismissive, barely answered my questions.
  **Extracted Topics (Max 3):**
  1. Staff Rudeness





# The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list.

In [15]:
all_topics = [topic for sublist in google_subset['topics_list'] for topic in sublist]
print(all_topics[:15]) # Adjust 15 to any number you prefer, or remove [:15] for the whole list

['Waiting Times', 'Staff friendliness', 'Overall Experience', 'Overall Experience', 'Convenient location', 'Staff Friendliness', 'Communication Quality', 'Communication Quality', 'Timely delivery', 'Staff Rudeness', 'Inadequate answers to questions', 'Overall Experience', 'Pricing', 'Pricing', 'Comparison with other clinics']


# Use this list as input to run BERTopic again.

In [16]:
from bertopic import BERTopic
from collections import Counter
import pandas as pd
#from openai import OpenAI

bertopic_model = BERTopic(language="english", verbose=True,
                          min_topic_size=5, nr_topics="auto", n_gram_range=(1, 2))

# Fit BERTopic on your 'all_topics' list
# The 'meta_topics' output here refers to the assigned meta-topic ID for each item in 'all_topics'
meta_topics, probabilities = bertopic_model.fit_transform(all_topics)

# Get information about the generated meta-topics
meta_topic_info = bertopic_model.get_topic_info()

print(meta_topic_info)

print("\n--- Top words for each BERTopic Meta-Topic ---")
for topic_id in meta_topic_info['Topic'].unique():
    if topic_id != -1: # -1 usually represents outliers/noise
        print(f"\nMeta-Topic {topic_id}: {bertopic_model.get_topic(topic_id)}")

2025-11-16 19:53:51,790 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

2025-11-16 19:53:55,727 - BERTopic - Embedding - Completed ✓
2025-11-16 19:53:55,728 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-16 19:54:03,993 - BERTopic - Dimensionality - Completed ✓
2025-11-16 19:54:03,994 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-16 19:54:04,002 - BERTopic - Cluster - Completed ✓
2025-11-16 19:54:04,003 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-11-16 19:54:04,021 - BERTopic - Representation - Completed ✓
2025-11-16 19:54:04,022 - BERTopic - Topic reduction - Reducing number of topics
2025-11-16 19:54:04,028 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-16 19:54:04,039 - BERTopic - Representation - Completed ✓
2025-11-16 19:54:04,040 - BERTopic - Topic reduction - Reduced number of topics from 11 to 11


    Topic  Count                                               Name  \
0      -1      1                                   -1_comparison___   
1       0     29  0_experience_overall_overall experience_experi...   
2       1     22           1_pricing_pricing pricing_condition_fees   
3       2     21                 2_waiting times_waiting_times_time   
4       3     17  3_quality_communication quality_communication_...   
5       4     13  4_friendliness_staff friendliness_staff_friend...   
6       5      9  5_hospital_hospital cleanliness_cleanliness_cl...   
7       6      9  6_doctor_specialty_doctor specialty_specialty ...   
8       7      8           7_staff rudeness_rudeness_staff_attitude   
9       8      7        8_other clinics_clinics_other_comparison to   
10      9      6     9_issues_parking_parking issues_issues parking   

                                       Representation  \
0                      [comparison, , , , , , , , , ]   
1   [experience, overall, overall

In [17]:
# Visualize the top 5 topics as a bar chart
fig = bertopic_model.visualize_barchart(top_n_topics=5)

# Adjust the figure size
fig.update_layout(
    autosize=False,
    width= 1400,  # Adjust width for more space
    height=600  # Adjust height for more space
)

# Rotate x-axis labels to prevent overlap
fig.update_layout(
    xaxis_tickangle=-45  # Rotate labels by 45 degrees to avoid overlap
)

# Show the plot
fig.show()

In [18]:
# Visualize topics with a heatmap
bertopic_model.visualize_heatmap()

In [19]:
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",
    device_map="auto"
)

if not all_topics:
    print("The 'all_topics' list is empty. Cannot generate insights.")
else:
    print("\nGenerating Actionable Insights from all_topics")

    # Join the topics into a single string for the prompt
    topics_for_insight_prompt = ", ".join(all_topics)

    # Construct the full prompt as specified
    actionable_insight_prompt = (
        "For the following text topics obtained from negative customer reviews, "
        "can you give some actionable insights that would help this gym company?\n\n"
        f"Topics: {topics_for_insight_prompt}\n\n"
        "Actionable Insights:" # This phrase helps guide the LLM's output
    )

    print(f"Prompt sent to Falcon model (truncated for display):\n{actionable_insight_prompt[:500]}...\n")

    # Run the Falcon-7b-Instruct model to generate insights
    # Adjust max_new_tokens for the desired length of insights.
    # Adjust temperature for creativity (higher = more creative).

    insight_results = generator(
        actionable_insight_prompt,
        max_new_tokens=300, # Example: 300 tokens for comprehensive insights
        do_sample=True,
        temperature=0.7,
        num_return_sequences=1
    )

    # Extract the generated insights from the model's output
    generated_insights = insight_results[0]['generated_text'].split("Actionable Insights:")[-1].strip()

    print("\nGenerated Actionable Insights from Falcon Model")
    print(generated_insights)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



Generating Actionable Insights from all_topics
Prompt sent to Falcon model (truncated for display):
For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?

Topics: Waiting Times, Staff friendliness, Overall Experience, Overall Experience, Convenient location, Staff Friendliness, Communication Quality, Communication Quality, Timely delivery, Staff Rudeness, Inadequate answers to questions, Overall Experience, Pricing, Pricing, Comparison with other clinics, Parking Issues, Overall Experience, Time Consuming,...


Generated Actionable Insights from Falcon Model
1. Staff friendliness and communication quality are crucial for a positive patient experience. Customers expect to be treated with respect and receive timely, clear answers to their questions. Improve your staff recruitment process and invest in training to ensure that your staff are well-equipped to handle customer inquiries and complaints.



# **Gensim LDA Comparison**

# Reload data set

In [20]:
!pip install gensim



In [21]:
import pickle

# Define paths to save the pickle files in your Google Drive
google_neg_path = '/content/drive/My Drive/Colab Notebooks/google_neg_reviews.pkl'
trustpilot_neg_path = '/content/drive/My Drive/Colab Notebooks/trustpilot_neg_reviews.pkl'

# Load (unpickle) the DataFrames
with open(google_neg_path, 'rb') as f:
    google_neg = pickle.load(f)

with open(trustpilot_neg_path, 'rb') as f:
    trustpilot_neg = pickle.load(f)

In [22]:
# Check the structure of the first few rows from both DataFrames
print(google_neg['clean_tokens'].head())
print(trustpilot_neg['clean_tokens'].head())

1     [waiting, time, minimal, seen, within, minutes...
13    [waiting, time, minimal, seen, within, minutes...
20    [communication, clear, timely, received, updat...
23    [nurse, rude, dismissive, barely, answered, qu...
24      [prices, consultation, high, compared, clinics]
Name: clean_tokens, dtype: object
0     [doctor, specialized, cardiology, provided, de...
2     [doctor, specialized, cardiology, provided, de...
4     [treatment, quality, exceptional, doctor, expl...
10    [waiting, times, long, wait, hour, past, sched...
11    [parking, nightmare, spent, minutes, looking, ...
Name: clean_tokens, dtype: object


# Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews).

In [23]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim import corpora

import nltk
nltk.download('wordnet')
nltk.download('punkt')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))  # List of stopwords in English
lemmatizer = WordNetLemmatizer()

# Preprocess each review (already tokenized)
def preprocess(tokens):
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Apply preprocessing to all Google reviews
google_preprocessed = [preprocess(tokens) for tokens in google_neg['clean_tokens'].tolist()]

# Apply preprocessing to all Trustpilot reviews
trustpilot_preprocessed = [preprocess(tokens) for tokens in trustpilot_neg['clean_tokens'].tolist()]

# Combine the preprocessed Google and Trustpilot reviews
preprocessed_texts = google_preprocessed + trustpilot_preprocessed

# Remove rare words (appear in fewer than 3 documents)
# Create dictionary
id2word = corpora.Dictionary(preprocessed_texts)

# Filter tokens that appear in fewer than 3 documents or more than 50% of the documents
id2word.filter_extremes(no_below=3, no_above=0.5)

# Create corpus
corpus = [id2word.doc2bow(text) for text in preprocessed_texts]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.

In [24]:
from gensim.models import LdaModel
import numpy as np

# Number of topics
num_topics = 10

# Train the LDA model using the preprocessed corpus and dictionary
lda_model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=num_topics,
    random_state=42,
    passes=15,  # Number of passes over the entire corpus (higher = better)
    iterations=400,  # Number of iterations for each pass
    eval_every=None  # Disable evaluation during training to save time
)

# Display the topics and their top words
topics = lda_model.print_topics(num_words=10)  # Show top 10 words for each topic
for topic in topics:
    print(topic)

(0, '0.125*"step" + 0.125*"exceptional" + 0.125*"every" + 0.125*"quality" + 0.125*"explained" + 0.125*"treatment" + 0.125*"clearly" + 0.125*"doctor" + 0.000*"time" + 0.000*"waiting"')
(1, '0.250*"time" + 0.125*"waiting" + 0.125*"scheduled" + 0.125*"long" + 0.125*"hour" + 0.125*"past" + 0.125*"wait" + 0.000*"doctor" + 0.000*"advice" + 0.000*"condition"')
(2, '0.087*"high" + 0.087*"clinic" + 0.087*"consultation" + 0.087*"price" + 0.087*"compared" + 0.080*"arrival" + 0.080*"seen" + 0.080*"minimal" + 0.080*"within" + 0.080*"minute"')
(3, '0.082*"minute" + 0.082*"spot" + 0.082*"looking" + 0.082*"appointment" + 0.082*"spent" + 0.082*"nightmare" + 0.082*"parking" + 0.061*"cleanliness" + 0.061*"sanitized" + 0.061*"impressive"')
(4, '0.143*"detailed" + 0.143*"condition" + 0.143*"specialized" + 0.143*"provided" + 0.143*"cardiology" + 0.143*"advice" + 0.143*"doctor" + 0.000*"high" + 0.000*"compared" + 0.000*"clinic"')
(5, '0.016*"time" + 0.016*"doctor" + 0.016*"consultation" + 0.016*"clinic" + 0.

In [25]:
!pip install pyLDAvis



# Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms.

In [26]:
import pyLDAvis.gensim_models
import pyLDAvis

# Visualizing the LDA topics using pyLDAvis
lda_visualization = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)

# Display the visualization
pyLDAvis.display(lda_visualization)


datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).




datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).


datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).


datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).

