<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/KEMPER_SHELDON_CAM_C301_Week_4and5_Topic_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic project 4.1 Applying NLP for topic modelling in a real-life context

**Welcome to your topic project: Applying NLP for topic modelling in a real-life context**

In this project, you will bridge the gap between theory and practical application by developing automated topic modelling tools tailored to a specific industry context.


Applying NLP for topic modelling is crucial for data analysis in business because it enables companies to identify and understand key themes and patterns within large volumes of text data. This efficiency allows businesses to extract essential insights and trends without manually sifting through extensive documents. Automated topic modelling helps businesses make informed decisions faster, which helps to improve productivity and gain a competitive edge. Additionally, it supports better information management by uncovering underlying topics in reports, emails, customer feedback, and market research, which enhances overall business intelligence and strategic planning.




Please set aside approximately **19 hours** to complete the topic project by **Thursday, 22 August at 5 p.m. (UK Time)**.

<br>

## **Business context**

The PureGym Group, founded in 2008, has approximately 2 million members and 600 gyms across the world (particularly in the UK, Denmark, and Switzerland). As one of the world’s largest value fitness operators, PureGym appeals to a broad range of customers by offering high-quality, low-cost, and flexible fitness facilities. The company’s customer-centric proposition – affordable membership fees, no fixed-term contracts, and 24/7 access to high-quality gyms – differentiates it from more traditional gyms and elevates it as a market leader within this space.

This focus on the customer is centred on wanting to understand what motivates members to join and what factors influence their behaviours once they have joined. Understanding how to leverage innovative technology to influence, improve, and simplify their experience allows PureGym to foster an open, welcoming, and diverse environment for its members while maintaining the value proposition that PureGym is built upon.

With the shift in focus to value-for-money memberships across the gym industry, PureGym seeks to achieve its mission of ‘inspiring a healthier world by providing members with affordable access to the benefits being healthy can offer’.


<br></br>

## **Objective**

By the end of this topic project, you will have analysed PureGym's review data to uncover key drivers that provide actionable insights for enhancing customer experience.

In the Notebook, you will:

- Use two data sets containing customer reviews from Google and Trustpilot.
- Perform basic level analysis by finding the frequently used words in both data sets.
- Generate a wordcloud to visualise the most frequently used words in the reviews.
- Apply BERTopic for topic modelling, keeping track of gym locations, to identify common topics and words in the negative reviews.
- Identify the locations that have the most negative reviews.
- Use the built-in visualisation functions in BERTopic to cluster and visually represent the topics and words in these reviews, thereby helping to identify specific themes from the reviews.
- Conduct a comparison with Gensim’s LDA model to validate the topic modelling results.
- Perform emotion analysis to identify the emotions associated with customer reviews.
- Filter out angry reviews and apply BERTopic to discover prevalent topics and words being discussed these negative reviews.
- Leverage the multi-purpose capability of the state-of-the-art Falcon-7b-instruct model, with the help of prompts, to identify top topics in each review.
- Use a different prompt with the Falcon-7b-instruct model to further generate suggestions for improvements for PureGym, based on the top topics identified from the negative reviews.


You will also write a report summarising the results of your findings and recommendations.


<br></br>

## **Assessment criteria**
By completing this project, you will be able to provide evidence that you can:

- Investigate real-world data to find potential trends for deeper investigation.
-Preprocess and refine textual data for visualisations.
-Apply topic modelling using various techniques.
-Apply emotion analysis using BERT.
-Evaluate the outcomes of your investigation.
-Communicate actionable insights.



<br></br>

## **Project guidance**

**Import packages and data:**
1. 1. Import the data set with the provided URL:
  - Data set's Drive link: https://drive.google.com/drive/folders/1azch13dtGeAkbnEjqeoyl3nU9jheojU4
2. Import the data file **Google_12_months.xlsx** into a dataframe.
3. Import the data file **Trustpilot_12_months.xslx** into a dataframe.
4. Remove any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot).


**Conducting initial data investigation:**

1. - Find the number of unique locations in the Google data set.
  
  - Find the number of unique locations in the Trustpilot data set.
  
  - Use Club's Name for the Google data set.
  
  - Use Location Name for the Trustpilot data set.
2. Find the number of common locations between the Google data set and the Trustpilot data set.
3. Perform preprocessing of the data – change to lower case, remove stopwords using NLTK, and remove numbers.
4. Tokenise the data using word_tokenize from NLTK.
5. Find the frequency distribution of the words from each data set's reviews separately. You can use nltk.freqDist.
6. Plot a histogram/bar plot showing the top 10 words from each data set.
7. Use the wordcloud library on the cleaned data and plot the word cloud.
8. Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets.

  • For Google reviews, overall scores < 3 can be considered negative scores.

  • For Trustpilot reviews, stars < 3 can be considered negative scores.

  Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews.


**Conducting initial topic modelling:**

1. With the data frame created in the previous step:

  • Filter out the reviews that are from the locations common to both data sets.

  • Merge the reviews to form a new list.

2. Preprocess this data set. Use BERTopic on this cleaned data set.
3. Output: List out the top topics along with their document frequencies.
4. For the top 2 topics, list out the top words.
5. Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map.
6. Show a barchart of the topics, displaying the top 5 words in each topic.
7. Plot a heatmap, showcasing the similarity matrix.
8. For 10 clusters, provide a brief description in the Notebook of the topics they comprise of along with the general theme of the cluster, evidenced by the top words within each cluster's topics.

**Performing further data investigation:**

1. List out the top 20 locations with the highest number of negative reviews. Do this separately for Google and Trustpilot's reviews, and comment on the result. Are the locations roughly similar in both data sets?
2. Merge the 2 data sets using Location Name and Club's Name.

  Now, list out the following:

  • Locations

  • Number of Trustpilot reviews for this location

  • Number of Google reviews for this location

  • Total number of reviews for this location (sum of Google reviews and Trustpilot reviews)

  Sort based on the total number of reviews.
3. For the top 30 locations, redo the word frequency and word cloud. Comment on the results, and highlight if the results are different from the first run.
4. For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic.

  Comment on the following:

  • Are the results any different from the first run of BERTopic?

  • If so, what has changed?

  • Are there any additional insights compared to the first run?

**Conducting emotion analysis:**

1. Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set up a pipeline for text classification.
2. With the help of an example sentence, run the model and display the different emotion classifications that the model outputs.
3. Run this model on both data sets, and capture the top emotion for each review.
4. Use a bar plot to show the top emotion distribution for all negative reviews in both data sets.
5. Extract all the negative reviews (from both data sets) where anger is top emotion.
6. Run BERTopic on the output of the previous step.
7. Visualise the clusters from this run. Comment on whether it is any different from the previous runs, and whether it is possible to narrow down the primary issues that have led to an angry review.

**Using a large language model from Hugging Face:**

1. Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review.
2. Add the following prompt to every review, before passing it on to the model: **In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line.**

  Run the model.

  Note: If the execution time is too high, you can use a subset of the bad reviews (instead of the full set) to run this model.
3. The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list.
4. Use this list as input to run BERTopic again.
5. Comment about the output of BERTopic. Highlight any changes, improvements, and if any further insights have been obtained.
6. Use the comprehensive list from Step 3.

  Pass it to the model as the input, but pre-fix the following to the prompt: **For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?**

  Run the Falcon-7b-Instruct model.
7. List the output, ideally in the form of suggestions, that the company can employ to address customer concerns.

**Using Gensim:**
1. Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews).
2. Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.
3. Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms.
4. Comment on the output and whether it is similar to other techniques, or whether any extra insights were obtained.

**Report:**
1. Document your approach and major inferences from the data analysis.
2. When you have completed the project:
  - Download your completed Notebook as an IPYNB (Jupyter Notebook) or PY (Python) file. Save the file as follows: **LastName_FirstName_CAM_C301_Week_4and5_Topic_project.ipynb**.
  - Prepare a detailed report (between 800–1,000 words) that includes:
    - an overview of your approach
    - a description of your analysis
    - an explanation of the insights you identified
    - a summary of the comments requested in earlier steps
    - final insights, based on the output obtained from the various models employed
  - Save the document as a PDF named according to the following convention: **LastName_FirstName_CAM_C301_Week_4and5_Topic_project.pdf**.
  - Submit your Notebook and PDF document by **Thursday, 22 August at 5 p.m. (UK Time)**.


<br></br>
> **Declaration**
>
> By submitting your project, you indicate that the work is your own and has been created with academic integrity. Refer to the Cambridge plagiarism regulations.

# 1. Introduction

## 1.1 Brief Project Description

todo

## 1.2 Import Necessary Libraries


## 1.3 Assessment Criteria Overview

todo

# 2. Data Import and Cleaning

## 2.1 Load the Datasets

## 2.2 Drop Irrelevant Columns and Handle Missing Values

# 3. Initial Data Investigation

## 3.1 Count Unique Locations

## 3.2 Identify Common Locations

## 3.3 Text Preprocessing

## 3.4 Word Frequency Analysis

## 3.5 Visualise Top Words

## 3.6 Generate Word Clouds

# 4. Filter Negative Reviews

## 4.1 Define Negative Reviews

## 4.2 Repeat Word Frequency and Word Cloud Analysis for Negative Reviews

# 5. Initial Topic Modelling with BERTopic

## 5.1 Prepare Data from Common Locations

## 5.2 Apply BERTopic

## 5.3 Visualise Topics and Clusters

## 5.4 Topic Description Documentation

todo

# 6. Further Investigation

## 6.1 Identify Top 20 Locations with Negative Reviews


## 6.2 Merge Datasets by Location and Summarise

## 6.3 Analyse Top 30 Locations

In [None]:
# Redo word frequency, word clouds, and BERTopic for combined reviews.


## 6.4 Location-Specific Insights

In [None]:
# For the top 5 locations with the highest negative reviews,
# consider providing a more detailed breakdown of customer complaints or concerns.

# 7. Emotion Analysis with BERT

## 7.1 Set Up Emotion Classifier

## 7.2 Classify Emotions in Reviews

## 7.3 Visualise Emotion Distribution

## 7.4 Filter Angry Reviews and Apply BERTopic

# 8. Advanced Topic Modelling with Falcon-7b-Instruct

## 8.1 Set Up Falcon-7b-Instruct for Topic Extraction

## 8.2 Extract Topics with Prompt

## 8.3 Generate Actionable Insights

## 8.4 Compare Falcon-7b-Instruct Insights to Previous Analysis

In [None]:
# After generating actionable insights using Falcon-7b-Instruct,
# compare these insights with earlier results from BERTopic and emotion analysis.
# Highlight any new or refined suggestions.


# 9. Topic Modelling with Gensim LDA

## 9.1 Preprocess Data for LDA

## 9.2 Apply LDA and Visualise Topics

## 9.3 Compare LDA Results with BERTopic

In [None]:
# Explicitly mention how the LDA results compare to BERTopic,
# particularly if they highlight different or similar themes.

# 10. Conclusion

## 10.1 Summarise Insights

todo

## 10.2 Highlight Recommendations

todo

## 10.3 Reflection on NLP Techniques

Briefly reflect on the strengths and limitations of the NLP techniques (BERTopic, Gensim LDA, BERT, Falcon-7b) used in the project. This will showcase your ability to critically evaluate the methods.