<a href="https://www.kaggle.com/code/killianmcguinness/topic-modelling-airline-customer-reviews?scriptVersionId=166416017" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Topic Modeling using BERTopic to understand user comments

## Objective:
Using a dataset comprised of user-generated comments for 5 prominent airlines, use various Topic Modelling strategies to understand what users think about each provider. We will use BERTopic for Topic Modelling and then use Cohere to summarize our findings into a scorecard for the airlines.

## About The Data:
This dataset comprises user-generated reviews for five prominent airlines—RyanAir, EasyJet, Singapore Air,Qatar & Emirates. The data was extracted from a comparison website where users share their experiences and rate service providers. The dataset spans a 12-month period from January 2023 to January 2024. 

**Period:**
January 28, 2023, to January 28, 2024.

Note - You can visit my [GitHub](https://github.com/sneakykilli) to see how I comprised this Dataset


# Action Plan
| Step | Task                                   | Objective                                               | Details                                            |
|------|----------------------------------------|---------------------------------------------------------|----------------------------------------------------|
| 1    | Data Preprocessing                     | Ensure Dataframe is in the correct format for the Model.        | - Import Libraries. <br> - Examine data types, missing values, and basic statistics. <br> - Convert Date from strings to Datetime to ensure accurate time-based analysis. |
| 2    | BERTopic Experimentation               | Figure out the Best BERTopic Configuration.             | - Make Agnostic col, replace reference to specific Airline with a generic stand-in. <br> - Pre-calculate Embeddings & Save for better Runtime. <br> - Build 5 Different Pipelines using Different Hyper Parameter Tuning. |
| 3    | Deep Dive into Clustering               | Fine-tune the Clustering Approach & Evaluate Results.       | - Visualize Topic Hierarchy, Add custom Labels. <br> - Visualize Top 10 Feedback Topics. <br> - Compare the Top 10 Topics per Airline to Generic Topics |
| 4    | Train Model on Airline Data               | Further Fine-Tune the Modelling Approach above & Train on Airline-specific Data.       | - Train on separate Airline Datasets, using Cohere as the Representation Model for EasyJet & Qatar. <br> - Save Models Locally & to Hugging Faces to Conserve Compute Power. <br> - Use Cohere API to Generate a Summary of all Representative Docs per Topic as well as an Intro Summary |


[](http://)

# Step 1: Data Preparation 🛠️ 
### Read in CSV data & make it easier to work with.<br>

**1. Import Key Libraries:**
- Import necessary libraries, there are quite a few! 

**2. Color Palette:**
- Import a Color Palette & Style Guide for Graphs. 

**3. Data Loading & Cleaning:**
- Read in Dataset CSV and convert to Pandas Dataframe.
- Name the Cols something Meaningful. 
- Date Col a bit messy, convert it from String to Datetime object 📅

In [None]:
!pip install cohere;
!pip install bertopic sentence-transformers cohere;

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import datetime
import time

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from umap import UMAP
from datasets import load_dataset
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, Cohere

In [None]:
# Create Color Pallette & Font Dicts, Will use later to keep graphs & plots looking similar. 
color_pal = sns.color_palette("tab20c")
color_brand = ['#ABE3C4', '#E7F7EE', '#5F8778', '#85B59E', '#E3ABCA', '#FFCCCC', '#F18C72', '#6BAED6', '#9ECAE1', '#D9D9D9']
plt.style.use('fivethirtyeight')

titles_dict = {'fontsize': 28,
 'fontweight': 25,
 'color':   color_brand[2]}

sub_title_dict = {'fontsize': 20,
 'fontweight': 18,
 'color':   color_brand[2]}

fig_text_dict = {
    'color':   color_brand[2], 
}

textprops={'color': color_brand[2], 'fontsize':8}

In [None]:
# Load Dataframe & Name cols. Create a var companies containing List of the Airlines. 
df = pd.read_csv('/kaggle/input/user-comments-travel-companies/airlines_12_months.csv')
df.columns = ['company', 'date', 'comment', 'star']
companies = df['company'].unique()

In [None]:
# Deal with Data format issues in Original DF

# Extract unique date values from the 'date' column
unique_date_values = df['date'].unique().tolist()

# Identify dates with 'ago' in them
ago_date_values = [item for item in unique_date_values if 'ago' in item]

# Fixed list of dates without 'ago'
ago_fixed_list = ['Jan 27, 2024', 'Jan 26, 2024', 'Jan 28, 2024', 'Jan 28, 2024', 'Jan 25, 2024', 'Jan 24, 2024',
                  'Jan 27, 2024', 'Jan 27, 2024', 'Jan 22, 2024', 'Jan 23, 2024', 'Jan 21, 2024', 'Jan 23, 2024',
                  'Jan 21, 2024', 'Jan 24, 2024', 'Jan 28, 2024', 'Jan 26, 2024', 'Jan 27, 2024', 'Jan 28, 2024',
                  'Jan 25, 2024']

# Create a dictionary mapping 'ago' dates to fixed dates
ago_fixed_dict = dict(zip(ago_date_values, ago_fixed_list))

# Identify dates with 'Updated' but without 'ago'
updated_date_values = [item for item in unique_date_values if 'Updated' in item and 'ago' not in item]

# Extract fixed dates from 'Updated' dates
fixed_updated_list = [item.split('Updated ')[1] for item in updated_date_values]

# Create a dictionary mapping 'Updated' dates to fixed dates
updated_fixed_dict = dict(zip(updated_date_values, fixed_updated_list))

# Update the 'ago' dictionary with the 'Updated' dictionary
ago_fixed_dict.update(updated_fixed_dict)

# Function to update values based on the created dictionary
def update_value_dict_map(my_value, my_dict):
    for key, value in my_dict.items():
        if key == my_value:
            return value
    return my_value

# Apply the update function to the 'date' column to get 'updated_dates'
df['updated_dates'] = df['date'].apply(lambda x: update_value_dict_map(x, ago_fixed_dict))

# Convert 'updated_dates' to datetime format
df['updated_dates'] = pd.to_datetime(df['updated_dates'], format='%b %d, %Y')

# Save the updated DataFrame to a CSV file
df.to_csv('updated_airlines_12_months.csv', index=False)

# Step 2: BERTopic Model Testing 🔍 
### Objective: Using our Entire Dataset & agnostic Comments play around with different configurations of BERTopic pipelines to understand How it performs on our Data. <br>

**1. Make Comments Agnostic:**
- Add a column to our DataFrame that makes the comment text airline agnostic, i.e., replace easyjet, ryanair with killiair. 🔄

**2. Save Embeddings:**
- To avoid the process of vector embedding each time we train a model, we're going to save our embeddings and reuse them for each iteration. 🔄

**3. Test:**
- We're going to test 5 different iterations of BERTopic modeling approaches to understand which setup works best on our data.
    1. No Fine at all
    2. Stop words Removed - Common words like "the" or "and" are often removed before topic modeling as they don't carry much meaning. 
    3. Seed Words - Pre-defined words or phrases used to guide topic modeling and improve the quality of topics identified.
    4. Guiding - Providing additional information or constraints to help the model find more accurate and relevant topics.
    5. Clustering using hdbscan_model - Grouping similar documents or data points together to identify distinct topics or themes.

**4. Visualize:**
- Using BERTopic's visualise attribute, visualize the results. 📊

In [None]:
# Add a Col containing Comments but with no reference to Specific Airlines

# Define a list of airline names and their proxies
carrier_names_proxies = ['Ryanair', 'Easyjet', 'Singapore', 'Singapore Airlines', 'Singapore Air', 'Qatar', 'Qatar Air', 'Qatar Airlines', 'Emirates', 'Airline']

# Function to replace airline names with a generic term
def agnostic_comment(comment):
    for word in carrier_names_proxies:
        comment = comment.casefold().replace(word.casefold(), 'killiair')
    return comment

# Apply the function to the 'comment' column to create a new 'comment_agnostic' column
df['comment_agnostic'] = df['comment'].apply(agnostic_comment)

In [None]:
# Save embeddings of the original comments in a DataFrame
# NB UNCOMMENT TO RUN !!!!

# from sentence_transformers import SentenceTransformer

# sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

# embeddings_agnostic = sentence_model.encode(df['comment_agnostic'], show_progress_bar=True)

# embeddings_specific = sentence_model.encode(df['comment'], show_progress_bar=True)

# np.save('/kaggle/working/embeddings_agnostic.npy', embeddings_agnostic)
# np.save('/kaggle/working/embeddings_specific.npy', embeddings_specific)

In [None]:
# Aim: Read in the saved embeddings & get them in a format that BERTopic can use for training.

# embeddings associated with the agnostic version of comments (i.e. replacing any company-specific mention with killiair)
embeddings_agnostic = np.load('/kaggle/input/embeddings-airlines/embeddings_agnostic.npy', allow_pickle=True)
df['embeddings_agnostic'] = list(embeddings_agnostic)

# We will use embeddings for specific companies a little later!
embeddings_specific = np.load('/kaggle/input/embeddings-airlines/embeddings_specific.npy', allow_pickle=True)

ryanair_loc_ = df.loc[df['company'] == 'www.ryanair.com'].index
easy_loc_ = df.loc[df['company'] == 'www.easyjet.com'].index
sing_loc_ = df.loc[df['company'] == 'www.singaporeair.com'].index
qatar_loc_ = df.loc[df['company'] == 'www.qatarairways.com'].index
emi_loc_ = df.loc[df['company'] == 'www.emirates.com'].index

emb_ryan = embeddings_specific[ryanair_loc_.min():ryanair_loc_.max() + 1]
emb_easy = embeddings_specific[easy_loc_.min():easy_loc_.max() + 1]
emb_sing = embeddings_specific[sing_loc_.min():sing_loc_.max() + 1]
emb_qatar = embeddings_specific[qatar_loc_.min():qatar_loc_.max() + 1]
emb_emi = embeddings_specific[emi_loc_.min():emi_loc_.max() + 1]

In [None]:
# [Pipeline 1] Train model with no fine tuning

# Define the embedding model to be used
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# Initialize BERTopic model with the specified embedding model
topic_model_a = BERTopic(embedding_model=embedding_model)

# Fit the BERTopic model to the agnostic comments and their embeddings
topics_a, probs_a = topic_model_a.fit_transform(df['comment_agnostic'], embeddings_agnostic)

In [None]:
# [PIPELINE 2] TRAIN MODEL ON COMMENTS WITH STOP WORDS REMOVED

vectorizer_model = CountVectorizer(stop_words="english")
topic_model_b = BERTopic(
                        vectorizer_model=vectorizer_model, 
                        embedding_model = embedding_model
                        )
topics_b, probs_b = topic_model_b.fit_transform(df['comment_agnostic'], embeddings_agnostic)

In [None]:
# [Pipeline 3] Train a model using seed words

# Define a list of seed words to use for topic modeling
seed_words = [
    'reservation', 'booking', 'online booking', 'ticket purchase', 'booking system',
    'service quality', 'customer support', 'assistance', 'helpdesk', 'service experience',
    'flight delays', 'arrival delays', 'departure delays', 'delayed flights', 'schedule disruptions',
    'flight cancellations', 'canceled flights', 'cancellation policy', 'itinerary changes', 'canceled services',
    'lost baggage', 'missing luggage', 'baggage claim', 'lost items', 'luggage recovery',
    'check-in experience', 'check-in process', 'online check-in', 'check-in counter', 'boarding pass',
    'entertainment options', 'in-flight movies', 'onboard entertainment', 'streaming services', 'in-flight media',
    'flight attendants', 'cabin crew', 'ground staff', 'crew behavior', 'staff professionalism',
    'onboard meals', 'food quality', 'meal options', 'catering service', 'in-flight dining'
]

# Initialize a ClassTfidfTransformer with the defined seed words
ctfidf_model = ClassTfidfTransformer(
    seed_words=seed_words, 
    seed_multiplier=2,
)

# Initialize a BERTopic model with the defined ctfidf_model, vectorizer_model, and embedding_model
topic_model_c = BERTopic(
    ctfidf_model=ctfidf_model,
    vectorizer_model=vectorizer_model, 
    embedding_model=embedding_model,
)

# Fit the BERTopic model to the agnostic comments and their embeddings
topics_c, probs_c = topic_model_c.fit_transform(df['comment_agnostic'], embeddings_agnostic)

In [None]:
# Aim: [Pipeline 4] Train a model using seed word guiding

# Define a list of lists of seed words to use for topic modeling
seed_words_list = [
    ['reservation', 'booking', 'online booking', 'ticket purchase', 'booking system'],
    ['service quality', 'customer support', 'assistance', 'helpdesk', 'service experience'],
    ['flight delays', 'arrival delays', 'departure delays', 'delayed flights', 'schedule disruptions'],
    ['flight cancellations', 'canceled flights', 'cancellation policy', 'itinerary changes', 'canceled services'],
    ['lost baggage', 'missing luggage', 'baggage claim', 'lost items', 'luggage recovery'],
    ['check-in experience', 'check-in process', 'online check-in', 'check-in counter', 'boarding pass'],
    ['entertainment options', 'in-flight movies', 'onboard entertainment', 'streaming services', 'in-flight media'],
    ['flight attendants', 'cabin crew', 'ground staff', 'crew behavior', 'staff professionalism'],
    ['onboard meals', 'food quality', 'meal options', 'catering service', 'in-flight dining']
]

# Initialize a BERTopic model with the defined seed_topic_list, vectorizer_model, and calculate_probabilities
topic_model_d = BERTopic(
    seed_topic_list=seed_words_list,
    vectorizer_model=vectorizer_model,
    calculate_probabilities=True,
)

# Fit the BERTopic model to the agnostic comments and their embeddings
topics_d, probs_d = topic_model_d.fit_transform(df['comment_agnostic'], embeddings_agnostic)

In [None]:
# Aim: [Pipeline 5] Build a model using clustering

# Initialize an HDBSCAN model with specified parameters
hdbscan_model = HDBSCAN(
    min_cluster_size=10, 
    metric='euclidean', 
    cluster_selection_method='eom', 
    prediction_data=True
)

# Initialize a BERTopic model with the defined hdbscan_model, vectorizer_model, and embedding_model
topic_model_e = BERTopic(
    hdbscan_model=hdbscan_model, 
    vectorizer_model=vectorizer_model,
    embedding_model=embedding_model
)

# Fit the BERTopic model to the agnostic comments and their embeddings
topics_e, probs_e = topic_model_e.fit_transform(df['comment_agnostic'], embeddings_agnostic)

In [None]:
# Visualize the models trained above

# Visualize the topics for each model
visualise_a = topic_model_a.visualize_topics()
visualise_b = topic_model_b.visualize_topics()
visualise_c = topic_model_c.visualize_topics()
visualise_d = topic_model_d.visualize_topics()
visualise_e = topic_model_e.visualize_topics()

# Create a subplot with 3 rows and 2 columns
fig = make_subplots(rows=3, cols=2, subplot_titles=['No Fine Tuning', 'No Stopwords', 'Seeds', 'Guiding', 'Clustering'])

# Add traces for each visualization to the subplot
for trace in visualise_a['data']:
    fig.add_trace(trace, row=1, col=1)
    
for trace in visualise_b['data']:
    fig.add_trace(trace, row=1, col=2)
    
for trace in visualise_c['data']:
    fig.add_trace(trace, row=2, col=1)
    
for trace in visualise_d['data']:
    fig.add_trace(trace, row=2, col=2)
    
for trace in visualise_e['data']:
    fig.add_trace(trace, row=3, col=1)
    
# Update the layout of the subplot
fig.update_layout(title_text='Fine Tuning BERTopic Topic Modelling', title_x=0.5)
    
# Show the figure
fig.show()

# Step 3: Deepdive into Clustering 🔬
### Clustering approach seems to yield good results, let's apply some post-training fine-tuning & see what findings we can extrapolate. <br>

**1. View the Cluster Hierarchy:**
- Get a big picture view of each of the topics our model has detected. 🌐

**2. Apply Custom Labels & Visualise:**
- BERTopic applied topic representations which describe what the topic is about. We want to apply some custom labels to make it easier to read.
- We want to reduce the number of topics from 50+ to about 10 & visualize the distribution in a simple pie chart. 📊

**3. Compare Topic Breakdown across Airlines:**
- At this point, we can compare the distribution of representative topics between specific airlines. We'll build a stacked bar chart that compares the top 10 topic reasons and their relative distribution among the different airlines. 📊

In [None]:
# Visualize the hierarchical clustering

# Visualize the hierarchical clustering for the clustering model
topic_model_e.visualize_hierarchy(title='Clustering')

In [None]:
# Apply custom labels to the topics

# Define a dictionary with custom labels for the topics
my_custom_labels = {
    0: 'Bagge Policy', 
    1: 'Seat Allocation',
    2: 'Refund request (Delay/Calcel)', 
    3: 'Bad Experience', 
    4: 'Poor Customer Service', 
    5: 'Great Experience', 
    6: 'Ticket Change Policy', 
    7: 'Customer Service Bot', 
    8: 'Sueing / Legal Action', 
    9: 'Good Customer Service'
}

topic_model_e.set_topic_labels(my_custom_labels)

In [None]:
# Reduce the number of topics to 11 and visualize the top 10 feedback for airlines

topic_model_e.reduce_topics(df['comment_agnostic'], nr_topics=11)
topic_model_e.set_topic_labels(my_custom_labels)

# Get information about the clusters
clusters_info = topic_model_e.get_topic_info()

# Drop the cluster with label -1
clusters_info.drop(0, inplace=True)

# Extract labels and sizes for the pie chart
try: 
    labels = clusters_info['CustomName']
except KeyError:
    topic_model_e.set_topic_labels(my_custom_labels)
    labels = clusters_info['CustomName']

sizes = clusters_info['Count']

# Create a pie chart of the top 10 feedback for airlines
plt.figure(figsize=(8, 8))
plt.pie(sizes, autopct='', startangle=140, colors=color_brand)
plt.legend(labels, loc='center left', bbox_to_anchor=(1, 0.5), title='Cluster Names')
plt.axis('equal') 
plt.title('Top 10 Feedback Topic for Airlines Generally', fontdict=titles_dict)

In [None]:
# Build stacked bars to compare feedback between different companies

# Reduce the number of topics to 11
topic_model_e.reduce_topics(df['comment_agnostic'], nr_topics=11)

# Get information about the clusters
info_clustering = topic_model_e.get_document_info(df['comment_agnostic'])

# Add a column to the dataframe to store the cluster labels
df['clustering_reps'] = info_clustering['CustomName']

# Get the count of documents for each cluster
df_counts_all = df.groupby('clustering_reps').size().reset_index(name='count')
df_counts_all = df_counts_all.drop(df_counts_all.index[0])
df_counts_all = df_counts_all.sort_values(by='clustering_reps')
df_counts_all['%'] = (df_counts_all['count'] / df_counts_all['count'].sum()) * 100

# Initialize lists to store data for the stacked bars
stacked_data = {}
airlines = (
    "All Airlines", 
    "RyanAir", 
    "EasyJet",
    "Singapore Air", 
    "Qatar Air", 
    "Emirates"
)

# Get the percentage of documents for each cluster for each airline
for company in companies:
    df_ = df[df['company'] == company]
    df_ = df_.groupby('clustering_reps').size().reset_index(name='count')
    df_ = df_.drop(df_.index[0])
    df_ = df_.sort_values(by='clustering_reps')
    df_['%'] = (df_['count'] / df_['count'].sum()) * 100
    df_counts_all[company] = df_['%']

# Get the category names for the stacked bars
clustering_reps = df_counts_all['clustering_reps'].tolist()

# Create the stacked bars
fig, ax = plt.subplots(figsize=(12, 8))
bottom = np.zeros(6)
width = 0.5

for index, rep in enumerate(clustering_reps):
    data_point = np.array(df_counts_all.iloc[index].tolist()[2:])
    p = ax.bar(airlines, data_point, width, label=rep, bottom=bottom, color=color_brand[index % len(color_brand)])
    bottom += data_point

ax.set_title("\n % Share of Each Feedback Category Per Airline \n", fontdict=titles_dict)
ax.legend(loc='upper left', bbox_to_anchor=(1, 1), bbox_transform=fig.transFigure, title='\n Feedback Category \n')

plt.show()


# Step 4: Apply Model to Airline Specific Datasets 📊
#### It looks like Iteration 5, clustering works well with our Generic Dataset. We'll use this approach to train Model on Airline Specific Data Sets with some additional Fine tuning to account for corpus size. <br>

**1. Train on Airline Specific Data:**
- Using what we learned above, we'll deploy the model to our airline-specific datasets. We'll take some additional steps to fine-tune by reducing the minimum cluster size to account for the reduced corpus size.
- For two of our datasets, we'll get the topic representation using Cohere, an external LLM Tool (Note: we've limited to two datasets as I have some API Limitations :D)

**2. Save The Model Results Locally & To Hugging Faces Hub:**
- Save model results locally to avoid retraining & hitting API limits.
- Save the model to Hugging Faces to use in the future with expanded datasets.

**3. Using Cohere & Results from Model Build Text Insights:**
- Build a function that takes in the most representative docs per topic and returns a summary of the comments.
- Build a function that reviews the most representative docs and suggests a topic name.
- Build a function that creates a stacked bar chart summarizing the most common topic per month. 📊

In [None]:
# Train Model on General Approach of Pipeline 5 but reducing Min Cluster Size
# NB Uncomment to Run! If using Cohere API Limit may apply 

# Import necessary libraries
# from bertopic.representation import KeyBERTInspired
# from bertopic.representation import MaximalMarginalRelevance
# from sklearn.feature_extraction.text import CountVectorizer
# from hdbscan import HDBSCAN

# # Install Cohere
# ! pip install cohere;

# import cohere
# from bertopic.representation import Cohere

# # Get comments for specific airlines
# doc_ryan = df['comment'][df['company'] == 'www.ryanair.com']
# doc_easy = df['comment'][df['company'] == 'www.easyjet.com']
# doc_sing = df['comment'][df['company'] == 'www.singaporeair.com']
# doc_qatar = df['comment'][df['company'] == 'www.qatarairways.com']
# doc_emi = df['comment'][df['company'] == 'www.emirates.com']

# # Initialize representation models
# representation_model = KeyBERTInspired()

# my_key = [ADD_YOUR_APIKEY]
# co = cohere.Client(my_key)
# representation_model_cohere = Cohere(co, delay_in_seconds=12, model='command')
# representation_model_bert = KeyBERTInspired()
# representation_model_maximal = MaximalMarginalRelevance(diversity=0.3)

# # Initialize vectorizer model
# vectorizer_model = CountVectorizer(stop_words="english")

# # Initialize HDBSCAN model
# hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# # Train the Ryanair model
# ryan_air_model = BERTopic(
#     min_topic_size = 5,
#     hdbscan_model=hdbscan_model, 
#     vectorizer_model=vectorizer_model, 
#     embedding_model="all-MiniLM-L6-v2",
#     representation_model=representation_model_bert)

# topics_ryan, probs_ryan = ryan_air_model.fit_transform(doc_ryan, emb_ryan)

# # Train the Easyjet model
# easyjet_model = BERTopic(
#     min_topic_size = 5,
#     hdbscan_model=hdbscan_model, 
#     vectorizer_model=vectorizer_model, 
#     embedding_model="all-MiniLM-L6-v2", 
#     representation_model=representation_model_cohere)

# topics_easy, probs_easy = easyjet_model.fit_transform(doc_easy, emb_easy)

# # Train the Singapore model
# singapore_model = BERTopic(
#     min_topic_size = 5, 
#     hdbscan_model=hdbscan_model, 
#     vectorizer_model=vectorizer_model, 
#     embedding_model="all-MiniLM-L6-v2", 
#     representation_model=representation_model_bert)

# topics_sing, probs_sing = singapore_model.fit_transform(doc_sing, emb_sing)

# # Train the Qatar model
# qatar_model = BERTopic(
#     min_topic_size = 5, 
#     hdbscan_model=hdbscan_model, 
#     vectorizer_model=vectorizer_model, 
#     embedding_model="all-MiniLM-L6-v2", 
#     representation_model=representation_model_cohere)

# topics_qatar, probs_qatar = qatar_model.fit_transform(doc_qatar, emb_qatar)

# # Train the Emirates model
# emirates_model = BERTopic(
#     min_topic_size = 5,
#     hdbscan_model=hdbscan_model, 
#     vectorizer_model=vectorizer_model, 
#     embedding_model="all-MiniLM-L6-v2", 
#     representation_model=representation_model_maximal)

# topics_emi, probs_emi = emirates_model.fit_transform(doc_emi, emb_emi)

In [None]:
# # reate & save dataframes from the different models above, because we have API limits

# # Define the columns to add to the dataframes
# cols_to_add = ['Topic', 'Name', 'Representation', 'Representative_Docs', 'Top_n_words']

# # Define the models and documents for each company
# model_holder = [ryan_air_model, easyjet_model, singapore_model, qatar_model, emirates_model]
# doc_holder = [doc_ryan, doc_easy, doc_sing, doc_qatar, doc_emi]

# # Loop through each company and create a dataframe with the results
# for index, company in enumerate(companies):
#     # Get the documents for the current company
#     df_ = df[df['company'] == company]
#     df_.reset_index(drop=True, inplace=True)
    
#     # Get the results for the current company
#     results_ = model_holder[index].get_document_info(doc_holder[index])[cols_to_add]
    
#     # Add the results to the dataframe
#     df_[cols_to_add] = results_[cols_to_add]
    
#     # Save the dataframe to a csv file
#     df_.to_csv(f'/kaggle/working/df_results_{company}.csv', index=False)

In [None]:
## Save Models to Hugging Face Hub
from huggingface_hub import login
login()

In [None]:
# ryan_air_model.push_to_hf_hub(
#     repo_id="sneakykilli/Ryanair_BERTopic",
#     save_ctfidf=True
# )

# easyjet_model.push_to_hf_hub(
#     repo_id="sneakykilli/Easyjet_BERTopic",
#     save_ctfidf=True
# )

# singapore_model.push_to_hf_hub(
#     repo_id="sneakykilli/Singapore_BERTopic",
#     save_ctfidf=True
# )

# qatar_model.push_to_hf_hub(
#     repo_id="sneakykilli/Qatar_BERTopic",
#     save_ctfidf=True
# )

# emirates_model.push_to_hf_hub(
#     repo_id="sneakykilli/Emirates_BERTopic",
#     save_ctfidf=True
# )

In [None]:
# Aim: Read in dataframes of results of clustering per airlines
df_ryan = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_results_www.ryanair.com.csv')
df_easy = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_results_www.easyjet.com.csv')
df_singapore = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_results_www.singaporeair.com.csv')
df_qatar = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_results_www.qatarairways.com.csv')
df_emirates = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_results_www.emirates.com.csv')

In [None]:
# Aim: Create a function that takes a dataframe as an input and summarizes the representative docs

def generate_document_summaries(df):
    # Initialize an empty list to store the summaries
    holder = []
    
    # Drop duplicates from the dataframe based on the 'Representative_Docs' column
    unique_df = df.drop_duplicates(subset=['Representative_Docs'])
    
    # Extract the unique representative docs as a list
    example_docs = unique_df['Representative_Docs'].tolist()
    
    # Initialize a Cohere client with your API key
    co = cohere.Client(my_api_key)

    # Loop through each representative doc and summarize it
    for item in example_docs:
        try:
            # Use the Cohere API to generate a summary
            response = co.summarize(text=item, length='medium', extractiveness='medium', temperature=5, additional_command="focusing on what can be improved, end the summary with a Next Action suggestion")
            
            # Pause for 10 seconds between requests to avoid rate limiting
            time.sleep(10)
            
            # Append the summary to the holder list
            holder.append(response)
        except Exception as e:
            # Print any errors that occur during processing
            print(f"Error processing document: {e}")
            continue
    
    # Return the list of summaries
    return holder

In [None]:
# # Create summaries of all representative docs and save to a dataframe so we don't re-run
# # NB Uncomment to Run, Update API Key & Beware of API Limits!

# # Generate summaries for the representative documents of each airline
# ryanair_summaries = generate_document_summaries(df_ryan)
# easyjet_summaries = generate_document_summaries(df_easy)
# qatarairways_summaries = generate_document_summaries(df_qatar)
# singaporeair_summaries = generate_document_summaries(df_singapore)

# # Extract the summaries from each response object
# ryanair_summaries = [item.summary for item in ryanair_summaries]
# easyjet_summaries = [item.summary for item in easyjet_summaries]
# qatarairways_summaries = [item.summary for item in qatarairways_summaries]
# singaporeair_summaries = [item.summary for item in singaporeair_summaries]

# # Create a list of company names for each airline
# company_names = [['ryanair' for _ in ryanair_summaries], ['easyjet' for _ in easyjet_summaries], ['qatar' for _ in qatarairways_summaries], ['singapore' for _ in singaporeair_summaries]]

# # Flatten the company names list
# company_names = [name for sublist in company_names for name in sublist]

# # Combine the summaries and company names into a dictionary
# data_dict = {
#     'company': company_names,
#     'summary': ryanair_summaries + easyjet_summaries + qatarairways_summaries + singaporeair_summaries
# }

# # Create a dataframe from the dictionary and save it to a CSV file
# df_summaries = pd.DataFrame(data_dict)
# df_summaries.to_csv('/kaggle/working/df_summaries.csv', index=False)

In [None]:
# Create a function that will accept the summaries of representative documents and create a single introduction summary for the company

def generate_summary_prompt(df, company_name):
    # Get the summaries of representative documents from the dataframe
    topic_representation = df['summary'].tolist()
    
    # Initialize an empty string to hold the text of the prompt
    text_holder = ''
    
    # Iterate through the summaries, adding each one to the text holder
    for index, topic in enumerate(topic_representation): 
        prompt_middle = f"""\nDocument[{index}] : {topic_representation[index]}\n"""
        text_holder += prompt_middle

    # Create the beginning and end of the prompt
    prompt_bedining = f"""
Below [DOCS] delimited by /// are {index - 1} Documents, each document is labeled as "Document[Number]".
Each Document is a user review of {company_name} the airline. Based on the Documents, write a Summary of {company_name}.
The Summary should be no more than 5 Sentences long.
\n[DOCS] /// """

    prompt_end = f""" ///
The tag line should not exceed 10 words. Each Tag line should be similar in Tone and format. 
"""

    # Combine the beginning, middle, and end of the prompt
    prompt = prompt_bedining + text_holder + prompt_end

    return prompt

In [None]:
# # Generate summaries for each company based on the representative document summaries
# # NB Uncomment to Run, Update API Key & Beware of API Limits!

# # Read in the dataframe containing all summaries
# df_all_summaries = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_summaries.csv')

# # Filter the dataframe for each company
# sum_ryan = df_all_summaries[df_all_summaries['company'] == 'ryanair']
# sum_easy = df_all_summaries[df_all_summaries['company'] == 'easyjet']
# sum_sing = df_all_summaries[df_all_summaries['company'] == 'singapore']
# sum_qatar = df_all_summaries[df_all_summaries['company'] == 'qatar']

# # Create a dictionary to hold the dataframes for each company
# run_cal = {'ryanair':sum_ryan, "easyjet": sum_easy, 'singapore':sum_sing, 'qatar': sum_qatar}

# # Define the API key for Cohere
# my_api_key = [YOUR_API_KEY]

# # Initialize a Cohere client
# co = cohere.Client(my_api_key)

# # Initialize a dictionary to hold the summaries for each company
# summary_holder = {}

# # Loop through each company and generate a summary
# for key, value in run_cal.items():
#     # Initialize a Cohere client
#     co = cohere.Client(my_api_key)
    
#     # Generate a summary prompt for the company
#     prompt = generate_summary_prompt(value, key)

#     # Generate a summary using Cohere
#     response_summary = co.generate(  
#     model='command-light',  
#     prompt = prompt,  
#     max_tokens=200,
#     temperature=0.9, 
#     truncate='END')
    
#     # Add the summary to the summary holder dictionary
#     summary_holder[key] = response_summary.generations[0].text

# # Create a dataframe from the summary holder
# data = {
#     'company' : ['ryanair', 'easyjet', 'singapore', 'qatar'], 
#     'company_summary' : [summary_holder['ryanair'], summary_holder['easyjet'], summary_holder['singapore'], summary_holder['qatar']]
# }
# df_findings = pd.DataFrame(data)

# # Save the dataframe to a csv file
# df_findings.to_csv('/kaggle/working/df_findings.csv')

In [None]:
# Create a function to create a stacked bar chart summarizing representations per month

def plot_stacked_bar_with_line(df, company):
    # Convert the 'updated_dates' column to datetime
    df['updated_dates'] = pd.to_datetime(df['updated_dates'])
    
    # Extract the month and year from the 'updated_dates' column
    df['Month_Year'] = df['updated_dates'].dt.to_period('M')

    # Get the top 10 values in the 'Name' column
    top_10_values = df['Name'].value_counts().head(11).index.tolist()

    # Function to check if an item is in the top 10 values
    def check_item(value):
        if value in top_10_values:
            return value.split("_")[1]
        else:
            return "other"

    # Apply the check_item function to create a new column 'Top10'
    df['Top10'] = df['Name'].apply(lambda x: check_item(x))

    # Group the data by 'Month_Year' and 'Top10' and get the count of each group
    monthly_counts = df.groupby(['Month_Year', 'Top10']).size().reset_index(name='Count')
    
    # Group the data by 'Month_Year' and get the top 10 values for each month
    top_10_per_month_year = monthly_counts.groupby('Month_Year').apply(lambda x: x.nlargest(11, 'Count')).reset_index(drop=True)
    
    # Pivot the table to get the 'Top10' values as columns
    pivot_table = top_10_per_month_year.pivot(index='Month_Year', columns='Top10', values='Count').fillna(0)
    
    # Get the line data for the average star number per month
    line_data = df.groupby('Month_Year')['star'].mean()
    
    # Create the figure and axis
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot the stacked bar chart
    pivot_table.plot(kind='bar', stacked=True, ax=ax, color=color_pal)
    
    # Create a second y-axis for the line plot
    ax2 = ax.twinx()
    
    # Plot the line chart for the average star number
    ax2.plot(line_data.index.astype(str), line_data.values, color='red', marker='o', linestyle='-', linewidth=2, label='Average Star Number')
    
    # Set the ylabel for the second y-axis
    ax2.set_ylabel('Average Star Number')
    
    # Set the legend for the first y-axis
    ax.legend(loc='upper left', bbox_to_anchor=(1, 1), bbox_transform=fig.transFigure, title='\n Feedback Category \n')
    
    # Set the legend for the second y-axis
    ax2.legend(loc='upper right')
    
    # Set the title and labels for the axes
    plt.title(f'{company.capitalize()} Monthly User Feedback', fontdict=titles_dict)
    ax.set_xlabel('Month_Year')
    ax.set_ylabel('Count')

    # Show the plot
    plt.show()

In [None]:
# AIM : Read in DataFrames of Results of Clustering per Airlines

# Read in the DataFrames
df_summaries = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_findings.csv')
df_representations = pd.read_csv('/kaggle/input/topic-modelling-outputs/df_summaries.csv')
df_clusters = pd.read_csv('/kaggle/input/topic-modelling-outputs/clustering_findings.csv')

# Get the top line summary
top_line_summary = df_summaries['company_summary'].tolist()

# Import Markdown to display the summary
from IPython.display import Markdown

In [None]:
df_representations.head()

In [None]:
# AIM GENERATE A SUMMARY OF FINDINGS 

def generate_summary_report_page2(df, company):
    df_representations_ = df_representations[df_representations['company'] == company]
    comment_summary_ = df_representations_['summary'].tolist()[1:]
    
    df_ = df.sort_values(by='Topic', ascending=True)
    reps_ = df_['Representation'].unique()
    reps_ = [item.split("'")[1] for item in reps_][1:]
    reps_comments_holder = f""" **Summary of {company}'s TrustPilot Feedback from Their Customers** \n
    Below you will find a summary explanation of the Feedback that {company} has recieved from their customers.
    This feedback has been clustered together using BERTopic Modeling and the Text of the most representative Comments per Topic have been
    Summarised using Cohere. Happy Reading! \n\n
    """

    for index, item in enumerate(reps_):
        paragraph_ = f"\n **{index+1}). {item}.**\n\n {comment_summary_[index]} \n"
        reps_comments_holder += paragraph_
    
    return reps_comments_holder

In [None]:
# Display Stacked Graph of Monthly User Custers
plot_stacked_bar_with_line(df_ryan, 'EasyJet')

In [None]:
# Display the top line summary 
Markdown(top_line_summary[0])

In [None]:
# Display Detailed Summary per Topic
Markdown(generate_summary_report_page2(df_easy, 'easyjet'))

# Project Conclusion 🎉📊 <br>
### Using BERTopic Topic Modelling, we've transformed a CSV of User-Generated Comments into actionable insights, including monthly topic breakdowns with sentiment scores, a summary of key findings and recommendations, and detailed explanations for each topic cluster. This approach can be adapted for any user-generated content. <br>

**Project Journey:**
- **Date Clean Up & Preprocessing:** We cleaned up DateTime Data & Saved EMbeddings to Save on Compute Time. 
- **Model Testing on Generic Comments:** By creating a generic comment column and experimenting with five pipeline approaches, we optimized our model before fine-tuning post-training. This included limiting the number of topics and providing custom labels for better performance. 
- **Fine Tuning & Model Training on Airline Specific Data:** To address corpus size differences, we adjusted the minimum number of clusters and used five separate documents (one for each airline) to train our model. We utilized the Cohere API and generative AI to create optimal topic labels. Our models were then saved both locally and on the Hugging Faces Hub. 
- **Leveraging Cohere LLM API & Generative AI:** In the final step, we leveraged Cohere's Generate Function to supply our topics and representative documents to the LLM, which then generated a summary and suggested actions, as well as a summary for each topic-representative document pair. Additionally, we created a table visualizing our airlines' monthly topic share alongside the sentiment score. 