Transformer-based NLP topic modeling using the Python package BERTopic: modeling, prediction, and visualization

# Resources

# Intro

BERTopic is a topic modeling python library that uses the combination of transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing).



# Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.


In [None]:
# Install bertopic
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/154.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/

In [None]:
# Try to import BERTopic
from bertopic import BERTopic

After installing the python packages, we will import the python libraries.
* `pandas` and `numpy` are imported for data processing.
* `nltk` is imported for text preprocessing. We downloaded the information for removing stopwords and lemmatization from `nltk`.
* `BERTopic` is imported for the topic modeling.
* `UMAP` is for dimension reduction.


In [None]:
# Data processing
import pandas as pd
import numpy as np

# Text preprocessiong
import nltk
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()

# Topic model
from bertopic import BERTopic

# Dimension reduction
from umap import UMAP

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Step 2: Download And Read Data

The second step is to download and read the dataset.



In [None]:
sheet_name = 'Comments'
df = pd.read_excel('/content/lgbtq_topic.xlsx',sheet_name = sheet_name)

Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

The dataset has many columns, but ignore all of them except for Comments which we will use in this notebook for Topic Modelling.


In [None]:
df.head()

Unnamed: 0,pri_icd9_dx_cd,ltr,Comments,Living Lens,Item,LGBTQ+ Filer,Sentiment,Emotion,Emotion Group,Confidence score,processed_comments,Topic Number,Topic Name,Key words
0,F32.9,0,The online claim submission process is terribl...,I am overseas and it is IMPOSSIBLE to find psy...,Major Depression,0,,,,,,,,
1,F32.9,0,"""Aetna processed original EOB on 5/26. I paid...","If you can't tell, I extremely frustrated by A...",Major Depression,0,,,,,,,,
2,Z11.4,0,Aetna is the worst health insurance I've ever ...,Do what I pay you for. I pay premiums every mo...,Encounter for screening for HIV:,1,,,,,,,,
3,F32.A,0,"After an injury to my arm, an orthopedic reque...",Very disappointing to have an MRI authorizatio...,is grouped within Diagnostic Related Group(s),0,,,,,,,,
4,F64.0,0,I am upset about not having gender confirming ...,"In general I am satisfied, I have been frustra...",Dysphoria Gender Adolescence and adulthood,1,,,,,,,,


In [None]:
df_cleaned = df.dropna(subset=['Comments'])

In [None]:
df_cleaned.head()

Unnamed: 0,pri_icd9_dx_cd,ltr,Comments,Living Lens,Item,LGBTQ+ Filer,Sentiment,Emotion,Emotion Group,Confidence score,processed_comments,Topic Number,Topic Name,Key words
0,F32.9,0,The online claim submission process is terribl...,I am overseas and it is IMPOSSIBLE to find psy...,Major Depression,0,,,,,,,,
1,F32.9,0,"""Aetna processed original EOB on 5/26. I paid...","If you can't tell, I extremely frustrated by A...",Major Depression,0,,,,,,,,
2,Z11.4,0,Aetna is the worst health insurance I've ever ...,Do what I pay you for. I pay premiums every mo...,Encounter for screening for HIV:,1,,,,,,,,
3,F32.A,0,"After an injury to my arm, an orthopedic reque...",Very disappointing to have an MRI authorizatio...,is grouped within Diagnostic Related Group(s),0,,,,,,,,
4,F64.0,0,I am upset about not having gender confirming ...,"In general I am satisfied, I have been frustra...",Dysphoria Gender Adolescence and adulthood,1,,,,,,,,


In [None]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7112 entries, 0 to 7115
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   pri_icd9_dx_cd      7112 non-null   object 
 1   ltr                 7112 non-null   int64  
 2   Comments            7112 non-null   object 
 3   Living Lens         7112 non-null   object 
 4   Item                7112 non-null   object 
 5   LGBTQ+ Filer        7112 non-null   int64  
 6   Sentiment           0 non-null      float64
 7   Emotion             0 non-null      float64
 8   Emotion Group       0 non-null      float64
 9   Confidence score    0 non-null      float64
 10  processed_comments  0 non-null      float64
 11  Topic Number        0 non-null      float64
 12  Topic Name          0 non-null      float64
 13  Key words           0 non-null      float64
dtypes: float64(8), int64(2), object(4)
memory usage: 833.4+ KB


`.info` helps us to get information about the dataset.

From the output, we can see that this data set has 7112 records, and no missing data. The 'review' column is the `comments` type.

# Step 3: Text Data Preprocessing (Optional)



Generally speaking, there is no need to preprocess the text data when using the python BERTopic model. However, since our dataset is a simple dataset, a lot of stopwords are picked to represent the topics.

Therefore, we removed stopwords and did lemmatization as data preprocessing. But please ignore this step if this is not an issue for you.

In [None]:
# Remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')

There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'no

Lemmatization refers to changing words to their base form.

After removing stopwords and lemmatizing the words we can see that the stopwords like `to` and `for` are removed, and the word like `conversations` is converted to `conversation`.

In [None]:
df_cleaned['review_without_stopwords'] = df_cleaned['Comments'].astype(str).apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in stopwords]))

In [None]:
# Remove stopwords
#df_cleaned['review_without_stopwords'] = df_cleaned['Comments'].apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in stopwords]))

# Lemmatization
df_cleaned['review_lemmatized'] = df_cleaned['review_without_stopwords'].apply(lambda x: ' '.join([wn.lemmatize(w) for w in x.split() if w not in stopwords]))

# Take a look at the data
df_cleaned.head()

Unnamed: 0,pri_icd9_dx_cd,ltr,Comments,Living Lens,Item,LGBTQ+ Filer,Sentiment,Emotion,Emotion Group,Confidence score,processed_comments,Topic Number,Topic Name,Key words,review_without_stopwords,review_lemmatized
0,F32.9,0,The online claim submission process is terribl...,I am overseas and it is IMPOSSIBLE to find psy...,Major Depression,0,,,,,,,,,online claim submission process terrible. know...,online claim submission process terrible. know...
1,F32.9,0,"""Aetna processed original EOB on 5/26. I paid...","If you can't tell, I extremely frustrated by A...",Major Depression,0,,,,,,,,,"""Aetna processed original EOB 5/26. paid amoun...","""Aetna processed original EOB 5/26. paid amoun..."
2,Z11.4,0,Aetna is the worst health insurance I've ever ...,Do what I pay you for. I pay premiums every mo...,Encounter for screening for HIV:,1,,,,,,,,,Aetna worst health insurance I've ever had. co...,Aetna worst health insurance I've ever had. co...
3,F32.A,0,"After an injury to my arm, an orthopedic reque...",Very disappointing to have an MRI authorizatio...,is grouped within Diagnostic Related Group(s),0,,,,,,,,,"injury arm, orthopedic requested MRI. Aetna de...","injury arm, orthopedic requested MRI. Aetna de..."
4,F64.0,0,I am upset about not having gender confirming ...,"In general I am satisfied, I have been frustra...",Dysphoria Gender Adolescence and adulthood,1,,,,,,,,,upset gender confirming surgeries. cancel surg...,upset gender confirming surgeries. cancel surg...


# Step 4: Topic Modeling Using BERTopic

In step 4, we will build the topic model using BERTopic.

BERTopic model by default produces different results each time because of the stochasticity inherited from UMAP.

To get reproducible topics, we need to pass a value to the `random_state` parameter in the `UMAP` method.
* `n_neighbors=15` means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.
 * A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
 * A high value pushes UMAP to look at broader neighborhood, and may lose details on local structure.
 * The default `n_neighbors` values for UMAP is 15.
* `n_components=5` indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
* `min_dist` controls how tightly UMAP is allowed to pack points together. It's the minimum distance between points in the low dimensional space.
 * Small values of `min_dist` result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set `min_dist` to 0.
 * Large values of `min_dist` prevent UMAP from packing points together and preserves the broad structure of data.
* `metric='cosine'` indicates that we will use cosine to measure the distance.
* `random_state` sets a random seed to make the UMAP results reproducible.

After initiating the UMAP model, we pass it to the BERTopic model, set the language to be English, and set the `calculate_probabilities` parameter to `True`.

Finally, we pass the processed review documents to the topic model and saved the results for topics and topic probabilities.
 * The values in `topics` represents the topic each document is assigned to.
 * The values in `probabilities` represents the probability of a document belongs to each of the topics.

In [None]:
df_cleaned.to_csv('lgbt_output.csv',index=False)

In [None]:
# Initiate UMAP
umap_model = UMAP(n_neighbors=5,
                  n_components=7,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=100)

# Initiate BERTopic
topic_model = BERTopic(umap_model= , language="english", calculate_probabilities=True)

# Run BERTopic model
topics, probabilities = topic_model.fit_transform(df_cleaned['review_lemmatized'])

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Step 5: Extract Topics From Topic Modeling

In step 5, we will extract topics from the BERTopic modeling results.

Using the attribute `get_topic_info()` on the topic model gives us the list of topics. We can see that the output gives us 31 rows in total.
* Topic -1 should be ignored. It indicates that the reviews are not assigned to any specific topic.
* Topic 0 to topic 173 are the 174 topics created for the reviews. It was ordered by the number of reviews in each topic, so topic 0 has the highest number of reviews.
* The `Name` column lists the top terms for each topic. For example, the top 4 terms for Topic 0 are dental,dentist,vision,eye,reimbursement indicating that it is a topic related to dental and vision related.

In [None]:
# Get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1952,-1_aetna_care_get_health,"[aetna, care, get, health, coverage, plan, nee...",[Way expensive. Information cost doctor visits...
1,0,209,0_dental_dentist_vision_eye,"[dental, dentist, vision, eye, reimbursement, ...","[Dental reimbursement, dental coverage, dental..."
2,1,154,1_claim_call_provider_aetna,"[claim, call, provider, aetna, told, paid, bac...",[There's wide variance competence call center ...
3,2,139,2_que_la_de_muy,"[que, la, de, muy, los, con, en, servicio, el,...",[Porq en el a??o q llevo con este plan m??dico...
4,3,126,3_claim_claims_process_handled,"[claim, claims, process, handled, processing, ...","[problem claim paid., Cannot communicate claim..."
...,...,...,...,...,...
168,167,11,167_phase_qny_disinterested_ofail,"[phase, qny, disinterested, ofail, hadgreat, p...",[85 I've tried numerous provider simple found ...
169,168,11,168_area_network_locate_country,"[area, network, locate, country, fiber, doctor...",[trying locate ophthalmologist network area. L...
170,169,10,169_reputationmost_presenting_dernatologist_miami,"[reputationmost, presenting, dernatologist, mi...",[Aetna never give problem....it EASY QUICK get...
171,170,10,170_communication_thorough_comment_knowledgeab...,"[communication, thorough, comment, knowledgeab...","[Thorough, good communication helpful people, ..."


If more than 4 terms are needed for a topic, we can use `get_topic` and pass in the topic number. For example, `get_topic(0)` gives us the top 10 terms for topic 0 and their relative importance.

In [None]:
# Get top 10 terms for a topic
topic_model.get_topic(0)

[('dental', 0.058442153445868415),
 ('dentist', 0.026855628291710777),
 ('vision', 0.024720448974142262),
 ('eye', 0.02212248580532108),
 ('reimbursement', 0.012766788095311026),
 ('coverage', 0.010601074739467762),
 ('benefit', 0.009556269491611947),
 ('hearing', 0.009307698223590852),
 ('glass', 0.008945671946291617),
 ('reimbursed', 0.008188270485278228)]

We can visualize the top keywords using a bar chart. `top_n_topics=12` means that we will create bar charts for the top 12 topics. The length of the bar represents the score of the keyword. A longer bar means higher importance for the topic.

In [None]:
topic_info = topic_model.get_topic_info()

In [None]:
# Visualize top topic keywords
topic_model.visualize_barchart(top_n_topics=12, custom_labels=True)

In [None]:
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1952,-1_aetna_care_get_health,"[aetna, care, get, health, coverage, plan, nee...",[Way expensive. Information cost doctor visits...
1,0,209,0_dental_dentist_vision_eye,"[dental, dentist, vision, eye, reimbursement, ...","[Dental reimbursement, dental coverage, dental..."
2,1,154,1_claim_call_provider_aetna,"[claim, call, provider, aetna, told, paid, bac...",[There's wide variance competence call center ...
3,2,139,2_que_la_de_muy,"[que, la, de, muy, los, con, en, servicio, el,...",[Porq en el a??o q llevo con este plan m??dico...
4,3,126,3_claim_claims_process_handled,"[claim, claims, process, handled, processing, ...","[problem claim paid., Cannot communicate claim..."
...,...,...,...,...,...
168,167,11,167_phase_qny_disinterested_ofail,"[phase, qny, disinterested, ofail, hadgreat, p...",[85 I've tried numerous provider simple found ...
169,168,11,168_area_network_locate_country,"[area, network, locate, country, fiber, doctor...",[trying locate ophthalmologist network area. L...
170,169,10,169_reputationmost_presenting_dernatologist_miami,"[reputationmost, presenting, dernatologist, mi...",[Aetna never give problem....it EASY QUICK get...
171,170,10,170_communication_thorough_comment_knowledgeab...,"[communication, thorough, comment, knowledgeab...","[Thorough, good communication helpful people, ..."


In [None]:
topic_keywords = topic_model.get_topics()

In [None]:
topic_name = dict(zip(topic_info['Topic'], topic_info['Name']))

In [None]:
df_cleaned['Topic Number'] = topics

In [None]:
df_cleaned['Topic Name'] = df_cleaned['Topic Number'].map(topic_name)

In [None]:
df_cleaned['Key words'] = df_cleaned['Topic Number'].map(lambda x: ', '.join(map(str,topic_keywords[x][:5])))

In [None]:
topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 word_length=10,
                                                 separator=", ")

In [None]:
topic_model.set_topic_labels(topic_labels)


Another view for keyword importance is the "Term score decline per topic" chart. It's a line chart with the term rank being the x-axis and the c-TF-IDF score on the y-axis.

There are a total of 174 lines, one line for each topic. Hovering over the line shows the term score information.

In [None]:
# Visualize term rank decrease
topic_model.visualize_term_rank()

# Step 6: Topic Similarities

In step 6, we will analyze the relationship between the topics generated by the topic model.

We will use three visualizations to study how the topics are related to each other. The three methods are intertopic distance map, the hierarchical clustering of topics, and the topic similarity matrix.

Intertopic distance map measures the distance between topics. Similar topics are closer to each other, and very different topics are far from each other. From the visualization, we can see that there are five topic groups for all the topics. Topics with similar semantic meanings are in the same topic group.

The size of the circle represents the number of documents in the topics, and larger circles mean that more reviews belong to the topic.

In [None]:
# Visualize intertopic distance
topic_model.visualize_topics()

Another way to see how the topics are connected is through a hierarchical clustering graph. We can control the number of topics in the graph by the `top_n_topics` parameter.

In this example, the top 10 topics are included in the hierarchical graph. We can see that the sound quality topic is closely connected to the headset topic, and both of them are connected to the earpiece comfortable topic.

In [None]:
# Visualize connections between topics using hierachical clustering
topic_model.visualize_hierarchy(top_n_topics=30)

Heatmap can also be used to analyze the similarities between topics. The similarity score ranges from 0 to 1. A value close to 1 represents a higher similarity between the two topics, which is represented by darker blue color.

In [None]:
# Visualize similarity using heatmap
topic_model.visualize_heatmap(top_n_topics=10)

# Step 7: Topic Model Predicted Probabilities

In step 7, we will talk about how to use BERTopic model to get predicted probabilities.

The topic prediction for a document is based on the predicted probabilities of the document belonging to each topic. The topic with the highest probability is the predicted topic. This probability represents how confident we are about finding the topic in the document.

We can visualize the probabilities using `visualize_distribution`, and pass in the document index. `visualize_distribution` has the default probability threshold of 0.015, so only the topic with a probability greater than 0.015 will be included.

In [None]:
# Visualize probability distribution
topic_model.visualize_distribution(topic_model.probabilities_[0], min_probability=0.015)

If you would like to save the visualization as a separate html file, we can save the chart into a variable and use `write_html` to write the chart into a file.

In [None]:
# Save the chart to a variable
chart = topic_model.visualize_distribution(topic_model.probabilities_[0])

# Write the chart as a html file
chart.write_html("amz_review_topic_probability_distribution.html")

The topic probability distribution for the first review in the dataset shows that topic 7 has the highest probability, so topic 7 is the predicted topic.

The first review is "So there is no way for me to plug it in here in the US unless I go by a converter.", and the topic of plugging a charger is pretty relevant.

We can also get the predicted probability for all topics using the code below.

In [None]:
# Get probabilities for all topics
topic_model.probabilities_[0]

We can see that there are 30 probability values, one for each topic. The index 7 has the highest value, indicating that topic 7 is the predicted topic.

# Bonus: Sentimental Analysis


In [None]:
from transformers import pipeline, AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
from bertopic import BERTopic

In [None]:
tokenizer = BertTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = BertForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

# Perform sentiment analysis on each comment
def predict_sentiment(comment):
    inputs = tokenizer(comment, return_tensors="pt", truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax().item()
    return "positive" if predicted_class == 2 else "negative"

df_cleaned['Sentiment'] = df_cleaned['review_lemmatized'].apply(predict_sentiment)

# Display the DataFrame with sentiment labels
print(df_cleaned[['Comments', 'Sentiment']])

                                               Comments Sentiment
0     The online claim submission process is terribl...  negative
1     "Aetna processed original EOB on 5/26.  I paid...  negative
2     Aetna is the worst health insurance I've ever ...  negative
3     After an injury to my arm, an orthopedic reque...  negative
4     I am upset about not having gender confirming ...  negative
...                                                 ...       ...
7111  Keep me informed on annual physical, and colon...  negative
7112                      Atencion, cobertura y calidad  negative
7113  The customer service has been above and beyond...  positive
7114  Very professional, knowledgeable and resolve q...  negative
7115                                               Bbbb  negative

[7112 rows x 2 columns]


In [None]:
df_cleaned

Unnamed: 0,pri_icd9_dx_cd,ltr,Comments,Living Lens,Item,LGBTQ+ Filer,Sentiment,Emotion,Emotion Group,Confidence score,processed_comments,Topic Number,Topic Name,Key words,review_without_stopwords,review_lemmatized
0,F32.9,0,The online claim submission process is terribl...,I am overseas and it is IMPOSSIBLE to find psy...,Major Depression,0,negative,,,,,3,3_claim_claims_process_handled,"('claim', 0.057251400717669806), ('claims', 0....",online claim submission process terrible. know...,online claim submission process terrible. know...
1,F32.9,0,"""Aetna processed original EOB on 5/26. I paid...","If you can't tell, I extremely frustrated by A...",Major Depression,0,negative,,,,,1,1_claim_call_provider_aetna,"('claim', 0.018008945913460545), ('call', 0.01...","""Aetna processed original EOB 5/26. paid amoun...","""Aetna processed original EOB 5/26. paid amoun..."
2,Z11.4,0,Aetna is the worst health insurance I've ever ...,Do what I pay you for. I pay premiums every mo...,Encounter for screening for HIV:,1,negative,,,,,6,6_bill_medical_aetna_pay,"('bill', 0.016433343428977072), ('medical', 0....",Aetna worst health insurance I've ever had. co...,Aetna worst health insurance I've ever had. co...
3,F32.A,0,"After an injury to my arm, an orthopedic reque...",Very disappointing to have an MRI authorizatio...,is grouped within Diagnostic Related Group(s),0,negative,,,,,53,53_mri_knee_xray_request,"('mri', 0.06643027837956514), ('knee', 0.03112...","injury arm, orthopedic requested MRI. Aetna de...","injury arm, orthopedic requested MRI. Aetna de..."
4,F64.0,0,I am upset about not having gender confirming ...,"In general I am satisfied, I have been frustra...",Dysphoria Gender Adolescence and adulthood,1,negative,,,,,74,74_gender_genderaffirming_affirming_cosmetic,"('gender', 0.07223442629588038), ('genderaffir...",upset gender confirming surgeries. cancel surg...,upset gender confirming surgeries. cancel surg...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7111,Z13.31,10,"Keep me informed on annual physical, and colon...",while I was satisfied with my recent in home ...,Encounter for screening for depression,0,negative,,,,,-1,-1_aetna_care_get_health,"('aetna', 0.00568494137448727), ('care', 0.005...","Keep informed annual physical, colon screening.","Keep informed annual physical, colon screening."
7112,Z20.6,10,"Atencion, cobertura y calidad","Excelente empresa, calidad y cobertura",exposure to HIV virus,1,negative,,,,,2,2_que_la_de_muy,"('que', 0.0804812441418273), ('la', 0.07657264...","Atencion, cobertura calidad","Atencion, cobertura calidad"
7113,Z13.31,10,The customer service has been above and beyond...,Just very pleased overall with Aetna. Custome...,Encounter for screening for depression,0,positive,,,,,55,55_customer_professional_knowledgeable_service,"('customer', 0.09322782930423916), ('professio...",customer service beyond helpful. Always kind p...,customer service beyond helpful. Always kind p...
7114,Z13.31,10,"Very professional, knowledgeable and resolve q...",Aetna Medicare representative is always profes...,Encounter for screening for depression,0,negative,,,,,-1,-1_aetna_care_get_health,"('aetna', 0.00568494137448727), ('care', 0.005...","professional, knowledgeable resolve questions ...","professional, knowledgeable resolve question c..."


In [None]:
df_cleaned.to_csv('lgbt_output_v1.csv',index=False)

In [None]:
df_cleaned['Sentiment'] = df_cleaned['review_lemmatized'].apply(lambda x: sentiment(x)[0]['label'])