                                               Workshop 7: Maria Balaet, Valentina Giunchiglia and Dragos Gruia

# Topic Modelling on drug use free text data

The aim of this workshop is to introduce you to the application of **natural language processing (NLP)** algorithms to study free text data. NLP is a branch of artificial intelligence that focuses on trying to understand written and spoken text. NLP is a very active area of research with new methods being developed all time. In this workshop, we will focus specifically on **topic modelling**, which is an unsupervised machine learning method that analyses a corpus of documents that people have written on one or other theme in order to identify the most common topics within that theme.

In particular, we will use a type of topic modelling called Latent Dirchlet Allocation (LDA) to examine the explanations that people give for changes in their drug use during the early stages of the COVID-19 pandemic. In the morning, we will investigate why some recreational drug users have decided to increase their use, whilst later in the day we will apply what we learnt to understand why other users decided to decrease their drug use.


The first thing we need to do is to download and import the packages we will need during the lecture, and to change the display settings in order to be able to visualise more rows and columns when printing dataframes. Today, we will work a lot with two new python modules called `gensim` and `nltk`. `Gensim` is one of the most coimmonly used modules for topic modelling in Python, and `nltk` is an NLP Python toolkit.

In [None]:
pip install gensim nltk

In [None]:
import pandas as pd
import warnings 
import gensim
import nltk
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np

warnings.filterwarnings('ignore')

pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', 500)

sb.set_theme("talk")
sb.set_style("whitegrid")

Now, let's import the data that we will use during the workshop and inspect them. As you can see, the dataframe consists of three columns that report the user ids, whether the drug use of participants increased or decreased during the pandemic and the reasons behind this change in format of free text. 

In [None]:
reasons_for_change_dec2020_more = pd.read_csv("Data/Day7_morning.csv")

In [None]:
reasons_for_change_dec2020_more


---------
### Code here
The column names of the dataframes are quite long and the row names are not really easy to interpret. 

1. Replace the column headers with the names "how" and "why" respectively for "How has your drug use changed due to the pandemic?" and "Why has your drug use changed during the pandemic?". 
2. Set the user ids as row index
3. Since we are analysing only the participants who increased their drug use, the "how" column should have only "I am using more" answers, check that this is the case.
4. Confirm that there are no missing values

In [None]:
## CODE HERE


--------
Now that we have a better looking dataframe, let's check out the answers in the *why* column.

In [None]:
reasons_for_change_dec2020_more['why'].to_list()

 By quickly looking at the answers, it appears that one of the main reasons for starting to use more drugs during the pandemic was *boredom*. However, different people express the same concept in slightly different ways. Let's print a few answers. 

In [None]:
print(reasons_for_change_dec2020_more['why'][3], "\n",
      reasons_for_change_dec2020_more['why'][14],"\n",
      reasons_for_change_dec2020_more['why'][34]
     )

If you look more carefully, you will notice a few other things: 1) some people wrote boredom with or without capital letters, 2) some answers have empty spaces, 3) other answers have special characters, 4) some words are spelled incorrectly... All these aspects are just few examples of the noise that free text answers have, and that need to be removed before starting any analysis. The data cleaning step in free text analysis is **fundamental**!


In [None]:
print(reasons_for_change_dec2020_more['why'][285], "\n",
      reasons_for_change_dec2020_more['why'][286],"\n",
      reasons_for_change_dec2020_more['why'][281], "\n",
      reasons_for_change_dec2020_more['why'][278], "\n",
      reasons_for_change_dec2020_more['why'][273], "\n",
     )

## Data cleaning of free text


Data cleaning is necessary to remove errors in the data, to reduce the noise to the minimum and to include in the analysis only what is essential. The most important cleaning steps of free-text data are the following:

1. **Turning all letters to lower case:** this is important otherwise words with capital letters will be mistakenly recognised as different compared to the same words without capital letters (e.g. This and this).
2. **Removal of punctuation, special characters and digits**: punctuation creates noise in the data. It cannot be used to make sense of the meaning of a topic because it does not represent words and computer does not know how to interpret it.
3. **Tokenization**: method that consists of segmenting a piece of text (in this case the answers of each participant) into the discrete units (i.e., words). 
4. **Stop words removal**: method that consists in removing words that are really common in english, but don't provide much information about content of a document when considered alone, e.g., "to", "in" or "when. These words just create noise in the data for a method like LDA (though they can be important for other methods that consider the interrelationship/ordering of multiple words).
5. **Lemmatization**: process of grouping together the inflected forms of a word so they can be analysed as a single item. E.g., run, running, runs, ran all have the same root and can be replaced with a single token.
6. **Removal of empty answers**

Let's complete these cleaning steps, in order to understand better what they do and why they are important. First, we will turn everything to lower case. 

In [None]:
reasons_for_change_dec2020_more['why'] = reasons_for_change_dec2020_more['why'].str.lower() 

Then, we remove all punctuation (e.g. *,* or *.*), special characters (e.g. *?,/* and *&*) and digits.

In [None]:
reasons_for_change_dec2020_more['why'] = reasons_for_change_dec2020_more['why'].str.replace('[,\.!?/&]', '')
reasons_for_change_dec2020_more['why'] = reasons_for_change_dec2020_more['why'].str.replace('\d+', '')

Now, let's save the free text data into a separate variable, and let's check it out. Were all digits, punctuation, and special characters removed?

In [None]:
data = reasons_for_change_dec2020_more['why'].to_list()
data

The next step of the data cleaning and preparation is **tokenization**. Tokenization is necessary to make the sentences analysable and understandable for the computer, and consists of splitting each answer of participants into lists of individual words. There is a function in the gensim package that can do this directly, by taking as input each separate answer, called `simple_preprocess`. By providing as argument `deacc=True`, the function removes punctuations if it finds any. In addition, the function removes all words that have less that 2 letters or more than 20. Words with less than 2 letters are usually not really meaningful on their own, and those with more than 20 are suspiciously long in most normal contexts! They could be just some typing mistakes.

In [None]:
 data_words = []
for sentence in data:
    listwords = gensim.utils.simple_preprocess(str(sentence), deacc=True, min_len=2, max_len=20)
    data_words.append(listwords)
    
data_words[10]  

Check different elements in the *data_words* list. Do you understand how tokenization works?

Now that we have the answers in terms of lists of words, we can do some cleaning on the words themselves. The first thing we are going to do is to remove the stop words, or commonly used words in the English language. Luckily, the *nltk* module has alredy a list of these words that we can simply download.

In [None]:
nltk.download('stopwords')
stop = nltk.corpus.stopwords.words('english') 
print(stop)

Of course, if you think that other words are too common and should be removed, but they are not in this list, you can easily add them. Adding an extra filter is always a good idea. In this way, the noise in the data is reduced even more. Do you have any other words in mind? If you do, add it to the following list.

In [None]:
custom_stop = ['goes','with'] 
finalstop = stop + custom_stop
print(finalstop)

Now that we have our final list of stop words, we can remove those words from the list of words of each participant's answers. 

In [None]:
data_words_final = []

for answer in data_words:
    data_words_cleaned = []
    for word in answer:
        if word not in finalstop:
            data_words_cleaned.append(word)
    data_words_final.append(data_words_cleaned)

print("Before:" , data_words[10]  ) 
print("After:" , data_words_final[10]  ) 

As you can see, the stop words "for" and "more" were removed.

Now, we can complete the final step of the pre-processing, which is the **lemmatization** step. Lemmatization groups together different inflected forms of the same word, so that they can be analysed as being a single item with the same meaning. For example, it converts the different conjugations of verbs into the inifinite forms (e.g. "swim", "swam" and "swum" would be all converted to "swim"), or turns the plurals of nouns into the singular form. 

In Python, it can be completed with the `WordNetLemmatizer()` method and the `lemmatize` function.

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
lemma = nltk.stem.wordnet.WordNetLemmatizer()

data_words_lemmatized = []
for answer in data_words_final:
    data_words_cleaned = []
    for word in answer:
        lematized = lemma.lemmatize(word)
        data_words_cleaned.append(lematized)
    data_words_lemmatized.append(data_words_cleaned)

print("Before:" , data_words_final[10]  ) 
print("After:" , data_words_lemmatized[10]  ) 

As you can see, "events" was converted to "event", and "makes" to "make". Check a few other answers to see what changed.

Let's check how many answers we have now

In [None]:
len(data_words_lemmatized)

In total we have 290 answers, however, after the cleaning, some of these will be completely empty. This usually happens when participants don't reply properly to the questions, and just write down something random. These answers are of course useless and should be removed. We will save the numbers of the answers that were dropped, because we will need this information later on.

In [None]:
data_words_lemmatized_cleaned = []
dropped_ids = []
for i, answer in enumerate(data_words_lemmatized):
    if answer:
        data_words_lemmatized_cleaned.append(answer)
    else:
        dropped_ids.append(i)
        
    
len(data_words_lemmatized_cleaned)

After removing the empty answers, we have now 283 fully cleaned answers that can be used for the analysis.

In [None]:
reasons_for_change_dec2020_more_cleaned=data_words_lemmatized_cleaned

## Topic modelling using LDA

In today's workshop, we will do topic modelling using **Latent Dirichlet Allocation (LDA)**. LDA is amongst the most established and popular topic modelling methods (though there are new ones beng developed all the time), which aims to find the topics within a body of text based on the words it contains. Let's look at the name of the method and try to understand what each word means:

- **Latent**: indicates that the model discovers the ‘yet-to-be-found’ or hidden topics that are common across the documents that people have written.
- **Dirichlet**: indicates the two assumptions of LDA - that both the distribution of topics within a document and the distribution of words within each topic are Dirichlet distributions (which is a type of probability distribution).
- **Allocation** indicates the distribution of topics in the document.

LDA assumes that the words within a document can be used to determine the topics. LDA assigns each word in a document to different topic, then maps the entire document to a list of topics. Put another way, LDA computes a many-to-many relationship between topics and words, and thus a many-to-many relationship between documents and topics. The many to many mapping of latent and written documents makes intuitive sense, as a document that someone has written can be saying the same thing as the one someone else wrote in somewhat different ways, but an individual document may also cover more than one topic.

[Here](https://proceedings.neurips.cc/paper/2001/file/296472c9542ad4d4788d543508116cbc-Paper.pdf) you can find the original paper, if you are really interested in understanding the method in details (do not worry though, we will not be assessing your mathematical understanding of LDA).

One of the requirements of LDA is to specify the number of topics that should be identified within the set of answers. The number of topics is a user-defined parameter, better called a **hyperparamerter**. In machine learning, a hyperparameter is a parameter that is external to the model, that cannot be inferred from the data during a single run of the fitting process, and therefore needs to be fine-tuned each time depending on the model you are developing and the dataset you are using across multiple runs or based on prior knowledge in order to find the optimal or most approriate value respectively.

In the case of LDA, one of the best approaches to identify the optimal value for the number of topics is to use the **Coherence Score**. The coherence score specifies whether a certain topic split gives rise to coherent topics, where "coherent" is defined as combinations of documents that humans would tend to agree should be grouped together. Interestingly, there is a branch of machine learning is dedicated to automatically determining coherence scores that correlate well with those that have been assigned manually. This specific method is described [here](http://www.saf21.eu/wp-content/uploads/2017/09/5004a165.pdf), though for now we are just interested in using it. The higher is the score, the more coherent are the topics. In order to use this score to identify the optimal number of topics, it is necessary to run the LDA analysis with different number of topics (this being the hyperparameter) and calculate the score for each of them. The number of topics that leads to the highest coherence score will correspond to the optimal number, which is then used as the optimal hyperparameter value in the model that we analyse/interpret.


Let's start by getting the data in the right format in order to be able to run LDA. The LDA function takes as input two main arguments: 

1. A **dictionary** that has as keys an id number and as values a word (each word is assigned to a different id number)
2. The **corpus** which is essentially the list of answers in a bag-of-words format, that corresponds to (word_id, word_count), where the word id corresponds to the id assigned to the word in the dictionary.

To better understand what these are, let's create them using the *gensin* package.

In [None]:
id2word = gensim.corpora.Dictionary(reasons_for_change_dec2020_more_cleaned)
print(id2word[1])
print(id2word[2])
print(id2word[11])

As you can see `id2word` is a dictionary, where each key is a number, and the value of that key is a different word. For example, *id* 1 is assigned to the word *marijuana*.

In [None]:
corpus = []
for answer in reasons_for_change_dec2020_more_cleaned:
    corpus.append(id2word.doc2bow(answer))

print("Answer", reasons_for_change_dec2020_more_cleaned[0])
print("Corpus", corpus[0])

As you can see, each word of each answer is converted into (word_id, word_count).

Now that we have the data in the right format, we can define which numbers of topics we want to test to find what the optimal number is. Today, we will try a maximum of 10 topics. Some data might require many more than 10, but the more topics numbers you test, the more time and computational resources you will need.

In [None]:
min_topics = 1
max_topics = 10
step_size = 1
topics_range = range(min_topics, max_topics, step_size)
topics_range

**Let's run LDA with all the potential numbers of topics!**

To be able to do it, we need to complete the following steps:
1. Create a results dictionary where we will store the topic number and coherence score at each iteration
2. Loop over all the potential number of topics
3. Run the LDA model and change the number of topics at each iteration
4. Calculate the coherance score for each topic number
5. Save the coherence score in the results dictionary

Depending on the amount of RAM that your computer has, this step can be more or less slow. To have an overview of how long you still need to wait before the computation is completed, Python has a really nice function and module called `tqdm` that creates progress bars when running for loops.

In [None]:
pip install tqdm

In [None]:
from tqdm import tqdm 

results = {'Topics_Number': [], "cv_Coherence_avg": []}
    
for n_topics in tqdm(topics_range):
    lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=n_topics, 
                                                        random_state=100,
                                                        chunksize=100,
                                                        passes=50,
                                                        workers=20,
                                                        iterations=150,
                                                        minimum_probability=0)
                                                                 
    cv_coherence_total = gensim.models.CoherenceModel(model=lda_model, 
                                            texts=reasons_for_change_dec2020_more_cleaned, 
                                            dictionary=id2word).get_coherence()
            
    results['Topics_Number'].append(n_topics)
    results['cv_Coherence_avg'].append(cv_coherence_total)

In [None]:
results_df = pd.DataFrame(results)
results_df

Great! Now that we have the coherance score for each number of topics, we can create a plot to visualize the results and better see where the peak is

In [None]:
sb.lineplot(x='Topics_Number', y='cv_Coherence_avg', data=pd.DataFrame(results))
plt.xlabel("Number of topics")
plt.ylabel("Coherence Score")

Based on the plot and the results dataframe, the peak is at topic number = 4, which suggests that 4 is the optimal number of topics. However, there is something important to mention here. If you look at the **LdaMulticore** function, there are three arguments that we set for you, which are `passes`, `iterations` and `chunksize`. These three values are also hyperparameters that one should generally fine-tune. `Passes` corresponds to the number of passes through the corpus during training, `chunksize` to the number of documents to be used in each training chunk, and `iterations` to the maximum number of iterations that are completed through the corpus when inferring its topic distribution. Depending on the values set in these three parameters, the results **will change**. 

Try to change those values and see what happens. 

In the end, to properly understand whether the number of topics is optimal, it is necessary to manually look at the final results, and see whether the identified topics make sense using a more qualitative approach. Using current methods, a combination of a quantitative (e.g. the coherence score) and qualitative approach is usually the best way to go, though the field is advancing quickly.

Due to time constraints, we will not fine-tune the number of `passes`, `iterations` and `chunksize`. We already provided to you the best values based on previous reserach experiments. Now that we know that 4 is the optimal topic number, let's run LDA using 4 as number of topics.

In [None]:
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                    id2word=id2word,
                                                    num_topics=4, 
                                                    random_state=100,
                                                    chunksize=100,
                                                    passes=50,
                                                    workers=20,
                                                    iterations=150,
                                                    minimum_probability=0)

Let's look at the 4 topics and the top 5 words that represent them using the method `show_topics`

In [None]:
lda_model.show_topics(formatted=False, num_words= 5)

As you can see the output of the function is the topic number, and a list of (word, probability) for each topic, where the probability corresponds to the probability of each word for that specific topic. Let's print only the words and ignore the probabilities for a moment.

In [None]:
for idx, topic in lda_model.show_topics(formatted=False, num_words= 10):
    wordskeep = [w[0] for w in topic]
    print(f'Topic: {idx} \nWords: {wordskeep}')

**How would you describe each topic in one sentence?** If you struggle to do it, try to increase the number of words that you print for each topic. Is the topic easier to understand now? Do the topics make sense?

Printing the words is of course useful, but there are better ways to visualise the most relevant words for each topic. One way is through **wordclouds**, which are a visual representation of the probability of words for each topic. The bigger is the word in the wordcloud, the higher is the probability for that word in the topic. To be able to visualise wordclouds, we need to install and import a new package called `wordcloud`.

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud

Now that we have the package, we can start preparing the input to the wordclouds function. We will first check how to create the wordcloud for one topic, and then create a for loop to create wordclouds for all topics at once. The input format of the wordclouds function is a dictionary, where each key is a word, and the value is the probability of that word in the topic. To obtain this dictionary we can use a method called `show_topic`, which returns the words and the probability value for a specific topic. The `topn` parameter specifies how many top words should be selected for each topic.


In [None]:
words_topic = {}
for word, probability in lda_model.show_topic(0, topn = 50): #0 means that we are extracting the words, probability for topic 0
    words_topic[word] = probability
print(words_topic)

Great! Now we can initialise the wordcloud function, and then use the method `generate_from_frequencies` to create the wordcloud from the dictionary we prepared.

In [None]:
wc = WordCloud(background_color="white", max_words=50)
wc.generate_from_frequencies(words_topic)

_ = plt.figure(figsize = (10, 10))
plt.imshow(wc)
plt.axis("off")

--------
## Code here 

Generate a figure with 4 subplots where each subplot corresponds to the wordcloud for a topic. 


In [None]:
# Code here

    

-----
At this point, thanks to the wordclouds and after printing the words, you should have a general interpretation of the different topics. Now, let's try to use the output of LDA to identify what is the dominant topic in each one of the answers of the participants. This is useful for two main reasons. First, by identifying the dominant topic in each answer we can evaluate the distribution of topics across all answers and see whether a topic was predominant compared to the others. Then, we can check out the 10 answers with the highest dominant topic contribution (or probability) in each topic and assess whether they way we interpreted the meaning of the topics based on the words was appropriate. 

To obtain the probability of each topic for each answer, you can use the `get_document_topics` function. Based on how you defined each topic, do you think it makes sense that the following answer was assigned the highest probability for topic 3? Check a few more.

In [None]:
all_topics = lda_model.get_document_topics(corpus, minimum_probability=0.0)
print("Answer", reasons_for_change_dec2020_more_cleaned[10])
print(all_topics[10])

Now that we have this information for each answer, we want to extract it in an easier to view format, which is a dataframe. The aim is to create a dataframe where each row corresponds to an answer, and each column represents the probability of each topic for that answer

In [None]:
all_topics_df = pd.DataFrame(gensim.matutils.corpus2csc(all_topics).T.toarray())
all_topics_df

Change the column headers with easier to interpret names. They can be simply "Topic_1", "Topic_2", "Topic_3" and "Topic_4", or one word that describes the main theme of the topic based on your interpretation.

In [None]:
## CODE HERE

Finally, to find out the predominant topic for each participant, we just have to identify the topic with the highest probability.

In [None]:
all_topics_df['dominant_topic_contribution'] = all_topics_df.max(axis = 1) 
all_topics_df['dominant_topic'] = np.argmax(all_topics_df.values, axis=1)
all_topics_df

In [None]:
all_topics_df["dominant_topic"].value_counts()

As you can see, it appears that the topic number 1 was the most common reason why people increased their use of drugs during the pandemic.

Now that we have the dominant topic for each person, we can check out the 10 answers with the highest dominant topic contribution in each topic. This will allow us to make sure that the interpretation of the topics that we completed before based on the words is actually appropriate. Indeed, it might be easier to interpret a topic by looking at entire answers rather that simple words. To do this, we first have to merge the `all_topics_df` dataframe with the original one containing the actual answers. If you remember, before running LDA, we excluded some answers because at the end of the data cleaning they were completely empty, and we saved the `index`that corresponded to those answers in a variable called `dropped_ids`. Before doing the merging, we will need to remove those answers from the original dataframe, in order to be sure to assign the topic contributions to the respective participants' answers. 

In [None]:
reasons_for_change_dec2020_more.shape, all_topics_df.shape

As you can see the shape of the dataframes is different because the original data has also the answers that were removed at the end of the data cleaning

In [None]:
answers_df = reasons_for_change_dec2020_more.drop(dropped_ids)
answers_df.head()

In [None]:
answers_df.shape, all_topics_df.shape

The shapes are now the same! We can move forward with the merging. 

In [None]:
finaldf = pd.concat([all_topics_df.reset_index(), answers_df.reset_index()], axis = 1)
print(finaldf.shape)
finaldf.head()

Great! Now we can find out the top n answers with the highest dominant topic contribution for each topic. To do this we have to complete the following steps:
1. Identify the unique dominant topics (e.g. 0, 1, 2, and 3)
2. Create a subset of the dataframe where only the answers assigned to a specific topic are kept
3. Sort the subset dataframe in descending order based on the dominant_topic_contribution value
4. Extract the first n rows - that correspond to the top n answers with the highest dominant topic contributions

The steps are repeated for each topic and the data are stored in a dictionary. The separate dataframes for each topic are merged into a unique one.

In [None]:
n_largest = 10
unique_topics = finaldf['dominant_topic'].unique()
largest_contributors_per_topic = {}
for topic in unique_topics:
    topic_df = finaldf[finaldf['dominant_topic'] == topic]
    largest_contibutors_idx = topic_df['dominant_topic_contribution'].sort_values(ascending = False).iloc[:n_largest].index
    largest_contibutors_df = topic_df.loc[largest_contibutors_idx, :]
    largest_contributors_per_topic[topic] = largest_contibutors_df
    
top_opinions_df_ = pd.concat(largest_contributors_per_topic).reset_index(drop=True)
top_opinions_df_.head(5)

Topics are sometimes easier to characterise when you read the full text answers that best fit them. Notably, LDA, although a powerful tool, does not take the order of words in a document into account, which is a limitation. Nonetheless, it tends to work quite well. Check the top answers and assess whether your interpretation of the different topics was correct. If it wasn't, updated it accordingly. 

# DAY 7 CHALLENGE - Decrease in drug use and free text analysis 

For today's challenge, you will use all of the concepts that you learned in this session and apply them to a new dataset. The dataset provides information about why different drug users decided to *decrease* their use of drugs during the pandemic. This information is available in the format of free text answers. The aim of this challenge is to find out what are the most common reasons behind the decrease in drug use, and to identify what is the most dominant one. Concretely, you are expected to do the following:

1. Clean the free text data by completing all the following steps: a) turning all letters to lower case, b) removal of punctuation, special characters and digits, c) tokenization, d) stop words removal, e) lemmatization and f) removal of empty answers
2. Identify the optimal number of topics using the coherence score, together with some qualitative analysis. 
3. Run LDA analysis to identify the most common topics, and interpret the meaning of each topic. Idenitify the dominant topics across all participants' answers.
3. **BONUS** (You get extra points if you do it, but you won't lose points if you don't do it): try to use different visualization techniques to better represent your topics and their distributions across the answers of the participants.

The data are available in the csv file called *Day7_challenge.csv*