### NLP Workshop Part 2

#### Topic Modeling via LDA:
- Topic modelling in natural language processing is a technique which assigns topic to a given corpus based on the words present. Topic modelling is important, because in this world full of data it has become increasingly important to categories the documents.
- Clustering the transcriptions of the conversation turns in a dialogue in an unsupervised fashion helps us to understand the topics of the conversation quickly. 
  - **Ideas**:
    - LDA
    - Unsupervised transformers

### Code Demo: Unsupervised mechanisms to group corpuses into concepts/clusters

- LDA [What is it?](https://medium.com/analytics-vidhya/topic-modelling-using-lda-aa11ec9bec13)  |  [Link to package](https://radimrehurek.com/gensim/models/ldamodel.html)  |  [MultiCore Fast LDA](https://radimrehurek.com/gensim/models/ldamulticore.html)
- Unsupervised transformers [What is it?](https://jalammar.github.io/illustrated-transformer/)  |  [Hugging Face](https://huggingface.co/docs/transformers/index)

### LDA

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [None]:
transcription_file = pd.read_csv('Transcription_E04.csv') # can swap for a different data file 
transcription_file = transcription_file[transcription_file['prompt_id']=='7c9053bd-3d78-4dc4-8canb-74947po3bdf0']

response_list = []

for doc in transcription_file.text_response:
    response_list.append(preprocess(doc))

print(response_list)

In [None]:
# set up the dictionary
id2word = Dictionary(response_list)

# set up the corpus 
corpus = [id2word.doc2bow(text) for text in response_list]

In [None]:
# check corpus data
print(corpus[:1])

In [None]:
# check dictionary + corpus output
[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

In [None]:
# set up the model 
lda_model = LdaModel(corpus=corpus,
                   id2word=id2word,
                   num_topics=2, # can change the number of topics 
                   random_state=100,
                   update_every=1,
                   chunksize=100,
                   alpha='auto',
                   per_word_topics=True)

In [None]:
# print out and check the two topics 

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

In [None]:
# visualize the topics 
pyLDAvis.enable_notebook()
gensimvis.prepare(lda_model, corpus, id2word)

## Team Discussion:
1. What are your findings? 
2. What are the pros and cons of this approach? 

In [None]:
# using a pre-defined testing dataset
processed_X_train = []
processed_X_test = []

for doc in X_train:
    processed_X_train.append(preprocess(doc))

for doc in X_test:
    processed_X_test.append(preprocess(doc))

In [None]:
# let's look at the data 
for doc in X_train:
    print(doc)

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

In [None]:
X_train, y_train, y_train_names = newsgroups_train['data'], newsgroups_train['target'], newsgroups_train['target_names'] 
X_test, y_test, y_test_names = newsgroups_test['data'], newsgroups_test['target'], newsgroups_test['target_names']

In [None]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''

dictionary = gensim.corpora.Dictionary(processed_X_train)

In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [None]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [None]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_X_train]

In [None]:
print(processed_X_train[0])
for k, v in bow_corpus[0]:
    print(dictionary[k], v)

In [None]:
# The parameters can be tweaked based on the understanding of the data.

# For instance, the number of topics can be set to what we expect the data to divide into.
# Passes can be experimented with based on the data.

lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 8, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [None]:
# Display learned topics

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

# LDA outputs the topics it leared from the data. For each topic, it shows the words and their corresponding
# importance.

In [None]:
# visualiztion
pyLDAvis.enable_notebook()
gensimvis.prepare(lda_model, bow_corpus, dictionary)

In [None]:
# LDA inference on test document

num = 100 #10
doc = processed_X_test[num]
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(doc)

print("Original doc----->\n")
print(X_test[num])

print("Topic inferred by LDA----->\n")
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

In [None]:
# Update existing LDA model on new corpus of data.

lda_model.update(other_corpus)