# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from time import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from scipy import stats
from tempfile import mkdtemp
from shutil import rmtree

from xgboost import XGBRegressor

from sklearn import set_config
set_config(display = 'diagram')

# Sklearn preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.ensemble import AdaBoostRegressor, VotingRegressor, GradientBoostingRegressor, StackingRegressor, RandomForestRegressor
from sklearn.feature_selection import SelectPercentile, mutual_info_regression, VarianceThreshold, SelectFromModel
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression
from sklearn.metrics import make_scorer, mean_squared_error, mean_squared_log_error, accuracy_score, recall_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer, LabelEncoder
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import LatentDirichletAllocation


import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [2]:
# import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [3]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [4]:
# YOUR CODE HERE
def preprocessing(sentence):

    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers

    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation

    tokenized_sentence = word_tokenize(sentence) ## tokenize
    stop_words = set(stopwords.words('english')) ## define stopwords

    tokenized_sentence_cleaned = [ ## remove stopwords
        w for w in tokenized_sentence if not w in stop_words
    ]

    lemmatized_v = [
        WordNetLemmatizer().lemmatize(word, pos = "v")
        for word in tokenized_sentence_cleaned
    ]

    lemmatized_n = [
        WordNetLemmatizer().lemmatize(word, pos = "n")
        for word in lemmatized_v
    ]
    
    lemmatized_a = [
        WordNetLemmatizer().lemmatize(word, pos = "a")
        for word in lemmatized_n
    ]

    cleaned_sentence = ' '.join(word for word in lemmatized_a)

    return cleaned_sentence


In [5]:
data['clean_text'] = data.text.apply(lambda x: preprocessing(x))

In [6]:
data.clean_text[0]

'gldcunixbcccolumbiaedu gary l dare subject stan fischler summary devil pregame show prior host penguin nntppostinghost cunixbcccolumbiaedu replyto gldcunixbcccolumbiaedu gary l dare organization phd hall line lester patrick award lunch bill torrey mention one option next season president miami team bob clarke work dinner clarke say bad mistake philadelphia let mike keenan go retrospect almost player come realize keenan know take win rumour circulate keenan back fly nick polano sick scapegoat schedule make red wing bryan murray approve gerry meehan john muckler worry sabre prospect assistant lever say sabre get share quebec dynasty emerge mighty duck declare throw money around loosely buy team oiler coach ted green remark guy around fill tie domis skate none fill helmet senator andrew mcbain tell security guard chicago stadium warn stair lead locker room mcbain mouth season professional tumble entire steep flight gld je souviens gary l dare gldcolumbiaedu go winnipeg jet go gldcunixcbi

## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [23]:
# YOUR CODE HERE
vectorizer = TfidfVectorizer()

vector_doc = vectorizer.fit_transform(data.clean_text)
vector_doc = pd.DataFrame(vector_doc.toarray(), columns=vectorizer.get_feature_names_out())

# vector_doc.head()

lda = LatentDirichletAllocation(n_components=5, max_doc_update_iter=100)
lda.fit(vector_doc)

# lda = make_pipeline(
#     TfidfVectorizer(),
#     LatentDirichletAllocation(n_components=5, max_iter=100)
# )

# lda.fit(data.clean_text)

In [24]:
topic_mixture = lda.transform(vector_doc)

In [25]:
topic_mixture

array([[0.01747368, 0.85395852, 0.017647  , 0.01748331, 0.09343749],
       [0.01856174, 0.92575219, 0.018561  , 0.01856349, 0.01856158],
       [0.01737478, 0.83905309, 0.10885951, 0.01734305, 0.01736956],
       ...,
       [0.02097532, 0.91609204, 0.02097407, 0.02098358, 0.02097499],
       [0.03688933, 0.6631545 , 0.03688919, 0.2261647 , 0.03690229],
       [0.02302838, 0.90788282, 0.02302766, 0.02303151, 0.02302963]])

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [26]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [27]:
# YOUR CODE HERE
print_topics(lda, vectorizer)

Topic 0:
[('hrivnak', 3.4499275153106543), ('gtdaprismgatechedu', 3.2913365403199584), ('abc', 2.588033764205411), ('mask', 2.2785376860031468), ('colon', 1.8329385426746028), ('patton', 1.6312052600033031), ('hornet', 1.6312052600032974), ('clement', 1.6022630154399644), ('friedman', 1.5584549261189822), ('rolfe', 1.522587981828814)]
Topic 1:
[('god', 35.84266989136198), ('would', 26.061570310967294), ('go', 25.98749876813292), ('team', 25.271068361078605), ('one', 24.242098294006244), ('game', 24.002986613123056), ('say', 23.442607342549437), ('write', 23.176897741869343), ('line', 22.88049673795609), ('subject', 22.738208282323903)]
Topic 2:
[('espn', 5.258366250928465), ('puck', 4.55264272077918), ('keller', 4.525001129281165), ('keith', 4.254166723248209), ('kkellermailsasupennedu', 4.048930194060726), ('gainey', 3.878797067257158), ('mask', 3.704935846468111), ('shoot', 3.0235559033992447), ('cal', 2.929468018441733), ('chi', 2.87284318851517)]
Topic 3:
[('gm', 3.1987881669899316

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [29]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]
vect_ex = vectorizer.transform([preprocessing(example[0])])
vect_ex = pd.DataFrame(vect_ex.toarray(), columns=vectorizer.get_feature_names_out())
topic_ex = lda.transform(vect_ex)
topic_ex

array([[0.04932427, 0.8018085 , 0.05013018, 0.04932542, 0.04941162]])

In [30]:
# YOUR CODE HERE
sum([0.04932427, 0.8018085 , 0.05013018, 0.04932542, 0.04941162])

0.9999999900000001

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!

In [None]:
!git add lda.ipynb

! git commit -m "Latently allocated "

! git push origin master