# Latent Dirichlet Allocation

In [18]:
import pickle
import string
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

with open("data_pickle", "rb") as pkl_reviews: data = pickle.load(pkl_reviews)

data.columns = ['target', 'text']

data.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vincent/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/vincent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vincent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/vincent/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,target,text
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [19]:
def lower_all(text): return text.lower()

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

def remove_numbers(text):
    
    processed_text = ""
    for letter in text:
        if not letter.isdigit():
            processed_text += letter
    return processed_text

def remove_stopwords(text):
    
    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in stop]
    return " ".join(tokens_without_sw)

def lemmatize(text):
    
    words = text.split(" ")
    lem_words = ""
    for w in words:
        lem_words += lemmatizer.lemmatize(w)
        lem_words += " "
    return lem_words.strip(" ")

In [20]:
lemmatizer = WordNetLemmatizer()
stop = stopwords.words("english")

data["clean_text"] = data['text'].apply(remove_punctuations)
data["clean_text"] = data['clean_text'].apply(lower_all)
data["clean_text"] = data["clean_text"].apply(remove_numbers)
data["clean_text"] = data["clean_text"].apply(remove_stopwords)
data["clean_text"] = data["clean_text"].apply(lemmatize)
print(data.head())

  target                                               text  \
0    neg  plot : two teen couples go to a church party ,...   
1    neg  the happy bastard's quick movie review \ndamn ...   
2    neg  it is movies like these that make a jaded movi...   
3    neg   " quest for camelot " is warner bros . ' firs...   
4    neg  synopsis : a mentally unstable man undergoing ...   

                                          clean_text  
0  plot two teen couple go church party drink dri...  
1  happy bastard quick movie review damn yk bug g...  
2  movie like make jaded movie viewer thankful in...  
3  quest camelot warner bros first featurelength ...  
4  synopsis mentally unstable man undergoing psyc...  


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [21]:
vectorizer = TfidfVectorizer().fit(data["clean_text"])
data_vec = vectorizer.transform(data["clean_text"])

# nb_components = nombre de catégories (2 ici)
lda = LatentDirichletAllocation(n_components=2).fit(data_vec)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [25]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print(f"Topic {idx}:")
        print([(vectorizer.get_feature_names()[i], topic[i])
               for i in topic.argsort()[:-10 - 1:-1]])
        
print_topics(lda, vectorizer)

Topic 0:




[('horrendous', 0.511793834182354), ('extraordinarily', 0.5115869387863768), ('kermit', 0.5084480816390747), ('marquis', 0.5084205772975534), ('sade', 0.5082236972242369), ('hrundi', 0.5078988112659871), ('alessa', 0.5078782336001176), ('camembert', 0.507469759788167), ('nbsp', 0.5073425602681043), ('trumpet', 0.5072231354454846)]
Topic 1:
[('film', 110.23068358090568), ('movie', 80.78641071388188), ('one', 59.75042790160767), ('character', 46.95186262866856), ('like', 43.91426354569686), ('get', 37.42624134469493), ('time', 36.71898760842922), ('scene', 36.45305609482864), ('story', 34.70808637930863), ('good', 34.42277968390615)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

Voici le texte utilisé pour réaliser la prédiction :

In [33]:
print(data["text"].iloc[0])

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

Et voici la prédiction réalisée :

In [32]:
test_text = data["clean_text"].iloc[0]
vec = vectorizer.transform([test_text])
print(lda.transform(vec))

[[0.03558707 0.96441293]]


On en conclut que le modèle classe nettement ce texte dans la catégorie 1 (deuxième catégorie). Cependant, en observant les mots les plus représentatifs de chaque catégorie on constate que celles-ci ne correspondent pas aux labels "positif" ou "négatif" qui sont présents dans notre jeu de données. Le modèle a donc défini les deux catégories selon d'autres critères qui nous sont opaques.