### What is Topic Modeling
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

## Import Libraries

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.manifold import TSNE
import matplotlib.colors as mcolors
import gensim
from gensim.utils import simple_preprocess
from gensim.models import Phrases, phrases, ldamodel, CoherenceModel
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import spacy
import gensim.corpora as corpora
from pprint import pprint
import pyLDAvis
import pyLDAvis.gensim 
from collections import Counter

In [None]:
data_df = pd.read_csv('/kaggle/input/spooky-author-identification/train.zip')
data_df.head()

In [None]:
# shape of the dataset
data_df.shape

In [None]:
# value count for eavh author
sns.countplot(data_df['author'])

In [None]:
# null value
data_df.isnull().sum()

We don't required id column so we will going to drop it

In [None]:
# drop id column
data_df = data_df.drop(columns = ['id'], axis=1)
data_df.head()

## Visulization

In [None]:
data_df['Number_of_words'] = data_df['text'].apply(lambda x:len(str(x).split()))
data_df

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(12,6))
sns.distplot(data_df['Number_of_words'],kde = False,color="red",bins=200)
plt.title("Frequency distribution of number of words for each text extracted", size=20)

In [None]:
cloud=WordCloud(colormap="winter",width=600,height=400).generate(str(data_df["text"]))
fig=plt.figure(figsize=(13,18))
plt.axis("off")
plt.imshow(cloud,interpolation='bilinear')

Now it's time to clean the text

## Data Cleaning 

Clean the text by lowering all words,removing special characters, numbers and stopwords

In [None]:
data_df['text_processed'] = data_df['text'].map(lambda x: re.sub('[,\.!?]','',x))
data_df['text_processed'] = data_df['text_processed'].map(lambda x:x.lower())
print(data_df['text_processed'].head())

In [None]:
# remove all characters, number or characters
def cleanText(input_string):
    modified_string = re.sub('[^A-Za-z0-9]+', ' ', input_string)
    return(modified_string)
data_df['text_processed'] = data_df.text_processed.apply(cleanText)
data_df['text_processed'][150]

In [None]:
# remove stopwords
stopWords = stopwords.words('english')
def removeStopWords(stopWords, rvw_txt):
    newtxt = ' '.join([word for word in rvw_txt.split() if word not in stopWords])
    return newtxt
data_df['text_processed'] = [removeStopWords(stopWords,x) for x in data_df['text_processed']]

## WordCloud

In [None]:
# join the different text together
longText = ','.join(list(data_df['text_processed'].values))
# generate the word cloud
wordcloud = WordCloud(background_color="black",
                      max_words= 600,
                      contour_width = 10,
                      contour_color = "steelblue",
                     collocations=False).generate(longText)
# visualize the word cloud
fig = plt.figure(1, figsize = (12, 12))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()


In [None]:
fig = plt.figure(1, figsize = (20,10))
# split() returns list of all the words in the string
split_it = longText.split()
# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
#print(Counter)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counter.most_common(30)
x_df = pd.DataFrame(most_occur, columns=("words","count"))
sns.barplot(x = 'words', y = 'count', data = x_df)

## Preparing data for Topic Modelling

First of all we will do tokenization then will do lemmatization

In [None]:
nltk.download("punkt")
# word_tokenize 
data_df["tokenized"] = data_df["text_processed"].apply(lambda x: nltk.word_tokenize(x))
data_df["tokenized"] = data_df["tokenized"].apply(lambda words: [word for word in words if word.isalnum()])
data_df


In [None]:
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
def word_lemmatizer(text):
  lem_text = [WordNetLemmatizer().lemmatize(i,pos='v') for i in text]
  return lem_text
data_df["lemmatized"] = data_df["tokenized"].apply(lambda x: word_lemmatizer(x))
data_df["lemmatize_joined"] = data_df["lemmatized"].apply(lambda x: ' '.join(x))
pd.set_option('display.max_colwidth', 100)
data_df.head()

Now let's see 30 most frequent words

In [None]:
plt.style.use('ggplot')
plt.figure(figsize=(14,6))
freq=pd.Series(" ".join(data_df["lemmatize_joined"]).split()).value_counts()[:30]
freq.plot(kind="bar", color = "orangered")
plt.title("30 most frequent words",size=20)

**Vectorization using Word2Vec**

In [None]:
tokens = data_df["lemmatize_joined"].apply(lambda x: nltk.word_tokenize(x))
tokens

In [None]:
import gensim 
from gensim.models import Word2Vec 
w2v_model = Word2Vec(tokens,
                     min_count=20,
                     window=10,
                     size=250,
                     alpha=0.03, 
                     min_alpha=0.0007,
                     workers = 4,
                     seed = 42)


## Model

The input will be in the form of document-term matrix, and we will convert that using the below piece of code.

In [None]:
dictionary = corpora.Dictionary(data_df["lemmatized"])
doc_term_matrix = [dictionary.doc2bow(rev) for rev in data_df["lemmatized"]]

LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=3, random_state=100,
                chunksize=200, passes=100)
lda_model.print_topics()

In [None]:
# COHERENCE SCORE
coherence_model_lda = CoherenceModel(model=lda_model,
texts= data_df["lemmatized"], dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# # Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(doc_term_matrix))  

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
pyLDAvis.display(vis)