<a href="https://colab.research.google.com/github/rishuatgithub/MLPy/blob/master/Topic_Modelling_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling using Latent-Dirichlet Allocation

- Blog URL : [Topic Modelling : Latent Dirichlet Allocation, an introduction](https://anotherreeshu.wordpress.com)
- Author   : Rishu Shrivastava

In [0]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

### Step 1: Loading Data

As part of this step we will load the data into the dataframe

- Dataset Kernel : https://www.kaggle.com/hengzheng/news-category-classifier-val-acc-0-65

In [30]:
filename = '/content/drive/My Drive/Colab Notebooks/data/news_data/News_Category_Dataset_v2.json'
data = pd.read_json(filename, lines=True)
data.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


- For the purpose of this blog, we will try to only look at the description of the data and try to generate the topic using Topic Modelling (LDA).

In [0]:
### Selecting only the interested dataset - short_description

df = data['short_description']

In [51]:
df.head()

0    She left her husband. He killed their children...
1                             Of course it has a song.
2    The actor and his longtime girlfriend Anna Ebe...
3    The actor gives Dems an ass-kicking for not fi...
4    The "Dietland" actress said using the bags is ...
Name: short_description, dtype: object

In [52]:
df.shape

(200853,)

### Step 2: Pre-processing data

**a. Applying Count Vectorizer to pre-process the data into vectors.**

In the parameters section of CountVectorizer(), we define the max_df and min_df.

- max_df : Ignore the words that occurs more than 95% of the corpus. 
- min_df : Accept the words in preparation of vocab that occurs in atleast 2 of the documents in the corpus.
- stop_words : Remove the stop words. We can do this in separate steps or in a single step.

In [0]:
df_cv = CountVectorizer(max_df=0.95, min_df=1, stop_words='english')

In [0]:
df_cv_transformed = df_cv.fit_transform(df)

In [55]:
df_cv_transformed

<200853x73729 sparse matrix of type '<class 'numpy.int64'>'
	with 1922096 stored elements in Compressed Sparse Row format>

*Here you can notice that the transformed dataset holds a sparse matrix with a dimension of 200853x73729; where 200853 is the total number of rows and 73729 is the total word corpus.*

### Step 3: Building Latent-Dirichlet Algorithm using scikit-learn

In [0]:
lda_model = LatentDirichletAllocation(n_components=10, batch_size=128, random_state=42)

In [0]:
lda_model.fit(df_cv_transformed)