<a href="https://colab.research.google.com/github/revanks/Xeeva_Task_Files/blob/main/Task_2_Topic_Modeling_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Task - 2.	Text based clustering (NLP) : Unsupervised topic modelling of unlabeled text descriptions with Latent Drichilet Allocation

### What is Topic Modelling?

In my words Topic Modelling is the process of extracting major themes from a given corpus of text data.

**Wikipedia Definition** <br>
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
<br><br>
**Usage**<br>
In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies.<br>
Originally developed as a text-mining tool, topic models have also been used to detect instructive structures in data such as genetic information, images, and networks.<br>



## LDA - Latent Drichilet Allocation

In [None]:
!pip install pyLDAvis



In [None]:
!pip uninstall openpyxl
!pip install openpyxl 

Found existing installation: openpyxl 2.5.9
Uninstalling openpyxl-2.5.9:
  Would remove:
    /usr/local/lib/python3.7/dist-packages/openpyxl-2.5.9.dist-info/*
    /usr/local/lib/python3.7/dist-packages/openpyxl/*
Proceed (y/n)? y
  Successfully uninstalled openpyxl-2.5.9
Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
[K     |████████████████████████████████| 242 kB 26.7 MB/s 
Installing collected packages: openpyxl
Successfully installed openpyxl-3.0.9


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
!pip install gensim
!pip install spacy==2.2.0

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
#Dependencies
import pandas as pd
import gensim #the library for Topic modelling
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
import pyLDAvis #LDA visualization library

from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer

import warnings
warnings.simplefilter('ignore')
from itertools import chain

In [None]:
from google.colab import files
data=files.upload()

Saving bert_sample.xlsx to bert_sample.xlsx


In [None]:
df=pd.read_excel('bert_sample.xlsx')
df=pd.DataFrame(df)
df.head()

Unnamed: 0,ITEM_NAME,CATEGORY_ID
0,CALIBRACION TRANSDUCER 75 nm,CAPITAL ASSEMBLY
1,for pusher whskey,CAPITAL ASSEMBLY
2,Stat 40B Press Head Cup to Carrier from Stati...,CAPITAL ASSEMBLY
3,TRANSD. Cable (4145097103) scrw,CAPITAL ASSEMBLY
4,"ZT200 7,5BAR,13BAR60HZ NUMERO DE SERIE: AIF09...",CAPITAL ASSEMBLY


In [None]:
from textblob import TextBlob
df1=[TextBlob(word).correct() for word in df['ITEM_NAME']]

**Cleaning the data**

In [None]:
#clean the data
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(text):
    stop_free = ' '.join([word for word in text.lower().split() if word not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = ' '.join([lemma.lemmatize(word) for word in punc_free.split()])
    return normalized.split()

In [None]:
df['ITEM_NAME']=df['ITEM_NAME'].apply(clean)
df

Unnamed: 0,ITEM_NAME,CATEGORY_ID
0,"[calibracion, transducer, 75, nm]",CAPITAL ASSEMBLY
1,"[pusher, whskey]",CAPITAL ASSEMBLY
2,"[stat, 40b, press, head, cup, carrier, station...",CAPITAL ASSEMBLY
3,"[transd, cable, 4145097103, scrw]",CAPITAL ASSEMBLY
4,"[zt200, 75bar13bar60hz, numero, de, serie, aif...",CAPITAL ASSEMBLY
...,...,...
9995,"[export, freight, charge, road]",LOGISTICS SERVICE
9996,"[export, packing]",LOGISTICS SERVICE
9997,"[express, delivery, charge]",LOGISTICS SERVICE
9998,"[express, delivery, charge, pmf, chmf, despatc...",LOGISTICS SERVICE


**Creating Dictionary from the articles**

In [None]:
#create dictionary
dictionary = corpora.Dictionary(df['ITEM_NAME'])
#Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
print(dictionary.num_nnz)

68228


**Create document term matrix**

In [None]:
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df['ITEM_NAME'] ]
print(len(doc_term_matrix))

10000


**Instantiate LDA model**

In [None]:
lda = gensim.models.ldamodel.LdaModel

**Fit LDA model on the dataset**

In [None]:
num_topics=4 
%time ldamodel = lda(doc_term_matrix, num_topics=num_topics, id2word=dictionary, passes=50, minimum_probability=0)

CPU times: user 1min 25s, sys: 1.56 s, total: 1min 27s
Wall time: 1min 25s


**Print the topics identified by LDA model**

In [None]:
ldamodel.print_topics(num_topics=num_topics)

[(0,
  '0.032*"insert" + 0.014*"print" + 0.013*"serial" + 0.013*"new" + 0.012*"boring" + 0.011*"diamond" + 0.011*"engrave" + 0.010*"make" + 0.010*"s" + 0.009*"thru"'),
 (1,
  '0.010*"drum" + 0.007*"bol" + 0.006*"nf" + 0.006*"oil" + 0.005*"chemical" + 0.005*"55" + 0.005*"gallon" + 0.005*"2" + 0.005*"d2" + 0.005*"gal"'),
 (2,
  '0.027*"tool" + 0.022*"x" + 0.021*"drill" + 0.013*"repair" + 0.012*"desc" + 0.011*"type" + 0.011*"diam" + 0.010*"mfg" + 0.009*"pn" + 0.009*"mill"'),
 (3,
  '0.063*"de" + 0.013*"para" + 0.008*"1" + 0.007*"en" + 0.007*"charge" + 0.006*"seco" + 0.005*"air" + 0.005*"freight" + 0.004*"fabricacion" + 0.004*"con"')]

**Visualize the LDA model results**

In [None]:
########### Note: There is some issue with pyLDAvis ploting ##
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
lda_display = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

**Find which articles were marked in which cluster**

In [None]:
# Assigns the topics to the documents in corpus
lda_corpus = ldamodel[doc_term_matrix]

In [None]:
[doc for doc in lda_corpus]

[[(0, 0.05001124), (1, 0.05064016), (2, 0.84933823), (3, 0.05001036)],
 [(0, 0.08674667), (1, 0.74632007), (2, 0.08346369), (3, 0.08346955)],
 [(0, 0.018473106), (1, 0.4878755), (2, 0.09831548), (3, 0.39533594)],
 [(0, 0.050088506), (1, 0.05008638), (2, 0.8497464), (3, 0.05007869)],
 [(0, 0.019347789), (1, 0.019334096), (2, 0.57140315), (3, 0.389915)],
 [(0, 0.8739922), (1, 0.04214328), (2, 0.042144004), (3, 0.041720524)],
 [(0, 0.042151004), (1, 0.041705996), (2, 0.8743225), (3, 0.041820485)],
 [(0, 0.027934792), (1, 0.027809013), (2, 0.9158384), (3, 0.028417738)],
 [(0, 0.019528124), (1, 0.15582845), (2, 0.21409062), (3, 0.61055285)],
 [(0, 0.010209481), (1, 0.6494688), (2, 0.20056206), (3, 0.13975964)],
 [(0, 0.022737121), (1, 0.022736803), (2, 0.023079038), (3, 0.931447)],
 [(0, 0.12525757), (1, 0.6243313), (2, 0.12518267), (3, 0.1252285)],
 [(0, 0.1252835), (1, 0.12527727), (2, 0.1252009), (3, 0.6242383)],
 [(0, 0.12525775), (1, 0.62433004), (2, 0.1251829), (3, 0.12522928)],
 [(0,