# Topic Modeling
## This notebook outlines the concepts involved in Topic Modeling


Topic modeling is a statistical model to **discover** the abstract "topics" that occur in a collection of documents

It is commonly used in text document. But nowadays, in social media analysis, topic modeling is an emerging research area.

One of the most popular algorithms used is **Latent Dirichlet Allocation** which was proposed by
David Blei et al in 2003.

Dataset: 
https://raw.githubusercontent.com/subashgandyer/datasets/main/kaggledatasets.csv

### Steps
- Install the necessary library
- Import the necessary libraries
- Download the dataset
- Load the dataset
- Pre-process the dataset
    - Tokenize
    - Stop words removal
    - Non-alphabetic words removal
    - Lowercase
- Create a dictionary for the document
- Filter low frequency words
- Create an Index to word dictionary
- Train the Topic Model
- Predict on the dataset
- Visualize the topics

### Install the necessary library

In [1]:
# ! pip install gensim

In [2]:
import nltk
! nltk.download('stopwords')

/bin/bash: -c: line 0: syntax error near unexpected token `'stopwords''
/bin/bash: -c: line 0: ` nltk.download('stopwords')'


### Import the necessary libraries

In [3]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
import gensim

### Download the dataset

In [4]:
! wget https://raw.githubusercontent.com/subashgandyer/datasets/main/kaggledatasets.csv

--2021-03-06 20:18:04--  https://raw.githubusercontent.com/subashgandyer/datasets/main/kaggledatasets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3881130 (3.7M) [text/plain]
Saving to: ‘kaggledatasets.csv.2’


2021-03-06 20:18:05 (4.30 MB/s) - ‘kaggledatasets.csv.2’ saved [3881130/3881130]



### Load the dataset

In [5]:
df = pd.read_csv("kaggledatasets.csv")
df.head()

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...


### Explore the dataset

### Extract the data for topic modeling

In [6]:
for i in df['Description'].iteritems():
    raw = str(i[1]).lower()
    print(raw)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



context
the [sentiment polarity dataset version 2.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/ ) is created by bo pang and lillian lee. this dataset is redistributed with nltk with permission from the authors.
this corpus is also used in the document classification section of chapter 6.1.3 of the nltk book.
content
this dataset contains 1000 positive and 1000 negative processed reviews.
citation
bo pang and lillian lee. 2004. a sentimental education: sentiment analysis 
using subjectivity summarization based on minimum cuts. in acl.
bibtex:
@inproceedings{pang+lee:04a,
  author =       {bo pang and lillian lee},
  title =        {a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts},
  booktitle =    "proceedings of the acl",
  year =         2004
}
context
the corpus consists of one million words of american english texts printed in 1961.
the canonical metadata on nltk:
<package id="brown" name="brown corpus"
         author

### Pre-process the dataset
- Tokenize
- Stop words removal
- Non-alphabetic words removal
- Lowercase
- Define them

### Define the pattern, tokenizer, stop words and lemmatizer

In [7]:
pattern = r'\b[^\d\W]+\b'
tokenizer = RegexpTokenizer(pattern)
en_stop = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

### Preprocess

In [8]:
texts = []


for i in df['Description'].iteritems():
    # clean and tokenize document string
    raw = str(i[1]).lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [raw for raw in tokens if not raw in en_stop]
    
    # lemmatize tokens
    lemma_tokens = [lemmatizer.lemmatize(tokens) for tokens in stopped_tokens]
    
    # remove word containing only single char
    new_lemma_tokens = [raw for raw in lemma_tokens if not len(raw) == 1]
    
    # add tokens to list
    texts.append(new_lemma_tokens)


print(texts[0])

['datasets', 'contains', 'transaction', 'made', 'credit', 'card', 'september', 'european', 'cardholder', 'dataset', 'present', 'transaction', 'occurred', 'two', 'day', 'fraud', 'transaction', 'dataset', 'highly', 'unbalanced', 'positive', 'class', 'fraud', 'account', 'transaction', 'contains', 'numerical', 'input', 'variable', 'result', 'pca', 'transformation', 'unfortunately', 'due', 'confidentiality', 'issue', 'cannot', 'provide', 'original', 'feature', 'background', 'information', 'data', 'feature', 'principal', 'component', 'obtained', 'pca', 'feature', 'transformed', 'pca', 'time', 'amount', 'feature', 'time', 'contains', 'second', 'elapsed', 'transaction', 'first', 'transaction', 'dataset', 'feature', 'amount', 'transaction', 'amount', 'feature', 'used', 'example', 'dependant', 'cost', 'senstive', 'learning', 'feature', 'class', 'response', 'variable', 'take', 'value', 'case', 'fraud', 'otherwise', 'given', 'class', 'imbalance', 'ratio', 'recommend', 'measuring', 'accuracy', 'usi

### Create a dictionary

In [9]:
dictionary = Dictionary(texts)

### Filter low frequency words

In [10]:
dictionary.filter_extremes(no_below=10, no_above=0.5)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

### Create an index to word dictionary

In [11]:
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

### Train the Topic model

In [12]:
ldamodel = LdaModel(corpus, num_topics=15, id2word = id2word, passes=20)

### Display the topics

In [13]:
pprint(ldamodel.top_topics(corpus,topn=5))

[([(0.036358196, 'player'),
   (0.030080475, 'game'),
   (0.028001763, 'team'),
   (0.0168796, 'match'),
   (0.014556797, 'season')],
  -1.1403886634491778),
 ([(0.01799047, 'company'),
   (0.017193004, 'price'),
   (0.017030902, 'name'),
   (0.015550443, 'year'),
   (0.015041812, 'http')],
  -1.19307918338903),
 ([(0.045464605, 'image'),
   (0.044691578, 'model'),
   (0.037053447, 'trained'),
   (0.02627516, 'feature'),
   (0.024118548, 'pre')],
  -1.3335498778266384),
 ([(0.025536627, 'city'),
   (0.020053511, 'new'),
   (0.019710885, 'others'),
   (0.018813342, 'inspiration'),
   (0.017759247, 'world')],
  -1.4411230120715441),
 ([(0.028886251, 'language'),
   (0.018455138, 'file'),
   (0.017737139, 'question'),
   (0.016496379, 'used'),
   (0.015745243, 'kaggle')],
  -1.4907629027727687),
 ([(0.031437837, 'text'),
   (0.018139252, 'would'),
   (0.018001739, 'like'),
   (0.015858034, 'movie'),
   (0.015505705, 'one')],
  -1.5601442522557547),
 ([(0.014332684, 'information'),
   (0.0

### Display the 15 topics with words

In [14]:
for idx in range(15):
    print("Topic #%s:" % idx, ldamodel.print_topic(idx, 10))

Topic #0: 0.038*"csv" + 0.022*"review" + 0.021*"numeric" + 0.019*"score" + 0.017*"class" + 0.012*"time" + 0.012*"attack" + 0.011*"pokemon" + 0.011*"activity" + 0.011*"new"
Topic #1: 0.030*"csv" + 0.017*"row" + 0.013*"column" + 0.012*"txt" + 0.011*"facility" + 0.011*"coordinate" + 0.010*"name" + 0.010*"http" + 0.010*"number" + 0.010*"file"
Topic #2: 0.014*"information" + 0.014*"column" + 0.010*"file" + 0.010*"csv" + 0.009*"activity" + 0.009*"database" + 0.008*"feature" + 0.007*"contains" + 0.007*"datasets" + 0.007*"record"
Topic #3: 0.036*"player" + 0.030*"game" + 0.028*"team" + 0.017*"match" + 0.015*"season" + 0.011*"back" + 0.011*"sport" + 0.009*"set" + 0.008*"goal" + 0.008*"result"
Topic #4: 0.027*"instance" + 0.025*"cell" + 0.019*"number" + 0.017*"group" + 0.016*"class" + 0.015*"attribute" + 0.011*"set" + 0.010*"woman" + 0.010*"car" + 0.010*"vehicle"
Topic #5: 0.212*"description" + 0.185*"yet" + 0.059*"weapon" + 0.044*"integer" + 0.027*"damage" + 0.026*"strongly" + 0.014*"enjoy" + 0

### LSI Model

In [15]:
from gensim.models import LsiModel
lsamodel = LsiModel(corpus, num_topics=10, id2word = id2word)
pprint(lsamodel.print_topics(num_topics=10, num_words=10))

[(0,
  '0.970*"university" + 0.174*"state" + 0.076*"college" + 0.051*"texas" + '
  '0.049*"california" + 0.039*"institute" + 0.031*"new" + 0.028*"technology" + '
  '0.027*"florida" + 0.027*"north"'),
 (1,
  '0.389*"player" + 0.247*"team" + 0.221*"shot" + 0.200*"number" + '
  '0.177*"time" + 0.173*"file" + 0.159*"year" + 0.156*"csv" + 0.146*"goal" + '
  '0.126*"ice"'),
 (2,
  '-0.437*"player" + -0.307*"shot" + -0.259*"team" + 0.250*"integer" + '
  '0.224*"strongly" + -0.175*"ice" + -0.174*"goal" + 0.154*"file" + '
  '-0.151*"attempt" + 0.133*"csv"'),
 (3,
  '-0.595*"integer" + -0.535*"strongly" + -0.263*"interested" + -0.261*"enjoy" '
  '+ -0.119*"much" + -0.116*"player" + 0.098*"file" + 0.093*"year" + '
  '-0.090*"shot" + 0.088*"csv"'),
 (4,
  '0.402*"year" + -0.325*"date" + -0.265*"element" + -0.198*"tag" + '
  '-0.192*"registration" + -0.186*"zero" + -0.180*"end" + -0.174*"start" + '
  '-0.171*"one" + -0.165*"application"'),
 (5,
  '-0.535*"csv" + 0.436*"year" + 0.193*"number" + -0.1

In [16]:
for idx in range(10):
    print("Topic #%s:" % idx, lsamodel.print_topic(idx, 10))
print("=" * 20)

Topic #0: 0.970*"university" + 0.174*"state" + 0.076*"college" + 0.051*"texas" + 0.049*"california" + 0.039*"institute" + 0.031*"new" + 0.028*"technology" + 0.027*"florida" + 0.027*"north"
Topic #1: 0.389*"player" + 0.247*"team" + 0.221*"shot" + 0.200*"number" + 0.177*"time" + 0.173*"file" + 0.159*"year" + 0.156*"csv" + 0.146*"goal" + 0.126*"ice"
Topic #2: -0.437*"player" + -0.307*"shot" + -0.259*"team" + 0.250*"integer" + 0.224*"strongly" + -0.175*"ice" + -0.174*"goal" + 0.154*"file" + -0.151*"attempt" + 0.133*"csv"
Topic #3: -0.595*"integer" + -0.535*"strongly" + -0.263*"interested" + -0.261*"enjoy" + -0.119*"much" + -0.116*"player" + 0.098*"file" + 0.093*"year" + -0.090*"shot" + 0.088*"csv"
Topic #4: 0.402*"year" + -0.325*"date" + -0.265*"element" + -0.198*"tag" + -0.192*"registration" + -0.186*"zero" + -0.180*"end" + -0.174*"start" + -0.171*"one" + -0.165*"application"
Topic #5: -0.535*"csv" + 0.436*"year" + 0.193*"number" + -0.174*"file" + 0.166*"date" + 0.155*"total" + 0.122*"ele

## Visualize the topics and documents with the trained Topic Model
- Use pyLDAvis from gensim

In [17]:
import pyLDAvis.gensim

### Enable the notebook for visualization

In [18]:
pyLDAvis.enable_notebook()

  and should_run_async(code)


### Visualize the Topic model

In [19]:

pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

  and should_run_async(code)
