# Traditional NLP - Topic Modelling

## Learning Objectives
- What is NLP and what can we do with it?
- Topic modelling
- Cleaning text data
- Tokenization
- Stemming
- Lemmatization
- Stop Words
- Bag of Words
- LSA (or LSI)
- TF-IDF
- LDA

**Natural Language Processing** (NLP), is a branch of AI which, as the name suggests, focuses on processing natural language (English, French, Chinese etc.). Recently, there has a been a massive surge in popularity of NLP due to the expressiveness of deep learning models. Prior to this rise in popularity however, statistical methods were commonly used to parse and gain insights from languages. Here, we'll be focusing on these more traditional/'simpler' techniques. It is a completely valid question to be asking *why* are we focusing on doing this when more modern and more powerful technologies exist. This is rationalised as understanding *how* you're going to use NLP. If your task is inherently requiring you to build an expressive model of language, then the deep learning approach is favoured. However, if you just need to use NLP as a tool in part of a larger project, then traditional NLP may be ample. Saying this, due to the use of 'clean' datasets we'll be using in the subsquent NLP chapter, we'll demonstrate effective EDA techniques for NLP in this module.

Concretely, NLP, despite being an interdisciplinary field, is best thought of as applying algorithms to text to extract insights. With the current state of NLP, almost every problem language problem imaginablable has research and methodologies behind it. On the traditional NLP front, common problems include: 
- **Topic Modelling:** An unsupervised methodology to group documents into potential topic groups
- **Sentiment Analysis:** Given some text data, what is the sentiment of this data? Good, bad?
- **Named Entity Recognition:** Given some data, extract out named entities (e.g. John, Apple, GSK etc)
- **Parts of Speech:** Given a sentence, what is it's parts of speech (based on the use of the word in that sentence). Common parts of speech are nouns, adjectives, verbs etc
- **Dependancy Parsing:** Given a sentence, analyse the grammatical structure of it. What are the relationships between the words. Phrase structure trees are heavily related. 

Ambiguity in language: 
- Scientists study whales from space
- She killed the man with the tie

In this chapter, we'll be focusing on the first two problems. These problems provide great introductions to NLP and can also be tackled with deep learning methodologies. Parts of Speech and Dependancy Parsing, on the other hand, have fallen out of favour now due to the power of deep learning. Practioners have realised that deep learning models perform better without having text augmented by the syntactic information the aforementioned concepts provide. If it helps, you can think of the differences between traditional and neural NLP as such: Neural NLP aims to *understand* what the language the model sees means, whereas traditional NLP attempts to just find patterns between words.

In this notebook, we won't so much focus on EDA of text data. The next notebook will cover that. Rather, we'll introduce important and relevant techniques to deal with text data, and how to represent text data to a model.

## Topic Modelling

As (very) briefly mentioned, **topic modelling** is the unsupervised act of assigning *documents* into potential topic groups. While the above sentence is seemingly straightforward, let's break it down slightly. There are two components here: "documents", and "potential topic groups". The first thing to note, however, is that topic modelling is an unsupervised methodology, and we will be looking at two techniques here to assign us the potential topic groups. I highligted the term *document* because it is an common term in NLP. Let's have a very brief discussion about some NLP terminology before discussing what I mean by "potential topic groups".

Regarding the terminology, there are three main abstraction terms you'll come across:
- **Corpus:** Your entire collection of text data for the problem (i.e. your dataset)
- **Document:** An item from your corpus (e.g. a single tweet or a single email)
- **Token:** The atomic unit of a piece of language (usually a word)

Putting the pieces together, this means that we'll be dealing with a corpus (i.e. dataset). In topic modelling, we'll be assigning each individual element in our corpus (i.e. a document) into a group. Let's look at the second component I mentioned earlier on: potential topic groups. I specifically used the word potential because:
1) We will be unsure how many topics we may have
2) The computer will be unsure how many topics to model

What?? How can we be unsure of how many topics we have? Well.. remember that topic modelling is an unsupervised problem. We're only given a corpus: a collection of documents; and we need to categorise each document into a topic (or into multiple topics - which we'll look at later). We don't know how many topics will exist in this corpus, so we need to choose a hyperparameter which will dictate how many topics we want the computer to create. Our job as humans is then to look at the output of the algorithm and label each of the topics that have been output. Let's look at this visually.

# Image of topic modelling

So we see this pre-processing step that is required before we feed the documents into a model. In the case of NLP, what does this involve?:
- Text cleaning
- Tokenization
- Stemming/Lemmatization
- Stop word removal
- Representing our text data (Bag of Words or TF-IDF)

Now that we understand what topic modelling is trying to do, let's load and preprocess some data before looking at the algorithms we can use for topic modelling! The above list is roughly the order we should be tackling these NLP problems in, so let's get started.

In this notebook, we'll be working with an sklearn dataset called 20newsgroups. I've chosen this because this NLP module is in two parts. We'll look at some simpler variants of some of the above concepts here, and in the next notebook, look at a slightly more complicated challenge which builds on the concepts outlined here.

The 20newsgroups dataset contains around 18000 posts from (what can be thought of as) a forum board. These posts are split into 20 different topics. We'll work with 4 of those topics here just so it's more managable as an introduction.

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ["alt.atheism", "misc.forsale", "sci.space", "comp.graphics"]
corpus = fetch_20newsgroups(subset='train', categories=categories)
print("\n".join(corpus.data[:3]))

From: zeno@ccwf.cc.utexas.edu (S. Hsieh)
Subject: Video/Audio/Computer equipment for sale..
Organization: The University of Texas at Austin, Austin TX
Lines: 49
Distribution: na
Reply-To: zeno@ccwf.cc.utexas.edu (S. Hsieh)
NNTP-Posting-Host: mickey.cc.utexas.edu
Originator: zeno@mickey.cc.utexas.edu

Time for some spring cleaning, so the following items are up
for sale:

Roland MT-32 Multi-Timbre Sound module.  
  LA synthesis, upto 32 simultaneous voices, 128 preset timbres,
  20-char backlit LCD display, MIDI in/out/thru, reference card,
  stereo output, etc

  Great for games that support it (music on the MT32 is far
  superior to any sound card), experimenting with MIDI, or
  for adding additional sounds to your MIDI setup.

  $235 + shipping

Canon RC-250 Xapshot still video camera system.
  Includes: camera, carrying pouch, battery pack, battery charger,
  ac adapter, video cables, two 2.5" floppies (each disk holds
  50 pictures for 100 pics total), manuals, etc

  Video output 

In [2]:
import numpy as np
np.array(corpus.target_names)[corpus.target[:3]]

array(['misc.forsale', 'misc.forsale', 'comp.graphics'], dtype='<U13')

In [3]:
corpus.filenames.shape, corpus.target.shape

((2242,), (2242,))

## Cleaning text data
What we've printed above is easily consumable and understandable by humans. However, there exist many nuances of language that exist for our convinience - and providing this information to a computer is unnecessary, and perhaps even detrimental. Some examples of this include capitalisation of letters and puncutation. With the text data returned above, can you think of what else we might need to "clean"?

This subsection provides an introduction into how we can clean text data. In previous notebooks, we've been introduced to some simple string manipulations on dataframes. Here, we'll be using similar methods (and then some). Let's create a function which takes in one document at a time, and cleans the document. From what I see, we'll need to do the following:
- Lowercase everything
- Remove emails
- Remove puncuation
- Remove numbers
- Replace newline char (\n) with normal whitespace (\s or " ")

Different usecases and different practioners will recommend different strategies for cleaning text data. Some would argue that removing numbers isn't a requirement. Personally, I believe the same thing, but I've added it here to test your googling and/or regex knowledge ;).

## Tokenization
It's worth killing two birds with one stone and mentioning what a token is. A token is simply the atomic unit of text. In most cases, this'll be a word, but could also include puncutation or an emoji. When we say to "tokenize" something, we mean that we take in a continuous piece of text, and we split (hint 😉) it into a list of tokens. That is, `The quick brown fox jumped over the lazy dog` --> `["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]`.

In the function we're defining below, clean the data and return it's tokens. You may find it easier to perform the regex over tokens as opposed to the whole document, but that's an implementation detail.

In [4]:
import re
import string

## Create a function which takes in a document and fulfills the above operations
# Feel free to Google around





clean_data(corpus.data[0])

['from',
 's',
 'hsieh',
 'subject',
 'videoaudiocomputer',
 'equipment',
 'for',
 'sale',
 'organization',
 'the',
 'university',
 'of',
 'texas',
 'at',
 'austin',
 'austin',
 'tx',
 'lines',
 'distribution',
 'na',
 'replyto',
 's',
 'hsieh',
 'nntppostinghost',
 'mickeyccutexasedu',
 'originator',
 'time',
 'for',
 'some',
 'spring',
 'cleaning',
 'so',
 'the',
 'following',
 'items',
 'are',
 'up',
 'for',
 'sale',
 'roland',
 'mt',
 'multitimbre',
 'sound',
 'module',
 'la',
 'synthesis',
 'upto',
 'simultaneous',
 'voices',
 'preset',
 'timbres',
 'char',
 'backlit',
 'lcd',
 'display',
 'midi',
 'inoutthru',
 'reference',
 'card',
 'stereo',
 'output',
 'etc',
 'great',
 'for',
 'games',
 'that',
 'support',
 'it',
 'music',
 'on',
 'the',
 'mt',
 'is',
 'far',
 'superior',
 'to',
 'any',
 'sound',
 'card',
 'experimenting',
 'with',
 'midi',
 'or',
 'for',
 'adding',
 'additional',
 'sounds',
 'to',
 'your',
 'midi',
 'setup',
 'shipping',
 'canon',
 'rc',
 'xapshot',
 'still'

In [5]:
## Apply this function over our corpus


corpus.data[:3]

[['from',
  's',
  'hsieh',
  'subject',
  'videoaudiocomputer',
  'equipment',
  'for',
  'sale',
  'organization',
  'the',
  'university',
  'of',
  'texas',
  'at',
  'austin',
  'austin',
  'tx',
  'lines',
  'distribution',
  'na',
  'replyto',
  's',
  'hsieh',
  'nntppostinghost',
  'mickeyccutexasedu',
  'originator',
  'time',
  'for',
  'some',
  'spring',
  'cleaning',
  'so',
  'the',
  'following',
  'items',
  'are',
  'up',
  'for',
  'sale',
  'roland',
  'mt',
  'multitimbre',
  'sound',
  'module',
  'la',
  'synthesis',
  'upto',
  'simultaneous',
  'voices',
  'preset',
  'timbres',
  'char',
  'backlit',
  'lcd',
  'display',
  'midi',
  'inoutthru',
  'reference',
  'card',
  'stereo',
  'output',
  'etc',
  'great',
  'for',
  'games',
  'that',
  'support',
  'it',
  'music',
  'on',
  'the',
  'mt',
  'is',
  'far',
  'superior',
  'to',
  'any',
  'sound',
  'card',
  'experimenting',
  'with',
  'midi',
  'or',
  'for',
  'adding',
  'additional',
  'sounds'

So there are some issues with the way that we've processed the text (generally speaking). A major issue that I see is that quotes/replies of previous messages were shown in the document string as ">" or ">>" etc. With the removal of this signal, a model wouldn't be able to distinguish between an authors post or a reply. In the case of sentiment classifcation, it may classify an authors opinion as negative when infact the authors opinion was positive, but the messages he/she was replying to were of negative sentiment. In such a situation, perhaps removing the quotes/replies are necessary. Here, however, we just want to find topics, and quotes/replies are most likely in the same domain as the post itself due to the post being part of a message thread. 

Another issue comes from our preprocessing strategy. We removed puncuation, so some substrings are just some characters with punctuation removed from it. For example "rutgersremusrutgersedukaldis". Also, due to the loose structure of the email, we have some potential redundant words which may be detrimental to a model (e.g. "subject", "organization"). We'll continue as is for now, and if we see such words causing an issue, we'll know to remove them 👍.

## Stop Word removal
Stop words are words which are common to a natural language but provide very little information towards the meaning of a sentence. Obvious examples are "the, and, a, is" etc. With traditional NLP, a stop list is typically defined (i.e. a set of stop words), and document tokens which are present in this stop list are removed. If we wanted to, we could also add domain specific stop words to our stop list (e.g. "subject", "organization"). It used to be that large stop lists were defined (e.g. 200-300 words), however stop lists (within traditional NLP) are now typically just the super common words (e.g. 7-12 words). Neural NLP does not consider stop words. However, even within traditional NLP, there has been debate about whether stop word removal is necessary or not. Different practioners will say different things, and I am of the opinion of that just the super common and domain specific stop words should be removed. Here, we'll define a stop list and subsequently remove the tokens from all our documents if the token appears in this stop list.


In [6]:
# stop_list obtained from: https://gist.github.com/sebleier/554280#gistcomment-2892081
stop_list = ["a", "about", "above", "after", "again", "against", "ain", "all", "am", "an", "and", "any", "are", "aren", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "couldn", "couldn't", "d", "did", "didn", "didn't", "do", "does", "doesn", "doesn't", "doing", "don", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn", "hadn't", "has", "hasn", "hasn't", "have", "haven", "haven't", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "isn", "isn't", "it", "it's", "its", "itself", "just", "ll", "m", "ma", "me", "mightn", "mightn't", "more", "most", "mustn", "mustn't", "my", "myself", "needn", "needn't", "no", "nor", "not", "now", "o", "of", "off", "on", "once", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "re", "s", "same", "shan", "shan't", "she", "she's", "should", "should've", "shouldn", "shouldn't", "so", "some", "such", "t", "than", "that", "that'll", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "ve", "very", "was", "wasn", "wasn't", "we", "were", "weren", "weren't", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "won", "won't", "wouldn", "wouldn't", "y", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "could", "he'd", "he'll", "he's", "here's", "how's", "i'd", "i'll", "i'm", "i've", "let's", "ought", "she'd", "she'll", "that's", "there's", "they'd", "they'll", "they're", "they've", "we'd", "we'll", "we're", "we've", "what's", "when's", "where's", "who's", "why's", "would", "able", "abst", "accordance", "according", "accordingly", "across", "act", "actually", "added", "adj", "affected", "affecting", "affects", "afterwards", "ah", "almost", "alone", "along", "already", "also", "although", "always", "among", "amongst", "announce", "another", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask", "asking", "auth", "available", "away", "awfully", "b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief", "briefly", "c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com", "come", "comes", "contain", "containing", "contains", "couldnt", "date", "different", "done", "downwards", "due", "e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end", "ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "except", "f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "former", "formerly", "forth", "found", "four", "furthermore", "g", "gave", "get", "gets", "getting", "give", "given", "gives", "giving", "go", "goes", "gone", "got", "gotten", "h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "hereupon", "hes", "hi", "hid", "hither", "home", "howbeit", "however", "hundred", "id", "ie", "im", "immediate", "immediately", "importance", "important", "inc", "indeed", "index", "information", "instead", "invention", "inward", "itd", "it'll", "j", "k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks", "ltd", "made", "mainly", "make", "makes", "many", "may", "maybe", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must", "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need", "needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non", "none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones", "onto", "ord", "others", "otherwise", "outside", "overall", "owing", "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly", "present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "right", "run", "said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed", "shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly", "similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify", "specifying", "still", "stop", "strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure", "take", "taken", "taking", "tell", "tends", "th", "thank", "thanks", "thanx", "thats", "that've", "thence", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "thereto", "thereupon", "there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug", "throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", "ts", "twice", "two", "u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll", "whats", "whence", "whenever", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether", "whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www", "x", "yes", "yet", "youd", "youre", "z", "zero", "a's", "ain't", "allow", "allows", "apart", "appear", "appreciate", "appropriate", "associated", "best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", 
             "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par", "0o", "0s", "3a", "3b", "3d", "6b", "6o", "a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az", "b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx", "c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz", "d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy", "e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey", "f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy", "ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy", "h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz", "jj", "jr", "js", "jt", "ju", "ke", "kg", "kj", "km", "ko", "l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt", "m2", "ml", "mn", "mo", "ms", "mt", "mu", "n2", "nc", "nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny", "oa", "ob", "oc", "od", "of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz", "p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq", "pr", "ps", "pt", "pu", "py", "qj", "qu", "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "rv", "ry", "s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st", "sy", "sz", "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn", "tp", "tq", "tr", "ts", "tt", "tv", "tx", "ue", "ui", "uj", "uk", "um", "un", "uo", "ur", "ut", "va", "wa", "vd", "wi", "vj", "vo", "wo", "vq", "vt", "vu", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y2", "yj", "yl", "yr", "ys", "yt", "zi", "zz"]
stop_list.extend(["from", "subject", "organization", "lines", "distribution", "replyto", "nntppostinghost", "originator", "_"])
    
## Remove the tokens in a document if the token is present in stop_list
documents = []


corpus.data = documents
corpus.data[:3]

[['hsieh',
  'videoaudiocomputer',
  'equipment',
  'sale',
  'university',
  'texas',
  'austin',
  'austin',
  'hsieh',
  'mickeyccutexasedu',
  'time',
  'spring',
  'cleaning',
  'items',
  'sale',
  'roland',
  'multitimbre',
  'sound',
  'module',
  'synthesis',
  'upto',
  'simultaneous',
  'voices',
  'preset',
  'timbres',
  'char',
  'backlit',
  'lcd',
  'display',
  'midi',
  'inoutthru',
  'reference',
  'card',
  'stereo',
  'output',
  'great',
  'games',
  'support',
  'music',
  'superior',
  'sound',
  'card',
  'experimenting',
  'midi',
  'adding',
  'additional',
  'sounds',
  'midi',
  'setup',
  'shipping',
  'canon',
  'xapshot',
  'video',
  'camera',
  'includes',
  'camera',
  'carrying',
  'pouch',
  'battery',
  'pack',
  'battery',
  'charger',
  'adapter',
  'video',
  'cables',
  'floppies',
  'disk',
  'holds',
  'pictures',
  'pics',
  'total',
  'manuals',
  'video',
  'output',
  'standard',
  'ntsc',
  'composite',
  'ntsc',
  'device',
  'televisio

## Stemming and Lemmatization
**Stemming** and **Lemmatization** are two distinct word normalization techniques. Essentially this means that, given our corpora, we wish to have variants of a word in a 'normal' form. For example, [playing, plays, played] may be normalised to "Play". The sentence "the boy's cars are different colours" may be normalised to "the boy car be differ colour"

### Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are assigned. Stemming a word may result in the word not actually being a word. For example, some stemming algorithms may stem [trouble, troubling, troubled] as "troubl".

### Lemmatization
Lemmatization attempts to properly reduce unnormalized tokens to a word that belongs in the language. The root word is called a lemma, and is the canonical form of a set of words. For example, [runs, running, ran] are all forms of the word "run".

Before we get round to implementing this, it's worth talking about two popular NLP libraries: `nltk` and `SpaCy`. `nltk` is a general purpose library which provides many functions applicable to natural language. `SpaCy` is a more modern NLP library and is opinionated about the way language should be processed. Where possible, I recommend `SpaCy` to be used (especially when working with neural based NLP). However, because of it's opinionated nature, it doesn't provide a stemmer. A linguist has been quoted saying: "Stemming is the poor-man's lemmatization" (Noah Smith, 2011). 

Here we will use `nltk` to demonstrate stemming and lemmatizing, and then we will load in SpaCy and use its lemmatizer to lemmatize our corpus

In [7]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Nihir\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
from nltk import stem

porter = stem.porter.PorterStemmer()
wnl = stem.WordNetLemmatizer()

word_list = ["feet", "foot", "foots", "footing"]

In [9]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

In [10]:
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

Try stemming and lemmatizing the following words. What do you notice?
- "fly", "flies", "flying"
- "organize", "organizes", "organizing"
- "universe", "university"

In [11]:
word_list1 = ["fly", "flies", "flying"]
word_list2 = ["organize", "organizes", "organizing"]
word_list3 = ["universe", "university"]

In [12]:
[porter.stem(word) for word in word_list3]

['univers', 'univers']

In [13]:
[wnl.lemmatize(word) for word in word_list1]

['fly', 'fly', 'flying']

Now, stem and lemmatize the following sentence:

`sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."`

In [14]:
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

In [15]:
## Stem the sentence


['He',
 'wa',
 'run',
 'and',
 'eat',
 'at',
 'same',
 'time.',
 'He',
 'ha',
 'bad',
 'habit',
 'of',
 'swim',
 'after',
 'play',
 'long',
 'hour',
 'in',
 'the',
 'sun.']

In [16]:
## Lemmatize the sentence


['He',
 'wa',
 'running',
 'and',
 'eating',
 'at',
 'same',
 'time.',
 'He',
 'ha',
 'bad',
 'habit',
 'of',
 'swimming',
 'after',
 'playing',
 'long',
 'hour',
 'in',
 'the',
 'Sun.']

Woah! Why did the above barely do anything? It's because the lemmatizer requires parts of speech (POS) context about the word it is currently parsing. We would need to use a POS model to identify what the POS for a token in its context is. We will use `nltk`'s built in POS tagger to tag the parts of speech, and then feed the word and POS to the lemmatizer.

In [17]:
nltk.download('averaged_perceptron_tagger')
words_with_pos = nltk.pos_tag(sentence.split())
words_with_pos

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Nihir\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('He', 'PRP'),
 ('was', 'VBD'),
 ('running', 'VBG'),
 ('and', 'CC'),
 ('eating', 'VBG'),
 ('at', 'IN'),
 ('same', 'JJ'),
 ('time.', 'NN'),
 ('He', 'PRP'),
 ('has', 'VBZ'),
 ('bad', 'JJ'),
 ('habit', 'NN'),
 ('of', 'IN'),
 ('swimming', 'NN'),
 ('after', 'IN'),
 ('playing', 'VBG'),
 ('long', 'JJ'),
 ('hours', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Sun.', 'NNP')]

In [18]:
# From: https://stackoverflow.com/a/15590384/3297011
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [19]:
[wnl.lemmatize(word, pos=get_wordnet_pos(pos)) for word, pos in words_with_pos]

['He',
 'be',
 'run',
 'and',
 'eat',
 'at',
 'same',
 'time.',
 'He',
 'have',
 'bad',
 'habit',
 'of',
 'swimming',
 'after',
 'play',
 'long',
 'hour',
 'in',
 'the',
 'Sun.']

Alternatively, we can just use `SpaCy` and save ourself the headache of doing this :). We need to install a language model for SpaCy though.

```
pip install spacy
sudo python -m spacy download en
```

(If you're on windows, don't sudo, but run your shell as an admin)

In [20]:
import spacy

nlp = spacy.load("en")

doc = nlp(sentence)
for token in doc:
    print(token.text, "\t\t", token.lemma_)

He 		 -PRON-
was 		 be
running 		 run
and 		 and
eating 		 eat
at 		 at
same 		 same
time 		 time
. 		 .
He 		 -PRON-
has 		 have
bad 		 bad
habit 		 habit
of 		 of
swimming 		 swimming
after 		 after
playing 		 play
long 		 long
hours 		 hour
in 		 in
the 		 the
Sun 		 Sun
. 		 .


<details>
    <summary>
        <b>> What is "-PRON-"?</b>
    </summary>
    Source: <a href="https://spacy.io/api/annotation">https://spacy.io/api/annotation</a>

spaCy adds a special case for English pronouns: all English pronouns are lemmatized to the special token -PRON-. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.
</details>

Great! Now using the `SpaCy` lemmatizer, lemmatize our corpus. Let's keep pronouns in their original form, and not have them replaced with "-PRON-". Represent the new corpus.data as a list of documents, where each document is a string.

In [21]:
from tqdm import tqdm

## Lemmatize the corpus
documents = []

    
corpus.data = documents
corpus.data[:3]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2242/2242 [00:41<00:00, 53.65it/s]


['hsieh videoaudiocomputer equipment sale university texas austin austin hsieh mickeyccutexasedu time spring cleaning items sale roland multitimbre sound module synthesis upto simultaneous voice preset timbre char backlit lcd display midi inoutthru reference card stereo output great games support music superior sound card experimenting midi add additional sound midi setup ship canon xapshot video camera include camera carry pouch battery pack battery charger adapter video cable floppy disk hold picture pic total manual video output standard ntsc composite ntsc device television direct view picture vcr record slideshow computer video digitizer savemanipulate picture computer shipping ambico video enhanceraudio mixer threeline stereo audio mixer microphone input master volume slider wvideo enhancer boost sharpen video image dub vcrvcr camcordervcr shipping baud internal modem ship quantum mb internal prodrive hard disk unit turn unreliable erratic usage simple easily fix problem major pr

## Bag Of Words
Taking stock, we currently have some processed data. We can't feed straight text into a model, so we face a challenge of how can we represent this data to a computer. In traditional NLP, two common techniques are used: **Bag of Words** (BoW) and **TF-IDF** (which we'll look at later in this notebook). BoW is an extremely simple approach which works surprisingly well for representing language. Essentially, we create a **document-term matrix**. This is simply a matrix with as many rows as documents, and as many columns as unique words/tokens/terms in the whole corpus. For every document-term pair, a count of how many times a word has appeared in that document is added to the matrix. Visually,

# BoW image

The vector for one document is thus just the row of that document! We could imagine building this ourselves relatively trivially (and we'll do that in a bit). For now, however, we'll use a helper function provided to us by sklearn - the CountVectorizer.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

## Create a document-term matrix using CountVectorizer.

dt_matrix.shape

(2242, 27067)

In [23]:
print(dt_matrix[:2])

  (0, 10739)	3
  (0, 25631)	1
  (0, 7548)	1
  (0, 20640)	2
  (0, 24977)	1
  (0, 23727)	1
  (0, 1749)	2
  (0, 14645)	1
  (0, 24014)	2
  (0, 22390)	1
  (0, 4060)	1
  (0, 11984)	1
  (0, 20356)	1
  (0, 15432)	1
  (0, 22085)	3
  (0, 15044)	1
  (0, 23315)	1
  (0, 25163)	1
  (0, 21663)	1
  (0, 25786)	1
  (0, 18417)	1
  (0, 24013)	1
  (0, 3734)	1
  (0, 1914)	1
  (0, 13100)	1
  :	:
  (1, 24977)	1
  (1, 23727)	1
  (1, 3374)	1
  (1, 8494)	1
  (1, 14277)	2
  (1, 10055)	1
  (1, 16798)	2
  (1, 1735)	2
  (1, 16182)	1
  (1, 25586)	2
  (1, 4317)	1
  (1, 15108)	1
  (1, 6829)	1
  (1, 2626)	1
  (1, 12295)	1
  (1, 15272)	1
  (1, 19331)	1
  (1, 6764)	1
  (1, 20047)	1
  (1, 8001)	1
  (1, 22026)	1
  (1, 2165)	1
  (1, 18314)	1
  (1, 12229)	1
  (1, 1475)	1


## LSA
**Latent Semantic Analysis**, otherwise known as LSA or LSI, is a methodolgy for finding 'hidden' meanings between a set of documents and terms. Mathematically, it's given by:
$$
X \approx USV^T
$$

Seem familiar...? Yes! LSA is actually SVD applied to NLP! As a quick recap of shapes:
- $U \in \mathbb{R}^{m \times k}$
- $S \in \mathbb{R}^{k \times k}$
- $V \in \mathbb{R}^{n \times k}$

$k$ is the 'hyperparameter' we choose to represent our desired number of topics. $m$ is the number of documents in our matrix while $n$ is the number of terms. 

In [24]:
from sklearn.decomposition import TruncatedSVD

lsa = TruncatedSVD(n_components=6, n_iter=200)
fitted = lsa.fit_transform(dt_matrix)
fitted.shape

(2242, 6)

Think about what this shape means. What is represented as the number of rows? What about the number of columns?

In [25]:
print(len(vectorizer.get_feature_names()))
print(lsa.components_.shape)

27067
(6, 27067)


We can gather that 27429 is the number of unique terms/tokens we have in our corpus (after all the pre-processing steps). With this knowledge, what does `lsa.components_.shape` imply? Obviously we interpret the matrix as having `n_components` numbers of rows and 27429 columns. Every row represents a topic and every column, the weighting of that term for that topic.

So if we wanted to get the most influential terms per topic, we would need to loop over all of the rows in `lsa.components_`. What would be the `len` of one row then? We need some way of 'combining' the terms and the components together in order to sort the component values from highest to lowest. This obviously will give us a  list of the most influential words, and then we just need to extract out however many we want and print them!

In [26]:
## Find the top 10 most influential words for a concept
terms = vectorizer.get_feature_names()

for i, comp in enumerate(lsa.components_):
    # Zip the components and terms together
    # Sort the zipped terms (on what?), and only consider the top 10 highest value terms

    
    print("Concept {}".format(i))
    for term in sorted_terms:
        print(term[0])
        
    print()

Concept 0
image
jpeg
file
format
gif
program
color
version
display
software

Concept 1
space
launch
satellite
year
god
mission
nasa
market
include
commercial

Concept 2
god
atheist
not
people
jesus
do
religion
atheism
exist
religious

Concept 3
jpeg
launch
space
satellite
gif
commercial
quality
market
year
venture

Concept 4
jesus
matthew
prophecy
gd
day
psalm
speak
isaiah
prophet
messiah

Concept 5
launch
image
satellite
tool
market
commercial
processing
plot
analysis
venture



Ok... loosely we can distinguish between some of the topics, but we can tell its not entirely perfect. There are two methods we'll use in attempt to improve on this: **TF-IDF** and **Latent Dirichlet Allocation** (LDA). TF-IDF is a way we can represent our data to a computer (instead of Bag of Words), whereas LDA is the algorithm we'll use to allocate our topics. Despite the acronym having only a one letter difference, the mathematical formulation is ridiculously more complex. We'll discuss this in a bit, but let's talk about TF-IDF.

## TF-IDF
TF-IDF stands for Term Frequency - Inverse Document Frequency. It is a measure of how important a word/term is to document, but is offset by the frequency of the term in the corpus. A [survey](http://nbn-resolving.de/urn:nbn:de:bsz:352-0-311312) conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf. Intuitively, if a word appears many times in a corpus, it is probably not an important word. However, if a word appears many times in a document, but *not* in the corpus, it probably is an important word.

TF-IDF is composed of two terms:
1) **Term Frequency:** The normalised number of times a word appears in a document. Normalised because a term would appear more times in a longer document than a shorter.

$$
\text{TF(t)} = \frac{\text{Number of times term appears in a document}}{\text{Total number of terms in document}}
$$

2) **Inverse Document Frequency:** Measures how important a term is relative to the *corpus*. Common words (e.g. stop words) appear a lot of time but provide little information about the topic of a document. We therefore scale down the importance of words which occur frequently in the corpus while scaling up the words which occur frequently in the document

$$
\text{IDF(t)} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents which contain term}}\right)
$$

TF-IDF is then just:
$$
\text{TF-IDF(t)} = \text{TF(t)} * \text{IDF(t)}
$$

Note that we end up with a document-term matrix, similar to what we had with BoW - the rows are the document and the columns are the terms. Each entry in this matrix will be the tf-idf score for a given token for a given document.

As an example (we'll implement this below), assume we have a corpus with 1000 documents, and we're trying to find the tf-idf score for a document for the word "dog" in the first document. Let's say the word dog appears 3 times and the document length is 150. The TF score is thus $3/150 = 0.02$. Now let's say the word dog appears in 400 of our documents. Our IDF score is then $1000/400 = 2.5$. For our first document, the TF-IDF score for the word dog is therefore $0.02 * 2.5 = 0.05$.

Below, we'll implement this from scratch. Feel free to implement it however you think appropiate, but purely for explicitness (and not efficiency), I would implement this as follows. The first step would be to create a document term matrix/dataframe (remember to lower-case everything!). As a reminder, the shape of this will be document x unique_terms. The entries will be the number of times the term has appeared in the document. I would also create a list of length number_of_documents, where each entry is the length of the document. Then, I would loop over every row in the matrix, and for each term in that row, work out the TF score for it.

In [27]:
import pandas as pd

toy_corpus = ['This is the first document', 
              'This document is the second document', 
              'And this is the third one', 
              'Is this the first document']

## Lowercase the toy_corpus


## Create a document-term dataframe with entries as zeros. Rows should be the document and columns the terms

print(doc_lens)
dt_matrix

[5, 6, 6, 5]


Unnamed: 0,document,second,is,first,this,and,one,third,the
this is the first document,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
this document is the second document,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
and this is the third one,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
is this the first document,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
## Populate the document term matrix with the counts of each word


dt_matrix

Unnamed: 0,document,second,is,first,this,and,one,third,the
this is the first document,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0
this document is the second document,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
and this is the third one,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
is this the first document,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0


In [29]:
## Populate the matrix with TF scores

        
dt_matrix

Unnamed: 0,document,second,is,first,this,and,one,third,the
this is the first document,0.2,0.0,0.2,0.2,0.2,0.0,0.0,0.0,0.2
this document is the second document,0.333333,0.166667,0.166667,0.0,0.166667,0.0,0.0,0.0,0.166667
and this is the third one,0.0,0.0,0.166667,0.0,0.166667,0.166667,0.166667,0.166667,0.166667
is this the first document,0.2,0.0,0.2,0.2,0.2,0.0,0.0,0.0,0.2


In [30]:
## Populate the matrix with TF-IDF scores. This can be done in one pass or multiple passes
        
dt_matrix

Unnamed: 0,document,second,is,first,this,and,one,third,the
this is the first document,0.057536,0.0,0.0,0.138629,0.0,0.0,0.0,0.0,0.0
this document is the second document,0.095894,0.231049,0.0,0.0,0.0,0.0,0.0,0.0,0.0
and this is the third one,0.0,0.0,0.0,0.0,0.0,0.231049,0.231049,0.231049,0.0
is this the first document,0.057536,0.0,0.0,0.138629,0.0,0.0,0.0,0.0,0.0


There are variations of the tf-idf forumla that I demonstrated above, and different variations will provide different results. A comprehensive list can be found [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Sklearn's implementation uses a variant, hence their returned matrix shows different results.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()
pd.DataFrame(csr_matrix.todense(vectorizer.fit_transform(toy_corpus)))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


Let's fit an LSA model to a tf-idf vectorized corpus.

In [32]:
## Implement an LSA model on a tf-idf'd corpus with 6 topics. Return the top 10 words for each of the topics.


Concept 0
not
do
write
space
god
article
people
university
thing
time

Concept 1
god
keith
atheist
morality
not
moral
people
do
schneider
atheism

Concept 2
space
moon
launch
nasa
orbit
lunar
billion
shuttle
henry
spencer

Concept 3
keith
morality
moral
objective
schneider
sale
allan
political
jon
goal

Concept 4
file
image
keith
format
polygon
graphic
program
morality
color
tiff

Concept 5
polygon
routine
group
split
newsgroup
islamic
algorithm
islam
aspect
fast



Great! We can also see that this has loosely categorised our documents into some topics. These topics are still a bit 'mixed' - although we can see some general trends within some of the displayed concepts. We'll look at a slightly more powerful algorithm now known as LDA.

## LDA
LDA stands for Latent Dirichlet Allocation and is actually a generative model which we can use to assign a document to a collection of topics. That is, one document could contain multiple topics. For example, "The government has increased their space exploration budget by £500 million" could be assigned to the topics of [politics, space, money]. LDA works under the notion that every document could be a mixture of topics, and words within a document are the reason a document has been attributed to the aforementioned mixture.

My personal opinion is that understanding how the underlying LDA model works is not really relevant to effectively use this algorithm - because the concepts behind it are not really going to come up in a ML or DS related job. To effectively teach this requires knowledge of concepts outside of the scope of this course. However, I can understand that this might not be satisfactory for some people, and for those who are interested in learning more, this video provides an awesome intuitive and comprehensive description: https://www.youtube.com/watch?v=T05t-SqKArY

In [33]:
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer()
# vectorizer = TfidfVectorizer()
dt_matrix = vectorizer.fit_transform(corpus.data)

lda = LatentDirichletAllocation(n_components=6)
fitted = lda.fit_transform(dt_matrix)

terms = vectorizer.get_feature_names()

for i, comp in enumerate(lda.components_):
    comp_terms = zip(terms, comp)
    sorted_terms = sorted(comp_terms, key=lambda x: x[1], reverse=True)[:10]
    
    print("Concept {}".format(i))
    for term in sorted_terms:
        print(term[0])
        
    print()

Concept 0
image
file
program
format
write
color
university
jpeg
software
not

Concept 1
not
write
do
god
people
article
atheist
thing
time
religion

Concept 2
sale
university
offer
sell
computer
include
price
not
do
software

Concept 3
write
article
university
good
polygon
orbit
earth
group
graphic
not

Concept 4
space
launch
nasa
year
satellite
mission
write
orbit
article
moon

Concept 5
university
do
dos
space
email
high
usa
war
not
offer



In [34]:
import pyLDAvis
import pyLDAvis.sklearn
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [35]:
pyLDAvis.sklearn.prepare(lda, dt_matrix, vectorizer)

In [36]:
from gensim.corpora import Dictionary
from gensim import models

split_corpus = [doc.split() for doc in corpus.data]

# Create a dictionary representation of the documents.
dictionary = Dictionary(split_corpus)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

c = [dictionary.doc2bow(doc) for doc in split_corpus]
# tfidf = models.TfidfModel(c)
# c = tfidf[c]

from gensim.models import LdaModel

# Set training parameters.
num_topics = 6
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=c,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

model.top_topics(c)

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


[([(0.035153322, 'space'),
   (0.013273308, 'launch'),
   (0.012037246, 'nasa'),
   (0.011985789, 'orbit'),
   (0.01145318, 'not'),
   (0.011386162, 'article'),
   (0.009257541, 'moon'),
   (0.008314573, 'work'),
   (0.008229328, 'cost'),
   (0.007979864, 'do'),
   (0.00795668, 'satellite'),
   (0.0075671254, 'year'),
   (0.007200616, 'earth'),
   (0.0065132966, 'high'),
   (0.006312601, 'station'),
   (0.006093175, 'mission'),
   (0.0058015073, 'shuttle'),
   (0.0057993117, 'lunar'),
   (0.005447027, 'time'),
   (0.005143147, 'flight')],
  -1.6514235853055401),
 ([(0.021958634, 'not'),
   (0.021418056, 'god'),
   (0.017552536, 'do'),
   (0.014741828, 'people'),
   (0.011937079, 'argument'),
   (0.0113763, 'atheist'),
   (0.009907505, 'article'),
   (0.007874088, 'thing'),
   (0.0072623566, 'religion'),
   (0.007114652, 'true'),
   (0.0068466114, 'time'),
   (0.0061029624, 'post'),
   (0.006034848, 'exist'),
   (0.005759047, 'bible'),
   (0.005688219, 'atheism'),
   (0.0054405476, 'isl

In [37]:
pyLDAvis.gensim.prepare(model, c, dictionary)