# Unsupervised NLP

Unsupervised Natural Language Processing (NLP) is the art of doing NLP without specifying any labels at all. The machine learning algorithm will find the groups on its own. It's a very useful technique if you do not know in advance what common themes there are in a corpus of documents. It allows you to find patterns that you cannot easily see yourself due to the large number of documents you want to study for example.

Comment: Add a little bit of explanation about what we are going to do and see in this tutorial and on which data sets we are going to run the analysis. Also, you used the same acronym for the supervised in notebook 1 and unsupervised processing here; is it a typo?

## Learning objectives

Average time to complete: 60min

By the end of this tutorial you should be able to:

* Clean your data and describe why this is important for machine learning
* Reduce words to their core meaning
* Apply Latent Dirichlet Allocation to create groups
* Interpret those groups using a variety of different ways of looking at the data
* Describe what the Unsupervised Learning is and how we use it
* Describe what Natural Language Processing is and how to use it

## What you will need for this tutorial

* See the [introduction document](https://uottawa-it-research-teaching.github.io/machinelearning/) for general requirements and how Jupyter notebooks work.
* We'll need Pandas for convenient data handling. It's a very powerful Python package that can read CSV and Excel files. It also has very good data manipulation capabilities which come in use for data cleaning.
* We will use NLTK as our machine learning package.
* The data files that should have come with this notebook.

Comment: I cannot see any data files in the main folder of NLP. Plz add them along with their licenses. 

## RDM best practices

Good data handling for machine learning begins with good Research Data Management (RDM). The quality of your source data will impact the outcome of your results, just like the reproducibility of your results will depend on the quality of your data sources, in addition to how you organize the data so that other people (and machines!) can understand and reuse it. 

We also need to respect a few research data management best practices along the way, these best practices are recommended by the [Digital Research Alliance of Canada](https://zenodo.org/records/4000989).

SAVE YOUR RAW DATA IN ORIGINAL FORMAT
* Don't overwrite your original data with a cleaned version.
* Protect your original data by locking them or making them read-only.
* Refer to this original data if things go wrong (as they often do).

BACKUP YOUR DATA
* Use the 3-2-1 rule: Save three copies of your data, on two different storage mediums, and one copy off site. The off site storage can be OneDrive or Google drive or whatever your institution provides.
* We are using Open Data, so it does not contain any personally identifiable data or data that needs to be restricted or protected in any way. However, if your data contains confidential information, it is important to take steps to restrict access and encrypt your data.

There are a few more RDM best practices that will help you in your project management, and we will highlight them at the beginning of each tutorial.

## Cleaning data for NLP

There are a number of things you typically need to do to clean up data meant for NLP. It does however depend on the task you want to do.

* Removing stopwords
* Removing punctuation
* Lemmatization
* Synonym substitution

Stopwords are words that are so common that they appear everywhere. These are words like "a", "the", "is", and "have". Thus they hold no value when trying to categorize text. You instead want to focus on words that are more or less unique to the categories. English in particular also suffers from contractions like "you'll" and such which need to be expanded first.

Punctuation doesn't add much so we can get rid of that too.

Lemmatization is the technique of reducing conjugations of verb to their base form as well as removing plurals. For example, "buy", "buys", and "bought" are all the same verb but for computers they are different. By substituting all of these conjugations to just "buy", the computer can recognize them as having the same meaning.

Synonyms are just different words meaning the same thing. Ideally, you want "help" and "aid" to be counted as the same word. This is a very tricky one though because some synonyms are not always exactly the same or can have different alternate meaning. So you may or may not want to do this one.

NLP packages usually come with a list of stopwords of their own. They are not always the same though. There is quite a bit of ambiguity as to what constitutes a stopword. NLTK, SpaCy, Gensim, and Scikit-learn are all Python packages that have different lists. Here, we will use NLTK.


Comment: update the info below plz. 

We need the "contractions" package which again may not be installed be default. So similar to the first notebook, we need to install that with conda:

    conda install contractions

or pip:

    !pip install contractions

In [2]:
!pip install contractions



In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import contractions
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\FAR4\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\FAR4\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

### Stopwords

These are the first ten words in the English and French stopword lists:

Comment: I am getting error not finding the resource stopword not found. 

In [6]:
stopwords.words('english')[:10]

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\FAR4/nltk_data'
    - 'C:\\Users\\FAR4\\anaconda3\\nltk_data'
    - 'C:\\Users\\FAR4\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\FAR4\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\FAR4\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [3]:
stopwords.words('french')[:10]

['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle']

You'll notice that all of the words are lower case too. It's good practice to make all words lower case, so you don't have to worry about case sensitivity when comparing words. So let's filter this sentence.

In [4]:
text = "You'll notice that all of the words are lower case too. It's good practise to make all words lower case, so you don't have to worry about case sensitivity when comparing words. Let's filter this sentence."

# Make it all lower case
text = text.lower()

# Remove contactions
text = contractions.fix(text)

# Tokenize the text which splits the words and punctiation
tokens = word_tokenize(text)

# Remove the stopwords and make everything lower case
filtered_text = [token for token in tokens if token not in stopwords.words('english')]

# Show result in a readable way
" ".join(filtered_text)

'notice words lower case . good practise make words lower case , worry case sensitivity comparing words . let us filter sentence .'

### Removing punctuation

It is fairly easy to do using the `isalpha` function which keeps only things that actually contain letters. So digits and punctuation will disappear.

In [5]:
tokens = [token for token in filtered_text if token.isalpha()]

# Show result in a readable way
" ".join(tokens)

'notice words lower case good practise make words lower case worry case sensitivity comparing words let us filter sentence'

### Lemmatization
Unfortunately, this one is limited in the NLTK to English only. There is however a separate project for a French version which works as a drop-in for the one from NTLK. It's called [LEFF Lematizer](https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer).

Comment: are we using this one and talking about it here? if not, please specify where learners can go and get the necessary training on that.
comment: explain what the code below does.

In [6]:
lemmatizer = WordNetLemmatizer()

tokens = [lemmatizer.lemmatize(token) for token in tokens]

" ".join(tokens)

'notice word lower case good practise make word lower case worry case sensitivity comparing word let u filter sentence'

However, there is something odd going here! It's keeping "comparing" even that should have been "compare". The reason is that WordNet assumes that everything is a noun unless told otherwise.

comment: and also the code below. what this one does?

In [7]:
lemmatizer.lemmatize('comparing')

'comparing'

This is interpreted by NLTK as "The Comparing" so it leaves it alone. But we want to mark it as a verb so it can conjugate the word back to the base form.

Comment: explain the function of code below.

In [8]:
# Same code but now marked as verb
lemmatizer.lemmatize('comparing', pos='v')

'compare'

Fortunately, we don't need to do the tagging of words ourselves. We can use `pos_tag` to do it for us.

comment: what do you mean from tagging of words, please explain.

In [9]:
nltk.pos_tag(tokens)

[('notice', 'NN'),
 ('word', 'NN'),
 ('lower', 'JJR'),
 ('case', 'NN'),
 ('good', 'JJ'),
 ('practise', 'NN'),
 ('make', 'VBP'),
 ('word', 'NN'),
 ('lower', 'JJR'),
 ('case', 'NN'),
 ('worry', 'NN'),
 ('case', 'NN'),
 ('sensitivity', 'NN'),
 ('comparing', 'VBG'),
 ('word', 'NN'),
 ('let', 'NN'),
 ('u', 'JJ'),
 ('filter', 'NN'),
 ('sentence', 'NN')]

It classified all of the words using a pretrained model that comes with NLTK. It's possible to specify language as well with `lang='fra'` for classifying in French for example.

The codes that it spits out like `NN` and `VBG` means "Noun, singular or mass" and "Verb, gerund or present participle", respectively. It depends on what tagset you use though. The default English one in NLTK uses the [Penn Treebank tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

comment: are we using the tagset you provided link above? if yes, please especify in the text. 

For the tagging in WordNet, these codes need to be translated. We can write a function for that. For some reason, this does not seem to be built-in with NTLK. Possibly because the tags change with different tagsets. However, we can just look at the first letter of the tag and use that. The function to translate from one tagging system to the other then is:

Comment: consider adding more elaboration on the sentence above after however. Maybe add an example to clarify what you mean.

In [10]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    elif tag.startswith('S'):
        return nltk.corpus.wordnet.ADJ_SAT
    else:
        # If it's something else, then just use the default value for the lemmatizer
        return nltk.corpus.wordnet.NOUN

Technically, it is better to do the tagging before removing the punctuation and do the tagging on sentences because punctuation has semantic meaning that the tagger can use, but this was a better order for educational purposes. In the real example further down, we'll do it in the proper order. Anyway, let's see what we get with the tagged words

In [11]:
tokens = [lemmatizer.lemmatize(token, pos=get_wordnet_pos(tag)) for token, tag in nltk.pos_tag(tokens)]

" ".join(tokens)

'notice word low case good practise make word low case worry case sensitivity compare word let u filter sentence'

Almost every word has now been standardized to its root. It also kind of sounds like a pirate now.

Comment: in the result above, i can see a standalone u. is it a mistake? if yes please correct it. if not, can we see u as a standadized root?

## Back to the books

We'll use the CMU books dataset that we used before with KNN and Bayes tutorials. This time, we will use an unsupervised classification algorithm to see what it can find. Unsupervised means that we will not be using any of the labels that come with the dataset but instead we let our machine learning algorithm figure things out for itself. It will try to find book summaries that can be grouped together and then tell us why it thought so.

However, since unsupervised learning does not have any understanding of the material, it is still up to you to figure out what the genres it found are.

### Reading the data
In the same way as in previous material for KNN, we use Pandas and JSON for loading the dataset:

In [12]:
import pandas as pd
import json

In [13]:
books = pd.read_csv('../data/booksummaries.txt', sep="\t", header=0, names=['wikipedia', 'freebase', 'title', 'author', 'publicationdate', 'genres', 'summary'])

In [14]:
books = books.dropna()

In [15]:
def genre(row):
    g = json.loads(row.genres)
    return list(g.values())

#genresperbook = books.apply(genre, axis=1)
#books = books.assign(genres=genresperbook)

You'll notice that we still expressly commented out the reading genres in our Pandas DataFrame. We won't use it.

### Preparing the data

Let's follow the NLP steps as we did before using the first book. This time, we'll start with the tagging before we remove the punctuation.

In [16]:
# Which book are we looking at?
books.loc[0]['title']

'A Clockwork Orange'

Let's get started! We change everything to lowercase and we also remove the contractions. Then we tokenize the text.

In [17]:
# Make it all lower case
text = books.loc[0]['summary'].lower()

# Remove contactions
text = contractions.fix(text)

# Tokenize the text
tokens = word_tokenize(text)

Now do the tagging of words and then feed that into the lemmatizer to get to the base.

In [18]:
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token, pos=get_wordnet_pos(tag)) for token, tag in nltk.pos_tag(tokens)]

How many tokens do we have now?

In [19]:
len(tokens)

1149

Remove the stop words and punctuation.

In [20]:
# Punctuation
tokens = [token for token in tokens if token.isalpha()]

# Stopwords
tokens = [token for token in tokens if token not in stopwords.words('english')]

Well, that likely got rid of some junk! Let's see how many tokens we are left with.

In [21]:
len(tokens)

545

We can temporarily join the remaining tokens back to get a readable text again. This is just to see what a filtered book summary looks like.

In [22]:
" ".join(tokens)

'alex teenager living england lead gang nightly orgy opportunistic random alex friend droogs novel slang nadsat dim bruiser gang muscle georgie ambitious pete mostly play along droogs indulge taste characterize sociopath harden juvenile delinquent alex also intelligent sophisticated taste music particularly fond beethoven lovely ludwig van novel begin droogs sit favorite hangout korova milkbar drink cocktail call hype night mayhem assault scholar walk home public library rob store leave owner wife bloody unconscious stomp panhandling derelict scuffle rival gang joyride countryside stolen car break isolated cottage maul young couple living beat husband rap wife metafictional touch husband writer work manuscript call clockwork orange alex contemptuously read paragraph state novel main theme shred manuscript back milk bar alex punishes dim crude behaviour strain within gang become apparent home dreary flat alex play classical music top volume fantasizing even orgiastic violence alex skips

That seemed to have worked. We got a nice word salad. One thing we might want to do is also to filter out names, but let's not overcomplicate things! If you are interested, search for NLTK and NER (Named Entity Recognizer).

Now we can do this for all of the books. We'll put all of the above in a single function that we can then apply to all of the rows. This will take a few minutes. There are a lot of books (9292)!

In [23]:
def prepare(text):
    # Make it all lower case and remove contactions
    text = contractions.fix(text.lower())

    # Tokenize the text
    tokens = word_tokenize(text)

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token, pos=get_wordnet_pos(tag)) for token, tag in nltk.pos_tag(tokens)]

    # Punctuation
    tokens = [token for token in tokens if token.isalpha()]

    # Stopwords
    tokens = [token for token in tokens if token not in stopwords.words('english')]

    return tokens

In [24]:
# Apply the prepare function to all of the books and store it in a new column.
books['prepared'] = books['summary'].apply(prepare)

This previous step takes a long time. In such cases, you might want to just save the prepared data into a file that you can then just read without needing to run all of the preparation steps again. This can be done with `to_pickle` and `read_pickle`. Pickle is a Python specific format that is fast and dense. It's not just for DataFrames either, any Python variable can be stored in a pickle.

In [25]:
# Save the prepared DataFrame to file
books.to_pickle('books.pkl')

In [26]:
# Read the prepared Data Frame
books = pd.read_pickle('books.pkl')

We now have a new column in our `books` DataFrame that contains the filtered clean text as a list of tokens.

In [27]:
books

Unnamed: 0,wikipedia,freebase,title,author,publicationdate,genres,summary,prepared
0,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...","[alex, teenager, living, england, lead, gang, ..."
1,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,"[text, plague, divide, five, part, town, oran,..."
4,2152,/m/0x5g,All Quiet on the Western Front,Erich Maria Remarque,1929-01-29,"{""/m/098tmk"": ""War novel"", ""/m/016lj8"": ""Roman...","The book tells the story of Paul Bäumer, a Ge...","[book, tell, story, paul, bäumer, german, sold..."
5,2890,/m/011zx,A Wizard of Earthsea,Ursula K. Le Guin,1968,"{""/m/0dwly"": ""Children's literature"", ""/m/01hm...","Ged is a young boy on Gont, one of the larger...","[ged, young, boy, go, one, large, island, nort..."
7,4081,/m/01b4w,Blade Runner 3: Replicant Night,K. W. Jeter,1996-10-01,"{""/m/06n90"": ""Science Fiction"", ""/m/014dfn"": ""...","Living on Mars, Deckard is acting as a consul...","[live, mar, deckard, act, consultant, movie, c..."
...,...,...,...,...,...,...,...,...
16548,36372465,/m/02vqwsp,The Third Lynx,Timothy Zahn,2007,"{""/m/06n90"": ""Science Fiction""}",The story starts with former government agent...,"[story, start, former, government, agent, fran..."
16550,36534061,/m/072y44,Remote Control,Andy McNab,1997,"{""/m/01jfsb"": ""Thriller"", ""/m/02xlf"": ""Fiction...",The series follows the character of Nick Ston...,"[series, follow, character, nick, stone, man, ..."
16554,37054020,/m/04f1nbs,Transfer of Power,Vince Flynn,2000-06-01,"{""/m/01jfsb"": ""Thriller"", ""/m/02xlf"": ""Fiction""}",The reader first meets Rapp while he is doing...,"[reader, first, meet, rapp, covert, operation,..."
16555,37122323,/m/0n5236t,Decoded,Jay-Z,2010-11-16,"{""/m/0xdf"": ""Autobiography""}",The book follows very rough chronological ord...,"[book, follow, rough, chronological, order, sw..."


### The algorithm

Now for the machine learning algorithm. It's called Latent Dirichlet Allocation (LDA). It's a statistical model that is able to find words that often go together within a sample and then combine that with all of the samples to form groups where certain terms are just used very often.

The downside of this unsupervised technique is that it won't tell you what the categories that it found are. Subject matter expertise is required to be able to discern that. Additionally, the model doesn't know how many different topics there are. It's a number you need to give it in advance. It requires a bit of trial and error.

The LDA model is part of the gensim package, so we need to load that first.

In [28]:
import gensim

Then we need to take the prepared text that we cleaned up earlier and create a dictionary out of it. This dictionary will contain all of the unique words it found in the entire corpus

In [29]:
common_dictionary = gensim.corpora.Dictionary(list(books['prepared']))

We can check how many unique words there now are in this dictionary.

In [30]:
len(common_dictionary)

76292

One more step we will want to do is to filter out the words that are extremely common and words that hardly appear at all. Removing the common words is similar to the step we had before where we removed the stop words, but this one will be over the entire corpus and will catch a few more that we missed before.

The words that hardly appear should go too because if a word just appears a few times in the entire corpus, then there isn't much value in using this word to define a group of words that fit together. LDA will try to create groups based on words that are often found together after all.

In [31]:
common_dictionary.filter_extremes(no_below=20, no_above=0.5)

Now we need to do a rather weird thing. We need to look at the first (or any really!) element of the `common_dictionary` we just filtered. This will force Python to load the dictionary in memory. If you don't do this, `id2token` will fail when training the model. This `id2token` function translates word IDs from the dictionary back into actual words which will make things easier for us to interpret.

In [32]:
common_dictionary[0]

'accidentally'

Now we have a dictionary that has been filtered, we can use it to transform our corpus of book summaries into something the LDA algorithm can understand. We need to transform the book summaries (documents) into a bag of words using the `doc2bow` function. A bag of words is just a list of words and the number of times that each word appeared in the document. Instead of using a word, it will instead use the word ID it gets from the dictionary since this is much more memory efficient and faster too.

In [33]:
common_corpus = [common_dictionary.doc2bow(text) for text in books['prepared']]

At this point, we will want to enable logging. This will let us see what the LDA model is doing while it is training. Otherwise, it will be too much of a black box. We will instruct Python to keep a log in "lda_model.log" and save everything from debug messages and upwards. Upwards in this context means information messages, warning messages, and error messages.

In [34]:
import logging
logging.basicConfig(filename='lda_model.log', format='%(asctime)s : %(levelname)s : %(message)s', filemode='w', level=logging.DEBUG)

Finally, everything is set up to do the training. Now we need to choose our model parameters! There are a lot of things you can tell the LDA model on how it should train. The most important one to choose is the number of topics.

We need to tell the LDA model in advance how many groups we want it to create. Here we just say 10. It's not too high and it's not too low. There's really not much more rationale than that.

We also set `eval_every` to `1` so that it will write something to the log for each pass.

The number of `passes` tell the algorithm how often it should repeat the training. Each time it does so, it will be better able to find connections between words and form groups out of them. That is because the LDA model is trying to minimize "word distances" in a group while maximizing those for words in other groups. It's slowly shuffling this all around while making those losses smaller and smaller. We'll be able to see in the log how well it's doing. We'll just use 20 passes for now and see what we get.

Then there is also `chunksize` which tells the algorithm how many books to process at the same time. Setting this high will let it find more correlations quicker but at the expense of using more computational power.

Finally, the `random_state` just makes it so that every time you run this training with the above parameters, you will always get the same answer. It fixes that seed for the random number generator, which means that it will always generate the same "random" sequence of numbers used by the model. It makes everything reproducible. We pick the number `42` because why not! It only matters that it's fixed.

In [35]:
lda = gensim.models.ldamodel.LdaModel(
    corpus=common_corpus,
    num_topics=10,
    id2word=common_dictionary.id2token,
    eval_every=1,
    passes=20,
    chunksize=3000,
    random_state=42
)

Ok, our model is done training now. Let's open the file "lda_model.log" and look for the lines that say
```
DEBUG : 1453/3000 documents converged within 50 iterations
```
This number needs to be as high as possible. Ideally we want all documents to be converged but we will never get there because the groupings will be imperfect as these book summaries will be talking about a lot of different things and sometimes use common language.

Once the model has been trained, we might want to save it to a file. That way when we reopen our Jupyter session, we don't need to retrain, but can just load it. You'll also be able to open it in a different notebook that's just focused on using the model if you want or share it with other researchers.

Before we used something similar with pickle, but when sharing models, it's better to use a more standard data format since the pickle format may depend on the version of Python or even what computer you are using. Using standard data formats is a best practice in data management that increases interoperability and reproducibility in research.

In [36]:
# Saving the model
lda.save('bookmodel')

In [37]:
# Loading the model
lda = gensim.models.ldamodel.LdaModel.load('bookmodel')

### Analysis

Time to look at the results. There's a number of things we can look at. For the ten topics we told the LDA to find, we can print the words it used and how strongly they indicated being part of a particular book genre.

In [38]:
for t in lda.print_topics(num_words=10):
    print(t)

(0, '0.011*"murder" + 0.009*"kill" + 0.009*"go" + 0.008*"police" + 0.008*"tell" + 0.007*"get" + 0.006*"case" + 0.005*"call" + 0.005*"day" + 0.005*"death"')
(1, '0.009*"kill" + 0.008*"king" + 0.007*"return" + 0.005*"help" + 0.005*"use" + 0.005*"attack" + 0.005*"tell" + 0.005*"world" + 0.005*"back" + 0.005*"power"')
(2, '0.011*"go" + 0.008*"get" + 0.008*"tell" + 0.007*"make" + 0.006*"leave" + 0.006*"see" + 0.006*"come" + 0.006*"island" + 0.005*"back" + 0.005*"say"')
(3, '0.017*"novel" + 0.016*"book" + 0.012*"story" + 0.008*"character" + 0.007*"life" + 0.006*"world" + 0.005*"also" + 0.005*"first" + 0.005*"chapter" + 0.005*"include"')
(4, '0.019*"john" + 0.012*"four" + 0.012*"go" + 0.011*"thomas" + 0.011*"henry" + 0.011*"sam" + 0.008*"luke" + 0.008*"tell" + 0.007*"sarah" + 0.007*"time"')
(5, '0.012*"ship" + 0.011*"human" + 0.010*"earth" + 0.009*"planet" + 0.007*"time" + 0.006*"new" + 0.006*"world" + 0.005*"use" + 0.005*"year" + 0.005*"destroy"')
(6, '0.011*"family" + 0.010*"father" + 0.010

The first one seems very crime oriented with words like murder, kill, police, case, but you can see there are also some words in there that are not quite right. If you look at topic number 5, there definitely seems to be a Sci-Fi theme going on there with the words human, earth, planet, and world.

There is however also some others where it's not quite clear how they are different genres. For example, topic 4 is mainly just names. Keep in mind that we are the ones looking for genres and that might not have been what the LDA algorithm found. It was just looking for anything to group these book summaries into ten topics.

Another thing to look at is the top words per topic. It's similar to `print_topics` but now we are looking at coherence. Coherence in gensim is a metric that looks at the "distance" between words within a topic. A higher coherence score is better.

In [39]:
lda.top_topics(common_corpus)

[([(0.0110494755, 'family'),
   (0.010247785, 'father'),
   (0.009740395, 'mother'),
   (0.008020293, 'love'),
   (0.007960088, 'life'),
   (0.007780727, 'become'),
   (0.007565873, 'child'),
   (0.007225185, 'leave'),
   (0.006368225, 'friend'),
   (0.0059221974, 'year'),
   (0.005589299, 'go'),
   (0.0055839773, 'young'),
   (0.005487799, 'return'),
   (0.0051857894, 'live'),
   (0.0051689884, 'home'),
   (0.0051486026, 'new'),
   (0.004815906, 'meet'),
   (0.004654479, 'daughter'),
   (0.0044921557, 'marry'),
   (0.004474495, 'make')],
  -0.852357147559029),
 ([(0.009047911, 'kill'),
   (0.0076246485, 'king'),
   (0.006640664, 'return'),
   (0.0054942826, 'help'),
   (0.0052796644, 'use'),
   (0.005254651, 'attack'),
   (0.0051531107, 'tell'),
   (0.0051329145, 'world'),
   (0.004670613, 'back'),
   (0.004652856, 'power'),
   (0.0043452256, 'make'),
   (0.004283357, 'leave'),
   (0.004097437, 'go'),
   (0.0039577885, 'magic'),
   (0.0037244798, 'become'),
   (0.0037072925, 'give'),


So for our first book with index 0, "Clockwork Orange", we can see the groups that LDA thinks it belongs to.

In [40]:
print(books.iloc[0].title)
lda[common_corpus[0]]

A Clockwork Orange


[(0, 0.44291747), (3, 0.26078534), (6, 0.21150394), (8, 0.07269716)]

The numbers here show the group number followed by the likelihood of this book belonging to that group. Here the LDA thinks that "A Clockwork Orange" belongs to group 0 with 44.3% confidence, which is not a whole lot.

Another thing we can do is look at books that most definitely belong to a group. We'll have to do some Python magic. First we create a function that takes the above output and transforms it into something that is easier to work with by adding zeroes for all the groups that are not part of the output. That is to say, for "Clockwork Orange" we have the groups 0, 3, 6, and 8, so we'll create new entries for the missing 1, 2, 4, 5, 7, and 9 and set those to zero.

In [41]:
def densify(sparse, num_topics=10):
    full = [0]*num_topics
    for s in sparse:
        full[s[0]] = s[1]
    return full

Using that function, we can now convert the entire output of the LDA model for every book into a 9292 (number of books) by 10 (number of groups) matrix and load that into a Pandas DataFrame.

In [42]:
topicmatrix=pd.DataFrame([densify(lda[c]) for c in common_corpus])

That in turn we can join to the original `books` DataFrame. We just need to reset the index on the `books` DataFrame because earlier we threw away a whole bunch of books for having missing values. Resetting the index will make sure the index runs from 0 to 9291 which is then the same as our `topicmatrix` DataFrame.

In [43]:
joined = books.reset_index().join(topicmatrix)

Now we have that, we can use Pandas to find us the books in group 2 with a confidence higher than 95%

In [44]:
joined[joined[2] > 0.95]["title"]

3695                           Punk Farm
5340     The Moomins and the Great Flood
5699    The Story of A Fierce Bad Rabbit
5943            Five Go Off In A Caravan
6013         Curious George Flies a Kite
6437                          Fox's Feud
8034                   The Missing Piece
8095                      Curious George
8359                 Battle for the Park
8737                              Bedlam
9003                     The Sly Old Cat
9035         Curious George Gets a Medal
9141                       The Gathering
9150                         Troll Blood
9210                      Little Red Cap
9211                When the Moon Forgot
Name: title, dtype: object

Now it's up to us again to think about what these books might have in common. Judging from the titles, most of these sounds like children's books!

What do we have for group 1?

In [45]:
joined[joined[1] > 0.95]["title"]

855                             The Dragon Reborn
1123                     The Wishsong of Shannara
1546                         To Green Angel Tower
2029                          The Source of Magic
3496                                 Resurrection
3751                          Dawn of the Dragons
3777                                   Stronghold
3804                       The Kingdoms of Terror
3805                                 Castle Death
3808                       The Dungeons of Torgar
3896                        The Hunger of Sejanoz
3909                                    Vampirium
3910                   The Fall of Blood Mountain
3963                         The Hour of the Gate
4597                           Chosen of the Gods
4860                         The Bone Doll's Twin
4879                                Deryni Rising
4966                               Rise of a Hero
4979                                   Shadowplay
5063                      Dark Wraith of Shannara


Hmm, fantasy maybe?

It's not always very clear though. For example, what would be the theme for group 9?

In [46]:
joined[joined[9] > 0.95]["title"]

8782    Ishmael and the Return of the Dugongs
9017                               Bloodlines
9103                       Soccer Comes First
9105                                     Goal
9202                        Succubus Revealed
9237                The Luck of Ginger Coffey
9268                  Big Nate: Strikes Again
Name: title, dtype: object

There are not a lot of books here, so even the LDA isn't too sure about what groups these books together.

## Conclusion
That's the trouble with unsupervised learning. The algorithm will find things that belong together but it doesn't tell you why. It can only tell you which words it used to group these books but as far as the model is concerned, these words are just indexes in a dictionary. All it sees are numbers and the words have no meaning.

At this point, you would start playing with the parameters. For example, maybe we chose too many topics or too few. Other parameters to play with are the number of training passes and the chunk size. In the [manual for the LDA model](https://radimrehurek.com/gensim/models/ldamodel.html) there are also a few other parameters you can use that we didn't set here. It will be a balance between your computational resources and accuracy. You will need to keep a close eye on the log file that the LDA model produces to see if your parameters have the desired effect.

References:
- https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html