# Gensim
----------
1.	Gensim(topic modelling for humans): generate similar
2.	Genism is useful for 
    - building document or word vectors
    - Performing topic identification and document comparison
3.	Word vector
    - Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.
4.	Example of word vector is 
![word vectors.png](https://github.com/rritec/datahexa/blob/dev/images/word_vector.png?raw=true)
5.	gensim models can be easily saved, updated and reused
6.	Our dictionary can also be updated 
7.	This more advanced and feature rich bag-of-words can be used in future exercises

- Time permits walk through 
    - https://radimrehurek.com/gensim/ 
    - http://tlfvincent.github.io/2015/10/23/presidential-speech-topics/

# Exercise 1: Creating a gensim dictionary and corpus

In [1]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize



In [2]:
my_documents = ['The movie was about. a spaceship and aliens.',
'I really liked the movie!',
'Awesome action scenes, but boring characters.',
'The movie was awful! I hate alien films.',
'Space is cool! I liked the movie.',
'More space films, please!',]

In [3]:
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]

In [4]:
tokenized_docs

[['the', 'movie', 'was', 'about', '.', 'a', 'spaceship', 'and', 'aliens', '.'],
 ['i', 'really', 'liked', 'the', 'movie', '!'],
 ['awesome', 'action', 'scenes', ',', 'but', 'boring', 'characters', '.'],
 ['the', 'movie', 'was', 'awful', '!', 'i', 'hate', 'alien', 'films', '.'],
 ['space', 'is', 'cool', '!', 'i', 'liked', 'the', 'movie', '.'],
 ['more', 'space', 'films', ',', 'please', '!']]

In [5]:
len(tokenized_docs)

6

In [6]:
dictionary = Dictionary(tokenized_docs)

In [7]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x2336303f0f0>

In [8]:
dictionary.token2id

{'.': 0,
 'a': 1,
 'about': 2,
 'aliens': 3,
 'and': 4,
 'movie': 5,
 'spaceship': 6,
 'the': 7,
 'was': 8,
 '!': 9,
 'i': 10,
 'liked': 11,
 'really': 12,
 ',': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'cool': 24,
 'is': 25,
 'space': 26,
 'more': 27,
 'please': 28}

In [9]:
dictionary.doc2bow?

In [10]:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

In [11]:
print(corpus)

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(0, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)], [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]


## Tf-idf with gensim
1.	Term frequency - inverse document frequency
2.	Allows you to determine the most important words in each document
3.	Each corpus may have shared words beyond just stopwords
4.	These words should be down-weighted in importance
5.	Example from astronomy: "Sky"
6.	Ensures most common words don't show up as key words
7.	Keeps document specific frequent words weighted high
8.	Tf-idf formula  
![word vectors.png](https://github.com/rritec/datahexa/blob/dev/images/tfidf.png?raw=true)

    - W i, j = tf-idf weight for token i in document j
    - Tf i, j = number of occurrences of token i in document j
    - dfi = number of documents that contain token i
    - N = total number of documents


## Exercise 2: Tf-idf with gensim

In [12]:
from gensim.models.tfidfmodel import TfidfModel

In [13]:
tfidf = TfidfModel(corpus)

In [14]:
corpus[1]

[(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)]

In [15]:
tfidf[corpus[1]]

[(5, 0.1746298276735174),
 (7, 0.1746298276735174),
 (9, 0.1746298276735174),
 (10, 0.29853166221463673),
 (11, 0.47316148988815415),
 (12, 0.7716931521027908)]