# Text Mining in Python
#### In our last lecture we learned about the bag of words (BOW) representation for transforming unstructured text into a document-term matrix that we could use with machine learning algorithms. Today, we'll learn about another way of representing the presence of terms in a document by reweighting the counts based on the importance of the terms.

## TF-IDF (Term Frequency - Inverse Document Frequency)

From [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf):<br>

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

#### Term Frequency: the number of times a term (token, e.g. a word) appears in a document<br>Document Frequency: the number of documents that a word appears in

So let's stop and think about this for a second. Say our goal is to find all relevant documents from a corpus given a search phrase. Say we're only allowed to search for documents using one word of the search phrase at a time. You've created a dictionary where all of the terms present in the documents are the keys and the values are set of tuples (document, term frequency). We'll need to consider each of the words in the search phrase to determine the relevance of each of the documents found.<br>

example search phrase: <em>the fast fourier transform</em><br>

Your first thought might be to take the intersection of each set of documents that contain each word. But how would you go about ordering those results? What if there was a document where French version of Pinocchio about a doll named Fourier that wanted to transform into a real boy (and do so fast)? How would you determine that was irrelevant?

Let's consider each word:

<b>the</b>: this word probably appears in every document so a document containing <em>the</em> doesn't mean that it's relevant at all

<b>fast</b>: this word probably appears in a lot of documents that have nothing to do with the fourier transform in addition to those about the fast fourier transform so it's not as useless as the but still pretty irrelevant

<b>fourier</b>: this word will appear in a lot less documents than the word fast, therefore it should be more relevant to our query

<b>transform</b>: this word will appear in more documents that <b>fourier</b> but less documents than <b>fast</b> and it's relevance should reflect that

We also care how many times the word is mentioned. In a document about the fast fourier transform, we would expect each of those words to occur frequently. However, we should keep in mind that we care more about the relative frequency than the overall frequency. Therefore we should normalize the term frequency based on how many words are present in the document. So that we don't place assign higher relevance to a document merely because it is longer.

Notice, that the relevance of any document is directly proportional to the normalized term frequency and inversely proportional to how many documents the term appears in. This is the motivation behind tf-idf.

we're looking for something like:

\begin{equation*}
\text{tf-idf} = f(\text{term freq}) \times g({\frac{1}{\text{doc freq}}})
\end{equation*}

which could be as simple as:

\begin{equation*}
\text{tf-idf}_{word} = \langle\text{term freq}_{word}\rangle \times log \left( \frac{N_{doc}}{\text{doc freq}_{word}} \right)
\end{equation*}

using the log for smaller overall values in case $N_{doc}$ is large.

There are various calculations used for calculating the tf-idf score. The Wikipedia page lists several. Refer to the Scikit documenation to see which one they use and why.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
example_sentences = ['The dog is a good dog.', 
                     'The boy is bad.', 
                     'The girl is good.',
                    ]

In [3]:
tfidf = TfidfVectorizer(lowercase=True, norm=None, stop_words='english', use_idf=False)
tfidf.fit_transform(example_sentences).toarray()

array([[0., 0., 2., 0., 1.],
       [1., 1., 0., 0., 0.],
       [0., 0., 0., 1., 1.]])

In [4]:
tfidf.get_feature_names()

['bad', 'boy', 'dog', 'girl', 'good']

In [5]:
tfidf = TfidfVectorizer(lowercase=True, norm='l1', stop_words='english', use_idf=False)
tfidf.fit_transform(example_sentences).toarray()

array([[0.        , 0.        , 0.66666667, 0.        , 0.33333333],
       [0.5       , 0.5       , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.5       , 0.5       ]])

In [6]:
tfidf = TfidfVectorizer(lowercase=True, norm=None, stop_words='english', use_idf=True)
tfidf.fit_transform(example_sentences).toarray()

array([[0.        , 0.        , 3.38629436, 0.        , 1.28768207],
       [1.69314718, 1.69314718, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.69314718, 1.28768207]])

In [7]:
tfidf = TfidfVectorizer(lowercase=True, norm='l2', stop_words='english', use_idf=True)
tfidf.fit_transform(example_sentences).toarray()

array([[0.        , 0.        , 0.93470196, 0.        , 0.35543247],
       [0.70710678, 0.70710678, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.79596054, 0.60534851]])

From the [Scikit-learn docs](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting):
> In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.<br><br>
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

#### Learn more about tf-idf: <br> http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/ <br> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

#### We also learned in the previous lecture about n-grams and that one of the problems with calculating n-grams is that our number of features will explode. What if we came up with a way to identify meaningful/significant n-grams and only used those instead. Lucky for us, some people already figured out some ways to do just that. 

## Collocations

From [Wikipedia](https://en.wikipedia.org/wiki/Collocation):

In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance.

I won't go into the details of the calculations here. But if you would like to work collocations in your project, here are resources to learn more about them:<br>
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf <br>
http://www.scielo.org.mx/scielo.php?pid=S1405-55462016000300327&script=sci_arttext

In [8]:
import bs4
import pandas as pd
import spacy

from gensim.models.phrases import Phrases, Phraser
from spacy.pipeline import Pipe

So now instead of just using the frequency of the word in the document. You're reweighting the frequency based on how important that term should be based on the tfidf score. 

In [9]:
movie_data = pd.read_csv('../Lecture_10/labeledTrainData.tsv/labeledTrainData.tsv', sep='\t')
text = movie_data.sample(10000, random_state=42).loc[:, 'review'].apply(lambda t: bs4.BeautifulSoup(t, 'lxml').get_text())

#### Note: using lxml instead of html5lib will significantly speed up the html parsing

In [10]:
print(text.iloc[0])

I read that \There's a Girl in My Soup\" came out during Peter Sellers's low period. Watching the movie, I'm not surprised. Almost nothing happens in the movie. Seemingly, the very presence of Sellers and Goldie Hawn should help the movie; it doesn't. The whole movie seems like they just randomly filmed whatever happened without scripting anything. Maybe I haven't seen every movie about middle-aged to elderly people trying to be hippies, but this one gives such movies a pretty bad name.All in all, both Sellers and Hawn have starred in much better movies than this, so don't waste your time on this. Pretty worthless."


#### Let's find collocations at the sentence level instead of the review level so we don't find collocations between words at the end of sentences and the beginning of others.

In [11]:
nlp = spacy.load('en')

In [12]:
%%time
token_text = []

for doc in nlp.pipe(text):
    for sent in doc.sents:
        token_text.append([t.lower_ for t in sent if not t.is_punct])

CPU times: user 10min 18s, sys: 3min 10s, total: 13min 29s
Wall time: 8min 6s


In [13]:
print(token_text[0])

['i', 'read', 'that', '\\there', "'s", 'a', 'girl', 'in', 'my', 'soup\\', 'came', 'out', 'during', 'peter', 'sellers', "'s", 'low', 'period']


In [199]:
from sklearn.feature_extraction import stop_words

In [17]:
common_terms = list(stop_words.ENGLISH_STOP_WORDS) + ["'m", "'re", "'ll", "'s", "'ve", "'d", 'ca', 'is']

common_terms.remove('not')
common_terms.remove('nothing')
common_terms.remove('never')

In [18]:
sorted(common_terms)

["'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'bill',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'cant',
 'co',
 'con',
 'could',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'do',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eg',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',


In [19]:
phrases = Phrases(token_text, common_terms=common_terms)

In [20]:
colloc = Phraser(phrases)

In [21]:
colloc_text = colloc[token_text]

In [22]:
colloc_text[0]

['i',
 'read',
 'that',
 '\\there',
 "'s",
 'a',
 'girl',
 'in',
 'my',
 'soup\\',
 'came',
 'out',
 'during',
 'peter_sellers',
 "'s",
 'low',
 'period']

In [23]:
colloc_text[1999]

['an',
 'unassuming',
 'subtle',
 'and',
 'lean',
 'film',
 '\\the',
 'man',
 'in',
 'the',
 'white',
 'suit\\',
 'is',
 'yet',
 'another',
 'breath_of_fresh',
 'air',
 'in',
 'filmic',
 'format',
 'from',
 'ealing',
 'studios']

In [24]:
tri_phrases = Phrases(colloc_text, common_terms=common_terms)

In [25]:
tri_colloc = Phraser(tri_phrases)

In [26]:
tri_colloc_text = tri_colloc[colloc_text]

In [27]:
tri_colloc_text[1999]

['an',
 'unassuming',
 'subtle',
 'and',
 'lean',
 'film',
 '\\the',
 'man',
 'in',
 'the',
 'white',
 'suit\\',
 'is',
 'yet',
 'another',
 'breath_of_fresh_air',
 'in',
 'filmic',
 'format',
 'from',
 'ealing',
 'studios']

#### Now you can use this Phraser to convert a list of tokens into a list of tokens that groups together collocations

In [28]:
tri_colloc[['it', 'was', 'a', 'waste', 'of', 'time']]

['it', 'was', 'a', 'waste_of_time']

See the gensim docs for more info: https://radimrehurek.com/gensim/models/phrases.html

## Word2Vec

For some of your projects, your goal is to figure out the sentiment expressed on specific aspects of an object. In order to do that, you'd have to account for all of the different ways a person could refer to that aspect.

Say you're looking at product reviews for cell phones and you've noticed one aspect of cell phones that reviewers seem to care about is the battery life. But you've noticed that sometimes they talk about that aspect using different words such as: 'battery life', 'battery', and 'battery power.' You now know how to find collocations such as 'battery_life' and 'battery_power.' But how would know that those are all used to refer to the same thing. This is an unsupervised learning problem. You don't have labels for each of those terms telling you that they refer to "battery life." So you need a way to learn from the text that those terms are used to refer to the same aspect. Word2Vec can do this for you.

![](https://deeplearning4j.org/img/word2vec_diagrams.png)

The gist:

By using surrounding words (context) to predict a word or by using a word to predict the surrounding words, you can use the hidden layer of the NN to map words to a lower dimensional vector space (instead of the original vector space that had the same number of dimensions as the number of words in your corpus vocabulary). In order to shrink the vector space, the NN has to learn to recognize patterns in the text (represenatations) to compress the information.

What you get from word2vec are vectors for each word where the position of word in the lower dimensional vector space represents some concept and similar words (words used in similar contexts in the training data) are close to each other.

Using these vectors, you can cluster the words together.

In [29]:
from gensim.models import Word2Vec

In [30]:
%%time

model = Word2Vec(tri_colloc_text, size=100, workers=8)

CPU times: user 1min 26s, sys: 1.93 s, total: 1min 27s
Wall time: 1min 15s


In [31]:
model.wv['cinematography']

array([-1.08353972e-01,  1.40819073e-01,  3.88194174e-01,  1.96526852e-03,
       -8.26176286e-01, -2.41572540e-02,  3.40961754e-01, -1.12723410e+00,
       -1.24302141e-01, -6.08248949e-01,  3.92841041e-01,  1.13429093e+00,
       -6.65594101e-01, -7.96273947e-01, -1.73878884e+00,  1.49672046e-01,
        2.32765928e-01,  5.76645970e-01,  7.62504876e-01,  1.84046459e+00,
       -7.46206045e-01,  6.51902080e-01,  1.16231477e+00,  3.65196377e-01,
        1.46791473e-01,  4.28103358e-01,  4.50814366e-01,  3.60439390e-01,
        1.19752860e+00, -1.00293875e+00, -2.06143737e+00,  9.02431756e-02,
        3.82894397e-01,  5.03197089e-02, -1.72153533e-01,  2.12108687e-01,
        1.50363934e+00, -1.40860423e-01,  9.75907803e-01,  1.31152973e-01,
       -3.28705341e-01,  7.03209937e-01,  8.88488218e-02,  3.74101818e-01,
       -1.25820339e+00,  1.34960458e-01, -6.26231432e-01,  7.85148263e-01,
        1.02406454e+00,  1.19477010e+00,  5.92695594e-01,  1.05428958e+00,
       -1.82161462e-02, -

In [32]:
model.wv.most_similar('cinematography')

[('photography', 0.9435670971870422),
 ('camera_work', 0.9348364472389221),
 ('lighting', 0.9125988483428955),
 ('direction', 0.9095066785812378),
 ('music', 0.909207284450531),
 ('soundtrack', 0.8958258032798767),
 ('editing', 0.8878849148750305),
 ('pacing', 0.8850436210632324),
 ('scenery', 0.8823689818382263),
 ('dialog', 0.8796185255050659)]

In [33]:
model.wv.most_similar('plot')

[('storyline', 0.8430126905441284),
 ('story', 0.8208246231079102),
 ('story_line', 0.8002089262008667),
 ('dialogue', 0.7878210544586182),
 ('script', 0.7815348505973816),
 ('ending', 0.7605075836181641),
 ('concept', 0.7239733934402466),
 ('message', 0.7239457964897156),
 ('dialog', 0.7191160917282104),
 ('sound', 0.707491934299469)]

In [34]:
model.wv.most_similar('character')

[('role', 0.73625248670578),
 ('performance', 0.7231549620628357),
 ('voice', 0.6991214156150818),
 ('villain', 0.6930806040763855),
 ('main_character', 0.682639479637146),
 ('portrayal', 0.6545627117156982),
 ('situation', 0.6361498832702637),
 ('presence', 0.6356627941131592),
 ('relationship', 0.6308934092521667),
 ('actor', 0.6215326189994812)]

In [35]:
model.wv.most_similar('director')

[('writer', 0.7605269551277161),
 ('actor', 0.6439990401268005),
 ('filmmaker', 0.641372561454773),
 ('screenplay', 0.5950239896774292),
 ('author', 0.5923200845718384),
 ('cast', 0.5767263770103455),
 ('producer', 0.5612565875053406),
 ('casting', 0.5563101768493652),
 ('performance', 0.5456812381744385),
 ('role', 0.5445694923400879)]

In [None]:
vocab_set = set()

In [None]:
for sent in tri_colloc_text:
    vocab_set.update(sent)

In [38]:
vocab = pd.Series(list(model.wv.vocab))
vocab_vectors = []

for word in vocab:
    try:
        vec = model.wv[word]
        vocab_vectors.append(vec)
    except:
        pass

In [39]:
len(vocab), len(vocab_vectors)

(22838, 22838)

In [40]:
vocab_vectors[0]

array([ 0.02460557,  0.47760966, -2.9226177 ,  0.13891673,  0.6100589 ,
       -0.09417142, -2.1427715 , -0.39144024, -0.25622603, -1.268989  ,
       -1.6281679 ,  0.0811056 ,  1.0130559 , -0.60660034,  0.571093  ,
       -1.9916303 , -2.333888  ,  2.3250675 , -0.9635863 ,  1.505168  ,
       -2.632644  ,  3.0234292 ,  0.23095961, -2.334893  ,  0.4685145 ,
       -2.6683495 , -1.2293535 , -1.0694523 , -0.93073964,  0.43642136,
       -1.4611065 , -2.4278991 , -1.8014826 , -0.4825394 , -0.21139567,
       -1.0011889 , -1.2318618 , -0.7314823 , -0.25001726,  1.7385334 ,
       -0.31799376,  0.4863861 , -0.7905135 , -1.7888017 ,  0.523053  ,
        0.46604916,  0.67809457, -2.8256426 , -0.08586459, -3.0361795 ,
        1.0235425 , -1.2746819 , -3.718552  , -1.9863592 , -0.6691114 ,
       -1.0657921 , -2.314538  ,  4.4069295 , -0.4689855 ,  0.616245  ,
       -1.0528014 ,  1.9347531 ,  0.9588505 ,  3.0499837 ,  0.03572845,
        0.18023825, -0.51651186,  1.2198809 ,  1.1035601 ,  2.13

In [42]:
import numpy as np

from sklearn.preprocessing import normalize

In [104]:
vector_array = np.concatenate(vocab_vectors, axis=0).reshape(-1, 100)

vec_array_l1 = normalize(vector_array, norm='l1')
vec_array_l2 = normalize(vector_array, norm='l2')

In [105]:
vector_array[0, :]

array([ 0.02460557,  0.47760966, -2.9226177 ,  0.13891673,  0.6100589 ,
       -0.09417142, -2.1427715 , -0.39144024, -0.25622603, -1.268989  ,
       -1.6281679 ,  0.0811056 ,  1.0130559 , -0.60660034,  0.571093  ,
       -1.9916303 , -2.333888  ,  2.3250675 , -0.9635863 ,  1.505168  ,
       -2.632644  ,  3.0234292 ,  0.23095961, -2.334893  ,  0.4685145 ,
       -2.6683495 , -1.2293535 , -1.0694523 , -0.93073964,  0.43642136,
       -1.4611065 , -2.4278991 , -1.8014826 , -0.4825394 , -0.21139567,
       -1.0011889 , -1.2318618 , -0.7314823 , -0.25001726,  1.7385334 ,
       -0.31799376,  0.4863861 , -0.7905135 , -1.7888017 ,  0.523053  ,
        0.46604916,  0.67809457, -2.8256426 , -0.08586459, -3.0361795 ,
        1.0235425 , -1.2746819 , -3.718552  , -1.9863592 , -0.6691114 ,
       -1.0657921 , -2.314538  ,  4.4069295 , -0.4689855 ,  0.616245  ,
       -1.0528014 ,  1.9347531 ,  0.9588505 ,  3.0499837 ,  0.03572845,
        0.18023825, -0.51651186,  1.2198809 ,  1.1035601 ,  2.13

In [109]:
vec_array_l1[5].sum()

0.207269

In [110]:
vocab[0]

'i'

In [111]:
from sklearn.cluster import KMeans

In [122]:
km = KMeans(n_clusters=1000, init='random', n_jobs=1, max_iter=1000)

In [123]:
%%time
km.fit(vec_array_l2)

CPU times: user 1min 36s, sys: 11.2 s, total: 1min 48s
Wall time: 1min 48s


KMeans(algorithm='auto', copy_x=True, init='random', max_iter=1000,
    n_clusters=1000, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [124]:
%matplotlib inline
pd.value_counts(km.labels_)

576    137
237    108
379    106
508    103
794    103
280     97
54      97
506     89
20      89
136     85
221     83
269     81
691     79
499     77
662     76
480     76
617     74
17      73
358     73
582     73
498     73
6       71
873     71
900     69
93      69
843     68
981     68
309     68
664     67
92      67
      ... 
401      1
584      1
552      1
668      1
777      1
303      1
232      1
255      1
960      1
120      1
88       1
24       1
896      1
985      1
130      1
471      1
439      1
111      1
327      1
95       1
167      1
362      1
71       1
514      1
594      1
926      1
486      1
779      1
470      1
475      1
Length: 1000, dtype: int64

In [125]:
words_to_lookup = ['cinematography', 'plot', 'character', 'filmmaker']
series_filter = vocab.isin(words_to_lookup)
word_labels = km.labels_[series_filter]
words = vocab[series_filter]
words, word_labels

(163               plot
 223          character
 1018    cinematography
 2337         filmmaker
 dtype: object, array([286, 188, 694, 752], dtype=int32))

In [126]:
for cluster_num, word in zip(word_labels, words):
    print('============{}============='.format(word))
    print(vocab[km.labels_ == cluster_num])

92               story
107           material
163               plot
421            message
598             comedy
714         story_line
1166             piece
1779           quality
1865           concept
1976          thriller
2637             drama
3487         storyline
3871            effect
4410    subject_matter
4428           premise
dtype: object
223          character
1190             voice
6730             truth
7806    main_character
dtype: object
133            direction
164                 cast
205                music
258               acting
501         performances
545           soundtrack
712           screenplay
747             dialogue
987     rest_of_the_cast
1006             editing
1018      cinematography
1088              dialog
1403         photography
1441              script
1525             scenery
2295               sound
2298               score
2658     supporting_cast
2723             effects
2864     special_effects
3230           directing
3768      

## Dimensionality Reduction and Topic Modeling

### Singular Value Decomposition

From the [Wikipedia page](https://en.wikipedia.org/wiki/Singular-value_decomposition):<br>
![](https://upload.wikimedia.org/wikipedia/commons/e/e9/Singular_value_decomposition.gif)

<div style="font-size: 200%;">$$\mathbf{M} =\mathbf{U} {\boldsymbol {\Sigma}}\mathbf{V}^{*}$$</div>

Read these:<br>
http://www.ams.org/publicoutreach/feature-column/fcarc-svd<br>
https://www.quora.com/What-is-an-intuitive-explanation-of-singular-value-decomposition-SVD

Basically, we are trying to compress the information stored in the N-dimensional data matrix down to a k-dimensional form. $(N < K)$ In order to accomplish that, we assume that each of the N column vectors of the original matrix can be represented as a linear combination of K vectors.

### How does this help us with a BOW matrix?

We can think instead of having a column for each word. We can having columns that represent relationships between words and by summing them in the right way get back most of the information contained in the document.

### Latent Semantic Analysis (Latent Semantic Indexing)
http://matpalm.com/lsa_via_svd/intro.html<br>
https://web.archive.org/web/20150823005532/http://www.puffinwarellc.com:80/index.php/news-and-articles/articles/33.html?start=3

Create a new set of latent features to describe the composition of a document. By grouping words into concepts (topics), we can represent each document in a lower dimensional vector space.

First get a tfidf BOW representation:

In [144]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer

In [156]:
num_of_topics = 100

In [215]:
stops = list(stop_words.ENGLISH_STOP_WORDS) + ['movie', 'film', 'just', 'like', 've', "'ve", "'m", "'s", "'ll", "ll",
                                               'really',
                                              ]

In [248]:
pipeline = Pipeline([('tfidf', TfidfVectorizer(min_df=0.001, stop_words=stops)),
                     ('tsvd', TruncatedSVD(n_components=num_of_topics, n_iter=10)),
                     ('norm', Normalizer())
                    ])

In [249]:
new_data = pipeline.fit_transform(' '.join(sent) for sent in tri_colloc_text)

In [250]:
tfidf = pipeline.named_steps['tfidf']
tsvd = pipeline.named_steps['tsvd']

In [251]:
tsvd.explained_variance_.sum()

0.2501219382518581

In [256]:
# repurposed from https://de.dariah.eu/tatom/topic_model_python.html#using-non-negative-matrix-factorization
def print_topic_words(components, vocab, num_topics=10, num_of_word_per_topic=10):
    topic_words = []
    vocab_array = np.array(vocab)

    for topic in components:
        word_idx = np.argsort(topic)[::-1][:num_of_word_per_topic]
        topic_words.append((vocab_array[word_idx]).tolist())
        
    for topic, words in list(zip(['Topic_{}'.format(i+1) for i in range(num_topics)], topic_words))[:10]:
        print(topic, words)

In [257]:
print_topic_words(tsvd.components_, tfidf.get_feature_names())

Topic_1 ['good', 'bad', 'time', 'story', 'great', 'people', 'watch', 'movies', 'way', 'think']
Topic_2 ['good', 'points', 'humour', 'taste', 'performances', 'scenery', 'fairly', 'story_line', 'equally', 'evil']
Topic_3 ['bad', 'movies', 'acting', 'thing', 'script', 'worse', 'ca', 'funny', 'writing', 'plot']
Topic_4 ['great', 'acting', 'story', 'bad', 'cast', 'actors', 'music', 'performance', 'actor', 'idea']
Topic_5 ['time', 'great', 'bad', 'watch', 'good', 'waste', 'saw', 'watched', 'worth', 'favorite']
Topic_6 ['watch', 'people', 'movies', 'think', 'say', 'better', 'seen', 'want', '10', 'did']
Topic_7 ['story', 'watch', 'time', 'bad', 'great', 'acting', 'told', 'interesting', 'good', 'worth']
Topic_8 ['seen', 'movies', 'best', 'films', 'better', 'worst', 'acting', '10', 'times', 'love']
Topic_9 ['think', 'way', 'better', 'did', 'plot', 'make', 'characters', 'did_n', 'funny', 'end']
Topic_10 ['think', 'story', 'seen', 'best', 'watch', 'time', 'great', 'better', 'bad', 'worst']


#### Looks like some more preprocessing is needed. I'll leave that for you.

In [254]:
len(tfidf.get_feature_names())

1187

## Latent Dirchlet Allocation

http://www.cs.cornell.edu/courses/cs6784/2010sp/lecture/30-BleiEtAl03.pdf

![](LDA.png)

In [262]:
from sklearn.decomposition import LatentDirichletAllocation

In [263]:
lda_pipe = Pipeline([('tfidf', TfidfVectorizer(min_df=0.001, stop_words=stops)),
                     ('lda', LatentDirichletAllocation(n_components=num_of_topics)),
                    ])

In [264]:
lda_pipe.fit_transform(' '.join(sent) for sent in tri_colloc_text)
tfidf = lda_pipe.named_steps['tfidf']
lda = lda_pipe.named_steps['lda']



In [265]:
print_topic_words(lda.components_, tfidf.get_feature_names())

Topic_1 ['wonderful', 'sad', 'likely', 'jane', 'flying', 'finding', 'finds', 'fine', 'fit', 'flat']
Topic_2 ['way', 'going', 'performance', 'excellent', 'direction', 'cheap', 'boys', 'german', 'island', 'think']
Topic_3 ['love', 'course', 'loved', 'remember', 'women', 'child', 'important', 'silly', 'parents', 'realistic']
Topic_4 ['oh', 'british', 'setting', 'lady', 'surely', 'thinks', 'plan', 'pain', 'finds', 'fine']
Topic_5 ['evil', 'dark', 'killer', 'police', 'keeps', 'mysterious', 'prison', 'soldiers', 'richard', 'government']
Topic_6 ['hell', 'happens', 'matter', 'english', 'situation', 'meant', 'convincing', 'present', 'knowing', 'focus']
Topic_7 ['night', 'average', 'beauty', 'casting', 'class', 'paul', 'showed', 'super', 'common', 'adds']
Topic_8 ['hope', 'black', 'add', 'lots', 'waste', 'wasted', 'dance', 'story', 'flying', 'focus']
Topic_9 ['dead', 'able', 'directed', 'blood', 'murder', 'question', 'suspense', 'comment', 'mystery', 'note']
Topic_10 ['beautiful', 'unfortunatel

Check out LDA2vec

https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

### Also check out Non-negative Matrix Factorization

http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html