<font color = green >

# Text classification: topic modeling 

</font>

<font color = green >

### Latent Dirichlet allocation (LDA)

</font>

Typically used to detect underlying topics in the text documents

**Input** : text documents and number of topics 
<br>
**Output**: Distribution of topics for each document (that allows to assign th one with highest probability) and word distribution for each topic 

**Assumptions**:
- Documents with similar topics use similar groups of words 
- Documents are probability distribution over latent topics 
- Topics are probability distribution over words


<font color = green >

#### Generative process
<br>
</font>

LDA considers the every document is created the following way:

1) Define number if words in the document
<br>
2) Chose the topic mixture over the fixed set of topics (e.g. 20% of topic A, 30% of topic A, and 50% of topic A)
<br>
3) Generate the words by:
<br>
   -pick the topic based on document's multinomial distribution 
<br>
   -pick the word based on topic's multinomial distribution 

<img src = "img/topics_modeling.png" height=500 width= 800 align="left">



<font color = green >

#### Recall
</font>


#### Binomial distribution

$$p(k/n)\quad =\quad C^{ k }_{ n }\cdot p^{ k }(1-p)^{ n-k }\quad =\quad \frac { n! }{ k!(n-k)! } p^{ k }(1-p)^{ n-k }$$

Example: Probability of 6 of 10 for fear coin: 
$$p(6,4)\quad =\quad C^{ 6 }_{ 10 }\cdot {0.5}^{ 6 }(0.5)^{ 4 }\quad = 210 \cdot 0.015625 \cdot 0.0625 = 0.205078125$$


#### Multinomial distribution

$$p(n_{ 1 }n_{ 2 }...n_{ k })\quad =\quad \frac { n! }{ n_{ 1 }!n_{ 2 }!...n_{ k }! } p^{ n_{ 1 } }_{ 1 }p^{ n_{ 2 } }_{ 2 }...p^{ n_{ k } }_{ k }$$

Example (three outcomes): <br>
n = 12 (12 games are played),<br>
n1 = 7 (number won by Player A),<br>
n2 = 2 (number won by Player B),<br>
n3 = 3 (the number drawn),<br>
p1 = 0.4 (probability Player A wins)<br>
p2 = 0.35(probability Player B wins)<br>
p3 = 0.25(probability of a draw)<br>
$$p(7,2,3)\quad =\quad \frac {12!}{ 7! \cdot 2! \cdot3 ! }  \cdot 0.4^{7} \cdot 0.35^{2} \cdot0.25^{3} = 0.0248$$




<font color = green >

#### Maximul Likelihood Estimation

</font>

<br>

**Recall** 
<br> Known are text documents and number $K$ of topics 

**Target**:
<br>Within all possible topics distribution for all documemnts and all possible words distribution for topics, shoose the one wich maximizes probability of all text documents.

**Approach** :
<br>
1) Randomly assign each word of each document to $K$ topics 
<br>
2) Iterate the following process till convergence (steady assignments of w to topics) 
<br>$\quad\quad$For each document $d$: 
<br>
    $\quad\quad\bullet$ Assume that all topic assignment except current one are correct     
    $\quad\quad\bullet$ For each word $w$ in $d$:           
    $\quad\quad\quad$ - For every topic $t$ compare the the score for hypothesis that w is in this topic $t$:
   <br>$\quad\quad\quad\quad\quad score (t) =  p(t | d) \cdot p (w |t),$
   <br>$\quad\quad\quad\quad p(t|d)$ is proportion of all words in d from t,
    <br>$\quad\quad\quad\quad p(w|t)$ is share of word w in topic t.  
    $\quad\quad\quad$ - Assign the word w to the topic with max score
    <br>$\quad\quad\bullet$ Iterate through all $w$ in $d$:           
$\quad\quad$Iterate through all $d$

Te results is matrix of distribution of words in topics  
Note: The computed topics are just words distribution, i.e. need to summarize them somehow. 


<font color = green >

## Gensim LDA 

</font>



In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords

<font color = green >

### Define the text documents 

</font>



In [2]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]


<font color = green >

### Tokenize, clean, and stem

</font>



In [3]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(doc_set):
    texts = []
    for doc in doc_set:
        # tokenize document string
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(doc_set)
texts[0]

['brocolli',
 'good',
 'eat',
 '.',
 'brother',
 'like',
 'eat',
 'good',
 'brocolli',
 ',',
 'mother',
 '.']

<font color = green >

### Convert tokenized documents into a "id <-> term" dictionary

</font>



In [4]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(34 unique tokens: [',', '.', 'brocolli', 'brother', 'eat']...)
0 ,
1 .
2 brocolli
3 brother
4 eat
5 good
6 like
7 mother
8 around
9 basebal
10 drive
11 lot
12 practic
13 spend
14 time
15 blood
16 caus
17 expert
18 health
19 increas
20 may
21 pressur
22 suggest
23 tension
24 better
25 feel
26 never
27 often
28 perform
29 school
30 seem
31 well
32 profession
33 say


<font color = green >

### Create gensim corpus

</font>



In [5]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1), (1, 2), (2, 2), (3, 1), (4, 2), (5, 2), (6, 1), (7, 1)]
[(1, 1), (3, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(1, 1), (10, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
[(0, 1), (1, 1), (3, 1), (7, 1), (10, 1), (21, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(1, 1), (2, 1), (5, 1), (18, 2), (32, 1), (33, 1)]


<font color = green >

### Generate LDA model

</font>



In [6]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

### Review topics 

</font>



In [7]:
ldamodel.print_topics(num_topics=2,num_words=10)

[(0,
  '0.098*"." + 0.076*"brocolli" + 0.076*"good" + 0.055*"mother" + 0.055*"brother" + 0.054*"health" + 0.054*"eat" + 0.033*"," + 0.033*"like" + 0.033*"spend"'),
 (1,
  '0.060*"drive" + 0.059*"pressur" + 0.059*"." + 0.036*"," + 0.036*"never" + 0.036*"often" + 0.036*"increas" + 0.036*"perform" + 0.036*"seem" + 0.036*"well"')]

<font color = green >

### Classify the new text 

</font>



In [8]:
test_doc_list = ["Some experts suggest that car may cause increased blood pressure. professionals say that brocolli is good "]
test_texts = tokenize(test_doc_list)
test_corpus = [dictionary.doc2bow(text) for text in test_texts ]
test_doc_topics = ldamodel.get_document_topics(test_corpus)
print ('\nget topics:')
for el in test_doc_topics: # loop over all tests in provided list
    print(el)


get topics:
[(0, 0.37417653), (1, 0.62582344)]


<font color = green >

### Sample of topic modeling on large dataset

</font>



In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

<font color = green >

#### Load "News" data 

</font>



In [10]:
import os
cwd= os.getcwd()
path = os.path.join(cwd,)
fn=  os.path.join(path , 'newsgroups')

with open(fn, 'rb') as f:
    newsgroup_data = pickle.load(f)

<font color = green >

#### Review data

</font>



In [11]:
print (type(newsgroup_data))
print ('len of documents = {:,}\n'.format(len(newsgroup_data)))

newsgroup_data[0]

<class 'list'>
len of documents = 2,000



"The best group to keep you informed is the Crohn's and Colitis Foundation\nof America.  I do not know if the UK has a similar organization.  The\naddress of\nthe CCFA is \n\nCCFA\n444 Park Avenue South\n11th Floor\nNew York, NY  10016-7374\nUSA\n\nThey have a lot of information available and have a number of newsletters.\n \nGood Luck."

<font color = green >

#### Define custom vectorizer

</font>



In [12]:
three_words_pattern = r"\b\w{3,}\b"
vectorizer = CountVectorizer(
    min_df=20, 
    stop_words='english',
    token_pattern=three_words_pattern) 
vectorizer.fit(newsgroup_data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b\\w{3,}\\b', tokenizer=None,
        vocabulary=None)

<font color = green >

#### Review feratures 

</font>



In [13]:
print ('len of features = {:,}\n'.format(len(vectorizer.get_feature_names())))
print (vectorizer.get_feature_names()[:40])


len of features = 902

['000', '100', '1990', '1992', '1993', '200', '2nd', '300', '400', '486', '500', '800', 'ability', 'able', 'accept', 'accepted', 'access', 'according', 'actual', 'actually', 'add', 'addition', 'additional', 'address', 'advance', 'advice', 'age', 'ago', 'agree', 'ahead', 'air', 'allow', 'alt', 'america', 'american', 'answer', 'answers', 'anybody', 'apparently', 'appears']


<font color = green >

#### Vectorize data set

</font>



In [14]:
newsgroup_data_vectorized= vectorizer.transform(newsgroup_data)
print (newsgroup_data_vectorized)

  (0, 23)	1
  (0, 33)	1
  (0, 58)	1
  (0, 76)	1
  (0, 326)	1
  (0, 335)	1
  (0, 386)	1
  (0, 409)	1
  (0, 451)	1
  (0, 456)	1
  (0, 515)	1
  (0, 529)	1
  (0, 545)	1
  (0, 727)	1
  (0, 843)	1
  (0, 900)	1
  (1, 33)	1
  (1, 34)	1
  (1, 84)	1
  (1, 184)	1
  (1, 201)	1
  (1, 214)	1
  (1, 231)	2
  (1, 241)	1
  (1, 324)	1
  :	:
  (1998, 622)	1
  (1998, 625)	3
  (1998, 688)	1
  (1998, 698)	2
  (1998, 726)	1
  (1998, 804)	1
  (1998, 805)	1
  (1998, 810)	10
  (1998, 813)	2
  (1998, 814)	1
  (1998, 816)	1
  (1998, 818)	1
  (1998, 844)	1
  (1998, 882)	2
  (1998, 899)	1
  (1999, 171)	1
  (1999, 194)	1
  (1999, 205)	1
  (1999, 213)	1
  (1999, 276)	2
  (1999, 308)	1
  (1999, 344)	1
  (1999, 669)	1
  (1999, 832)	1
  (1999, 874)	1


<font color = green >

#### Create gensim corpus

</font>



In [15]:
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
# comparing to using corpora.Dictionary:
# corpus = [dictionary.doc2bow(text) for text in texts] 
[item for item in corpus][:5]


[[(23, 1),
  (33, 1),
  (58, 1),
  (76, 1),
  (326, 1),
  (335, 1),
  (386, 1),
  (409, 1),
  (451, 1),
  (456, 1),
  (515, 1),
  (529, 1),
  (545, 1),
  (727, 1),
  (843, 1),
  (900, 1)],
 [(33, 1),
  (34, 1),
  (84, 1),
  (184, 1),
  (201, 1),
  (214, 1),
  (231, 2),
  (241, 1),
  (324, 1),
  (332, 1),
  (359, 1),
  (363, 1),
  (365, 1),
  (409, 1),
  (430, 3),
  (451, 1),
  (475, 1),
  (492, 2),
  (525, 2),
  (605, 1),
  (633, 2),
  (642, 1),
  (674, 1),
  (688, 1),
  (709, 1),
  (750, 1),
  (777, 1),
  (823, 1),
  (838, 1),
  (874, 1),
  (896, 1)],
 [(25, 1),
  (26, 1),
  (63, 1),
  (120, 1),
  (231, 1),
  (297, 1),
  (326, 1),
  (344, 1),
  (373, 1),
  (423, 1),
  (442, 1),
  (444, 1),
  (448, 2),
  (465, 1),
  (572, 1),
  (653, 1),
  (659, 1),
  (714, 1),
  (777, 1),
  (779, 1),
  (781, 1),
  (818, 1),
  (836, 1),
  (855, 1),
  (890, 1),
  (898, 1)],
 [(4, 1),
  (17, 2),
  (18, 1),
  (22, 1),
  (42, 1),
  (48, 2),
  (68, 1),
  (78, 1),
  (86, 1),
  (94, 1),
  (117, 1),
  (119, 1)

<font color = green >

#### Create id2word dictionary

</font>



In [16]:
id_map = dict((v, k) for k, v in vectorizer.vocabulary_.items()) 

<font color = green >

#### Generate LDA model

</font>



In [17]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)
# Comparing to corpora.Dictionary
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

#### Review topics

</font>



In [18]:
ldamodel.print_topics(num_topics=6,num_words=10)

[(0,
  '0.025*"edu" + 0.019*"com" + 0.018*"use" + 0.018*"thanks" + 0.016*"does" + 0.015*"know" + 0.011*"mail" + 0.010*"apple" + 0.009*"help" + 0.008*"want"'),
 (1,
  '0.061*"drive" + 0.039*"disk" + 0.030*"scsi" + 0.027*"drives" + 0.027*"hard" + 0.025*"controller" + 0.021*"card" + 0.018*"rom" + 0.016*"cable" + 0.016*"floppy"'),
 (2,
  '0.024*"people" + 0.022*"god" + 0.012*"atheism" + 0.012*"think" + 0.012*"believe" + 0.012*"don" + 0.010*"does" + 0.010*"just" + 0.009*"argument" + 0.009*"say"'),
 (3,
  '0.023*"game" + 0.021*"year" + 0.020*"team" + 0.013*"games" + 0.013*"play" + 0.011*"good" + 0.011*"don" + 0.010*"think" + 0.010*"season" + 0.010*"players"'),
 (4,
  '0.035*"space" + 0.019*"nasa" + 0.018*"data" + 0.013*"information" + 0.013*"available" + 0.013*"center" + 0.011*"ground" + 0.010*"research" + 0.010*"000" + 0.010*"new"'),
 (5,
  '0.017*"just" + 0.017*"like" + 0.016*"don" + 0.012*"car" + 0.012*"time" + 0.011*"think" + 0.011*"good" + 0.010*"know" + 0.008*"way" + 0.008*"people"')]

<font color = green >

#### Name topics

</font>



In [19]:
topics_names= ['Education', 'Computers & IT', 'Religion', 'Sports', 'Science','Society & Lifestyle']

<font color = green >

#### Classify the new text 

</font>



In [20]:
import numpy as np

In [142]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "] 


In [143]:
doc_vectorized= vectorizer.transform(new_doc) # input param is list
new_doc_corpus = gensim.matutils.Sparse2Corpus(doc_vectorized, documents_columns=False)
doc_topics = ldamodel.get_document_topics(new_doc_corpus)
list(doc_topics)

[[(0, 0.11149516), (1, 0.8885048)]]

In [144]:
doc_vectorized

<1x902 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [23]:
def elicit_topic_name(doc_topics):    
    return topics_names[np.squeeze(np.array(doc_topics))[:,1].argmax()]
elicit_topic_name(doc_topics)

'Science'

<font color = green >

## Home Task 

</font>


<font color = green >

### Topic Modeling 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [119]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import gensim
import re

fn=  os.path.join(path , 'voted-kaggle-dataset.csv')
df = pd.read_csv(fn)

In [120]:
df.head(50)

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...
5,"Kaggle ML and Data Science Survey, 2017",A big picture view of the state of data scienc...,Kaggle,574,"Version 4,2017-10-28|Version 3,2017-10-03|Vers...",employment\nsociology\nartificial intelligence,CSV,28 MB,ODbL,"95,587 views","9,390 downloads",244 kernels,10 topics,https://www.kaggle.com/kaggle/kaggle-survey-2017,"Context\nFor the first time, Kaggle conducted ..."
6,Iris Species,Classify iris plants into three species in thi...,UCI Machine Learning,512,"Version 2,2016-09-27|Version 1,2016-01-12",botany,SQLite,15 KB,CC0,"162,706 views","24,361 downloads","3,394 kernels",14 topics,https://www.kaggle.com/uciml/iris,The Iris dataset was used in R.A. Fisher's cla...
7,World Development Indicators,Explore country development indicators from ar...,World Bank,468,"Version 2,2017-05-02|Version 1,2016-01-28",economics\ninternational relations,CSV,2 GB,Other,"134,038 views","20,364 downloads",389 kernels,5 topics,https://www.kaggle.com/worldbank/world-develop...,The World Development Indicators from the Worl...
8,Daily News for Stock Market Prediction,Using 8 years daily news headlines to predict ...,Aaron7sun,438,"Version 1,2016-08-25",news agencies\nfinance,CSV,14 MB,CC4,"95,770 views","10,820 downloads",293 kernels,8 topics,https://www.kaggle.com/aaron7sun/stocknews,"Actually, I prepare this dataset for students ..."
9,Pokemon with stats,721 Pokemon with stats and types,Alberto Barradas,428,"Version 2,2016-08-29|Version 1,2016-08-23",popular culture\ngames and toys\nvideo games,CSV,43 KB,CC0,"133,256 views","16,610 downloads",706 kernels,13 topics,https://www.kaggle.com/abcsds/pokemon,"This data set includes 721 Pokemon, including ..."


In [121]:
print ('len of texts= {:,}'.format(len(df)))
index = 10 
df.loc[index, 'Description']

len of texts= 2,150


'These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k'

In [122]:
df_data_set = df["Description"]
# df["Description"].apply(pd.value_counts) //Note: not working for non numbers type 

In [123]:
en_stop  = set(stopwords.words('english'))
most_frequent = set(['data', 'dataset']);
p_stemmer = PorterStemmer()

def tokenize(df_data_set):
    texts = []
    for doc in df_data_set:
        # tokenize document string
        doc = str(doc)
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]
        tokens = [token for token in tokens if token not in most_frequent]

        #remove words that length are less than 3 chars

        tokens =  [token for token in tokens if len(token) >3]
                
        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
      
    return texts

texts = tokenize(df_data_set)
texts[0]

['dataset',
 'contain',
 'transact',
 'made',
 'credit',
 'card',
 'septemb',
 '2013',
 'european',
 'cardhold',
 'present',
 'transact',
 'occur',
 'day',
 'fraud',
 '284,807',
 'transact',
 'highli',
 'unbalanc',
 'posit',
 'class',
 'fraud',
 'account',
 '0.172',
 'transact',
 'contain',
 'numer',
 'input',
 'variabl',
 'result',
 'transform',
 'unfortun',
 'confidenti',
 'issu',
 'provid',
 'origin',
 'featur',
 'background',
 'inform',
 'featur',
 'princip',
 'compon',
 'obtain',
 'featur',
 'transform',
 "'time",
 "'amount",
 'featur',
 "'time",
 'contain',
 'second',
 'elaps',
 'transact',
 'first',
 'transact',
 'featur',
 "'amount",
 'transact',
 'amount',
 'featur',
 'use',
 'example-depend',
 'cost-senst',
 'learn',
 'featur',
 "'class",
 'respons',
 'variabl',
 'take',
 'valu',
 'case',
 'fraud',
 'otherwis',
 'given',
 'class',
 'imbal',
 'ratio',
 'recommend',
 'measur',
 'accuraci',
 'use',
 'area',
 'precision-recal',
 'curv',
 'auprc',
 'confus',
 'matrix',
 'accuraci'

In [124]:
# moby_frequencies = FreqDist(nltk.word_tokenize(texts))


In [125]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(32458 unique tokens: ["'amount", "'class", "'time", '//mlg.ulb.ac.b', '//mlg.ulb.ac.be/artml']...)
0 'amount
1 'class
2 'time
3 //mlg.ulb.ac.b
4 //mlg.ulb.ac.be/artml
5 //mlg.ulb.ac.be/bruf
6 0.172
7 2013
8 2015
9 284,807
10 account
11 accuraci
12 amount
13 analys
14 andrea
15 area
16 auprc
17 avail
18 background
19 bontempi
20 bruxel
21 caelen
22 calibr
23 card
24 cardhold
25 case
26 cidm
27 cite
28 class
29 classif
30 collabor
31 collect
32 compon
33 comput
34 confidenti
35 confus
36 contain
37 cost-senst
38 credit
39 current
40 curv
41 dataset
42 day
43 detail
44 detect
45 elaps
46 european
47 example-depend
48 featur
49 first
50 fraud
51 gianluca
52 given
53 group
54 highli
55 http
56 ieee
57 imbal
58 inform
59 input
60 intellig
61 issu
62 johnson
63 learn
64 libr
65 machin
66 made
67 matrix
68 meaning
69 measur
70 mine
71 numer
72 obtain
73 occur
74 olivi
75 origin
76 otherwis
77 past
78 pleas
79 posit
80 pozzolo
81 precisi

786 conduct
787 convers
788 conversionrates.csv
789 correspond
790 cut
791 day-to-day
792 decod
793 describ
794 edg
795 email
796 employ
797 establish
798 exclud
799 flag
800 folk
801 form
802 free
803 freeform
804 freeformresponses.csv
805 happen
806 ident
807 industry-wid
808 initi
809 interact
810 kind
811 less
812 live
813 median
814 month
815 multipl
816 multiplechoiceresponses.csv
817 non-random
818 novemb
819 open-end
820 option
821 packag
822 primarili
823 proceed
824 protect
825 pud
826 quantmod
827 rank
828 read
829 receiv
830 regard
831 relev
832 respond
833 respondenttypereadme.txt
834 salari
835 schema.csv
836 scientist
837 segment
838 shown
839 singl
840 spam
841 survey
842 survey-tak
843 territori
844 total
845 usabl
846 week
847 window
848 write
849 1936
850 classic
851 fisher
852 flower
853 iri
854 linearli
855 paper
856 petallengthcm
857 petalwidthcm
858 problem
859 r.a.
860 repositori
861 sampl
862 sepallengthcm
863 sepalwidthcm
864 speci
865 taxonom
866 along
867 an

1477 favor
1478 fourth
1479 freedom
1480 gain
1481 gallup
1482 generos
1483 higher
1484 highest
1485 imaginari
1486 incom
1487 increasingli
1488 ladder
1489 landmark
1490 larg
1491 least
1492 least-happi
1493 level
1494 main
1495 march
1496 occasion
1497 over-
1498 policy-mak
1499 poll
1500 poorli
1501 psycholog
1502 quit
1503 recognit
1504 residu
1505 seen
1506 smaller
1507 sub-bar
1508 succeed
1509 support
1510 therefor
1511 third
1512 today
1513 un-reli
1514 under-explain
1515 unexplain
1516 unpleas
1517 utopia
1518 well-b
1519 whole
1520 width
1521 //github.com/zalandoresearch/fashion-mnist
1522 //pjreddie.com/projects/mnist-in-csv/
1523 //tech.zalando.com
1524 10,000
1525 255.
1526 28x28
1527 60,000
1528 ai/ml/data
1529 and/or
1530 ankl
1531 aris
1532 ascii-diagram
1533 assign
1534 boot
1535 charg
1536 claim
1537 cloth
1538 coat
1539 commun
1540 connect
1541 convert
1542 copi
1543 copyright
1544 dark
1545 darker
1546 decompos
1547 direct
1548 dress
1549 drop-in
1550 fail
1551 fash

2785 satisfactori
2786 solid
2787 storag
2788 straight
2789 summar
2790 superior
2791 terroir
2792 transcend
2793 unpalat
2794 vision
2795 14.3
2796 2014.
2797 affiliated_base_num
2798 american_b01362.csv
2799 b02512
2800 b02598
2801 b02617
2802 b02682
2803 b02764
2804 b02765
2805 b02835
2806 b02836
2807 borough
2808 carmel_b00256.csv
2809 coarse-grain
2810 commiss
2811 danach-ni
2812 date/tim
2813 dial7_b00887.csv
2814 diplo_b01196.csv
2815 dispatch
2816 dispatching_base_num
2817 dreist
2818 drinnen
2819 driver
2820 essenti
2821 federal_02216.csv
2822 fine-grain
2823 firstclass_b01536.csv
2824 fivethirtyeight
2825 foil
2826 for-hir
2827 grun
2828 highclass_b01717.csv
2829 hinter
2830 januari
2831 latitud
2832 limousin
2833 locationid
2834 longitud
2835 lyft_b02510.csv
2836 non-ub
2837 other-fhv-data-jan-aug-2015.csv
2838 outer
2839 pickup
2840 pickup_d
2841 prestige_b01338.csv
2842 ride
2843 rush-hour
2844 schmecken
2845 sept.
2846 skyline_b00111.csv
2847 taxi
2848 taxi-zone-lookup.cs

3535 clubpictures.zip
3536 compress
3537 doubl
3538 emblem
3539 fanat
3540 fastest/slowest
3541 fifaindex.com
3542 fulldata.csv
3543 globe
3544 interchang
3545 liter
3546 logo
3547 organis
3548 outfield
3549 pick
3550 pictures.zip
3551 pictures_f.zip
3552 playernames.csv
3553 playground..
3554 pressur
3555 proxi
3556 realis
3557 reli
3558 scout
3559 stamina
3560 tough
3561 uniform
3562 url
3563 young
3564 youngster
3565 .txt
3566 //archive.org/download/nf_prize_dataset.tar
3567 //netflixprize.com/index.html
3568 //www.netflixprize.com/faq
3569 1,000,000
3570 1,2,3,4
3571 1234,2005-05-26
3572 17770
3573 1890
3574 19th
3575 23rd
3576 2649429
3577 3245
3578 3245,2005-12-19
3579 3456,2005-11-07
3580 480189
3581 5666
3582 5666,2005-12-23
3583 6789,2005-03-14
3584 bellkor
3585 chao
3586 cinematch
3587 colon
3588 combined_data_
3589 customerid
3590 customerid11
3591 customerid12
3592 customerid21
3593 customerid22
3594 date11
3595 date12
3596 date21
3597 date22
3598 decemeb
3599 empti
3600 ga

4285 burger
4286 calor
4287 calori
4288 chicken
4289 crispi
4290 dessert
4291 egg
4292 filet-o-fish
4293 fish
4294 franchise
4295 fri
4296 grill
4297 intak
4298 kroc
4299 mcdonald
4300 mcmuffin
4301 meal
4302 menu
4303 milkshak
4304 path
4305 persuad
4306 restaur
4307 salad
4308 sandwich
4309 soda
4310 supplier
4311 opendataphilli
4312 philadelphia
4313 philli
4314 19-48
4315 2.56
4316 216-223.
4317 24-26
4318 3-axial
4319 50hz
4320 561-featur
4321 acceler
4322 acceleromet
4323 alessandro
4324 ambient
4325 andreu
4326 anguita
4327 angular
4328 arithmet
4329 artifici
4330 assit
4331 belgium
4332 bracket
4333 bruge
4334 butterworth
4335 cabestani
4336 català
4337 constant
4338 david
4339 decemb
4340 effici
4341 embed
4342 esann
4343 fixed-point
4344 galaxi
4345 ghio
4346 gravit
4347 graviti
4348 gyroscop
4349 hardware-friendli
4350 inerti
4351 iwaal
4352 joan
4353 jorg
4354 lay
4355 lectur
4356 luca
4357 lui
4358 motion
4359 multiclass
4360 oneto
4361 overlap
4362 parra
4363 parra-llana


5035 ne=nebraska
5036 ng=nation
5037 ng=non-govern
5038 nh=new
5039 nj=new
5040 nm=new
5041 nnsa=nat
5042 nonfeder
5043 nps=nation
5044 nr=northern
5045 nuclear
5046 nv=nevada
5047 nw=northwest
5048 nwcg
5049 nwcg_reporting_ag
5050 nwcg_reporting_unit_id
5051 nwcg_reporting_unit_nam
5052 nwcg_unitidactive_20170109
5053 nws=nation
5054 ny=new
5055 occurr
5056 oes=offic
5057 oh=ohio
5058 ok=oklahoma
5059 or=oregon
5060 owner
5061 owner_cod
5062 owner_descr
5063 pa=pennsylvania
5064 plss
5065 pr=puerto
5066 pri=priv
5067 ranger
5068 reclam
5069 redund
5070 ri=rhod
5071 rico
5072 rm=rocki
5073 rocki
5074 sa=southern
5075 sc=south
5076 sd=south
5077 sf=state
5078 sfs=state
5079 source_reporting_unit
5080 source_reporting_unit_nam
5081 source_system
5082 source_system_typ
5083 sp=state
5084 st/c
5085 st/l=state
5086 stat_cause_cod
5087 stat_cause_descr
5088 state/territori
5089 subgroup
5090 sw=southwest
5091 three-digit
5092 tn=tennesse
5093 tnc=the
5094 tribal
5095 tribe
5096 tribe=trib
50

5784 3*2039
5785 4292,000
5786 70,000
5787 abbr
5788 cdk2
5789 cdk2.h5
5790 chembl
5791 chembl_id
5792 chemotherapi
5793 cyclin-depend
5794 egfr_erbb1
5795 endotheli
5796 epiderm
5797 erbb1
5798 glycogen
5799 gsk3b
5800 h5pi
5801 h5py.fil
5802 hdf5
5803 hepatocyt
5804 hgfr
5805 ic50
5806 indptr
5807 inhibitor
5808 kinas
5809 kinase-3
5810 map_k_p38a
5811 non-inhibitor
5812 non-trivi
5813 pubchem
5814 receptor
5815 scipi
5816 shape=
5817 spars
5818 sparse.csr_matrix
5819 sparse.hstack
5820 synthas
5821 tpk_lck
5822 tpk_src
5823 tyrosine-protein
5824 vascular
5825 vegfr2
5826 comfort
5827 cuisin
5828 mercyhurst
5829 unclean
5830 18000
5831 aman
5832 ban
5833 birth_dat
5834 body_typ
5835 dummi
5836 eur_release_claus
5837 height_cm
5838 international_reput
5839 pre-clean
5840 preferred_foot
5841 real_fac
5842 shrivastava
5843 skill_mov
5844 weak_foot
5845 weight_kg
5846 work_rate_att
5847 work_rate_def
5848 non-relev
5849 ohio
5850 ten
5851 abstract
5852 authorid
5853 authors.csv
5854 even

6534 widespread
6535 //ergast.com/mrd/
6536 1950
6537 1950.
6538 auto
6539 constructor
6540 fédération
6541 inaugur
6542 international
6543 l'automobil
6544 newel
6545 sanction
6546 single-seat
6547 'brand_name_rx_count
6548 'cms_prescription_count
6549 'gender
6550 'generic_rx_count
6551 'midwest
6552 'non-urban
6553 'northeast
6554 'npi
6555 'provider_vari
6556 'region
6557 'settlement_typ
6558 'south
6559 'specialti
6560 'urban'
6561 'west
6562 'years_pract
6563 239,930
6564 brand/gener
6565 brand_name_rx_count
6566 drug_nam
6567 generic_rx_count
6568 heurist
6569 jsonl
6570 prescription-bas
6571 roam
6572 0n/0e
6573 =-180
6574 =180
6575 =2016
6576 =860
6577 antarctica
6578 bulletin
6579 cartogram
6580 chemic
6581 coast
6582 comma-separ
6583 composit
6584 degrad
6585 dplyr
6586 fallen
6587 fell
6588 gram
6589 high-resolut
6590 incorrect
6591 meteorit
6592 meteorites.al
6593 meteorites.geo
6594 nametyp
6595 overlaid
6596 parentheses-enclos
6597 primer
6598 recclass
6599 reclassif
660

7284 manuel
7285 nike
7286 offshor
7287 panama
7288 paradis
7289 politician
7290 queen
7291 ross
7292 santo
7293 schwarzman
7294 shaukat
7295 sheldon
7296 siemen
7297 singer
7298 steve
7299 tillerson
7300 trillion
7301 untrain
7302 walmart
7303 wilbur
7304 wynn
7305 encompass
7306 inequ
7307 overst
7308 pdf-friendli
7309 photograph
7310 re-format
7311 sizabl
7312 stronger
7313 1400
7314 gaug
7315 osmi
7316 tech/it
7317 1960
7318 2013-2014
7319 5,000
7320 5-part
7321 accultur
7322 albumin
7323 antibodi
7324 antigen
7325 apolipoprotein
7326 aromat
7327 arsen
7328 aspirin
7329 assay
7330 becam
7331 biochemistri
7332 cadmium
7333 cardiovascular
7334 cholesterol
7335 cigarett
7336 coba
7337 copper
7338 creatinin
7339 dentit
7340 depress
7341 dermatolog
7342 diet
7343 dietari
7344 differenti
7345 endomys
7346 ethyl
7347 exposur
7348 fluorid
7349 formerli
7350 genotyp
7351 glucos
7352 glycohemoglobin
7353 grip
7354 hcv-rna
7355 health-rel
7356 hepat
7357 herp
7358 hydrocarbon
7359 iga-ttg
736

8034 decision-mak
8035 demo
8036 duel
8037 explod
8038 first-person
8039 grenad
8040 guy
8041 highlight
8042 lulz
8043 map_data.csv
8044 match-level
8045 miscellan
8046 mm_grenades_demos.csv
8047 mm_master_demos.csv
8048 notabl
8049 parser
8050 pit
8051 radar
8052 self-driv
8053 static
8054 teamwork
8055 theft
8056 valv
8057 acs2015_census_tract_data.csv
8058 acs2015_county_data.csv
8059 dp03
8060 dp05
8061 factfind
8062 subdivis
8063 1962
8064 aviat
8065 factual
8066 narr
8067 ntsb
8068 anderson
8069 blackard
8070 cartograph
8071 charl
8072 colorado
8073 dean
8074 etcetera
8075 jock
8076 nearbi
8077 roosevelt
8078 shadow
8079 soil
8080 topographi
8081 exot
8082 market.cap
8083 scam
8084 //github.com/tonmcg/county_level_election_results_12-16
8085 //www.kaggle.com/benhamner/2016-us-elect
8086 romney
8087 //arxiv.org/abs/1710.00888
8088 //github.com/orioli/emoji-writer-senti
8089 asonam2017
8090 awok
8091 easd
8092 emoji
8093 happier
8094 lewi
8095 muuch
8096 pydatabcn2017
8097 pydatadu

8783 emot
8784 era-net
8785 jin
8786 miniatur
8787 ncbir
8788 occulograph
8789 sasan.jafarnejad
8790 smartglass
8791 smartwatch
8792 snsf
8793 swipe
8794 timelogg
8795 50,000
8796 baron
8797 epoch
8798 herald
8799 rift
8800 summon
8801 team1
8802 team2
8803 victori
8804 alive.
8805 meetup
8806 unifi
8807 commod
8808 sheep
8809 steel
8810 typewrit
8811 -help
8812 -identifi
8813 -make
8814 -requir
8815 -understand
8816 //groups.yahoo.com/neo/groups/careercenter-am
8817 //www.slideshare.net/habetmadoyan/it-skills-analysis-63686238
8818 19,000
8819 2004-2015
8820 aboutc
8821 announcementcod
8822 applicationp
8823 armenia
8824 armenian
8825 careercent
8826 curriculum
8827 deadlin
8828 elig
8829 jobdescript
8830 jobpost
8831 jobrequir
8832 openingd
8833 part-tim
8834 profess
8835 quicker
8836 requiredqu
8837 richer
8838 0,800000
8839 720,000
8840 720k
8841 airdrop
8842 ammo
8843 armor
8844 battleground
8845 bluezon
8846 corral
8847 erangel
8848 first-aid
8849 first/third-person
8850 fpp/tpp


9533 ratingscount
9534 reviewernam
9535 reviewerr
9536 reviewscount
9537 webmin
9538 1076
9539 20°c
9540 46°f
9541 68°f
9542 comfi
9543 deduc
9544 discount
9545 downtown
9546 notif
9547 pavic
9548 raini
9549 services/act
9550 sunni
9551 warm
9552 //github.com/minimaxir/get-all-hacker-news-submissions-com
9553 eastern
9554 minimaxir
9555 num_com
9556 num_point
9557 edstat
9558 enrol
9559 illiteraci
9560 pirl
9561 pisa
9562 pre-primari
9563 projection/attain
9564 teacher
9565 timss
9566 vocat
9567 patrol
9568 1979
9569 250k
9570 features/column
9571 dados.gov.br
9572 tourist
9573 2022.
9574 2024.
9575 debt-rel
9576 //www.spc.noaa.gov/gis/svrgis/
9577 //www.spc.noaa.gov/wcm/data/spc_severe_database_description.pdf
9578 nina
9579 nino
9580 temp/el
9581 tornado
9582 states/u.t
9583 doi:10.7910/dvn/ilat5b
9584 earliest
9585 feed_cod
9586 outag
9587 publish_tim
9588 source_url
9589 utf8
9590 yyyymmddhhmm
9591 2,000
9592 displac
9593 geophys
9594 impuls
9595 ncei
9596 noaa/wd
9597 oceanograph


10283 time_series.csv
10284 unhcr
10285 uproot
10286 '../input/accepted_2007_to_2016.csv.gz
10287 '../input/rejected_2007_to_2016.csv.gz
10288 'accepted_2007_to_2016.csv.gz
10289 'rejected_2007_to_2016.csv.gz
10290 //github.com/nategeorge/preprocess_lending_club_data
10291 500mb
10292 acc_dt
10293 accept_df
10294 accept_df.info
10295 accepted_def
10296 as.data.t
10297 compression='gzip
10298 data.t
10299 gzfile
10300 int_rat
10301 lame
10302 na.strings=
10303 null_counts=tru
10304 pd.read_csv
10305 read.csv
10306 rej_dt
10307 reject
10308 reject_df
10309 rejected_def
10310 revol_util
10311 verbose=tru
10312 //101innovations.wordpress.com
10313 //dashboard101innovations.silk.co/
10314 //dx.doi.org/10.12688/f1000research.8414.1
10315 //dx.doi.org/10.5281/zenodo.49583
10316 //orcid.org/0000-0001-5796-2727
10317 //orcid.org/0000-0002-5965-6560
10318 20,663
10319 b.m.r.kramer
10320 bianca
10321 bosman
10322 dashboard
10323 disciplin
10324 editor
10325 f1000
10326 j.bosman
10327 jeroen
10328

11033 thunderstorm
11034 updraft
11035 12583347
11036 18:09:48
11037 2016-10-12.
11038 2016-10-13
11039 3654954
11040 csv.gz
11041 deletiond
11042 overhead
11043 sqllite
11044 'failur
11045 'smart
11046 backblaz
11047 capacity_byt
11048 manufacturer-assign
11049 model/manufactur
11050 out-of-bound
11051 s.m.a.r.t
11052 serial_numb
11053 65,500
11054 nypd
11055 opendata
11056 aguah
11057 cubic
11058 //syntagmatic.github.io/exposedata/marvel/
11059 alberich
11060 balear
11061 cesc
11062 edges.csv
11063 fantast
11064 hero-edge.csv
11065 hulk
11066 marvel
11067 miro
11068 nodes.csv
11069 nowaday
11070 relaunch
11071 revit
11072 rosselló
11073 spectacular
11074 spider-man
11075 superhero
11076 x-men
11077 'catrom
11078 'hermit
11079 'mitchel
11080 'robidoux
11081 'spline
11082 .pixel784
11083 0-25
11084 0-255.
11085 1704
11086 200-year-old
11087 25=z
11088 27,455
11089 7172
11090 brightness/contrast
11091 deaf
11092 enlarg
11093 gestur
11094 hands-onli
11095 hard-of-hear
11096 image-bas
110

11782 topshop
11783 victoria
11784 wherev
11785 //edgarlopez.net
11786 //www.researchgate.net/publication/265736405_banksim_a_bank_payment_simulation_for_fraud_detection_research
11787 144–152
11788 587443
11789 594643
11790 7200
11791 9788897999324.
11792 academia
11793 agent-bas
11794 alonso
11795 banksim
11796 bleking
11797 bordeaux
11798 dime
11799 emss
11800 genoa
11801 inproceed
11802 randomis
11803 stefan
11804 thiev
11805 ww.bth.se
11806 2,685
11807 5/2017
11808 6/2014
11809 7/2016
11810 9/2016
11811 abdulbasit
11812 asma
11813 cleric
11814 counter-terror
11815 english-bas
11816 faith
11817 hadeeth
11818 hadi
11819 ideolog
11820 jewish
11821 kassam
11822 nigeria
11823 propaganda
11824 qur'an
11825 rebutt
11826 refut
11827 rumiyah
11828 saif
11829 shah
11830 suraqah
11831 'aaba
11832 'aapl
11833 'amzn
11834 'axp
11835 'cat
11836 'csco
11837 'cvx
11838 'di
11839 'googl
11840 'ibm
11841 'intc
11842 'jnj
11843 'jpm
11844 'mcd
11845 'mmm
11846 'mrk
11847 'msft
11848 'nke
11849 'pand

12532 crucial
12533 zipcod
12534 //handbooks.bulbagarden.net/pokemongo/pokemon-index
12535 bulbagarden
12536 bulbagarden.net
12537 galleri
12538 infring
12539 bad_tweets.csv
12540 exception.respons
12541 favorite_count
12542 rrggbb
12543 type'
12544 user.created_at
12545 user.default_profil
12546 user.default_profile_imag
12547 user.descript
12548 user.favourites_count
12549 user.follow
12550 user.followers_count
12551 user.friends_count
12552 user.geo_en
12553 user.id
12554 user.lang
12555 user.listed_count
12556 user.loc
12557 user.nam
12558 user.profile_background_color
12559 user.profile_image_url
12560 user.screen_nam
12561 user.statuses_count
12562 user.time_zon
12563 user.url
12564 user.utc_offset
12565 user.verifi
12566 //childes.talkbank.org/
12567 //en.wikipedia.org/wiki/flesch
12568 //www.speech-language-therapy.com/index.php
12569 0.8709
12570 1-gram
12571 10-repeated-10-k-fold
12572 1038
12573 1163
12574 13.10
12575 15.90.
12576 161–170
12577 2-gram
12578 2007.
12579 3-gra

13282 surveymanual.pdf
13283 waha
13284 wt.ec_id=sdata-201605
13285 zambia
13286 zipf
13287 congest
13288 dont
13289 gtf
13290 siri
13291 'subreddit
13292 'worldnew
13293 'worldnews'
13294 over_18
13295 time_cr
13296 //missingmigrants.iom.int/
13297 //www.iom.int/about-iom
13298 0.095
13299 1951
13300 adequ
13301 apathi
13302 asylum-seek
13303 broadli
13304 closur
13305 countless
13306 covered-up
13307 dehuman
13308 deport
13309 despit
13310 digniti
13311 evad
13312 exagger
13313 extrapol
13314 fate
13315 government
13316 habitu
13317 horn
13318 humanitarian
13319 imped
13320 imper
13321 indirect
13322 inter-government
13323 intergovernment
13324 involuntari
13325 lampedusa
13326 meantim
13327 mediterranean
13328 migrant
13329 migratori
13330 moral
13331 neglect
13332 ngo
13333 non-government
13334 orderli
13335 projectdata
13336 relay
13337 sheriff
13338 shipwreck
13339 spotti
13340 survivor
13341 tailor
13342 traceabl
13343 tragedi
13344 turkey
13345 u.s./mexico
13346 under-repres
13

14032 qreldfa60
14033 qreldff60
14034 rela/60
14035 relc
14036 relc/60
14037 reldf/60
14038 relf
14039 relf/60
14040 relpct
14041 relz
14042 righthand
14043 ropfo
14044 ropfow
14045 s.bkhd
14046 s.dflct
14047 s.slap
14048 s.snap
14049 s.tip
14050 s.wrap
14051 s.wrst
14052 sdist
14053 shootout
14054 slap
14055 sogdg
14056 takeaway
14057 tmca60
14058 tmcf60
14059 tmfa60
14060 tmff60
14061 tmga
14062 tmga60
14063 tmgf
14064 tmgf60
14065 tmsa60
14066 tmsf60
14067 tmxga
14068 tmxgf
14069 toi.qoc
14070 toi.qot
14071 toi/gp
14072 unblock
14073 usat
14074 wraparound
14075 xgf.qoc
14076 xgf.qot
14077 //en.wikipedia.org/wiki/list_of_towns_and_cities_with_100,000_or_more_inhabitants/citynam
14078 //www.movehub.com/city-rank
14079 cappuccino
14080 cities.csv
14081 movehub
14082 movehubcostofliving.csv
14083 movehubqualityoflife.csv
14084 38,269
14085 'now
14086 'special
14087 'top
14088 tracklist
14089 volume_numb
14090 //data.gov.in/
14091 2001-2014
14092 agricutur
14093 agricuturc
14094 cultivat

14781 croix
14782 databank
14783 dinger
14784 doug
14785 entireti
14786 eric
14787 fischler
14788 forman
14789 goaliessc
14790 goaliesshootout
14791 goaltend
14792 grave
14793 hockey-reference.com
14794 hockey.net
14795 holzman
14796 home/road
14797 hornick
14798 kubatko
14799 lloyd
14800 losthockey.com
14801 macmillan
14802 mcmurray
14803 morey
14804 norri
14805 on-lin
14806 paradi
14807 postseason
14808 pucker
14809 ralph
14810 reeth
14811 rick
14812 roger
14813 scoringsc
14814 scoringshootout
14815 scoringsup
14816 scott
14817 seriespost
14818 shirley
14819 shutout
14820 styer
14821 surgent
14822 teamshalf
14823 teamsplit
14824 teamspost
14825 teamssc
14826 teamvsteam
14827 trophi
14828 walton
14829 xaler
14830 bonus
14831 payrol
14832 1930
14833 campus
14834 larceny-theft
14835 legaci
14836 manslaught
14837 nonneglig
14838 refin
14839 university/colleg
14840 dynasti
14841 emperor
14842 reign
14843 zonin
14844 59000
14845 challang
14846 colum
14847 exil
14848 harbing
14849 hardcor
1

15531 dwayn
15532 eaten
15533 farther
15534 friendship
15535 hang
15536 honest
15537 kale
15538 landlin
15539 pee-pe
15540 racist
15541 shower
15542 unaccept
15543 unequ
15544 //arxiv.org/abs/1704.05579
15545 1:100
15546 authors=
15547 cap
15548 elong
15549 ital
15550 jest
15551 khodak
15552 kiran
15553 mikhail
15554 nikunj
15555 non-sarcast
15556 racism
15557 readme.txt
15558 sarc
15559 sarcast
15560 saunshi
15561 self-annot
15562 slang
15563 title=
15564 train-balanced-sarcasm.csv
15565 url=
15566 vodrah
15567 yeahhh
15568 year=2017
15569 //fotokite.com
15570 /mav
15571 /street
15572 1080
15573 120499
15574 1920
15575 5-15
15576 ag—http
15577 air-ground
15578 albers-schoenberg
15579 autopilot
15580 barometr
15581 barometricpressure.csv
15582 cadastr
15583 ego-mot
15584 fotokit
15585 google-street-view
15586 gopro
15587 gps-deni
15588 ground-level
15589 groundtruthagl.csv
15590 groundtruthagm.csv
15591 icra
15592 ieee/rsj
15593 images/
15594 img/
15595 iro
15596 karl
15597 majdik
1559

16281 -1.7069
16282 -1.7823
16283 -1.7972
16284 -1.819
16285 -1.9129
16286 -1.921
16287 -2.01
16288 -2.0398
16289 -2.1051
16290 -2.18
16291 -2.2359
16292 -2.245
16293 -2.3
16294 -2.4103
16295 -2.6433
16296 -2.7295
16297 -2.8083
16298 -2.9705
16299 -3.1119
16300 -3.13
16301 -3.4837
16302 -3.7365
16303 -4.1563
16304 0.027
16305 0.09
16306 0.0914
16307 0.1002
16308 0.1044
16309 0.1368
16310 0.16
16311 0.17
16312 0.1756
16313 0.1863
16314 0.21
16315 0.3239
16316 0.3387
16317 0.4089
16318 0.4166
16319 0.46
16320 0.4627
16321 0.5071
16322 0.5837
16323 0.6031
16324 0.659
16325 0.6678
16326 0.7012
16327 0.7378
16328 0.7681
16329 0.784
16330 0.9262
16331 0.9384
16332 095003
16333 1.1001
16334 1.1503
16335 1.21
16336 1.2523
16337 1.3299
16338 1.417
16339 1.4879
16340 1.5832
16341 1.6404
16342 1.7028
16343 1.75
16344 1.7551
16345 1.7699
16346 1.8428
16347 100,000,000
16348 134101
16349 17:490-519
16350 2.09
16351 2.1935
16352 2.3478
16353 2.5213
16354 2.55
16355 2.901
16356 20:720-729
16357 259.6

17031 technicolor
17032 towel
17033 ukwac
17034 wackypedia
17035 1840-1908
17036 67,000
17037 colonia
17038 cuadro
17039 disambigu
17040 endofarticle.
17041 gemma
17042 lluí
17043 lrec'10
17044 malta
17045 medium-s
17046 monts
17047 padró
17048 rees
17049 rigau
17050 spanish-languag
17051 stop-word
17052 valleta
17053 widely-spoken
17054 word-sens
17055 //deepseacoraldata.noaa.gov/
17056 //deepseacoraldata.noaa.gov/website/agsviewers/deepseacorals/mapsites.htm
17057 20170324-0
17058 azooxanthel
17059 bathymetri
17060 binomi
17061 catalognumb
17062 coral
17063 dataprovid
17064 deep-sea
17065 depthinmet
17066 depthmethod
17067 diaphu
17068 dixon
17069 entranc
17070 hierarchi
17071 identificationqualifi
17072 longlin
17073 microb
17074 non-expert
17075 observationd
17076 pulley
17077 recordtyp
17078 ridg
17079 sample/observ
17080 samplingequip
17081 scientificnam
17082 scuba
17083 semicolon
17084 shout
17085 spong
17086 submers
17087 symbiot
17088 taxonrank
17089 tow
17090 trawl
17091 ver

17780 //www.poetryfoundation.org/
17781 folklor
17782 mytholog
17783 renaiss
17784 knoema
17785 left-bal
17786 pandimi
17787 player_batting_styl
17788 player_bowling_styl
17789 player_dob
17790 powerplay
17791 toss
17792 toss_tabl
17793 'utf-8
17794 analyzing/us
17795 iso-8859-1
17796 state_party_strength.csv
17797 state_party_strength_cleaned.csv
17798 wiki-cit
17799 'pwrindx
17800 //www.globalfirepower.com/countries-listing.asp
17801 artilleri
17802 corvett
17803 firepow
17804 fit-for-servic
17805 frigat
17806 globalfirepower.com
17807 globalfirepower.csv
17808 globalfirepower_multiindex.csv
17809 helicopt
17810 interceptor
17811 land-lock
17812 leadership
17813 lesser-develop
17814 militaryfactory.com
17815 minesweep
17816 nato
17817 naval
17818 powerindex
17819 projector
17820 roadway
17821 self-propel
17822 stockpil
17823 submarin
17824 technologically-advanc
17825 waterway
17826 www.globalfirepower.com
17827 ©2003-2017
17828 ©2017
17829 //www.thenews.com.pk
17830 dognam
17831 exp

18530 'victim
18531 //en.wikipedia.org/wiki/diplomacy_
18532 //nlp.stanford.edu/sentiment/
18533 //politeness.mpi-sws.org/
18534 //usak.asciiking.com/
18535 //www.anthology.aclweb.org/w/w03/w03-1014.pdf
18536 //www.floc.net/dpjudge/
18537 //www.seas.upenn.edu/~pdtb/
18538 //www.ukp.tu-darmstadt.de/data/argumentation-mining/
18539 0.8320
18540 1906
18541 1906.
18542 1906.0
18543 1906.5
18544 28game
18545 betray
18546 boyd-grab
18547 diplomaci
18548 diplomacy_data.json
18549 disc_expans
18550 essay
18551 frequent_word
18552 gurevych
18553 json.load
18554 lexicon_word
18555 msg
18556 n_request
18557 n_sentenc
18558 n_word
18559 nicula
18560 penn
18561 persuas
18562 psycholinguist
18563 reciproc
18564 riloff
18565 srijan
18566 treebank
18567 u'at
18568 u'support
18569 vlad
18570 wieb
18571 //www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf
18572 13-17
18573 160,000
18574 19,320
18575 23-27
18576 2994
18577 33-47
18578 681,288
18579 7250
18580 8240
18581 argamon
18582 astrolog
18583 blogge

19280 leap
19281 netbean
19282 tremor
19283 //starmine.ai
19284 columns/featur
19285 inhous
19286 starmine.ai
19287 sympathet
19288 //maps.london.gov.uk/webmaps/nextgenbroadband/
19289 broadband
19290 isp
19291 kcom
19292 ofcom
19293 talktalk
19294 57380690
19295 57390689.
19296 blackelo
19297 blackid
19298 blitz
19299 game_datetim
19300 gameinfo.csv
19301 playok
19302 playok.com
19303 redelo
19304 redid
19305 rosino
19306 xiangqi
19307 xqinenglish.com
19308 acident
19309 aeronáutico
19310 cenipa
19311 centro
19312 chapecoens
19313 colombia
19314 envolv
19315 investigação
19316 lamia
19317 prevenção
19318 //www.jc-bingo.com/
19319 user-ag
19320 /r/watch
19321 amazon_link
19322 reddits/
19323 reddits/w/watches.csv
19324 rudolph
19325 subreddit_ment
19326 thingsonreddit
19327 total_ment
19328 kirubi
19329 //www.bra.se/bra/bra-in-english/home/crime-and-statistics/crime-statistics.html
19330 100.000
19331 crimes.penal.cod
19332 crimes.person
19333 crimes.tot
19334 criminal.damag
19335 drun

20030 'stand
20031 010-000-024-033
20032 010-000-030-096
20033 020-000-032-221
20034 020-000-033-111
20035 ami-10
20036 ankle_left
20037 ankle_right
20038 belt
20039 dd.mm.yyyy
20040 dovgan
20041 gam
20042 kaluza
20043 lustrek
20044 malaga
20045 mirchevska
20046 50-100
20047 92697-3425
20048 bands-
20049 goat
20050 ics.uci.edu
20051 pazzani
20052 'field
20053 porsch
20054 leather
20055 shovel
20056 tracked-month
20057 cultiv
20058 data.mil
20059 painstakingli
20060 southeast
20061 unexplod
20062 h.10
20063 //github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest
20064 wnyc
20065 socrata.com
20066 1.89million
20067 age_cod
20068 ahs_house_unit
20069 ani_milk_month
20070 anthropometr
20071 bio-chem
20072 bp_diastol
20073 bp_diastolic_2read
20074 bp_systol
20075 bp_systolic_2_read
20076 date_survey
20077 day_or_month_for_breast_feed
20078 day_or_month_for_breast_feeding_cod
20079 diabetes_test
20080 district_cod
20081 duration_pregnanaci
20082 fasting_blood_glucos
20083 fasting_blood_

20779 ml/min
20780 ramon
20781 rhuerta
20782 sensorsâ€™
20783 tgs-2600
20784 tgs-2602
20785 tgs-2610
20786 tgs-2620
20787 tgs2600
20788 tgs2602
20789 tgs2610
20790 tgs2620
20791 ucsd.edu
20792 versatil
20793 *health
20794 *room*
20795 *total*
20796 academic.year
20797 doi:10.7910/dvn/lv0ysq
20798 doi:10.7910/dvn/mss2b
20799 fee*
20800 fyxnny+kbtglx+dzewzefg==
20801 medical/dent
20802 tuition
20803 tuition_graduate.csv
20804 undergraduate_package.csv
20805 bed
20806 one-b
20807 three-b
20808 two-b
20809 babb
20810 buffalo
20811 kent
20812 oakland
20813 orlean
20814 twenty-nin
20815 accuracy_matrix.csv
20816 defense_results.csv
20817 error_matrix.csv
20818 hit_target_class_matrix.csv
20819 kaggleteamid
20820 maxevaltim
20821 meanevaltim
20822 medianevaltim
20823 merger
20824 minevaltim
20825 non_targeted_attack_results.csv
20826 normalizedscor
20827 rows/column
20828 sum-up
20829 targeted_attack_results.csv
20830 teamnam
20831 .tif
20832 //data.broadinstitute.org/bbbc/bbbc010/
20833 //dx

21279 0-180
21280 0-720
21281 1,559
21282 4-categori
21283 barbequ
21284 blindli
21285 brisket
21286 category-specif
21287 cbj
21288 cbj_percentag
21289 clever
21290 contest_key
21291 is_championsip
21292 is_standard
21293 kcb
21294 location_str
21295 non-standard
21296 pork
21297 resultsar
21298 rib
21299 sample-by-sampl
21300 state_ful
21301 team_nam
21302 goods/servic
21303 third-parti
21304 discover
21305 donnelli
21306 fortune.com
21307 arab_emirates.csv
21308 australia.csv
21309 caldonia
21310 china.csv
21311 emir
21312 iceland.csv
21313 israel.csv
21314 japan.csv
21315 kazakhstan.csv
21316 kuwait.csv
21317 new_caldonia.csv
21318 new_zealand.csv
21319 oceania
21320 qatar
21321 qatar.csv
21322 saudiarabia.csv
21323 singapore.csv
21324 south_korea.csv
21325 taiwan.csv
21326 //www.histdata.com/download-free-forex-data/
21327 *star
21328 88000+
21329 answer_d
21330 asterisk
21331 quesion
21332 question/session/ministri
21333 question_bi
21334 question_descript
21335 question_no
21336

22029 inpj
22030 intu
22031 investec
22032 iplj
22033 ituj
22034 kioj
22035 kumba
22036 lhcj
22037 mndj
22038 mondi
22039 mtnj
22040 nasper
22041 nedbank
22042 nedj
22043 non-curr
22044 npnjn
22045 omlj
22046 p.csv
22047 pharmacar
22048 r.csv
22049 rainbow
22050 reij
22051 reinet
22052 remgro
22053 remj
22054 repurchas
22055 richemont
22056 rmhj
22057 sa40
22058 sanlam
22059 sasol
22060 sbkj
22061 shoprit
22062 shpj
22063 slmj
22064 snhj
22065 solj
22066 steinhoff
22067 tbsj
22068 tiger
22069 vodacom
22070 vodj
22071 whlj
22072 woolworth
22073 //github.com/wh0801/nba-shooting-rationality-2016-17-regular-season
22074 //haowang204.wordpress.com/2017/06/03/shooting-rationality-of-nba-teams/
22075 //www.kaggle.com/drgilermo/irrational-shot-select
22076 //www.kaggle.com/selfishgene/kobe-bryant-shot-selection/psychology-of-a-professional-athlet
22077 //www.mysportsfeeds.com/
22078 16-17
22079 irration
22080 ration
22081 emmi
22082 *thi
22083 4.0.
22084 figure.nz
22085 to/from
22086 //www.ine

22779 player/team
22780 player_detail
22781 player_nam
22782 player_position_info
22783 player_position_valu
22784 player_stat
22785 possession_percentag
22786 team_detail
22787 team_id
22788 team_rat
22789 total_pass
22790 total_tackl
22791 total_throw
22792 won_contest
22793 1-10
22794 intermix
22795 invect
22796 200x33
22797 6600
22798 lettercolorimages3.h5
22799 letters3.csv
22800 letters3.txt
22801 letters3.zip
22802 paper=
22803 'forb
22804 1983-
22805 1984-
22806 1985-rank:51
22807 1990-1995
22808 o'brien
22809 richest
22810 trumpnat
22811 undefin
22812 yet-to-b
22813 agradeço
22814 antun
22815 ben
22816 candidato
22817 característica
22818 conclusão
22819 criado
22820 curso
22821 declarado
22822 deputado
22823 disponibilizar
22824 doaçõ
22825 eleito
22826 eleiçõ
22827 engenheiro
22828 estadu
22829 felip
22830 heitor
22831 idad
22832 melhorando
22833 melhoria
22834 obtido
22835 pela
22836 previsão
22837 raça/cor
22838 realizar
22839 recebida
22840 resultado
22841 revisor
22842 s

23528 //github.com/kaggle/docker-python/issues/73
23529 //www.kaggle.com/ekkus93/keras-models-as-datasets-test
23530 applications.vgg16
23531 include_top=fals
23532 keras_models_dir
23533 model.load_weight
23534 s/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
23535 weights=non
23536 //github.com/udacity/machine-learn
23537 //takeout.google.com/settings/takeout/custom/location_histori
23538 2013-2015-2017.
23539 assests-
23540 company-
23541 country-
23542 forbes.com
23543 profit-
23544 rank-
23545 sales-
23546 sector-
23547 values-
23548 *there
23549 humong
23550 iniquitouspsych
23551 review*
23552 trivago
23553 ***i
23554 **to
23555 *for
23556 //calgaryherald.com/news/local-news/in-naming-streets-strict-rules-dictate-roads-rises-trails-and-mor
23557 //catalog.data.gov/dataset/nyc-street-centerline-cscl
23558 //data.cityofnewyork.us/api/views/exjm-f27b/files/cba8af99-6cd5-49fd-9019-b4a6c2d9dff7
23559 //en.wikipedia.org/wiki/cycling_in_new_york_c
23560 //imgur.com/a/uflwx
23561 //ww

24278 betfair
24279 first_taken
24280 full_descript
24281 in_play
24282 latest_taken
24283 number_bet
24284 pre-ev
24285 scheduled_off
24286 selection_id
24287 settled_d
24288 sports_id
24289 volume_match
24290 win_flag
24291 //www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.pag
24292 //www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.pag
24293 2003-2017
24294 block/lot
24295 combo
24296 de-dup
24297 gentrif
24298 nycpluto
24299 //github.com/alvations/quot
24300 2,297
24301 36,165
24302 878,450
24303 bruce
24304 eendra
24305 mona
24306 quotabl
24307 two-column
24308 //alt.qcri.org/semeval2015/task17/
24309 //alt.qcri.org/semeval2016/task13/
24310 //support.google.com/merchants/answer/6324436
24311 //www.en.advertisercommunity.com/t5/google-shopping-and-merchant/taxonomy-list-countries-some-missing/td-p/599656
24312 //www.google.com/basepages/producttype/taxonomy-with-ids.
24313 brazillian
24314 copyrights/licens
24315 edho
24316 language_cod
24317 pratama
2

25028 eagle.surgery.wisc.edu
25029 madison
25030 math-prog/cpo-dataset/machine-learn/wpbc/
25031 method-tre
25032 msm-t
25033 multisurfac
25034 nonrecur
25035 nonrecurr
25036 olvi
25037 wolberg
25038 'loc
25039 'misc
25040 'org
25041 'per
25042 11692
25043 14000
25044 167112
25045 4195
25046 8431
25047 9736
25048 607-619
25049 alam
25050 demand/
25051 energy-manag
25052 iot-bas
25053 musharraf
25054 seung
25055 //en.wikipedia.org/wiki/list_of_monster_energy_nascar_cup_series_champion
25056 nascar
25057 7000
25058 afrob
25059 afrobaromet
25060 alwin
25061 amerab
25062 americasbaromet
25063 arabb
25064 ase
25065 asiab
25066 asianb
25067 baromet
25068 caucab
25069 caucasu
25070 cceb
25071 ceeb
25072 cgfe
25073 closed-end
25074 cnep
25075 cross-cultur
25076 cross-nat
25077 cse
25078 eass
25079 endem
25080 eunb
25081 eurob
25082 eurobaromet
25083 evidence-bas
25084 facto
25085 gobo
25086 grove
25087 hark
25088 harmon
25089 hnseo
25090 intun
25091 issp
25092 latinb
25093 latinobarometro
2509

25778 ahmedevelop
25779 ahmetaksoy
25780 ahsan
25781 aibi
25782 aifirst
25783 aimeshangula
25784 aina
25785 aishwarya
25786 aivar
25787 aj_2017
25788 ajana
25789 ajaxfb
25790 ajaykumarmanimala
25791 ajinkya
25792 ajitbrar
25793 ajmartinezm
25794 ajmera
25795 akash
25796 akashpatel
25797 akhil
25798 akhilesh
25799 akhileshwarreddychennu
25800 akhunov
25801 aki
25802 akil
25803 akira.i
25804 akki
25805 akshatupp
25806 akshay
25807 akshayaradhya
25808 akson
25809 albertjiang
25810 albyati
25811 alec
25812 aleenahkhan
25813 aleksandr
25814 aleksey
25815 aleman
25816 alex_deng
25817 alexanderglulkhovtsev
25818 alexandru
25819 alexattia
25820 alexeev
25821 alexisglennespina
25822 alexlight
25823 alexnavarret
25824 alexstrasza
25825 alexzhang
25826 alfiansyah
25827 alfonsorey
25828 alfredoquintana
25829 alifarsi
25830 alifatemi
25831 aliia
25832 alimbekovkz
25833 alin
25834 alishan
25835 aljabari
25836 aljumiah
25837 allam
25838 allbe
25839 allhailsammi
25840 allsmil
25841 alluri
25842 almon


26527 davtyan
26528 dayana
26529 dazhangyu
26530 dddhiraj
26531 deathmood
26532 debanjan
26533 debashish
26534 debayandasgupta
26535 debdootsheet
26536 debirath
26537 debrun
26538 dechavez005
26539 deeley
26540 deena
26541 deepak
26542 deepakgupta
26543 deepakkandasami
26544 deepakmitt
26545 deepanalyt
26546 delepp
26547 delforouzi
26548 dell
26549 delord
26550 deltoix
26551 deluca
26552 delux
26553 demetri
26554 dempsey
26555 denesha
26556 denisafonin
26557 denni
26558 derrick
26559 derrin
26560 desai
26561 deshmukh
26562 deshpand
26563 deu
26564 devansh
26565 devashismohapatra
26566 deveshmaheshwari
26567 devin
26568 deviou
26569 devji
26570 devjyotichandra
26571 dexterita
26572 dgoke1
26573 dhafermalouch
26574 dhainj
26575 dhameliya
26576 dhananjay
26577 dhatt
26578 dhiman
26579 dhruvmangtani
26580 dhyani
26581 dian
26582 diaz-bã©rrio
26583 digitalcowboy
26584 digiulio
26585 digmi
26586 dilday
26587 dileep
26588 dilzeem
26589 dimarco
26590 dimitrif
26591 ding
26592 diogo
26593 dipan

27277 jigarkumar
27278 jihan
27279 jiji
27280 jill_m
27281 jimmymarguerit
27282 jin-hwachiu
27283 jindong
27284 jinesh
27285 jingdazh
27286 jingjuewang
27287 jingli
27288 jingwang
27289 jingyuan-é™è¿œ
27290 jinner
27291 jinsoo
27292 jinz
27293 jiri
27294 jirka
27295 jitendra
27296 jitendrakumarbans
27297 jiuzhang
27298 jjjooo1
27299 jluca
27300 jmataya
27301 jo-team
27302 joejo
27303 joeland209
27304 joerg
27305 joeymey
27306 joffl
27307 johannesbuchn
27308 johanneslapoutr
27309 john2
27310 johncurcio
27311 johnd
27312 johndebugg
27313 johnheyrich
27314 johnjaych
27315 johnnyha
27316 johnworn
27317 johnx
27318 joi
27319 jolhe006
27320 joli
27321 jolli
27322 jomend
27323 jon.bil
27324 jonahelisio
27325 jonatan
27326 jonathanphoon
27327 jongh
27328 jonh
27329 jonihoppen
27330 jonsson
27331 joostlubach
27332 jorgezazueta
27333 josephba
27334 josh777
27335 joshkyh
27336 joshmckenney
27337 joshua
27338 joshuaherman
27339 joss
27340 jossssss
27341 josã©
27342 josã©prado
27343 josä—andrä—sal

28027 mouli
28028 moura
28029 mouton
28030 movahhedinia
28031 moxiou
28032 moya
28033 mozilla
28034 mphogodfreynkadimeng
28035 mrak
28036 mrdeed
28037 mridulsharma
28038 mritunjaymohitesh
28039 mrjazz
28040 mrnasalhazel
28041 mrpantherson
28042 mrsanto
28043 mrverd
28044 mrzzheng
28045 msiebold
28046 msjass
28047 mszombi
28048 mubarak
28049 mudit
28050 mufti
28051 muhamad
28052 muhammadmahadtariq
28053 muhammadyasiradnan
28054 mujtaba
28055 mukarram
28056 mukesh
28057 muneeb
28058 munjal
28059 munkey
28060 munshi
28061 muonneutrino
28062 murali_munna
28063 muralidhar
28064 mureren
28065 murilo
28066 murphi
28067 murthi
28068 mushkin
28069 mushtaq
28070 muskanbararia
28071 musskopf
28072 mustafiz
28073 mustakim
28074 muthukumar.j
28075 muttaqi
28076 mypapit
28077 mysaturdayself
28078 mã¤lick
28079 n01z3
28080 nabae
28081 nabeel
28082 nada
28083 nader
28084 nadi
28085 nadimp
28086 nadin
28087 nagabhushan
28088 nagaraj
28089 nagarajan
28090 nagendra
28091 nagyfi
28092 naik
28093 nailo
280

28777 sasikumar
28778 saswata
28779 satadru5
28780 satavisha
28781 satheeshperepu
28782 sathu79
28783 sathyajit
28784 satish
28785 satya
28786 satyaki
28787 satyasai
28788 saudal-zakwani
28789 saurabh
28790 saurabhbhagvatula
28791 saurav
28792 sauro
28793 savannahlogan
28794 savasyä±ldä±rä±m
28795 savchenko
28796 savioz
28797 sawayaka
28798 saxena
28799 saxin
28800 sazidurrahman
28801 schale
28802 scharf
28803 schenck
28804 schiller
28805 schirmer
28806 schmidt
28807 schnessl
28808 schreier
28809 schroeder
28810 schwartz
28811 schwertfeg
28812 scottfre
28813 scotthendrickson
28814 seagoat
28815 seagullbird
28816 seankim
28817 seanlahman
28818 sebastian
28819 sebastianmarkow
28820 sebastianzanabria
28821 sebel
28822 secareanu
28823 security3test
28824 securityteamvictim4
28825 seetharam
28826 sehgal
28827 seibert
28828 seiei
28829 sekar
28830 sekhar
28831 sekkaf
28832 selah
28833 selvakumar
28834 semenyk
28835 semin
28836 semionkorchevskiy
28837 seong-ja
28838 seregni
28839 serena
28840

29526 xaliap
29527 xavierbay
29528 xaviermartinezbartra
29529 xavya
29530 xgan
29531 xiao
29532 xiaocongsonia
29533 xiaojingli
29534 xiaotong
29535 xiaoxiao
29536 xiaozhou
29537 xime
29538 xingzhangren
29539 xiong
29540 xjtushilei
29541 xss'
29542 xtyscut
29543 xuan
29544 xuetao
29545 xuleiyang
29546 xuseniayu
29547 xuy2
29548 xwang
29549 yabir
29550 yachuncheng
29551 yadav
29552 yagana
29553 yagnesh
29554 yahyacivelek
29555 yalitsai
29556 yamaji
29557 yamuuu
29558 yanir
29559 yanni
29560 yannmallegol
29561 yannsar
29562 yaohsiao
29563 yaoseny
29564 yaoxiang
29565 yapi
29566 yarden
29567 yasar
29568 yash
29569 yashjain
29570 yashna
29571 yashu
29572 yasmeenw
29573 yassin
29574 yassineameur
29575 yasuhiro_121
29576 yaswanth
29577 yatishbn
29578 yazi
29579 yeomyungro
29580 yeongchan
29581 yeongseok
29582 yetukuri
29583 yevgeniya
29584 yexiaofeng
29585 yichenâ€œeddieâ€shen
29586 yifan
29587 yijiezhuang
29588 yilmaz
29589 yime
29590 ying
29591 yinghan
29592 yingzhu
29593 yiqizhang
29594 y

30276 tulpenwoed
30277 tulpomani
30278 well.
30279 上证综合指数前复权日线数据
30280 'decad
30281 'mysteri
30282 //www.fisheaters.com/rosary.html
30283 //www.preces-latinae.org/thesaurus/bvm/rosarium.html
30284 dominican
30285 joy
30286 novic
30287 pray
30288 prayer
30289 rosari
30290 sorrow
30291 disburs
30292 extet
30293 feaf
30294 sorti
30295 squadron
30296 broadsid
30297 eighteenth
30298 fugit
30299 pamphlet
30300 autobiographi
30301 egalitarian
30302 evangel
30303 metaphor
30304 post-revolutionari
30305 renov
30306 errant
30307 for-excel
30308 doha
30309 //data.betfair.com/sportids.htm
30310 //www.kaggle.com/zygmunt/betfair-hors
30311 104049
30312 1306731
30313 1477
30314 2152880
30315 26420387
30316 3503
30317 3988
30318 4339
30319 468328
30320 61420
30321 6231
30322 6422
30323 6423
30324 7511
30325 7522
30326 7524
30327 998917
30328 gaelic
30329 greyhound
30330 handbal
30331 rugbi
30332 snooker
30333 volleybal
30334 //archive.spectator.co.uk/article/24th-september-2005/57/doctor-in-the-hous
3

31026 article-id
31027 article-titl
31028 beta-1
31029 bioavail
31030 bsa/pbst
31031 cane
31032 cluster_id
31033 cripto
31034 cytokin
31035 de-orphan
31036 degf
31037 diod
31038 elisa
31039 eno
31040 enzym
31041 evolutionarili
31042 extracellular
31043 g-protein
31044 gnb3
31045 il-17
31046 journal-titl
31047 l-shape
31048 luciferas
31049 lxml
31050 meta-analysi
31051 mg/ml
31052 monoclon
31053 mutant
31054 ncbi
31055 ncbi_data.json
31056 ncbi_doc_titles_map.json
31057 ncbi_lines_map.json
31058 non-fit
31059 non-specif
31060 phenix.refin
31061 pias3
31062 plagiar
31063 polymorph
31064 pymol
31065 rmsd
31066 solubl
31067 supernat
31068 tgf-
31069 tgf-b
31070 tgfb1
31071 therapeut
31072 tumorigen
31073 tyro3
31074 tyrosin
31075 uspto_data.json
31076 uspto_doc_titles_map.json
31077 uspto_lines_map.json
31078 well-character
31079 xpath
31080 card_detail.csv
31081 corner_detail.csv
31082 cross_detail.csv
31083 datadictionary.xlsx
31084 foulcommit
31085 foulcommit_detail.csv
31086 goal_detai

31776 employee_agreements.csv
31777 income_sources_and_assets.csv
31778 jobs_before_joining_admin.csv
31779 names_and_job_titles.csv
31780 billi
31781 depodesta
31782 hillsid
31783 on-bas
31784 oobp
31785 oslg
31786 rankplayoff
31787 rankseason
31788 sports-reference.com
31789 unsettl
31790 1.25kg
31791 20-28
31792 bicep
31793 curl
31794 dumbbel
31795 elbow
31796 hip
31797 puc-rio
31798 unilater
31799 wearer
31800 weight-loss
31801 wugulino
31802 'surrog
31803 0.95
31804 1.05
31805 dbot
31806 econc
31807 esoil
31808 gammag
31809 mr_c
31810 mr_t
31811 mt_c
31812 mt_t
31813 parametr
31814 predicted_valu
31815 self-mad
31816 true_valu
31817 //www.asx.com.au/
31818 //www.asxhistoricaldata.com/
31819 1997-01-02
31820 2,228
31821 6,475,470
31822 asx-equity-price.csv
31823 asx-tickers.csv
31824 gaint
31825 at_confer
31826 livetweet
31827 lower-ti
31828 modern-world
31829 //www1.nyc.gov/site/planning/data-maps/open-data/districts-download-metadata.pag
31830 openphish.com
31831 phish
31832 dhh


In [126]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 2), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2), (29, 2), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 3), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 7), (49, 1), (50, 4), (51, 1), (52, 1), (53, 1), (54, 1), (55, 3), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 2), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 7), (100, 2), (101, 3), (102, 1), (103, 1), (104, 1), (1

[(138, 1), (189, 1), (199, 2), (206, 2), (257, 1), (670, 1), (675, 1), (1726, 1), (1783, 1), (5252, 1), (5253, 1), (5254, 1), (5255, 1), (5256, 1), (5257, 1)]
[(36, 1), (41, 1), (52, 1), (55, 1), (121, 1), (139, 1), (142, 2), (164, 2), (172, 2), (186, 1), (189, 1), (194, 1), (204, 1), (223, 1), (245, 2), (255, 2), (267, 1), (270, 1), (275, 1), (296, 1), (314, 1), (325, 1), (337, 1), (348, 2), (351, 1), (359, 1), (365, 1), (375, 1), (376, 1), (383, 7), (399, 1), (423, 1), (430, 1), (461, 1), (475, 2), (500, 1), (501, 1), (530, 1), (533, 1), (606, 1), (632, 1), (655, 1), (711, 1), (718, 1), (743, 1), (756, 1), (767, 1), (802, 1), (815, 1), (846, 2), (892, 2), (903, 6), (914, 1), (933, 2), (962, 1), (964, 1), (1106, 2), (1261, 1), (1423, 1), (1604, 1), (1634, 1), (1670, 1), (1720, 1), (1776, 1), (2113, 1), (2262, 1), (2281, 1), (3033, 1), (3166, 2), (3446, 1), (3543, 1), (3872, 1), (3883, 1), (4139, 1), (4221, 1), (4230, 2), (4737, 1), (5258, 1), (5259, 1), (5260, 1), (5261, 1), (5262, 1)

[(31, 1), (195, 1), (362, 1), (2124, 1)]
[(7, 1), (8, 3), (13, 1), (58, 4), (77, 1), (78, 1), (86, 1), (91, 1), (98, 1), (105, 3), (125, 1), (141, 1), (158, 1), (166, 1), (180, 2), (184, 1), (187, 1), (195, 3), (197, 1), (204, 1), (211, 1), (212, 1), (222, 1), (223, 2), (228, 1), (255, 1), (270, 2), (274, 1), (275, 1), (288, 1), (290, 1), (301, 1), (328, 1), (335, 1), (343, 2), (346, 1), (359, 1), (424, 1), (430, 2), (452, 1), (471, 1), (480, 1), (500, 1), (501, 1), (512, 1), (513, 1), (518, 1), (534, 1), (542, 1), (546, 1), (547, 1), (598, 1), (606, 1), (611, 2), (615, 1), (617, 2), (623, 1), (639, 1), (647, 1), (673, 1), (730, 1), (736, 2), (743, 2), (749, 1), (769, 3), (832, 1), (841, 5), (861, 1), (933, 1), (954, 1), (975, 2), (997, 2), (1015, 1), (1051, 1), (1312, 1), (1333, 2), (1343, 3), (1399, 1), (1458, 1), (1509, 1), (1539, 4), (1658, 1), (1667, 2), (1779, 1), (1802, 1), (1842, 1), (1867, 1), (1879, 1), (1938, 1), (2019, 1), (2053, 1), (2135, 1), (2170, 1), (2184, 1), (2267, 

[(12, 2), (36, 3), (42, 1), (55, 1), (82, 1), (98, 2), (139, 1), (149, 1), (157, 1), (176, 1), (193, 2), (197, 2), (201, 1), (210, 1), (223, 1), (227, 2), (241, 1), (258, 1), (274, 1), (275, 1), (337, 1), (343, 1), (349, 2), (359, 1), (372, 1), (383, 1), (395, 2), (448, 2), (453, 4), (492, 1), (500, 1), (501, 1), (517, 2), (563, 1), (595, 1), (660, 5), (661, 1), (682, 1), (686, 1), (718, 2), (730, 3), (738, 1), (741, 3), (760, 2), (767, 1), (901, 1), (923, 5), (932, 1), (933, 1), (981, 1), (982, 1), (1032, 5), (1055, 1), (1493, 2), (1599, 2), (1732, 1), (1815, 4), (1840, 1), (1862, 2), (1911, 1), (2044, 1), (2064, 1), (2106, 1), (2124, 1), (2601, 1), (2615, 1), (3316, 2), (3407, 2), (3773, 2), (6365, 1), (9330, 1), (15210, 2), (20499, 1), (20500, 1), (20501, 1), (20502, 1), (20503, 3), (20504, 1), (20505, 1), (20506, 1), (20507, 1)]
[(36, 1), (58, 1), (258, 1), (741, 1), (1140, 1), (1152, 1), (7476, 1), (20508, 1)]
[(8, 1), (12, 1), (25, 3), (28, 1), (36, 1), (41, 1), (42, 1), (52, 1),

[(895, 1)]
[(895, 1)]
[(31, 3), (55, 2), (78, 1), (85, 3), (149, 1), (169, 2), (172, 1), (175, 1), (180, 1), (241, 1), (255, 1), (267, 1), (275, 1), (346, 3), (359, 1), (423, 1), (495, 1), (500, 1), (501, 1), (580, 1), (659, 2), (706, 1), (713, 1), (736, 1), (751, 1), (955, 3), (982, 2), (1058, 1), (1106, 1), (1333, 1), (1399, 4), (1550, 1), (1582, 1), (2026, 1), (2050, 1), (2220, 1), (2955, 1), (3916, 1), (5287, 1), (5387, 1), (10996, 1), (25499, 1), (25500, 1), (25501, 1)]
[(55, 2), (63, 1), (99, 2), (100, 1), (232, 2), (275, 1), (359, 2), (401, 1), (478, 1), (500, 1), (501, 1), (892, 1), (925, 1), (1368, 1), (2248, 1), (2278, 2), (2625, 2), (25502, 1), (25503, 1)]
[(17, 2), (36, 1), (41, 1), (98, 1), (105, 1), (106, 2), (125, 1), (132, 1), (142, 1), (143, 1), (180, 1), (181, 1), (189, 2), (205, 1), (210, 1), (221, 1), (238, 1), (246, 1), (261, 2), (270, 1), (337, 2), (360, 1), (389, 4), (466, 1), (533, 1), (649, 1), (666, 1), (694, 3), (700, 2), (709, 1), (718, 1), (730, 1), (736, 1

[(895, 1)]
[(895, 1)]
[(895, 1)]
[(231, 2), (267, 1), (275, 1), (337, 3), (346, 2), (359, 1), (500, 1), (501, 1), (802, 1), (871, 1), (2608, 2), (4073, 2), (7768, 1)]
[(895, 1)]
[(28, 1), (31, 1), (71, 1), (105, 1), (124, 1), (129, 2), (189, 1), (201, 1), (222, 2), (244, 1), (245, 1), (372, 1), (374, 2), (380, 1), (384, 1), (420, 2), (424, 1), (489, 5), (507, 1), (623, 1), (708, 1), (858, 3), (886, 1), (981, 1), (1003, 1), (1015, 1), (1036, 1), (1062, 1), (1121, 1), (1276, 1), (1305, 11), (1490, 2), (1506, 1), (1529, 1), (1660, 1), (1730, 7), (1782, 1), (1921, 1), (2024, 1), (2085, 1), (2188, 1), (2189, 1), (2968, 1), (3018, 1), (3028, 1), (3118, 1), (3270, 1), (3505, 7), (3650, 1), (3972, 1), (4476, 2), (4477, 2), (4478, 1), (4479, 1), (4480, 3), (4481, 1), (4482, 1), (4483, 1), (4484, 1), (4485, 1), (4486, 1), (4487, 1), (4488, 3), (4489, 1), (4490, 1), (4491, 1), (4492, 1), (4493, 1), (4494, 4), (4495, 1), (4496, 3), (4497, 1), (4498, 1), (4499, 1), (4500, 1), (4501, 1), (4502, 1), 

In [107]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)

In [109]:
ldamodel.print_topics(num_topics=5,num_words=15)

[(0,
  '0.047*"univers" + 0.010*"state" + 0.006*"descript" + 0.006*"use" + 0.006*"colleg" + 0.006*"model" + 0.005*"contain" + 0.005*"featur" + 0.005*"column" + 0.005*"imag" + 0.004*"http" + 0.004*"learn" + 0.004*"context" + 0.004*"time" + 0.003*"california"'),
 (1,
  '0.007*"content" + 0.007*"use" + 0.006*"contain" + 0.006*"acknowledg" + 0.006*"context" + 0.006*"time" + 0.006*"file" + 0.005*"inform" + 0.005*"includ" + 0.005*"inspir" + 0.005*"number" + 0.004*"year" + 0.004*"http" + 0.004*"name" + 0.003*"collect"')]

In [148]:
topics_names= ['Education', 'Communication']

In [151]:
corpus

[[(0, 2),
  (1, 1),
  (2, 2),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 2),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 2),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 3),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 7),
  (49, 1),
  (50, 4),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 3),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 2),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 2),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 1),
  (85, 1),
  (86, 1),
  (87, 1),
  (88, 1),
  (89, 1),
  (90, 1),
  (91, 1)

In [152]:
def elicit_topic_name(doc_topics):    
    return topics_names[np.squeeze(np.array(doc_topics))[:,1].argmax()]
elicit_topic_name(doc_topics)

IndexError: too many indices for array

<font color = green >

## Learn more
</font>

Latent Dirichlet allocation
<br>
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


<font color = green >

## Next lesson: Clustering 
</font>

