<font color = green >

# Text classification: topic modeling 

</font>

<font color = green >

### Latent Dirichlet allocation (LDA)

</font>

Typically used to detect underlying topics in the text documents

**Input** : text documents and number of topics 
<br>
**Output**: Distribution of topics for each document (that allows to assign th one with highest probability) and word distribution for each topic 

**Assumptions**:
- Documents with similar topics use similar groups of words 
- Documents are probability distribution over latent topics 
- Topics are probability distribution over words


<font color = green >

#### Generative process
<br>
</font>

LDA considers the every document is created the following way:

1) Define number if words in the document
<br>
2) Chose the topic mixture over the fixed set of topics (e.g. 20% of topic A, 30% of topic A, and 50% of topic A)
<br>
3) Generate the words by:
<br>
   -pick the topic based on document's multinomial distribution 
<br>
   -pick the word based on topic's multinomial distribution 

<img src = "img/topics_modeling.png" height=500 width= 800 align="left">



<font color = green >

#### Recall
</font>


#### Binomial distribution

$$p(k/n)\quad =\quad C^{ k }_{ n }\cdot p^{ k }(1-p)^{ n-k }\quad =\quad \frac { n! }{ k!(n-k)! } p^{ k }(1-p)^{ n-k }$$

Example: Probability of 6 of 10 for fear coin: 
$$p(6,4)\quad =\quad C^{ 6 }_{ 10 }\cdot {0.5}^{ 6 }(0.5)^{ 4 }\quad = 210 \cdot 0.015625 \cdot 0.0625 = 0.205078125$$


#### Multinomial distribution

$$p(n_{ 1 }n_{ 2 }...n_{ k })\quad =\quad \frac { n! }{ n_{ 1 }!n_{ 2 }!...n_{ k }! } p^{ n_{ 1 } }_{ 1 }p^{ n_{ 2 } }_{ 2 }...p^{ n_{ k } }_{ k }$$

Example (three outcomes): <br>
n = 12 (12 games are played),<br>
n1 = 7 (number won by Player A),<br>
n2 = 2 (number won by Player B),<br>
n3 = 3 (the number drawn),<br>
p1 = 0.4 (probability Player A wins)<br>
p2 = 0.35(probability Player B wins)<br>
p3 = 0.25(probability of a draw)<br>
$$p(7,2,3)\quad =\quad \frac {12!}{ 7! \cdot 2! \cdot3 ! }  \cdot 0.4^{7} \cdot 0.35^{2} \cdot0.25^{3} = 0.0248$$




<font color = green >

#### Maximul Likelihood Estimation

</font>

<br>

**Recall** 
<br> Known are text documents and number $K$ of topics 

**Target**:
<br>Within all possible topics distribution for all documemnts and all possible words distribution for topics, shoose the one wich maximizes probability of all text documents.

**Approach** :
<br>
1) Randomly assign each word of each document to $K$ topics 
<br>
2) Iterate the following process till convergence (steady assignments of w to topics) 
<br>$\quad\quad$For each document $d$: 
<br>
    $\quad\quad\bullet$ Assume that all topic assignment except current one are correct     
    $\quad\quad\bullet$ For each word $w$ in $d$:           
    $\quad\quad\quad$ - For every topic $t$ compare the the score for hypothesis that w is in this topic $t$:
   <br>$\quad\quad\quad\quad\quad score (t) =  p(t | d) \cdot p (w |t),$
   <br>$\quad\quad\quad\quad p(t|d)$ is proportion of all words in d from t,
    <br>$\quad\quad\quad\quad p(w|t)$ is share of word w in topic t.  
    $\quad\quad\quad$ - Assign the word w to the topic with max score
    <br>$\quad\quad\bullet$ Iterate through all $w$ in $d$:           
$\quad\quad$Iterate through all $d$

Te results is matrix of distribution of words in topics  
Note: The computed topics are just words distribution, i.e. need to summarize them somehow. 


<font color = green >

## Gensim LDA 

</font>



In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords

<font color = green >

### Define the text documents 

</font>



In [2]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]


<font color = green >

### Tokenize, clean, and stem

</font>



In [3]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(doc_set):
    texts = []
    for doc in doc_set:
        # tokenize document string
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(doc_set)
texts[0]

['brocolli',
 'good',
 'eat',
 '.',
 'brother',
 'like',
 'eat',
 'good',
 'brocolli',
 ',',
 'mother',
 '.']

<font color = green >

### Convert tokenized documents into a "id <-> term" dictionary

</font>



In [4]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(34 unique tokens: [',', '.', 'brocolli', 'brother', 'eat']...)
0 ,
1 .
2 brocolli
3 brother
4 eat
5 good
6 like
7 mother
8 around
9 basebal
10 drive
11 lot
12 practic
13 spend
14 time
15 blood
16 caus
17 expert
18 health
19 increas
20 may
21 pressur
22 suggest
23 tension
24 better
25 feel
26 never
27 often
28 perform
29 school
30 seem
31 well
32 profession
33 say


<font color = green >

### Create gensim corpus

</font>



In [5]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1), (1, 2), (2, 2), (3, 1), (4, 2), (5, 2), (6, 1), (7, 1)]
[(1, 1), (3, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(1, 1), (10, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
[(0, 1), (1, 1), (3, 1), (7, 1), (10, 1), (21, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(1, 1), (2, 1), (5, 1), (18, 2), (32, 1), (33, 1)]


<font color = green >

### Generate LDA model

</font>



In [6]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

### Review topics 

</font>



In [7]:
ldamodel.print_topics(num_topics=2,num_words=10)

[(0,
  '0.098*"." + 0.076*"brocolli" + 0.076*"good" + 0.055*"mother" + 0.055*"brother" + 0.054*"health" + 0.054*"eat" + 0.033*"," + 0.033*"like" + 0.033*"spend"'),
 (1,
  '0.060*"drive" + 0.059*"pressur" + 0.059*"." + 0.036*"," + 0.036*"never" + 0.036*"often" + 0.036*"increas" + 0.036*"perform" + 0.036*"seem" + 0.036*"well"')]

<font color = green >

### Classify the new text 

</font>



In [8]:
test_doc_list = ["Some experts suggest that car may cause increased blood pressure. professionals say that brocolli is good "]
test_texts = tokenize(test_doc_list)
test_corpus = [dictionary.doc2bow(text) for text in test_texts ]
test_doc_topics = ldamodel.get_document_topics(test_corpus)
print ('\nget topics:')
for el in test_doc_topics: # loop over all tests in provided list
    print(el)


get topics:
[(0, 0.37417653), (1, 0.62582344)]


<font color = green >

### Sample of topic modeling on large dataset

</font>



In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

<font color = green >

#### Load "News" data 

</font>



In [10]:
import os
cwd= os.getcwd()
path = os.path.join(cwd,)
fn=  os.path.join(path , 'newsgroups')

with open(fn, 'rb') as f:
    newsgroup_data = pickle.load(f)

<font color = green >

#### Review data

</font>



In [11]:
print (type(newsgroup_data))
print ('len of documents = {:,}\n'.format(len(newsgroup_data)))

newsgroup_data[0]

<class 'list'>
len of documents = 2,000



"The best group to keep you informed is the Crohn's and Colitis Foundation\nof America.  I do not know if the UK has a similar organization.  The\naddress of\nthe CCFA is \n\nCCFA\n444 Park Avenue South\n11th Floor\nNew York, NY  10016-7374\nUSA\n\nThey have a lot of information available and have a number of newsletters.\n \nGood Luck."

<font color = green >

#### Define custom vectorizer

</font>



In [12]:
three_words_pattern = r"\b\w{3,}\b"
vectorizer = CountVectorizer(
    min_df=20, 
    stop_words='english',
    token_pattern=three_words_pattern) 
vectorizer.fit(newsgroup_data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b\\w{3,}\\b', tokenizer=None,
        vocabulary=None)

<font color = green >

#### Review feratures 

</font>



In [13]:
print ('len of features = {:,}\n'.format(len(vectorizer.get_feature_names())))
print (vectorizer.get_feature_names()[:40])


len of features = 902

['000', '100', '1990', '1992', '1993', '200', '2nd', '300', '400', '486', '500', '800', 'ability', 'able', 'accept', 'accepted', 'access', 'according', 'actual', 'actually', 'add', 'addition', 'additional', 'address', 'advance', 'advice', 'age', 'ago', 'agree', 'ahead', 'air', 'allow', 'alt', 'america', 'american', 'answer', 'answers', 'anybody', 'apparently', 'appears']


<font color = green >

#### Vectorize data set

</font>



In [14]:
newsgroup_data_vectorized= vectorizer.transform(newsgroup_data)
print (newsgroup_data_vectorized)

  (0, 23)	1
  (0, 33)	1
  (0, 58)	1
  (0, 76)	1
  (0, 326)	1
  (0, 335)	1
  (0, 386)	1
  (0, 409)	1
  (0, 451)	1
  (0, 456)	1
  (0, 515)	1
  (0, 529)	1
  (0, 545)	1
  (0, 727)	1
  (0, 843)	1
  (0, 900)	1
  (1, 33)	1
  (1, 34)	1
  (1, 84)	1
  (1, 184)	1
  (1, 201)	1
  (1, 214)	1
  (1, 231)	2
  (1, 241)	1
  (1, 324)	1
  :	:
  (1998, 622)	1
  (1998, 625)	3
  (1998, 688)	1
  (1998, 698)	2
  (1998, 726)	1
  (1998, 804)	1
  (1998, 805)	1
  (1998, 810)	10
  (1998, 813)	2
  (1998, 814)	1
  (1998, 816)	1
  (1998, 818)	1
  (1998, 844)	1
  (1998, 882)	2
  (1998, 899)	1
  (1999, 171)	1
  (1999, 194)	1
  (1999, 205)	1
  (1999, 213)	1
  (1999, 276)	2
  (1999, 308)	1
  (1999, 344)	1
  (1999, 669)	1
  (1999, 832)	1
  (1999, 874)	1


<font color = green >

#### Create gensim corpus

</font>



In [15]:
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
# comparing to using corpora.Dictionary:
# corpus = [dictionary.doc2bow(text) for text in texts] 
[item for item in corpus][:5]


[[(23, 1),
  (33, 1),
  (58, 1),
  (76, 1),
  (326, 1),
  (335, 1),
  (386, 1),
  (409, 1),
  (451, 1),
  (456, 1),
  (515, 1),
  (529, 1),
  (545, 1),
  (727, 1),
  (843, 1),
  (900, 1)],
 [(33, 1),
  (34, 1),
  (84, 1),
  (184, 1),
  (201, 1),
  (214, 1),
  (231, 2),
  (241, 1),
  (324, 1),
  (332, 1),
  (359, 1),
  (363, 1),
  (365, 1),
  (409, 1),
  (430, 3),
  (451, 1),
  (475, 1),
  (492, 2),
  (525, 2),
  (605, 1),
  (633, 2),
  (642, 1),
  (674, 1),
  (688, 1),
  (709, 1),
  (750, 1),
  (777, 1),
  (823, 1),
  (838, 1),
  (874, 1),
  (896, 1)],
 [(25, 1),
  (26, 1),
  (63, 1),
  (120, 1),
  (231, 1),
  (297, 1),
  (326, 1),
  (344, 1),
  (373, 1),
  (423, 1),
  (442, 1),
  (444, 1),
  (448, 2),
  (465, 1),
  (572, 1),
  (653, 1),
  (659, 1),
  (714, 1),
  (777, 1),
  (779, 1),
  (781, 1),
  (818, 1),
  (836, 1),
  (855, 1),
  (890, 1),
  (898, 1)],
 [(4, 1),
  (17, 2),
  (18, 1),
  (22, 1),
  (42, 1),
  (48, 2),
  (68, 1),
  (78, 1),
  (86, 1),
  (94, 1),
  (117, 1),
  (119, 1)

<font color = green >

#### Create id2word dictionary

</font>



In [16]:
id_map = dict((v, k) for k, v in vectorizer.vocabulary_.items()) 

<font color = green >

#### Generate LDA model

</font>



In [17]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)
# Comparing to corpora.Dictionary
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

#### Review topics

</font>



In [18]:
ldamodel.print_topics(num_topics=6,num_words=10)

[(0,
  '0.025*"edu" + 0.019*"com" + 0.018*"use" + 0.018*"thanks" + 0.016*"does" + 0.015*"know" + 0.011*"mail" + 0.010*"apple" + 0.009*"help" + 0.008*"want"'),
 (1,
  '0.061*"drive" + 0.039*"disk" + 0.030*"scsi" + 0.027*"drives" + 0.027*"hard" + 0.025*"controller" + 0.021*"card" + 0.018*"rom" + 0.016*"cable" + 0.016*"floppy"'),
 (2,
  '0.024*"people" + 0.022*"god" + 0.012*"atheism" + 0.012*"think" + 0.012*"believe" + 0.012*"don" + 0.010*"does" + 0.010*"just" + 0.009*"argument" + 0.009*"say"'),
 (3,
  '0.023*"game" + 0.021*"year" + 0.020*"team" + 0.013*"games" + 0.013*"play" + 0.011*"good" + 0.011*"don" + 0.010*"think" + 0.010*"season" + 0.010*"players"'),
 (4,
  '0.035*"space" + 0.019*"nasa" + 0.018*"data" + 0.013*"information" + 0.013*"available" + 0.013*"center" + 0.011*"ground" + 0.010*"research" + 0.010*"000" + 0.010*"new"'),
 (5,
  '0.017*"just" + 0.017*"like" + 0.016*"don" + 0.012*"car" + 0.012*"time" + 0.011*"think" + 0.011*"good" + 0.010*"know" + 0.008*"way" + 0.008*"people"')]

<font color = green >

#### Name topics

</font>



In [19]:
topics_names= ['Education', 'Computers & IT', 'Religion', 'Sports', 'Science','Society & Lifestyle']

<font color = green >

#### Classify the new text 

</font>



In [20]:
import numpy as np

In [21]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "] 


In [22]:
doc_vectorized= vectorizer.transform(new_doc) # input param is list
new_doc_corpus = gensim.matutils.Sparse2Corpus(doc_vectorized, documents_columns=False)
doc_topics = ldamodel.get_document_topics(new_doc_corpus)
list(doc_topics)

[[(0, 0.033417594),
  (1, 0.033340912),
  (2, 0.03351659),
  (3, 0.033779576),
  (4, 0.8323054),
  (5, 0.03363992)]]

In [23]:
def elicit_topic_name(doc_topics):    
    return topics_names[np.squeeze(np.array(doc_topics))[:,1].argmax()]
elicit_topic_name(doc_topics)

'Science'

<font color = green >

## Home Task 

</font>


<font color = green >

### Topic Modeling 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [84]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from nltk.corpus import stopwords
import pandas as pd
import gensim
import re

fn=  os.path.join(path , 'voted-kaggle-dataset.csv')
df = pd.read_csv(fn)

In [85]:
df.head(50)

Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1241,"Version 2,2016-11-05|Version 1,2016-11-03",crime\nfinance,CSV,144 MB,ODbL,"442,136 views","53,128 downloads","1,782 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1046,"Version 10,2016-10-24|Version 9,2016-10-24|Ver...",association football\neurope,SQLite,299 MB,ODbL,"396,214 views","46,367 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1024,"Version 2,2017-09-28",film,CSV,44 MB,Other,"446,255 views","62,002 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,789,"Version 2,2017-07-19|Version 1,2016-12-08",crime\nterrorism\ninternational relations,CSV,144 MB,Other,"187,877 views","26,309 downloads",608 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."
4,Bitcoin Historical Data,Bitcoin data at 1-min intervals from select ex...,Zielak,618,"Version 11,2018-01-11|Version 10,2017-11-17|Ve...",history\nfinance,CSV,119 MB,CC4,"146,734 views","16,868 downloads",68 kernels,13 topics,https://www.kaggle.com/mczielinski/bitcoin-his...,Context\nBitcoin is the longest running and mo...
5,"Kaggle ML and Data Science Survey, 2017",A big picture view of the state of data scienc...,Kaggle,574,"Version 4,2017-10-28|Version 3,2017-10-03|Vers...",employment\nsociology\nartificial intelligence,CSV,28 MB,ODbL,"95,587 views","9,390 downloads",244 kernels,10 topics,https://www.kaggle.com/kaggle/kaggle-survey-2017,"Context\nFor the first time, Kaggle conducted ..."
6,Iris Species,Classify iris plants into three species in thi...,UCI Machine Learning,512,"Version 2,2016-09-27|Version 1,2016-01-12",botany,SQLite,15 KB,CC0,"162,706 views","24,361 downloads","3,394 kernels",14 topics,https://www.kaggle.com/uciml/iris,The Iris dataset was used in R.A. Fisher's cla...
7,World Development Indicators,Explore country development indicators from ar...,World Bank,468,"Version 2,2017-05-02|Version 1,2016-01-28",economics\ninternational relations,CSV,2 GB,Other,"134,038 views","20,364 downloads",389 kernels,5 topics,https://www.kaggle.com/worldbank/world-develop...,The World Development Indicators from the Worl...
8,Daily News for Stock Market Prediction,Using 8 years daily news headlines to predict ...,Aaron7sun,438,"Version 1,2016-08-25",news agencies\nfinance,CSV,14 MB,CC4,"95,770 views","10,820 downloads",293 kernels,8 topics,https://www.kaggle.com/aaron7sun/stocknews,"Actually, I prepare this dataset for students ..."
9,Pokemon with stats,721 Pokemon with stats and types,Alberto Barradas,428,"Version 2,2016-08-29|Version 1,2016-08-23",popular culture\ngames and toys\nvideo games,CSV,43 KB,CC0,"133,256 views","16,610 downloads",706 kernels,13 topics,https://www.kaggle.com/abcsds/pokemon,"This data set includes 721 Pokemon, including ..."


In [86]:
print ('len of texts= {:,}'.format(len(df)))
index = 10 
df.loc[index, 'Description']

len of texts= 2,150


'These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k'

In [87]:
df_data_set = df["Description"]
# df["Description"].apply(pd.value_counts) //Note: not working for non numbers type 

In [90]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(df_data_set):
    texts = []
    for doc in df_data_set:
        # tokenize document string
        doc = str(doc)
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]
        
        #remove words that length are less than 3 chars

        tokens =  [token for token in tokens if len(token) >3]
                
        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
      
    return texts

texts = tokenize(df_data_set)
texts[0]

['dataset',
 'contain',
 'transact',
 'made',
 'credit',
 'card',
 'septemb',
 '2013',
 'european',
 'cardhold',
 'dataset',
 'present',
 'transact',
 'occur',
 'day',
 'fraud',
 '284,807',
 'transact',
 'dataset',
 'highli',
 'unbalanc',
 'posit',
 'class',
 'fraud',
 'account',
 '0.172',
 'transact',
 'contain',
 'numer',
 'input',
 'variabl',
 'result',
 'transform',
 'unfortun',
 'confidenti',
 'issu',
 'provid',
 'origin',
 'featur',
 'background',
 'inform',
 'data',
 'featur',
 'princip',
 'compon',
 'obtain',
 'featur',
 'transform',
 "'time",
 "'amount",
 'featur',
 "'time",
 'contain',
 'second',
 'elaps',
 'transact',
 'first',
 'transact',
 'dataset',
 'featur',
 "'amount",
 'transact',
 'amount',
 'featur',
 'use',
 'example-depend',
 'cost-senst',
 'learn',
 'featur',
 "'class",
 'respons',
 'variabl',
 'take',
 'valu',
 'case',
 'fraud',
 'otherwis',
 'given',
 'class',
 'imbal',
 'ratio',
 'recommend',
 'measur',
 'accuraci',
 'use',
 'area',
 'precision-recal',
 'curv'

In [56]:
# moby_frequencies = FreqDist(nltk.word_tokenize(texts))


In [91]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(32459 unique tokens: ["'amount", "'class", "'time", '//mlg.ulb.ac.b', '//mlg.ulb.ac.be/artml']...)
0 'amount
1 'class
2 'time
3 //mlg.ulb.ac.b
4 //mlg.ulb.ac.be/artml
5 //mlg.ulb.ac.be/bruf
6 0.172
7 2013
8 2015
9 284,807
10 account
11 accuraci
12 amount
13 analys
14 andrea
15 area
16 auprc
17 avail
18 background
19 bontempi
20 bruxel
21 caelen
22 calibr
23 card
24 cardhold
25 case
26 cidm
27 cite
28 class
29 classif
30 collabor
31 collect
32 compon
33 comput
34 confidenti
35 confus
36 contain
37 cost-senst
38 credit
39 current
40 curv
41 data
42 dataset
43 day
44 detail
45 detect
46 elaps
47 european
48 example-depend
49 featur
50 first
51 fraud
52 gianluca
53 given
54 group
55 highli
56 http
57 ieee
58 imbal
59 inform
60 input
61 intellig
62 issu
63 johnson
64 learn
65 libr
66 machin
67 made
68 matrix
69 meaning
70 measur
71 mine
72 numer
73 obtain
74 occur
75 olivi
76 origin
77 otherwis
78 past
79 pleas
80 posit
81 pozzolo
82

764 unintuit
765 unix
766 variou
767 viewer
768 volum
769 wait
770 weight
771 -freeformresponses.csv
772 1000
773 16,000
774 16,716
775 16.4
776 25th
777 allow
778 attempt
779 august
780 break
781 calcul
782 channel
783 choic
784 column-wis
785 compens
786 comprehens
787 conduct
788 convers
789 conversionrates.csv
790 correspond
791 cut
792 day-to-day
793 decod
794 describ
795 edg
796 email
797 employ
798 establish
799 exclud
800 flag
801 folk
802 form
803 free
804 freeform
805 freeformresponses.csv
806 happen
807 ident
808 industry-wid
809 initi
810 interact
811 kind
812 less
813 live
814 median
815 month
816 multipl
817 multiplechoiceresponses.csv
818 non-random
819 novemb
820 open-end
821 option
822 packag
823 primarili
824 proceed
825 protect
826 pud
827 quantmod
828 rank
829 read
830 receiv
831 regard
832 relev
833 respond
834 respondenttypereadme.txt
835 salari
836 schema.csv
837 scientist
838 segment
839 shown
840 singl
841 spam
842 survey
843 survey-tak
844 territori
845 total


1513 today
1514 un-reli
1515 under-explain
1516 unexplain
1517 unpleas
1518 utopia
1519 well-b
1520 whole
1521 width
1522 //github.com/zalandoresearch/fashion-mnist
1523 //pjreddie.com/projects/mnist-in-csv/
1524 //tech.zalando.com
1525 10,000
1526 255.
1527 28x28
1528 60,000
1529 ai/ml/data
1530 and/or
1531 ankl
1532 aris
1533 ascii-diagram
1534 assign
1535 boot
1536 charg
1537 claim
1538 cloth
1539 coat
1540 commun
1541 connect
1542 convert
1543 copi
1544 copyright
1545 dark
1546 darker
1547 decompos
1548 direct
1549 dress
1550 drop-in
1551 fail
1552 fashion-mnist
1553 fit
1554 furnish
1555 grayscal
1556 handwritten
1557 height
1558 herebi
1559 holder
1560 images—consist
1561 impli
1562 integ
1563 intend
1564 left
1565 licens
1566 light
1567 merchant
1568 merg
1569 mnist
1570 noninfring
1571 permit
1572 pixel
1573 pixel-valu
1574 pixel31
1575 pullov
1576 remain
1577 rest
1578 restrict
1579 sandal
1580 seek
1581 sell
1582 shirt
1583 size
1584 sneaker
1585 sublicens
1586 substanti
1587

2263 librari
2264 neutral
2265 reformat
2266 rude
2267 tweet
2268 u.s.
2269 0hz-280hz
2270 140hz
2271 3,168
2272 absolut
2273 accumul
2274 acoust
2275 adjac
2276 appar
2277 baselin
2278 carnegi
2279 cart
2280 centroid
2281 cmu_arct
2282 corpu
2283 data-set
2284 dfrang
2285 diagram
2286 domin
2287 energi
2288 entropi
2289 falsetto
2290 festvox
2291 flat
2292 forest
2293 harvard-haskin
2294 incorrectli
2295 interquantil
2296 kurt
2297 kurtosi
2298 logist
2299 maxdom
2300 maxfun
2301 mcgill
2302 meandom
2303 meanfreq
2304 meanfun
2305 mellon
2306 mindom
2307 minfun
2308 mode
2309 modindx
2310 peakf
2311 pre-process
2312 quantil
2313 regular
2314 regularly-tim
2315 reson
2316 seewav
2317 signal
2318 skew
2319 sp.ent
2320 specprop
2321 spectral
2322 speech
2323 telecommun
2324 threshold
2325 tuner
2326 vocal
2327 voic
2328 voxforg
2329 xgboost
2330 0.2882
2331 0.3843
2332 0.3881
2333 0.5438
2334 0.7813
2335 03/01/2018
2336 0x375923bf82f0b728d23a5704261a6e16341fd860
2337 1,382
2338 105.21
23

3013 suffici
3014 whizzo
3015 yeah
3016 ..lub……….dub………..………
3017 //www.peterjbentley.com/heartchallenge/index.html
3018 32gb
3019 abnorm
3020 adult
3021 aistat
3022 anywher
3023 artifact
3024 asterisk*
3025 audio
3026 automat
3027 beat
3028 bentley
3029 bodi
3030 breath
3031 brush
3032 calm
3033 chsc2011
3034 citat
3035 clear
3036 clinic
3037 clip
3038 coimbra
3039 consider
3040 cycl
3041 devic
3042 digiscop
3043 discern
3044 diseas
3045 distinguish
3046 distort
3047 domain-specif
3048 dub-dub
3049 dub.dub………………….lub.……….dub.dub………………….lub……..…….dub
3050 dub…**….lub
3051 dub…*….lub
3052 dub……
3053 dub……………
3054 earlier
3055 eart
3056 echo
3057 enabl
3058 esult
3059 extrasystol
3060 fluid
3061 fname
3062 fourier
3063 fragment
3064 gold-standard
3065 halleng
3066 heartbeat
3067 hospit
3068 howpublish
3069 ignor
3070 illustr
3071 ipad
3072 istethoscop
3073 lassifi
3074 low-pass
3075 lub-lub
3076 lub..*..dub
3077 lub……….dub…
3078 lub……….dub……………
3079 lub……….dub…………….lub.lub……..…….dub……
30

3763 pay_4
3764 pay_5
3765 pay_6
3766 pay_amt1
3767 pay_amt2
3768 pay_amt3
3769 pay_amt4
3770 pay_amt5
3771 pay_amt6
3772 repay
3773 statement
3774 taiwan
3775 montcoalert.org
3776 montgomeri
3777 walk-through
3778 1988
3779 adap
3780 constraint
3781 diagnost
3782 dickson
3783 digest
3784 everhart
3785 heritag
3786 indian
3787 j.e.
3788 j.w.
3789 johann
3790 kidney
3791 knowler
3792 mellitu
3793 onset
3794 pima
3795 press
3796 w.c.
3797 'candid
3798 'confirm
3799 'dim
3800 .fit
3801 2016.
3802 3198
3803 5050
3804 5087
3805 artefact
3806 astrophysicist
3807 blue
3808 boost
3809 datafil
3810 de-nois
3811 embark
3812 emit
3813 exoplanet
3814 exoplanet-star
3815 felt
3816 flux
3817 imagin
3818 kepler
3819 late-summ
3820 lessmann
3821 mikulski
3822 multi-planet
3823 non-exoplanet-star
3824 orbit
3825 partial
3826 planet
3827 satellit
3828 solidifi
3829 starlight
3830 telescop
3831 testset
3832 trainset
3833 undiscov
3834 vector
3835 wavelength
3836 wrongli
3837 author_flair_css_class
3838 a

4512 cave
4513 combat
4514 intelygenz
4515 intelygenz.com
4516 t7pokemonchalleng
4517 weedl
4518 //www.jeannicholashould.com/python-web-scraping-tutorial-for-craft-beers.html
4519 2410
4520 beer
4521 blogpost
4522 breweri
4523 craftcans.com
4524 tidi
4525 1971.
4526 24,000
4527 chrismel
4528 closest
4529 farthest
4530 oversea
4531 pike
4532 proxim
4533 roaster
4534 spice
4535 starbuck
4536 subsidiari
4537 //www.kaggle.com/zohaib1111
4538 6,000+
4539 afghanistan
4540 allah
4541 bangla
4542 berg
4543 bhasha
4544 chinese/madarin
4545 contemporari
4546 dutch
4547 east
4548 einar
4549 finest
4550 geo-polit
4551 hamidullah
4552 hayek
4553 heaven
4554 hell
4555 humil
4556 indoenisan
4557 indonesia
4558 italian
4559 jalandhari
4560 japanes
4561 kalifa
4562 keyzer
4563 makarim
4564 malay
4565 malayalam
4566 messeng
4567 muham
4568 muslim
4569 norwegian
4570 ozturk
4571 peac
4572 persian
4573 piccardo
4574 prophet
4575 rashad
4576 recit
4577 russian
4578 sacr
4579 selv
4580 sheerazi
4581 swedish

5262 2.6m
5263 2001-2017
5264 2003-02-19
5265 2017-12-31
5266 ascii
5267 australia
5268 australian
5269 clickbait
5270 corp.
5271 crisi
5272 disast
5273 doi:10.7910/dvn/sybgzl
5274 early-2003
5275 ecolog
5276 end-2017
5277 fairli
5278 headline_text
5279 india
5280 jsoup
5281 kernal
5282 kulkarni
5283 lowercas
5284 noteworthi
5285 publish_d
5286 reput
5287 rohit
5288 shape
5289 summaris
5290 yyyymmdd
5291 'sudoku.csv
5292 'sudoku_quizzes.npi
5293 'sudoku_solutions.npi
5294 .read
5295 .splitlin
5296 //github.com/kyubyong/sudoku
5297 //www.ocf.berkeley.edu/~arel/sudoku/main.html
5298 1000000
5299 arel
5300 cnn
5301 cordero
5302 crack
5303 dozen
5304 enumer
5305 inclin
5306 line.split
5307 np.int32
5308 np.load
5309 np.zero
5310 numpi
5311 puzzl
5312 quiz
5313 quizz
5314 quizzes.reshap
5315 reinforc
5316 restor
5317 solutions.reshap
5318 subgrid
5319 sudoku
5320 sudoku-solv
5321 25,000
5322 accident
5323 cms.gov
5324 credenti
5325 non-opioid
5326 opiat
5327 opioid
5328 opioid.prescrib
5329

6012 1957
6013 24-hour
6014 8-hour
6015 aircompar
6016 airdata
6017 assur
6018 back-end
6019 bigquery-public-data.epa_historical_air_quality.
6020 cent
6021 data-submitt
6022 exempt
6023 gateway
6024 mart
6025 naaq
6026 percentil
6027 pm2.5
6028 readi
6029 routin
6030 sampler
6031 web-bas
6032 wider
6033 400,000
6034 canon
6035 knowledge-shar
6036 logic
6037 non-commerci
6038 principl
6039 quora
6040 readership
6041 seeker
6042 semant
6043 truli
6044 writer
6045 1,663
6046 1-1-16
6047 1-1-16_5-31-17_weather
6048 11-28-17
6049 135.1978525
6050 139.668220854814
6051 1663
6052 2-6-18
6053 34.6951242
6054 35.6436746642265
6055 4690
6056 5-31-17
6057 _store_info_with_nearest_active_st
6058 add
6059 air_station_dist
6060 air_station_distances.csv
6061 air_store_info
6062 air_store_info.csv
6063 air_store_info_with_nearest_active_station.csv
6064 antarct
6065 avg_humid
6066 avg_local_pressur
6067 avg_sea_pressur
6068 avg_temperatur
6069 avg_vapor_pressur
6070 avg_wind_spe
6071 calendar_d
6072

6762 multidisciplinari
6763 n-bodi
6764 nake
6765 nbody6
6766 neighbour
6767 outskirt
6768 particl
6769 position-veloc
6770 pure
6771 realist
6772 s.~j
6773 seoul
6774 so-cal
6775 spheric
6776 springer
6777 stelar
6778 stellar
6779 supercomput
6780 symmetr
6781 three-dimension
6782 uniti
6783 workstat
6784 xxxx
6785 yonsei
6786 shark
6787 1,333
6788 2,539
6789 2004-2016
6790 2004.
6791 3,558
6792 administ
6793 aggrav
6794 al-qaeeda
6795 authent
6796 bashir
6797 bomb
6798 brown
6799 bush
6800 canvass
6801 drone
6802 fata
6803 hira
6804 increase/decreas
6805 khyber
6806 milit
6807 newspap
6808 obama
6809 oldest
6810 outfit
6811 pakhtunkhwa
6812 pakistan
6813 pakistanbodycount.org
6814 pakistandroneattacks.csv
6815 predat
6816 reaper
6817 strike
6818 suicid
6819 taliban
6820 tank
6821 tenur
6822 usmani
6823 www.pakistanbodycount.org
6824 zeeshan-ul-hassan
6825 +70k
6826 pre-procces
6827 10,265
6828 12.2
6829 8.3-percent
6830 9,943
6831 alcohol-impair
6832 far
6833 motorcyclist
6834 nhtsa


7511 de-identif
7512 dicom
7513 doi:10.1088/0031-9155/60/14/5471
7514 fat-suppress
7515 fdg-pet
7516 fdg-pet/ct
7517 freeman
7518 gmail.com
7519 histolog
7520 isotrop
7521 issam
7522 mart.vallier
7523 metastas
7524 montreal
7525 naqa
7526 pre-treat
7527 proven
7528 radiom
7529 rtstruct
7530 sarcoma
7531 skamen
7532 soft
7533 soft-tissu
7534 sonia
7535 stss
7536 t1-weight
7537 t2-weight
7538 vallièr
7539 voxel
7540 chief
7541 economist
7542 humphri
7543 industry-lead
7544 mortgage-rel
7545 nationwid
7546 regardless
7547 stan
7548 user-gener
7549 zestim
7550 5490
7551 bigquery-public-data.openaq.
7552 near
7553 openaq
7554 openaq.org
7555 pollution.
7556 422,937
7557 alphanumer
7558 hostnam
7559 midnight
7560 cultur
7561 datablog
7562 endanger
7563 extinct
7564 grandpar
7565 infrequ
7566 romani
7567 sicilian
7568 spoken
7569 tongu
7570 vulner
7571 youngest
7572 coal
7573 non-convent
7574 renew
7575 undata
7576 //go.nature.com/2fcox5
7577 //go.nature.com/2fz4afr
7578 //www.istat.it/en/
75

8261 premis
8262 procur
8263 production/process
8264 prostitut
8265 railway
8266 ransom
8267 rape
8268 received/cas
8269 recidiv
8270 remand
8271 restraint
8272 reveng
8273 river
8274 robberi
8275 safest
8276 santion
8277 sati
8278 sc/st
8279 scooter
8280 scs/st
8281 setup
8282 sex-wis
8283 slaveri
8284 stolen
8285 sub-inspecto
8286 sub-inspector
8287 subordin
8288 terrorists/extremist
8289 tortur
8290 trucks/tempo
8291 unidentifi
8292 viii
8293 wire
8294 xiii
8295 xv-a
8296 xvii
8297 xviii
8298 appetit
8299 carbo
8300 carbohydr
8301 cereal
8302 cold
8303 dragicev
8304 fiber
8305 fruiti
8306 isenberg
8307 jansen
8308 kellogg
8309 mill
8310 milligram
8311 nabisco
8312 oat
8313 ounc
8314 pebbl
8315 petra
8316 pierr
8317 potass
8318 purina
8319 quaker
8320 ralston
8321 ruin
8322 shelf
8323 sugari
8324 yvonn
8325 5-10
8326 tv-14
8327 tv-g
8328 tv-i
8329 tv-ma
8330 vast
8331 wife
8332 2050.
8333 chunk
8334 east-west
8335 entrepreneur
8336 factori
8337 farm
8338 harder
8339 leader
8340 lives

9011 inappropri
9012 joke
9013 //developer.spotify.com/web-api/get-audio-features/
9014 //opendatascience.com/blog/a-machine-learning-deep-dive-into-my-spotify-data/
9015 danceabl
9016 duration_m
9017 speechi
9018 spotipi
9019 tempo
9020 time_signatur
9021 valenc
9022 wrapper
9023 //pubgtracker.com/
9024 //www.playbattlegrounds.com/main.pu
9025 85,000
9026 87,898
9027 addict
9028 bluehol
9029 game-play
9030 pan
9031 pubgtracker.com
9032 subdivid
9033 trademark
9034 walking/rid
9035 //basketball.realgm.com/info/glossari
9036 //www.basketball-reference.com/about/glossary.html
9037 //www.nbastuffer.com/team-evaluation-metrics/
9038 forcast
9039 officialboxscore.csv
9040 playerboxscore.csv
9041 standing.csv
9042 teamboxscore.csv
9043 xmlstat
9044 //ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/
9045 //github.com/stephanerappeneau/scienceofmovi
9046 //github.com/stephanerappeneau/scienceofmovies/tree/master/pycharmprojects/getallmovi
9047 //www.kaggle.com/netflix-inc/netflix-pri

9761 year-end
9762 american_indian_alaskan_n
9763 asian
9764 black_or_african_american
9765 eeo-1
9766 executive/senior
9767 first/mid
9768 helper
9769 job_categori
9770 latino
9771 mgr
9772 native_hawaiian_or_pacific_island
9773 no-data
9774 non-binari
9775 overall_tot
9776 previous_tot
9777 rangarajan
9778 revealnews.org
9779 sinduja
9780 srangarajan
9781 two_or_more_rac
9782 740k
9783 by-igo
9784 destabil
9785 famin
9786 policymak
9787 refuge
9788 stapl
9789 upheav
9790 //factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml
9791 jan.
9792 unreport
9793 'abnorm
9794 biomechan
9795 column_2c_weka.csv
9796 column_3c_weka.csv
9797 lordosi
9798 pelvi
9799 pelvic
9800 sacral
9801 slope
9802 spondylolisthesi
9803 tilt
9804 -9999.
9805 //www.ncdc.noaa.gov/ghcnm/v3.php
9806 180°e
9807 180°w
9808 1880
9809 1997
9810 2,592
9811 3.2.1
9812 85-90
9813 85-90n
9814 90°
9815 90°n
9816 anomali
9817 departur
9818 easterl
9819 ghcn
9820 ghcn-m
9821 ghcn-monthli
9822 homogen
9823 menn
9824 m

10510 pbystand
10511 pfiet
10512 pgezong
10513 pinbo
10514 pleven
10515 pmotsco
10516 porchless
10517 postal
10518 ppersaut
10519 ppersong
10520 pplezier
10521 provinci
10522 ptractor
10523 putten
10524 pvraaut
10525 pwabedr
10526 pwaland
10527 pwaoreg
10528 pwapart
10529 pwerkt
10530 pzeilpl
10531 relig
10532 sentient
10533 smr.nl
10534 socio-demograph
10535 someren
10536 surfboard
10537 tractor
10538 trailer
10539 downvot
10540 state-by-st
10541 tourism
10542 depriv
10543 dimensions—reproduct
10544 disadvantag
10545 empower
10546 gender-bas
10547 ihdi
10548 inequality-adjust
10549 market—for
10550 micro
10551 multidimension
10552 absorb
10553 absorpt
10554 biocapac
10555 built-up
10556 conceiv
10557 cotton
10558 cropland
10559 deficit
10560 emiss
10561 exce
10562 graze
10563 greatest
10564 hectar
10565 mathi
10566 meat
10567 overfish
10568 overshoot
10569 plant-bas
10570 ree
10571 sea
10572 timber
10573 unharvest
10574 wackernagel
10575 wast
10576 wood
10577 //coinmarketcap.com/
1057

11260 slava
11261 state-of-the-union
11262 syntact
11263 ungdc
11264 admin1
11265 ambigu
11266 datapackag
11267 geonam
11268 geonameid
11269 github/kaggl
11270 guaranti
11271 lexman
11272 slight
11273 subcountri
11274 toponym
11275 vatican
11276 3-letter
11277 3209
11278 4-letter
11279 59036
11280 8859-1
11281 codeshar
11282 iata
11283 icao
11284 latin-1
11285 mapper
11286 openflight
11287 openflights.org
11288 openflights/airlin
11289 //stats.nba.com/lead
11290 //www.basketball-reference.com/teams/
11291 20season
11292 cf=min*g*2
11293 permode=tot
11294 season=2014-15
11295 seasontype=regular
11296 statcategory=min
11297 //arxiv.org/abs/1702.05373v1
11298 //www.nist.gov/itl/iad/image-group/emnist-dataset
11299 103,600
11300 112,800
11301 116,323
11302 131,600
11303 14,800
11304 18,800
11305 240,000
11306 26-class
11307 2751
11308 280,000
11309 40,000
11310 697,932
11311 814,255
11312 88,800
11313 acknowldg
11314 afshar
11315 byclass
11316 bymerg
11317 class_label
11318 cohen
11319 com

12010 ref_university_type.csv
12011 regional_center.csv
12012 scholarship.csv
12013 staff_quarter.csv
12014 standalone_institution.csv
12015 standalone_institution_accreditation.csv
12016 standalone_institution_department.csv
12017 standalone_institution_faculty.csv
12018 standalone_institution_non_teaching_staff_count.csv
12019 standalone_institution_student_hostel.csv
12020 standalone_institution_teaching_staff.csv
12021 standalone_institution_teaching_staff_sanctioned_strength.csv
12022 student_hostel.csv
12023 teaching_staff.csv
12024 teaching_staff_count.csv
12025 teaching_staff_sanctioned_strength.csv
12026 university.csv
12027 university_accreditation.csv
12028 university_department.csv
12029 university_enrolled_distance_student.csv
12030 university_faculty.csv
12031 university_non_teaching_staff_count.csv
12032 university_private_students_result.csv
12033 university_student_hostel.csv
12034 university_teaching_staff.csv
12035 university_teaching_staff_sanctioned_strength.csv
12

12760 relief
12761 steadi
12762 typhoon
12763 user.anon
12764 14-65
12765 dalia
12766 n=9.649
12767 neopoli
12768 perus
12769 region/countri
12770 zurich
12771 4,700
12772 hscic
12773 majesti
12774 smoking-rel
12775 //rhythm-of-food.net/
12776 moritz
12777 128x128x3
12778 aquilegia
12779 bellflow
12780 calendula
12781 flower-labels.csv
12782 goldquel
12783 laciniata
12784 leucanthemum
12785 peoni
12786 phlox
12787 rudbeckia
12788 trivial
12789 viola
12790 //www.bdfutbol.com
12791 //www.resultados-futbol.com
12792 1970-71
12793 1977-78
12794 247014000
12795 30/10/1977
12796 4808
12797 4808,1977-78,1,8
12798 división
12799 fecha
12800 localgo
12801 localteam
12802 madrid
12803 madrid,3,2,30/10/1977,247014000
12804 rayo
12805 vallecano
12806 visitorgo
12807 visitorteam
12808 //jmcauley.ucsd.edu/data/amazon/
12809 5-core
12810 982619
12811 a3sptokdg7wbln
12812 asin
12813 b000fa64pk
12814 kindl
12815 review/
12816 reviewerid
12817 reviews/
12818 reviewtext
12819 reviewtim
12820 ucsd
12821 u

13509 lovel
13510 newsmak
13511 newsworthi
13512 stalin
13513 statesmen
13514 unforgett
13515 .json
13516 amirhossein
13517 circuit
13518 cuff
13519 cuff-less
13520 cuffless
13521 data_bp
13522 data_ecg
13523 data_fsr
13524 data_pcg
13525 data_ppg
13526 diastol
13527 electrocardiogram
13528 esmaili
13529 fingertip
13530 force-sens
13531 iscas'17
13532 kachue
13533 mahdi
13534 non-invas
13535 phonocardiogram
13536 photoplethysmogram
13537 puls
13538 resistor
13539 shabani
13540 sharif
13541 systol
13542 tehran
13543 'colt
13544 'filly'
13545 'geld
13546 'hors
13547 'mare
13548 'rig
13549 1,104
13550 2-charact
13551 actual_weight
13552 behind_sec1
13553 behind_sec2
13554 behind_sec3
13555 behind_sec4
13556 behind_sec5
13557 behind_sec6
13558 benter
13559 bolton
13560 chapman
13561 computeris
13562 config
13563 declared_weight
13564 dirt
13565 finish_tim
13566 gear
13567 giant
13568 hkjc
13569 horse_ag
13570 horse_countri
13571 horse_gear
13572 horse_no
13573 horse_r
13574 horse_typ
13575

14259 goldberg
14260 hausdorff
14261 health-car
14262 high-accuraci
14263 iscas'15
14264 ivanov
14265 j.mietu
14266 kiani
14267 mmhg
14268 mohammadzad
14269 mohammadzadeh
14270 moodi
14271 photoplethysmograph
14272 physionet
14273 physionet.org
14274 physiotoolkit
14275 stanley
14276 â€œphysiobank
14277 'recip
14278 /u/stuck_in_the_matrix
14279 pm_me_pix_of_ur_cat
14280 pm_me_your_success
14281 pmmeyourrgb
14282 wholesom
14283 //datacenter.aucegypt.edu/shazeem/
14284 //link.springer.com/chapter/10.1007/978-3-319-48308-5_54
14285 //www.kaggle.com/mloey1/ahcd1
14286 566-575
14287 60000
14288 99.15
14289 lenet-5
14290 madbas
14291 //medium.freecodecamp.com/we-asked-20-000-people-who-they-are-and-how-theyre-learning-to-code-fff5d668969
14292 //api.nginov
14293 //earthquake.usgs.gov/
14294 1986.05.04
14295 2016.05.04
14296 6.1+
14297 arc
14298 astroposit
14299 azimuth
14300 dextrorotatori
14301 geographical
14302 humanread
14303 nginov
14304 param
14305 ueki
14306 usg
14307 yyyy.mm.dd
14308

15009 //www.chadwick-bureau.com
15010 //www.retrosheet.org
15011 2015-01-24
15012 allstarful
15013 awardsmanag
15014 awardssharemanag
15015 awardsshareplay
15016 battingpost
15017 birthdat
15018 chadwick
15019 collegeplay
15020 fieldingof
15021 fieldingpost
15022 fieldingpost-
15023 halloffam
15024 homegam
15025 homegames.csv
15026 managershalf
15027 parks.csv
15028 pitchingpost
15029 playerid
15030 retrosheet
15031 teamfranchis
15032 978-0-470-47565-2
15033 auction.csv
15034 auction_typ
15035 auctionid
15036 bead
15037 bidder
15038 bidder.volum
15039 bidderr
15040 bidtim
15041 cartier
15042 galit
15043 jank
15044 m515
15045 openbid
15046 palm
15047 pda
15048 seller.volum
15049 shmueli
15050 son
15051 swarovski.csv
15052 swarowski
15053 wiley
15054 wolfgang
15055 wristwatch
15056 xbox
15057 'angri
15058 austria
15059 authoritarian
15060 brainpow
15061 bremain
15062 brexit
15063 dilig
15064 europuls
15065 forgotten
15066 fuell
15067 haidt
15068 laid
15069 n=1.052
15070 n=11.283
15071 om

15759 1200
15760 24000
15761 8000
15762 no-plan
15763 nose
15764 planesnet
15765 planesnet.json
15766 planesnet.zip
15767 //fred.stlouisfed.org/series/ipg3113n
15768 confectioneri
15769 frenet
15770 ipg3113n
15771 nondur
15772 short-liv
15773 stun
15774 //cdd.publicsafety.gc.ca
15775 //www.publicsafety.gc.ca/cnt/ntcs/trms-en.aspx
15776 //www.publicsafety.gc.ca/cnt/rsrcs/cndn-dsstr-dtbs/index-en.aspx
15777 1900
15778 affected/injured/infected/evacu
15779 costs/loss
15780 costs/pay
15781 damage/interrupt
15782 dfaa
15783 distast
15784 evacue
15785 injured/infect
15786 interrupted/affect
15787 national/intern
15788 natural/artif
15789 non-comer
15790 non-natru
15791 pertin
15792 province/territori
15793 provincial/territori
15794 roll-up
15795 1986-2016
15796 6820
15797 interven
15798 autonom
15799 deer
15800 opensprayer.com
15801 spray
15802 sprayer
15803 weed
15804 132,850
15805 address_id
15806 dcgisopendata
15807 etim
15808 issue_tim
15809 month_of_year
15810 rowid_
15811 rp_plate_st


16508 alfr
16509 diploma
16510 laureat
16511 laurel
16512 riksbank
16513 stockholm
16514 sverig
16515 2004-mid
16516 caer
16517 cfsan
16518 cosmet
16519 food-rel
16520 foodstuff
16521 meddra
16522 readme.pdf
16523 terminolog
16524 'high-accuraci
16525 'tappi
16526 //journals.plos.org/plosone/articl
16527 ethic
16528 h17013
16529 id=10.1371/journal.pone.0188226
16530 keystrok
16531 parkinson
16532 sec008
16533 sturt
16534 tappi
16535 trove
16536 warwick
16537 //csc.lsu.edu/~saikat/deepsat/
16538 100000x4
16539 28x28x4x100000
16540 28x28x4x400000
16541 330,000
16542 400000x4
16543 ame
16544 barren
16545 basu
16546 conu
16547 deepsat
16548 dibiano
16549 doqq
16550 ganguli
16551 geotiff
16552 grassland
16553 infrar
16554 institute/nasa
16555 interclass
16556 karki
16557 louisiana
16558 manohar
16559 mukhopadhyay
16560 naip
16561 nemani
16562 non-overlap
16563 one-hot
16564 ortho
16565 photo-identifi
16566 pseudo-random
16567 quad
16568 quadrangl
16569 ramakrishna
16570 saikat
16571 sangram

17258 right-sid
17259 s001
17260 s002
17261 s051
17262 s057
17263 sessionindex
17264 ud.key1.key2
17265 ud.period.t
17266 up-key
17267 1968.
17268 matches/play
17269 xiam
17270 bind
17271 chromosome1
17272 non-bid
17273 non-bind
17274 nucleobas
17275 1958
17276 undergon
17277 asynchron
17278 blackboard
17279 microkernel
17280 prof.dr
17281 qamar
17282 refinements.ani
17283 usman
17284 //www.kaggle.com/nathanto/d/nathanto/seds-1960-2014f/data-wrangling-code-for-seds-1960-2014f
17285 2014f
17286 bbtu
17287 coursera/john
17288 mnemon
17289 sed
17290 specilis
17291 states.abb
17292 units_cod
17293 writng
17294 //www.superdatascience.com/deep-learning/
17295 10.000
17296 a-z™
17297 eremenko
17298 hadelin
17299 hands-on
17300 kiril
17301 pontev
17302 self-explain
17303 techniques/skil
17304 udemi
17305 '1920
17306 1920.
17307 293-299.
17308 3,500
17309 5186
17310 chapter1
17311 chapter10
17312 chapter11
17313 chapter12
17314 chapter13
17315 chapter14
17316 chapter15
17317 chapter16
17318 cha

18008 white-board
18009 //creativecommons.org/licenses/by-nc-sa/3.0/
18010 1,650
18011 1,818
18012 15,632
18013 15,649
18014 adverb
18015 bear
18016 german-languag
18017 remu
18018 ressourc
18019 sentimentwortschatz
18020 sentiw
18021 utf8-encod
18022 v1.8b
18023 //en.wikipedia.org/wiki/plagiar
18024 ean/upc
18025 furnitur
18026 gtin
18027 n.d.
18028 symbolog
18029 upc-a
18030 22050hz
18031 amplitud
18032 bedtim
18033 boi
18034 by-nd
18035 disfluenc
18036 englebretson
18037 gossip
18038 on-the-job
18039 phone-align
18040 recover
18041 sandra
18042 sermon
18043 thompson
18044 time-align
18045 two-channel
18046 ucsb
18047 umass
18048 unrecogniz
18049 wed
18050 *austin_bikeshare_stations.csv*
18051 *austin_bikeshare_trips.csv*
18052 2013-2017.
18053 649k
18054 acl-onli
18055 checkout_tim
18056 duration_minut
18057 end_station_id
18058 end_station_nam
18059 high-us
18060 start_station_id
18061 start_station_nam
18062 subscriber_typ
18063 //www.osha.gov/severeinjury/index.html
18064 2015-fe

18758 curs
18759 daimler
18760 greener
18761 june-juli
18762 merced
18763 mercedes-benz
18764 permut
18765 persever
18766 shakeup
18767 2455
18768 aspell.txt
18769 big.txt
18770 birkbeck.txt
18771 corrector
18772 industrial-strength
18773 snippet
18774 spele
18775 spell-testset1.txt
18776 spell-testset2.txt
18777 spell.pi
18778 told
18779 transcontinent
18780 wikipedia.txt
18781 arthur
18782 conan
18783 doyl
18784 holm
18785 insens
18786 theoriz
18787 twist
18788 5547
18789 epidemium
18790 janowczyk
18791 ac_no
18792 breakup
18793 eciresults.nic.in
18794 forfeit
18795 seat_allot
18796 10-29
18797 100-time
18798 17-band
18799 2800a
18800 3,462
18801 30-55
18802 56-65
18803 63,501
18804 abscissa
18805 apdrmag
18806 apdrmag=0
18807 b-band
18808 bang
18809 best-fit
18810 bimod
18811 brighter
18812 chandra
18813 chi-squar
18814 chi2r
18815 cosmic
18816 dispers
18817 dropoff
18818 e.mcz
18819 heteroscedast
18820 high-absolute-magnitud
18821 high-redshift
18822 hole
18823 interstellar
18824 l

19507 flights_domest
19508 flights_international*
19509 outstand
19510 passenger-mil
19511 passengers_domest
19512 passengers_international*
19513 placehold
19514 rpm_domest
19515 rpm_international*
19516 seat-mil
19517 stochast
19518 tractabl
19519 transtat
19520 workabl
19521 //k-db.com/
19522 //www.meetup.com/machine-learning-meetup-by-team-ai/events/242154425/
19523 everyweek
19524 fastretail
19525 hackathon
19526 hakathon
19527 uniqlo
19528 の株価予測モデルをオープンイノベーションで構築します。
19529 の題材として、
19530 ファーストリテイリング
19531 公開されたモデルを実際の取引に使う場合は十分注意ください。弊社側やコミュニティメンバー側では損失の責任は持てません。
19532 毎週水曜日東京・渋谷で開催している、team
19533 身近なユニクロ
19534 game_typ
19535 yyyy/mm/dd
19536 //arxiv.org/abs/1602.00134
19537 //arxiv.org/abs/1707.09416
19538 //github.com/limi44/parkinson-s-pose-estimation-dataset
19539 120-130
19540 1349–1354
19541 39th
19542 agil
19543 aquino
19544 arxiv:1707.09416
19545 beaulieu-boir
19546 capsit
19547 data_import_demo.ipynb
19548 doi:10.1016/j.parkreldis.2015.09.044
19549 dyskinesia
19550 embc
1

20007 thaigov.go.th
20008 thailand
20009 tonal
20010 hmda
20011 anti-corrupt
20012 ern
20013 evas
20014 llp
20015 lough
20016 overwhelm
20017 reaffirm
20018 2010-2014
20019 columnar
20020 facility-level
20021 industry-rel
20022 uniqueid
20023 //www.reddit.com/r/datasets/comments/47a7wh/ufc_fights_and_fighter_data/
20024 2/23/2016
20025 geyg
20026 sherdog
20027 u/geyg
20028 s.j.re
20029 'li
20030 'sit
20031 'stand
20032 010-000-024-033
20033 010-000-030-096
20034 020-000-032-221
20035 020-000-033-111
20036 ami-10
20037 ankle_left
20038 ankle_right
20039 belt
20040 dd.mm.yyyy
20041 dovgan
20042 gam
20043 kaluza
20044 lustrek
20045 malaga
20046 mirchevska
20047 50-100
20048 92697-3425
20049 bands-
20050 goat
20051 ics.uci.edu
20052 pazzani
20053 'field
20054 porsch
20055 leather
20056 shovel
20057 tracked-month
20058 cultiv
20059 data.mil
20060 painstakingli
20061 southeast
20062 unexplod
20063 h.10
20064 //github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest
20065 wnyc
20066 socrat

20690 preprint
20691 stare
20692 sunglass
20693 tab-separ
20694 tetreault
20695 tgif
20696 tgif-v1.0.tsv
20697 train/test
20698 yale
20699 yuncheng
20700 131552
20701 480421
20702 champid2nam
20703 gold-rank
20704 goldsummdata2016
20705 summids2016
20706 up/down
20707 year-old
20708 attack*attack
20709 attackbasicdp
20710 basicdp
20711 chargedp
20712 dragonit
20713 gem
20714 oneshotpotenti
20715 pokemongodb.net
20716 potentialdamag
20717 powerhous
20718 serebii.net
20719 tanki
20720 //apps.who.int/gho/data/view.main.57020mp
20721 angellist
20722 apriori
20723 biscuit
20724 bournvita
20725 cornflak
20726 maggi
20727 //www.reddit.com/search
20728 1/20/17
20729 amalinow1973
20730 latlon
20731 q=inaugur
20732 www.reddit.com
20733 0-153
20734 25-2500
20735 ddelta_p
20736 decoupl
20737 ef2009
20738 ef2013
20739 efconstotgha
20740 efdelta
20741 efdelta_p
20742 gdp2009
20743 gdp2013
20744 gdpdelta
20745 gdpdelta_p
20746 min-max
20747 record-typ
20748 //opendata.minneapolismn.gov
20749 1-99
207

21257 carboni
21258 clionadh
21259 flare-up
21260 hegr
21261 helen
21262 håvard
21263 instabl
21264 joakim
21265 karlsen
21266 kishi
21267 margaux
21268 matt
21269 militia
21270 pinaud
21271 raleigh
21272 rebel
21273 regim
21274 roudabeh
21275 south-east
21276 sussex
21277 tanoff
21278 vannic
21279 wigmore-shepherd
21280 0-180
21281 0-720
21282 1,559
21283 4-categori
21284 barbequ
21285 blindli
21286 brisket
21287 category-specif
21288 cbj
21289 cbj_percentag
21290 clever
21291 contest_key
21292 is_championsip
21293 is_standard
21294 kcb
21295 location_str
21296 non-standard
21297 pork
21298 resultsar
21299 rib
21300 sample-by-sampl
21301 state_ful
21302 team_nam
21303 goods/servic
21304 third-parti
21305 discover
21306 donnelli
21307 fortune.com
21308 arab_emirates.csv
21309 australia.csv
21310 caldonia
21311 china.csv
21312 emir
21313 iceland.csv
21314 israel.csv
21315 japan.csv
21316 kazakhstan.csv
21317 kuwait.csv
21318 new_caldonia.csv
21319 new_zealand.csv
21320 oceania
21321 qat

22006 bgaj
22007 bidvest
22008 bilj
22009 billiton
22010 btij
22011 bvtj
22012 ccoj
22013 cfrj
22014 compagni
22015 copy-pasta
22016 dsyj
22017 exxaro
22018 exxj
22019 f.csv
22020 financier
22021 firstrand
22022 fsrj
22023 ftse/js
22024 growthpoint
22025 grtj
22026 impala
22027 imperi
22028 impj
22029 inflow/outflow
22030 inpj
22031 intu
22032 investec
22033 iplj
22034 ituj
22035 kioj
22036 kumba
22037 lhcj
22038 mndj
22039 mondi
22040 mtnj
22041 nasper
22042 nedbank
22043 nedj
22044 non-curr
22045 npnjn
22046 omlj
22047 p.csv
22048 pharmacar
22049 r.csv
22050 rainbow
22051 reij
22052 reinet
22053 remgro
22054 remj
22055 repurchas
22056 richemont
22057 rmhj
22058 sa40
22059 sanlam
22060 sasol
22061 sbkj
22062 shoprit
22063 shpj
22064 slmj
22065 snhj
22066 solj
22067 steinhoff
22068 tbsj
22069 tiger
22070 vodacom
22071 vodj
22072 whlj
22073 woolworth
22074 //github.com/wh0801/nba-shooting-rationality-2016-17-regular-season
22075 //haowang204.wordpress.com/2017/06/03/shooting-rationality

22710 alexnet
22711 cut-off
22712 dropout
22713 fully-connect
22714 geoffrey
22715 hinton
22716 ilsvrc-2012
22717 ilya
22718 krizhevski
22719 lsvrc-2010
22720 max-pool
22721 non-satur
22722 recently-develop
22723 second-best
22724 sutskev
22725 //arxiv.org/abs/1608.06993
22726 cifar-100
22727 densenet
22728 densenet-201
22729 embrac
22730 feature-map
22731 kilian
22732 lauren
22733 maaten
22734 vanishing-gradi
22735 zhuang
22736 resnet-34
22737 'tie'
22738 //cricsheet.org
22739 //github.com/shoaibnajeeb/cricsheet-data-prepar
22740 //www.flickr.com/photos/badgerswan/173466044
22741 batting_team
22742 bowl_out
22743 cricsheet
22744 dismiss
22745 dreamt
22746 headach
22747 invok
22748 match_numb
22749 match_refere
22750 metrics/statist
22751 moneybal
22752 neutralvenu
22753 non-miss
22754 non_strik
22755 player_of_match
22756 reserve_umpir
22757 run_scor
22758 stadium/ground
22759 toss_decis
22760 toss_winn
22761 tv_umpir
22762 twenty20
22763 watching/read
22764 winner_run
22765 winner_wi

23256 emitido
23257 entregu
23258 especificação
23259 ess
23260 essa
23261 esta
23262 exemplo
23263 exercício
23264 federai
23265 federação
23266 financeira
23267 financeiro
23268 fiscai
23269 forenam
23270 formada
23271 fornecedor
23272 franqueada
23273 funcionamento
23274 gerará
23275 glosa
23276 governo
23277 goza
23278 idecadastro
23279 idem
23280 identifica
23281 identificador
23282 ideologia
23283 implantar
23284 incidirá
23285 indicativo
23286 indtipodocumento
23287 interess
23288 início
23289 issue_d
23290 juntament
23291 junto
23292 jurídica
23293 juízo
23294 leg_of_the_trip
23295 legislativa
23296 legislatura
23297 liberdad
23298 liderança
23299 localização
23300 lote
23301 líquido
23302 mai
23303 mandato
23304 medida
23305 meio
23306 montli
23307 muda
23308 necessita
23309 negativo
23310 net_valu
23311 nova
23312 nucarteiraparlamentar
23313 nudeputadoid
23314 nulegislatura
23315 numano
23316 numespecificacaosubcota
23317 numlot
23318 numm
23319 numparcela
23320 numressarcime

24006 million-pixel
24007 pixel-by-pixel
24008 r/place
24009 rate-limit
24010 re-paint
24011 u/drunken_economist
24012 '1860
24013 '1865
24014 '1865-ish
24015 '1901-1903
24016 'carlin
24017 'citi
24018 'confeder
24019 '~1880
24020 -readopt
24021 -remodel
24022 //books.google.com/book
24023 //docsouth.unc.edu/commland/monument/15/
24024 //www.amap1.org/images/2008
24025 //www.arkansaspreservation.com/national-register-listings/pdf/pu4770s.nr.pdf
24026 //www.drbronsontours.com/bronsonconfederatememorial.html
24027 1879.
24028 1903.
24029 1911-1920
24030 1916.
24031 2012-08.pdf
24032 20folder/amap
24033 20newslett
24034 22+confederate+memorial+1906
24035 22caroline+counti
24036 carolin
24037 cartodb_id
24038 constitucion
24039 f=fals
24040 feature_nam
24041 field_1
24042 fountain
24043 hl=en
24044 id=xxvhymoh3usc
24045 lpg=pa276
24046 ots=qyrgtv13j
24047 pg=pa276
24048 q=monument
24049 reded
24050 reopen
24051 sa=x
24052 secondary_class_for_internal_us
24053 sig=ii-ka__3bhzg9wztad5vckhb3b

24755 drift
24756 metan
24757 multisensor
24758 sen
24759 value.
24760 vito
24761 ogwr
24762 ogwr.com
24763 move.csv
24764 pokemon.csv
24765 ups*
24766 1072
24767 add-on
24768 arel-bundock
24769 conventions/trend
24770 rdataset
24771 vincentarelbundock
24772 -how
24773 -when
24774 27035
24775 bernoulli
24776 clap
24777 clap-word-clap-word
24778 contina
24779 fireant
24780 //apics-online.info/contributions/58
24781 2017-07-28
24782 apic
24783 autoglossonym
24784 creoles/pidgin
24785 fully-fledg
24786 huber
24787 kikongo-kituba
24788 langauge_t
24789 language_data
24790 language_id
24791 language_sourc
24792 lexifi
24793 magnu
24794 maurer
24795 melanesia
24796 micha
24797 mufwen
24798 salikoko
24799 -999
24800 -9999
24801 celciu
24802 cond
24803 dewpoint
24804 dewpti
24805 dewptm
24806 heatindexi
24807 heatindexm
24808 inhg
24809 mbar
24810 precipi
24811 precipm
24812 pressurei
24813 pressurem
24814 tempi
24815 tempm
24816 visi
24817 vism
24818 vivibl
24819 wdird
24820 wdire
24821 wgust

25505 contractu
25506 job_board
25507 location/sector
25508 phone_numb
25509 //arxiv.org/abs/1602.07261
25510 3.08
25511 alemi
25512 inception-resnet
25513 inception-resnet-v2
25514 inception-v4
25515 non-residu
25516 single-fram
25517 streamlin
25518 //arxiv.org/abs/1610.02357
25519 depthwis
25520 pointwis
25521 densenet-121
25522 //arxiv.org/abs/1602.07360
25523 0.5mb
25524 510x
25525 alexnet-level
25526 ashraf
25527 dalli
25528 dnn
25529 forrest
25530 fpga
25531 iandola
25532 keutzer
25533 khalid
25534 moskewicz
25535 vgg-19
25536 9035
25537 9035e
25538 icert
25539 nonimmigr
25540 my/op_build
25541 my/op_common
25542 my/op_ep
25543 my/op_legendari
25544 my/op_name_of_card
25545 my/op_rar
25546 my/op_spel
25547 my/op_troop
25548 glove.6b.100d.txt
25549 glove.6b.200d.txt
25550 glove.6b.300d.txt
25551 glove.6b.50d.txt
25552 googlenews-vectors-negative300.bin
25553 two-lay
25554 '-73.960876'
25555 '-73.98997825'
25556 '17109'
25557 '40.710451'
25558 '40.72019576'
25559 '401'
25560 '532'

26255 bryce
26256 bryn
26257 bshivaani
26258 bssasikanth
26259 btolar1
26260 budiaji
26261 buetti
26262 bugg
26263 buggs23
26264 bughunt
26265 buja
26266 buket
26267 bukun
26268 bulblight
26269 bung
26270 burachyonok
26271 burakh
26272 burra
26273 buryburyzymon
26274 bustamant
26275 busto
26276 butthead
26277 byrn
26278 caio
26279 caiquecassemiro
26280 caitlin
26281 caleb
26282 calebfackl
26283 calloway
26284 camara
26285 cameron
26286 camil
26287 camilasampaio
26288 campanelli
26289 campbel
26290 canario
26291 candocia
26292 canuto
26293 caparrini
26294 caramba
26295 carlesbalsach
26296 carlosmoncayo
26297 carra
26298 carreman
26299 carrillo
26300 carsten
26301 carter
26302 casadei
26303 castro
26304 catara
26305 cathi
26306 cauim
26307 cccheung
26308 cclark
26309 ceccon
26310 cecil
26311 cedrikfd
26312 celio
26313 cenk
26314 cerdan
26315 cesarini
26316 ceshin
26317 cgaet
26318 chabreuil
26319 chaitanya
26320 chamberunderground
26321 chan
26322 chandan
26323 chandler
26324 chandlervan

27005 haroon
27006 harpiecrispi
27007 harrand
27008 harrison
27009 harryquak
27010 harrytan
27011 harsh
27012 harsha
27013 harshavardhan
27014 harshit
27015 harshitagupta
27016 harshitmehta
27017 harshitsrivastava
27018 harshoday
27019 harshpandya
27020 harshvardhan
27021 hartono
27022 hashu
27023 hasil
27024 hassanaftabmugh
27025 hassankhanyusufzai
27026 hatem
27027 hatton
27028 haubner
27029 haumont
27030 havera
27031 hazrat
27032 hdkim
27033 heatingsmok
27034 hectopasc
27035 hedi
27036 hefen
27037 heidogsdf
27038 heihei
27039 heiko
27040 hellenandreea
27041 hellrid
27042 hemani
27043 hemant
27044 hemanth
27045 hemanthgowda
27046 hena
27047 hendrik
27048 henrikheggland
27049 henriqu
27050 henrywconklin
27051 heraldo
27052 heri
27053 herimanitra
27054 heriningtya
27055 hersak
27056 hervind
27057 heymeredith
27058 hhl028
27059 hidark
27060 hidayatullah
27061 hidehisa
27062 hieuvt
27063 hilger
27064 hilleck
27065 himanshu
27066 himanshu0113
27067 himanshurai
27068 hioki
27069 hiro
27070

28254 pakshal
28255 palacio
28256 palak
28257 palanisami
28258 palashshah
28259 pallav
28260 pallavi
28261 pallec
28262 palley
28263 pambudi
28264 panagioti
28265 panano
28266 panchal
28267 panchicor
28268 pancho
28269 panda974
28270 pandatadelta
28271 pandey
28272 pandeya
28273 pandian
28274 pang
28275 panizzo
28276 pano
28277 paosheng
28278 pappa
28279 paresh
28280 parichart
28281 parihar
28282 parindsheel
28283 parmanand
28284 parmar
28285 parseltung
28286 parth
28287 parthmaheshwari
28288 pasha
28289 paslay
28290 pasquato
28291 pastor
28292 patanajirasit
28293 patata
28294 patel
28295 patil
28296 patit
28297 patra
28298 patryk
28299 paul-loui
28300 paula
28301 paultimothymooney
28302 paulzh
28303 pavansubhash
28304 pavel
28305 paveltroshenkov
28306 pavlin
28307 pavlo
28308 pavlukhin
28309 pawan
28310 pazookii
28311 pbcquoc
28312 pcminer
28313 pearc
28314 pedro
28315 pedrofrantz
28316 peinado
28317 peirano
28318 peng-hsuan
28319 pengm
28320 pengxu
28321 pengzha
28322 penkov
28323 pe

29004 sohinibhattacharya
29005 solanki
29006 soldo
29007 soler-compani
29008 somasundaram
29009 somat
29010 somavarapu
29011 somesh
29012 sommenoob
29013 somnath
29014 sonali
29015 sonamsrivastava
29016 sonavan
29017 sonawan
29018 songsong
29019 sonneland
29020 soojung
29021 soorajm
29022 soroosh
29023 sotopia
29024 souci
29025 soufian
29026 soufianeorama
29027 souhaiel
29028 souhail
29029 soukaina
29030 souliman
29031 souman
29032 soumitra
29033 sourabhmitt
29034 sourav
29035 souravmaharana
29036 sousa
29037 sovboc2018
29038 sowhit
29039 sowmiya
29040 soywu
29041 spadafora
29042 spadim
29043 spencer
29044 sprabakar
29045 sreeram
29046 sreyansh
29047 sridhar
29048 sridharan
29049 srigiriraju
29050 srihari
29051 sriharirao
29052 srikanth
29053 srilakshmi
29054 srilakshminandamuri
29055 srilbg
29056 srinath
29057 srinivasan
29058 srinivasrao
29059 srinivinnakota
29060 srivastava
29061 srk16113
29062 ssvitian
29063 stamat
29064 stanciu
29065 stanford-taylor
29066 starconf
29067 stawari
29

29754 is_respons
29755 cant
29756 ml/dp
29757 motherboard
29758 science~
29759 solder
29760 //stats.espncricinfo.com/ci/engine/stats/index.html
29761 all-round
29762 statsguru
29763 1800-2020.
29764 11/14/2017
29765 sci-kit
29766 //www.epa.gov/tsca-inventory/how-access-tsca-inventori
29767 1975.
29768 62,000
29769 accno
29770 casregno
29771 number/form
29772 pmnacc_062017.csv
29773 pmnno
29774 tsca
29775 tscainv_062017.csv
29776 uvcb
29777 crypto-curr
29778 mosquito-blueprint
29779 vericoin
29780 brainchild
29781 elsewher
29782 erect
29783 ewart
29784 plaqu
29785 plauq
29786 //www.dds.ca.gov
29787 developmentally-dis
29788 mickel
29789 non-hispan
29790 paradox
29791 sacremento
29792 www.amstat.org/publications/jse/v22n1/mickel.pdf
29793 applicatin
29794 feauter
29795 seper
29796 10,502
29797 all_annotated.tsv
29798 blodgett
29799 code-switch
29800 geotag
29801 johnni
29802 o'connor
29803 tian-zheng
29804 978-9077381-39-7
29805 gutenbergr
29806 janeaustenr
29807 page-by-pag
29808 word-c

30503 multi-bleu.perl
30504 newstest
30505 newstest-2015-100sents.en-ru.google.ru
30506 newstest-2015-100sents.en-ru.ref.en
30507 newstest-2015-100sents.en-ru.ref.ru
30508 newstest-2015-100sents.en-ru.src.en
30509 newstest-2015.enru.ref.ru.sgm
30510 newstest-2015.enru.src.en.sgm
30511 newstest2015-100sents-enru-google.ru.sgm
30512 newstest2015-100sents-enru-ref.ru.sgm
30513 newstest2015-100sents-enru-src.en.sgm
30514 newstest2015-enru-ref.ru.sgm
30515 newstest2015-enru-src.en.sgm
30516 nltk.translat
30517 plaintext
30518 ratio=1.074
30519 ref.ru
30520 ref_len=1852
30521 src.en
30522 testb
30523 tokenizer.perl
30524 wmt15
30525 wmt15_eval
30526 wmt16
30527 wrap-xml.perl
30528 ~/mosesdecoder/scripts/ems/support/wrap-xml.perl
30529 ~/mosesdecoder/scripts/generic/mteval-v13a.pl
30530 ~/mosesdecoder/scripts/generic/multi-bleu.perl
30531 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl
30532 7.0.0
30533 close_punctuation.txt
30534 currency_symbol.txt
30535 isalnum.txt
30536 isalpha.txt
30537 

31253 hausa
31254 isizulu
31255 iso_codes.csv
31256 kiswahili
31257 lugbarati
31258 open-licens
31259 sesotho
31260 somali
31261 stoplist
31262 storybook
31263 yoruba
31264 dlabal
31265 garageband
31266 midi
31267 wedeen
31268 bermuda.csv
31269 canada.csv
31270 curaçao
31271 curaçao.csv
31272 jamaica.csv
31273 mexico.csv
31274 argentina.csv
31275 brazil.csv
31276 chile
31277 chile.csv
31278 columbia.csv
31279 uraguay
31280 uraguay.csv
31281 continuo
31282 mush
31283 unhappi
31284 abolish
31285 admidst
31286 circumscrib
31287 confront
31288 darwinist
31289 enlist
31290 ex-slav
31291 memoir
31292 modernist
31293 oppress
31294 ravag
31295 re-creat
31296 segreg
31297 sharecropp
31298 tumultu
31299 turmoil
31300 //catalog.archives.gov/id/2240988
31301 2240988.
31302 burial_statu
31303 ethnicity_1
31304 ethnicity_2
31305 fatality_d
31306 fatality_year
31307 home_c
31308 home_counti
31309 home_st
31310 hostility_condit
31311 pay_grad
31312 service_cod
31313 service_typ
31314 //twitter.com/lin

32003 fortnit
32004 projectil
32005 reload
32006 soumydev
32007 underutil
32008 //www.usnews.com
32009 acceptance-r
32010 act-avg
32011 adelphi
32012 akron
32013 albani
32014 alliant
32015 argosi
32016 ashland
32017 auburn
32018 augusta
32019 azusa
32020 baptist
32021 baylor
32022 benedictin
32023 biola
32024 bloomington
32025 bois
32026 brandei
32027 brigham
32028 brunswick
32029 canyon
32030 capella
32031 carbondal
32032 charlott
32033 cincinnati
32034 clarkson
32035 clemson
32036 cumberland
32037 dame
32038 dartmouth
32039 depaul
32040 dominion
32041 drexel
32042 duke
32043 duquesn
32044 edgewood
32045 emori
32046 fairbank
32047 fordham
32048 fresno
32049 fullerton
32050 gardner-webb
32051 greensboro
32052 hartford
32053 hofstra
32054 huntsvil
32055 immaculata
32056 indianapoli
32057 kennesaw
32058 kingsvil
32059 lafayett
32060 lamar
32061 lehigh
32062 lesley
32063 lindenwood
32064 lipscomb
32065 lowel
32066 loyola
32067 manoa
32068 marquett
32069 maryvil
32070 merc
32071 mercer
320

In [92]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 2), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2), (29, 2), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 3), (37, 1), (38, 1), (39, 1), (40, 1), (41, 3), (42, 5), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 7), (50, 1), (51, 4), (52, 1), (53, 1), (54, 1), (55, 1), (56, 3), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 7), (101, 2), (102, 3), (103, 1), (104, 1), (1

[(36, 1), (41, 5), (42, 3), (49, 2), (53, 1), (67, 1), (70, 1), (76, 3), (80, 1), (83, 1), (106, 1), (107, 3), (124, 1), (125, 2), (140, 1), (143, 6), (150, 2), (158, 1), (170, 1), (173, 1), (176, 1), (181, 1), (190, 3), (240, 1), (251, 3), (256, 3), (276, 1), (282, 1), (306, 5), (329, 1), (347, 1), (367, 4), (400, 1), (410, 1), (412, 1), (421, 2), (423, 1), (430, 8), (441, 1), (475, 1), (539, 2), (596, 1), (601, 1), (633, 1), (634, 1), (639, 2), (650, 1), (722, 1), (725, 1), (794, 1), (815, 1), (872, 1), (887, 1), (891, 1), (896, 1), (934, 1), (936, 1), (964, 1), (982, 5), (990, 1), (993, 1), (1040, 1), (1052, 1), (1053, 2), (1059, 1), (1069, 2), (1100, 3), (1276, 1), (1306, 1), (1400, 1), (1402, 1), (1444, 1), (1492, 1), (1496, 1), (1541, 1), (1544, 1), (1566, 5), (1598, 2), (1657, 1), (1663, 1), (1706, 1), (1777, 1), (1789, 1), (1828, 2), (1847, 1), (2044, 1), (2114, 2), (2185, 1), (2285, 1), (2313, 1), (2453, 1), (2703, 7), (2771, 4), (2916, 1), (2930, 1), (3029, 1), (3035, 1), (32

[(36, 1), (41, 3), (42, 1), (54, 1), (73, 1), (165, 1), (170, 1), (185, 1), (196, 1), (224, 1), (232, 1), (239, 1), (329, 1), (399, 1), (425, 1), (431, 2), (561, 1), (605, 1), (624, 1), (741, 1), (829, 1), (1364, 1), (1369, 1), (1540, 2), (1601, 1), (1668, 2), (1680, 1), (2884, 1), (3447, 5), (3465, 1), (3776, 1), (4313, 5), (4532, 1), (7788, 2), (9884, 4), (9893, 2), (11676, 1), (20066, 1)]
[(10, 1), (31, 1), (32, 2), (36, 2), (41, 6), (42, 3), (44, 2), (54, 1), (59, 2), (87, 1), (106, 1), (108, 2), (133, 1), (143, 2), (151, 1), (182, 1), (194, 1), (201, 1), (202, 2), (203, 1), (211, 3), (262, 3), (271, 1), (276, 1), (302, 1), (338, 2), (376, 3), (390, 2), (432, 1), (464, 1), (472, 1), (501, 2), (502, 1), (513, 1), (534, 1), (535, 1), (648, 4), (665, 1), (669, 1), (741, 1), (757, 1), (787, 1), (842, 6), (845, 2), (862, 3), (870, 1), (949, 3), (976, 1), (982, 1), (1107, 1), (1236, 1), (1350, 3), (1355, 1), (1484, 1), (1494, 1), (1565, 1), (1629, 1), (1750, 1), (1796, 1), (2016, 1), (22

[(42, 1), (896, 1)]
[(12, 1), (41, 5), (59, 3), (72, 1), (87, 4), (93, 3), (100, 4), (106, 1), (108, 1), (125, 1), (126, 2), (176, 1), (196, 1), (202, 1), (234, 1), (276, 1), (360, 1), (363, 1), (453, 1), (501, 1), (502, 1), (553, 1), (722, 1), (757, 1), (833, 1), (887, 1), (893, 2), (934, 1), (1107, 1), (1494, 1), (1540, 1), (1791, 4), (2703, 2), (2902, 1), (3271, 1), (3355, 1), (3407, 1), (3704, 1), (5187, 1), (32152, 1), (32153, 1)]
[(36, 1), (56, 1), (59, 3), (64, 1), (106, 1), (118, 1), (146, 2), (158, 1), (203, 1), (276, 1), (360, 1), (501, 1), (502, 1), (509, 1), (543, 1), (660, 1), (1045, 5), (2045, 1), (3407, 1), (3452, 2), (3465, 1), (3697, 1), (3834, 1), (4342, 6), (4670, 1), (6193, 5), (8118, 1), (8759, 1), (8765, 1), (9006, 1), (14793, 1), (18111, 1), (18116, 1), (18119, 1), (18120, 1), (18122, 1), (21694, 2), (32154, 1), (32155, 1), (32156, 1)]
[(41, 2), (42, 1), (67, 1), (232, 1), (395, 1), (501, 1), (502, 1), (523, 1), (742, 1), (990, 1), (1052, 1), (1534, 1), (2853, 1)

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=3, id2word=dictionary, passes=20, random_state= 0)

In [None]:
ldamodel.print_topics(num_topics=5,num_words=15)

<font color = green >

## Learn more
</font>

Latent Dirichlet allocation
<br>
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


<font color = green >

## Next lesson: Clustering 
</font>

