<font color = green >

# Text classification: topic modeling 

</font>

<font color = green >

### Latent Dirichlet allocation (LDA)

</font>

Typically used to detect underlying topics in the text documents

**Input** : text documents and number of topics 
<br>
**Output**: Distribution of topics for each document (that allows to assign th one with highest probability) and word distribution for each topic 

**Assumptions**:
- Documents with similar topics use similar groups of words 
- Documents are probability distribution over latent topics 
- Topics are probability distribution over words


<font color = green >

#### Generative process
<br>
</font>

LDA considers the every document is created the following way:

1) Define number if words in the document
<br>
2) Chose the topic mixture over the fixed set of topics (e.g. 20% of topic A, 30% of topic A, and 50% of topic A)
<br>
3) Generate the words by:
<br>
   -pick the topic based on document's multinomial distribution 
<br>
   -pick the word based on topic's multinomial distribution 

<img src = "img/topics_modeling.png" height=500 width= 800 align="left">



<font color = green >

#### Recall
</font>


#### Binomial distribution

$$p(k/n)\quad =\quad C^{ k }_{ n }\cdot p^{ k }(1-p)^{ n-k }\quad =\quad \frac { n! }{ k!(n-k)! } p^{ k }(1-p)^{ n-k }$$

Example: Probability of 6 of 10 for fear coin: 
$$p(6,4)\quad =\quad C^{ 6 }_{ 10 }\cdot {0.5}^{ 6 }(0.5)^{ 4 }\quad = 210 \cdot 0.015625 \cdot 0.0625 = 0.205078125$$


#### Multinomial distribution

$$p(n_{ 1 }n_{ 2 }...n_{ k })\quad =\quad \frac { n! }{ n_{ 1 }!n_{ 2 }!...n_{ k }! } p^{ n_{ 1 } }_{ 1 }p^{ n_{ 2 } }_{ 2 }...p^{ n_{ k } }_{ k }$$

Example (three outcomes): <br>
n = 12 (12 games are played),<br>
n1 = 7 (number won by Player A),<br>
n2 = 2 (number won by Player B),<br>
n3 = 3 (the number drawn),<br>
p1 = 0.4 (probability Player A wins)<br>
p2 = 0.35(probability Player B wins)<br>
p3 = 0.25(probability of a draw)<br>
$$p(7,2,3)\quad =\quad \frac {12!}{ 7! \cdot 2! \cdot3 ! }  \cdot 0.4^{7} \cdot 0.35^{2} \cdot0.25^{3} = 0.0248$$




<font color = green >

#### Maximul Likelihood Estimation

</font>

<br>

**Recall** 
<br> Known are text documents and number $K$ of topics 

**Target**:
<br>Within all possible topics distribution for all documemnts and all possible words distribution for topics, shoose the one wich maximizes probability of all text documents.

**Approach** :
<br>
1) Randomly assign each word of each document to $K$ topics 
<br>
2) Iterate the following process till convergence (steady assignments of w to topics) 
<br>$\quad\quad$For each document $d$: 
<br>
    $\quad\quad\bullet$ Assume that all topic assignment except current one are correct     
    $\quad\quad\bullet$ For each word $w$ in $d$:           
    $\quad\quad\quad$ - For every topic $t$ compare the the score for hypothesis that w is in this topic $t$:
   <br>$\quad\quad\quad\quad\quad score (t) =  p(t | d) \cdot p (w |t),$
   <br>$\quad\quad\quad\quad p(t|d)$ is proportion of all words in d from t,
    <br>$\quad\quad\quad\quad p(w|t)$ is share of word w in topic t.  
    $\quad\quad\quad$ - Assign the word w to the topic with max score
    <br>$\quad\quad\bullet$ Iterate through all $w$ in $d$:           
$\quad\quad$Iterate through all $d$

Te results is matrix of distribution of words in topics  
Note: The computed topics are just words distribution, i.e. need to summarize them somehow. 


<font color = green >

## Gensim LDA 

</font>



In [2]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords

<font color = green >

### Define the text documents 

</font>



In [3]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]


<font color = green >

### Tokenize, clean, and stem

</font>



In [4]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(doc_set):
    texts = []
    for doc in doc_set:
        # tokenize document string
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(doc_set)
texts[0]

['brocolli',
 'good',
 'eat',
 '.',
 'brother',
 'like',
 'eat',
 'good',
 'brocolli',
 ',',
 'mother',
 '.']

<font color = green >

### Convert tokenized documents into a "id <-> term" dictionary

</font>



In [5]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(34 unique tokens: [',', '.', 'brocolli', 'brother', 'eat']...)
0 ,
1 .
2 brocolli
3 brother
4 eat
5 good
6 like
7 mother
8 around
9 basebal
10 drive
11 lot
12 practic
13 spend
14 time
15 blood
16 caus
17 expert
18 health
19 increas
20 may
21 pressur
22 suggest
23 tension
24 better
25 feel
26 never
27 often
28 perform
29 school
30 seem
31 well
32 profession
33 say


<font color = green >

### Create gensim corpus

</font>



In [6]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1), (1, 2), (2, 2), (3, 1), (4, 2), (5, 2), (6, 1), (7, 1)]
[(1, 1), (3, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(1, 1), (10, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
[(0, 1), (1, 1), (3, 1), (7, 1), (10, 1), (21, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(1, 1), (2, 1), (5, 1), (18, 2), (32, 1), (33, 1)]


<font color = green >

### Generate LDA model

</font>



In [7]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

### Review topics 

</font>



In [8]:
ldamodel.print_topics(num_topics=2,num_words=10)

[(0,
  '0.098*"." + 0.076*"brocolli" + 0.076*"good" + 0.055*"mother" + 0.055*"brother" + 0.054*"health" + 0.054*"eat" + 0.033*"," + 0.033*"like" + 0.033*"spend"'),
 (1,
  '0.060*"drive" + 0.059*"pressur" + 0.059*"." + 0.036*"," + 0.036*"never" + 0.036*"often" + 0.036*"increas" + 0.036*"perform" + 0.036*"seem" + 0.036*"well"')]

<font color = green >

### Classify the new text 

</font>



In [9]:
test_doc_list = ["Some experts suggest that car may cause increased blood pressure. professionals say that brocolli is good "]
test_texts = tokenize(test_doc_list)
test_corpus = [dictionary.doc2bow(text) for text in test_texts ]
test_doc_topics = ldamodel.get_document_topics(test_corpus)
print ('\nget topics:')
for el in test_doc_topics: # loop over all tests in provided list
    print(el)


get topics:
[(0, 0.37417653), (1, 0.62582344)]


<font color = green >

### Sample of topic modeling on large dataset

</font>



In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

<font color = green >

#### Load "News" data 

</font>



In [39]:
import os
cwd= os.getcwd()
path = os.path.join(cwd,)
fn=  os.path.join(path , 'newsgroups')

with open(fn, 'rb') as f:
    newsgroup_data = pickle.load(f)

<font color = green >

#### Review data

</font>



In [40]:
print (type(newsgroup_data))
print ('len of documents = {:,}\n'.format(len(newsgroup_data)))

newsgroup_data[0]

<class 'list'>
len of documents = 2,000



"The best group to keep you informed is the Crohn's and Colitis Foundation\nof America.  I do not know if the UK has a similar organization.  The\naddress of\nthe CCFA is \n\nCCFA\n444 Park Avenue South\n11th Floor\nNew York, NY  10016-7374\nUSA\n\nThey have a lot of information available and have a number of newsletters.\n \nGood Luck."

<font color = green >

#### Define custom vectorizer

</font>



In [41]:
three_words_pattern = r"\b\w{3,}\b"
vectorizer = CountVectorizer(
    min_df=20, 
    stop_words='english',
    token_pattern=three_words_pattern) 
vectorizer.fit(newsgroup_data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b\\w{3,}\\b', tokenizer=None,
        vocabulary=None)

<font color = green >

#### Review feratures 

</font>



In [42]:
print ('len of features = {:,}\n'.format(len(vectorizer.get_feature_names())))
print (vectorizer.get_feature_names()[:40])


len of features = 902

['000', '100', '1990', '1992', '1993', '200', '2nd', '300', '400', '486', '500', '800', 'ability', 'able', 'accept', 'accepted', 'access', 'according', 'actual', 'actually', 'add', 'addition', 'additional', 'address', 'advance', 'advice', 'age', 'ago', 'agree', 'ahead', 'air', 'allow', 'alt', 'america', 'american', 'answer', 'answers', 'anybody', 'apparently', 'appears']


<font color = green >

#### Vectorize data set

</font>



In [43]:
newsgroup_data_vectorized= vectorizer.transform(newsgroup_data)
print (newsgroup_data_vectorized)

  (0, 23)	1
  (0, 33)	1
  (0, 58)	1
  (0, 76)	1
  (0, 326)	1
  (0, 335)	1
  (0, 386)	1
  (0, 409)	1
  (0, 451)	1
  (0, 456)	1
  (0, 515)	1
  (0, 529)	1
  (0, 545)	1
  (0, 727)	1
  (0, 843)	1
  (0, 900)	1
  (1, 33)	1
  (1, 34)	1
  (1, 84)	1
  (1, 184)	1
  (1, 201)	1
  (1, 214)	1
  (1, 231)	2
  (1, 241)	1
  (1, 324)	1
  :	:
  (1998, 622)	1
  (1998, 625)	3
  (1998, 688)	1
  (1998, 698)	2
  (1998, 726)	1
  (1998, 804)	1
  (1998, 805)	1
  (1998, 810)	10
  (1998, 813)	2
  (1998, 814)	1
  (1998, 816)	1
  (1998, 818)	1
  (1998, 844)	1
  (1998, 882)	2
  (1998, 899)	1
  (1999, 171)	1
  (1999, 194)	1
  (1999, 205)	1
  (1999, 213)	1
  (1999, 276)	2
  (1999, 308)	1
  (1999, 344)	1
  (1999, 669)	1
  (1999, 832)	1
  (1999, 874)	1


<font color = green >

#### Create gensim corpus

</font>



In [44]:
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
# comparing to using corpora.Dictionary:
# corpus = [dictionary.doc2bow(text) for text in texts] 
[item for item in corpus][:5]


[[(23, 1),
  (33, 1),
  (58, 1),
  (76, 1),
  (326, 1),
  (335, 1),
  (386, 1),
  (409, 1),
  (451, 1),
  (456, 1),
  (515, 1),
  (529, 1),
  (545, 1),
  (727, 1),
  (843, 1),
  (900, 1)],
 [(33, 1),
  (34, 1),
  (84, 1),
  (184, 1),
  (201, 1),
  (214, 1),
  (231, 2),
  (241, 1),
  (324, 1),
  (332, 1),
  (359, 1),
  (363, 1),
  (365, 1),
  (409, 1),
  (430, 3),
  (451, 1),
  (475, 1),
  (492, 2),
  (525, 2),
  (605, 1),
  (633, 2),
  (642, 1),
  (674, 1),
  (688, 1),
  (709, 1),
  (750, 1),
  (777, 1),
  (823, 1),
  (838, 1),
  (874, 1),
  (896, 1)],
 [(25, 1),
  (26, 1),
  (63, 1),
  (120, 1),
  (231, 1),
  (297, 1),
  (326, 1),
  (344, 1),
  (373, 1),
  (423, 1),
  (442, 1),
  (444, 1),
  (448, 2),
  (465, 1),
  (572, 1),
  (653, 1),
  (659, 1),
  (714, 1),
  (777, 1),
  (779, 1),
  (781, 1),
  (818, 1),
  (836, 1),
  (855, 1),
  (890, 1),
  (898, 1)],
 [(4, 1),
  (17, 2),
  (18, 1),
  (22, 1),
  (42, 1),
  (48, 2),
  (68, 1),
  (78, 1),
  (86, 1),
  (94, 1),
  (117, 1),
  (119, 1)

<font color = green >

#### Create id2word dictionary

</font>



In [45]:
id_map = dict((v, k) for k, v in vectorizer.vocabulary_.items()) 

<font color = green >

#### Generate LDA model

</font>



In [46]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)
# Comparing to corpora.Dictionary
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

#### Review topics

</font>



In [47]:
ldamodel.print_topics(num_topics=6,num_words=10)

[(0,
  '0.025*"edu" + 0.019*"com" + 0.018*"use" + 0.018*"thanks" + 0.016*"does" + 0.015*"know" + 0.011*"mail" + 0.010*"apple" + 0.009*"help" + 0.008*"want"'),
 (1,
  '0.061*"drive" + 0.039*"disk" + 0.030*"scsi" + 0.027*"drives" + 0.027*"hard" + 0.025*"controller" + 0.021*"card" + 0.018*"rom" + 0.016*"cable" + 0.016*"floppy"'),
 (2,
  '0.024*"people" + 0.022*"god" + 0.012*"atheism" + 0.012*"think" + 0.012*"believe" + 0.012*"don" + 0.010*"does" + 0.010*"just" + 0.009*"argument" + 0.009*"say"'),
 (3,
  '0.023*"game" + 0.021*"year" + 0.020*"team" + 0.013*"games" + 0.013*"play" + 0.011*"good" + 0.011*"don" + 0.010*"think" + 0.010*"season" + 0.010*"players"'),
 (4,
  '0.035*"space" + 0.019*"nasa" + 0.018*"data" + 0.013*"information" + 0.013*"available" + 0.013*"center" + 0.011*"ground" + 0.010*"research" + 0.010*"000" + 0.010*"new"'),
 (5,
  '0.017*"just" + 0.017*"like" + 0.016*"don" + 0.012*"car" + 0.012*"time" + 0.011*"think" + 0.011*"good" + 0.010*"know" + 0.008*"way" + 0.008*"people"')]

<font color = green >

#### Name topics

</font>



In [48]:
topics_names= ['Education', 'Computers & IT', 'Religion', 'Sports', 'Science','Society & Lifestyle']

<font color = green >

#### Classify the new text 

</font>



In [53]:
import numpy as np

In [54]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "] 


In [55]:
doc_vectorized= vectorizer.transform(new_doc) # input param is list
new_doc_corpus = gensim.matutils.Sparse2Corpus(doc_vectorized, documents_columns=False)
doc_topics = ldamodel.get_document_topics(new_doc_corpus)
list(doc_topics)

[[(0, 0.033417623),
  (1, 0.033340912),
  (2, 0.03351664),
  (3, 0.03380149),
  (4, 0.83228344),
  (5, 0.03363986)]]

In [56]:
def elicit_topic_name(doc_topics):    
    return topics_names[np.squeeze(np.array(doc_topics))[:,1].argmax()]
elicit_topic_name(doc_topics)

'Science'

<font color = green >

## Home Task 

</font>


<font color = green >

### Topic Modeling 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [18]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from nltk.corpus import stopwords
import pandas as pd
import gensim

fn=  os.path.join(path , 'voted-kaggle-dataset.csv')
df = pd.read_csv(fn)

In [17]:
print ('len of texts= {:,}'.format(len(df)))
index = 10 
df.loc[index, 'Description']

len of texts= 2,150


'These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k'

In [29]:
df_data_set = df["Description"]

In [30]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(df_data_set):
    texts = []
    for doc in df_data_set:
        # tokenize document string
        doc = str(doc)
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(df_data_set)
texts[0]

['dataset',
 'contain',
 'transact',
 'made',
 'credit',
 'card',
 'septemb',
 '2013',
 'european',
 'cardhold',
 '.',
 'dataset',
 'present',
 'transact',
 'occur',
 'two',
 'day',
 ',',
 '492',
 'fraud',
 '284,807',
 'transact',
 '.',
 'dataset',
 'highli',
 'unbalanc',
 ',',
 'posit',
 'class',
 '(',
 'fraud',
 ')',
 'account',
 '0.172',
 '%',
 'transact',
 '.',
 'contain',
 'numer',
 'input',
 'variabl',
 'result',
 'pca',
 'transform',
 '.',
 'unfortun',
 ',',
 'due',
 'confidenti',
 'issu',
 ',',
 'provid',
 'origin',
 'featur',
 'background',
 'inform',
 'data',
 '.',
 'featur',
 'v1',
 ',',
 'v2',
 ',',
 '...',
 'v28',
 'princip',
 'compon',
 'obtain',
 'pca',
 ',',
 'featur',
 'transform',
 'pca',
 "'time",
 "'",
 "'amount",
 "'",
 '.',
 'featur',
 "'time",
 "'",
 'contain',
 'second',
 'elaps',
 'transact',
 'first',
 'transact',
 'dataset',
 '.',
 'featur',
 "'amount",
 "'",
 'transact',
 'amount',
 ',',
 'featur',
 'use',
 'example-depend',
 'cost-senst',
 'learn',
 '.',
 '

In [31]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(35583 unique tokens: ['%', "'", "'amount", "'class", "'time"]...)
0 %
1 '
2 'amount
3 'class
4 'time
5 (
6 )
7 ,
8 .
9 ...
10 //mlg.ulb.ac.b
11 //mlg.ulb.ac.be/artml
12 //mlg.ulb.ac.be/bruf
13 0
14 0.172
15 1
16 2013
17 2015
18 284,807
19 492
20 :
21 a.
22 account
23 accuraci
24 amount
25 analys
26 andrea
27 area
28 auprc
29 avail
30 background
31 big
32 bontempi
33 bruxel
34 caelen
35 calibr
36 card
37 cardhold
38 case
39 cidm
40 cite
41 class
42 classif
43 collabor
44 collect
45 compon
46 comput
47 confidenti
48 confus
49 contain
50 cost-senst
51 credit
52 current
53 curv
54 dal
55 data
56 dataset
57 day
58 de
59 detail
60 detect
61 due
62 elaps
63 european
64 example-depend
65 featur
66 first
67 fraud
68 gianluca
69 given
70 group
71 highli
72 http
73 ieee
74 imbal
75 inform
76 input
77 intellig
78 issu
79 johnson
80 learn
81 libr
82 machin
83 made
84 matrix
85 meaning
86 measur
87 mine
88 numer
89 obtain
90 occur
91 olivi
92

833 send
834 serv
835 sha-256
836 share
837 soon
838 spare
839 student
840 technic
841 thank
842 thu
843 timestamp
844 togeth
845 trade
846 trust
847 unforseen
848 unintuit
849 unix
850 variou
851 via
852 viewer
853 volum
854 wait
855 way
856 weight
857 -freeformresponses.csv
858 1000
859 14
860 16,000
861 16,716
862 16.4
863 171
864 25th
865 5
866 50
867 5th
868 7th
869 allow
870 attempt
871 august
872 break
873 calcul
874 channel
875 choic
876 column-wis
877 compens
878 comprehens
879 conduct
880 convers
881 conversionrates.csv
882 correspond
883 day-to-day
884 decod
885 describ
886 edg
887 email
888 employ
889 establish
890 exclud
891 flag
892 folk
893 form
894 free
895 freeform
896 freeformresponses.csv
897 happen
898 ident
899 industry-wid
900 initi
901 interact
902 kind
903 less
904 live
905 median
906 month
907 multipl
908 multiplechoiceresponses.csv
909 non-random
910 novemb
911 open-end
912 option
913 packag
914 primarili
915 proceed
916 protect
917 pud
918 quantmod
919 r
920 

1583 n
1584 needl
1585 none
1586 nuclei
1587 nucleu
1588 o.
1589 optim
1590 p.
1591 perimet
1592 perimeter^2
1593 portion
1594 radiu
1595 real-valu
1596 recod
1597 robust
1598 se
1599 server
1600 signific
1601 smooth
1602 softwar
1603 standard
1604 symmetri
1605 textur
1606 uw
1607 'reviews'
1608 1999
1609 2012.
1610 256,059
1611 260
1612 568,454
1613 74,258
1614 amateur
1615 amazon
1616 connoisseur
1617 evolut
1618 expertis
1619 j.
1620 leskovec
1621 mcauley
1622 octob
1623 onlin
1624 plain
1625 pull
1626 queri
1627 quick
1628 reviews.csv
1629 span
1630 www
1631 ~500,000
1632 +
1633 1.85
1634 155
1635 2.1
1636 2.2
1637 2013-2016
1638 2014-
1639 2014-2016
1640 20th
1641 absenc
1642 add
1643 alway
1644 assess
1645 bar
1646 benchmark
1647 cantril
1648 capita
1649 celebr
1650 character
1651 civil
1652 contrast
1653 corrupt
1654 dystopia
1655 effect
1656 equal
1657 equat
1658 exceed
1659 experi
1660 expert
1661 explain
1662 extent
1663 factor
1664 favor
1665 fourth
1666 freedom
1667 gain
1

2332 vari
2333 8600
2334 axi
2335 behaviour
2336 brigg
2337 carl
2338 categoris
2339 clung
2340 code/typ
2341 cognit
2342 distinct
2343 extrovers
2344 googl
2345 intp
2346 introvers
2347 intuit
2348 judg
2349 jung
2350 jungian
2351 later
2352 letter
2353 mbti
2354 mind
2355 myer
2356 personalitycaf
2357 pipe
2358 reveal
2359 scientif
2360 sens
2361 short
2362 suggest
2363 surround
2364 thought
2365 typolog
2366 unreli
2367 use/valid
2368 written
2369 |||
2370 2003
2371 academ
2372 alma
2373 arabia
2374 barro-le
2375 china
2376 compris
2377 controversi
2378 critic
2379 disagre
2380 educ
2381 expenditur
2382 fair
2383 focus
2384 gross
2385 human
2386 influenti
2387 institut
2388 instruct
2389 kingdom
2390 levi
2391 mater
2392 non-english-instruct
2393 practic
2394 privat
2395 saudi
2396 shanghai
2397 supplementari
2398 undermin
2399 unesco
2400 abus
2401 advantag
2402 analyz
2403 attitud
2404 bring
2405 care
2406 care_opt
2407 choos
2408 company/organ
2409 consequ
2410 cowork
2411 disord

3082 flaw
3083 forward
3084 genet
3085 harvest
3086 highest-r
3087 ideal
3088 impress
3089 intens
3090 linger
3091 manhattan
3092 melt
3093 narrowli
3094 ordinari
3095 passabl
3096 pound
3097 praiseworthi
3098 premium
3099 profici
3100 profil
3101 promin
3102 puriti
3103 resid
3104 satisfactori
3105 solid
3106 storag
3107 straight
3108 summar
3109 superior
3110 terroir
3111 transcend
3112 unpalat
3113 vision
3114 14.3
3115 2014.
3116 329
3117 4.5
3118 affiliated_base_num
3119 american_b01362.csv
3120 b02512
3121 b02598
3122 b02617
3123 b02682
3124 b02764
3125 b02765
3126 b02835
3127 b02836
3128 borough
3129 carmel_b00256.csv
3130 coarse-grain
3131 commiss
3132 danach-ni
3133 date/tim
3134 dial7_b00887.csv
3135 diplo_b01196.csv
3136 dispatch
3137 dispatching_base_num
3138 dreist
3139 drinnen
3140 driver
3141 essenti
3142 federal_02216.csv
3143 fhv
3144 fine-grain
3145 firstclass_b01536.csv
3146 fivethirtyeight
3147 flv
3148 foil
3149 for-hir
3150 grun
3151 hfv
3152 highclass_b01717.csv


3832 mountain
3833 pilot
3834 plane
3835 poor
3836 punctuat
3837 rout
3838 runway
3839 shortli
3840 socrata
3841 solut
3842 struck
3843 takeoff
3844 trends/behavior
3845 welcom
3846 yearli
3847 1,996
3848 1966-2017
3849 1996
3850 2,488
3851 2488
3852 39
3853 398
3854 48
3855 515
3856 58
3857 audienc
3858 church
3859 consult
3860 crime
3861 crowd
3862 deadli
3863 duti
3864 eight
3865 enrich
3866 gun
3867 hate
3868 in-depth
3869 injur
3870 lat-long
3871 open/clos
3872 ownership
3873 policeman
3874 prone
3875 safeti
3876 serial
3877 stanford
3878 texa
3879 u.
3880 ver
3881 wikipedia
3882 wit
3883 y/n
3884 •
3885 backord
3886 deck_risk
3887 forecast_3_month
3888 forecast_6_month
3889 forecast_9_month
3890 in_transit_qti
3891 lead_tim
3892 local_bo_qti
3893 min_bank
3894 national_inv
3895 oe_constraint
3896 overdu
3897 perf_12_month_avg
3898 perf_6_month_avg
3899 pieces_past_du
3900 potential_issu
3901 ppap_risk
3902 react
3903 rev_stop
3904 sales_1_month
3905 sales_3_month
3906 sales_6_mon

4582 tonyplaysguitar
4583 usd_goal_r
4584 usd_pledg
4585 usd_pledged_r
4586 'femal
4587 'male
4588 //www.ibrahimaljarah.com
4589 11-
4590 119-136.
4591 17
4592 172
4593 175
4594 188
4595 191
4596 2016-11-8
4597 210
4598 235
4599 245
4600 270
4601 289
4602 292
4603 305
4604 360
4605 360.
4606 480
4607 69
4608 70
4609 89
4610 90-100.
4611 above-7
4612 abu
4613 aeect
4614 aljarah
4615 amman
4616 amrieh
4617 announc
4618 announcements-how
4619 arab
4620 architectur
4621 birth-
4622 classroom
4623 days-th
4624 e-learn
4625 egypt
4626 elaf
4627 electr
4628 french
4629 g-01
4630 g-02
4631 g-03
4632 g-04
4633 g-05
4634 g-06
4635 g-07
4636 g-08
4637 g-09
4638 g-10
4639 g-11
4640 g-12
4641 geolog
4642 grade/mark
4643 groups-
4644 hamtini
4645 hand-
4646 high-level
4647 highschool
4648 ibrahim
4649 id-
4650 integer/categor
4651 iran
4652 iraq
4653 jordan
4654 kalboard
4655 kuwait
4656 leading-edg
4657 learner
4658 lebanon
4659 levels-
4660 libya
4661 lm
4662 low-level
4663 lowerlevel
4664 lybia
4

5331 unifrom
5332 unlaw
5333 ward
5334 //
5335 1-cooc
5336 100/200/etc
5337 100m
5338 151
5339 293,000
5340 400
5341 5850m
5342 800
5343 90-5850m
5344 =200
5345 =400
5346 afternoon
5347 appearedday/appearedmonth/appearedyear
5348 appeareddayofweek
5349 appearedhour/appearedminut
5350 appearedlocaltim
5351 appearedtimeofday
5352 appl
5353 atmospher
5354 boolean
5355 breezi
5356 breezyandmostlycloudi
5357 breezyandovercast
5358 breezyandpartlycloudi
5359 capabl
5360 cellid
5361 clear-day
5362 clear-night
5363 closetowat
5364 cloudi
5365 co-occurr
5366 contin
5367 cooc
5368 creatur
5369 datadump
5370 day/month/year
5371 delet
5372 densiti
5373 dri
5374 drizzl
5375 drizzleandbreezi
5376 dryandmostlycloudi
5377 dryandpartlycloudi
5378 foggi
5379 friday
5380 glcf
5381 gp
5382 gym
5383 gym/pokestop
5384 gymdistancekm
5385 gymin100m-pokestopin5000m
5386 heavyrain
5387 hour/minut
5388 humid
5389 humidandovercast
5390 humidandpartlycloudi
5391 io
5392 km
5393 km/h
5394 lightrain
5395 lightrainan

6081 tree
6082 variant
6083 verd
6084 vinho
6085 -5
6086 //keras.io/applications/
6087 0.715
6088 0.727
6089 0.759
6090 0.788
6091 0.790
6092 0.804
6093 0.901
6094 0.910
6095 0.929
6096 0.944
6097 0.945
6098 0.953
6099 awesom
6100 chollet
6101 françoi
6102 gpu
6103 inceptionresnetv2
6104 inceptionv3
6105 pretrain
6106 resnet50
6107 top-1
6108 vgg16
6109 vgg19
6110 xception
6111 1965.
6112 5.5
6113 cooper
6114 dissemin
6115 earthquak
6116 facil
6117 magnitud
6118 modern
6119 neic
6120 seismic
6121 seismograph
6122 15,000
6123 bigger
6124 camp
6125 codenewbie.org
6126 nonprofit
6127 20,000
6128 _golden
6129 _last_judgment_at
6130 _trusted_judg
6131 _unit_id
6132 _unit_st
6133 ai
6134 contributor-judg
6135 existence/non-exist
6136 fav_numb
6137 forev
6138 gender_gold
6139 golden
6140 hex
6141 judgment
6142 link_color
6143 non-golden
6144 non-human
6145 non-individu
6146 profile_yn
6147 profile_yn_gold
6148 profileimag
6149 retweet_count
6150 sidebar
6151 sidebar_color
6152 stylist
6153 ti

6831 thusli
6832 tldr
6833 total_snowfal
6834 vincenti
6835 visitor
6836 weather_st
6837 weather_stations.csv
6838 380,000+
6839 artist/year/song
6840 cocain
6841 ec2
6842 eminem
6843 genre.txt
6844 hennesi
6845 jack
6846 jasoria
6847 liquor
6848 multi-thread
6849 musician
6850 pot
6851 rapstar
6852 rockstar
6853 shruti
6854 sjasoria
6855 biweekli
6856 bullet
6857 due_dat
6858 effective_d
6859 loan_id
6860 loan_statu
6861 one-tim
6862 paidoff_tim
6863 pastdue_day
6864 schedul
6865 //serebii.net/
6866 802
6867 against_
6868 base_egg_step
6869 base_happi
6870 capture_r
6871 eighteen
6872 experience_growth
6873 genderless
6874 hatch
6875 impervi
6876 inflict
6877 is_legendari
6878 japanese_nam
6879 judici
6880 metr
6881 moon
6882 percentage_mal
6883 pokedex
6884 pokedex_numb
6885 sp_attack
6886 sp_defens
6887 sun
6888 type1
6889 type2
6890 weakest
6891 -input
6892 -output
6893 ./fasttext
6894 0.916
6895 1-
6896 2-star
6897 3-star
6898 5-star
6899 __label__
6900 __label__1
6901 __label__2


7581 administ
7582 aggrav
7583 al-qaeeda
7584 authent
7585 bashir
7586 bomb
7587 brown
7588 bush
7589 canvass
7590 drone
7591 fata
7592 hira
7593 increase/decreas
7594 khyber
7595 kpk
7596 milit
7597 newspap
7598 obama
7599 oldest
7600 outfit
7601 pakhtunkhwa
7602 pakistan
7603 pakistanbodycount.org
7604 pakistandroneattacks.csv
7605 predat
7606 reaper
7607 strike
7608 suicid
7609 taliban
7610 tank
7611 tenur
7612 usmani
7613 www.pakistanbodycount.org
7614 zeeshan-ul-hassan
7615 +70k
7616 pre-procces
7617 10,265
7618 12.2
7619 8.3-percent
7620 9,943
7621 alcohol-impair
7622 motorcyclist
7623 nhtsa
7624 pedalcyclist
7625 vmt
7626 'hord
7627 //github.com/myles-oneill/wowah-pars
7628 //mmnet.iis.sinica.edu.tw/dl/wowah/
7629 //wowwiki.wikia.com/wiki/zones_by_level_
7630 chines
7631 competitit
7632 dmi3kno
7633 faction
7634 guild
7635 inhabit
7636 knight
7637 lich
7638 loyalti
7639 preserv
7640 qo
7641 unicod
7642 unsubscrib
7643 warcraft
7644 workload
7645 wotlk
7646 wow
7647 wrangl
7648 w

8330 'tim-moy
8331 'tim-nam
8332 'tim-yoc
8333 107
8334 11404
8335 1146068
8336 12786
8337 1354149
8338 146
8339 20680
8340 23790
8341 2413
8342 38
8343 4262
8344 48034
8345 5290
8346 58388
8347 602
8348 8152
8349 866
8350 891
8351 9800
8352 datafram
8353 eve
8354 geo
8355 geo-nam
8356 gmb
8357 gpe
8358 groningen
8359 nat
8360 phenomenon
8361 cpu
8362 cpus.csv
8363 game-deb
8364 gpus.csv
8365 manufactur
8366 resolut
8367 audio-visu
8368 brainwav
8369 brainwave-sens
8370 cheaper
8371 chosen
8372 consumer-grad
8373 exercis
8374 headset
8375 icon
8376 immers
8377 inconspicu
8378 indra_tim
8379 john
8380 maillart
8381 merril
8382 mindwav
8383 nick
8384 packet
8385 stimuli
8386 stimulu
8387 stimulus-times.csv
8388 stimulus-timing.csv
8389 subject-metadata.csv
8390 superbowl
8391 uc
8392 versa
8393 //doi.org/10.1088/0031-9155/60/14/5471
8394 //doi.org/10.7937/k9/tcia.2015.7go2gsk
8395 2004
8396 5471-5496.
8397 anatom
8398 carolyn
8399 crf
8400 de-identif
8401 dicom
8402 doi:10.1088/0031-9155

9080 decommission_d
9081 dock
9082 from_station_id
9083 from_station_nam
9084 install_d
9085 install_dockcount
9086 membership
9087 modification_d
9088 pronto
9089 pst
9090 rider
9091 starttim
9092 stoptim
9093 to_station_id
9094 to_station_nam
9095 trip_id
9096 tripdur
9097 usertyp
9098 //nsidc.org/data/nsidc-0051.html
9099 10^6
9100 1978-2015
9101 cryospher
9102 frozen
9103 glacier
9104 hemispher
9105 nsidc
9106 realm
9107 snow
9108 sq
9109 //data.worldbank.org/indicator/sl.uem.totl.z
9110 15-24
9111 ilo
9112 labour
9113 *xx
9114 1-11
9115 1-13
9116 10.
9117 11.
9118 12.
9119 1929
9120 1956
9121 1961
9122 2001-14.
9123 2010.
9124 302
9125 315
9126 316
9127 317
9128 360,361,363,363-a
9129 363
9130 366
9131 366-a
9132 367
9133 369
9134 372
9135 373
9136 376
9137 384
9138 4.1
9139 4.2
9140 5.1
9141 5.2
9142 5.3
9143 6.
9144 abandon
9145 abduct
9146 abet
9147 action/punish
9148 acual
9149 anti
9150 antiqu
9151 apprehend
9152 arson
9153 asi
9154 atroc
9155 autho
9156 auxilliari
9157 beg
9

9830 swipe
9831 timelogg
9832 uni
9833 3rd
9834 50,000
9835 baron
9836 epoch
9837 euw
9838 herald
9839 lol
9840 rift
9841 summon
9842 team1
9843 team2
9844 victori
9845 182
9846 280
9847 alive.
9848 eer
9849 meetup
9850 unifi
9851 commod
9852 sheep
9853 steel
9854 typewrit
9855 -help
9856 -identifi
9857 -make
9858 -requir
9859 -understand
9860 //groups.yahoo.com/neo/groups/careercenter-am
9861 //www.slideshare.net/habetmadoyan/it-skills-analysis-63686238
9862 19,000
9863 2004-2015
9864 aboutc
9865 announcementcod
9866 applicationp
9867 armenia
9868 armenian
9869 careercent
9870 curriculum
9871 deadlin
9872 elig
9873 jobdescript
9874 jobpost
9875 jobrequir
9876 openingd
9877 part-tim
9878 profess
9879 quicker
9880 requiredqu
9881 richer
9882 0,800000
9883 2gb
9884 720,000
9885 720k
9886 airdrop
9887 ammo
9888 armor
9889 battleground
9890 bluezon
9891 corral
9892 erangel
9893 first-aid
9894 first/third-person
9895 fpp/tpp
9896 hide
9897 miramar
9898 playerunknown
9899 pubg
9900 pubg.op.g

10580 readjust
10581 thid
10582 week_number_block_weekday.sqlit
10583 liquid
10584 oil
10585 spill
10586 1999.10.14
10587 286:531-537
10588 34
10589 aml
10590 bloomfield
10591 c.d
10592 caligiuri
10593 coller
10594 d.k
10595 gaasenbeek
10596 golub
10597 huard
10598 j.p.
10599 j.r.
10600 lander
10601 leukemia
10602 loh
10603 lymphoblast
10604 m.a
10605 marrow
10606 mesirov
10607 microarray
10608 myeloid
10609 peripher
10610 proof-of-concept
10611 re-scal
10612 slonim
10613 t.r
10614 tamayo
10615 //www2.camara.leg.br/transparencia/cota-para-exercicio-da-atividade-parlamentar/dados-abertos-cota-parlamentar
10616 140.000r
10617 brave
10618 brazil
10619 brazilian
10620 deputi
10621 entitl
10622 fashion
10623 formal
10624 monitoring/data
10625 refund
10626 scandal
10627 senat
10628 thrive
10629 u'company_id
10630 u'company_nam
10631 u'deputy_nam
10632 u'deputy_st
10633 u'political_parti
10634 u'refund_date'
10635 unusu
10636 //github.com/sunnysai12345/news_summari
10637 4515
10638 author_nam

11329 -infrastructur
11330 -labor
11331 -poverti
11332 -privat
11333 -public
11334 -scienc
11335 -social
11336 -trade
11337 //data.worldbank.org/data-catalog/world-development-ind
11338 prosper
11339 2012-2013
11340 coach
11341 determinist
11342 hey
11343 partli
11344 whoscored.com
11345 //github.com/rojour/boston_result
11346 //registration.baa.org/2017/cf/public/iframe_resultssearch.cfm
11347 juic
11348 marathon
11349 quest
11350 trail
11351 veteran
11352 //crimemap.dc.gov/crimemapsearch.aspx
11353 assault
11354 sidewalk
11355 tabs-geooth
11356 //github.com/planetsig/ufo-report
11357 1949.
11358 _of_encount
11359 alien
11360 cheeseburg
11361 date_docu
11362 date_tim
11363 defenc
11364 described_dur
11365 earthl
11366 extraterrestri
11367 immin
11368 length_of_encounter_second
11369 planetari
11370 pub
11371 swap
11372 ufo_shap
11373 //github.com/jeffsackmann/tennis_atp
11374 1stin
11375 2ndwon
11376 ace
11377 bpface
11378 bpsave
11379 df
11380 jeff
11381 sackmann
11382 svgm
11383 svp

12079 approv
12080 bold
12081 case_numb
12082 case_statu
12083 case_status*
12084 case_submit
12085 certified_withdrawn
12086 decision_d
12087 doleta
12088 employer_address
12089 employer_c
12090 employer_countri
12091 employer_nam
12092 employer_phon
12093 employer_phone_ext
12094 employer_postal_cod
12095 employer_provinc
12096 employer_st
12097 employment_end_d
12098 employment_start_d
12099 full_time_posit
12100 h-1b_depend
12101 ina
12102 job_titl
12103 naics_cod
12104 original_cert_d
12105 prevailing_wag
12106 pw_sourc
12107 pw_source_oth
12108 pw_source_year
12109 pw_unit_of_pay
12110 soc_cod
12111 soc_nam
12112 total_work
12113 visa_class
12114 wage_rate_of_pay_from
12115 wage_rate_of_pay_to
12116 wage_unit_of_pay
12117 weka
12118 willful_viol
12119 worksite_c
12120 worksite_counti
12121 worksite_postal_cod
12122 worksite_st
12123 //gleason.case.edu/webdata/jpi-dl-tutorial/idc_regular_ps50_idx5.zip
12124 //spie.org/publications/proceedings/paper/10.1117/12.2043872
12125 //www.n

12829 balls_remain
12830 bilater
12831 bowl-outs/sup
12832 d/l
12833 dd/mm/yyyy
12834 home/away
12835 inning
12836 innings1
12837 innings1_ov
12838 innings1_overs_bat
12839 innings1_run
12840 innings1_wicket
12841 innings2
12842 innings2_ov
12843 innings2_overs_bat
12844 innings2_run
12845 innings2_wicket
12846 match_detail
12847 rain-adjust
12848 semi-fin
12849 t20
12850 t20i
12851 tie
12852 vjb
12853 wicket
12854 win_by_run
12855 win_by_wicket
12856 arrest-id
12857 arrestd
12858 arrestloc
12859 arresttim
12860 baltimor
12861 chargedescript
12862 deprat
12863 incidentloc
12864 incidentoffens
12865 location1
12866 offenc
12867 1/1/2015
12868 10/19/2016
12869 congression
12870 fec.gov
12871 fundrais
12872 //www.fhfa.gov/policyprogramsresearch/research/paperdocuments/1996-03_hpi_techdescription_n508.pdf
12873 1991.
12874 msa
12875 non-season
12876 refinanc
12877 countrystat
12878 faostat
12879 sub-nat
12880 two-way
12881 zipfil
12882 0–4
12883 5mg
12884 abdomin
12885 ate
12886 autoimmun


13579 reprint
13580 trafficway
13581 almanac
13582 demographi
13583 factbook
13584 mix-in
13585 400000
13586 aggregater
13587 datepublish
13588 files
13589 numdownloadsmin
13590 packagenam
13591 ratingcount
13592 softwarevers
13593 //luna16.grand-challenge.org
13594 //www.kaggle.com/c/data-science-bowl-2017
13595 hounsfield
13596 lesion
13597 luna
13598 amazoncloudwatch
13599 ambient_temperature_system_failure.csv
13600 anomal
13601 artif
13602 artifically-gener
13603 artificialnoanomaly/
13604 artificialwithanomaly/
13605 bucket
13606 catastroph
13607 cost-per-click
13608 cpc
13609 cpm
13610 cpu_utilization_asg_misconfiguration.csv
13611 datacent
13612 ec2_request_latency_system_failure.csv
13613 fourteenth
13614 github.com/numenta/nab
13615 lavin
13616 machine_temperature_system_failure.csv
13617 mahcin
13618 minnesota
13619 nab
13620 numenta
13621 numenta.org
13622 nyc_taxi.csv
13623 publicly-trad
13624 realadexchange/
13625 realawscloudwatch/
13626 realknowncause/
13627 realtraffic

14328 heyvaert
14329 interlink
14330 japanese-dissemin
14331 n-tripl
14332 pieter
14333 pronunci
14334 queried.
14335 rdf
14336 themes.
14337 vibrant
14338 advissuedyndesc
14339 bite_d
14340 bitten
14341 breed
14342 breediddesc
14343 dispositioniddesc
14344 genderiddesc
14345 head_sent_d
14346 kentucki
14347 louisvil
14348 personal/identifi
14349 quarantin
14350 quarantine_d
14351 rabi
14352 rabid
14353 resultsiddesc
14354 speciesiddesc
14355 vaccination_d
14356 vaccination_yr
14357 victim_zip
14358 wherebitteniddesc
14359 'a
14360 'in
14361 //stats.oecd.org/index.aspx
14362 datasetcode=bl
14363 non-oecd
14364 oecd
14365 russia
14366 ugm3
14367 //www.bls.gov/data/
14368 1.000.
14369 100.0
14370 1913.
14371 1966.
14372 1977.
14373 1982
14374 1982-1984
14375 1984
14376 1988.
14377 2285
14378 8,300
14379 aaor
14380 bimonthli
14381 city-s
14382 cmsa
14383 cu
14384 cu.txt
14385 cw
14386 earner
14387 eleven
14388 handbook
14389 interarea
14390 region/city-s
14391 saor
14392 semi-annu
14393 t

15078 nelson
15079 octfip
15080 ofip
15081 ostfip
15082 otrfip
15083 pop10
15084 usp
15085 erupt
15086 gvp
15087 holocen
15088 mall
15089 smithsonian
15090 unrest
15091 volcan
15092 volcano
15093 1939
15094 col
15095 jenn
15096 live-act
15097 robertson
15098 thor
15099 warfar
15100 wwii
15101 1942
15102 adolf
15103 ander
15104 apollo
15105 ayatullah
15106 borman
15107 fdr
15108 grace
15109 guidelin
15110 hitler
15111 honore
15112 indel
15113 khomeini
15114 lovel
15115 newsmak
15116 newsworthi
15117 stalin
15118 statesmen
15119 unforgett
15120 .json
15121 amirhossein
15122 circuit
15123 cuff
15124 cuff-less
15125 cuffless
15126 data_bp
15127 data_ecg
15128 data_fsr
15129 data_pcg
15130 data_ppg
15131 dbp
15132 diastol
15133 ecg
15134 electrocardiogram
15135 esmaili
15136 fingertip
15137 force-sens
15138 fsr
15139 iscas'17
15140 kachue
15141 mahdi
15142 non-invas
15143 pcg
15144 phonocardiogram
15145 photoplethysmogram
15146 ppg
15147 puls
15148 resistor
15149 sbp
15150 shabani
15151 sha

15828 attacks/attempt
15829 cyber
15830 honeypot
15831 jacob
15832 jay
15833 rudi
15834 +100
15835 -104
15836 -104dbm
15837 -7299.786516730871000
15838 -7695.9387549299299000
15839 //www.ipin2014.org/wp/pdf/4a-3.pdf
15840 //www.uji.
15841 0dbm
15842 110.000m2
15843 1111
15844 19937
15845 4864745.7450159714
15846 4865017.3646842018.
15847 521
15848 522
15849 523
15850 524
15851 525
15852 526
15853 527
15854 528
15855 529
15856 adolfo
15857 arnau
15858 avariento
15859 balagu
15860 belmont
15861 benedito-bordonau
15862 building_error
15863 building_pen
15864 building_penalti
15865 buildingid
15866 carlo
15867 castelló
15868 coordinates_error
15869 corridor
15870 door
15871 estimated_latitud
15872 estimated_longitude-real_longitud
15873 eva
15874 fingerprint
15875 fingerprint-bas
15876 floor_error
15877 floor_pen
15878 floor_penalti
15879 garcia-martí
15880 gargallo
15881 gonzalez
15882 huerta
15883 ianiss
15884 inside/outsid
15885 ipin2015
15886 iren
15887 jaum
15888 javier
15889 joaquín


16578 hickey
16579 nougat
16580 peanut
16581 peanutalmondi
16582 pluribu
16583 priceperc
16584 sugarperc
16585 wafer
16586 walt
16587 winperc
16588 //hockeygoalies.org
16589 //hsp.flyershistory.com
16590 //jeays.net/shootout/index.htm
16591 //nhltradeshistory.blogspot.com
16592 //sports.groups.yahoo.com/group/hockey-databank
16593 //sports.groups.yahoo.com/group/hockey_summary_project/
16594 //sports.yahoo.com/nhl
16595 //www.espn.com/nhl/statist
16596 //www.findagrave.com
16597 //www.hockey-reference.com
16598 //www.hockeydb.com
16599 //www.hockeyresearch.com/stat
16600 //www.legendsofhockey.net/html/search.htm
16601 //www.losthockey.com
16602 //www.nhl.com
16603 //www.nhlshootouts.com
16604 //www.ottawavalleyonline.com/sites/tomking_01/index.html
16605 //www.puckerings.com
16606 //www.shrpsports.com/hsp
16607 //www.sihrhockey.org
16608 //www.sportsnetwork.com
16609 //www.usatoday.com/sports/hockey
16610 //www.usatoday.com/sports/hockey/archive.htm
16611 1917-18
16612 1917-64
16613 19

17327 lemma
17328 lesser
17329 loct
17330 masc
17331 natalia
17332 neut
17333 nikolaeva
17334 nomn
17335 ostapuk
17336 plur
17337 plur,1per
17338 plur,2per
17339 plur,3per
17340 protopopova
17341 rev_id
17342 shimorina
17343 sing,1per
17344 sing,2per
17345 sing,3per
17346 slavic
17347 soviet
17348 stepanova
17349 surikov
17350 svetlana
17351 tfr
17352 unrecognis
17353 vasili
17354 victor
17355 ежа
17356 ежам
17357 еже
17358 ежей
17359 ежи
17360 ежом
17361 ежу
17362 сват
17363 ёж
17364 ёрничаем
17365 ёрничает
17366 ёрничаете
17367 ёрничаешь
17368 ёрничай
17369 ёрничайте
17370 ёрничал
17371 ёрничала
17372 ёрничали
17373 ёрничало
17374 ёрничаю
17375 ёрничают
17376 dot_traffic_2015.txt.gz
17377 dot_traffic_stations_2015.txt.gz
17378 heaviest
17379 ~2gb
17380 0-23
17381 10k-12k
17382 bcn
17383 carrer
17384 causa
17385 codi
17386 conductor
17387 coordenada
17388 d'expedi
17389 del
17390 desc
17391 descripció
17392 dia
17393 dona
17394 edat
17395 ferit
17396 festiv
17397 greu
17398 hood
17399

18077 tumulti
18078 undercard
18079 univis
18080 univision-host
18081 univision/telemundo
18082 upper-cas
18083 video.end
18084 video.start
18085 vt
18086 walker
18087 wallac
18088 wi
18089 wmur-tv
18090 woodruff
18091 yuuuug
18092 biodivers
18093 millennia
18094 movebank
18095 ornitholog
18096 citywid
18097 self-measur
18098 2012/2013
18099 500m
18100 isu
18101 luckiest
18102 skate
18103 skater
18104 smartest
18105 surpass
18106 toe
18107 .zip
18108 //donnees.ville.montreal.qc.ca/dataset/bixi-etat-des-st
18109 //donnees.ville.montreal.qc.ca/dataset/bixi-historique-des-deplac
18110 //donnees.ville.montreal.qc.ca/dataset/polygones-arrondiss
18111 //en.wikipedia.org/wiki/bixi_montr
18112 a9al
18113 aaaa-mm-jj
18114 ahunts
18115 bixi
18116 biximontrealrentals2014.zip
18117 bk
18118 bx
18119 choropleth
18120 côte-des-neiges–notre-dame-de-grâc
18121 duration_sec
18122 dx
18123 end_dat
18124 end_station_cod
18125 geodes
18126 hochelaga-maisonneuv
18127 is_memb
18128 lc
18129 montréal
18130 n

18827 discretitzar
18828 distribucion
18829 dividid
18830 domini
18831 dòlar
18832 econòm
18833 efect
18834 eliminar
18835 empresa
18836 enfocat
18837 entr
18838 entramat
18839 equilibri
18840 espanyol
18841 especial
18842 estan
18843 estem
18844 estigui
18845 estudi
18846 estudiat
18847 està
18848 euro
18849 extracció
18850 fet
18851 financera
18852 fitxer
18853 forma
18854 formulació
18855 força
18856 freqüènci
18857 garanteix
18858 generar
18859 generat
18860 grafica
18861 grau
18862 gràfic
18863 gt
18864 guardar
18865 han
18866 hem
18867 homegeneitat
18868 homogeneitat
18869 identificar
18870 igual
18871 instància
18872 integritat
18873 intercanvi
18874 interessa
18875 interessà
18876 iteració
18877 kaggle.com
18878 kmean
18879 l'índex
18880 l'última
18881 leven
18882 lineal
18883 llicència
18884 mateix
18885 mercat
18886 mitja
18887 mitjà
18888 molt_alt
18889 moned
18890 moviment
18891 mé
18892 normalitat
18893 normalitzar-l
18894 nou
18895 observa
18896 observacion
18897 observac

19577 2015-01-04
19578 2017-10-01.
19579 corn
19580 gann
19581 jason
19582 quantopian
19583 //www.folha.uol.com.br/
19584 167.053
19585 folha
19586 1971-2017
19587 categorialdataset.csv
19588 continousdataset.csv
19589 cricinfo
19590 dtclassifi
19591 labelleddataset.csv
19592 mlpclassifi
19593 odi
19594 originaldataset.csv
19595 30n
19596 available_bik
19597 available_bike_stand
19598 ayto
19599 bike-shar
19600 bike_lanes.csv
19601 bike_stand
19602 bikes.csv
19603 closedor
19604 compatbil
19605 contract_nam
19606 datast
19607 ed50
19608 ejecutado
19609 estado
19610 etalab
19611 jcdecaux
19612 lane
19613 last_upd
19614 monto
19615 odc-bi
19616 santand
19617 tiia
19618 tusbic
19619 wkt
19620 wkt_wsg84
19621 //unstats.un.org/unsd/gender/data.html
19622 awareness-rais
19623 customiz
19624 shed
19625 unfpa
19626 //archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008
19627 1999-2008
19628 beata
19629 cerner
19630 cio
19631 clore
19632 commonwealth
19633 ctsa
19634 de

20576 kitchen_avail
20577 land_possess
20578 leading/
20579 lighting_sourc
20580 m_expall_statu
20581 m_id
20582 m_serial_no
20583 member_ident
20584 month_of_birth
20585 month_of_death
20586 month_of_marriag
20587 months_of_pregn
20588 mortality_data_dictionary.xlsx
20589 mother_serial_no
20590 nearest_medical_facil
20591 no_of_dwelling_room
20592 occupation_statu
20593 order_of_birth
20594 owner_statu
20595 place_of_death
20596 psu_id
20597 reason_for_not_attending_school
20598 regular_treat
20599 regular_treatment_sourc
20600 relation_to_head
20601 residancial_statu
20602 schedule_id
20603 serial_no
20604 serial_num_of_infant_moth
20605 social_group_cod
20606 sought_medical_car
20607 stratum_cod
20608 symptoms_of_death
20609 symptoms_pertaining_il
20610 time_between_onset_of_complicati
20611 toilet_us
20612 treatment_sourc
20613 usual_resid
20614 viz
20615 water_filter
20616 year_of_birth
20617 year_of_death
20618 year_of_marriag
20619 //nlp.stanford.edu/projects/glove/
20620 //www.

21326 80-84
21327 85+
21328 dear
21329 same-sex
21330 statistician
21331 //tatoeba.org/
21332 tatoeba
21333 //data.seoul.go.kr/openinf/sheetview.jsp
21334 //english.seoul.go.kr/policy-information/key-policies/informatization/seoul-open-data-plaza/
21335 //www.ft.com/content/b49a9878-141b-11e7-80f4-13e067d5072c
21336 25-50
21337 blow
21338 infid=oa-2275
21339 jeonguk
21340 plaza
21341 tmenu=11
21342 5-day
21343 intro
21344 multiple-choic
21345 nope
21346 shuffl
21347 yep
21348 🐱🐶
21349 🐶
21350 🙅
21351 //github.com/mokosan/fsharpadv
21352 lord
21353 lotr
21354 //aligulac.com/
21355 1-0
21356 addon
21357 lotv
21358 match_dat
21359 player_1
21360 player_1_match_statu
21361 player_1_rac
21362 player_2
21363 player_2_match_statu
21364 player_2_rac
21365 tournament_typ
21366 wol-
21367 zerg
21368 //data.gov.uk/dataset/national-statistics-postcode-lookup-uk
21369 //download.companieshouse.gov.uk/en_output.html
21370 //www.gov.uk/government/publications/standard-industrial-classification-of-eco

22076 14,373
22077 219,280
22078 25:1
22079 504
22080 63,412
22081 8,319
22082 869,378
22083 fulltim
22084 hec
22085 interest/expertis
22086 m.phil/m
22087 p.g.d
22088 post-gradu
22089 science/it
22090 sector/govern
22091 specialization/research
22092 //archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphon
22093 //www.kaggle.com/t/5c27656d61ec4808bcbddd67ac1fdc5a
22094 08800
22095 activityrecognit
22096 barcelonatech
22097 catalunya
22098 cetpd
22099 degli
22100 diten
22101 geltrú
22102 genova
22103 i-16145
22104 introductori
22105 non-linear
22106 palermo
22107 politècnica
22108 simplif
22109 smartlab
22110 smartlab.w
22111 universidad
22112 università
22113 vilanova
22114 //www.bart.gov/about/reports/ridership
22115 112
22116 423,000
22117 date-hour-soo-dest-2016.csv
22118 date-hour-soo-dest-2017.csv
22119 fifth-busiest
22120 inter-st
22121 ridership
22122 station-to-st
22123 station_info.csv
22124 ../generatenoise.sh
22125 //espeak.sourceforge.net/
22126 //githu

22826 nicklebi
22827 pg1023.txt
22828 pg1392.txt
22829 pg1407.txt
22830 pg19337.txt
22831 pg23344.txt
22832 pg32241.txt
22833 pg676.txt
22834 pg699.txt
22835 pg730.txt
22836 pickwick
22837 rainbird
22838 rectifi
22839 rudg
22840 uncommerci
22841 //www.ptt.cc/bbs/stock/index.html
22842 2015/10/01
22843 2017/10/06
22844 2017/7/17~2017/10/06
22845 ^twii
22846 daychat_push_60d_1006.csv
22847 ppt
22848 ptt
22849 ptt_stock_p3000_p3718.csv
22850 topic_url
22851 twii_20151001_20171006.csv
22852 car_sal
22853 //github.com/kjkjava/garmin-connect-export
22854 bulk-downloaded-not-cleaned-dataset
22855 colmun
22856 cross-road
22857 gadet
22858 givem
22859 gps-sport-smartwatch
22860 gpx
22861 oder
22862 rosskopf
22863 runn
22864 tue
22865 undersuppli
22866 untest
22867 vivoact
22868 wed.
22869 classnbr
22870 courseid
22871 cross-regist
22872 department/program
22873 doi:10.7910/dvn/9mwtyo
22874 gsa
22875 hcol
22876 mehta
22877 neel
22878 non-degree-seek
22879 nondgr
22880 number/lett
22881 pa8a+2yr3

23575 habitat
23576 nsw
23577 parrot
23578 pezopor
23579 sm2
23580 spuriou
23581 sq1
23582 sq1.wav
23583 sq1_gt.txt
23584 sq2
23585 sq3
23586 sq4
23587 sub-fold
23588 vocalis
23589 wallicu
23590 de:0009-7-35977
23591 dejan
23592 eurovis
23593 ivković
23594 language-rel
23595 urn
23596 *chi
23597 *exp
23598 -um
23599 .cha
23600 //alpha.talkbank.org/clan/
23601 //childes.talkbank.org/access/biling/paradis.html
23602 1-25.
23603 172-187.
23604 aoe
23605 clan
23606 crago
23607 interrupt/overlap
23608 longituin
23609 non-linguist
23610 pseudonym
23611 spece
23612 video-­‐tap
23613 184
23614 anniversari
23615 commemor
23616 statehood
23617 when/wher
23618 40.
23619 ami
23620 axz
23621 bryan
23622 bryan.culbertson
23623 calthorp
23624 coars
23625 coarsediscours
23626 context/background
23627 csail
23628 culbertson
23629 in_reply_to
23630 is_first_post
23631 is_self_post
23632 kawong
23633 link_to_post
23634 main_typ
23635 majority_link
23636 majority_typ
23637 mit.edu
23638 paritosh
23639 pkp

24325 retouch
24326 1926
24327 712.
24328 80-plu
24329 codif
24330 counsel
24331 non-posit
24332 predecessor
24333 proclam
24334 restat
24335 subchapt
24336 subclaus
24337 subparagraph
24338 subpart
24339 boro
24340 buddi
24341 categorynam
24342 companynam
24343 cop
24344 courthous
24345 eventendd
24346 eventnam
24347 eventstartd
24348 exterior
24349 film/show
24350 film/televis
24351 metrocosm
24352 mome
24353 mozart
24354 prep/shoot/wrap
24355 projectid
24356 projecttitl
24357 reshoot
24358 shoot/ev
24359 shorthand
24360 state/feder
24361 subcategorynam
24362 sunsetpark-010815
24363 untitl
24364 //www.voterstudygroup.org/
24365 11,168
24366 35,408
24367 8,000
24368 8,637
24369 ccap
24370 charit
24371 non-demograph
24372 olsen
24373 panelist
24374 pre-elect
24375 strata
24376 stratifi
24377 yougov
24378 accross
24379 candidateid
24380 candidatenam
24381 candidateparti
24382 cn
24383 cochis
24384 coconino
24385 contestid
24386 contesttitl
24387 contesttot
24388 desing
24389 earlyturnou

25075 dreamwork
25076 fu
25077 glenn
25078 kung
25079 kung-fu-panda
25080 manti
25081 martial
25082 mystic
25083 oogway
25084 shufi
25085 spongebob
25086 tai
25087 taoism
25088 tigress
25089 viper
25090 'mask
25091 1024x1280
25092 840
25093 aluminium
25094 fatigu
25095 sem
25096 striation
25097 x200
25098 x400
25099 內容
25100 歐洲足球
25101 歐洲足球分析
25102 歐洲足球資料庫
25103 背景
25104 2010-2015.
25105 1899
25106 breakneck
25107 reap
25108 untold
25109 defeat
25110 gerrymand
25111 princeton
25112 12525
25113 953
25114 arrest-rel
25115 bodili
25116 ccapo
25117 cjsc
25118 dagli
25119 doj
25120 lea
25121 leoka
25122 meric
25123 probat
25124 ursu
25125 2355-0.txt
25126 2485-0.txt
25127 34967-0.txt
25128 36520-0.txt
25129 46408-0.txt
25130 autobiograph
25131 beagl
25132 cirripedia
25133 darwin
25134 insectivor
25135 monograph
25136 mould
25137 pg1227.txt
25138 pg2087.txt
25139 pg2088.txt
25140 pg22728.txt
25141 pg22764.txt
25142 pg2690.txt
25143 pg2739.txt
25144 pg2740.txt
25145 pg2871.txt
25146 pg2872.tx

25825 exercício
25826 federai
25827 federação
25828 fez
25829 fim
25830 financeira
25831 financeiro
25832 fiscai
25833 foi
25834 forenam
25835 formada
25836 fornecedor
25837 franqueada
25838 funcionamento
25839 gerará
25840 glosa
25841 governo
25842 goza
25843 há
25844 idecadastro
25845 idem
25846 identifica
25847 identificador
25848 ideologia
25849 implantar
25850 incidirá
25851 indicativo
25852 indtipodocumento
25853 interess
25854 início
25855 issue_d
25856 juntament
25857 junto
25858 jurídica
25859 juízo
25860 leg_of_the_trip
25861 legislativa
25862 legislatura
25863 liberdad
25864 liderança
25865 localização
25866 lote
25867 líquido
25868 mai
25869 mandato
25870 medida
25871 meio
25872 montli
25873 muda
25874 mê
25875 necessita
25876 negativo
25877 net_valu
25878 nova
25879 nucarteiraparlamentar
25880 nudeputadoid
25881 nulegislatura
25882 numano
25883 numespecificacaosubcota
25884 numlot
25885 numm
25886 numparcela
25887 numressarcimento
25888 numsubcota
25889 observado
25890 oca

26574 probabilist
26575 raw-data
26576 raw-fil
26577 tang:12tkd
26578 titlesim
26579 xuezhi
26580 xxx
26581 xxx.txt
26582 xxx.xml
26583 xxx_txt
26584 xxx_xml
26585 //creativecommons.org/licenses/by/4.0/
26586 //docsouth.unc.edu/neh
26587 //voyant-tools.org/
26588 allan
26589 appendic
26590 bain
26591 chancellor
26592 chapel
26593 co-edit
26594 docsouth
26595 named-ent
26596 nineteenth-centuri
26597 paratextu
26598 poe
26599 unc
26600 unc-chapel
26601 unc.edu
26602 uncorrect
26603 virginian
26604 voyant
26605 wilsonlibrari
26606 world-renown
26607 world-wid
26608 guilti
26609 misbrand
26610 non-voluntari
26611 self-polic
26612 viol
26613 //www.propublica.org/datastore/dataset/federal-air-marshal-misconduct-databas
26614 5,214
26615 disciplinari
26616 extran
26617 hijack
26618 mete
26619 newsroom
26620 undercov
26621 'is_pati
26622 *ag_ratio*
26623 *direct_bilirubin*
26624 *tot_bilirubin*
26625 *tot_proteins*
26626 alkpho
26627 is_pati
26628 records.th
26629 selector
26630 sgot
26631 sgp

27324 yes/no
27325 1,437
27326 additon
27327 argonne-meus
27328 wwi
27329 2013-2017
27330 blasio
27331 self-identifi
27332 small-to-larg
27333 //github.com/zarkonnen/aimeetup_coin
27334 1fr
27335 50rp
27336 coint
27337 fr_back
27338 pile
27339 rappen
27340 rp_back
27341 brace
27342 curli
27343 distance=x
27344 end_address
27345 end_loc
27346 ft/mi
27347 start_address
27348 start_location=
27349 step1
27350 step2
27351 mare
27352 previsioni
27353 segnalazioni
27354 venezia
27355 2017-06-19
27356 ccsub
27357 colbert
27358 fixat
27359 stubborn
27360 ssi
27361 'rare
27362 disesas
27363 sentimi
27364 //en.wikipedia.org/wiki/support_and_resist
27365 backtest
27366 ta-lib
27367 +10.0
27368 -10.0
27369 36710
27370 36711
27371 73422
27372 joketext.csv
27373 userratings1.csv
27374 userratings2.csv
27375 //capitalbikeshare.com/system-data
27376 //dchr.dc.gov/page/holiday-schedul
27377 //dx.doi.org/10.1007/s13748-013-0040-3
27378 //www.freemeteo.com
27379 1-15
27380 10.1007/s13748-013-0040-3
27381

28074 0.625gb
28075 0.655gb
28076 0.684gb
28077 0.699gb
28078 0.881gb
28079 0.911gb
28080 0.990gb
28081 1.049gb
28082 1.127gb
28083 1.205gb
28084 1.300gb
28085 2.134gb
28086 2.158gb
28087 3.248gb
28088 4.452gb
28089 4.471gb
28090 4.863gb
28091 5.160gb
28092 7.725gb
28093 abap
28094 actionscript
28095 ada
28096 cobol
28097 coffeescript
28098 crystal
28099 csharp
28100 css
28101 cuda
28102 dart
28103 delphi
28104 erlang
28105 fortran
28106 haskel
28107 hi-r
28108 js
28109 lisp
28110 loader
28111 lua
28112 perl
28113 prolog
28114 rust
28115 scala
28116 tex
28117 vbnet
28118 //www.unicode.org/copyright.html
28119 1991-2017
28120 6.1.0
28121 bidirection
28122 code-point
28123 codepoint
28124 decomposit
28125 ideograph
28126 linebreak
28127 name-alia
28128 name_alia
28129 nushu
28130 propertyaliases.txt
28131 propertyvaluealiases.txt
28132 special_case_condit
28133 tangut
28134 ucd
28135 unihan
28136 100,000.
28137 1950-2011
28138 198
28139 2013-2015
28140 livabl
28141 lucki
28142 suicidolog

28824 bakheet
28825 balani
28826 balbo
28827 baldassarr
28828 baligh
28829 banik
28830 banki
28831 bansal
28832 bao
28833 bapat
28834 baptisteamato
28835 barba
28836 bargava
28837 bari
28838 barillaro
28839 barney
28840 baronchen
28841 barrena
28842 bartczak
28843 bartel
28844 barton.new
28845 bartulov
28846 barudwal
28847 barã³n
28848 basil
28849 basit
28850 bastien
28851 batanga
28852 batham
28853 batzig
28854 batzner
28855 bauer
28856 bauman
28857 baur
28858 bayarjarg
28859 baylã©
28860 bayowa
28861 bazako
28862 bazinga
28863 beavi
28864 bec14
28865 bedi
28866 begiev
28867 behr
28868 behzadgolshan
28869 beili
28870 belhaoua
28871 belitskaya
28872 belliappa
28873 bello
28874 belov
28875 beltrã¡n
28876 belvederethecat
28877 belyaev
28878 benavid
28879 benben
28880 beneschi
28881 benf
28882 benist
28883 benjaminswedlov
28884 benkhelif
28885 benmeida
28886 benner
28887 bequet
28888 bera
28889 berenguer
28890 berhan
28891 bernadac
28892 bert
28893 besain
28894 betancourt
28895 bethtseng


29573 fujiwara
29574 fungyuehoi
29575 funnymango
29576 furbi
29577 fuzzyfroghunt
29578 g1ng0
29579 gabara
29580 gabrielacaesar
29581 gabrielavellaneda
29582 gabro
29583 gael
29584 gagan
29585 gaganbhatia
29586 gagneja
29587 gagrai
29588 gajendrabadw
29589 gajulavarthi
29590 galvi
29591 ganesan
29592 ganesh
29593 ganiyu
29594 ganji
29595 gantal
29596 gaoweiwang
29597 garag
29598 garcã­a
29599 garg
29600 gargeya
29601 garnier
29602 garrard
29603 garrykevin
29604 gasimov
29605 gaspar
29606 gauchi
29607 gaurav
29608 gauravjain
29609 gautam
29610 gauthamsenthil
29611 gavankar
29612 gavin
29613 gavinarmstrong
29614 gawlik
29615 gaz113
29616 gbadebo
29617 geco
29618 gecodavid
29619 gehlot
29620 gellowmellow
29621 genacri
29622 geneburin
29623 genexpr
29624 genin
29625 gennadii
29626 geoffnobl
29627 geonsoo
29628 georgemcintir
29629 georgii
29630 georginaros
29631 gerardo
29632 gerardosegura
29633 gere
29634 gerg
29635 gerosa
29636 getting_start
29637 gevault
29638 gevorg
29639 gfan
29640 ggze

30323 kocal
30324 kocaman
30325 kokatjuhha
30326 kola
30327 kolh
30328 kolli
30329 kollmer
30330 kolodni
30331 komo
30332 kompella
30333 konan
30334 kondalarao
30335 kondo
30336 kongmane
30337 konshin
30338 konstantin
30339 konstantino
30340 konuk
30341 korablev
30342 korakot
30343 korchagin
30344 kori
30345 korzeniewicz
30346 koshki
30347 kosiewmm
30348 kossa
30349 kostako
30350 kosterin
30351 kostiantyn
30352 kostya
30353 kote42
30354 kothari
30355 kotian
30356 kotobotov
30357 kouassi
30358 koulibali
30359 kozlova
30360 koã§
30361 kp
30362 kpapamih
30363 kr
30364 kravdiy
30365 kreimer
30366 krishna
30367 krishnadheeraj
30368 krishnamurthi
30369 krishnapraveen
30370 krishnathiyagarajan
30371 krismurphi
30372 kristian
30373 kristjan
30374 kristofferhess
30375 kristoph
30376 krithel
30377 krizsã³
30378 krsimon
30379 krã¶ger
30380 ksayantani
30381 ksenia
30382 kso
30383 kukunuru
30384 kulshreshtha
30385 kumaran
30386 kumarbhrgv
30387 kumarhalak
30388 kun
30389 kunal
30390 kunalkumawat
30

31073 palashshah
31074 pallav
31075 pallavi
31076 pallec
31077 palley
31078 pambudi
31079 panagioti
31080 panano
31081 panchal
31082 panchicor
31083 pancho
31084 panda974
31085 pandatadelta
31086 pandey
31087 pandeya
31088 pandian
31089 pang
31090 panizzo
31091 pano
31092 paosheng
31093 pappa
31094 paresh
31095 parichart
31096 parihar
31097 parindsheel
31098 parmanand
31099 parmar
31100 parseltung
31101 parth
31102 parthmaheshwari
31103 pasha
31104 paslay
31105 pasquato
31106 pastor
31107 patanajirasit
31108 patata
31109 patel
31110 patil
31111 patit
31112 patra
31113 patryk
31114 paul-loui
31115 paula
31116 paultimothymooney
31117 paulzh
31118 pavansubhash
31119 pavel
31120 paveltroshenkov
31121 pavlin
31122 pavlo
31123 pavlukhin
31124 pawan
31125 pazookii
31126 pbcquoc
31127 pcminer
31128 pearc
31129 pedro
31130 pedrofrantz
31131 peinado
31132 peirano
31133 peng-hsuan
31134 pengm
31135 pengxu
31136 pengzha
31137 penkov
31138 penovi
31139 peppermintshak
31140 perastiko
31141 peretiatk

31823 soar
31824 sobrado
31825 sofiya
31826 sohaib
31827 sohaibomar
31828 soham
31829 sohel
31830 sohier
31831 sohinibhattacharya
31832 sol
31833 solanki
31834 soldo
31835 soler-compani
31836 somasundaram
31837 somat
31838 somavarapu
31839 somesh
31840 sommenoob
31841 somnath
31842 sonali
31843 sonamsrivastava
31844 sonavan
31845 sonawan
31846 songsong
31847 sonneland
31848 soo
31849 soojung
31850 soorajm
31851 soroosh
31852 sotopia
31853 souci
31854 soufian
31855 soufianeorama
31856 souhaiel
31857 souhail
31858 soukaina
31859 souliman
31860 souman
31861 soumitra
31862 sourabhmitt
31863 sourav
31864 souravmaharana
31865 sousa
31866 sovboc2018
31867 sowhit
31868 sowmiya
31869 soywu
31870 spadafora
31871 spadim
31872 spencer
31873 sprabakar
31874 sreeram
31875 sreyansh
31876 sridhar
31877 sridharan
31878 srigiriraju
31879 srihari
31880 sriharirao
31881 srikanth
31882 srilakshmi
31883 srilakshminandamuri
31884 srilbg
31885 srinath
31886 srinivasan
31887 srinivasrao
31888 srinivinnakota
31

32572 æžç
32573 æž—æ¹§æ£®
32574 ç
32575 ç§‹ä¹‹çµç¾½
32576 éœœé›ªåƒå¹´
32577 ê²½ë¦¼
32578 ê³½
32579 ë¯¼
32580 ë²¤ìž
32581 î¹î´î·ï‚
32582 îœî±ïî¹î¿ï‚
32583 îœî¹ï‡î±î·î
32584 ðð
32585 ðð½ð´ñ€ðµð¹ð¢ð¸ð¼ð¾ñ
32586 ðñ€ñ‚ðµð¼
32587 ð´ñ‹ð·ð¶ð°ð½ð¡ð°ð³ð¸ð¼ð±ð°ðµð²
32588 ðµðµð²
32589 ðµðºñð°ð½ð´ñ€
32590 ð¸ñð°
32591 ðÿñƒð³ð°ñ‡ðµð²ð°
32592 ðœð°ðºññžñ‚ðµð½ðºð¾
32593 ðœð¸ñ…ð°ð¸ð
32594 ð–ñƒð
32595 ð›ñð½
32596 ø§ø¨
32597 ø·ùšùø£ø­ù…ø¯øºù
32598 ø¹ø¨ø¯ø§ù
32599 ù
32600 „
32601 ‡m
32602 ‹ch
32603 ‹å³°
32604 ™i
32605 'hands-on
32606 //github.com/ageron/handson-ml/tree/master/datasets/h
32607 291-297.
32608 33.3
32609 aurélien
32610 géron
32611 housing_median_ag
32612 luí
32613 median_house_valu
32614 median_incom
32615 ocean_proxim
32616 rudimentari
32617 torgo
32618 total_bedroom
32619 total_room
32620 toyish
32621 10year
32622 added/consid
32623 data.go.kr
32624 buildings/structur
32625 dicsplin
32626 browser_us
32627 device_us
32628 is_respons
32629 323
32630 cant
32631 ml/dp
32632 motherboa

33322 22:00u
33323 2°
33324 365day
33325 3week
33326 4933
33327 60225
33328 66000
33329 9:00am
33330 9:00u-22:00
33331 black-point
33332 blackpoint
33333 collid
33334 day..
33335 day/night
33336 dayblock
33337 daytim
33338 feast
33339 fr-sa-so
33340 fr/so
33341 giss
33342 google-api
33343 guard-capit
33344 guard-pharmaci
33345 halv
33346 haversin
33347 kalend
33348 mid/week
33349 midweek
33350 midweek/weekend
33351 mo-tu-we-th
33352 mse
33353 night-guard
33354 nightblock
33355 nightshift
33356 pharmaci
33357 pharmacist
33358 sample..
33359 school/holiday
33360 sinksen
33361 sun/moon
33362 superimpos
33363 weekend/midweek
33364 weekendblock
33365 'flat
33366 2011-2016
33367 analytics/forecasting/consolid
33368 worldsteel
33369 www.worldsteel.org
33370 htsa
33371 merchandis
33372 usitc
33373 3a
33374 80-20
33375 benwing.com
33376 delozi
33377 document-geoloc
33378 geoannot
33379 geojson
33380 grantdelozi
33381 polygon-centroid
33382 rebellion
33383 size-2
33384 stand-off
33385 training/t

34072 shoton
34073 shoton_detail.csv
34074 0.0004
34075 0.0005
34076 0.0200
34077 39.9
34078 3seri
34079 co_tailpip
34080 g/
34081 hc_tailpip
34082 nox_tailpip
34083 sedan
34084 tailpip
34085 these
34086 vehicle_spe
34087 𝑎𝑛𝑑
34088 resumpt
34089 sona
34090 //www.imcb.osaka-u.ac.jp/nakai/psort.html
34091 109-115.
34092 11:95-110
34093 14:897-911
34094 336
34095 alm1
34096 alm2
34097 alom
34098 amino
34099 chg
34100 cleavabl
34101 cs.berkeley.edu
34102 cytoplasm
34103 e.coli
34104 gram-neg
34105 gvh
34106 heijn
34107 hoc
34108 horton
34109 imcb.osaka-u.ac.jp
34110 kanehisa
34111 kenta
34112 lipoprotein
34113 mcg
34114 mcgeoch
34115 membran
34116 minoru
34117 n-terminu
34118 nakai
34119 om
34120 oml
34121 paulh
34122 peptidas
34123 periplasm
34124 perisplasm
34125 probablist
34126 suita
34127 swiss-prot
34128 sytem
34129 uncleav
34130 yamada-oka
34131 +-
34132 //archive.ics.uci.edu/ml/datasets/forest+fir
34133 //www.dsi.uminho.pt/~pcortez/fires.pdf
34134 0.00
34135 0.40
34136 10-fold
3413

35072 rumrat
35073 rumratings.com
35074 //www.census.gov/data/tables/1995/demo/educational-attainment/p20-489.html
35075 //www.census.gov/data/tables/2005/demo/educational-attainment/cps-detailed-tables.html
35076 //www.census.gov/data/tables/2015/demo/education-attainment/p20-578.html
35077 //www.census.gov/topics/education/educational-attainment/data/tables.all.html
35078 18-24
35079 18-24.
35080 18_24
35081 25-34
35082 29-34
35083 coarser
35084 bypass
35085 dentist
35086 misinform
35087 read_delim
35088 ~2012
35089 //www.fortnitechests.info/
35090 2.1.0
35091 crit
35092 fortnit
35093 mag
35094 projectil
35095 reload
35096 soumydev
35097 underutil
35098 //www.usnews.com
35099 acceptance-r
35100 act-avg
35101 adelphi
35102 akron
35103 albani
35104 alliant
35105 argosi
35106 ashland
35107 auburn
35108 augusta
35109 azusa
35110 baptist
35111 baylor
35112 benedictin
35113 biola
35114 bloomington
35115 bois
35116 brandei
35117 brigham
35118 brunswick
35119 canyon
35120 capella
35121 carbo

In [32]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1), (1, 5), (2, 2), (3, 1), (4, 2), (5, 5), (6, 5), (7, 13), (8, 14), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 2), (42, 2), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 3), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 3), (56, 5), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 7), (66, 1), (67, 4), (68, 1), (69, 1), (70, 1), (71, 1), (72, 3), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 2), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 2), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 3), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), 

[(5, 2), (6, 2), (7, 3), (8, 10), (20, 1), (55, 2), (56, 2), (60, 1), (75, 1), (92, 1), (125, 1), (133, 1), (136, 1), (151, 1), (158, 2), (164, 1), (182, 1), (192, 1), (230, 3), (249, 1), (254, 1), (292, 1), (307, 3), (314, 1), (328, 1), (331, 1), (337, 1), (355, 2), (403, 2), (409, 1), (420, 1), (575, 1), (576, 1), (577, 1), (586, 1), (654, 1), (691, 1), (694, 1), (704, 1), (719, 1), (890, 3), (1059, 2), (1248, 1), (1325, 1), (1662, 1), (1741, 1), (1816, 2), (1819, 1), (1946, 1), (2035, 1), (2255, 2), (3197, 1), (3369, 2), (3410, 2), (3680, 1), (5424, 2), (5470, 1), (5899, 1), (6016, 1), (6682, 2), (6701, 1), (7913, 1), (8447, 1), (9318, 2), (10233, 1), (10694, 1), (12135, 1), (12917, 2), (17267, 1), (17269, 1), (17588, 1), (19122, 1), (21699, 1), (23226, 1), (23575, 3), (25154, 2), (25162, 1), (25163, 2), (25164, 1), (25165, 1), (25166, 1), (25167, 1), (25168, 1), (25169, 1), (25170, 1), (25171, 1), (25172, 1), (25173, 1), (25174, 8), (25175, 1)]
[(5, 1), (6, 1), (7, 4), (8, 4), (9, 

In [37]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)

In [38]:
ldamodel.print_topics(num_topics=5,num_words=10)

[(0,
  '0.057*"," + 0.050*"." + 0.027*":" + 0.022*")" + 0.021*"(" + 0.014*"data" + 0.010*"dataset" + 0.007*"use" + 0.005*"?" + 0.005*"\'\'"'),
 (1,
  '0.048*"univers" + 0.032*"." + 0.015*"dataset" + 0.012*"-" + 0.012*"," + 0.009*"state" + 0.009*"--" + 0.007*"descript" + 0.006*"data" + 0.006*"yet"')]

<font color = green >

## Learn more
</font>

Latent Dirichlet allocation
<br>
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


<font color = green >

## Next lesson: Clustering 
</font>

