# This notebok demonstrates how to use python NLTK package for text cleaning and text preparation. It also shows how to perform cosine similarity to find similar documents.

## 2017 Dec Shilpa Jain

# Install Python NLTK package

NLTK is a natural language toolkit for building programs in Python that work with natural language text.
We will use NLTK for this course.

In [5]:
!pip list --isolated

alabaster (0.7.8)
anaconda-client (1.4.0)
anaconda-navigator (1.2.1)
argcomplete (1.0.0)
asn1crypto (0.24.0)
astropy (1.2.1)
astunparse (1.5.0)
Babel (2.5.0)
backports.shutil-get-terminal-size (1.0.0)
backports.weakref (1.0rc1)
beautifulsoup4 (4.6.0)
biopython (1.66)
bitarray (0.8.1)
bkcharts (0.2)
blaze (0.10.1)
bleach (2.0.0)
bokeh (0.12.7)
boto (2.40.0)
boto3 (1.4.4)
botocore (1.5.50)
Bottleneck (1.1.0)
brunel (2.3)
bz2file (0.98)
cdsax-jupyter-extensions (0.1)
certifi (2017.11.5)
cffi (1.11.2)
chardet (3.0.4)
chest (0.2.3)
click (6.6)
cloudpickle (0.2.1)
clyent (1.2.2)
cognitive-assistant (1.0.52)
colorama (0.3.7)
conda (4.3.27)
conda-build (1.21.3)
configobj (5.0.6)
configparser (3.5.0)
contextlib2 (0.5.3)
cryptography (2.1.4)
cycler (0.10.0)
Cython (0.24)
cytoolz (0.8.0)
dask (0.10.0)
datashape (0.5.2)
debtcollector (1.17.0)
decorator (4.0.10)
dill (0.2.5)
docutils (0.12)
dynd (0.7.3.dev1)
enum34 (1.1.5)
et-xmlfile (1.0.1)
extension-utils (0.1.57)
fastcache (1.0.2)
Flask (0.11.1)

In [6]:
!pip install nltk --upgrade

Requirement already up-to-date: nltk in /gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/.local/lib/python3.5/site-packages
Requirement already up-to-date: six in /gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/.local/lib/python3.5/site-packages (from nltk)


## Import NLTK and download NLTK book collection

In [7]:
import nltk
nltk.download()


NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

## Cell below will load all the items in the book module that you have just downloaded. When this finishes, we will see the output.
We can see from the output that there are 9 pieces of text and 9 sentences loaded. For example, if we
type text1, we will see the title of the first piece of text. If we type sent3, we will see the body of the
third sentence.

In [8]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [9]:
(text1)

<Text: Moby Dick by Herman Melville 1851>

In [10]:
sent3

['In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth',
 '.']

##### In NLTK, there is a method called concordance that allows us to search for a word inside a piece of text.
##### Count method returns the number of times a word occurs in a piece of text.

In [11]:
text1.concordance("America")
text1.count("America")

Displaying 12 of 12 matches:
 of the brain ." -- ULLOA ' S SOUTH AMERICA . " To fifty chosen sylphs of speci
, in spite of this , nowhere in all America will you find more patrician - like
hree pirate powers did Poland . Let America add Mexico to Texas , and pile Cuba
 , how comes it that we whalemen of America now outnumber all the rest of the b
mocracy in those parts . That great America on the other side of the sphere , A
f age ; though among the Red Men of America the giving of the white belt of wam
 and fifty leagues from the Main of America , our ship felt a terrible shock , 
, in the land - locked heart of our America , had yet been nurtured by all thos
 some Nor ' West Indian long before America was discovered . What other marvels
d universally applicable . What was America in 1492 but a Loose - Fish , in whi
w those noble golden coins of South America are as medals of the sun and tropic
od of the last one must be grown in America ." " Aye , aye ! a strange sight th


11

## UsingWord Counts to Obtain an Overview of a Collection
Assume that you have a large document collection. For example, it could be all the email enquiries
from the customers of a company in a particular month. It could be all the tweets published by a particular
user. It could also be all the fictions written by a particular author. Without going through all the documents
inside the collection, how can you quickly get an idea about the major topics or themes covered by these
documents?

In NLTK, there is a built-in function called FreqDist() that makes our task very easy.

In [12]:
fdist=FreqDist(text2)
fdist

FreqDist({'permit': 1,
          'household': 3,
          'circle': 2,
          'instantaneously': 1,
          'objections': 7,
          'afford': 15,
          'weeks': 23,
          'disagreeable': 2,
          'gained': 15,
          'anticipated': 2,
          'concealment': 5,
          'Can': 7,
          'spoken': 15,
          'thirteen': 1,
          'encouraged': 5,
          'any': 389,
          'brief': 6,
          'embarrassment': 9,
          'degree': 14,
          'contented': 4,
          'bind': 1,
          'fall': 6,
          'why': 25,
          'affliction': 16,
          'preceding': 6,
          'friends': 62,
          'trembled': 2,
          'luckily': 6,
          'fertile': 1,
          'Hanover': 2,
          'furniture': 10,
          'Jenning': 5,
          'debated': 1,
          'did': 211,
          'able': 46,
          'regain': 1,
          'wild': 4,
          'arranged': 8,
          'avoided': 7,
          'overspreading': 1,
          'S

##### There is a method called most_common() that can be conveniently used to show the most frequent words in a frequency distribution.

In [13]:
fdist.most_common(10)

[(',', 9397),
 ('to', 4063),
 ('.', 3975),
 ('the', 3861),
 ('of', 3565),
 ('and', 3350),
 ('her', 2436),
 ('a', 2043),
 ('I', 2004),
 ('in', 1904)]

#### Looking at the most frequent words, you realize that they are not so meaningful. Many words are so commonly used everywhere that they do not reveal anything about the particular document or document collection we are looking at. There are a number of ways to address this problem.

#### We will create a new list text2_long_words and add only words with atleast 5 characters.

In [14]:
text2_long_words=[w for w in text2 if len(w)>=5]
text2_long_words[:10]

['Sense',
 'Sensibility',
 'Austen',
 'CHAPTER',
 'family',
 'Dashwood',
 'settled',
 'Sussex',
 'Their',
 'estate']

#### Checking the frequency distribution and most common words again on the new list gives more sensible results and shows the major characters in a book.

In [15]:
fdist2=FreqDist(text2_long_words)
m=fdist2.most_common(10)

## Import Brunel library for visualization

In [16]:
import brunel

### Create a dataframe to visualize the common words as a tag cloud using Brunel package.

In [17]:
import pandas as pd
df = pd.DataFrame(columns=['word', 'freq'])
for i in m:
    df.loc[len(df)] = i
    
print (df)
        

       word   freq
0    Elinor  684.0
1     which  592.0
2     could  568.0
3  Marianne  566.0
4     would  507.0
5     their  463.0
6     every  361.0
7    sister  282.0
8    Edward  262.0
9    mother  258.0


## Tag cloud of most common words

In [18]:
%%brunel cloud color(freq) size(freq) sort(freq)
label(word) style('font-size:200px;font-family:Impact') legends(none) :: width = 600, height=600

<IPython.core.display.Javascript object>

## StopWord Removal
NLTK also has a built-in stop word list for English that can come in handy when we need to remove stop
words from a text collection.

In [19]:
from nltk.corpus import stopwords
stop_list=stopwords.words('english')
text2_stopremoved=[w for w in text2_long_words if w not in stop_list]
text2_stopremoved

['Sense',
 'Sensibility',
 'Austen',
 'CHAPTER',
 'family',
 'Dashwood',
 'settled',
 'Sussex',
 'Their',
 'estate',
 'large',
 'residence',
 'Norland',
 'centre',
 'property',
 'generations',
 'lived',
 'respectable',
 'manner',
 'engage',
 'general',
 'opinion',
 'surrounding',
 'acquaintance',
 'owner',
 'estate',
 'single',
 'lived',
 'advanced',
 'years',
 'constant',
 'companion',
 'housekeeper',
 'sister',
 'death',
 'happened',
 'years',
 'produced',
 'great',
 'alteration',
 'supply',
 'invited',
 'received',
 'house',
 'family',
 'nephew',
 'Henry',
 'Dashwood',
 'legal',
 'inheritor',
 'Norland',
 'estate',
 'person',
 'intended',
 'bequeath',
 'society',
 'nephew',
 'niece',
 'children',
 'Gentleman',
 'comfortably',
 'spent',
 'attachment',
 'increased',
 'constant',
 'attention',
 'Henry',
 'Dashwood',
 'wishes',
 'proceeded',
 'merely',
 'interest',
 'goodness',
 'heart',
 'every',
 'degree',
 'solid',
 'comfort',
 'could',
 'receive',
 'cheerfulness',
 'children',
 'add

## Stemming
NLTK also has a built-in Porter stemmer we can use.

In [30]:
from nltk.stem.porter import *
stemmer=PorterStemmer()

text2_stemmed=[stemmer.stem(w) for w in text2_stopremoved]
    
    

print (text2_stemmed)

['sens', 'sensibl', 'austen', 'chapter', 'famili', 'dashwood', 'settl', 'sussex', 'their', 'estat', 'larg', 'resid', 'norland', 'centr', 'properti', 'gener', 'live', 'respect', 'manner', 'engag', 'gener', 'opinion', 'surround', 'acquaint', 'owner', 'estat', 'singl', 'live', 'advanc', 'year', 'constant', 'companion', 'housekeep', 'sister', 'death', 'happen', 'year', 'produc', 'great', 'alter', 'suppli', 'invit', 'receiv', 'hous', 'famili', 'nephew', 'henri', 'dashwood', 'legal', 'inheritor', 'norland', 'estat', 'person', 'intend', 'bequeath', 'societi', 'nephew', 'niec', 'children', 'gentleman', 'comfort', 'spent', 'attach', 'increas', 'constant', 'attent', 'henri', 'dashwood', 'wish', 'proceed', 'mere', 'interest', 'good', 'heart', 'everi', 'degre', 'solid', 'comfort', 'could', 'receiv', 'cheer', 'children', 'ad', 'relish', 'exist', 'former', 'marriag', 'henri', 'dashwood', 'present', 'three', 'daughter', 'steadi', 'respect', 'young', 'ampli', 'provid', 'fortun', 'mother', 'larg', 'dev

## Gensim

A Python library that provides some built-in functions for easily converting
documents to vectors and computing cosine similarities.

In [21]:
!pip install gensim



## Sparse Vectors
To convert a piece of text into a vector, we need to first determine the dimension of the vector, or in other
words the number of components of the vector. Generally, the dimension of the vectors representing documents
(and thus the dimension of the vector space) is the same as the vocabulary size, that is, each unique
word corresponds to an entry of the vectors. Here vocabulary refers to the set of all unique words in a corpus.
While finding out the vocabulary size is not a problem—you may want to think about how to do this—
the real problem is that for a real world corpus, the vocabulary is usually very large. It is not uncommon to
have millions of words in a vocabulary. If we represent each document as a very high-dimensional vector,
we will need a lot of space to store these vectors either in memory or on disk. In reality, however, each
document contains only a relatively small set of words, so the vector used to represent a document has
most entries equal to zero and a small subset of entries with non-zero values. When we store such kind of
vectors in a computer, we typically use a sparse vector representation. 

For example, suppose our vocabulary has 10 words. Let us look at the following vector:
v =(1,0,0,2,0,0,0,0,0,0.5)

To store this vector in its original form, we need to store 10 numbers, each corresponding to one entry of the
vector. A sparse vector representation stores only the non-zero entries as follows:
v = ((0, 1), (3,2), (9, 0.5)) 
Here the sparse vector is a list of pairs. For each pair, the first number is an ID or index indicating a particular
entry of the original vector. For example, (0, 1) indicates that the first entry of the original v has a value of
1, and (9, 0.5) indicates that the tenth entry of the original v has a value of 0.5. We can see that the amount
of space needed to store this sparse vector is now reduced to twice the number of non-zero entries of the
original vector.

## Creating a Dictionary from a Corpus

Now think about converting all documents in a corpus into vectors. We need to map each unique word in
the vocabulary of this corpus to an ID or index first. These mappings from words to IDs are represented by
a class called Dictionary in Gensim.


In [22]:
import gensim
from gensim import corpora

# Input to dictionary is a list of list
doc=[text2_stemmed]
dictionary=corpora.Dictionary(doc)
#print (dictionary) - Returns unique tokens

#use dictionary.token2id to obtain a dict object which contains all the mappings
token_to_id=dictionary.token2id
token_to_id

{'household': 0,
 'face': 1,
 'outrun': 2,
 'anim': 3,
 'gratif': 4,
 'vari': 5,
 'gloom': 2832,
 'afford': 6,
 'mad': 7,
 'unexpectedli': 8,
 'undesir': 9,
 'insult': 10,
 'spoken': 11,
 'thirteen': 12,
 'addit': 13,
 'proof': 14,
 'loiter': 15,
 'brief': 16,
 'pastur': 17,
 'malic': 3382,
 'forsak': 1163,
 'impos': 19,
 'fall': 20,
 'exuber': 21,
 'dawlish': 22,
 'brandon': 23,
 'shoe': 24,
 'mutter': 25,
 'deject': 26,
 'resid': 27,
 'right': 28,
 'halloo': 29,
 'proprietor': 30,
 'somebodi': 31,
 'apologis': 32,
 'patron': 34,
 'regain': 35,
 'wild': 36,
 'convinc': 37,
 'complain': 243,
 'appeas': 3511,
 'necess': 2361,
 'absent': 40,
 'asund': 3454,
 'troublesom': 41,
 'dissatisfi': 42,
 'alon': 2783,
 'dread': 43,
 'chang': 44,
 'sooth': 45,
 'bowl': 46,
 'qualifi': 47,
 'moder': 48,
 'mix': 2366,
 'apricot': 50,
 'injur': 51,
 'fear': 52,
 'quarter': 54,
 'curat': 55,
 'compass': 56,
 'resist': 57,
 'profit': 58,
 'palmer': 59,
 'endeavor': 60,
 'liber': 61,
 'refrain': 62,
 'b

## Converting document into a Vector

we use the function doc2bow to convert doc to another list which we call vec (to indicate that this is a vector).Here bow
stands for bag of words, meaning that the order of the words in the original document is ignored and we
treat a document as a bag of words without any order.

In [23]:
vec=dictionary.doc2bow(text2_stemmed)
vec

[(0, 3),
 (1, 1),
 (2, 1),
 (3, 13),
 (4, 2),
 (5, 5),
 (6, 25),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 4),
 (11, 15),
 (12, 1),
 (13, 13),
 (14, 18),
 (15, 1),
 (16, 6),
 (17, 1),
 (18, 3),
 (19, 1),
 (20, 6),
 (21, 1),
 (22, 5),
 (23, 144),
 (24, 2),
 (25, 1),
 (26, 4),
 (27, 12),
 (28, 32),
 (29, 1),
 (30, 1),
 (31, 3),
 (32, 3),
 (33, 22),
 (34, 1),
 (35, 1),
 (36, 2),
 (37, 46),
 (38, 8),
 (39, 1),
 (40, 3),
 (41, 3),
 (42, 6),
 (43, 29),
 (44, 39),
 (45, 9),
 (46, 1),
 (47, 1),
 (48, 6),
 (49, 1),
 (50, 1),
 (51, 12),
 (52, 30),
 (53, 14),
 (54, 9),
 (55, 1),
 (56, 21),
 (57, 16),
 (58, 2),
 (59, 87),
 (60, 3),
 (61, 13),
 (62, 1),
 (63, 20),
 (64, 1),
 (65, 6),
 (66, 3),
 (67, 44),
 (68, 1),
 (69, 1),
 (70, 2),
 (71, 36),
 (72, 1),
 (73, 4),
 (74, 6),
 (75, 65),
 (76, 12),
 (77, 3),
 (78, 4),
 (79, 1),
 (80, 1),
 (81, 15),
 (82, 9),
 (83, 32),
 (84, 1),
 (85, 6),
 (86, 74),
 (87, 1),
 (88, 34),
 (89, 33),
 (90, 1),
 (91, 53),
 (92, 1),
 (93, 31),
 (94, 30),
 (95, 4),
 (96, 2),
 (97, 4

In [24]:
# Checking the vector indeed represents the doc
print (text2_stemmed.count('fall'))
print (dictionary.token2id['fall'])

6
20


## Computing similarities between documents

Gensim has built-in functions to compute cosine similarities between documents.

In [25]:
# Define a corpus with 2 documents represented as sparse vectors

mycorpus=[[(0,1),(1,1),(2,1)],[(1,2),(3,1)]]
from gensim import similarities

# Build an index for efficient computation. In below example, 4 is the dimension of the sparse matrix
index=similarities.SparseMatrixSimilarity(mycorpus,4)
index

<gensim.similarities.docsim.SparseMatrixSimilarity at 0x7f54fc83b080>

In [26]:
#To compute similarity of a new document, 

test_doc=[(0,1)]
sims=index[test_doc]
print (list(enumerate(sims)))

[(0, 0.57735026), (1, 0.0)]


#### Results above shows test_doc has 0.57 similarity to 1st document in mycorpus and has 0.0 similarity to 2nd document in mycorpus

## TF-IDFWeighting
We can use Gensim to help compute inverse document frequencies (IDFs) and use them to re-assign weights
to document vectors. To do this, we need to import models from Gensim first.

tfidf object automatically normalizes the vectors when it transforms them. When a vector is normalized, the value of each entry is
divided by the norm of this vector, such that the norm of the new vector is exactly 1.

In [27]:
from gensim import models
#tfidf object is created on entire corpus
tfidf=models.TfidfModel(mycorpus)
#test new documents using the tfidf object
tfidf[test_doc]

[(0, 1.0)]

## Document Retrieval
Next thing is to see how cosine similarity with TFIDF weighting can be used for the task of document retrieval.

Given a query represented as a set of words,the goal of document retrieval is the find a list of documents from a corpus that are the most relevant to the
query.

This can be done by ranking all the documents in the corpus based on their cosine similarities to the
query.

In [31]:
import sys
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
string1="""Singapore can expect more rain and less haze in the coming weeks with the south-west monsoon season transitioning into inter-monsoon conditions.The inter-monsoon season typically lasts from October to November and the weather during the period is characterised by more rainfall and light and variable winds.The Meteorological Service Singapore said on Monday in an advisory that this transition signals the end of traditional dry season in the region, and the likelihood of transboundary haze affecting Singapore for the rest of the year will be low.This is because the increased rainfall will help alleviate the hotspot and haze situation in Sumatra and Kalimantan in Indonesia."""
string2="""Train services between Admiralty and Jurong East MRT stations will end half an hour earlier from Sunday, Nov 2, to end March next year, due to rail works.SMRT will end services at nine North-South Line stations by 12.30am on Sunday to Thursday nights, except on the eve of public holidays.The stations are: Admiralty, Woodlands, Marsiling, Kranji, Yew Tee, Choa Chu Kang, Bukit Gombak, Bukit Batok and Jurong East.From Nov 2, commuters who board trains after 11.15pm on the North-South Line are advised to plan their journey and consider alternative transport arrangements such as bus services to get to their final destination, said SMRT in a statement to remind the public on Monday."""
corpus = [string1,string2]
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)
    
filename = 0
for text in corpus:
    filename+=1
    with open(corpusdir+str(filename)+'.txt','w') as fout:
        print (text,fout)
        fout.write(text)




Singapore can expect more rain and less haze in the coming weeks with the south-west monsoon season transitioning into inter-monsoon conditions.The inter-monsoon season typically lasts from October to November and the weather during the period is characterised by more rainfall and light and variable winds.The Meteorological Service Singapore said on Monday in an advisory that this transition signals the end of traditional dry season in the region, and the likelihood of transboundary haze affecting Singapore for the rest of the year will be low.This is because the increased rainfall will help alleviate the hotspot and haze situation in Sumatra and Kalimantan in Indonesia. <_io.TextIOWrapper name='newcorpus/1.txt' mode='w' encoding='UTF-8'>
Train services between Admiralty and Jurong East MRT stations will end half an hour earlier from Sunday, Nov 2, to end March next year, due to rail works.SMRT will end services at nine North-South Line stations by 12.30am on Sunday to Thursday nights,

In [34]:
newcorpus = PlaintextCorpusReader('/gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/notebook/work/newcorpus', '.*')
fids= (newcorpus.fileids())
docs=[newcorpus.words(f) for f in fids]
print (docs)
# Change words to lowercase
docs=tolower(docs)
#print(docs)
#Remove stop words
docs=removestop(docs)
#Perform stemming
docs=stemwords(docs)

#Create dictionary
dictionary=fetchdictionary(docs)
token_to_id=dictionary.token2id
#Convert to vector
print (type(docs))
vecs=convertToVec(docs,dictionary)
print (vecs)
#Build index for finding similarity
index=buildindex(vecs)
#print(index)

tdif=createtdif(vecs)
print (tdif)


[['Singapore', 'can', 'expect', 'more', 'rain', 'and', ...], ['Train', 'services', 'between', 'Admiralty', 'and', ...]]
<class 'list'>
[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 4), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 2), (22, 3), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 3), (32, 1), (33, 1), (34, 3), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 3), (49, 1), (50, 1), (51, 1)], [(2, 1), (3, 2), (7, 1), (10, 13), (15, 6), (37, 1), (39, 2), (47, 3), (50, 3), (52, 1), (53, 1), (54, 2), (55, 2), (56, 2), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 2), (76, 1), (77, 2), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 3), (85, 1), (86, 1), (87, 2), 

In [33]:
def tolower(docs):
    docs=[[w.lower() for w in doc] for doc in docs]
    return docs
    
def fetchdictionary(docs):
    dictionary=corpora.Dictionary(docs)
    return dictionary

def removestop(docs):
    docs=[[w for w in doc if w not in stop_list] for doc in docs]
    return docs;

def stemwords(docs):
    docs=[[stemmer.stem(w) for w in doc] for doc in docs]
    
    #text2_stemmed=[stemmer.stem(w) for w in wordlist]
    return docs;

def convertToVec(docs,dictionary):
    vecs=[dictionary.doc2bow(doc) for doc in docs]
    return vecs

def buildindex(docs):
    index=similarities.SparseMatrixSimilarity(docs,110)
    return index;

def createtdif(docs):
    tfidf=models.TfidfModel(docs)
    return tfidf


In [36]:
#Retrive document from newcorpus based on the query
query1=[stemmer.stem('singapore'),stemmer.stem('indonesia')]
# Convert query to sparse vector
query1_vec=dictionary.doc2bow(query1)
print (query1_vec)
# See the importance of the query in the corpus by performing TF_IDF. Both Singapore and Indonesia words appear in Doc 1 and therefore has the same IDF value.
query1_vec_tdif=tdif[query1_vec]
query1_vec_tdif

[(25, 1), (31, 1)]


[(25, 0.7071067811865475), (31, 0.7071067811865475)]

In [37]:
# Next is to find cosine similarity between the query and all documents in the corpus.
sims=index[query1_vec_tdif]

# Sort in descending order by similarity score.
sorted_sims=sorted(enumerate(sims),key=lambda item: -item[1])
print (sorted_sims[0:10])

# Check the ID of the document which is similar to the query
newcorpus.fileids()[1]

[(0, 0.26261285), (1, 0.0)]


'2.txt'