## Lede Algorithms 2018 Week 2 class 1 - Text analysis

Before running this notebook you will need to install a few things:

```
pip3 install textblob
python3 -m textblob.download_corpora
pip3 install scipy
pip3 install scikit-learn
```

For more text analysis goodness, check out Jonathan Soma's 2017 notebooks:
- [TextBlob spaCy sklearn lemmas stems and vectorization](http://jonathansoma.com/lede/algorithms-2017/classes/text-analysis/textblob-spacy-sklearn-lemmas-stems-and-vectorization/)
- [Counting and Stemming](http://jonathansoma.com/lede/algorithms-2017/classes/more-text-analysis/counting-and-stemming/)


In [1]:
# Import the packages we will be using
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import math


Load some press releases to play around with. These were scraped from NJ Senator Menendez' site in 2012.

In [2]:
pr = pd.read_csv('menendez-press-releases.csv')
len(pr)

1530

In [3]:
pr.head()

Unnamed: 0,url,text
0,http://menendez.senate.gov/newsroom/press/rele...,Menendez Statement on Black History Month\n ...
1,http://menendez.senate.gov/newsroom/press/rele...,Menendez Praises Susan G. Komen For Reversing ...
2,http://menendez.senate.gov/newsroom/press/rele...,Menendez Applauds Dentists’ Pro-Bono Work For ...
3,http://menendez.senate.gov/newsroom/press/rele...,Senator Menendez Applauds Passage of STOCK Act...
4,http://menendez.senate.gov/newsroom/press/rele...,Menendez Hails Banking Committee Passage of Ir...


These press releases are on all sorts of topics. Take a look at a few, like this:

In [4]:
print(pr.text[8])

Senator Menendez Slams Unfair Imprisonment of Former Ukrainian Prime Minister Yulia Tymoshenko
                    
                      February 1, 2012
                     WASHINGTON – United States Senator Robert Menendez (D-NJ) participated in the Senate Foreign Relations Committee hearing on Ukraine today, giving him the opportunity to meet Eugenia Tymoshenko and to discuss the inhumane detention of her mother, the former Prime Minister Yulia Tymoshenko.  Last October, a Ukrainian court sentenced Yulia Tymoshenko to seven years in prison after she was found guilty of abuse of office when brokering a 2009 gas deal with Russia.  The Senator expressed his sympathy and support to Ms. Tymoshenko and vowed to assist in her efforts to have her mother freed from prison.

 “Your mother is a pioneering and incredibly strong woman,” the Senator told the younger Tymoshenko. “Yulia is an example for all people who care so much about their country that they are willing to endure extraordinary

There are press releases about government programs, holidays, foreign policy, and more. Can an algorithm tell us something about which topics there are?


### Sentences and tokens

The first step of text analysis is typically breaking the text into sentences and words or more accurately, "tokens" which are basically words but can also be punctuation and numbers.

We'll use the TextBlob package, which has many easy and useful text processing methods -- though as we will see it's also kind of dumb in many cases.

Let's start by trying analyze the sentences of the first press release in the set.

In [5]:
press_release_text = pr.text[212]
doc = TextBlob(press_release_text)
doc.sentences

[Sentence("Menendez and Lautenberg Applaud USDOT’s $4.3 Million Award to National Transit Institute at Rutgers
                     
                             Funding will provide job training and education for public transportation workers
                     
                       August 3, 2011
                      WASHINGTON, D.C. – Today, U.S."),
 Sentence("Senators Robert Menendez (D-NJ) and Frank R. Lautenberg (D-NJ) applauded Secretary Ray LaHood on a $4.3 million grant from the U.S. Department of Transportation in support of the National Transit Institute (NTI) at Rutgers."),
 Sentence("Since 1991, NTI has served as the premiere research, training, and educational institute in the country dedicated to public transportation."),
 Sentence("This award will enable critical enhancements to NTI’s ongoing work, including safety and security training, procurement, planning and advanced technologies."),
 Sentence("“Transit is a critical element of our transportation network and r

As you can see, TextBlob is a little naive about what counts as a sentence. Or perhaps the problem is that real text contains lots of things that aren't really sentences. What should we do with the title and the dateline? Even so, it seems to have problems with quotes and newlines.

Anyway, let's take one of these sentences and play with it further. 

In [6]:
s = doc.sentences[9]
s

Sentence("“This grant will ensure that the existing and importantly, the new generation of public transportation workers receive the training that they need to comply with federal regulations and operate safe and efficient transit services,” said NTI’s director, Paul Larrousse.NTI works cooperatively through partnerships with industry, government, institutions, and associations to develop high quality programs that build careers and help make our communities healthier and more livable.")

This `Sentence` object acts just like a Python string. Let's try to break it into words for further analysis.

In [7]:
s.split(' ')

WordList(['“This', 'grant', 'will', 'ensure', 'that', 'the', 'existing', 'and', 'importantly,', 'the', 'new', 'generation', 'of', 'public', 'transportation', 'workers', 'receive', 'the', 'training', 'that', 'they', 'need', 'to', 'comply', 'with', 'federal', 'regulations', 'and', 'operate', 'safe', 'and', 'efficient', 'transit', 'services,”', 'said', 'NTI’s', 'director,', 'Paul', 'Larrousse.NTI', 'works', 'cooperatively', 'through', 'partnerships', 'with', 'industry,', 'government,', 'institutions,', 'and', 'associations', 'to', 'develop', 'high', 'quality', 'programs', 'that', 'build', 'careers', 'and', 'help', 'make', 'our', 'communities', 'healthier', 'and', 'more', 'livable.'])

Notice that all of the punctuation and capitalization is still there. If we're counting occurences of the word "nation" we will miss "Nation’s". We need a smarter way to extract words. This process is called tokenization.

In [8]:
s.tokens

WordList(['“', 'This', 'grant', 'will', 'ensure', 'that', 'the', 'existing', 'and', 'importantly', ',', 'the', 'new', 'generation', 'of', 'public', 'transportation', 'workers', 'receive', 'the', 'training', 'that', 'they', 'need', 'to', 'comply', 'with', 'federal', 'regulations', 'and', 'operate', 'safe', 'and', 'efficient', 'transit', 'services', ',', '”', 'said', 'NTI', '’', 's', 'director', ',', 'Paul', 'Larrousse.NTI', 'works', 'cooperatively', 'through', 'partnerships', 'with', 'industry', ',', 'government', ',', 'institutions', ',', 'and', 'associations', 'to', 'develop', 'high', 'quality', 'programs', 'that', 'build', 'careers', 'and', 'help', 'make', 'our', 'communities', 'healthier', 'and', 'more', 'livable', '.'])

But what about different forms of the same word? Suppose we want to count "education" and "educate" as the same thing? This is where _lemmatization_ and _stemming_ come in. 

In [9]:
for word in s.words:
    print("ORIGINAL:", word, "| LEMMA:", word.lemmatize(), "| STEM:", word.stem())

ORIGINAL: “ | LEMMA: “ | STEM: “
ORIGINAL: This | LEMMA: This | STEM: thi
ORIGINAL: grant | LEMMA: grant | STEM: grant
ORIGINAL: will | LEMMA: will | STEM: will
ORIGINAL: ensure | LEMMA: ensure | STEM: ensur
ORIGINAL: that | LEMMA: that | STEM: that
ORIGINAL: the | LEMMA: the | STEM: the
ORIGINAL: existing | LEMMA: existing | STEM: exist
ORIGINAL: and | LEMMA: and | STEM: and
ORIGINAL: importantly | LEMMA: importantly | STEM: importantli
ORIGINAL: the | LEMMA: the | STEM: the
ORIGINAL: new | LEMMA: new | STEM: new
ORIGINAL: generation | LEMMA: generation | STEM: gener
ORIGINAL: of | LEMMA: of | STEM: of
ORIGINAL: public | LEMMA: public | STEM: public
ORIGINAL: transportation | LEMMA: transportation | STEM: transport
ORIGINAL: workers | LEMMA: worker | STEM: worker
ORIGINAL: receive | LEMMA: receive | STEM: receiv
ORIGINAL: the | LEMMA: the | STEM: the
ORIGINAL: training | LEMMA: training | STEM: train
ORIGINAL: that | LEMMA: that | STEM: that
ORIGINAL: they | LEMMA: they | STEM: they
O

In this example `lemmatize()` mostly just removes pluralization. `stem()` goes further and produces many non-words like "develop". It's just a set of rules that try to parse out English morphology. The rule set here is called [Porter stemming](https://snowballstem.org/algorithms/porter/stemmer.html), and there are similar algorithms for [many languages](https://snowballstem.org/algorithms/). Lemmatization is actually smarter because it uses a dictionary, so it can undo common verb inflections... if you tell TextBlob that the word is a verb (to be fair, it can [figure out parts of speech](https://textblob.readthedocs.io/en/dev/quickstart.html#part-of-speech-tagging) if you ask it to.)

In [10]:
TextBlob("running").words[0].lemmatize('v')

'run'

In [11]:
TextBlob("ran").words[0].lemmatize('v')

'run'

For our purposes we're going to use TexBlob to make a basic tokenizing function that just makes everything lowercase and throws token with less than 3 characters, which throws out punctuation tokens too.

In [12]:
def tokenize(s):
    blob = TextBlob(s.lower())
    words = [token for token in blob.words if len(token)>2]
    return words

### Document vectors

We're going to develop ways of turning a document into a vector in a high dimensional space -- that is, a list of hundreds or throusands of numbers. 

Why? Well, we're going to make each of the numbers correspond to the score or importance of a word. This is sort of an abstract word cloud, and it can help us to summarize documents.

We're also going to use vectors to compare documents to each other. This is useful for many things:
- matching search queries against documents
- classifying documents (think of this as sorting them into piles)
- clustering documents by topic

To turn documents into vectors, we need to write a function that takes a string and returns a list of numbers. Our first attempt at this will be by counting the number of tokens of each kind.

In [13]:
def doc2vec_count(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1
    return vec

In [14]:
doc2vec_count("the cat and the mat")

{'and': 1, 'cat': 1, 'mat': 1, 'the': 2}

We can use document vectors to summarize documents by imagining them as as list of top words. Let's sort document vectors by decreasing value to try to get an idea of what the entire press release document is "about", and print the top 20

In [15]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:20]
    print('\n'.join([str(x) for x in sorted_list]))

In [16]:
print_sorted_vector(doc2vec_count(press_release_text))

('and', 23)
('the', 13)
('transportation', 7)
('transit', 7)
('nti', 7)
('with', 5)
('training', 5)
('that', 5)
('public', 5)
('this', 4)
('our', 4)
('for', 4)
('workers', 3)
('work', 3)
('will', 3)
('said', 3)
('rutgers', 3)
('menendez', 3)
('lautenberg', 3)
('institute', 3)


Not too bad... but is this press release really "about" the words like "the" and "and"? We're going to need something better. That something for us is TF-IDF term weighting, and we'll talk about it below.

### Comparing document vectors
The simplest way to compare two word count vectors is to count the number of overlapping words. Each word can appear more than once, so we'll multiply together the counts of the same word in each docmument. Why? Because this leads us to a Euclidian feature vector with natural geometric properties, which makes it possible to think about wbat is going on with spatial analogies. 

In [17]:
def doc_similarity(a_vec,b_vec):
    total = 0
    for word in a_vec:
        if word in b_vec:
            total += a_vec[word]*b_vec[word]
    return total

In [18]:
a = doc2vec_count(str(doc.sentences[6]))  # need str to convert Sentence object to string
b = doc2vec_count(str(doc.sentences[9]))

In [19]:
print(a)

{'nti': 1, 'work': 1, 'essential': 1, 'for': 1, 'the': 1, 'industry': 1, 'continue': 1, 'create': 1, 'good': 1, 'long': 1, 'term': 1, 'jobs': 2, 'provide': 1, 'families': 1, 'with': 1, 'access': 1, 'opportunity': 1, 'and': 6, 'help': 1, 'our': 1, 'communities': 1, 'grow': 1, 'ways': 1, 'that': 1, 'are': 1, 'smart': 1, 'efficient': 2, 'millions': 1, 'people': 1, 'count': 1, 'reliable': 1, 'public': 1, 'transportation': 1, 'get': 1, 'from': 1, 'their': 2, 'homes': 1, 'transit': 1, 'workers': 1, 'make': 1, 'this': 1, 'possible': 1, 'said': 1, 'senator': 1, 'lautenberg': 1}


In [20]:
print(b)

{'this': 1, 'grant': 1, 'will': 1, 'ensure': 1, 'that': 3, 'the': 3, 'existing': 1, 'and': 6, 'importantly': 1, 'new': 1, 'generation': 1, 'public': 1, 'transportation': 1, 'workers': 1, 'receive': 1, 'training': 1, 'they': 1, 'need': 1, 'comply': 1, 'with': 2, 'federal': 1, 'regulations': 1, 'operate': 1, 'safe': 1, 'efficient': 1, 'transit': 1, 'services': 1, 'said': 1, 'nti': 1, 'director': 1, 'paul': 1, 'larrousse.nti': 1, 'works': 1, 'cooperatively': 1, 'through': 1, 'partnerships': 1, 'industry': 1, 'government': 1, 'institutions': 1, 'associations': 1, 'develop': 1, 'high': 1, 'quality': 1, 'programs': 1, 'build': 1, 'careers': 1, 'help': 1, 'make': 1, 'our': 1, 'communities': 1, 'healthier': 1, 'more': 1, 'livable': 1}


In [21]:
doc_similarity(a,b)

58

Ok, but what does "58" mean? One problem we are going to have is that longer documents will tend to be more similar to everything else. More words mean more words can match. We will solve this problem by normalizing each document vector so that it has length 1, meaning that the sum of the _squares_ of the elements is one -- this is Pyhagoras, so we can think of a document as a unit vector now, or a direction, in a space that has as many dimensions as the vocabulary size. 


In [22]:
def doc2vec_normalized(s):
    tokens = tokenize(s)
    vec = {}
    for t in tokens:
        vec[t] = vec.get(t, 0) + 1 # get from dict with a default of 0 if missing
        
    length = math.sqrt(sum([x*x for x in vec.values()]))  # length of a vector, according to Pythagoras
    for word,value in vec.items():
        vec[word] /= length
        
    return vec

In [23]:
a = doc2vec_normalized(str(doc.sentences[6]))  # need str to convert Sentence object to string
b = doc2vec_normalized(str(doc.sentences[9]))
c = doc2vec_normalized(str(doc.sentences[5]))

In [24]:
print(c)

{'today': 0.1690308509457033, 'with': 0.3380617018914066, 'gas': 0.1690308509457033, 'prices': 0.1690308509457033, 'around': 0.1690308509457033, 'gallon': 0.1690308509457033, 'and': 0.3380617018914066, 'oil': 0.1690308509457033, 'companies': 0.1690308509457033, 'reaping': 0.1690308509457033, 'record': 0.1690308509457033, 'profits': 0.1690308509457033, 'the': 0.3380617018914066, 'threat': 0.1690308509457033, 'climate': 0.1690308509457033, 'change': 0.1690308509457033, 'growing': 0.1690308509457033, 'wealth': 0.1690308509457033, 'disparity': 0.1690308509457033, 'transit': 0.1690308509457033, 'part': 0.1690308509457033, 'solution': 0.1690308509457033, 'for': 0.1690308509457033, 'number': 0.1690308509457033, 'interconnected': 0.1690308509457033, 'challenges': 0.1690308509457033}


Now our similarity function says that the first two sentences are the most similar, because they have words like "industry", "efficient", and "transit" in common.

In [25]:
print(doc_similarity(a,b))
print(doc_similarity(b,c))
print(doc_similarity(a,c))

0.5943484047696529
0.37583907018239515
0.32251021858460976


### TF-IDF weighting
As we discussed in class, term frequency / inverse document frequency is a word weighting scheme that tries to give less weight to words that appear in many documents. This will solve our "the" problem, and it will also help drop out topic words that are common to the entire corpus. Rather than writing it ourselves, we're going to use the implementation in the `scikit` library.

Scikit includes a bunch of built in vectorizers, such as classic counting. Let's turn the first ten press releases into vectors.

In [26]:
# Make a new Count Vectorizer!!!! It will conveniently remove stop words if we tell it what language we're using
vectorizer = CountVectorizer(stop_words='english', tokenizer=tokenize)

# Use the vectorizor we just made. The name fit_transform will be clearer later when we use it for machine learning
matrix = vectorizer.fit_transform(pr.text[0:10])

# The easiest way to see what happenned is to make a dataframe
results = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
results

Unnamed: 0,"1,105,000","1,160,000",100,1099,"13,381",142nd,15th,170,"170,000","179,550",...,yanukovych,year,yearly,years,yesterday,york,young,younger,yulia,–through
0,0,0,2,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,1,0,1,1,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,3,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,3,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,4,0,2,0,0,0,0,0,0
6,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,0,0,1,1,0,0,0,1,0,0,...,0,1,0,3,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,1,6,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Each row is a document, and each column corresponds to a single vocabulary word or token. And there are a lot of columns, which means we can think of these as points in 1,261 dimensional space.


 

In [27]:
vectorizer.get_feature_names()

['1,105,000',
 '1,160,000',
 '100',
 '1099',
 '13,381',
 '142nd',
 '15th',
 '170',
 '170,000',
 '179,550',
 '1903',
 '1980s',
 '1983',
 '1996',
 '1998',
 '20,000',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '265,950',
 '3.8',
 '315,000',
 '335',
 '350',
 '351,554',
 '4,600',
 '47,200',
 '5,000',
 '51.5m',
 '519',
 '5:00',
 '5million',
 '6,400',
 '6.5',
 '650,000',
 '65m',
 '7.8',
 '750,000',
 '770,000',
 '771',
 '929,088',
 '96-3',
 'able',
 'abuse',
 'access',
 'according',
 'accountability',
 'acquisition',
 'act',
 'action',
 'add',
 'added',
 'additionally',
 'address',
 'addresses',
 'adopted',
 'advance',
 'adversely',
 'advocate',
 'affect',
 'affected',
 'affiliates',
 'afford',
 'afg',
 'afin',
 'african',
 'african-american',
 'african-americans',
 'agencies',
 'agency',
 'agenda',
 'agents',
 'ahmadinejad',
 'aid',
 'air',
 'alarm',
 'ali',
 'allow',
 'allows',
 'ambassador',
 'amendment',
 'amendments',
 'america',
 'american',
 'americans',
 'anniversary',
 'announce',


Look at all those columns with number names! We didn't remove them in the tokenizer, so they're tokens. Do we want that? It depends! Sometimes the numbers in the text can be interesting information.

Now we try this again with tf-idf weighting.

In [28]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(pr.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,'01,'activist,'acts,'disappointing,'e-verify,'em,'liberty,'ll,'no,'re,...,…oil-drilling,…struggling,…tarp,…that,…the,…then,…there,…these,…this,…we
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let's look at the top words in the press release again. The stop words have gone.

In [29]:
print_sorted_vector(tfidf.iloc[212])


('nti', 0.56487288350283282)
('transit', 0.30992020013592608)
('transportation', 0.23490806362768965)
('training', 0.20020090792648149)
('rutgers', 0.16379220882186274)
('4.3', 0.15701624958670632)
('institute', 0.15030236459784593)
('public', 0.13468667783892418)
('efficient', 0.13247842020300446)
('workers', 0.11513111737966622)
('grant', 0.099558443055257573)
('award', 0.098396096503811856)
('critical', 0.095532368614887939)
('larrousse.nti', 0.091692332362515894)
('premiere', 0.086826433625999941)
('industry', 0.083008913035465287)
('lautenberg', 0.081471107009427737)
('interconnected', 0.080696126214690411)
('element', 0.080696126214690411)
('1991', 0.080696126214690411)


Comapre this to the headline:

In [30]:
doc.sentences[0]

Sentence("Menendez and Lautenberg Applaud USDOT’s $4.3 Million Award to National Transit Institute at Rutgers
                    
                            Funding will provide job training and education for public transportation workers
                    
                      August 3, 2011
                     WASHINGTON, D.C. – Today, U.S.")

### Finding similar documents
Finally, let's use tf-idf to summarize a group of documents. We'll do this by simply summing together the vectors for each document. Dividing by the number of documents will give us an "average" vector, but just a simple sum will point in the same direction (the ratios of the components will be the same.) You can think of this as choosing a direction in the middle of all these documents. 

So what do the first 100 documents discuss?

In [31]:
docs = tfidf.iloc[:100,:]
total = docs.sum(axis=0)
print_sorted_vector(total)

('new', 4.6380084052119663)
('jersey', 4.3405154266179391)
('menendez', 3.3653048234731422)
('iran', 3.3369335810395864)
('program', 2.8096238918970617)
('tax', 2.7953701626577092)
('2012', 2.7558566301624596)
('funding', 2.7341462492374515)
('million', 2.5937783756758925)
('veterans', 2.5615692819926918)
('senator', 2.4405760771314737)
('safety', 2.3674326591909973)
('year', 2.284279574844704)
('2011', 2.2828391557370171)
('federal', 2.2330684973455455)
('lautenberg', 2.1523007201349902)
('senate', 2.1362927146719057)
('families', 2.098229232266049)
('communities', 1.9405729234847227)
('iranian', 1.864004829716446)


Another way we can use these document vectors is to compute the similarity between them.  Conveniently, the TfidfVectorizer already outputs normalized vectors so it's easy to calculate similarity just by taking a dot product.  Two identical documents have similarity 1 and two documents with no words in common have similarity 0. We will reverse this to be "distance" so that identical documents have distance 0, and we can look for document pairs which are "closest" in the vector space.

In [42]:
def doc_distance(a_vec,b_vec):
    # First we have to compute similarity. The idea is the same as doc_similarity, but
    # because we are using arrays and not dictionaries, we can just multiply all the elements 
    # together and add the sum. This is what numpy's dot function does
    similarity = a_vec.dot(b_vec)

    # Because the vectors are already normalized, similarity will be 1 if equal, 0 if disjoint
    # We want things the other way around
    return 1-similarity

# helpful little function for distance between documents i and j
def dij(i,j):
    return doc_distance(tfidf.iloc[i], tfidf.iloc[j])


In [44]:
dij(100,200)


0.85775589439211042

One we thing we can do with this is find the documents that are most simmilar to any particular document. This is an example-driven search engine. Actually what we'll do here is find the distance to every other document, then sort. We'll use the same document as above for the query.

In [43]:
query_doc = 212

# create a list of (document index, distance to that document) tuples 
closest = [(i, dij(query_doc, i)) for i in range(len(tfidf))]

# sort by the second element of the tuple, that is, the distance
closest.sort(key=lambda x: x[1])

# top 5 closest?
closest[:5]

[(212, -4.4408920985006262e-16),
 (637, 0.43711053078542184),
 (5, 0.57345588413445436),
 (11, 0.58115728608200345),
 (282, 0.68107726734169893)]

Unsurprisingly, document 212 is closest to itself (that first distance should be a 0, but floating point dirt prevents it). Let's take a look at the others.

In [35]:
print(pr.text[closest[1][0]])

SENATORS MENENDEZ AND LAUTENBERG ANNOUNCE $4 MILLION FOR RUTGERS’ NATIONAL TRANSIT INSTITUTE
                    
                      June 3, 2010
                     WASHINGTON – Today, U.S. Senators Robert Menendez (D-NJ) and Frank R. Lautenberg (D-NJ) announced that the Department of Transportation has awarded $4,300,000 in funds to the National Transit Institute (NTI). The Institute was established in 1991 under the Intermodal Surface Transportation Efficiency Act (ISTEA) with the objective of developing, promoting, and delivering quality training, education, and clearinghouse services for the public transit industry. Its primary objective is to develop, deliver, and promote quality programs and materials via cooperative partnerships with industry, government, institutions, and associations around the country. “As we know well in New Jersey, public transit helps families save time and money and helps clear the air we breathe,” said Menendez. “An emphasis on public transit will b

In [36]:
print(pr.text[closest[2][0]])

Transportation Subcommittee Chair Says Bipartisan Senate Transit Bill will  Deliver $63 Million More Per Year to NJ
                    
                            More funding means more jobs, easier travel
                    
                      February 2, 2012
                     WASHINGTON – Following today’s approval by the Senate Banking Committee of the Federal Public Transportation Act of 2012, U.S. Senator Robert Menendez (D-NJ), Chairman of the Subcommittee on Housing, Transportation and Community Development, called the bipartisan legislation “an incredible boon to New Jersey.”  By cutting waste and eliminating earmarks, the bill will provide New Jersey $519 million in federal transit funding, an increase of over $63 million per year.  If passed by the full Senate and House, New Jersey would receive more federal transit funding per year than ever before -- without increased overall federal spending.

“For over two years now, I have worked across party lines to help cra

In [37]:
print(pr.text[closest[3][0]])

Menendez: Bipartisan Senate Transit Bill Will Deliver NJ over $62 Million More Per Year
                    
                            Chair of Banking’s Transit Subcommittee Delivers Urgently Needed Transit Funding in a Flat Funded Bill
                    
                      January 30, 2012
                     WASHINGTON - Senator Robert Menendez (D-NJ) today announced that Senate Banking Committee’s transit reauthorization mark, which he helped craft, will be a boon to New Jersey.  By cutting waste and eliminating earmarks, the bill will provide New Jersey $519 million in federal transit funding, an increase of over $62 million per year.  If passed, New Jersey would receive more federal transit funding per year than ever before and without increased overall federal spending.

“For over two years now, we have worked to craft a bill to improve public transportation in New Jersey and across America,” said Menendez, who worked closely with former Chairman Dodd, Chairman Johnson, 

These are all on the subject of transit funding!