# Distributional Semantics


For this notebook, we'll be using the 500 document Brown corpus included in NLTK

In [1]:
from nltk.corpus import brown

This notebook is divided up into two independent parts: the first uses PMI for distinguishing good collocations, and the second involves building a vector space model for document retrieval.

For the PMI portion, we'll use a function which extracts the information we need for a particular two word collocation, namely counts of each word individually, counts of the collocation, and the total number of word tokens in the corpus, and then calculates the PMI:

In [2]:
import math

def get_PMI_for_collocation_brown(word1,word2):
    word1_count = 0
    word2_count = 0
    both_count = 0
    total_count = 0.0 # so that division results in a float
    for sent in brown.sents():
        sent = [word.lower() for word in sent]
        for i in range(len(sent)):
            total_count += 1
            if sent[i] == word1:
                word1_count += 1
                if i < len(sent) - 1 and sent[i + 1] == word2:
                    both_count += 1
            elif sent[i] == word2:
                word2_count += 1
    return math.log((both_count/total_count)/((word1_count/total_count)*(word2_count/total_count)), 2)
                
        

Note that in a typical use case, we probably wouldn't do it this way, since we'd likely want to calculate PMI across many different words, and collecting the statisitcs for this can be done in a single pass across the corpus for all words, and then the PMI calculated in a separate function. Anyway, let's compare the PMI for two phrases, "hard work" and "some work"

In [3]:
print get_PMI_for_collocation_brown("hard","work")
print get_PMI_for_collocation_brown("some","work")

5.23724453167
1.9135320271


Based on PMI, "hard work" appears to be a much better collocation than "some work", which matches our intuition. Go ahead and try out this out some other collocations. 

For the second part of the notebook, let's create a sparse document-term matrix, using sci-kit learn. We will do a document-term matrix rather than term-document because we will be performing SVD dimensionality reduction to produce dense document representations for document retrevial. Note that this is actually identical to creating a BOW feature representation for each document; the difference comes in how we used the representation. 

In [4]:
from sklearn.feature_extraction import DictVectorizer

def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word.lower()] = BOW.get(word.lower(),0) + 1
    return BOW

texts = []
for fileid in brown.fileids():
    texts.append(get_BOW(brown.words(fileid)))

vectorizer = DictVectorizer()
brown_matrix = vectorizer.fit_transform(texts)
print brown_matrix


  (0, 49)	1.0
  (0, 58)	1.0
  (0, 169)	1.0
  (0, 181)	1.0
  (0, 205)	1.0
  (0, 238)	1.0
  (0, 322)	33.0
  (0, 373)	3.0
  (0, 374)	3.0
  (0, 393)	87.0
  (0, 395)	4.0
  (0, 405)	88.0
  (0, 454)	4.0
  (0, 465)	1.0
  (0, 695)	1.0
  (0, 720)	1.0
  (0, 939)	1.0
  (0, 1087)	1.0
  (0, 1103)	1.0
  (0, 1123)	1.0
  (0, 1159)	1.0
  (0, 1170)	1.0
  (0, 1173)	1.0
  (0, 1200)	3.0
  (0, 1451)	1.0
  :	:
  (499, 49161)	1.0
  (499, 49164)	1.0
  (499, 49242)	1.0
  (499, 49253)	1.0
  (499, 49275)	1.0
  (499, 49301)	1.0
  (499, 49313)	1.0
  (499, 49369)	1.0
  (499, 49385)	1.0
  (499, 49386)	4.0
  (499, 49390)	2.0
  (499, 49410)	2.0
  (499, 49446)	1.0
  (499, 49576)	1.0
  (499, 49590)	1.0
  (499, 49613)	3.0
  (499, 49691)	42.0
  (499, 49694)	3.0
  (499, 49697)	3.0
  (499, 49698)	1.0
  (499, 49707)	17.0
  (499, 49708)	1.0
  (499, 49710)	4.0
  (499, 49711)	1.0
  (499, 49797)	1.0


Our matrix is sparse: for instance, columns 0-48 in row 0 are empty, and are just left out, only the rows and columns with values other than zeros are displayed

Rather than removing stopwords as we did for text classification, let's add some idf weighting to this matrix. Scikit-learn has a built-in tf-idf transformer for just this purpose.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf=False,norm=None)

brown_matrix = transformer.fit_transform(brown_matrix)

print brown_matrix

  (0, 49646)	1.72981116493
  (0, 49613)	1.36816932336
  (0, 49596)	3.70663186543
  (0, 49386)	9.98833379406
  (0, 49378)	8.73162901565
  (0, 49313)	2.62964061975
  (0, 49301)	7.37407593121
  (0, 49292)	2.18417017703
  (0, 49224)	3.38596670193
  (0, 49147)	6.0
  (0, 49041)	3.40794560865
  (0, 49003)	22.2100968809
  (0, 49001)	5.74160535314
  (0, 48990)	16.8467729363
  (0, 48951)	4.72970144863
  (0, 48950)	4.93935194012
  (0, 48932)	3.9565115604
  (0, 48867)	7.04612032287
  (0, 48777)	1.41855034766
  (0, 48771)	13.6942100975
  (0, 48769)	6.23642898412
  (0, 48753)	1.29571424415
  (0, 48749)	3.19841940751
  (0, 48720)	1.16487464319
  (0, 48670)	2.19743194588
  :	:
  (499, 2710)	3.1202635362
  (499, 2688)	2.04412410338
  (499, 2670)	3.9565115604
  (499, 2611)	4.27016911926
  (499, 2468)	6.52146091786
  (499, 2439)	4.1700856607
  (499, 2415)	4.12263300785
  (499, 2413)	2.32033750431
  (499, 2388)	2.09661428601
  (499, 2358)	6.11599580975
  (499, 2290)	61.0
  (499, 2289)	7.55330245138
  (499

Next, let's apply SVD. Scikit-learn does not expose the internal details of the decomposition, we just use the TruncatedSVD class directly get a matrix with k dimensions. Since the Brown corpus is a fairly small corpus, we'll do k=100

In [6]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100)
brown_matrix = svd.fit_transform(brown_matrix)

print brown_matrix

[[  2.42558208e+02   2.29467260e+01  -9.13295899e+00 ...,  -3.93194521e+00
   -4.31070980e+00  -4.48405648e+00]
 [  2.48385446e+02   2.50211712e+01  -2.11564668e+01 ...,   3.95161813e+00
   -1.47451093e+01   7.71966768e+00]
 [  2.36701807e+02   2.40132907e+01  -1.11425082e+01 ...,  -1.37672303e+01
   -2.03296052e+00  -1.55838356e+00]
 ..., 
 [  2.58648884e+02  -1.13703798e+02   2.65677713e+01 ...,   1.90243669e-01
   -6.42072245e-01  -1.86653074e+01]
 [  2.91346128e+02   1.29999362e+01  -2.67548983e+01 ...,  -5.44686900e-01
   -1.49584992e+00  -1.29129229e+00]
 [  2.73315461e+02  -3.19074823e+01  -1.77859498e+01 ...,  -3.05565816e+00
   -3.66808138e+00  -3.93672626e+00]]


Note that this matrix is not sparse.

The last thing we'll do is build a very simple document retrevial system based on the vector space model we've built: it will take some query input, apply all the transformations we have defined above, then find the Brown document with the highest cosine similarity to the query document. Here we are using scipy's cosine distance function; we actually find the smallest distance instead of the largest similarity.

In [7]:
from scipy.spatial.distance import cosine as cos_distance

def transform_query(query_text):
    return svd.transform(transformer.transform(vectorizer.transform([get_BOW(query_text.split())])))[0]

def get_best_doc_num(query):
    f = lambda x: cos_distance(query,brown_matrix[x])
    best_doc = min(range(500),key=f)
    return best_doc

Let's test this out with a couple of sets of key words, with the idea of getting a religious text in the first example, and a mathematics text in the second (the Brown corpus has both). We'll also look at the specific vectors and distances involved.

In [8]:
def try_query(query_text):
    query = transform_query(query_text)
    doc_num = get_best_doc_num(query)
    print "query text"
    print query_text
    print "query vector"
    print query
    print "best document vector"
    print brown_matrix[doc_num]
    print "cosine distance from query to document"
    print cos_distance(query,brown_matrix[doc_num])
    print "best document sample"
    print brown.words(brown.fileids()[doc_num])[:50]

try_query("heaven hell devil lord")
try_query("matrix algebra eigenvalue")

query text
heaven hell devil lord
query vector
[  2.54452687e-02  -7.56679993e-02   1.41123872e-02   3.04769437e-05
   4.26648781e-02  -1.00752509e-01  -4.57414331e-02   3.56242823e-02
   5.54915690e-02   4.06544466e-02   2.70829739e-02  -4.90315816e-02
   7.17051626e-02   4.50726583e-03  -4.11722678e-02   4.48082691e-02
  -1.23657908e-01   1.43152195e-02   3.52936044e-03   2.81564695e-02
  -7.83756573e-02   6.89396599e-03   9.52306268e-02  -3.11698230e-02
   7.49723686e-02  -1.62267507e-02   9.59333787e-05  -3.17927457e-02
   1.28658660e-01  -5.29471293e-02  -6.72129251e-02   1.40567515e-02
   5.33132038e-02   3.78057823e-02  -8.75384899e-02   1.72355447e-02
  -8.21910451e-03  -4.86292667e-04   2.37387285e-02  -5.49583988e-02
   6.83240447e-02  -1.32568515e-01  -4.13863783e-02  -6.41750572e-02
  -4.63857753e-02  -8.93651530e-03   2.75541168e-02  -3.28616310e-02
  -6.63452416e-02  -9.62220804e-02  -5.75416343e-02   8.52153878e-02
   1.15379908e-01  -1.07925220e-01  -5.59246119e-02   7.