# Text Mining

# Lecture 3: Vector Space Retrieval 

![img](https://3.bp.blogspot.com/_tOOi3R89e74/TUeyueig7ZI/AAAAAAAAAJQ/QHL-VLEWook/s1600/vector_space.png)

## Word Vectors

### Recap

Data Collection -> Information Selection -> Pre-processing -> ??? -> Classification

Plain Text -> Tokens -> Vocabulary -> 'Word' Vectors

Word - Token - Term

### Document to Vector

In [1]:
document = "this is a text this is"
tokens = document.split()
tokens

['this', 'is', 'a', 'text', 'this', 'is']

In [2]:
vocabulary = set(tokens)
vocabulary

{'a', 'is', 'text', 'this'}

In [3]:
# warning, this example is simple but actually runs pretty slow
# on real data
document_vector = [tokens.count(term) for term in sorted(vocabulary)]
document_vector

[1, 2, 1, 2]

### Documents to Vectors

In [4]:
documents = ["this is a text this is", "and here is another text"]
doc_tokens = [document.split() for document in documents]
vocabulary = set([term for tokens in doc_tokens for term in tokens])
vocabulary

{'a', 'and', 'another', 'here', 'is', 'text', 'this'}

In [5]:
import numpy as np

document_vectors = [[tokens.count(term) for term in sorted(vocabulary)]
                    for tokens in doc_tokens]
np.matrix(document_vectors)

matrix([[1, 0, 0, 0, 2, 1, 2],
        [0, 1, 1, 1, 1, 1, 0]])

### Easier

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
document_vectors = cv.fit_transform(documents)
document_vectors.todense()

matrix([[0, 0, 0, 2, 1, 2],
        [1, 1, 1, 1, 1, 0]], dtype=int64)

In [8]:
cv.vocabulary_

{'and': 0, 'another': 1, 'here': 2, 'is': 3, 'text': 4, 'this': 5}

## Vector Retrieval

### Boolean Vectors

In [9]:
A = "text about stuff"
B = "stuff about text"
C = "text about vectors"
D = "vectors are handy"

Which document is D most similar to?

| doc  | about | are | handy | stuff | text | vectors |
| ---- | ----- | --- | ----- | ----- | ---- | ------- |
| A    | 1     |     |       | 1     | 1    |         |
| B    | 1     |     |       | 1     | 1    |         |
| C    | 1     |     |       |       | 1    | 1       |
| D    |       | 1   | 1     |       |      | 1       |

- `Document * Term` or `Term * Document` Matrix
- Bag of Words (BoW)

### Jaccard Coefficient - $J(A, B) = \frac{| A \cap B |}{| A \cup B |}$

| doc  | about | are | handy | stuff | text | vectors |
| ---- | ----- | --- | ----- | ----- | ---- | ------- |
| A    | 1     |     |       | 1     | 1    |         |
| B    | 1     |     |       | 1     | 1    |         |
| C    | 1     |     |       |       | 1    | 1       |
| D    |       | 1   | 1     |       |      | 1       |

In [10]:
As = set(A.split())
Bs = set(B.split())
Cs = set(C.split())
Ds = set(D.split())

print("As Ds:", len(As & Ds) / len(As | Ds))
print("As Bs:", len(As & Bs) / len(As | Bs))
print("Cs Ds:", len(Cs & Ds) / len(Cs | Ds))

As Ds: 0.0
As Bs: 1.0
Cs Ds: 0.2


- Order
- Context
- Information?

### Term Frequencies - $tf_{t,d}$

t = term, d = document

In [11]:
from glob import glob

cv = CountVectorizer()
documents = {doc: open(doc).read() for doc in glob('../Week 1 - Introduction/data/*.txt')}
D = cv.fit_transform(documents.values())
D

<6x5361 sparse matrix of type '<class 'numpy.int64'>'
	with 9245 stored elements in Compressed Sparse Row format>

In [28]:
print([x.replace('../Week 1 - Introduction/data/', '') for x in documents.keys()])

['artificial-intelligence.txt', 'natural-language-processing.txt', 'information-retrieval.txt', 'text-mining.txt', 'computer-vision.txt', 'machine-learning.txt']


In [13]:
cv.vocabulary_['learning']

2970

In [14]:
D[:,2970].todense()

matrix([[ 46],
        [ 27],
        [  2],
        [  6],
        [ 10],
        [134]])

### Issues

- Frequency 10 is more important than 100, but also * 10?

In [15]:
np.log10(D.todense() + 1)[:,2970]

matrix([[ 1.67209786],
        [ 1.44715803],
        [ 0.47712125],
        [ 0.84509804],
        [ 1.04139269],
        [ 2.13033377]])

- Longer documents shouldn't be more likely to be interesting.
- Rare terms should be informative (amongst all documents).
    - If D1 and D2 both have 'cross-validation' in their vectors, and all the other documents don't, that should be a very strong similarity indication.

Latter: document frequency.

### Document Frequency - $idf_t = \log_{10} (N / df_{t})$

Number of documents that contain some term. Inverse measure of informativeness.

* N = number of documents
* $df_t$ = document frequency of term $t$

In [16]:
def idf(term):
    return np.log10(len(documents) / sum([term in d.split() for d in documents.values()]))

print(idf("learning"))
print(idf("parsing"))
print(idf("naive"))

0.0
0.47712125472
0.778151250384


### TF\*IDF Weighting

Also tf-idf, tf.idf, etc.

$w_{t,d} = (1 + \log tf_{t,d} ) \cdot \log_{10} (N / df_{t})$

Increases with rarity and occurences.

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

D = tfidf.fit_transform(documents.values())
D.todense()[:,2970]

matrix([[ 0.05364644],
        [ 0.06741781],
        [ 0.00425381],
        [ 0.01783667],
        [ 0.02036652],
        [ 0.36599158]])

### TF\*IDF Weighting II

In [27]:
print([x.replace('../Week 1 - Introduction/data/', '') for x in documents.keys()])

['artificial-intelligence.txt', 'natural-language-processing.txt', 'information-retrieval.txt', 'text-mining.txt', 'computer-vision.txt', 'machine-learning.txt']


In [24]:
tfidf.vocabulary_['vision']

5200

In [41]:
D.todense()[:,5200]

matrix([[ 0.00467589],
        [ 0.        ],
        [ 0.0056851 ],
        [ 0.        ],
        [ 0.34296306],
        [ 0.01095084]])

So how do we know if they are similar?

## Vector Space Retrieval

### Vectors in Space

- Features
- Points

Below: sentence = document, term = word

* term 1 = "a"
* term 2 = "sentence"
* term n = "that"


* sentence 1 = "a that"
* sentence 2 = "that sentence"
* sentence n = "a sentence"

![img](https://3.bp.blogspot.com/_tOOi3R89e74/TUeyueig7ZI/AAAAAAAAAJQ/QHL-VLEWook/s1600/vector_space.png)

### Common Display ... Mining

- 2-D
- Vector endpoints
- Vectors have a label

![img2](http://nlp.stanford.edu/IR-book/html/htmledition/img1087.png)

### Common Display Information Retrieval

- 2D
- Full vectors
- q = query

![img3](http://nlp.stanford.edu/IR-book/html/htmledition/img411.png)