# Text Mining
# Lecture 4: Distributed Semantics

![img](https://www.tensorflow.org/versions/r0.10/images/linear-relationships.png)

## Recap & Issues

### Last Week

- Text can be tokenized, tokens can be counted.
- Documents can be represented as a vector with token count values.
- These vectors can be seen as a Vector Space model.
- Vectors can be weighted to be more informative.
- Through simple Vector Space calculations we can do some classification.

### In Practice

In [1]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

newsgroups = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

vectorizer = TfidfVectorizer()
vectors = np.asarray(vectorizer.fit_transform(newsgroups.data).todense())

In [2]:
query = ["USA space launch"]
qv = np.asarray(vectorizer.transform(query).todense())[0]
cos = {}

for i, dv in enumerate(vectors):
    cos[np.dot(qv, dv)] = i
    
scores = sorted(cos)[::-1]
scores[:5]

[0.3911516433798255,
 0.29651994760185252,
 0.25581635272977149,
 0.24790204776245364,
 0.21241493770390013]

In [3]:
newsgroups.target_names[newsgroups.target[cos[scores[0]]]]

'sci.space'

### In Practice II

In [5]:
query = ["blue screen of death"]
qv = np.asarray(vectorizer.transform(query).todense())[0]
cos = {}

for i, dv in enumerate(vectors):
    cos[np.dot(qv, dv)] = i
    
scores = sorted(cos)[::-1]

newsgroups.target_names[newsgroups.target[cos[scores[0]]]]

'comp.windows.x'

In [16]:
query = ["John kicked the ball"]
qv = np.asarray(vectorizer.transform(query).todense())[0]
cos = {}

for i, dv in enumerate(vectors):
    cos[np.dot(qv, dv)] = i
    
scores = sorted(cos)[::-1]

newsgroups.target_names[newsgroups.target[cos[scores[0]]]]

'sci.crypt'

### Left-over Issues

- TF\*IDF works pretty well, but has nothing to do with language other than occurence.
- Does not take into account order, context, semantics, nor any properties of the words.
- Information properties of both parts tf (Luhn, 1957), and df (Jones, 1972) pretty dated, and lack strong theoretical grounding.
- Need a lot of preprocessing to fix noisy features (e.g. misspellings).


### Fix 1: Change the Input

$n$-grams to the rescue!

- Up until now: word frequencies and bag-of-words.
- What if we use more words as features? - `"the cat", "cat jumped" "jumped on" "on the" "the table"`.
- Do the occurences of these pairs model language well?

### Language Modelling

Very easy approach: Markov Models

In [6]:
s = "I am Sam. Sam I am. I do not like green eggs and ham."
tokens = s.split()
bigrams = list(zip(*[tokens[i:] for i in range(2)]))
print(bigrams)

[('I', 'am'), ('am', 'Sam.'), ('Sam.', 'Sam'), ('Sam', 'I'), ('I', 'am.'), ('am.', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham.')]


![img2](https://sookocheff.com/img/nlp/ngram-modeling-with-markov-chains/transitions-from-I.png)

![img3](https://sookocheff.com/img/nlp/ngram-modeling-with-markov-chains/following-transitions-from-I.png)

In [7]:
import random


class MarkovChain:

    def __init__(self):
        self.memory = {}

    def _learn_key(self, key, value):
        if key not in self.memory:
            self.memory[key] = []

        self.memory[key].append(value)

    def learn(self, text):
        tokens = text.split(" ")
        bigrams = zip(*[tokens[i:] for i in range(2)])
        for bigram in bigrams:
            self._learn_key(bigram[0], bigram[1])

    def _next(self, current_state):
        next_possible = self.memory.get(current_state)

        if not next_possible:
            next_possible = self.memory.keys()

        return random.sample(next_possible, 1)[0]

    def babble(self, amount, state=''):
        if not amount:
            return state

        next_word = self._next(state)
        return state + ' ' + self.babble(amount - 1, next_word)

### Language Modelling II

In [8]:
from glob import glob
import re

files = glob('../Week 1 - Introduction/data/*.txt')
chain = MarkovChain()

for f in files:
    text = open(f).read().lower()
    chain.learn(re.sub('[^\w \-]', '', text))

In [9]:
chain.babble(50)

'    text-proofing  disbelief   a label to build a specific problems reasoning using computer vision and vicarious to neural networks were then using guidance space odyssey released as often with restricted blocks worlds with hand-written rules similar to roc curve this with our particular query '

```'  this measure of the 1940s alan turings proposal to is strongly np-hard and retrieval and job offers related to the ability to take one or other terms is an algorithm basic techniques should artificial neuronsthe field of artificial beings head and applications include swarm intelligence ai effecthigh-profile examples of'```

### Further Expansion

- Can use any $n$-gram; trigrams, tetragram, etc.
    - The longer the gram, the more improbable it will be in test data.
- Use character grams to capture spelling variations.
    - Very effective in stylometry.

$n$-grams have proven very effective in a lot of text mining applications (and are hard to beat baselines); however, cannot capture long dependencies, or intuitive relations.

## Distributed Semantics