### Today, we're teaching computers how to read

Computers are great at crunching numberS. But, crunching words, not so much... So, today, we're going to send our computers to school and teach it to read. How? By converting words to numbers

In this tutorial, I'll cover the most basic parts of 'language processing'. There is a lot to language processing, or text analytics, and this is only a start, but you can do a lot with the things we cover here.

### What's a Corpus

Let's start with a brief corpus of documents. Corpus is a collection of documents.

In [2]:
docA = "the cat sat on my face"
docB = "the dog sat on my bed"

### Tokenizing

Most of the time when we work on text, we can use the 'Bag of words' model to represent a document. In the BOW model, each document can be thought of as a bag of words...

In [3]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [4]:
bowB

['the', 'dog', 'sat', 'on', 'my', 'bed']

Splitting a document up into the component words like this is called 'tokenizing'.
Ok, so the documents are tokenized, but how do we convert a tokenized BOW into numbers?
There are a few statergies, one simple stratergy is to create a vector of all possible words, and for each document count how many times each word appears.


In [5]:
wordSet = set(bowA).union(set(bowB))

In [6]:
#all words in all bags/documents
wordSet

{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'}

In [7]:
#I'll create dictionaries to keep my word counts and set all the values of the items in it to 0
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [8]:
#This is how one of them looks like
wordDictA

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

In [10]:
#Now I'll count the words in my bags
for word in bowA:
    wordDictA[word] += 1

for word in bowB:
    wordDictB[word] += 1

In [11]:
wordDictA

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [12]:
#Lastly I'll stick those into a matrix
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


Boom, we just converted words into a linear algebra problem!. Computers can handle linear algebra, mission accomplished.

### Not so fast

Mission almost accomplished. The problem with our counting stratergy is that we use a lot of words commonly, that just don't mean much. Infact, the most commonly used word in the english language {the} makes up 7% of the words we speqk, which is double the frequency of the next most popular word (of). The distribution of words in langauge is a power law distribution which is the basis of Zip's law.

So, if we construct our document matrix out of counts, then we end up with numbers that don't contain much information, unless our goal was to see who uses 'the' most often.

### TF-IDF - A Better Stratergy

Rather than just count, we can use the TF-IDF score of a word to rank its importance.

The tfidf score of a word 'w' is **tf(w) * idf(w)**

where tf(w) = (Number of times the word appears in a document)/(Total number of words in the document)

and where idf(w) = log(Number of documents)/(Number of documents that contain w)

In [17]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict

In [23]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

In [24]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    #counts the number of documents that contain the word w
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
                
    #Divide N by denominator above, take the log of that
    for word, val in idfDict.items():
        idfDict[word] = math.log(N/float(val))
    return idfDict    

In [25]:
idfs = computeIDF([wordDictA, wordDictB])

In [26]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf    


In [27]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [28]:
#Lastly, I'll stick those into a matrix
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0
