# Text Feature Extraction

This Notebook illustrates the translation of complex objects (plain text documents) into a set of features suitable for training a machine learning algorithm.

We will use the simplest of such models the **Bag of Words** on the [Reuters 50](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50) collection of texts.

## Preliminaries

#### Imports

In [2]:
import string
from collections import Counter
import os
import pickle

import numpy as np
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


#### Fist time Use

This notebook downloads some data, and generates a files that are needed later.

Set the `fist_time` flag below to True **once**, after that, it will be faster to run the notebook
with the flag set to false.

In [3]:
first_time=True

#### Data Directories

In [5]:
raw_data_dir=r"../../raw/C50/C50train" # original data set used for training
test_dir    =r"../../raw/C50/C50test"  # original test data set
data_dir    =r"../../data/C50/"  # directory to save intermediate results
if not os.path.exists(data_dir):
    os.mkdir(data_dir)

## Data Set

Because documents can be large  we will just keep a list of their file names in memory.

We will try really hard not read them all into memory at the same time.

Each document is labeled by its author.  We will  use that label later on the course, but not today.

In [6]:
def author_labels(directory):
    doc_labels=[]
    for author in os.listdir(directory):
        for filename in os.listdir(directory+"/"+author):
            filename=directory+"/"+author+"/"+filename
            doc_labels.append([filename,author])
    data=pd.DataFrame(doc_labels,columns=["filename","label"])
    return data

In [7]:
documents_filename=data_dir+"/C50_documents.csv"

if first_time:
    documents=author_labels(raw_data_dir)
    documents.to_csv(documents_filename,index_label="document_id")



documents=pd.read_csv(documents_filename,index_col="document_id")
documents.head()

FileNotFoundError: [Errno 2] No such file or directory: '../../raw/C50/C50train'

We also generate a seprate list of documents that we will use for testing.

In [6]:
test_documents_filename=data_dir+"/C50_test_documents.csv"
if first_time:
    test_documents=author_labels(test_dir)
    test_documents.to_csv(test_documents_filename,index_label="document_id")
test_documents=pd.read_csv(test_documents_filename,index_col="document_id")
test_documents.head()

Unnamed: 0_level_0,filename,label
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,../../raw/C50/C50test/AaronPressman/421829news...,AaronPressman
1,../../raw/C50/C50test/AaronPressman/424074news...,AaronPressman
2,../../raw/C50/C50test/AaronPressman/42764newsM...,AaronPressman
3,../../raw/C50/C50test/AaronPressman/43033newsM...,AaronPressman
4,../../raw/C50/C50test/AaronPressman/433558news...,AaronPressman


## The Bag Of Words Document Model

One of the simplest models of a document is the **Bag of Words** model:

we **ignore word ordering** and represent a
document by the list of words that it containts.

We still have a lot of choices we could make:

* Does punctuation "*.*", "*?*" count as a word?
* is "*New York*" one word or two, what about "*U.S.*"?
* Do "*car*" and "*cars*" count as different words?
* What about "*safe*" and "*safely*"
* What do we do about miss-spelled words, do we try to fix them?
* Do we consider different capitalizations of the same word:  "*Car*" versus "*car*"?
* Do we include high frequency, low information content words such as "*a*" and "*the*" in our bag of words?

All this choices are **problem dependent**. If we have a particular ML problem to solve we will use our **domain** knowledge to make a decision in the context of that specific task.

Today we just illustrate how to put together a **data pre-processing** pipeline. 

The pipeline is designed in such a way that it will be easier later to change our answer to any of those questions.



## Document Preprocessing

For this exercise we will use a default that is sensible and simple to implement:
* We remove punctuation
* "*New York*" counts as two words but  "*U.S*" counts as one.
* "*car*" and ""*cars*" are the same word, same for "*safe*", and "*safely*".
* miss-spelled words count as different words.
* We remove capitalization so "*Car*" and "*car*" count as the same word.
* We remove high frequency such as "*a*" and "*the*".

We will rely on python's **NLTK** (Natural Language Toolkit) to perform tasks that require **domain** knowledge about
text processing:
* **tokenization**: breaking character streams into words
* **stemming**: normalizing words into their roots: "*cars*" -> "*car*", "*safely*" -> "*safe*"
* **stop word removal**: high frequency, low information words that we will consider just *noise* and ignore.

### Linguistic Data required for Preprocessing

In [7]:
# This only needs to be run once, to get access to data used by nltk
if first_time:
    import nltk
    nltk.download('punkt') # for tokenizer
    nltk.download('stopwords') # for stop words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\manel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Punctuation

In [8]:
punctuation=list(string.punctuation)
punctuation[:10]

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']

#### Stop Words

In [9]:
stop=stopwords.words("english")+punctuation+['``',"''"]
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

#### Stemmer

A **Stemmer** normalizes different morphological forms of the same **root word** (stem)

We use one provided by NLTK, there are quite a few others.

In [10]:
porter_stemmer = PorterStemmer()

In [11]:
words=["car","cars","safe","safely","perform","performed","study","studies"]
for word in words:
    print(word,porter_stemmer.stem(word))

car car
cars car
safe safe
safely safe
perform perform
performed perform
study studi
studies studi


### From character stream to stem stream

We read a file from disk line by line and:
* Concatenate lines together
* Remove new lines character from end of lines

we end up with a long stream of characters

In [12]:
filename=documents["filename"][0]
print("reading", filename,"...")
file=open(filename)
lines=file.readlines()
char_stream=" ".join(lines).replace("\n"," ").replace("'"," ")
char_stream[:100]+"..."

reading ../../raw/C50/C50train/AaronPressman/106247newsML.txt ...


'The Internet may be overflowing with new technology but crime in cyberspace is still of the old-fash...'

We **lower case** the text and **tokenize** it (break the stream of characters into words).

In [13]:
raw_tokens=word_tokenize(char_stream.lower())
raw_tokens[:10]

['the',
 'internet',
 'may',
 'be',
 'overflowing',
 'with',
 'new',
 'technology',
 'but',
 'crime']

We now replace words with their stems to normalize word morphology

In [14]:
stem_list=[porter_stemmer.stem(token) for token in raw_tokens]
stem_list[:10]

['the',
 'internet',
 'may',
 'be',
 'overflow',
 'with',
 'new',
 'technolog',
 'but',
 'crime']

We can write a function that goes from text to stems for reuse later

In [15]:
def stem_tokenizer(text):
    return [porter_stemmer.stem(token) for token in word_tokenize(text.lower().replace("'"," "))]

### Removing Stop Words

We  remove non-informative (stop) words.

Note that the words <*the*>,<*be*>,<*but*>  have been filtered out of stream.

In [16]:
used_list=[token for token in stem_list if token not in stop]
used_list[:10]

['internet',
 'may',
 'overflow',
 'new',
 'technolog',
 'crime',
 'cyberspac',
 'still',
 'old-fashion',
 'varieti']

Let's collect all the transformations together

In [17]:
def process_text(filename,stop): 
    porter_stemmer = PorterStemmer()
    file=open(filename)
    lines=file.readlines()
    text=" ".join(lines).replace("\n"," ")
    stem_list=stem_tokenizer(text)
    used_list=[token for token in stem_list if token not in stop]
    return used_list

Let's check the function does exactly the same we did step by step above

In [18]:
stems=process_text(filename,stop)
used_list==stems

True

## Document Similarity Definitions

Given a representation of a document as a stream of tokes (stems), we need to define the concept of a document distance metric.

Usually in text processing the concept described is the normalized similarity $0<s(t_1,t_2)<1$, the translation to distance is simply

$$
    d(t_1,t_2) = 1-s(t_1,t_2)
$$

Again, may different definitions of similarity are possible, we will consider three here:
* Set intersection similarity.
* vector of counts similarity.
* TF-IDF (Term Frequency, Inverse Document Frequency) similarity.

There are many more choices, and, as usual, which one works best depends on problem at hand.

### Set Intersection Similarity Measure

We can consider our **bag of words** as just the set with the stems contained in document.

Then document similarity is just the normalized intersection of this two sets.

$$
    S_{\textrm{set}}(S_1,S_2)= \frac{|S_1 \cap S_2|}{\sqrt{|S_1|\cdot |S_2|}}
$$
where $|\cdot|$ is the set cardinality.

In words: the ratio of the number of words in common to the geometric mean of the document's vocabularies.

This is the same as considering $S_i$ as a **one-hot-encoded** vector of words: each word in the vocabulary is a dimension, and a component of the vector is 1 if that word is present on the document, 0 otherwise.  

With that interpretation
$$
    S_{\textrm{set}}(S_1,S_2)= \frac{S_1 \cdot S_2}{\sqrt{|S_1|\cdot |S_2|}}=\cos(S_1,S_2)
$$
where $\cdot$ is the regular scalar product of vectors and $|\cdot|$ is the vector norm (numerically identical to the set's cardinality).

Python has a  `set` class that alows to implement $S_{\textrm{set}}$ in a efficient way

In [19]:
stems=set(used_list)
for word in ["week","internet","hola","Aristotle"]:
     print(word,",",word in stems)


week , False
internet , True
hola , False
Aristotle , False


In [20]:
# convenience function
def text_2_set(filename,stop):
    stems=process_text(filename,stop)
    return set(stems)

We create a few document word sets to test the similarity measure

In [21]:
# we save the document index for reuse later
document1=0 
document2=1
document3=105


In [22]:
filename1=documents["filename"][document1]
filename2=documents["filename"][document2]
filename3=documents["filename"][document3] # this will be from a different author
print("processing documents:")
print("\t",filename1)
print("\t",filename2)
print("\t",filename3)

set1=text_2_set(filename1,stop)
set2=text_2_set(filename2,stop)
set3=text_2_set(filename3,stop)

processing documents:
	 ../../raw/C50/C50train/AaronPressman/106247newsML.txt
	 ../../raw/C50/C50train/AaronPressman/120600newsML.txt
	 ../../raw/C50/C50train/AlexanderSmith/134595newsML.txt


This is the implementation of set similarity

In [23]:
def product_set(set1,set2):
   return  len(set1.intersection(set2))/np.sqrt(len(set1)*len(set2))

A set has a similarity of 1 to himself

In [24]:
product_set(set1,set1)

1.0

Now we can compare how similar different documents are to each other

In [25]:
product_set(set1,set2),product_set(set1,set3),product_set(set2,set3)

(0.19230769230769232, 0.15555760103384325, 0.11045510132580585)

Note how documents from the same author are more similar to each other than to documents from a second author

### Word Count Similarity Measure

We can include a bit more information on the **Bag of Words** model by keep track on how many times each word appear on a document
$$
    S_{\textrm{count}} = \frac{C_1 \cdot C_2}{\sqrt{|C_1|\cdot |C_2|}}=\cos(C_1,C_2)
$$
This is the same cosine similarity used on the set case but a  **count's feature vector** rather than been only 0 or 1, each dimention $w$ of $C_i$ contains the number of times count of each word on the vocabulary.

`python` collections module has a `Counter` object that computes the counts for us.

In [26]:
counts=Counter(used_list)
counts.most_common(10)

[('internet', 9),
 ('consum', 6),
 ('leagu', 6),
 ('scam', 6),
 ('site', 6),
 ('said', 4),
 ('fraud', 4),
 ('may', 3),
 ('investor', 3),
 ('web', 3)]

In [27]:
# convenience function again
def text_2_counts(filename,stop):
    stems=process_text(filename,stop)
    return Counter(stems)

A few  test counts again

In [28]:
count1=text_2_counts(filename1,stop)
count2=text_2_counts(filename2,stop)
count3=text_2_counts(filename3,stop)

A simple minded implementation of the count distance

In [29]:
def product_count(count1,count2):
    sum1=0.0
    sum_cross=0.0
    for key in count1: # key will be each word on the document represented by count1
        sum1+=count1[key]**2
        sum_cross+=count1[key]*count2[key]
    sum2=0.0
    for key in count2:
        sum2+=count2[key]**2
    return sum_cross/np.sqrt(sum1*sum2)


Again, a document has similarity of 1 to itself

In [30]:
product_count(count1,count1)

1.0

In [31]:
print(product_count(count1,count2),product_count(count1,count3),product_count(count2,count3))

0.244075282625 0.0970998008041 0.119030339834


Again, documents from the same author are closer to each other. But now the differences look more pronounced.

### TF-IDF Document Similarity Measure

See the [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) wikipedia page for definition and a long list of alternatives

In [32]:
# This function count
def corpus_word_counts(documents,stop):
    counts=Counter()
    for filename in documents["filename"]:   
        print("processing...",filename)
        bag=text_2_set(filename,stop)
        for word in bag:
            counts[word]+=1
    return pd.DataFrame.from_dict(counts,orient="index")

In [33]:
word_counts_filename=data_dir+"corpus_word_counts.csv"

if first_time:
    word_counts=corpus_word_counts(documents,stop)
    word_counts = word_counts.rename(columns={'index':'word', 0:'count'})
    word_counts.to_csv(word_counts_filename,index_label="word")

word_counts=pd.read_csv(word_counts_filename,index_col="word")
word_counts.describe()

processing... ../../raw/C50/C50train/AaronPressman/106247newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/120600newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/120683newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/136958newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/137498newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/14014newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/156814newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/182596newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/186392newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/193495newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/196805newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/197734newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/206838newsML.txt
processing... ../../raw/C50/C50train/AaronPressman/231479newsML.txt
processing... ../../raw/C50/C50train/AaronPressma

processing... ../../raw/C50/C50train/AlexanderSmith/21127newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/219512newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/219521newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/220666newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/223283newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/223300newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/223793newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/224655newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/225590newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/236412newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/237953newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/238090newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/23876newsML.txt
processing... ../../raw/C50/C50train/AlexanderSmith/239202newsML.txt
processing... ../../raw/C50/C50train

processing... ../../raw/C50/C50train/BernardHickey/361792newsML.txt
processing... ../../raw/C50/C50train/BernardHickey/361793newsML.txt
processing... ../../raw/C50/C50train/BernardHickey/361820newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/102760newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/117102newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/142661newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/144035newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/146644newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/146870newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/156808newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/15944newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/160089newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/16033newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/179121newsML.txt
processing... ../../raw/C50/C50train/BradDorfman/196808newsML.txt
proces

processing... ../../raw/C50/C50train/DavidLawder/144314newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/145801newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/146321newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/146470newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/148055newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/148087newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/148215newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/148624newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/150717newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/153794newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/154150newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/154151newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/15588newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/156769newsML.txt
processing... ../../raw/C50/C50train/DavidLawder/156823newsML.txt
processing.

processing... ../../raw/C50/C50train/EricAuchard/231241newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/233319newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/260570newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/261502newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/263257newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/264271newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/288069newsML.txt
processing... ../../raw/C50/C50train/EricAuchard/292697newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/10028newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/103993newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/114292newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/123528newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/126656newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/132085newsML.txt
processing... ../../raw/C50/C50train/FumikoFujisaki/133528n

processing... ../../raw/C50/C50train/HeatherScoffield/163214newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/165906newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/168501newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/175078newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/175721newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/184769newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/187990newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/191069newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/195125newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/204659newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/21156newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/216816newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/217889newsML.txt
processing... ../../raw/C50/C50train/HeatherScoffield/219197newsML.txt
process

processing... ../../raw/C50/C50train/JanLopatka/230755newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/254538newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/260914newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/260918newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/263451newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/263741newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/26976newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/27569newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/288397newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/290685newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/293006newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/293018newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/293216newsML.txt
processing... ../../raw/C50/C50train/JanLopatka/295717newsML.txt
processing... ../../raw/C50/C50train/JimGilchrist/103904newsML.txt
processing... ../../raw/C

processing... ../../raw/C50/C50train/JohnMastrini/18297newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/18320newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/18519newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/18529newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/192052newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/192116newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/194494newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/194608newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/195069newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/195181newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/195805newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/202307newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/205953newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/207961newsML.txt
processing... ../../raw/C50/C50train/JohnMastrini/212289newsML.txt

processing... ../../raw/C50/C50train/JoWinterbottom/203923newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/203926newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/206040newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/207515newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/210506newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/210519newsML.txt
processing... ../../raw/C50/C50train/JoWinterbottom/220918newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/110696newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/123422newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/126553newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/129041newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/130351newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/133336newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/136654newsML.txt
processing... ../../raw/C50/C50train/KarlPenhaul/140181

processing... ../../raw/C50/C50train/KevinDrawbaugh/154378newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/154388newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/16057newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/173045newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/178086newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/191657newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/233324newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/243558newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/248401newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/249305newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/249404newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/259009newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/269637newsML.txt
processing... ../../raw/C50/C50train/KevinDrawbaugh/270070newsML.txt
processing... ../../raw/C50/C50trai

processing... ../../raw/C50/C50train/KirstinRidley/402055newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/101520newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/104417newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/108366newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/113206newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/117158newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/120392newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/121032newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/123800newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/124768newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/128940newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/129309newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/146032newsML.txt
processing... ../../raw/C50/C50train/KouroshKarimkhany/150572newsML.

processing... ../../raw/C50/C50train/LynneO'Donnell/158416newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/159976newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/166518newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/169600newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/171253newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/171633newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/17858newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/180191newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/187362newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/198178newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/204874newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/20855newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/229894newsML.txt
processing... ../../raw/C50/C50train/LynneO'Donnell/23184newsML.txt
processing... ../../raw/C50/C50train/

processing... ../../raw/C50/C50train/MarcelMichelson/217650newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/218344newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/221119newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/221246newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/222937newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/22884newsML.txt
processing... ../../raw/C50/C50train/MarcelMichelson/229499newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/101093newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/101098newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/103818newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/103826newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/114722newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/115420newsML.txt
processing... ../../raw/C50/C50train/MarkBendeich/115431newsML.txt
processing... ../../raw/C50/C50train/MarkB

processing... ../../raw/C50/C50train/MatthewBunce/280207newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/283605newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/286922newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/287384newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/288284newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/293127newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/298758newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/301536newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/302468newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/302478newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/305276newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/308257newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/311372newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/318283newsML.txt
processing... ../../raw/C50/C50train/MatthewBunce/327686newsML

processing... ../../raw/C50/C50train/MureDickie/222803newsML.txt
processing... ../../raw/C50/C50train/MureDickie/225075newsML.txt
processing... ../../raw/C50/C50train/NickLouth/105567newsML.txt
processing... ../../raw/C50/C50train/NickLouth/10799newsML.txt
processing... ../../raw/C50/C50train/NickLouth/108449newsML.txt
processing... ../../raw/C50/C50train/NickLouth/110904newsML.txt
processing... ../../raw/C50/C50train/NickLouth/112673newsML.txt
processing... ../../raw/C50/C50train/NickLouth/116073newsML.txt
processing... ../../raw/C50/C50train/NickLouth/116176newsML.txt
processing... ../../raw/C50/C50train/NickLouth/117081newsML.txt
processing... ../../raw/C50/C50train/NickLouth/119266newsML.txt
processing... ../../raw/C50/C50train/NickLouth/120416newsML.txt
processing... ../../raw/C50/C50train/NickLouth/120591newsML.txt
processing... ../../raw/C50/C50train/NickLouth/121030newsML.txt
processing... ../../raw/C50/C50train/NickLouth/123754newsML.txt
processing... ../../raw/C50/C50train/Ni

processing... ../../raw/C50/C50train/PeterHumphrey/238764newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/242322newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/243415newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/244883newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/246251newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/247781newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/247855newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/250733newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/250825newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/252085newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/25384newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/256526newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/262875newsML.txt
processing... ../../raw/C50/C50train/PeterHumphrey/262888newsML.txt
processing... ../../raw/C50/C50train/PeterHumphre

processing... ../../raw/C50/C50train/RogerFillion/140752newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/146861newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/150665newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/154675newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/155654newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/168572newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/171742newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/172846newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/173423newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/174058newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/175177newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/177017newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/181874newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/186425newsML.txt
processing... ../../raw/C50/C50train/RogerFillion/186437newsML

processing... ../../raw/C50/C50train/SarahDavison/387956newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/392316newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/396739newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/396740newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/396743newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/405565newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/409023newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/414690newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/419462newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/421047newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/426559newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/426661newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/428346newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/429223newsML.txt
processing... ../../raw/C50/C50train/SarahDavison/429225newsML

processing... ../../raw/C50/C50train/TanEeLyn/242294newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/243445newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/246306newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/250726newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/250732newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/253732newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/253869newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/264108newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/266981newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/271710newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/27284newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/273117newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/281155newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/283835newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/301674newsML.txt
processing... ../../raw/C50/C50train/TanEeLyn/31955newsM

processing... ../../raw/C50/C50train/TimFarrand/242992newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/107276newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/108216newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/110957newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/111247newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/117089newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/120415newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/121051newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/133717newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/140567newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/144319newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/146468newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/146741newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/148408newsML.txt
processing... ../../raw/C50/C50train/ToddNissen/149287newsML.txt
processing... ../../raw/C

Unnamed: 0,count
count,28060.0
mean,17.699073
std,77.33924
min,1.0
25%,1.0
50%,2.0
75%,6.0
max,2482.0


In [34]:
# Vocabulary size
V=len(word_counts)
print("Vocabulary Size",V)
# Corpus size
C=len(documents)
print("Corpus Size",C)

Vocabulary Size 28060
Corpus Size 2500


In [35]:
idfs=np.log((1+C)/(1+word_counts))+1
idfs.head()

Unnamed: 0_level_0,count
word,Unnamed: 1_level_1
enforc,5.040256
equip,3.972416
third,2.894857
memori,5.135566
100000,4.520381


In [36]:
def product_tfidf(count1,count2,idfs):
    sum1=0.0
    sum_cross=0.0
    for key in count1:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w1=idf*count1[key]
        w2=idf*count2[key]
        sum1+=(w1)**2
        sum_cross+=w1*w2
    sum2=0.0
    for key in count2:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w2=idf*count2[key]
        sum2+=w2**2
    return sum_cross/np.sqrt(sum1*sum2)

In [37]:
product_tfidf(count1,count1,idfs)

1.0

In [38]:
product_tfidf(count1,count2,idfs),product_tfidf(count1,count3,idfs),product_tfidf(count2,count3,idfs)

(0.098638168899128104, 0.031445199375524771, 0.035286391159521348)

## Text Feature Extraction with `klearn`

`sklearn` provide combinient methods to import text into a set of features suitable for machine learning 


### Represent Text as Word Counts

We configure a `CountVectorizer` that will
* use a list of `filename` as input and read the file to get text
* process text using `steam_tokenizer`
* remove stop words from our `stop` list

It will return a matrix of word counts arranged `(document index)` x (`word index`)

each word is mapped to its own column

In [39]:
countVectorizer=CountVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop)

We have a count vectorizer, but we still need to learn a vocabulary by **fitting** a corpus of documents

In [40]:
X=countVectorizer.fit_transform(documents["filename"])

In [41]:
print("Total Words",X.sum())

Total Words 799731


#### Inspecting the document Representation

Count Tokenizer agree with our `word_counts` table on *vocabulary size*

In [42]:
len(countVectorizer.vocabulary_),len(word_counts)

(28060, 28060)

Each word in document gets mapped to a vocabulary index

In [43]:
words=list(countVectorizer.vocabulary_.keys())[:5] # only get the first 5 words
for word in words:
    print( word,countVectorizer.vocabulary_[word])

internet 15609
may 17800
overflow 20017
new 19054
technolog 25282


Lets print the word counts on  a couple of documents:

In [44]:
print("doc","word"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for word in words:
        document=X[i1]
        dimension=countVectorizer.vocabulary_[word]
        print(i1,f"{word:15}",dimension,document[0,dimension],sep="\t")

doc	word           	dim	count
0	internet       	15609	9
0	may            	17800	3
0	overflow       	20017	1
0	new            	19054	1
0	technolog      	25282	1
1	internet       	15609	5
1	may            	17800	0
1	overflow       	20017	0
1	new            	19054	3
1	technolog      	25282	1


So X is just a **matrix**:
* **rows** are **documents**
* **columns** are **words**, each word has its own column

There are 28k columns, but the representation is **sparse** to save memory, only non-zero entries are stored in X.

#### Comparing to hand Written Word Count Representation

Let's compute vector square norms using numpy's usual linear algebra functions

In [45]:
x1_sqr=np.dot(X[document1],X[document1].T)[0,0] # np.dot returns a 1x1 matrix, extract value
x2_sqr=np.dot(X[document2],X[document2].T)[0,0]
x3_sqr=np.dot(X[document3],X[document3].T)[0,0]

Normalized scalar product matches our `product_count` similarity measure

In [46]:
np.dot(X[document1],X[document3].T)[0,0]/np.sqrt(x1_sqr*x3_sqr),product_count(count1,count3)

(0.097099800804124392, 0.097099800804124392)

#### Representing new text

We call `transform`, not `fit_transform`

Words in new documents, will get mapped to the columns learned on the training set.
New words not in the original vocabulary will be ignored.

If we called `fit_transform` instead the indexing of words would be scrambled up.

In [47]:
X_test=countVectorizer.transform(test_documents["filename"])

Lets check the word counts of a few new documents

In [48]:
print("doc","word"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for word in words:
        document=X_test[i1]
        dimension=countVectorizer.vocabulary_[word]
        print(i1,f"{word:15}",dimension,document[0,dimension],sep="\t")

doc	word           	dim	count
0	internet       	15609	0
0	may            	17800	1
0	overflow       	20017	0
0	new            	19054	3
0	technolog      	25282	0
1	internet       	15609	0
1	may            	17800	1
1	overflow       	20017	0
1	new            	19054	0
1	technolog      	25282	0


### Represent Text as a Set of Words

To represent text as a set of words (instead of a vector of counts) we pass an extra flag to
`CountVectorizer` so that it only counts up to `one`. 

In [49]:
setVectorizer=CountVectorizer(input="filename",binary=True,tokenizer=stem_tokenizer,stop_words=stop)

In [50]:
X_set=setVectorizer.fit_transform(documents["filename"])

In [51]:
print("doc","word"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for word in words:
        document=X_set[i1]
        dimension=setVectorizer.vocabulary_[word]
        print(i1,f"{word:15}",dimension,document[0,dimension],sep="\t")

doc	word           	dim	count
0	internet       	15609	1
0	may            	17800	1
0	overflow       	20017	1
0	new            	19054	1
0	technolog      	25282	1
1	internet       	15609	1
1	may            	17800	0
1	overflow       	20017	0
1	new            	19054	1
1	technolog      	25282	1


In [52]:
X_set_test=setVectorizer.transform(test_documents["filename"])

### Represent Text as TF-IDF weighted Counts

`sklearn` has also a vectorizer that weights columns by the inverse document frequencies.

In [53]:
tfidfVectorizer=TfidfVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop)

In [54]:
Xi=tfidfVectorizer.fit_transform(documents["filename"])

In [55]:
Xi_test=tfidfVectorizer.transform(test_documents["filename"])

TF-IDF vectorizer already returns normalized vectors

In [56]:
np.dot(Xi[document1],Xi[document2].T)[0,0],product_tfidf(count1,count2,idfs)

(0.098638168899128187, 0.098638168899128104)

### Using Digrams as features

Instead of using *words* counts as features, we can use *pairs of words*.
This are called **digrams**. 


In [57]:
digramVectorizer=CountVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop,ngram_range=(2,2))

In [58]:
X_digram=digramVectorizer.fit_transform(documents["filename"])

In [59]:
X_digram_test=digramVectorizer.transform(test_documents["filename"])

In [60]:
digrams=list(digramVectorizer.vocabulary_.keys())[:5] # only get the first 5 digrams
for digram in digrams:
    print( digram,digramVectorizer.vocabulary_[digram])

internet may 182960
may overflow 215852
overflow new 246048
new technolog 233347
technolog crime 343655


In [61]:
V_digram=X_digram.shape[1]
V_digram

388209

In [62]:
print("doc","digram"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for digram in digrams:
        document=X_digram[i1]
        dimension=digramVectorizer.vocabulary_[digram]
        print(i1,f"{digram:15}",dimension,document[0,dimension],sep="\t")

doc	digram           	dim	count
0	internet may   	182960	1
0	may overflow   	215852	1
0	overflow new   	246048	1
0	new technolog  	233347	1
0	technolog crime	343655	1
1	internet may   	182960	0
1	may overflow   	215852	0
1	overflow new   	246048	0
1	new technolog  	233347	1
1	technolog crime	343655	0


### Saving Trained models for Reuse

Training a text model (calling `fit_transform`) is slow because we need to process all the documents in the training corpus.

Sometimes it is convenient to save pre-trained models to disk for reuse later. We can also save the pre-processed test documents:

In [63]:
set_vectorizer_filename=   data_dir+"/set_vectorizer.p"
set_features_filename=     data_dir+"/set_features.p"
set_test_features_filename=data_dir+"/set_test_features.p"

pickle.dump(setVectorizer, open( set_vectorizer_filename, "wb" ) )
pickle.dump(X_set,         open( set_features_filename, "wb" ) )
pickle.dump(X_set_test,    open( set_test_features_filename, "wb" ) )


In [64]:
count_vectorizer_filename=   data_dir+"/count_vectorizer.p"
count_features_filename=     data_dir+"/count_features.p"
count_test_features_filename=data_dir+"/count_test_features.p"

pickle.dump(countVectorizer, open( count_vectorizer_filename, "wb" ) )
pickle.dump(X,             open( count_features_filename, "wb" ) )
pickle.dump(X_test,        open( count_test_features_filename, "wb" ) )


In [65]:
tfidf_vectorizer_filename=   data_dir+"/tfidf_vectorizer.p"
tfidf_features_filename=     data_dir+"/tfidf_features.p"
tfidf_test_features_filename=data_dir+"/tfidf_test_features.p"

pickle.dump(tfidfVectorizer, open( tfidf_vectorizer_filename, "wb" ) )
pickle.dump(Xi,              open( tfidf_features_filename, "wb" ) )
pickle.dump(Xi_test,         open( tfidf_test_features_filename, "wb" ) )


In [66]:
digram_vectorizer_filename=   data_dir+"/digram_vectorizer.p"
digram_features_filename=     data_dir+"/digram_features.p"
digram_test_features_filename=data_dir+"/digram_test_features.p"

pickle.dump(digramVectorizer, open( digram_vectorizer_filename, "wb" ) )
pickle.dump(X_digram,              open( digram_features_filename, "wb" ) )
pickle.dump(X_digram_test,         open( digram_test_features_filename, "wb" ) )