# Text Feature Extraction Homework

In this exercise, we will implement another slight variation of the `tfidf`  document distance definition using **sublinear** document counts.

We will then compare it to the `sklearn` implementation.

## Preliminaries

#### Imports

In [1]:
import string
import os
import pickle
from collections import Counter 

import numpy as np
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

import sys 
sys.path.append("../../")
from E4525_ML import text # you must have saved the file text.py into the E4525_ML directory

#### Data Directories

In [2]:
raw_data_dir=r"../../raw/C50/C50train" # original data set used for training
data_dir    =r"../../data/C50/"  # directory to save intermediate results

#### Convenience Function Definitions

A few functions carried over from the Text_Features notebook that we will need during this exercise.

In [3]:
def process_text(filename,stop): 
    porter_stemmer = PorterStemmer()
    file=open(filename)
    lines=file.readlines()
    text_str=" ".join(lines).replace("\n"," ").lower()
    stem_list=text.stem_tokenizer(text_str)
    used_list=[token for token in stem_list if token not in stop]
    return used_list

In [4]:
def text_2_set(filename,stop_words):
    stems=process_text(filename,stop_words)
    return set(stems)

In [5]:
def text_2_counts(filename,stop_words):
    stems=process_text(filename,stop_words)
    return Counter(stems)

In [6]:
def corpus_word_counts(documents,stop):
    counts=Counter()
    for filename in documents["filename"]:   
        print("processing...",filename)
        bag=text_2_set(filename,stop)
        for word in bag:
            counts[word]+=1
    return pd.DataFrame.from_dict(counts,orient="index")

## Environment Preparation

<div class="alert alert-block alert-info"> Problem 0 </div>

1. Download the  [Reuters 50](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50) collection of texts. Save it on the `raw` data directory.

    You should end up with this directory structure structure:
    
    raw/
        C50/
            C50train/
            C50test/
            
1. Run to completion the [Text Feature Extraction](./Text_Features.ipynb) notebook. This will generate the document lists, and word count statistics. Make sure to run any of the sections are are meant to be run only once.
1. Save the text.py python module into the `E4525_ML` directory.

## Implement TF-IDF document Distance with Sublinear Growth 

<div class="alert alert-block alert-info"> Problem 1.1 </div>

Read the list of documents in the file `C50_documents.csv`  from the data directory `data_dir` into a `documents` variable

In [7]:
documents = pd.read_csv(data_dir + 'C50_documents.csv')
documents.head()

Unnamed: 0,document_id,filename,label
0,0,../../raw/C50/C50train/RobinSidel/147604newsML...,RobinSidel
1,1,../../raw/C50/C50train/RobinSidel/196812newsML...,RobinSidel
2,2,../../raw/C50/C50train/RobinSidel/219316newsML...,RobinSidel
3,3,../../raw/C50/C50train/RobinSidel/251225newsML...,RobinSidel
4,4,../../raw/C50/C50train/RobinSidel/177958newsML...,RobinSidel


<div class="alert alert-block alert-info"> Problem 1.2 </div>

Create a list of stop works by calling the function `text.stop_words` from the `E4525.text` python module.

In [8]:
stopwords = text.stop_words()
stopwords

{'!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "'d",
 "'ll",
 "'re",
 "'s",
 "'ve",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '``',
 'a',
 'about',
 'abov',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'ani',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'becaus',
 'been',
 'befor',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'could',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'doe',
 'doesn',
 'don',
 'down',
 'dure',
 'each',
 'few',
 'for',
 'from',
 'further',
 'ha',
 'had',
 'hadn',
 'hasn',
 'have',
 'haven',
 'he',
 'her',
 'here',
 'herself',
 'hi',
 'him',
 'himself',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'might',
 'mightn',
 'more',
 'most',
 'must',
 'mustn',
 'my',
 'myself',
 "n't",
 'need',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'onc',
 'onli',
 'or',
 'other',
 'our',
 

<div class="alert alert-block alert-info"> Problem 1.3 </div>

Using pandas, read  the word count (term frequencies) file generated by the Text_Features notebook
The file is called "corpus_word_counts.csv"

In [9]:
corpus = pd.read_csv(data_dir + '/corpus_word_counts.csv', index_col='word')
corpus.head()

Unnamed: 0_level_0,count
word,Unnamed: 1_level_1
paid,169
declin,572
new,1472
difficulti,67
market,1473


<div class="alert alert-block alert-info"> Problem 1.4 </div>
Create a variable $V$ with the vocabulary size  and a variable named $C$ with the total number of documents

In [10]:
V = corpus.shape[0]
C = documents.shape[0]
print('the vocabulary size is: ' + str(V))
print('the total number of documents: ' + str(C))

the vocabulary size is: 28131
the total number of documents: 2500


<div class="alert alert-block alert-info"> Problem 1.5 </div>
Compute the smoothed inverse document counts, defined as
$$
    \textrm{idf}_i =  \log\left( \frac{1+C}{1+\textrm{n}_i}\right) + 1
$$

where $n_i$ is the number of documents in corpus where word $i$ appears.

In [11]:
idf = np.log((1+C)/(1+corpus)) + 1 # Using the dataframe directly
print(idf.head())

               count
word                
paid        3.688647
declin      2.473560
new         1.529390
difficulti  4.604938
market      1.528711


<div class="alert alert-block alert-success"> We set up a few documents for comparison</div>

[HINT] Code below assumes that the variable `documents`  is the list of documents you read in problem 1.1

In [12]:
# document indexes we will use for comparison
document1=0 
document2=1
document3=105

# document filenames
filename1=documents["filename"][document1]
filename2=documents["filename"][document2]
filename3=documents["filename"][document3] # this will be from a different author

<div class="alert alert-block alert-info"> Problem 1.6 </div>
    Compute the word counts for `documents1`,`document2` and `document3`, using the `text_2_count` function defined at the beginning of the notebook.

In [13]:
wc_d1 = text_2_counts(filename1,stopwords)
wc_d2 = text_2_counts(filename2,stopwords)
wc_d3 = text_2_counts(filename3,stopwords)
print(wc_d1)
print(wc_d2)
print(wc_d3)

Counter({'revco': 20, 'big': 20, 'b': 19, 'said': 11, 'share': 9, 'offer': 8, 'store': 8, 'compani': 8, 'inc.': 7, 'buy': 5, 'combin': 5, 'profit': 5, 'chain': 4, 'stock': 4, 'per': 4, "''": 4, 'sale': 4, 'oper': 4, 'drugstor': 3, 'month': 3, 'deal': 3, 'rite': 3, 'aid': 3, 'sign': 3, 'unit': 3, 'although': 3, 'trade': 3, 'new': 3, 'margin': 3, 'distribut': 3, 'centr': 3, 'monday': 2, 'agre': 2, 'region': 2, 'million': 2, 'transact': 2, '15': 2, 'reject': 2, 'last': 2, 'abl': 2, 'hoven': 2, 'drug': 2, 'board': 2, 'latest': 2, 'tender': 2, 'acquisit': 2, 'industri': 2, 'corp.': 2, 'agreement': 2, 'billion': 2, 'view': 2, 'cent': 2, 'financi': 2, 'togeth': 2, 'base': 2, 'potenti': 2, 'like': 2, 'manag': 2, 'care': 2, 'overlap': 2, 'georgia': 2, 'group': 2, 'year': 2, 'price': 2, 'invest': 2, 'strauss': 2, 'also': 2, 'two': 2, 'earlier': 2, 'eckerd': 2, 'southeastern': 2, 'state': 2, 'giant': 1, 'd.': 1, 'sweeten': 1, 'takeov': 1, 'valu': 1, '380': 1, 'call': 1, 'twinsburg': 1, 'ohio-bas'

### Classical Tf-Idf

The function below computes the normalized product of  `tfidf`  vectors.
Where the `tfidf` vector is defined as follows
$$
    w_{k} = \textrm{idf_k} * c_{k}
$$
where  $c_{k}$ is the number of times that word $k$ appears in document.

In [14]:
def product_tfidf(count1,count2,idfs):
    sum1=0.0
    sum_cross=0.0
    for key in count1:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w1=idf*count1[key]
        w2=idf*count2[key]
        sum1+=(w1)**2
        sum_cross+=w1*w2
    sum2=0.0
    for key in count2:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w2=idf*count2[key]
        sum2+=w2**2
    return sum_cross/np.sqrt(sum1*sum2)

### Sub-Linear Tf-Idf

It seems unlikely that 20 occurrences of a term in a document truly carry $20\times$ the significance of a single occurrence. And alternative (see the [Information Retrieval book](https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html)) is to use a function
to *tamper* the growth of the word counts.

<div class="alert alert-block alert-info"> Problem 1.6 </div>
Create a function named `sublinear_product_tfidf`.
It should compute the normalized product of `tfidf` vectors as above but using a **`sublinear`** measure of  the word counts, defined as:
\begin{align}
    w_k  &= idf_k * (1+\log c_k)  &\textrm{if}\,\, c_k &>0 \\
    w_k  &= 0                    &\textrm{if}\,\, c_k &=0 \\
\end{align}
where $c_k$ is the raw word count for word $k$.

[HINT] Probably easiest to copy and modify slightly the function  `product_idf` above

In [15]:
def count(c): # *** These lines are important
    if c>0:
        return 1+np.log(c)
    else:
        return 0
    
def product_sl_tfidf(count1,count2,idfs):
    sum1=0.0
    sum_cross=0.0
    for key in count1:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w1=idf*count(count1[key])
        w2=idf*count(count2[key])
        sum1+=(w1)**2
        sum_cross+=w1*w2
    sum2=0.0
    for key in count2:
        if key not in idfs.index:
            idf=0
            print(f"key {key} not found")
        else:
            idf=idfs.loc[key]["count"]
        w2=idf*count(count2[key])
        sum2+=w2**2
    return sum_cross/np.sqrt(sum1*sum2)

<div class="alert alert-block alert-info"> Problem 1.7 </div>
Compute the sublinear normalized product (similarity) for `document1` with itself, verify that the product is 1

In [16]:
product_sl_tfidf(wc_d1,wc_d1,idf) # computer the similarity

1.0

<div class="alert alert-block alert-info"> Problem 1.8 </div>
Compute the sublinear normalized products between 
1. `document1` and `document2`
2. `document1` and `document3`
3. `document2` and `document3`

In [17]:
snp1 = product_sl_tfidf(wc_d1,wc_d2,idf) # Need the sublinear function be defined according to the range
snp2 = product_sl_tfidf(wc_d1,wc_d3,idf)
snp3 = product_sl_tfidf(wc_d2,wc_d3,idf)
print(snp1)
print(snp2)
print(snp3)

0.12787626323711282
0.04982868267860495
0.0352982411561606


## Comparison to  `sklearn`

<div class="alert alert-block alert-info"> Problem 2.1 </div>
store the value of the function `text.stem_tokenizer` from the module `text.py` into variable named `tokenizer`.

In [18]:
tokenizer = text.stem_tokenizer # text.py abbreviated as text to apply the function

<div class="alert alert-block alert-info"> Problem 2.2 </div>

set up  an instance of [`sklearn.TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)  so that it generates `tfidf` vectors using sublinear growth.

[Hint] 
1. Read carefully the  long list of options on the constructor of `TfidfVectorizer`
2. Do not forget to set the `input`, `tokenizer` and `stop_word` arguments.
    

In [19]:
TfidV = TfidfVectorizer(input = 'filename', tokenizer = tokenizer, stop_words = stopwords, sublinear_tf = True)
# Since the function generates tfidf vector using sublinear growth, setup the sublinear_tf to true

<div class="alert alert-block alert-info"> Problem 2.3 </div>
Generate the matrix $X$ of `tfidf` representations for each document in our corpus (this may take a bit of time)

In [20]:
X = TfidV.fit_transform(documents['filename']) # .fit_transform to apply the function on the file

  'stop_words.' % sorted(inconsistent))


<div class="alert alert-block alert-info"> Problem 2.4 </div>
Compute the dot product between `document1` and `document2` using their vector (`X`) representation. 

Compare to the result produced by the `sublinear_product_tfidf`
function you just wrote. They should be nearly identical.

In [21]:
# both print(X[document1].shape), print(X[document2].shape) give (1, 28131)
print(np.dot(X[document1], X[document2].T)[0,0]) # Gives the only value in the 2-dimensional matrix
print(product_sl_tfidf(wc_d1,wc_d2,idf))

0.1278762632371129
0.12787626323711282


### Saving Trained models for Reuse

<div class="alert alert-block alert-info"> Problem 3.1 </div>
In the data directory `data_dir`:
1. Save vectorizer to a `pickle` called "tfidf_sublinear_vectorizer.p"
2. Save sublinear `tfidf1` features to a file called "tfidf_sublinear_features.p"

In [26]:
# 1. Save vectorizer to a `pickle` called "tfidf_sublinear_vectorizer.p"
pickle.dump(TfidV, open(data_dir + '/tfidf_sublinear_vectorizer.p', 'wb'))

# 2. Save sublinear `tfidf1` features to a file called "tfidf_sublinear_features.p"
pickle.dump(X, open(data_dir + '/tfidf_sublinear_features.p', 'wb'))

<div class="alert alert-block alert-info"> Problem 3.2 </div>
Make sure you can read those files again

In [27]:
import pickle

vectorizer = pickle.load( open( data_dir + '/tfidf_sublinear_vectorizer.p', "rb" ) )
tfidf1 = pickle.load( open( data_dir + '/tfidf_sublinear_features.p', "rb" ) )

In [29]:
print(vectorizer)
tfidf1

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='filename', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words={'!', '#', '$', '%', '&', "'", "'d", "'ll", "'re",
                            "'s", "'ve", '(', ')', '*', '+', ',', '-', '.', '/',
                            ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']',
                            '^', ...},
                strip_accents=None, sublinear_tf=True,
                token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function stem_tokenizer at 0x1a1af67d90>,
                use_idf=True, vocabulary=None)


<2500x28131 sparse matrix of type '<class 'numpy.float64'>'
	with 483542 stored elements in Compressed Sparse Row format>