## Topics

* vectorizer in sklearn
* sentiment review classification with pytorch and sklearn

## Vectorizer

A vectorizer turns a text into a numerical representation. 

* the vocabulary is built: each word occupies its own dimension
* each text is transformed into a vector
* only those dimension of words that appear in the text get a value
* the value is the frequency of that word (count vectorizer)
* if the vocabulary is large, the vectors are sparse (many dimension with zeros)

we use sklearn

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = [
  'die Kiefer wird gefällt',
  'gefällt mir nicht dass die Kiefer gefällt wird',
  'mein Kiefer pocht',
  'jeder pocht darauf dass die Kiefer gefällt wird'
  ]

X = vectorizer.fit_transform(corpus)

# alphabetically ordered
vectorizer.get_feature_names() == (
    ['darauf','dass','die','gefällt','jeder','kiefer','mein','mir','nicht','pocht','wird'])

# the vectors

X.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1]], dtype=int64)

## tfidf representation (recap)

If we understand a document as a set of terms (lemmas of content words), then this is called the *bag-of-words* approach. Not all terms are equally important. Those that occur in all documents are not very meaningful. There is a central metric: term frequency inverse document frequency (tf-idf)

* term frequency  $tf $: how frequent is a term $t$ in a dokument $d$
* document frequency $df$: in how many dokuments is the term $t$ (given $n_d$ dokuments)


Inverse Document Frequency:
      $$idf(t)=log\frac{n_d}{df(t)}$$


some properties: $log(1) = 0$ since e.g. $2^0=1$

* $log(\frac{n_d}{df(t)})= log(n_d) - log(df(t))$


* full weight for words (terms), which appear in just a single document ($log~n_d-log~df(t)=log~n_d-log~1=log~n_d)$


* no weight for words (terms) to are in all documents($log~n_d-log ~df(t)=log~n_d-log~n_d= 0$)

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X)
tfidf.toarray()   

array([[0.        , 0.        , 0.52682017, 0.52682017, 0.        ,
        0.40912286, 0.        , 0.        , 0.        , 0.        ,
        0.52682017],
       [0.        , 0.33725422, 0.25649053, 0.51298106, 0.        ,
        0.19918778, 0.        , 0.47532067, 0.47532067, 0.        ,
        0.25649053],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.32340369, 0.77173641, 0.        , 0.        , 0.54757005,
        0.        ],
       [0.49653437, 0.35230598, 0.26793778, 0.26793778, 0.49653437,
        0.20807759, 0.        , 0.        , 0.        , 0.35230598,
        0.26793778]])

## Sentiment classification of reviews (yelp data)

* we use the perceptron to classify a review as positive (0) or negative (1)
* we use sklearn
* we use (our own) pytorch version (simplified)

  
 * we are better than sklearn (but it is faster)
 * the question is: are we significantly better?

In [6]:
import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

In [7]:
# you should adopt the paths

args = Namespace(
    raw_train_dataset_csv="/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/train.csv",
    raw_test_dataset_csv="/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/test.csv",
    train_proportion=0.7,
    val_proportion=0.3,
    output_munged_csv="/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/reviews_with_splits_full.csv",
    seed=1337
)

In [8]:
# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

In [9]:
train_reviews.head()

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [10]:
# clean the data, e.g. punctuation encapsulated by blanks

def preprocess_text(text):
    if type(text) == float:
        print(text)
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

### vectorize with sklearn

* create an array of arrays - the texts
* keep train data and test data distinct
* fit with the whole corpus to avoid oov (out of vocabulary) items
* transform both data set separately

In [12]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus=[]
y=[]
i=1
e=6000
for x, row in train_reviews.iterrows():
    text=preprocess_text(row.review)
    corpus.append(text)
    if row.rating==1:
        y.append(1)
    else:
        y.append(0)
    i+=1
    if i == e:
        break
    
test_corpus=[]
test_y=[]

i=1
for x, row in test_reviews.iterrows():
    text=preprocess_text(row.review)
    test_corpus.append(text)
   # test_y.append(row.rating)
    if row.rating==1:
        test_y.append(1)
    else:
        test_y.append(0)
    i+=1
    if i == e:
        break
    
vectorizer.fit(corpus+test_corpus)
    
X=vectorizer.transform(corpus)
testX=vectorizer.transform(test_corpus)

# print the vectors
print(X.toarray())

# list the vocabulary
vectorizer.get_feature_names() 

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


['aa',
 'aaa',
 'aaaaaa',
 'aaaaaah',
 'aaaagh',
 'aaaand',
 'aaah',
 'aaahhh',
 'aabc',
 'aacywtm',
 'aah',
 'aahhh',
 'aahperd',
 'aamco',
 'aaren',
 'aaron',
 'aarp',
 'aas',
 'aasin',
 'ab',
 'aback',
 'abalone',
 'abandon',
 'abandoned',
 'abba',
 'abbey',
 'abbreviated',
 'abc',
 'abd',
 'abducted',
 'abercrombie',
 'abhor',
 'abhorred',
 'abhorrent',
 'abide',
 'abiding',
 'abilities',
 'ability',
 'abit',
 'abject',
 'ablaze',
 'able',
 'abnormally',
 'aboard',
 'abode',
 'abominable',
 'abomination',
 'aboration',
 'abosultely',
 'abound',
 'abounded',
 'abounds',
 'about',
 'above',
 'abp',
 'abra',
 'abrasion',
 'abrasive',
 'abroad',
 'abrupt',
 'abruptly',
 'abs',
 'abscessed',
 'absence',
 'absent',
 'absentee',
 'absently',
 'absinthe',
 'abslutely',
 'absofrigginamazeballs',
 'absoltely',
 'absolut',
 'absolute',
 'absolutely',
 'absolutley',
 'absolutly',
 'absolves',
 'absorb',
 'absorbed',
 'absorbent',
 'absorbers',
 'absorbing',
 'absorbs',
 'abstract',
 'absurd',


In [13]:
# tfidf fit and transfomation

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)

transformer.fit(X+testX)

X=transformer.transform(X)

input_dim=len(X.toarray()[0])   # input layer of perceptron
output_dim=1

len(X.toarray()[0]),len(testX.toarray()[0])

(31307, 31307)

In [14]:
# define the model

import torch
import torch.nn as nn

class Net(nn.Module):   
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_dim,1)  
        
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))  
        return x
    
net=Net()    

print(list(net.parameters()))

[Parameter containing:
tensor([[-0.0040, -0.0048,  0.0031,  ...,  0.0038,  0.0049, -0.0043]],
       requires_grad=True), Parameter containing:
tensor([-0.0049], requires_grad=True)]


In [20]:
# train

import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F

optimizer = optim.Adam(net.parameters(), lr=0.05)

loss_func = nn.BCELoss()

for epoch in range(2):
    for text, label  in zip(X,y):
        optimizer.zero_grad()
        
        Y = Variable(torch.FloatTensor([label]), requires_grad=False)
        X2= torch.tensor(text.toarray(), dtype=torch.float32, requires_grad=True)

        outputs = net(X2)
        loss = loss_func(outputs, Y)
        loss.backward()
 #       print(outputs,Y,loss)
        optimizer.step()

# but: have we trained long enough?
# if we do it once again without reinitializing the model, we get improved performance!
# a third run throws us back: performance drops

In [None]:
test_corpus[19]

In [21]:
#Sample = vectorizer.transform([test_corpus[19]])
#Sample=transformer.transform(Sample).toarray()
#print(net(Variable(torch.Tensor(Sample))))

def step(x):  
    if x < 0.5:
        return 0
    else:
        return 1
   
i=0
tp=0
all=0

# do the prediction, evaluate
for text in testX:
    true = test_y[i]
    label=net(Variable(torch.Tensor(text.toarray())))
#    print(true,step(label),test_corpus[i])   
    i+=1
    if true==step(label):
        tp+=1

    all+=1

a=tp/all

print(a)

#print("\nthe gradients:",net.fc1.weight.grad[0][13000:], net.fc1.bias.grad)

0.8611435239206534


In [17]:
from sklearn.metrics import accuracy_score 

from sklearn.linear_model import Perceptron

clf = Perceptron(tol=1e-5, random_state=0)

clf.fit(X, y)

y_test_predict=clf.predict(testX)

#clf.coef_,clf.intercept_, clf.classes_
accuracy_score(test_y,y_test_predict)

0.8596432738789799