## Topics

* vectorizer in sklearn
* sentiment review classification with pytorch and sklearn

## Vectorizer

A vectorizer turns a text into a numerical representation. 

* the vocabulary is built: each word occupies its own dimension
* each text is transformed into a vector
* only those dimension of words that appear in the text get a value
* the value is the frequency of that word (count vectorizer)
* if the vocabulary is large, the vectors are sparse (many dimension with zeros)

we use sklearn

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
vectorizer = CountVectorizer()

corpus = [
  'die Kiefer wird gefällt',
  'gefällt mir nicht dass die Kiefer gefällt wird',
  'mein Kiefer pocht',
  'jeder pocht darauf dass die Kiefer gefällt wird'
  ]

X = vectorizer.fit_transform(corpus)

# alphabetically ordered
vectorizer.get_feature_names() == (
    ['darauf','dass','die','gefällt','jeder','kiefer','mein','mir','nicht','pocht','wird'])

# the vectors

X.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1]])

## tfidf representation (recap)

If we understand a document as a set of terms (lemmas of content words), then this is called the *bag-of-words* approach. Not all terms are equally important. Those that occur in all documents are not very meaningful. There is a central metric: term frequency inverse document frequency (tf-idf)

* term frequency  $tf $: how frequent is a term $t$ in a dokument $d$
* document frequency $df$: in how many dokuments is the term $t$ (given $n_d$ dokuments)


Inverse Document Frequency:
      $$idf(t)=log\frac{n_d}{df(t)}$$


some properties: $log(1) = 0$ since e.g. $2^0=1$

* $log(\frac{n_d}{df(t)})= log(n_d) - log(df(t))$


* full weight for words (terms), which appear in just a single document ($log~n_d-log~df(t)=log~n_d-log~1=log~n_d)$


* no weight for words (terms) to are in all documents($log~n_d-log ~df(t)=log~n_d-log~n_d= 0$)

In [3]:
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X)
tfidf.toarray()   

array([[0.        , 0.        , 0.52682017, 0.52682017, 0.        ,
        0.40912286, 0.        , 0.        , 0.        , 0.        ,
        0.52682017],
       [0.        , 0.33725422, 0.25649053, 0.51298106, 0.        ,
        0.19918778, 0.        , 0.47532067, 0.47532067, 0.        ,
        0.25649053],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.32340369, 0.77173641, 0.        , 0.        , 0.54757005,
        0.        ],
       [0.49653437, 0.35230598, 0.26793778, 0.26793778, 0.49653437,
        0.20807759, 0.        , 0.        , 0.        , 0.35230598,
        0.26793778]])

## Sentiment classification of reviews (yelp data)

* we use the perceptron to classify a review as positive (0) or negative (1)
* we use sklearn
* we use (our own) pytorch version (simplified)

  
 * we are better than sklearn (but it is faster)
 * the question is: are we significantly better?

In [4]:
import collections
import pandas as pd
import re

from argparse import Namespace

In [5]:
# you should adopt the paths

args = Namespace(
    raw_train_dataset_csv="/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/train.csv",
    raw_test_dataset_csv="/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/test.csv")

In [6]:
# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

FileNotFoundError: [Errno 2] File /home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/train.csv does not exist: '/home/klenner/applications/jupyter/PyTorchNLPBook-master/yelp/train.csv'

In [None]:
train_reviews.head()

In [None]:
# clean the data, e.g. punctuation encapsulated by blanks

def preprocess_text(text):
    if type(text) == float:
        print(text)
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

### vectorize with sklearn

* create an array of arrays - the texts
* keep train data and test data distinct
* fit with the whole corpus to avoid oov (out of vocabulary) items
* transform both data set separately

In [None]:
length=6000

def load_sentiment(data_iterrows,length):
    y,X=[],[]
    
    for i,row in data_iterrows:
        text=preprocess_text(row.review)

        X.append(text)   
        
        if row.rating==1:
            y.append(1)
        else:
            y.append(0)

        if i == length:
            return X,y
            

# create training set 
(train_corpus,train_y)=load_sentiment(train_reviews.iterrows(),length)

# create test set
(test_corpus,test_y)=load_sentiment(test_reviews.iterrows(),length)

# count vectorize fit the data: determine vocabulary
vectorizer.fit(train_corpus+test_corpus)
    
# count vectorize transform the data
trainX=vectorizer.transform(train_corpus)
testX=vectorizer.transform(test_corpus)

# print the vectors (sparse vectors)
#print(trainX.toarray())

# list the vocabulary 
#vectorizer.get_feature_names() 


In [None]:
# tfidf fit and transfomation

transformer.fit(trainX+testX)

trainX=transformer.transform(trainX)
testX=transformer.transform(testX)

input_dim=len(trainX.toarray()[0])   # input layer of perceptron
output_dim=1

len(trainX.toarray()[0]),len(testX.toarray()[0])

In [None]:
# get a baseline from sklearn :-)

from sklearn.metrics import accuracy_score 
from sklearn.linear_model import Perceptron

clf = Perceptron(tol=1e-5, random_state=0)

clf.fit(trainX, train_y)

y_test_predict=clf.predict(testX)

#clf.coef_,clf.intercept_, clf.classes_
accuracy_score(test_y,y_test_predict)

In [None]:
# define the model 

import torch
import torch.nn as nn

class Net(nn.Module):   
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_dim,1)  
       # self.init_weights()
    
    def init_weights(self):
        initrange = 0.5
        self.fc1.weight.data.uniform_(-initrange, initrange)
        
        
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))  

        return x
    
net=Net()    

print(list(net.parameters()))
len(train_y),len(trainX.toarray())

In [None]:
from torch.utils.data import Dataset, DataLoader

class SentimentData(torch.utils.data.Dataset):
  def __init__(self, data, labels):
        'Initialization'
        self.labels = labels
        self.data = data

  def __len__(self):
        'Denotes the total number of samples'
        return len(self.data)

  def __getitem__(self, index):
        'Generates one sample of data'
     
        X = self.data[index]
        y = self.labels[index]

        return torch.tensor(X,dtype=torch.float32,requires_grad=True), torch.tensor(y,dtype=torch.float32)
    
data=SentimentData(trainX.toarray(),train_y)


In [None]:
import torch.optim as optim
#from torch.autograd import Variable
#import torch.nn.functional as F

dataloader=DataLoader(dataset=data,batch_size=20,shuffle=True)

net=Net()
optimizer = optim.SGD(net.parameters(), lr=0.01)

loss_func = nn.BCELoss()

for epoch in range(400):
    for batch_of_tensors, batch_labels  in  dataloader:
        optimizer.zero_grad()       
    
        outputs = net(batch_of_tensors)
    
        loss = loss_func(outputs, batch_labels.view(-1,1))

        loss.backward()
        optimizer.step()

In [None]:
# evaluate with testdata

def step(x):  
    if x < 0.5:
        return 0
    else:
        return 1

tp,all=0,0
y_test_predict=[]

# do the prediction, evaluate
for i, text in enumerate(testX):
    true = test_y[i]
    label=net(torch.Tensor(text.toarray()))
    if true==step(label):
        tp+=1
    all+=1

accuracy=tp/all

accuracy 

In [None]:
# use dataloader for predictions: batch_size = testdata size

testdata=SentimentData(testX.toarray(),test_y)

testdataloader=DataLoader(dataset=testdata,batch_size=len(testdata))

testdata,_=next(iter(testdataloader))  # only one batch, label not needed since test_y are gold

y_test_predict = net(testdata)     # make the prediction
y_test_predict = [step(val) for val in y_test_predict.squeeze().detach().numpy()]  # step function

accuracy_score(test_y,y_test_predict)  # sklearn scorer