<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/01_conventional_methods_for_text_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Conventional Methods for Text Representation

**In Conventional Machine learning methods, we ought to create features for a text.** There are a lot of representations that are present to achieve this. Let us talk about them one by one.

- Bag of Words - Countvectorizer Features
- TFIDF Features
- Hashing Features
- Word2vec Features


Reference:

https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/

https://www.kaggle.com/mlwhiz/conventional-methods-for-quora-classification/

##Setup

In [1]:
%%shell

wget -q https://github.com/rahiakela/natural-language-processing-case-studies/raw/master/nlp-fundamental-mechanism/preprocesssing.py
wget -q https://github.com/rahiakela/natural-language-processing-case-studies/raw/master/nlp-fundamental-mechanism/utils.py



In [None]:
import random
import copy
import time
import pandas as pd
import numpy as np
import gc
import re
import torch
from torchtext import data

import os 
import nltk

# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

#import spacy
from tqdm import tqdm_notebook, tnrange
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
tqdm.pandas(desc='Progress')
from collections import Counter
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

#from unidecode import unidecode
import nltk
from sklearn.preprocessing import StandardScaler
from textblob import TextBlob
from multiprocessing import  Pool
from functools import partial
import numpy as np
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import LinearSVC
import lightgbm as lgb

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

In [None]:
from preprocesssing import *
from utils import *

Let's download Quora Insincere Questions Classification dataset from [Kaggle](https://www.kaggle.com/c/quora-insincere-questions-classification/data)

Reference: https://github.com/Kaggle/kaggle-api

In [4]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [5]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle competitions download -c quora-insincere-questions-classification -f train.csv
kaggle competitions download -c quora-insincere-questions-classification -f test.csv
kaggle competitions download -c quora-insincere-questions-classification -f embeddings.zip

unzip -qq train.csv.zip
unzip -qq test.csv.zip
rm -rf train.csv.zip
rm -rf test.csv.zip
unzip -qq embeddings.zip
rm -rf embeddings.zip

kaggle.json
Downloading train.csv.zip to /content
 76% 42.0M/54.9M [00:00<00:00, 78.4MB/s]
100% 54.9M/54.9M [00:00<00:00, 136MB/s] 
Downloading test.csv.zip to /content
 63% 10.0M/15.8M [00:00<00:00, 97.0MB/s]
100% 15.8M/15.8M [00:00<00:00, 101MB/s] 




In [22]:
%%shell

kaggle competitions download -c quora-insincere-questions-classification -f embeddings.zip
unzip -qq embeddings.zip
rm -rf embeddings.zip

Downloading embeddings.zip to /content
100% 5.96G/5.96G [02:09<00:00, 35.9MB/s]
100% 5.96G/5.96G [02:09<00:00, 49.5MB/s]




##Text Preprocessing

In [6]:
SEED = 1029

In [7]:
train_df = pd.read_csv("train.csv")#[:400000]
test_df = pd.read_csv("test.csv")#[:20000]
print("Train shape : ", train_df.shape)
print("Test shape : ", test_df.shape)

Train shape :  (1306122, 3)
Test shape :  (375806, 2)


In [8]:
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [9]:
# clean the sentences
train_df['cleaned_text'] = train_df['question_text'].apply(lambda x : clean_sentence(x))
test_df['cleaned_text'] = test_df['question_text'].apply(lambda x : clean_sentence(x))

##BoW model using Count Vectorizer

Suppose we have a series of sentences(documents)

In [None]:
X = [
  "This is good",
  "This is bad",
  "This is awesome"
]

<img src='https://github.com/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/images/countvectorizer.png?raw=1' width='800'/>

Bag of words will create a dictionary of the most common words in all the sentences. For the example above the dictionary would look like:

In [None]:
word_index = {
    'this':0,
    'is':1,
    'good':2,
    'bad':3,
    'awesome':4
}

And then encode the sentences using the above dict.

```
This is good    - [1,1,1,0,0]
This is bad     - [1,1,0,1,0]
This is awesome - [1,1,0,0,1]
```

We could do this pretty simply in Python by using the CountVectorizer class from Python. Don’t worry much about the heavy name, it just does what I explained above. It has a lot of parameters most significant of which are:

- **ngram_range**: I specify in the code (1,3). This means that unigrams, bigrams, and trigrams will be taken into account while creating features.

- **min_df**: Minimum no of time an ngram should appear in a corpus to be used as a feature.

In [None]:
cnt_vectorizer = CountVectorizer(dtype=np.float32, strip_accents="unicode", analyzer="word", token_pattern=r'\w{1,}', ngram_range=(1, 3), min_df=3)

# Fitting count vectorizer to both training and test sets (semi-supervised learning)
cnt_vectorizer.fit(list(train_df.cleaned_text.values) + list(test_df.cleaned_text.values))
x_train = cnt_vectorizer.transform(train_df.cleaned_text.values)
# xtest_cntv = cnt_vectorizer.transform(test_df.cleaned_text.values)
y_train = train_df.target.values

We could then use these features with any machine learning classification model like Logistic Regression, Naive Bayes, SVM or LightGBM as we would like. 

For example:

In [None]:
# Fitting a simple Logistic Regression on CountVectorizer Model
train_oof_preds = model_train_cv(x_train, y_train, 5, LogisticRegression(C=1.0))
print("F1 Score: %0.3f " % best_thresshold(y_train, train_oof_preds))


F1 Score: 0.603 


We are able to get a F1 local CV score of ___ with our fairly simple model which just counts the number of time some ngrams appear in a sentence. That is pretty good. 

Let us try Multinomial NB.

In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = model_train_cv(x_train, y_train, 5, MultinomialNB())
print("F1 Score: %0.3f " % best_thresshold(y_train, train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.543 


We are able to get a good F1 local CV score with our fairly simple model which just counts the number of time some ngrams appear in a sentence. You can also try running SVMs which were used extensively when trying out models on Text. But they are pretty slow so not using them here.

Lets try LightGBM also.

In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = lgb_model_train_cv(x_train, y_train, 5, lgb)

In [None]:
print("F1 Score: %0.3f " % best_thresshold(y_train, train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.439 


In [None]:
# clean up
x_train=0
del x_train
#del xtest_cntv
del cnt_vectorizer
gc.collect()

##BoW model using TF-IDF features

TFIDF is a simple technique to find features from sentences. While in Count features we take count of all the words/ngrams present in a document, with TFIDF we take features only for the significant words. How do we do that? If you think of a document in a corpus, we will consider two things about any word in that document:

<img src='https://github.com/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/images/tfidf.png?raw=1' width='800'/>

- **Term Frequency**: How important is the word in the document?

$$TF(word\ in\ a\ document) = \dfrac{No\ of\ occurances\ of\ that\ word\ in\ document}{No\ of\ words\ in\ document}$$

- **Inverse Document Frequency**: How important the term is in the whole corpus?

$$IDF(word\ in\ a\ corpus) = -log(ratio\ of\ documents\ that\ include\ the\ word)$$

TFIDF then is just multiplication of these two scores.

Intuitively, One can understand that a word is important if it occurs many times in a document. But that creates a problem. Words like “a”, “the” occur many times in sentence. Their TF score will always be high. We solve that by using Inverse Document frequency, which is high if the word is rare, and low if the word is common across the corpus.

In essence, we want to find important words in a document which are also not very common.

We could do this pretty simply in Python by using the TFIDFVectorizer class from Python. It has a lot of parameters most significant of which are:

- **ngram_range**: I specify in the code (1,3). This means that unigrams, bigrams, and trigrams will be taken into account while creating features.
- **min_df**: Minimum no of time an ngram should appear in a corpus to be used as a feature.

In [None]:
# Always start with these features. They work (almost) everytime!
tfv = TfidfVectorizer(dtype=np.float32, min_df=3, max_features=None, strip_accents="unicode", analyzer="word", token_pattern=r'\w{1,}', 
                      ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words="english")

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(train_df.cleaned_text.values) + list(test_df.cleaned_text.values))
x_train = tfv.transform(train_df.cleaned_text.values)
# xtest_cntv = cnt_vectorizer.transform(test_df.cleaned_text.values)
y_train = train_df.target.values

Again, we could use these features with any machine learning classification model like Logistic Regression, Naive Bayes, SVM or LightGBM as we would like.

In [None]:
# Fitting a simple Logistic Regression on TFIDF Features
train_oof_preds = model_train_cv(x_train,y_train,5,LogisticRegression(C=1.0))

print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds))

In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = model_train_cv(x_train,y_train,5,MultinomialNB())
print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.502 


In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = lgb_model_train_cv(x_train,y_train,5,lgb)

In [None]:
print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.442 


##BoW model using Hashing Vectorizer

Normally there will be a lot of ngrams in a document corpus. The number of features that our TFIDFVectorizer generated was in excess of 2,00,000 features. This might lead to a problem on very large datasets as we have to hold a very large vocabulary dictionary in memory. One way to counter this is to use the Hash Trick.

<img src='https://github.com/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/images/hashfeats.png?raw=1' width='800'/>

One can think of hashing as a single function which maps any ngram to a number range for example between 0 to 1024. Now we don’t have to store our ngrams in a dictionary. We can just use the function to get the index of any word, rather than getting the index from a dictionary.

Since there can be more than 1024 ngrams, different ngrams might map to the same number, and this is called collision. The larger the range we provide our Hashing function, the less is the chance of collisions.

We could do this pretty simply in Python by using the HashingVectorizer class from Python. It has a lot of parameters most significant of which are:

- **ngram_range**: I specify in the code (1,3). This means that unigrams, - bigrams, and trigrams will be taken into account while creating features.
- **n_features**: No of features you want to consider. The range I gave above.

In [17]:
# Always start with these features. They work (almost) everytime!
hv = HashingVectorizer(dtype=np.float32, strip_accents="unicode", analyzer="word", ngram_range=(1, 3), n_features=2 ** 10, alternate_sign=False)

# Fitting Hash Vectorizer to both training and test sets (semi-supervised learning)
hv.fit(list(train_df.cleaned_text.values) + list(test_df.cleaned_text.values))
x_train = hv.transform(train_df.cleaned_text.values)
# xtest_cntv = hv.transform(test_df.cleaned_text.values)
y_train = train_df.target.values

Again, we could use these features with any machine learning classification model like Logistic Regression, Naive Bayes, SVM or LightGBM as we would like.

In [18]:
# Fitting a simple Logistic Regression on TFIDF Features
train_oof_preds = model_train_cv(x_train,y_train,5,LogisticRegression(C=1.0))

print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.374 


In [19]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = model_train_cv(x_train, y_train, 5, MultinomialNB())
print ("F1 Score: %0.3f " % best_thresshold(y_train, train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.234 


In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = lgb_model_train_cv(x_train,y_train,5,lgb)

In [21]:
print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.307 


## Word2vec Embeddings

We can use the word2vec features to create sentence level features also. We want to create a d-dimensional vector for sentence. For doing this, we will simply average the word embedding of all the words in a sentence.

<img src='https://github.com/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/images/word2vec_feats.png?raw=1' width='800'/>



In [23]:
# load the GloVe vectors in a dictionary
def load_glove_index():
  EMBEDDING_FILE = "glove.840B.300d/glove.840B.300d.txt"

  def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype="float32")[:300]

  embeddings_index = dict(get_coefs(*emb.split(" ")) for emb in open(EMBEDDING_FILE))

  return embeddings_index

In [24]:
embeddings_index = load_glove_index()
print("Found %s word vectors." % len(embeddings_index))

Found 2196016 word vectors.


In [27]:
stop_words = stopwords.words('english')

def sent2vec(sent):
  words = str(sent).lower()
  words = word_tokenize(words)
  words = [w for w in words if not w in stop_words]
  words = [w for w in words if w.isalpha()]

  M = []
  for w in words:
    try:
      M.append(embeddings_index[w])
    except:
      continue
    
  M = np.array(M)
  v = M.sum(axis=0)

  if type(v) != np.ndarray:
    return np.zeros(300)

  return v / np.sqrt((v ** 2).sum())

In [30]:
# create sentence vectors using the above function for training and validation set
x_train = [sent2vec(x) for x in tqdm(train_df.cleaned_text.values)]
#x_test = [sent2vec(x) for x in tqdm(test_df.cleaned_text.values)]

HBox(children=(FloatProgress(value=0.0, max=1306122.0), HTML(value='')))




In [31]:
# clean up
embeddings_index = 0
del embeddings_index
# del x_test
gc.collect()

1002

Again, we could use these features with any machine learning classification model like Logistic Regression, Naive Bayes, SVM or LightGBM as we would like.

In [32]:
x_train = np.array(x_train)
# x_valid = np.array(x_test)
y_train = train_df.target.values

In [33]:
# Fitting a simple Logistic Regression on TFIDF Features
train_oof_preds = model_train_cv(x_train, y_train, 5, LogisticRegression(C=1.0))

print ("F1 Score: %0.3f " % best_thresshold(y_train,train_oof_preds)) 


F1 Score: 0.552 


In [None]:
# fitting a simple Naive Bayes model in place of logistic regression using the same features
train_oof_preds = lgb_model_train_cv(x_train, y_train, 5, lgb)

In [35]:
print ("F1 Score: %0.3f " % best_thresshold(y_train, train_oof_preds))

HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


F1 Score: 0.451 


## Conclusion

Here are the results of different approaches on the Kaggle Dataset. I ran a 5 fold Stratified CV.

<img src='https://github.com/rahiakela/natural-language-processing-case-studies/blob/master/nlp-fundamental-mechanism/images/results_conv.png?raw=1' width='400'/>

While Deep Learning works a lot better for NLP classification task, it still makes sense to have an understanding of how these problems were solved in the past, so that we can appreciate the nature of the problem. I have tried to provide a perspective on the conventional methods and one should experiment with them too to create baselines before moving to Deep Learning methods. 

