<a href="https://colab.research.google.com/github/thanhnguyen2612/diveintocode-ml/blob/master/ML_Sprint21_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

In [2]:
#Download IMDB to the current folder
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# Unzip
!tar zxf aclImdb_v1.tar.gz
# aclImdb / train / unsup is unlabeled and removed
!rm -rf aclImdb/train/unsup
# Show IMDB dataset description
!cat aclImdb/README

--2021-11-16 14:02:09--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-11-16 14:02:23 (5.87 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire 

In [9]:
from sklearn.datasets import load_files

train_review = load_files('aclImdb/train/', encoding='utf-8')
test_review = load_files('aclImdb/test/', encoding='utf-8')

X_train, y_train = train_review.data, train_review.target
X_test, y_test = test_review.data, test_review.target

print(train_review.target_names)
print(f"X: {X_train[0]}")

['neg', 'pos']
X: Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty.


# Example

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

mini_dataset = [
    'This movie is very good.',
    'This film is a good',
    'Very bad. Very, very bad.'
]

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b')
bow = (vectorizer.fit_transform(mini_dataset)).toarray()

df = pd.DataFrame(bow, columns=vectorizer.get_feature_names_out())
df.head()

Unnamed: 0,a,bad,film,good,is,movie,this,very
0,0,0,0,1,1,1,1,1
1,1,0,1,1,1,0,1,0
2,0,2,0,0,0,0,0,3


In [13]:
vectorizer_2gram = CountVectorizer(ngram_range=(2, 2), token_pattern=r'(?u)\b\w+\b')

bow_train = (vectorizer_2gram.fit_transform(mini_dataset)).toarray()

df = pd.DataFrame(bow_train, columns=vectorizer_2gram.get_feature_names_out())
display(df)

Unnamed: 0,a good,bad very,film is,is a,is very,movie is,this film,this movie,very bad,very good,very very
0,0,0,0,0,1,1,0,1,0,1,0
1,1,0,1,1,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0,2,0,1


# [Problem 1] Scratch implementation of BoW

In [23]:
texts = [
    'This movie is SOOOO funny !!!',
    'What a movie! I never',
    'best movie ever!!!!! this movie'
]

split_texts = [text.lower().replace('!', '').split() for text in texts]
split_words = [item for sublist in split_texts for item in sublist]
split_words

['this',
 'movie',
 'is',
 'soooo',
 'funny',
 'what',
 'a',
 'movie',
 'i',
 'never',
 'best',
 'movie',
 'ever',
 'this',
 'movie']

In [24]:
import numpy as np

unigram = pd.DataFrame(np.zeros((len(split_texts), len(set(split_words)))).astype('int'), columns=set(split_words))

for i, ss in enumerate(split_texts):
    for s in ss:
        n = ss.count(s)
        unigram[s][i] = n
        
unigram

Unnamed: 0,movie,what,this,never,ever,a,funny,i,soooo,best,is
0,1,0,1,0,0,0,1,0,1,0,1
1,1,1,0,1,0,1,0,1,0,0,0
2,2,0,1,0,1,0,0,0,0,1,0


In [25]:
bigram_vocab = []
bigram_list = []
for s in split_texts:
    lst = []
    for i in range(len(s) - 1):
        words = f"{s[i]} {s[i + 1]}"
        bigram_vocab.append(words)
        lst.append(words)
    bigram_list.append(lst)

bigram = pd.DataFrame(np.zeros((len(bigram_list), len(set(bigram_vocab)))).astype('int'), columns=set(bigram_vocab))

for i, ss in enumerate(bigram_list):
    for s in ss:
        n = ss.count(s)
        bigram[s][i] = n
        
bigram

Unnamed: 0,this movie,movie i,movie ever,what a,soooo funny,movie is,best movie,ever this,a movie,i never,is soooo
0,1,0,0,0,1,1,0,0,0,0,1
1,0,1,0,1,0,0,0,0,1,1,0
2,1,0,1,0,0,0,1,1,0,0,0


# [Problem 2] TF-IDF calculation

In [29]:
import nltk
stop_words = nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'bo

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_train = TfidfVectorizer(stop_words=stop_words, max_features=5000)
X_train = vectorizer_train.fit_transform(X_train)
vec = vectorizer_train.get_feature_names_out()

vectorizer_test = TfidfVectorizer(stop_words=stop_words, max_features=5000, vocabulary=vec)
X_test = vectorizer_test.fit_transform(X_test)

X_train.shape, X_test.shape

((25000, 5000), (25000, 5000))

# [Problem 3] Learning using TF-IDF

In [37]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def train_evaluate(model, X_train, y_train, X_val, y_val):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    print(model.score(X_val, y_val))
    print(precision_score(y_val, y_pred))
    print(recall_score(y_val, y_pred))
    print(f1_score(y_val, y_pred))
    print(confusion_matrix(y_val, y_pred))

In [39]:
import lightgbm as lgb

lgb = lgb.LGBMClassifier()

train_evaluate(lgb, X_train, y_train, X_test, y_test)

0.86004
0.853118870145155
0.86984
0.8613982966924144
[[10628  1872]
 [ 1627 10873]]


# [Problem 4] Scratch mounting of TF-IDF

## Standard formula

In [40]:
tf = unigram.copy()
idf = unigram.copy()

for i in range(len(tf)):
    tf.iloc[i, :] = tf.iloc[i, :] / tf.iloc[i, :].sum()
    
for t in idf.columns:
    idf[t][:] = np.log(len(idf.columns) / idf[t].sum())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [41]:
tf

Unnamed: 0,movie,what,this,never,ever,a,funny,i,soooo,best,is
0,0.2,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.2
1,0.2,0.2,0.0,0.2,0.0,0.2,0.0,0.2,0.0,0.0,0.0
2,0.4,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.0


In [42]:
idf

Unnamed: 0,movie,what,this,never,ever,a,funny,i,soooo,best,is
0,1.011601,2.397895,1.704748,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895
1,1.011601,2.397895,1.704748,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895
2,1.011601,2.397895,1.704748,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895,2.397895


In [43]:
tf * idf

Unnamed: 0,movie,what,this,never,ever,a,funny,i,soooo,best,is
0,0.20232,0.0,0.34095,0.0,0.0,0.0,0.479579,0.0,0.479579,0.0,0.479579
1,0.20232,0.479579,0.0,0.479579,0.0,0.479579,0.0,0.479579,0.0,0.0,0.0
2,0.40464,0.0,0.34095,0.0,0.479579,0.0,0.0,0.0,0.0,0.479579,0.0


## Scikit-learn formula

In [44]:
tf_2 = unigram.copy()
idf_2 = unigram.copy()

for t in idf.columns:
    idf_2[t][:] = np.log((1 + len(idf_2.columns)) / (1 + idf[t].sum())) + 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [45]:
tf_2 * idf_2

Unnamed: 0,movie,what,this,never,ever,a,funny,i,soooo,best,is
0,2.089949,0.0,1.674285,0.0,0.0,0.0,1.381543,0.0,1.381543,0.0,1.381543
1,2.089949,1.381543,0.0,1.381543,0.0,1.381543,0.0,1.381543,0.0,0.0,0.0
2,4.179898,0.0,1.674285,0.0,1.381543,0.0,0.0,0.0,0.0,1.381543,0.0


# [Problem 5] Corpus pretreatment

In [46]:
!pip install gensim



In [95]:
X_train, y_train = train_review.data, train_review.target
X_test, y_test = test_review.data, test_review.target

In [96]:
import re

def preprocess(row):
    after_preprocessing1 = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', '', row) 
    after_preprocessing2 = re.sub(r'<[^>]+>', ' ', after_preprocessing1)
    after_preprocessing3 = re.sub(r'[^0-9a-zA-Z ]', '', after_preprocessing2)
    after_preprocessing = after_preprocessing3.lower()
    return after_preprocessing

X_train = [preprocess(x).split() for x in X_train]
X_test = [preprocess(x).split() for x in X_test]

In [98]:
from gensim.models import Word2Vec

model = Word2Vec(min_count=1, size=10)
model.build_vocab(X_train)

model.train(X_train, total_examples=model.corpus_count, epochs=model.epochs)

(21879328, 28814035)