<a href="https://colab.research.google.com/github/rajeev03/python0/blob/master/NLP_Demo_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## Naive Bayes Classification of Sentiments of Text

#### Probabilistic Model of Classification

In a probabilistic classification model we want to estimate the value of 
$P(c|x)$
, the probability of a sample x being of class c. Naive Bayes is one such probabilistic classifier that uses Bayes' Rule to classify samples. And Naive Bayes is _"Naive"_ because it assumes strong independence among all the features of sample x.

#### Bayes Rule:

$P(c|x) = \frac{P(x|c)P(c)}{P(x)}$

#### Text Classification using Naive Bayes classifier

Consider the task of classifying textual documents into having positive or negative sentiments. We will design the Naive Bayes classifier for this problem as follows:

Samples are text documents, and their features are the words that comprises these documents.

- Each document $d$ is a sequence of words, $d = w_1w_2...w_n$, where $w_i$ are the tokens of the document and $n$ is the total number of tokens in the document $d$.

- The training dataset consists of many document, sentiment pairs, ${d_i, s_i}$

- Each document $d_i$ is associated with a sentiment $s_i \in \{0,1\}$, $0$ being negative sentiment and $1$ being positive sentiment.

Using **Bayes' Rule** we have 

$p(s|d) = \frac{p(d|s)p(s)}{p(d|s)p(s) + p(d|\bar{s})p(\bar{s})}$

And from the **independence assumption** of features

$p(d|s) = p(w_1,w_2,..., w_n|s) = p(w_1|s)p(w_2|s)...p(w_n|s)$

Also in the **IMDb reviews dataset** that we are considering here have equal number of positive and negative datasets.

We have $p(s) = 0.5$ and $p(\bar{s})=0.5$.

This simplifies our formulation for 
$p(s|d)$

$ p(s|d) = \frac{p(d|s)}{p(d|s) + p(d|\bar{s})} $

If we assign threshold of 
$p_T(s|d) = 0.5$
for deciding the final label, the model simplifies to,

$y=
    \begin{cases}
      1, & \text{if } p(d|s=1) \geq p(d|s=0)\\
      0, & \text{otherwise}
    \end{cases}$
#### A measure for numerical stability

$p(w_i)$ will be very small in magnitude, and when we take a product of such very small numbers to compute $p(d|s)$
, even double precision floating points fail to store such small numbers and becomes zero. Hence, for numerical stability, we will convert the probabilities to log probability,

$\log p(d|s) = \log p(w_1,w_2,..., w_n|s) = \log p(w_1|s) + \log p(w_2|s) + ...+ \log p(w_n|s)$

### Basic NLP Tasks


In [0]:
# Import 'os' for preliminary tasks like directory listing etc.
import os

# Import re for regex string matching
import re

# Import nltk for nlp
import nltk

# Import library providing high-performance, easy-to-use data structures and data analysis tools
import pandas as pd

# Import Python's native data structures Counter and defaultdict
# Counter - maintains count of element
# defaultdict - dictionary data structure with exception handling for missing keys
from collections import Counter, defaultdict

# Import tqdm for fancy progressbars!
from tqdm import tqdm_notebook

# Import numpy for different mathematical operations on arrays / matrices
import numpy as np

from nltk.tokenize import word_tokenize # import tokenizer
from nltk.corpus import stopwords # import stopwords
from nltk.stem.porter import PorterStemmer #import stemmer
from nltk.stem import WordNetLemmatizer #lemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfTransformer #
from sklearn.feature_extraction.text import TfidfVectorizer # from text to vector
from sklearn.naive_bayes import MultinomialNB #import naive bayes classifier
from sklearn import svm #import SVM classifier
from  sklearn.metrics  import accuracy_score # accuracy measure
from sklearn.tree import DecisionTreeClassifier # Decision tree classfier
from sklearn.ensemble import RandomForestClassifier # Random Forest Classfier 

In [0]:
# Install the nltk component for several tasks
nltk.download('punkt')     
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [0]:
#sentence for testing
#sentence= "The quick brown fox jumps over the lazy dog"
#sentence= "Backgammon is one of the oldest known board games"
sentence= "It is better to use the corpora about the rocks"

#function to split text into word
tokens = word_tokenize(sentence)
print (tokens)

#POS_Tagging
nltk.pos_tag(tokens)

In [0]:
#stop words removal
stop_words = set(stopwords.words('english'))

print (stop_words)
tokens = [w for w in tokens if not w in stop_words]
print(tokens)

In [0]:
# stemming
porter = PorterStemmer()
stems = []
for t in tokens:    
    stems.append(porter.stem(t))
print(stems)

In [0]:
#lemmatizing
lemmatizer = WordNetLemmatizer()
lemmas=[]
for t in tokens:
  lemmas.append(lemmatizer.lemmatize(t))
print (lemmas)

print("better:", lemmatizer.lemmatize("better",pos ="a" ))

### Downloading the data

In [2]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -P data/

--2019-07-20 06:37:50--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘data/aclImdb_v1.tar.gz’


2019-07-20 06:37:52 (45.5 MB/s) - ‘data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Extract data
Please wait for ~15s

In [3]:
%%time
!tar -xzf data/aclImdb_v1.tar.gz -C data/

CPU times: user 81 ms, sys: 14 ms, total: 95 ms
Wall time: 8.31 s


### Data Samples
- Dataset is split into two parts for training and testing
- Positive and negative samples are organized in individual folders 
- Each sample document is stored in a .txt file

In [0]:
#convert the dataset from files to a python DataFrame
folder = 'data/aclImdb/'
labels = {'pos': 1, 'neg': 0} 
df = pd.DataFrame()
revList = list()
for f in ('test', 'train'):    
    for l in ('pos', 'neg'):
        path = os.path.join(folder, f, l)
        for file in os.listdir (path) :
            with open(os.path.join(path, file),'r', encoding='utf-8') as infile:
                txt = infile.read()
                revList.append((txt,labels[l]))
            #df = df.append([[txt, labels[l]]],ignore_index=True)
df = pd.DataFrame.from_records(revList)
df.columns = ['review', 'sentiment']
#df.head()
#df.tail(50)
#df.loc[27000, 'review']
#df.loc[27000, 'sentiment']
#df.loc[27500, 'sentiment']

### Build Vocabulary


In [0]:
reviews = df.review.str.cat(sep=' ')#function to split text into word
tokens = word_tokenize(reviews)
vocabulary = set(tokens)
print(len(vocabulary))
#frequency_dist = nltk.FreqDist(tokens)
#sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)[0:50]



In [0]:
stop_words = set(stopwords.words('english'))
vocabulary = [w for w in vocabulary if not w in stop_words]
print (len (vocabulary))

###Build Classifier



In [0]:
#building a classifier
X_train = df.loc[:24999, 'review'].values
y_train = df.loc[:24999, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
vectorizer = TfidfVectorizer()
#v=vectorizer.fit_transform(vocabulary)
train_vectors=vectorizer.fit_transform(X_train)
#train_vectors = vectorizer.transform(X_train)
test_vectors = vectorizer.transform(X_test)
print(train_vectors.shape, test_vectors.shape)

In [0]:
#fit the classifier Naive Bayes
clf = MultinomialNB().fit(train_vectors, y_train)
predicted = clf.predict(test_vectors)
print(accuracy_score(y_test,predicted))

In [0]:
#clf = svm.SVC().fit(train_vectors, y_train)
clf = DecisionTreeClassifier(max_depth=5).fit(train_vectors, y_train)
#clf=RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1).fit(train_vectors, y_train)
predicted = clf.predict(test_vectors)
print(accuracy_score(y_test,predicted))

### Implementation of Naive Bayes Classifier without Using Library Function


In [0]:
data_folder = 'data/aclImdb/'

rp = os.path.join(data_folder, 'train/pos')
train_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'train/neg')
train_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

rp = os.path.join(data_folder, 'test/pos')
test_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'test/neg')
test_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

### Regex for cleaning html tags
- Pattern <.*?> means "anything within two angular brackets". The qualifier *? denotes "as few times as possible". This makes sure we match only one html tag at a time.

In [0]:
re_html_cleaner = re.compile(r"<.*?>")

#### Limit number of samples
To quickly train a small model, consider setting n_train and n_test to some relatively small numbers e.g. `1000`. Set, 
`n_train = n_test = -1` to use all the samples available.

In [0]:
n_train = 2500
n_test = 2500

### (Conditional) Unigram Counter
- Calculates the distribution $p(w|s=1)$ and $p(w|s=0)$, empirically, from training data.

In [0]:
# Distribution of word tokens in positive samples
positive_word_counts = Counter()

for _fname in tqdm_notebook(train_positive[:n_train], desc="Crunching +ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        positive_word_counts += Counter(nltk.word_tokenize(text))

# Distribution of word tokens in negative samples
negative_word_counts = Counter()

for _fname in tqdm_notebook(train_negative[:n_train], desc="Crunching -ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        negative_word_counts += Counter(nltk.word_tokenize(text))

#### Unigram counts to probability distribution

$p(w|s) = \frac{N_{s,w}}{N_{s,*}} = \frac{N_{s,w}}{\sum_{w' \in W}N_{s,w'}}$

#### Additive Smoothing
- Note that, if some token, $u$, unseen in training documents, occurrs in a test document, $p(doc_{test}|s)$ becomes $0$ as $N_{s,u}$ for that token is $0$.
- We apply _Additive Smoothing_ to prevent probability from going to zero.

$p(w|s) = \frac{\alpha + N_{s,w}}{\sum_{w' \in W}(\alpha + N_{s,w'})} = \frac{\alpha + N_{s,w}}{\alpha V + \sum_{w' \in W}N_{s,w'}}$

where V is the total vocab size.

In [0]:
len_corpus_pos = sum(positive_word_counts.values())
len_corpus_neg = sum(negative_word_counts.values())
V_pos = len(positive_word_counts)
V_neg = len(negative_word_counts)
alpha = 0.1
log_p_vocab_pos = defaultdict(
    lambda: np.log(alpha/len_corpus_pos), 
    {w:np.log((alpha + c)/(V_pos*alpha + len_corpus_pos)) for w,c in positive_word_counts.items()}
)
log_p_vocab_neg = defaultdict(
    lambda: np.log(alpha/len_corpus_neg), 
    {w:np.log((alpha + c)/(V_neg*alpha + len_corpus_neg)) for w,c in negative_word_counts.items()}
)

In [0]:
p_data_pos = len(train_positive)/(len(train_positive) + len(train_negative))
print(f"Prob. of +ve sentiment in our dataset: {p_data_pos}")

Prob. of +ve sentiment in our dataset: 0.5


#### get_prob_pos(doc)

A function that accepts a document string as input, tokenizes it and computes the probability 
$p(d|s=1)$
and 
$p(d|s=0)$
. It returns 1 if 
$p(d|s=1) \geq p(d|s=0)$ 
otherwise 0.

In [0]:
def get_prob_pos(doc):
    text = doc.strip()
    text = re_html_cleaner.sub(" ", text)
    tokens = nltk.word_tokenize(text)
    p_pos = 1
    p_neg = 1
    for token in tokens:
        p_pos += log_p_vocab_pos[token]
        p_neg += log_p_vocab_neg[token]
        
    return 1.0*(p_pos >= p_neg) #/(p_pos+p_neg)

In [0]:
results = []
for _fname in tqdm_notebook(test_positive[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((1, get_prob_pos(f.read())))
        

for _fname in tqdm_notebook(test_negative[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((0, get_prob_pos(f.read())))

### Performance evaluation of our model

**Accuracy:** Overall performance of our model, fraction of samples that were labelled correctly

**Recall:** Out of all +ve data samples in test set, what fraction of it was labelled correctly

**Precision:** How precise is the model? Out of all samples that were tagged +ve by the model, how many were actually positive.

In [0]:
true_pos = 0
false_pos = 0
true_neg = 0
false_neg = 0
for true_label, pred_label in results:
    if true_label == 1 and pred_label == 1:
        true_pos += 1
    elif true_label == 1 and pred_label == 0:
        false_neg += 1
    elif true_label == 0 and pred_label == 1:
        false_pos += 1
    elif true_label == 0 and pred_label == 0:
        true_neg += 1

In [0]:
print(f"Accuracy: {(true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg):0.4F}")
print(f"Recall: {(true_pos)/(true_pos + false_neg):0.4F}")
print(f"Precision: {(true_pos)/(true_pos + false_pos):0.4F}")

Accuracy: 0.7904
Recall: 0.7416
Precision: 0.8218
