# Tutorial for NLP - Text classification using Naive Bayes and NLP

Naive Bayes is a popular algorithm, widely used for text classification.  Along with the powerful NLTK library,we will see the standard procedures used for sentiment classification of labelled text.
In this kernel we will see on how to create a model to classify a text input using

In [1]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import nltk
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
train = pd.read_csv('../input/train.csv',encoding='iso-8859-1')

In [2]:
train.shape

In [3]:
train.head(20)

In [4]:
train['SentimentText'][400]

In [5]:
lens = train.SentimentText.str.len()
lens.mean(), lens.std(), lens.max()

In [6]:
lens.hist();
plt.show()

In [7]:
labels = ['0', '1']
sizes = [train['Sentiment'].value_counts()[0],
         train['Sentiment'].value_counts()[1]
        ]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True)
ax1.axis('equal')
plt.title('Sentiment Proportion', fontsize=20)
plt.show()

So this data is a fairly balanced dataset, with not much skewness, for eductaional purpose we will use it as is.

## StopWords

We can ignore words of no importance like conjunctions, adjective, etc, to make our input data much more meaningful to the algorithm. NLTK provides inbuilt corpus with stopwords to filer out them.

In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
 
print(wordsFiltered)

In [9]:
stopwords.words('english')

In [10]:
#nltk.download("stopwords") 
from nltk.corpus import stopwords
train.SentimentText = [w for w in train.SentimentText if w.lower() not in stopwords.words('english')]

# Stemmer
A word stem is part of a word. It is sort of a normalization idea, but linguistic. Given words, NLTK can find the stems.
![title](https://pythonspot-9329.kxcdn.com/wp-content/uploads/2016/08/word-stem.png)

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
 
words = ["game","gaming","gamed","games"]
stemmer = PorterStemmer()
 
for word in words:
    print(stemmer.stem(word))

In [None]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
            'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted'] 
for word in plurals:
    print(stemmer.stem(word))

In [None]:
#nltk.download("wordnet")
ps = nltk.PorterStemmer()
train.SentimentText = [ps.stem(l) for l in train.SentimentText]

# Split Test and Train

In [None]:
X = train.SentimentText
y = train.Sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [None]:
train1=pd.concat([X_train,y_train], axis=1)
train1.shape

# Tokenization

The goal of tokenization is to break up a sentence or paragraph into specific tokens or words. We basically want to convert human language into a more abstract representation that computers can work with.

Sometimes you want to split sentence by sentence and other times you just want to split words.

In [None]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

# Ngrams

In [None]:
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 2
bigrams = ngrams(sentence.split(), n)
for grams in bigrams:
  print (grams)

# Term Frequency-Inverse Document Frequency (TF-IDF)
Term-frequency-inverse document frequency (TF-IDF) is another way to judge the topic of an article by the words it contains. With TF-IDF, words are given weight – TF-IDF measures relevance, not frequency. That is, wordcounts are replaced with TF-IDF scores across the whole dataset.

First, TF-IDF measures the number of times that words appear in a given document (that’s term frequency). But because words such as “and” or “the” appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is as a signal. That’s intended to leave only the frequent AND distinctive words as markers. Each word’s TF-IDF relevance is a normalized data format that also adds up to one.

![title](https://deeplearning4j.org/img/tfidf.png)

In [None]:
n = train1.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), #The lower and upper boundary of the range of n-values for different n-grams
                      tokenizer=tokenize,
                      min_df=3,      # ignore terms that have a df strictly lower than threshold
                      max_df=0.9,    #ignore terms that have a df strictly higher than threshold (corpus-specific stop words)
                      strip_accents='unicode', #Remove accents during the preprocessing step
                      use_idf=1,
                      smooth_idf=1,  #Smooth idf weights by adding one to document frequencies, 
                                     #as if an extra document was seen containing every term in 
                                     #the collection exactly once. Prevents zero divisions.
                      sublinear_tf=1, #Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
                      max_features=40000
                     )
trn_term_doc = vec.fit_transform(train1['SentimentText'])
test_term_doc = vec.transform(X_test)

In [None]:
#This creates a sparse matrix with only a small number of non-zero elements (stored elements in the representation below).
trn_term_doc, test_term_doc

In [None]:
#Here's the basic naive bayes feature equation:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=3,solver='newton-cg')
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [None]:
x = trn_term_doc
test_x = test_term_doc

label_cols=['Sentiment']
preds = np.zeros((len(X_test), len(label_cols)))
preds

In [None]:
for i, j in enumerate(label_cols):
    print('fit', j)
    m,r = get_mdl(train1[j])
    preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

In [None]:
y_pred=pd.DataFrame(preds.round(decimals=0), columns = label_cols)

In [None]:
accuracy_score(y_test, y_pred)

## References:
- https://stackoverflow.com/questions/24647400/what-is-the-best-stemming-method-in-python
- http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html
- http://billchambers.me/tutorials/2015/01/14/python-nlp-cheatsheet-nltk-scikit-learn.html
- https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
- https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf