# Text Classification - Naive Bayes

*(For more information of Naive Bayes Classifier, please check the PowerPoint slides.)*

We're working on **classification problem**. There are **different machine learning algorithms** available for building a predictive model

<img src ="http://amueller.github.io/sklearn_tutorial/cheat_sheet.png">

The **fetch_20newsgroups()** function allows the loading of filenames and data from the 20 newsgroups dataset. It has 20 classes, 18846 observations, and features in the form of strings. It downloads the dataset from the original 20 newsgroups website and caches it locally.

The 20 newsgroups dataset splits in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

In [1]:
#import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

#Load the filenames and data from the 20 newsgroups dataset (classification).
from sklearn.datasets import fetch_20newsgroups

In [None]:
# credit: http://qwone.com/~jason/20Newsgroups/
data = fetch_20newsgroups()
data.target_names

In [None]:
# get the 20 classes data
categories = ['alt.atheism', 'comp.graphics', 
              'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 
              'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 
              'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 
              'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 
              'sci.space', 'soc.religion.christian', 'talk.politics.guns', 
              'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
# the first news article in train 
print(train.data[1])

In [None]:
# the first news article in test
print(test.data[1])

In [None]:
# how many articles in train and test
print(len(train.data)) #60% of the total data
print(len(test.data)) #40% of the total data

## TF-IDF 

short for **Term Frequency–inverse Document Frequency** is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus -- a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed).

- TF: measures how frequently a term appears 
    - = Number of times the word appears in a document / Total number of words in the document
<br><br>    
- IDF: measures the relative importance of a word. for example, such words as "at" and "of" frequently appear, but little important. IDF **weight down such frequent terms while scale up the rare words** 
    - = log(total number of documents / number of documents containing the word in question)
<br><br> 
- TFIDF: the importance of words or tokens (or features) in a document
    - = tf * idf
    - tells the importance of words, which is used in classification

It is often used as a **weighting factor** in searches of 
- information retrieval
- text mining
- user modeling

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today.

sources: 
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- https://en.wikipedia.org/wiki/Text_corpus

In [None]:
# Import the library for TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# show the top words based on the TFIDF scores
tfIdfVectorizer=TfidfVectorizer(use_idf=True,stop_words='english')

tfIdf = tfIdfVectorizer.fit_transform(train.data)

df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(20))

## Naive Bayes Classifier 

**The Multinomial Naive Bayes** calculates each lebal's likelihood for a given sample and outputs the tag with the greatest chance. 

In [None]:
# Libraries
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

In [None]:
# Creating a model based on Multinomial Naive Bayes using make_pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
# Training the model with the train data
model.fit(train.data, train.target) #The target attribute is the integer index of the category

In [None]:
# Creating labels for the test data
labels = model.predict(test.data)
print(labels[1])

### How did we got the labels for testing data?

<img src="http://www.nltk.org/images/supervised-classification.png">

## Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature. The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. commonly mislabeling one as another).

Source: https://en.wikipedia.org/wiki/Confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
mat = confusion_matrix(test.target, labels)
# T means Transpose
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

## Calculate the accuracy of the Naive Bayes Classifier

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(test.target, labels)
accuracy

## Write a fuction for future use to predict news category

In [None]:
# Predicting category on new data based on trained model
def predict_category(s, train=train, model=model): #s: set as string, model=model:make_pipeline
    pred = model.predict([s]) #set the data to pipeline: Tokenized, ... NB
    return train.target_names[pred[0]] #train.target_names = categories

## Let's test our trained model with a new dataset

In [None]:
news = pd.read_csv("articles1.csv",header=0)
news.head()

In [None]:
news.info()

In [None]:
# Split the data into smaller dataset
news_test = news.iloc[:10]
news_test

In [None]:
#using our NB model to classify the news

cate = []

for news in news_test["content"]:
    category = predict_category(news)
    
    cate.append(category)

catedf = pd.DataFrame(cate, columns=["Category"])
catedf

In [None]:
# merge the two dataframe
news_test = pd.concat([news_test, catedf], axis=1, join="inner")
news_test

In [None]:
pd.set_option("max_colwidth", 500)

print(news_test[:5])

# Actions: Create a Spam filter for the text messages

### Instructions:

1. Clean the sms texts
2. Conduct the feature engineering (Words to Vectors)
    - Tokenization
    - Word Frequency
    - Stemming
    - Lemmatization
    - Remove stopwords
3. Calculate the TF-IDF
4. Split the data into training and testing dataset
5. Build a Spam filter using Naive Bayes classifier

In [None]:
import pandas as pd

In [None]:
SMS = pd.read_csv('SpamSMStraining.txt', sep = '\t', header=None, names=["label", "sms"])
SMS.head()

In [None]:
# remove the stopwords from sms

In [None]:
# stemming the sms texts

In [None]:
# Lemmatization

In [None]:
# Tokenization 

In [None]:
# Regular Word Frequency (counting the frequency of each word appears)

In [None]:
# TF-IDF

In [None]:
# Split the dataset into training and testing dataset
from sklearn.model_selection import train_test_split

# Creating training and test sets (80-20): X = corpus; y = classifications
x_train, x_test, y_train, y_test = train_test_split(SMS["sms"], SMS["label"], test_size=0.2, random_state=10)
len(x_train), len(y_train), len(x_test), len(y_test)

In [None]:
print(x_test[:5])
print(y_test[:5])

In [None]:
# Creating a model based on Multinomial Naive Bayes using make_pipeline

In [None]:
# Training the model with the train data

In [None]:
# Creating labels for the test data

In [None]:
# calculate the accuracy of your Naive Bayes classifier

In [None]:
# Write a fuction for future use to predict the spam messages

In [None]:
# test your function with new five sms messages

# This is what happens when you reply to spam email

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('4o5hSxvN_-s')

## References:

- https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
- https://www.youtube.com/watch?v=l3dZ6ZNFjo0