# Week 5 Classifications Using Text Data

###connect Colab to your Google Drive.

In [None]:
#connect Colab to your Google Drive.
from google.colab import drive
import os
drive.mount('/content/gdrive')

# Part 1: News Categorization using Multinomial Naive Bayes

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

* Deciding whether an email is spam or not.
*  **Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."**
* Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

A classifier is called supervised if it is built based on **training data** containing the correct label for each input. 

In [None]:
from IPython.display import Image
Image(url="http://www.nltk.org/images/supervised-classification.png")

(a) During training, we have a set of input cases, for which we know their correct label. 
Then we take each input and we extract a set of _features_, which capture the basic information 
about each input. 
Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model. 

(b) During prediction, we need to classify input for which we do not have the correct label. 
For that, we extract the  same set of features from the input. we feed these features into the model, 
which generates predicted labels.


The objective of this notebook is to show how to use Multinomial Naive Bayes method to classify news according to some predefined classes. (adapted from Andres Soto Villaverde)

The News Aggregator Data Set comes from the UCI Machine Learning Repository. 

* Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

### Data Source:
This specific dataset can be found in the UCI ML Repository at this URL: http://archive.ics.uci.edu/ml/datasets/News+Aggregator

### About Data:
This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:

* b: business; 
* t: science and technology; 
* e: entertainment; and 
* m: health. 

### Our Goal:
Using Multinomial Naive Bayes method, we will try to predict the category (business, entertainment, etc.) of a news article given its headline.

## Naive Bayes Classifier (Optional)

The naive bayes classifier is a probabilistic classifier that, given a set of features, tries to find the class with the highest probability. It is based on applying Bayes' theorem and is called naive because of its strong independence assumption between features. This means that the absence or presence of each feature is assumed to be independent of each other. We compute the posterior probability of a class as the product of the prior probability of a class and the joint probability of all features given that class:

$$ P(y|x_1,\ldots,x_n) \propto P(y) \prod^n_{i=1} P(x_i|y) $$

Classification is based on the *maximum a posteriori* or MAP descision rule which simply picks the class (or author in our case) that is most probable:

$$ predict(x_1, \ldots, x_n) = \arg\max_y P(y) \prod^n_{i=1} P(x_i|y) $$

If you're unfamiliar with reading formulas, this might all seem quite daunting. To better understand what is going on, let's work out a small example. Say we have a small corpus of five very short books of which the author of the fifth book is unknown. The total vocabulary $V$ is 10 words long. For each book we store how often each word $w_i \in V$ occurs:


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>the</th>
      <th>poetry</th>
      <th>society</th>
      <th>america</th>
      <th>realism</th>
      <th>a</th>
      <th>harry</th>
      <th>magic</th>
      <th>health</th>
      <th>system</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>David Foster Wallace</th>
      <td>  6</td>
      <td>  4</td>
      <td> 3</td>
      <td> 3</td>
      <td> 0</td>
      <td> 10</td>
      <td>  0</td>
      <td>  0</td>
      <td>  1</td>
      <td> 4</td>
    </tr>
    <tr>
      <th>David Foster Wallace</th>
      <td>  8</td>
      <td>  1</td>
      <td> 2</td>
      <td> 4</td>
      <td> 0</td>
      <td>  7</td>
      <td>  0</td>
      <td>  0</td>
      <td> 10</td>
      <td> 3</td>
    </tr>
    <tr>
      <th>Walt Whitman</th>
      <td> 12</td>
      <td> 10</td>
      <td> 1</td>
      <td> 8</td>
      <td> 0</td>
      <td>  4</td>
      <td>  0</td>
      <td>  0</td>
      <td>  0</td>
      <td> 4</td>
    </tr>
    <tr>
      <th>J.K. Rowling</th>
      <td>  8</td>
      <td>  0</td>
      <td> 0</td>
      <td> 0</td>
      <td> 0</td>
      <td>  15</td>
      <td> 12</td>
      <td> 10</td>
      <td>  0</td>
      <td> 0</td>
    </tr>
    <tr>
      <th>?</th>
      <td>  7</td>
      <td>  4</td>
      <td> 0</td>
      <td> 0</td>
      <td> 0</td>
      <td>  12</td>
      <td> 6</td>
      <td> 8</td>
      <td>  3</td>
      <td> 0</td>
    </tr>
  </tbody>
</table>

What is the probability of $P(y=\textrm{Walt Whitman}|x = [12, 10, 1, 8, 0, 4, 0, 0, 0, 4])$? And what is the probability of $P(y=\textrm{J.K. Rowling}|x = [7, 4, 0, 0, 0, 12, 6, 8, 3, 0])$?

The probability of a word like *the* given some author is computed by dividing the number of occurences of that word by the total number of words for that author. In the case of Walt Whitman, the probability of the word *poetry* is:

$$
\begin{array}{lll}
P(x_i=\textrm{poetry}|y=\textrm{Walt Whitman}) & = & \frac{10}{12 + 10 + 1 + 8 + 0 + 4 + 0 + 0 + 0 + 4}\\
                                         & = & \frac{10}{39} \\
                                         & = & 0.256 \\
\end{array}
$$

The posterior probability of a class computes the joint probability of all features given that class. This means that for each nonzero word $w_1, w_2, \ldots, w_n$ in our unknown book, we compute the probability of that word given a particular author $y$: $P(w_i|y)$. We then take the product (joint probability) of these individual words, which multiplied by the prior probability of the author, provides us with the posterior probability. 

Let's begin importing the Pandas (Python Data Analysis Library) module. The import statement is the most common way to gain access to the code in another module. 

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#import csv file and put it into a Pandas dataframe.
news=pd.read_csv('/content/gdrive/My Drive/CIS NLP Data Sets/newsCorpora.csv', sep="\t")

#assign column names.
news.columns=["ID","TITLE","URL","PUBLISHER","CATEGORY","STORY","HOSTNAME","TIMESTAMP"]

In [None]:
news.head()
#print (data.iloc[:10,:])

In [None]:
#How many columns and rows?
print ("Shape:", news.shape)

#Column names?
print ("Column Names",news.columns.values)

In [None]:
categories = news['CATEGORY']
titles = news['TITLE']


#What values exist within category?
labels = list(set(categories))
print('possible categories',labels)

#b: business
#t: science & technology
#e: entertainment
#m: health

#cross-tabulation of category column.
num_freq=news['CATEGORY'].value_counts()
print (num_freq)


In [None]:
#Categories are literal labels, but it is better for machine learning algorithms just to work with numbers.
#so we will encode them using LabelEncoder, which encode labels with value between 0 and n_classes-1.

#Class LabelEncoder allows to encode labels with values between 0 and n_classes-1.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()                       #Encode labels with value between 0 and n_classes-1.
ncategories = encoder.fit_transform(categories)

#What values exist within category?
labels_category = list(set(ncategories))
print('possible encoded categories',labels_category)


- Now we should split our data into two sets:
1. a training set (70%) used to discover potentially predictive relationships, and
2. a test set (30%) used to evaluate whether the discovered relationships hold and to assess the strength and utility of a predictive relationship.

In [None]:
#Samples should be first shuffled and then split into a pair of train and test sets. 
#Make sure you shuffle your training data before fitting the model.


from sklearn.utils import shuffle
titles, ncategories = shuffle(titles, ncategories, random_state=0) #shuffle my matrices randomly

N = len(titles)
Ntrain = int(N * 0.7) #we use 70% of the data as training set

#You can use train_test_split function for this step too.

In [None]:
X_train = titles[:Ntrain]
print('X_train.shape',X_train.shape)
y_train = ncategories[:Ntrain]
print('y_train.shape',y_train.shape)
X_test = titles[Ntrain:]
print('X_test.shape',X_test.shape)
y_test = ncategories[Ntrain:]
print('y_test.shape',y_test.shape)


In [None]:
print (126726/(126726+295692))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [None]:

#create CountVectorizer object.
coun_vect=CountVectorizer(stop_words="english", lowercase=True)

#generate training BoW vectors.
X_train_bow = coun_vect.fit_transform(X_train)

#generate test BoW vectors.
X_test_bow = coun_vect.transform(X_test)

In [None]:
print("Unique Vocabulary: ",len(coun_vect.vocabulary_))

# Print the first 10 features of the count_vectorizer
print(coun_vect.get_feature_names()[:10])

#convert tf-idf values using numerical matrix format.
#X_train_bow_array=X_train_bow.toarray()

#print (X_train_bow_array.shape)

In [None]:
#generate MultinomialNB object.
clf=MultinomialNB()

#Fit the Naive Bayes classifier to the train set.
text_clf = clf.fit(X_train_bow, y_train)

In [None]:
#apply the classifier to the test set and calculate the predicted values.
print('Predicting...')
predicted = text_clf.predict(X_test_bow)


In [None]:
from sklearn import metrics

print('accuracy_score',metrics.accuracy_score(y_test,predicted))
print('Reporting...')

In [None]:
# Precision/Recall/F1-score measures for each element in the test data.
print(metrics.classification_report(y_test, predicted, target_names=labels))

In [None]:
# Creating  a confusion matrix,which compares the y_test and y_pred.
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, predicted)
cm_df = pd.DataFrame(cm,index = ['health','science','entertainment','business'],
                     columns = ['health','science','entertainment','business']  
                     )

#Plotting the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm_df, annot=True , fmt=".0f")
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()

# Part 2: Spam Classification

### About Data:
- There are 2 columns.

- The first column:'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam.

- The second column: The text content of the message.

In [None]:
import pandas as pd

#import a file and put it into a Pandas dataframe.
message=pd.read_table('/content/gdrive/My Drive/CIS NLP Data Sets/SMSSpamCollection', sep="\t")

#assign column names.
message.columns=['label', 'sms_message']

In [None]:
# Output printing out first 5 rows
message.head()

In [None]:
#How many columns and rows?
print ("Shape:", message.shape)

#Column names?
print ("Column Names",message.columns.values)

count=message['label'].value_counts()
print (count)

### 1: Data Preprocessing ###

Convert our labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam' for ease of computation. 

In [None]:
message['label'] = message.label.map({'ham':0, 'spam':1})

#How many columns and rows?
print ("Shape:", message.shape)

#Column names?
print ("Column Names",message.columns.values)



In [None]:
# Output printing out first 5 rows
message.head()

### 2: Bag of words ###

Covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.


**Please Note:** 

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. But let's set `lowercase` parameter as `False` based on the assumption that many spam messages tend to use all-capital words so we would use these as-is.

* It also ignores all punctuation so that words followed by a punctuation mark (e.g.'hello!') are not treated differently than the same word(e.g.'hello').To enable this, use `token_pattern` parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the `stop_words` parameter. To enable this, set 'stop_words' as english.


### 3: Training and testing sets (before we apply Count Vectorizer) ###


>>**Instructions:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn.
* `X_train` is our training data for the 'message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'message' column.
* `y_test` is our testing data for the 'label' column



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(message['sms_message'], 
                                                    message['label'],
                                                    random_state=0, 
                                                    test_size=0.25
                                                    )

print('Number of rows in the total set: {}'.format(message.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

In [None]:
print (1393/5571)

### Continue on #2, Applying Bag of Words processing to our dataset. ###

* Fit our training data (`X_train`) into `CountVectorizer()` and return the matrix.
* Secondly, we have to transform our testing data (`X_test`) to return the matrix. 

`X_train` is our training data for the 'sms_message' column in our dataset and we will be using this to train our model. 

`X_test` is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with `y_test` in a later step. 

In [None]:
#apply Count Vectorizer.

#we are learning a vocabulary dictionary for the training data and then transforming the data into a document-term matrix.
#Then, for the testing data, we are only transforming the data into a document-term matrix.

# Instantiate the CountVectorizer method
tf_vector = TfidfVectorizer(lowercase=False,
stop_words='english',
use_idf=True)

# Fit the training data and then return the matrix
training_data = tf_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = tf_vector.transform(X_test)

#https://stackoverflow.com/questions/23838056/what-is-the-difference-between-transform-and-fit-transform-in-sklearn

In [None]:
print ("Shape of training set",training_data.shape)

print ("Shape of testing set",testing_data.shape)

### 4: Apply NB Algorithm ###

In [None]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()

naive_bayes.fit(training_data, y_train)



In [None]:
#predict the labels for testing set.
predictions = naive_bayes.predict(testing_data)

### 5: Evaluate the Model. ###

Accuracy, precision, recall, F1 score


- Accuracy  
measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions.

- Precision 
what proportion of messages we classified as spam, actually were spam.
It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification).

`[True Positives/(True Positives + False Positives)]`

- Recall(sensitivity)
what proportion of messages that actually were spam were classified by us as spam.<br>
It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam.

`[True Positives/(True Positives + False Negatives)]`

For classification problems that are skewed in their classification distributions like in our case, (e.g. among 100 text messages and only 2 were spam) accuracy by itself is not a very good metric. <br><br>We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

In [None]:
# Precision/Recall/F1-score measures for each element in the test data.
print(metrics.classification_report(y_test, predictions))

In [None]:
# Creating  a confusion matrix,which compares the y_test and y_pred.
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm,index = ['ham','spam'],
                     columns = ['ham','spam']  
                     )

#Plotting the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm_df, annot=True , fmt=".0f")
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()