# Text-Classification with Naive Bayes
This week you have learned about text classfication using ML algorithm Naive Bayes. In the lab this week, you will learn how to go through the steps of training a text-classifier using some of the techniques talked about in the lecture this week.

We will use [NLTK](https://www.nltk.org/) and [Scikit-learn](https://scikit-learn.org/stable/) as we go through the lab sheet.


## Import your data

The dataset you will be using can be downloaded from this [link](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/). Download the zip file and make sure you read through the readme file. 

In [49]:
import pandas as pd
# Read the text file into a DataFrame
data = pd.read_csv('/home/xinpeng/UoB/NLP/Week5/smsspamcollection/SMSSpamCollection', header=None, sep='\t')
data.columns = ['Tag', 'SMS']
print(data)

       Tag                                                SMS
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
...    ...                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [50]:
print(f'We have a total of {len(data)} examples.')

We have a total of 5572 examples.


Now change the label column in the dataframe so that we have:
* 0 for ham
* 1 for spam

We want to also explore if we have a class imbalance and what our data looks like. Make sure you understand what you are classifying before diving in. 

Use the power of pandas dataframe to count the number of examples for each of the classes

In [51]:
data['Tag'] = data['Tag'].replace(['ham', 'spam'], ['0', '1'])
data['Tag'].value_counts()

0    4825
1     747
Name: Tag, dtype: int64

## Preprocess your data
Now that we have seen what our data looks like, we can begin to think about how we might pre-process our data. We can do a simple pre-processing and then, if needed, we can do more, such as lemmatise etc.

1. words are lower case
2. tokenize
3. stop-word removal
4. punctuation and non-alpha character removal
5. Lemmatise the words in the text



In [52]:
import nltk

In [53]:
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/xinpeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/xinpeng/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [54]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

We make everything lower case and tokenize

Remove stopwords, non alphabetic characters and punctuation, and lemmatise the text

Compare the original text to the preprocessed text.

In [76]:
import re
#data['tokenized_SMS'] = data.apply(lambda row: nltk.word_tokenize(row['SMS'].lower()), axis=1)
#data['removal'] = data.apply(lambda row: re.findall(r'[a-zA-Z]+', ' '.join(row['tokenized_SMS'])), axis=1)
data['removal'] = data.apply(lambda row: re.findall(r'[a-zA-Z]+', row['SMS'].lower()), axis=1)
nltk_stopwords = stopwords.words('english')
data['removal2'] = data.apply(lambda row: [word for word in row['removal'] if not word in nltk_stopwords], axis=1)
print(data)

     Tag                                                SMS  \
0      0  Go until jurong point, crazy.. Available only ...   
1      0                      Ok lar... Joking wif u oni...   
2      1  Free entry in 2 a wkly comp to win FA Cup fina...   
3      0  U dun say so early hor... U c already then say...   
4      0  Nah I don't think he goes to usf, he lives aro...   
...   ..                                                ...   
5567   1  This is the 2nd time we have tried 2 contact u...   
5568   0               Will ü b going to esplanade fr home?   
5569   0  Pity, * was in mood for that. So...any other s...   
5570   0  The guy did some bitching but I acted like i'd...   
5571   0                         Rofl. Its true to its name   

                                          tokenized_SMS  \
0     [go, until, jurong, point, ,, crazy, .., avail...   
1              [ok, lar, ..., joking, wif, u, oni, ...]   
2     [free, entry, in, 2, a, wkly, comp, to, win, f...   
3     [

## Split the data
Now we will split our data using sklearn into our train and test set

In [79]:
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data)

## Feature selection

Let's experiment with feature selection. We will test two different methods for this:


1.   Word frequency
2.   Mutual Information - Tfidf

Implement both. We will train and test our algorithms with both to see which one works better for our data.

Hint: this is where we are vectorizing our data




In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Take a look to see what our feature names are

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Train the Naive Bayes Classifier

Train the model using the data vectorized using the count vectorizer

In [None]:
from sklearn.naive_bayes import MultinomialNB

Train the model using the data vectorized using the tfidf vectorizer

## Test and Results
Now that we have a trained classifier, let's test to see if it generalises well to unseen examples. Let's compare the F-1 score for the two feature selection methods we used earlier.

Make sure to note which one seems to work beter for our dataset.

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Test the count model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

Test the tfidf model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

# Bonus: Multiclass classification

Now that you have implemented a binary classifier, let's try to do the same but this time for multiple classes. We will be working with the ______ dataset. 

Use the [20 newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) available on sklearn dataset library.

Implement naive bayes but for 5 classes (out of the 20 avaialable) on this dataset. 