# Text-Classification with Naive Bayes
This week you have learned about text classfication using ML algorithm Naive Bayes. In the lab this week, you will learn how to go through the steps of training a text-classifier using some of the techniques talked about in the lecture this week.

We will use [NLTK](https://www.nltk.org/) and [Scikit-learn](https://scikit-learn.org/stable/) as we go through the lab sheet.


## Import your data

The dataset you will be using can be downloaded from this [link](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/). Download the zip file and make sure you read through the readme file. 

In [1]:
import pandas as pd

data = pd.read_csv('/content/dataset/SMSSpamCollection', sep = '\t', names=['label', 'text'], header=None)

In [2]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print(f'We have a total of {len(data)} examples.')

We have a total of 5572 examples.


Now change the label column in the dataframe so that we have:
* 0 for ham
* 1 for spam

In [4]:
data.loc[data['label'] == 'ham', 'label'] = 0
data.loc[(data['label'] == 'spam', 'label')] = 1

# to verify that it has been done correctly
data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


We want to also explore if we have a class imbalance and waht our data looks like. Make sure you understand what you are classifying before diving in. 

In [5]:
data.groupby('label').count()

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
0,4825
1,747


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5572 non-null   object
 1   text    5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


We can see (as was already stated in the readme file) that we have a class imbalance. Thing about how this might affect your algorithm (if at all). 

## Preprocess your data
Now that we have seen what our data looks like, we can begin to think about how we might pre-process our data. We can do a simple pre-processing and then, if needed, we can do more, such as lemmatise etc.

1. words are lower case
2. tokenize
3. stop-word removal
4. punctuation and non-alpha character removal
5. Lemmatise the words in the text



In [7]:
import nltk

In [8]:
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [9]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

We make everything lower case and tokenize

In [10]:
tokenized_data = [word_tokenize(sms.lower()) for sms in data['text']] #text converted to lower case and tokenized

Remove stopwords, non alphabetic characters and punctuation, and lemmatise the text

In [11]:
def remove_stopwords(tknzd_text):
  tokens = [] #list of tokens w/o stopwords
  for token in tknzd_text:
    if token not in stopwords.words('english'):
      tokens.append(token)
  return tokens


In [12]:
def remove_non_alpha(tknzd_text):
  """
  This function removes any character that not alphabetic. This means that punctuation and numeric charactes will be removed
  This is something we can play around with, and see what pre-processing steps help us achieve a higher f-1 score.
  """
  alpha_tokens = [] #list of tokens that are only alphabetic 
  for token in tknzd_text:
    if token.isalpha():
      alpha_tokens.append(token)
  return alpha_tokens

In [13]:
def lemmatise(tknzd_text):
  lemma_tokens = [] #list of lemmatized tokens
  lemmatizer = WordNetLemmatizer()
  for token in tknzd_text:
    lemma_tokens.append(lemmatizer.lemmatize(token))
  
  lemmatized_text = " ".join(lemma_tokens)
  return lemmatized_text

In [14]:
def preprocess(tokenized_data):
  pp_data = []  #list of preprocessed sms. This is not tokenized text anymore
  for tknzd_sms in tokenized_data:
    pp_text = remove_stopwords(tknzd_sms)
    pp_text = remove_non_alpha(pp_text)
    pp_text = lemmatise(pp_text)
    pp_data.append(pp_text)
  return pp_data

In [15]:
preprocessed_data = preprocess(tokenized_data)

Compare the original text to the preprocessed text.

In [16]:
original_data = data['text']

In [17]:
print(f'Before preprocessing:\n {original_data[:5]}')

Before preprocessing:
 0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: text, dtype: object


In [18]:
data['text']= preprocessed_data

In [20]:
print(f'Before preprocessing:\n {data["text"][:5]}')

Before preprocessing:
 0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry wkly comp win fa cup final tkts may...
3                  u dun say early hor u c already say
4                  nah think go usf life around though
Name: text, dtype: object


## Split the data
Now we will split our data using sklearn into our train and test set

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X = data['text']
y = data['label']

In [23]:
data['text'][1978]

'reply win weekly fifa world cup held send stop end service'

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, shuffle=True)

## Feature selection

Let's experiment with feature selection. We will test two different methods for this:


1.   Word frequency
2.   Mutual Information - Tfidf

Implement both. We will train and test our algorithms with both to see which one works better for our data.

Hint: this is where we are vectorizing our data




In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
count_vectorizer = CountVectorizer()

# we fit the count vectorizer with our train data
X_train_counts = count_vectorizer.fit_transform(X_train)

# we now use the same vectorizer used and fit with our train data with test set
X_test_counts = count_vectorizer.transform(X_test)                                           

Take a look to see what this looks like

In [61]:
count_vectorizer.get_feature_names_out()[:20]

array(['aah', 'aaniye', 'aaooooright', 'aathi', 'ab', 'abbey', 'abdomen',
       'aberdeen', 'abi', 'ability', 'abiola', 'abj', 'able',
       'abnormally', 'aboutas', 'abroad', 'absence', 'absolutly',
       'abstract', 'abt'], dtype=object)

In [None]:
# output vocabulary from vectorizer
count_vectorizer.vocabulary_

In [54]:
print(f'We have {len(count_vectorizer.vocabulary_)} vocabulary words in our vectorizer')

We have 5936 vocabulary words in our vectorizer


In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [56]:
tfidf_vectorizer = TfidfVectorizer()

# we fit the tfidf vectorizer with our train data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# we now use the same vectorizer used and fit with our train data with test set
X_test_tfidf = tfidf_vectorizer.transform(X_test)                                           

In [None]:
# output vocabulary from vectorizer
tfidf_vectorizer.vocabulary_

In [64]:
print(f'We have {len(tfidf_vectorizer.vocabulary_)} vocabulary words in our vectorizer')

We have 5936 vocabulary words in our vectorizer


In [83]:
import numpy as np

y_train = np.array(y_train, dtype=int)
y_test = np.array(y_test, dtype=int)

## Train the Naive Bayes Classifier

Train the model using the data vectorized using the count vectorizer

In [58]:
from sklearn.naive_bayes import MultinomialNB

In [84]:
count_nb = MultinomialNB()

# Fit the model
count_nb.fit(X_train_counts, y_train)


F-1 score: 0.9811659192825112


Train the model using the data vectorized using the tfidf vectorizer

In [86]:
tfidf_nb = MultinomialNB()

# Fit the model
tfidf_nb.fit(X_train_tfidf, y_train)


F-1 score: 0.9713004484304932


## Test and Results
Now that we have a trained classifier, let's test to see if it generalises well to unseen examples. Let's compare the F-1 score for the two feature selection methods we used earlier.

Make sure to note which one seems to work beter for our dataset.

In [88]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Test the count model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

In [90]:
y_pred_count = count_nb.predict(X_test_counts)

# Print the Confusion Matrix
cm = confusion_matrix(y_test, y_pred_count)
print("Confusion Matrix\n")
print(cm)

# Print the Classification Report
cr = classification_report(y_test, y_pred_count)
print("\n\nClassification Report\n")
print(cr)


Confusion Matrix

[[956  10]
 [ 11 138]]


Classification Report

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       966
           1       0.93      0.93      0.93       149

    accuracy                           0.98      1115
   macro avg       0.96      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



Test the tfidf model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

In [91]:
# Predict the labels
y_pred_tfidf = tfidf_nb.predict(X_test_tfidf)

# Print the Confusion Matrix
cm = confusion_matrix(y_test, y_pred_tfidf)
print("Confusion Matrix\n")
print(cm)

# Print the Classification Report
cr = classification_report(y_test, y_pred_tfidf)
print("\n\nClassification Report\n")
print(cr)

Confusion Matrix

[[965   1]
 [ 31 118]]


Classification Report

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       966
           1       0.99      0.79      0.88       149

    accuracy                           0.97      1115
   macro avg       0.98      0.90      0.93      1115
weighted avg       0.97      0.97      0.97      1115



# Bonus: Multiclass classification

Now that you have implemented a binary classifier, let's try to do the same but this time for multiple classes. We will be working with the ______ dataset. 

Use the [20 newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) available on sklearn dataset library.

Implement naive bayes but for 5 classes (out of the 20 avaialable) on this dataset. 