# Text-Classification with Naive Bayes
This week you have learned about text classfication using ML algorithm Naive Bayes. In the lab this week, you will learn how to go through the steps of training a text-classifier using some of the techniques talked about in the lecture this week.

We will use [NLTK](https://www.nltk.org/) and [Scikit-learn](https://scikit-learn.org/stable/) as we go through the lab sheet.


## Import your data

The dataset you will be using can be downloaded from this [link](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/). Download the zip file and make sure you read through the readme file. 

In [156]:
import pandas as pd
# Read the text file into a DataFrame
data = pd.read_csv('/home/xinpeng/UoB/NLP/Week5/smsspamcollection/SMSSpamCollection', header=None, sep='\t')
data.columns = ['Tag', 'SMS']
data

Unnamed: 0,Tag,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [157]:
print(f'We have a total of {len(data)} examples.')

We have a total of 5572 examples.


Now change the label column in the dataframe so that we have:
* 0 for ham
* 1 for spam

We want to also explore if we have a class imbalance and what our data looks like. Make sure you understand what you are classifying before diving in. 

Use the power of pandas dataframe to count the number of examples for each of the classes

In [158]:
data['Tag'] = data['Tag'].replace(['ham', 'spam'], ['0', '1'])
data['Tag'].value_counts()

0    4825
1     747
Name: Tag, dtype: int64

## Preprocess your data
Now that we have seen what our data looks like, we can begin to think about how we might pre-process our data. We can do a simple pre-processing and then, if needed, we can do more, such as lemmatise etc.

1. words are lower case
2. tokenize
3. stop-word removal
4. punctuation and non-alpha character removal
5. Lemmatise the words in the text



In [159]:
import nltk

In [160]:
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/xinpeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/xinpeng/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [161]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

We make everything lower case and tokenize

Remove stopwords, non alphabetic characters and punctuation, and lemmatise the text

Compare the original text to the preprocessed text.

In [162]:
import re
from nltk import stem
def preprocess(data):
    #data['tokenized_SMS'] = data.apply(lambda row: nltk.word_tokenize(row['SMS'].lower()), axis=1)
    #data['removal'] = data.apply(lambda row: re.findall(r'[a-zA-Z]+', ' '.join(row['tokenized_SMS'])), axis=1)
    #tokenize and reserve only alphabetic characters
    data['preprocessed'] = data.apply(lambda row: re.findall(r'[a-zA-Z]+', row['SMS'].lower()), axis=1)
    #remove stopwords
    nltk_stopwords = stopwords.words('english')
    data['preprocessed'] = data.apply(lambda row: [word for word in row['preprocessed'] if not word in nltk_stopwords], axis=1)
    #lemmatization
    wnl = stem.WordNetLemmatizer()
    data['preprocessed'] = data.apply(lambda row: [wnl.lemmatize(word) for word in row['preprocessed']], axis=1)
    #convert list to string
    data['preprocessed'] = data.apply(lambda row: " ".join(row['preprocessed']), axis=1)
    return data

data = preprocess(data)
data

Unnamed: 0,Tag,SMS,preprocessed
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think go usf life around though
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,nd time tried contact u u pound prize claim ea...
5568,0,Will ü b going to esplanade fr home?,b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",pity mood suggestion
5570,0,The guy did some bitching but I acted like i'd...,guy bitching acted like interested buying some...


## Split the data
Now we will split our data using sklearn into our train and test set

In [163]:
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data)

## Feature selection

Let's experiment with feature selection. We will test two different methods for this:


1.   Word frequency
2.   Mutual Information - Tfidf

Implement both. We will train and test our algorithms with both to see which one works better for our data.

Hint: this is where we are vectorizing our data




In [164]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_WordFrequency = vectorizer.fit_transform(data['preprocessed'])
X_WordFrequency
# we now use the same vectorizer used and fit with our train data with test set
# X_test_counts = vectorizer.transform(X_test)   

<5572x7098 sparse matrix of type '<class 'numpy.int64'>'
	with 45138 stored elements in Compressed Sparse Row format>

Take a look to see what our feature names are

In [165]:
vectorizer.get_feature_names()

['aa',
 'aah',
 'aaniye',
 'aaooooright',
 'aathi',
 'ab',
 'abbey',
 'abdomen',
 'abeg',
 'abel',
 'aberdeen',
 'abi',
 'ability',
 'abiola',
 'abj',
 'able',
 'abnormally',
 'aboutas',
 'abroad',
 'absence',
 'absolutely',
 'absolutly',
 'abstract',
 'abt',
 'abta',
 'aburo',
 'abuse',
 'abuser',
 'ac',
 'academic',
 'acc',
 'accent',
 'accenture',
 'accept',
 'access',
 'accessible',
 'accidant',
 'accident',
 'accidentally',
 'accommodation',
 'accommodationvouchers',
 'accomodate',
 'accomodations',
 'accordin',
 'accordingly',
 'account',
 'accounting',
 'accumulation',
 'achan',
 'ache',
 'achieve',
 'acid',
 'acknowledgement',
 'acl',
 'acnt',
 'aco',
 'across',
 'act',
 'acted',
 'actin',
 'acting',
 'action',
 'activ',
 'activate',
 'active',
 'activity',
 'actor',
 'actual',
 'actually',
 'ad',
 'adam',
 'add',
 'addamsfa',
 'added',
 'addicted',
 'addie',
 'adding',
 'address',
 'adewale',
 'adi',
 'adjustable',
 'admin',
 'administrator',
 'admirer',
 'admission',
 'admit'

In [166]:
from sklearn.feature_extraction.text import TfidfVectorizer
def getFeatureVector2(data):
    vectorizer = TfidfVectorizer()
    X_Tfidf = vectorizer.fit_transform(data)
    return X_Tfidf

## Train the Naive Bayes Classifier

Train the model using the data vectorized using the count vectorizer

In [170]:
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(X_WordFrequency, data['Tag'])
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB()

Train the model using the data vectorized using the tfidf vectorizer

In [171]:
# clf2 = MultinomialNB()
# clf2.fit(getFeatureVector2(data_train['preprocessed']), data_train['Tag'])

## Test and Results
Now that we have a trained classifier, let's test to see if it generalises well to unseen examples. Let's compare the F-1 score for the two feature selection methods we used earlier.

Make sure to note which one seems to work beter for our dataset.

In [172]:
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Test the count model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

In [173]:
#?????
y_pred = clf.predict(X_test)
confusion_matrix(y_true, y_pred)
target_names = ['ham', 'spam']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

         ham       0.86      0.87      0.86      1202
        spam       0.13      0.13      0.13       191

    accuracy                           0.77      1393
   macro avg       0.50      0.50      0.50      1393
weighted avg       0.76      0.77      0.76      1393



Test the tfidf model on the test data. Then print the confusion matrix and classification report comparing the gold labels to predictions from the model

# Bonus: Multiclass classification

Now that you have implemented a binary classifier, let's try to do the same but this time for multiple classes. We will be working with the ______ dataset. 

Use the [20 newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) available on sklearn dataset library.

Implement naive bayes but for 5 classes (out of the 20 avaialable) on this dataset. 