# Welcome to my notebook
**In this notebook, i will introduce to you some technique to preprocess text data and build model to classify spam or ham email**
**.This notebook will cover:**
1. Preprocessing text data
    * Remove stopwords, punctuation, stemming, lemmazation,....
    * Vectorize text data using Term frequency inverse document frequency TfidfVectorizer   
2. Build model
    * Use Naive bayes, Logistic regression and SVC to classify spam email
    * Compare Stemming and lemmazation text in term of model performance on it 
3. Oversampling technique
    * We will use SMOTE(Synthetic Minority Over-sampling Technique) to oversampling the minority class to see if we can increase model performance

In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#  Import neccessary librarys

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, plot_confusion_matrix
#for oversampling minority class
from imblearn.over_sampling import SMOTE


**Reading data and see some emails**

In [None]:
data = pd.read_csv('/kaggle/input/spam-mails-dataset/spam_ham_dataset.csv')
data.head()

In [None]:
sample = data.sample(5)
for i in range(5):
    print('Class: ', sample.iloc[i]['label'])
    print('Email:')
    print(sample.iloc[i]['text'])
    print('\n', '---'*45)
    

**We see that emails contains many special character, number, http,... which will not useful to classify spam email. So we will remove them later.**

In [None]:
#drop the ID column
data.drop('Unnamed: 0', axis = 1, inplace = True)

**Plot the distribution of class, we see that spam email only contain 29%. This is not good because this dataset is skewed, when we directly feed it to our model, it will not generalize well**

In [None]:
print(data['label_num'].value_counts()/sum(data['label_num'].value_counts())*100)
sns.countplot('label_num', data = data)

# Preprocessing text:
* Lower case text
* Remove number, punctuation, leading and ending space, stopwords
* I add the 'subject' to stopwords because it not look like useful for predict spam.

In [None]:
stopwords_set = set(stopwords.words('english'))
#Save the 'not'
#stopwords_set.remove('not')
#add subject to stopwords
stopwords_set.add('subject')
stopwords_set.add('http')
def preprocessing_text(x):
    import string
    #lower case
    x = x.lower()
    #remove number
    x = re.sub(r'\d+','',x)
    #remove punctuation
    x = re.sub(r'[^\w\s]', '',x)
    #remove leading and ending space
    x = x.strip()
    #remove stopword
    x = ' '.join([word for word in word_tokenize(x) if not word in stopwords_set])
    return x
#apply preprocessing text on text
data['text'] = data['text'].apply(lambda x: preprocessing_text(x))

**Split dataset to train and test**

In [None]:
train, test = train_test_split(data, test_size = 0.2, random_state = 42)

# Draw wordclouds of Spam and Not spam emails.
**WordCloud will show us what is the most popular word in a specific class, which will give us some insight about the data**


In [None]:
#draw wordcloud
from wordcloud import WordCloud
sns.set(style = None)
train_spam = train[train['label_num'] == 1]
train_spam = train_spam['text']
#turn series to string by join ' ' to it
train_spam = ' '.join(train_spam)
train_ham = train[train['label_num'] == 0]
train_ham = train_ham['text']
train_ham = ' '.join(train_ham)
wordcloud_spam = WordCloud(background_color = 'black', width = 2500, height = 2000 ).generate(train_spam)
plt.figure(figsize = (13,13))
print('Spam email wordcloud')
plt.imshow(wordcloud_spam)
plt.show()
wordcloud_ham = WordCloud(background_color = 'white', width = 2500, height = 2000).generate(train_ham)
print('Ham email wordcloud')
plt.figure(figsize = (13,13))

plt.imshow(wordcloud_ham)
plt.show()

# Stemming word
**stemming mean bring all the word to its original form. e.g. surveys -> survey, started -> start**

**Email before stemming**

In [None]:

train.iloc[0]['text']

**Stemming all mails**

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#function take in a list of tokenized word and stemming it 
def stemming_words(words):
    stemmed_words = []
    for word in words:
        stemmed_words.append(stemmer.stem(word))
    return ' '.join(stemmed_words)
stemmer = PorterStemmer()
#tokenize word before use stemming
train['text'] = train['text'].apply(lambda x: word_tokenize(x))
train['text'] = train['text'].apply(lambda x: stemming_words(x))

**Email after stemming**

In [None]:
train.iloc[0]['text']

# Build model using stemming

**Before we feed emails to our model, we have to vectorize all of it. Because we are human, we can understand text, but computer do not. It only works with numbers**

In [None]:
#control the max_features to vectorize to leave some unpopular word out, which is consider not important, like personal names, ...
tfidf = TfidfVectorizer( strip_accents = 'ascii', max_df = 0.8, max_features = 27000)
train_vectorized = tfidf.fit_transform(train['text'])
test_vectorized = tfidf.transform(test['text'])

**Oversample the spam class. This will give us a more balance dataset**

In [None]:
sm = SMOTE(sampling_strategy = 1,random_state = 42)
X_resample,y_resample = sm.fit_resample(train_vectorized, train['label_num'])

In [None]:
print(y_resample.value_counts()/sum(y_resample.value_counts())*100)
sns.countplot(y_resample)

**Look pretty good. We have a balance dataset now. Next, we will compare beween model train on the non-resample dataset and the resample-dataset to see if SMOTE make the model performance better**

In [None]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, plot_confusion_matrix
nb = MultinomialNB()
nb.fit(train_vectorized, train['label_num'])
p = nb.predict(test_vectorized)
print('Naive Bayes on non-resample dataset\n\n')
print(classification_report(test['label_num'],p))
plot_confusion_matrix(nb, test_vectorized, test['label_num'], cmap = 'Paired')

In [None]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, plot_confusion_matrix
nb = MultinomialNB()
nb.fit(X_resample, y_resample)
p = nb.predict(test_vectorized)
print('Naive Bayes on resample dataset\n\n')
print(classification_report(test['label_num'],p))
plot_confusion_matrix(nb, test_vectorized, test['label_num'], cmap = 'Paired')

Look pretty good. Our Naive bayes model have improve significantly on resample dataset. Next we will check on Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train_vectorized, train['label_num'])
lr_p = lr.predict(test_vectorized)
print('Logistic regression on non-resample dataset')

print(classification_report(test['label_num'], lr_p))
plot_confusion_matrix(lr, test_vectorized, test['label_num'], cmap = 'Paired')

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_resample, y_resample)
lr_p = lr.predict(test_vectorized)
print('Logistic regression on resample dataset\n\n')
print(classification_report(test['label_num'], lr_p))
plot_confusion_matrix(lr, test_vectorized, test['label_num'], cmap = 'Blues')


**This logistic regression model does not perform well on resample dataset as expected. Let check on SVC**

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(train_vectorized, train['label_num'])
svc_p = svc.predict(test_vectorized)
print('SVC on non-resample dataset')
print(classification_report(test['label_num'], svc_p))
plot_confusion_matrix(svc, test_vectorized, test['label_num'])

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_resample, y_resample)
svc_p = svc.predict(test_vectorized)
print(classification_report(test['label_num'], svc_p))
plot_confusion_matrix(svc, test_vectorized, test['label_num'])

**Well, it look like SVC does not better or worser with resample dataset**

# Build model using Lemmazation

# Lemmazation
**The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words**

In [None]:
#split the data again
train, test = train_test_split(data, test_size = 0.2, random_state = 42)

In [None]:
print('Email before lemmazation')
train.iloc[0]['text']

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
#lemmatize function take in list of words, so you have to tokenize word before give it to this function
def lemmatize_words(words):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = []
    for word in words:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    return ' '.join(lemmatized_words)
train.loc[:,'text'] = train.loc[:,'text'].apply(lambda x: word_tokenize(x))
train.loc[:,'text'] = train.loc[:,'text'].apply(lambda x: lemmatize_words(x))

In [None]:
print('Email after lemmazation')
train.iloc[0]['text']

In [None]:
#Vectorize text 
tfidf = TfidfVectorizer( strip_accents = 'ascii', max_df = 0.8, max_features = 27000)
train_vectorized = tfidf.fit_transform(train['text'])
test_vectorized = tfidf.transform(test['text'])
#and oversampling data 
sm = SMOTE(sampling_strategy = 1,random_state = 42)
X_resample,y_resample = sm.fit_resample(train_vectorized, train['label_num'])

**Now we compare model train on non-resample data and resample data as we did before**

In [None]:

nb = MultinomialNB()
nb.fit(train_vectorized, train['label_num'])
p = nb.predict(test_vectorized)
print('Naive bayes on non-resample dataset')
print(classification_report(test['label_num'],p))
plot_confusion_matrix(nb, test_vectorized, test['label_num'], cmap = 'Paired')

**This naive bayes on lemmazation data is clearly better than the one stemming data. Next, lets see naive bayes on lemmazation resample dataset**

In [None]:

nb = MultinomialNB()
nb.fit(X_resample, y_resample)
p = nb.predict(test_vectorized)
print('Naive bayes on resample dataset')
print(classification_report(test['label_num'],p))
plot_confusion_matrix(nb, test_vectorized, test['label_num'], cmap = 'Paired')

**This model is slightly better than the one train on stemming resample dataset. Next, lets see logistic regression**

In [None]:

lr = LogisticRegression()
lr.fit(train_vectorized, train['label_num'])
lr_p = lr.predict(test_vectorized)
print('Logistic regression on non-resample data')
print(classification_report(test['label_num'], lr_p))
plot_confusion_matrix(lr, test_vectorized, test['label_num'], cmap = 'Paired')

**Slightly better than the former logistic regression. Lets see the Logistic regression train on resampling dataset**

In [None]:
lr = LogisticRegression()
lr.fit(X_resample, y_resample)
lr_p = lr.predict(test_vectorized)
print('Logistic regression on resample data')
print(classification_report(test['label_num'], lr_p))
plot_confusion_matrix(lr, test_vectorized, test['label_num'], cmap = 'Blues')

**Well, it look worser than the one train with non-resampling. But it is not sure, if we can test it on more sample we will know that which one is better.**

In [None]:

svc = SVC()
svc.fit(train_vectorized, train['label_num'])
svc_p = svc.predict(test_vectorized)
print('SVC on non-resample data\n\n')
print(classification_report(test['label_num'], svc_p))
plot_confusion_matrix(svc, test_vectorized, test['label_num'], cmap = 'Blues')

**Look like SVC do the best job. Let see how it perform on resampling dataset**

In [None]:

svc = SVC()
svc.fit(X_resample, y_resample)
svc_p = svc.predict(test_vectorized)
print('SVC on resample data\n\n')
print(classification_report(test['label_num'], svc_p))
plot_confusion_matrix(svc, test_vectorized, test['label_num'], cmap = 'Blues')

**Notthing change. If we have more test data, we can then test our model and see its generalization.**

# Conclusion
**Finally, we come to the end of this notebook. Let see what we have learn so far.**
1. We know how to preprocessing text data, Vectorize it using TfidfVectorizer
2. We know how to use Naive bayes, logistic regression, SVC to classify spam emails
3. We know how to use stemming, lemmazation, and compare model performance on each type, and see that lemmazation give us better performance
**Well, that its for this notebook. See you next time.**