<a href="https://colab.research.google.com/github/tutsilianna/Automatic_Text_Processing_and_Image_Processing/blob/main/Text%20Classification/Task_3_%7C_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text classification: Spam or Ham

In this example based on the classical dataset Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) we will try to make our own spam filter using scikit-learn library. The dataset contains text corpora of  5.574 text messages with labels "spam" or "ham".

### Data

Data are attached to the task description for your convinience

In [15]:
import pandas as pd
df = pd.read_csv('3_data.csv', encoding='latin-1')

We delete all other columns except for two of interest: text messages and labels:

In [16]:
df = df[['v1', 'v2']]
df = df.rename(columns = {'v1': 'label', 'v2': 'text'})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Delete duplicates:

In [17]:
df = df.drop_duplicates('text')

Change labels to binary:

In [18]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['label'].map({'ham': 0, 'spam': 1})


### Text pre-processing (Task)

We need to complete the function for text pre-processing, to pre-process the text the following way:
* convert text to lowercase;
* remove stop-words;
* remove punctuation marks;
* normalizes the text using Snowball stemmer.

We recommend to use the NLTK library, in order not to compile a list of stop-words and not to implement the stemming algorithm yourself. Click the link to find the examples of stemmers application (https://www.nltk.org/howto/stem.html).

In [40]:
from nltk import stem
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('punkt')

stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    tokens = text.split() # nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in stopwords]
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Check that the function works correctly

In [43]:
assert preprocess("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.") == "im gonna home soon dont want talk stuff anymor tonight k ive cri enough today"
assert preprocess("Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...") == "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

Apply to the text:

In [44]:
df['text'] = df['text'].apply(preprocess)
df['text']

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri 2 wkli comp win fa cup final tkts 2...
3                     u dun say earli hor u c alreadi say
4               nah dont think goe usf live around though
                              ...                        
5567    2nd time tri 2 contact u u å750 pound prize 2 ...
5568                             ì_ b go esplanad fr home
5569                              piti mood soani suggest
5570    guy bitch act like id interest buy someth els ...
5571                                       rofl true name
Name: text, Length: 5169, dtype: object

### Split the data to the training and test set

In [45]:
y = df['label'].values

Now we need to split the data to test (test) and training (train) sets. Scikit-learn library contains ready to use tools to do it.

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.3, random_state=51)
X_test

634     dear voucher holder 2 claim week offer pc go h...
1149                                            drop tank
2375    thanx 4 2day u r goodmat think ur rite sari as...
845     meanwhil shit suit xavier decid give us ltgt s...
155                                      aaooooright work
                              ...                        
1977    repli win å100 week 2006 fifa world cup held s...
481              yo carlo friend alreadi ask work weekend
1632    hello littl parti anim thought id buzz friend ...
5280                           vikki come around lttimegt
3566    collect valentin weekend pari inc flight hotel...
Name: text, Length: 1551, dtype: object

### Classifier training

We came to the classifier training now.

First we extract features from the texts. It is strongly recommened to try several methods in order to check how each method influences the result (more information on defferent text representation methods you can find on the link https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

Then we train the classifier. We use SVM, but you can try different algorithms.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# exctract features from the texts
vectorizer = TfidfVectorizer(decode_error='ignore')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [48]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

#train SVM model

model = LinearSVC(random_state = 51, C = 1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Selfcheck. If the function ```preprocess``` is complimented correctly, then you should get the following model evaluation results.

In [49]:
print(classification_report(y_test, predictions, digits=3))

              precision    recall  f1-score   support

           0      0.984     0.993     0.988      1355
           1      0.946     0.888     0.916       196

    accuracy                          0.979      1551
   macro avg      0.965     0.940     0.952      1551
weighted avg      0.979     0.979     0.979      1551



Let's predict results for the specified text

In [50]:
txt = "As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a å£1500 Bonus Prize, call 09066364589"
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [51]:
model.predict(txt)

array([1])

The message is classified as spam.

# Individual task

Using Spambase Dataset as original dataset (the same as in the example), build a model to check if the email is spam.

Pre-process the text (as in the example), split the dataset into training and test dataset with the parameters `test_size = 0.25`, `random_state = 9`. Train SVM classifier , given `C = 1.2` and `random_state = 9` on the training set and evaluate the resulting model on the test set.




In [61]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.25, random_state=9)

In [62]:
# exctract features from the texts

vectorizer = TfidfVectorizer(decode_error='ignore')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [63]:
#train SVM model

model = LinearSVC(random_state = 9, C = 1.2)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Model evaluation results on the test data.

1. Enter the Precision value (macro avg)
2. Enter the Recall value (macro avg)
3. Enter the F-score value (macro avg)

In [64]:
print(classification_report(y_test, predictions, digits=3))

              precision    recall  f1-score   support

           0      0.976     0.996     0.986      1123
           1      0.966     0.841     0.899       170

    accuracy                          0.975      1293
   macro avg      0.971     0.918     0.943      1293
weighted avg      0.975     0.975     0.975      1293



**Make the prediction for the following messages:**

1. *Call 8890909838 to inquire about our degree programs. Whether you are seeking a Bachelors, Masters, Ph.D. or MBA*
2. *I think this book is a must read for anyone who wants an insight into the Middle East.*
3. *Excellent collection of articles and speeches.*
4. *URGENT! We are trying to contact U.Todays draw shows that you have won a 2000 prize GUARANTEED. Call 090 5809 4507 from a landline. Claim 3030. Valid 12hrs only.*

In [66]:
txt = "Call 8890909838 to inquire about our degree programs. Whether you are seeking a Bachelors, Masters, Ph.D. or MBA"
txt = preprocess(txt)
txt = vectorizer.transform([txt])
model.predict(txt)

array([0])

In [67]:
txt = "I think this book is a must read for anyone who wants an insight into the Middle East."
txt = preprocess(txt)
txt = vectorizer.transform([txt])
model.predict(txt)

array([0])

In [68]:
txt = "Excellent collection of articles and speeches."
txt = preprocess(txt)
txt = vectorizer.transform([txt])
model.predict(txt)

array([0])

In [69]:
txt = "URGENT! We are trying to contact U.Todays draw shows that you have won a 2000 prize GUARANTEED. Call 090 5809 4507 from a landline. Claim 3030. Valid 12hrs only."
txt = preprocess(txt)
txt = vectorizer.transform([txt])
model.predict(txt)

array([1])