#From Scratch Using nltk

##Data Preprocessing

In [14]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
df.groupby(by='Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [16]:
df.Category.value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

The dataset is skewed towards one class, an imbalanced dataset. There are several ways to deal with this kind of problems.
(1)Oversampling:- In which we oversample the examples of a class which has less samples.
(2)Downsampling:- In which we Downsample(means removing examples) the class which has large samples.

We will be using second option to deal with this problem.

In [17]:
df_spam = df[df['Category'] == 'spam']
df_spam.shape

(747, 2)

In [18]:
df_ham = df[df['Category'] == 'ham']
df_ham.shape

(4825, 2)

In [19]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape

(747, 2)

In [20]:
df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape

(1494, 2)

In [21]:
df_balanced = df_balanced.sample(frac=1)

In [22]:
df_balanced['spam'] = df_balanced['Category'].apply(lambda x: 1 if x=='spam' else 0)
len(df_balanced)

1494

In [23]:
df_balanced.reset_index(drop=True)

Unnamed: 0,Category,Message,spam
0,ham,Are you there in room.,0
1,spam,Your free ringtone is waiting to be collected....,1
2,ham,What time. I‘m out until prob 3 or so,0
3,ham,It is only yesterday true true.,0
4,ham,Went to pay rent. So i had to go to the bank t...,0
...,...,...,...
1489,spam,You won't believe it but it's true. It's Incre...,1
1490,spam,"If you don't, your prize will go to another cu...",1
1491,ham,O shore are you takin the bus,0
1492,ham,How much did ur hdd casing cost.,0


In [24]:
df_balanced.sample(5)

Unnamed: 0,Category,Message,spam
5409,ham,There is a first time for everything :),0
3651,ham,"We are hoping to get away by 7, from Langport....",0
787,ham,It does it on its own. Most of the time it fix...,0
128,ham,Are you there in room.,0
3577,ham,The sign of maturity is not when we start sayi...,0


##Text Preprocessing

In [25]:
import matplotlib.pyplot as plt
import nltk

In [26]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mohammedsohilshaikh/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mohammedsohilshaikh/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [27]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
my_corpus = []

for i in range(1494):
  text = df_balanced.iloc[i, -2]
  tokenizer = RegexpTokenizer(r'\w+')
  tokenized_words = tokenizer.tokenize(text.lower())
  stop_words = stopwords.words('english')
  lemmatizer = WordNetLemmatizer()
  words_without_stopwords = [lemmatizer.lemmatize(word) for word in tokenized_words if word not in set(stop_words)]
  final_words = ' '.join(words_without_stopwords)
  my_corpus.append(final_words)

In [28]:
print(my_corpus)

['room', 'free ringtone waiting collected simply text password mix 85069 verify get usher britney fml po box 5249 mk17 92h 450ppw 16', 'time prob 3', 'yesterday true true', 'went pay rent go bank authorise payment', 'guy leaving', 'havent still waitin usual ü come back sch oredi', 'sent jd customer service cum account executive ur mail id detail contact u', '4 costa del sol holiday 5000 await collection call 09050090044 toclaim sae tc pobox334 stockport sk38xh cost 1 50 pm max10mins', 'sm ac sptv new jersey devil detroit red wing play ice hockey correct incorrect end reply end sptv', 'so amount get pls', 'okey doke home dressed co laying around ill speak later bout time stuff', 'congrats 2 mobile 3g videophones r call 09063458130 videochat wid mate play java game dload polyph music noline rentl', 'oh thanks lot already bought 2 egg', 'use foreign stamp country good lecture', 'k text way', 'contacted dating service someone know find call land line 09050000878 pobox45w2tg150p', 'urgent m

##Building + Evaluating a model

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
import joblib
cv = CountVectorizer()
X = cv.fit_transform(my_corpus).toarray()
y = df_balanced.iloc[:, -1].values
joblib.dump(cv, 'CountVectorizer')

['CountVectorizer']

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

In [31]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
joblib.dump(classifier, 'Gaussina_Classifier')

['Gaussina_Classifier']

In [32]:
y_pred = classifier.predict(X_test)

In [33]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[145   5]
 [  9 140]]


0.9531772575250836

In [34]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your .",
    "I will not be available for today."
    ]

In [35]:
CV = joblib.load('CountVectorizer')
X_to_be_predicted = CV.transform(reviews).toarray()
y_predicted = classifier.predict(X_to_be_predicted)
print(y_predicted)

[1 1 1 0 0 0]


##Trainig using an ANN

In [36]:
import tensorflow as tf

In [37]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size = 0.80, random_state = 0, stratify=y)
X_rem = np.delete(X_rem, 1, 0)
y_rem = np.delete(y_rem, 1, 0)
print(X_rem.shape, y_rem.shape)
X_test, X_valid, y_test, y_valid = train_test_split(X_rem, y_rem, train_size = 0.50, random_state = 0, stratify=y_rem)

(298, 4249) (298,)


In [38]:
model_ann = tf.keras.Sequential()
model_ann.add(tf.keras.layers.Dense(256, activation='relu'))
model_ann.add(tf.keras.layers.Dense(128, activation='relu'))
model_ann.add(tf.keras.layers.Dense(64, activation='relu'))
model_ann.add(tf.keras.layers.Dropout(0.2))
model_ann.add(tf.keras.layers.Dense(1, activation='sigmoid'))

Metal device set to: Apple M1


2021-11-07 13:10:24.753343: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-11-07 13:10:24.753502: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [39]:
model_ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [40]:
model_ann.fit(X_train, y_train, validation_data=(X_valid, y_valid), batch_size = 32, epochs = 10)

2021-11-07 13:10:28.353984: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-07 13:10:28.354635: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 1/10


2021-11-07 13:10:28.712792: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10

2021-11-07 13:10:29.904525: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x297038220>

In [41]:
y_pred = model_ann.predict(X_test)
y_pred = np.where(y_pred > 0.5, 1, 0)

2021-11-07 13:10:35.269193: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


In [42]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[72  3]
 [ 6 68]]


0.9395973154362416

In [4]:
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your .",
    "I will not be available for today."
    ]


In [5]:
import joblib
import tensorflow as tf
import numpy as np
def get_prediction(reviews):
  CV = joblib.load('CountVectorizer')
  model_ann = tf.keras.models.load_model('ann_classifier.h5')
  X_to_be_predicted = CV.transform(reviews).toarray()
  y_predicted = model_ann.predict(X_to_be_predicted)
  y_predicted = np.where(y_predicted > 0.5, 1, 0)
  return y_predicted

result = get_prediction(reviews)
print(result)

2021-11-07 13:27:40.060935: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


[[1]]


In [46]:
model_ann.save('ann_classifier.h5')