# 9.0 Email Spam Classification Scikit-Learn

**Email Spam Detection is perhaps one of the most popular Machine Learning projects for beginners. In this video we will be using Scikit-learn to build a SVM classifier that can detect classify emails as Spam or Ham (not spam).**

https://www.youtube.com/watch?v=exHwwy9kVcg&ab_channel=CodeHeroku

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import svm
from sklearn.model_selection import GridSearchCV

# Step 1: Load Dataset

In [4]:
data = pd.read_csv('/Users/yuliabezginova/PycharmProjects/deep_learning/spam_detection/spam.csv')

In [6]:
data.head()

Unnamed: 0,Label,EmailText
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
data.describe()

Unnamed: 0,Label,EmailText
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


# Step 2: Split the Dataset into Training and Test Data

In [8]:
X = data['EmailText']
y = data['Label']

In [9]:
X_train, y_train = X[0:4457], y[0:4457]
X_test, y_test = X[4457:], y[4457:]

# Step 3: Extract Features

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


![Screen%20Shot%202022-11-22%20at%202.49.19%20PM.png](attachment:Screen%20Shot%202022-11-22%20at%202.49.19%20PM.png)

![Screen%20Shot%202022-11-22%20at%202.50.33%20PM.png](attachment:Screen%20Shot%202022-11-22%20at%202.50.33%20PM.png)

In [10]:
cv = CountVectorizer()

In [12]:
features = cv.fit_transform(X_train)

# Step 4: Build a Model

Using SVM as a classifier.

In [13]:
model = svm.SVC()

In [14]:
model.fit(features, y_train)

SVC()

## 4.1 Finding for the best parameters

#### Looks like accuracy is very decent, but could we make it better? There is a GridSearchCV class which helps find better parameters for our model. 

In [17]:
tuned_parameters = {'kernel' : ['linear', 'rbf'], 'gamma':[1e-3,1e-4],'C':[1,10,100,1000]}

In [18]:
# asking GridSearch to find best parameters
model = GridSearchCV(svm.SVC(), tuned_parameters)

In [20]:
# re-modelling using the best parameters
model.fit(features, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                         'kernel': ['linear', 'rbf']})

In [21]:
# printing the best parameters provided by GridSearchCV
print(model.best_params_)

{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}


# Step 5: Test the Best Model Quality

In [23]:
# transforming test dataset for modeling
features_test = cv.transform(X_test)

In [24]:
print("Accuracy of the model is:", model.score(features_test, y_test))

Accuracy of the model is: 0.9874439461883409


### Conclusion:
98% accuracy looks good.

***Thank you for going through this project. Your comments are more then welcome to ybezginova2021@gmail.com***

***Best wishes,***

***Yulia***