**Why There is need of Spam Filter:**

**Business Problem:**

When you have a business, getting rid of spam is all the more important due to the fact that these can eat up a lot of your inbox space, as well as a lot of your time when you start clearing these out. These emails can also carry malware and viruses that can compromise company security and data. What can you do to stop these from inundating your work email, and by extension, to stop these from compromising your company’s security? You can use spam filters.

Spam filtering is an important tool that your company should use to help keep these unwanted messages from entering your inboxes, and to keep people from clicking on potentially harmful emails. According to studies, more than half of the emails that you get are actually classified as junk or spam. This fact alone shows you that there is a large potential for security issues due to these messages, not to mention the drop in productivity because of the time people will spend on deleting such emails from their inbox.

**Lets use our machine learning skills and solve the business Problem:**

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report 
import seaborn as sns

In [None]:
df = pd.read_csv("../input/sms-spam-collection-dataset/spam.csv",encoding='latin-1')
df.head()

**Exploratory Data Analysis**

In [None]:
df.info()

In [None]:
df.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"],axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.rename(columns={"v1":"Label","v2":"Message"},inplace=True)

In [None]:
df.columns

In [None]:
df.Label.value_counts()

In [None]:
sns.countplot(x=df.Label)

In [None]:
df["Label"]=df.Label.map({"ham":0,"spam":1})

In [None]:
df.head()

**Modelling**

In [None]:
#defining indipendent and dependent variables.
X=df["Message"]
y=df["Label"]

In [None]:
Count_vec=CountVectorizer()
X=Count_vec.fit_transform(X) #fit and transform the data

In [None]:
#train test split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.33,random_state=42)

In [None]:
#lets fit our naive bayes classifier
NB=MultinomialNB()
NB.fit(X_train,y_train)
NB.score(X_test,y_test)
y_pred=NB.predict(X_test)
print(classification_report(y_test,y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

Lets try some some other models

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()
LR.fit(X_train,y_train)
LR.score(X_test,y_test)
y_pred_LR=LR.predict(X_test)
print(classification_report(y_test,y_pred_LR))
confusion_matrix(y_test,y_pred_LR)

**K Neighbours Classifier**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
Knn=KNeighborsClassifier(n_neighbors=1)
Knn.fit(X_train,y_train)
Knn.score(X_test,y_test)
y_pred_knn=Knn.predict(X_test)
print(classification_report(y_test,y_pred_knn))
confusion_matrix(y_test,y_pred_knn)

**Ensemble Classifier:** **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
Rf=RandomForestClassifier(n_jobs=-1)
Rf.fit(X_train,y_train)
Rf.score(X_test,y_test)
y_pred_Rf=Rf.predict(X_test)
print(classification_report(y_test,y_pred_Rf))
confusion_matrix(y_test,y_pred_Rf)

**Adaboost:**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
Ada=AdaBoostClassifier()
Ada.fit(X_train,y_train)
Ada.score(X_test,y_test)
y_pred_Ada=Ada.predict(X_test)
print(classification_report(y_test,y_pred_Ada))
confusion_matrix(y_test,y_pred_Ada)

In [None]:
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
# Plot calibration plots

plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(LR, 'Logistic'),
                  (NB, 'Naive Bayes'),
                  (Ada, 'Adaboost'),
                  (Rf, 'Random Forest'),
                  (Knn,"K Neighbours")]:
    clf.fit(X_train, y_train)
    if hasattr(clf, "predict_proba"):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:  # use decision function
        prob_pos = clf.decision_function(X_test)
        prob_pos = \
            (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
    fraction_of_positives, mean_predicted_value = \
        calibration_curve(y_test, prob_pos, n_bins=10)

    ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
             label="%s" % (name, ))

    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
             histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
plt.show()

We can see that the Naive Bayes has best at classfiying the Spam and Ham mmessages in the mail .we have also used many other ensembe and classification models but none was better then the Multinomial Naive Bayes.

**Deployment: Work In progress**

After training the model, it is desirable to have a way to persist the model for future use without having to retrain. To achieve this, we add the following lines to save our model as a .pkl file for the later use.

In [None]:
#from sklearn.externals import joblib
#joblib.dump(NB,"Spam_detection_proj.pkl")

And we can load and use saved model later like so:

In [None]:
#Spam_detection_proj=open("Spam_detection_proj.pkl","rb")
#NB=joblib.load(Spam_detection_proj)

The above process called “persist model in a standard format”, that is, models are persisted in a certain format specific to the language in development.

And the model will be served in a micro-service that expose endpoints to receive requests from client. This is the next step.