## Technohacks Intern Task - 2

## Email Spam Filtering

In [1]:
import pandas as pd

In [2]:
# Read the CSV file with the specified parameters
df = pd.read_csv("spam.csv", sep='\t', encoding='ISO-8859-1')
df.head()

Unnamed: 0,"v1,v2,,,"
0,"ham,""Go until jurong point, crazy.. Available ..."
1,"ham,Ok lar... Joking wif u oni...,,,"
2,"spam,Free entry in 2 a wkly comp to win FA Cup..."
3,"ham,U dun say so early hor... U c already then..."
4,"ham,""Nah I don't think he goes to usf, he live..."


In [3]:
df[['label', 'message']] = df['v1,v2,,,'].str.split(',', n=1, expand=True)

In [4]:
df.drop('v1,v2,,,', axis=1, inplace=True)

In [5]:
df.head(10)

Unnamed: 0,label,message
0,ham,"""Go until jurong point, crazy.. Available only..."
1,ham,"Ok lar... Joking wif u oni...,,,"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"""Nah I don't think he goes to usf, he lives ar..."
5,spam,"""FreeMsg Hey there darling it's been 3 week's ..."
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [7]:
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [8]:
corpus[0]

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'

### Vectorization

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000)
bag_of_words = cv.fit_transform(corpus).toarray()

In [10]:
spam_or_not = pd.get_dummies(df['label'])
# Select the values of the second column after one-hot encoding
spam_or_not = spam_or_not.iloc[:, 1].values
spam_or_not

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

### Splitting the dataset

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(bag_of_words, spam_or_not, test_size = 0.20, random_state = 0)

### NAIVE BAYES MODEL

**NAIVE BAYES MODEL**
Naive Bayes is a popular and simple machine learning algorithm used for classification tasks, including text classification, spam detection, and sentiment analysis. It's based on Bayes' theorem, which is a probabilistic theorem used to calculate conditional probabilities.

At the core of Naive Bayes is Bayes' theorem, which is used to calculate conditional probabilities. The theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. In the context of classification, it's used to calculate the probability of a particular class given some observed features.

![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*CnoTGGO7XeUpUMeXDrIfvA.png)


In [12]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

spam_pred=spam_detect_model.predict(X_test)
spam_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

### Confusion Matrix

In [13]:
from sklearn.metrics import confusion_matrix
confusion_m = confusion_matrix(y_test,spam_pred)
confusion_m 

array([[1112,    3],
       [   0,    0]], dtype=int64)

### Accuracy Score

In [14]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test,spam_pred)
accuracy

0.9973094170403587