<a href="https://colab.research.google.com/github/rajkumar1325/CODSOFT/blob/main/SmsSpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Implementation Steps:
1. import necessary libraries
2. Load the dataset
3. PreProcess the data-set
4. Splitting of data into training and testing set
5. Convert the training data into binary usin TF-IDF
6. Train the model(using Logistic Regression)
7. Testing the accuracy of the model
8. Test the model
9. Testing the model on custom message also


# Importing Necessary **Libraries**

In [9]:
import pandas as pd

In [10]:
# for splitting the dataset
from sklearn.model_selection import train_test_split


In [11]:
# for Converting text to numerical representation
from sklearn.feature_extraction.text import TfidfVectorizer


In [12]:
# For training the Logistic classifier
from sklearn.linear_model import LogisticRegression


In [13]:
# For evaluating/Judge the model
from sklearn.metrics import accuracy_score, classification_report


# Loading of the dataset

In [35]:
# Reading and encoding the file
df = pd.read_csv('/content/spam.csv')




UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte

In [36]:
# Reading and encoding the file
df = pd.read_csv('/content/spam.csv' ,encoding='latin-1')

"""
if we not decode, error shown:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte
"""


"\nif we not decode, error shown:\nUnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte\n"

In [37]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


# **PreProcessing**

In [38]:
# Renameing of the dataset for simple understanding
df = df.rename(columns={"v1": "label", "v2": "text"})


In [39]:
df.head()

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Converting Label into Binary

*   'ham' to 0
*   'spam' to 1



In [40]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})


In [41]:
df.head(6)

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,
5,1,FreeMsg Hey there darling it's been 3 week's n...,,,


### Splitting data into **Training** and **Testing** set

In [42]:
X = df['text']  # The input tells (SMS messages)
y = df['label']  # The output tells (spam or not)

In [43]:
# Training the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [44]:
df.head()

Unnamed: 0,label,text,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,


# Transforming text into Numerical Format
#### using TF-IDF library

In [45]:
# Initializing
vectorizer = TfidfVectorizer(stop_words='english')


In [46]:
X_train_tfidf = vectorizer.fit_transform(X_train)  # Fit and transform the training data
X_test_tfidf = vectorizer.transform(X_test)  # Transform the test data

In [47]:
print("Well done")

Well done


# Training the model (i.e, *Logistic Regression model*)

In [48]:
# Initialisation
model = LogisticRegression()


In [49]:
# Training
model.fit(X_train_tfidf, y_train)

# Testing the model

In [50]:
# Make predictions on the test data
y_pred = model.predict(X_test_tfidf)

In [51]:
# Calculatin the accuracy
accuracy = accuracy_score(y_test, y_pred)

In [52]:
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.9525


# Classification Report

In [53]:
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.95      1.00      0.97       965\n           1       0.97      0.67      0.79       150\n\n    accuracy                           0.95      1115\n   macro avg       0.96      0.83      0.88      1115\nweighted avg       0.95      0.95      0.95      1115\n'

# Testing with **custom sms input**

In [54]:
messages = ["Free entry in a contest!", "Hi, how are you?"]
messages_tfidf = vectorizer.transform(messages)
predictions = model.predict(messages_tfidf)
print(predictions)
if predictions[0] == 1:
    print("This is a Spam message, Open at your own Risk")
else:
    print("This is niot a spam bro! Go through it")


[1 0]
This is a Spam message, Open at your own Risk


In [55]:
messages = ["""
Hello dear,
My name is Raj Kumar and I'm Inviting you to my grand party
Please be there.!
"""]
messages_tfidf = vectorizer.transform(messages)
predictions = model.predict(messages_tfidf)
print(predictions)
if predictions[0] == 1:
    print("This is a Spam message, Open at your own Risk")
else:
    print("This is not a spam bro! Go through it")


[0]
This is not a spam bro! Go through it
