## About the Dataset
This dataset contains **83,446 email records** labeled as spam or not-spam (ham). It was created by combining the **2007 TREC Public Spam Corpus** and the **Enron-Spam Dataset**.

### Columns

- **label**  
  - `1`: Spam email  
  - `0`: Legitimate (ham) email

- **text**  
  - The actual content of the email message.

### Sources

- **2007 TREC Public Spam Corpus**  
  - [Original link](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)  


**Importing Dependencies**

In [33]:
import pandas as pd 
import numpy as np 
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer 


### Data Collection and Proprocessing

In [16]:
# Load the dataset into a pandas dataframe
df = pd.read_csv("C:/Users/USER/Desktop/Datasets/mail_data.csv")
df.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [17]:
# Overview of the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83448 entries, 0 to 83447
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   83448 non-null  int64 
 1   text    83448 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.3+ MB


In [18]:
# Checking for missing values 
df.isna().sum()

label    0
text     0
dtype: int64

In [19]:
# Checking the number of rows and columns 
df.shape

(83448, 2)

In [20]:
# Distribution of Labels
print(df['label'].value_counts())

label
1    43910
0    39538
Name: count, dtype: int64


**Seperating Features and Target**

In [21]:
X = df['text']

y = df['label']

In [22]:
print(X)

0        ounce feather bowl hummingbird opec moment ala...
1        wulvob get your medircations online qnb ikud v...
2         computer connection from cnn com wednesday es...
3        university degree obtain a prosperous future m...
4        thanks for all your answers guys i know i shou...
                               ...                        
83443    hi given a date how do i get the last date of ...
83444    now you can order software on cd or download i...
83445    dear valued member canadianpharmacy provides a...
83446    subscribe change profile contact us long term ...
83447    get the most out of life ! viagra has helped m...
Name: text, Length: 83448, dtype: object


In [23]:
print(y)

0        1
1        1
2        0
3        1
4        0
        ..
83443    0
83444    1
83445    1
83446    0
83447    1
Name: label, Length: 83448, dtype: int64


**Splitting the data for Training and Testing**

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y, random_state=13)

In [25]:
print(X.shape, X_train.shape, X_test.shape)

(83448,) (70930,) (12518,)


**Feature Extraction**

In [26]:
# Tranforming the text data into feature vectors that can be used as input 
feaure_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = feaure_extraction.fit_transform(X_train)

X_test_features = feaure_extraction.transform(X_test)


In [27]:
print(X_train_features)

  (0, 183954)	0.08754979972284799
  (0, 60212)	0.09799943920431366
  (0, 74753)	0.07039102117660517
  (0, 118053)	0.11100140581522831
  (0, 114938)	0.06773804711064588
  (0, 236455)	0.07004903270626044
  (0, 195017)	0.13174060829705148
  (0, 116300)	0.13001992397122278
  (0, 43665)	0.09133665799716531
  (0, 69301)	0.11957028448975711
  (0, 116586)	0.058780540487749916
  (0, 235085)	0.07284798355537027
  (0, 52485)	0.0294715087755479
  (0, 189705)	0.12850217283181586
  (0, 54839)	0.07569484785409025
  (0, 245749)	0.08791280939037796
  (0, 209640)	0.056234313093813576
  (0, 233956)	0.08778429682461272
  (0, 53827)	0.05655867329814715
  (0, 53208)	0.1228087199549927
  (0, 117003)	0.1038429053122085
  (0, 88315)	0.07080725086098363
  (0, 64348)	0.06928949972157673
  (0, 118574)	0.13889915240536985
  (0, 160469)	0.12591633205274314
  :	:
  (70929, 137359)	0.06714283228408376
  (70929, 69256)	0.24505520286332158
  (70929, 23599)	0.07006186953130562
  (70929, 234497)	0.07654877781603868
  (70

In [28]:
print(X_test_features)

  (0, 925)	0.11816979803972225
  (0, 2230)	0.056767647436147185
  (0, 2759)	0.05303384309002133
  (0, 3403)	0.04413790346575052
  (0, 4045)	0.11842272043764224
  (0, 4853)	0.06963862505344807
  (0, 4870)	0.06354585148592368
  (0, 10740)	0.07206508920459255
  (0, 11098)	0.0626884757772671
  (0, 12391)	0.06682510159126231
  (0, 12971)	0.15926823104824103
  (0, 15645)	0.042756545833470196
  (0, 16898)	0.048980091212768126
  (0, 27355)	0.11143650195198934
  (0, 27356)	0.05303384309002133
  (0, 29807)	0.07804630007173785
  (0, 30498)	0.037940213378149235
  (0, 30538)	0.07665232277094333
  (0, 30539)	0.05002948767292065
  (0, 31136)	0.04454068862193913
  (0, 33847)	0.025449483431728284
  (0, 35092)	0.05178215898645322
  (0, 38904)	0.0811632793913832
  (0, 40091)	0.0456266742450803
  (0, 41467)	0.027317458982151006
  :	:
  (12517, 129583)	0.06439623136059924
  (12517, 140950)	0.06413539532931418
  (12517, 140996)	0.07576216409674771
  (12517, 145923)	0.07312210213208861
  (12517, 145941)	0.07

### Model Trainining and Evaluation

**Model Training: LogisticRegression**

In [29]:
model = LogisticRegression()

In [30]:
model.fit(X_train_features, y_train)

**Evaluation**

In [31]:
# prediction on training data
prediction_on_training_data = model.predict(X_train_features)

# accuracy on training data 
accurracy_on_training_data = accuracy_score(prediction_on_training_data, y_train)

print("Accuracy on training data :", accurracy_on_training_data)

Accuracy on training data : 0.9897363597913436


In [32]:
# prediction on test data
prediction_on_test_data = model.predict(X_test_features)

# accuracy on test data 
accurracy_on_test_data = accuracy_score(prediction_on_test_data, y_test)

print("Accuracy on test data :", accurracy_on_test_data)

Accuracy on test data : 0.984023006870107


In [34]:
# Saving the model locally 
joblib.dump(model, 'spam_mail_prediction')

['spam_mail_prediction']

**Building a Predictive System**

In [36]:
mail = "dear valued member its your therapists assistant writing to you i just wanted to give you some really useful advice on how to shop for drugs online its not a secret that many web pharmacies are trying to make profits by selling fake drugs that not only prove to be totally useless but also can cause serious health problems usdrugs is one of very few internet drugstores that always offer only escapenumber generic meds dont hesitate to contact us if you have any questions concerning the information provided if you have any more questions please contact to me please include all previous messages in your email's thank you and best regards rosa arnold email escapelong toshiba eis com www http wgimja superplusnob com gmoilmrxyaix"

In [37]:
# Load the saved model
loaded_model = joblib.load('spam_mail_prediction')

input_mail = [mail]

# Convert text to feature vectors 
input_mail = feaure_extraction.transform(input_mail)

# making prediction 
prediction = loaded_model.predict(input_mail)

if (prediction[0]==1):
    print("Email is Valid(ham)")
else:
    print("Email is Spam")
    
    

Email is Valid(ham)
