<a href="https://colab.research.google.com/github/samiha-mahin/A-Machine-Learning-Models-Repo/blob/main/Naive_Byes_Spam_Email_Filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### 🧠 What is Naive Bayes?

Naive Bayes is a **simple and fast machine learning algorithm** used for **classification**.

It is based on **probability** and uses **Bayes’ Theorem**.

---

### 💡 Why "Naive"?

Because it **naively assumes** that all features (words, symptoms, etc.) are **independent** of each other — which is often not true, but it still works well!

---

### 📚 Real-life Example: Spam Detection

Suppose we want to detect whether an email is **spam** or **not spam**.

We train the model with examples like:

| Email Text            | Label    |
| --------------------- | -------- |
| "Buy now cheap offer" | Spam     |
| "Let's meet today"    | Not Spam |
| "Limited offer today" | Spam     |

Now a new email comes:
**"cheap offer today"**

---

### ✅ What Naive Bayes does:

1. Looks at each word in the email.
2. Calculates how often that word appears in **spam** and **not spam** emails.
3. Uses **Bayes’ theorem** to calculate the **probability** of the email being spam or not.
4. Picks the label (Spam/Not Spam) with the **higher probability**.

---

### 🧾 Summary:

* Naive Bayes = classification based on probability
* Great for text (like spam detection or sentiment analysis)
* Very fast, works well even on small data


---

### 1. **Multinomial Naive Bayes**

* **Used for**: Text classification (e.g., spam detection)
* **Works with**: Word counts or term frequencies

✅ **Example**:
Classify emails as **spam** or **not spam** using word counts.

```python
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
```

---

### 2. **Bernoulli Naive Bayes**

* **Used for**: Binary features (word present or not)
* **Works with**: 0s and 1s (not actual counts)

✅ **Example**:
Detect if a review is **positive or negative** based on presence of certain words (e.g., "good", "bad").

```python
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()
```

---

### 3. **Gaussian Naive Bayes**

* **Used for**: Continuous data (like age, height)
* **Works with**: Real numbers, assumes normal distribution

✅ **Example**:
Predict if a person has a **disease** based on **age, blood pressure, temperature**.

```python
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
```

---

### 🧾 Summary Table:

| Type        | Used For             | Input Type   | Example Use Case              |
| ----------- | -------------------- | ------------ | ----------------------------- |
| Multinomial | Text classification  | Word counts  | Spam detection                |
| Bernoulli   | Binary text features | 0/1          | Sentiment analysis            |
| Gaussian    | Continuous features  | Real numbers | Disease prediction by metrics |




# **Spam Email Detection With Naive Bayes Classifier**

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/content/spam.csv')

In [3]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [5]:
df['spam'] = (df['Category'] == 'spam').astype(int)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0



### 💡 What this does:

* `df['Category'] == 'spam'` → gives `True` for spam and `False` for others
* `.astype(int)` → converts `True` to `1`, `False` to `0`
* Stores the result in a new column called `'spam'`


### ✅ Result:

You’ll get a new column like this:

| Category | spam |
| -------- | ---- |
| ham      | 0    |
| spam     | 1    |
| ham      | 0    |


In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

In [12]:
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [13]:
clf.fit(X_train, y_train)

In [14]:
clf.score(X_test,y_test)

0.9885139985642498

In [16]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

In [17]:
clf.predict(emails)

array([0, 1])