### Bag of Words

In [39]:
import numpy as np
import pandas as pd

In [40]:
emails = pd.read_csv("spam.csv")

emails.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [41]:
emails['Category'].value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

In [42]:
emails["Spam"] = emails["Category"].apply(lambda x : 1 if x == "spam" else 0)

In [43]:
emails.head(5)

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [44]:
from sklearn.model_selection import train_test_split

In [49]:
X_train, X_test, y_train, y_test = train_test_split(
    emails.Message, emails.Spam, test_size=0.2, random_state=42)

In [50]:
X_train.shape

(4457,)

In [55]:
try:
    print(X_train[:4][1978])
except:
    print("Key error")

Reply to win £100 weekly! Where will the 2006 FIFA World Cup be held? Send STOP to 87239 to end service


In [56]:
from sklearn.feature_extraction.text import CountVectorizer

In [57]:
v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 59275 stored elements and shape (4457, 7701)>

In [58]:
X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(4457, 7701))

In [59]:
X_train_cv.shape
# 7746 words vocab

(4457, 7701)

In [60]:
v.get_feature_names_out()

array(['00', '000', '000pes', ..., 'zyada', 'èn', 'ú1'],
      shape=(7701,), dtype=object)

In [61]:
v.vocabulary_

{'reply': 5687,
 'to': 6888,
 'win': 7474,
 '100': 258,
 'weekly': 7396,
 'where': 7437,
 'will': 7471,
 'the': 6773,
 '2006': 354,
 'fifa': 2805,
 'world': 7555,
 'cup': 2106,
 'be': 1271,
 'held': 3364,
 'send': 5980,
 'stop': 6460,
 '87239': 694,
 'end': 2568,
 'service': 5999,
 'hello': 3369,
 'sort': 6304,
 'of': 4854,
 'out': 4976,
 'in': 3603,
 'town': 6959,
 'already': 924,
 'that': 6770,
 'so': 6252,
 'dont': 2395,
 'rush': 5825,
 'home': 3441,
 'am': 934,
 'eating': 2504,
 'nachos': 4650,
 'let': 4057,
 'you': 7662,
 'know': 3926,
 'eta': 2627,
 'how': 3487,
 'come': 1902,
 'guoyang': 3258,
 'go': 3140,
 'tell': 6711,
 'her': 3384,
 'then': 6784,
 'told': 6904,
 'hey': 3392,
 'sathya': 5881,
 'till': 6856,
 'now': 4808,
 'we': 7372,
 'dint': 2323,
 'meet': 4392,
 'not': 4797,
 'even': 2638,
 'single': 6147,
 'time': 6857,
 'can': 1629,
 'saw': 5894,
 'situation': 6164,
 'orange': 4945,
 'brings': 1509,
 'ringtones': 5763,
 'from': 2993,
 'all': 911,
 'chart': 1733,
 'heroes':

In [62]:
X_train_np = X_train_cv.toarray()

X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], shape=(7701,))

In [63]:
np.where(X_train_np[0] != 0)

(array([ 258,  354,  694, 1271, 2106, 2568, 2805, 3364, 5687, 5980, 5999,
        6460, 6773, 6888, 7396, 7437, 7471, 7474, 7555]),)

In [64]:
X_train[1133]

'Good morning princess! How are you?'

In [65]:
v.get_feature_names_out()[3185]

'gotbabes'

### Naive Bayes Model

In [66]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(X_train_cv, y_train)

In [67]:
X_test_cv =  v.transform(X_test)

In [68]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       966
           1       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



In [69]:
email = [
    'Hi Shashank! You have won a lottery of $500000. Call us back.'
]

In [70]:
email_cv = v.transform(email)

model.predict(email_cv)

array([0])

In [71]:
from sklearn.pipeline import Pipeline

clf = Pipeline(
    [
        ("vectorizer", CountVectorizer()),
        ("nb", MultinomialNB())
    ]
)

In [72]:
clf.fit(X_train, y_train)

Now the pipeline has fit, learnt the parameters and when we do predict on X-test, first it will be vectorized and then fed into the Naive Baysian Classifier model.

In [75]:
clf[0].get_feature_names_out()

array(['00', '000', '000pes', ..., 'zyada', 'èn', 'ú1'],
      shape=(7701,), dtype=object)

In [73]:
y_predicted = clf.predict(X_test)
y_predicted

array([0, 0, 0, ..., 0, 0, 0], shape=(1115,))

In a machine learning pipeline, the `fit` and `predict` methods serve distinct purposes:

### `fit`
- **Purpose**: Trains the model by learning patterns from the training data.
- **Pipeline Behavior**: Each step in the pipeline that has a `fit` method (e.g., `CountVectorizer`, `MultinomialNB`) applies its `fit` logic sequentially.
  - For example, `CountVectorizer` learns the vocabulary from the training data, and `MultinomialNB` learns the probabilities for classification.
- **Input**: Training data (`X_train`) and corresponding labels (`y_train`).
- **Output**: The pipeline is updated with learned parameters (e.g., vocabulary, model weights).

### `predict`
- **Purpose**: Uses the trained model to make predictions on new, unseen data.
- **Pipeline Behavior**: Each step in the pipeline applies its `transform` or `predict` logic sequentially.
  - For example, `CountVectorizer` transforms the input text into a feature matrix, and `MultinomialNB` predicts the class labels.
- **Input**: Test data (`X_test` or new data).
- **Output**: Predicted labels or values.

### Key Differences
1. **Training vs. Inference**:
   - `fit` is for training the model.
   - `predict` is for inference (making predictions).

2. **Data Dependency**:
   - `fit` requires both features (`X`) and labels (`y`).
   - `predict` only requires features (`X`).

3. **State Changes**:
   - `fit` modifies the internal state of the pipeline (e.g., vocabulary, model parameters).
   - `predict` does not change the state; it uses the already learned parameters.

In your pipeline:


In [None]:
# clf.fit(X_train, y_train)  # Trains the vectorizer and Naive Bayes model
# y_predicted = clf.predict(X_test)  # Predicts labels for test data