Certainly! Let’s break down **Bag of Words (BoW)** in a way that captures the 20% most essential concepts, covering 80% of what you need to understand and implement it.

### What is Bag of Words?
Bag of Words is a method to represent text data numerically by focusing on **word frequencies** in the text. This method ignores grammar and word order, treating text as a "bag" of words.

### Key Concepts to Implement BoW:
1. **Vocabulary Creation**: Identify all unique words across your text data to create a vocabulary.
2. **Word Counting**: For each document (sentence, paragraph, etc.), count how often each word from the vocabulary appears.
3. **Resulting Matrix**: Represent each document as a vector where each entry is the count of a word from the vocabulary.

### Steps to Implement BoW
The easiest way to implement BoW in Python is to use `CountVectorizer` from the `sklearn.feature_extraction.text` library.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

# Example text data
documents = [
    "Data science is amazing",
    "I love learning about data",
    "Data science and machine learning go hand in hand"
]

# 1. Initialize CountVectorizer
vectorizer = CountVectorizer()

# 2. Fit and transform the text data to BoW
bow_matrix = vectorizer.fit_transform(documents)

# 3. Convert to array to see the matrix
bow_array = bow_matrix.toarray()

# 4. Get vocabulary (unique words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# 5. Display BoW matrix
print("BoW Matrix:\n", bow_array)

Vocabulary: ['about' 'amazing' 'and' 'data' 'go' 'hand' 'in' 'is' 'learning' 'love'
 'machine' 'science']
BoW Matrix:
 [[0 1 0 1 0 0 0 1 0 0 0 1]
 [1 0 0 1 0 0 0 0 1 1 0 0]
 [0 0 1 1 1 2 1 0 1 0 1 1]]


### Explanation of the Output:
1. **Vocabulary**: List of unique words in all documents.
2. **BoW Matrix**: Each row represents a document, and each column represents the count of a word from the vocabulary in that document.

### When to Use BoW:
Use Bag of Words when:
- You need a quick, simple numerical representation of text.
- You’re working on smaller datasets where word order isn’t crucial, like in text classification tasks.

### Limitations:
BoW doesn’t capture word order or meaning relationships between words. For more advanced applications, consider using **TF-IDF** or **Word Embeddings**.

With just these steps, you have the core of BoW covered for implementation and understanding.

### Prediction Model

In [2]:
import pandas as pd

In [10]:
messages=  pd.read_csv('SMSSpamCollection.txt',sep = '\t',names=['label','message'], encoding='utf8')

In [12]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [15]:
#Data cleaning and preprocessing
import re
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

y=pd.get_dummies(messages['label'])
y=y.iloc[:,1].values


# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Training model using Naive bayes classifier

from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

y_pred=spam_detect_model.predict(X_test)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
