#Bag-of-Words (BoW)

##Prequisites:

* **spaCy** for tokenization
* **scikit-learn** for the Bag-of-Words model and classification
* **pandas** for data handling

In [33]:
!pip install spacy scikit-learn pandas



In [34]:
# make sure the required python packages are installed

# install nltk (we'll use 3.6.7)
!pip install nltk==3.6.7 --upgrade

# install spacy (we'll use 3.2.1)
!pip install spacy==3.2.1 --upgrade

# download the spacy en_core_web_sm model (3.2.0 version)
!python -m spacy download en_core_web_sm-3.2.0 --direct

Collecting spacy==3.2.1
  Using cached spacy-3.2.1.tar.gz (1.1 MB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

## Step 1: Import the required libraries

In [3]:
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd


## Step 2: Load or Create Sample Data
We’ll create a small dataset of SMS messages for simplicity, where each message is labeled as either spam or ham (not spam).

In [4]:
# Sample dataset of SMS messages
data = {'message': ['Free money!!!', 'Hey, how are you?', 'Win a new car today', 'Call your mom now',
                    'Congratulations, you won a lottery!', 'Let\'s grab lunch tomorrow',
                    'Exclusive offer just for you', 'Meeting is scheduled at 3 PM'],
        'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']}

df = pd.DataFrame(data)

# View the dataset
print(df)

                               message label
0                        Free money!!!  spam
1                    Hey, how are you?   ham
2                  Win a new car today  spam
3                    Call your mom now   ham
4  Congratulations, you won a lottery!  spam
5            Let's grab lunch tomorrow   ham
6         Exclusive offer just for you  spam
7         Meeting is scheduled at 3 PM   ham


## Step 3: Preprocess the Text Using SpaCy
We'll use SpaCy to tokenize and clean the text data by removing stop words, punctuation, and lemmatizing the tokens.

In [5]:
# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Function to preprocess messages
def preprocess_text(text):
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

# Apply the preprocessing function to the messages
df['cleaned_message'] = df['message'].apply(preprocess_text)

# View the cleaned dataset
print(df[['message', 'cleaned_message']])

                               message             cleaned_message
0                        Free money!!!                  free money
1                    Hey, how are you?                         hey
2                  Win a new car today           win new car today
3                    Call your mom now                         mom
4  Congratulations, you won a lottery!  congratulation win lottery
5            Let's grab lunch tomorrow     let grab lunch tomorrow
6         Exclusive offer just for you             exclusive offer
7         Meeting is scheduled at 3 PM       meeting schedule 3 pm


##Step 4: Convert Text Data into Bag-of-Words Features
We’ll use Scikit-learn's CountVectorizer to convert the preprocessed text data into numerical features.

In [6]:
# Create Bag-of-Words features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['cleaned_message'])

# Labels (1 for spam, 0 for ham)
y = df['label'].apply(lambda x: 1 if x == 'spam' else 0)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 5: Train a Naive Bayes Classifier
We’ll use a Multinomial Naive Bayes classifier, which is well-suited for text classification tasks like this.

In [7]:
# Train a Multinomial Naive Bayes model
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

## Step 6: Evaluate the Model
After training, we’ll test the model on the test set and evaluate its performance.

In [8]:
# Predict on test data
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.50


## Step 7: Test the Model with New Data
You can test the spam filter by passing new messages through the trained model.

In [9]:
# Function to classify new messages
def predict_spam(message):
    cleaned_message = preprocess_text(message)
    bow = vectorizer.transform([cleaned_message])
    prediction = classifier.predict(bow)
    return 'spam' if prediction[0] == 1 else 'ham'

# Test new messages
new_message = "Congratulations, you've won a free vacation!"
result = predict_spam(new_message)
print(f'The message "{new_message}" is classified as {result}.')


The message "Congratulations, you've won a free vacation!" is classified as spam.


We created a simple spam filter using a **Bag-of-Words** model with **SpaCy** for text preprocessing and **Naive Bayes** for classification. You can extend this by using more sophisticated feature extraction techniques or a larger dataset.

# Term Frequency-Inverse Document Frequency

* **Term Frequency (TF)**: Measures how frequently a word (term) occurs in a document. If a term appears multiple times, its TF score is higher. It is calculated as:

TF(t,d) = (Number of times term *t* appears in a document *d*) / (Total number of terms in document *d*)

* **Inverse Document Frequency (IDF)**: Measures how important a term is in the entire set of documents (corpus). Rare terms across the corpus will have a higher score, while common terms (like "the", "is") will have a lower score. It is calculated as:

IDF(t,d) = log((Total number of documents in the corpus) / (Number of documents containing term *t*))

* **TF-IDF**: A product of TF and IDF, this metric captures how important a term is in a document relative to the entire corpus. The idea is to prioritize rare yet significant terms.

In [10]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Sample corpus (3 documents)
documents = [
    "The cat sat on the mat",
    "The dog sat on the mat",
    "The cat chased the dog"
]

In [11]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [12]:
# Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)


In [13]:
# Convert the result to a DataFrame for easy viewing
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


In [14]:
# Show the DataFrame
tfidf_df

Unnamed: 0,cat,chased,dog,mat,sat
0,0.57735,0.0,0.0,0.57735,0.57735
1,0.0,0.0,0.57735,0.57735,0.57735
2,0.517856,0.680919,0.517856,0.0,0.0


# Cosine Similarity
We will compare the similarity between two short text documents based on their term frequency (TF) vectors. For simplicity, let’s consider two sentences and calculate their cosine similarity:

## Example Sentences:
* Sentence 1 = "I love teaching data ethics"
* Sentence 2 = "Teaching data ethics is my passion"

## Steps:
* 1. Tokenize the sentences.
* 2. Build the term-frequency (TF) vectors for each sentence.
* 3. Calculate the cosine similarity between the two TF vectors.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [16]:
# Example documents
doc1 = "I love teaching data ethics"
doc2 = "Teaching data ethics is my passion"

In [17]:
# Step 1: Convert the documents into a Bag of Words model
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([doc1, doc2])

In [18]:
# Step 2: Create a DataFrame to visualize the word vectors
vector_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names_out(), index=['doc1', 'doc2'])
print("Word Vector Representation:")
print(vector_df)

Word Vector Representation:
      data  ethics  is  love  my  passion  teaching
doc1     1       1   0     1   0        0         1
doc2     1       1   1     0   1        1         1


In [19]:
# Step 3: Compute the Cosine Similarity
cos_sim = cosine_similarity(vectors)

In [20]:
# Step 4: Display the cosine similarity between the two documents
print("\nCosine Similarity Matrix:")
print(pd.DataFrame(cos_sim, index=['doc1', 'doc2'], columns=['doc1', 'doc2']))


Cosine Similarity Matrix:
          doc1      doc2
doc1  1.000000  0.612372
doc2  0.612372  1.000000


#Testing Cosine Similarity on some real text data from Wikipedia

In [21]:
doc1 = "On 24 February 2022, Russia invaded Ukraine in a major escalation of the Russo-Ukrainian War, which started in 2014. The invasion, the largest conflict in Europe since World War II,[13][14][15] has caused hundreds of thousands of military casualties and tens of thousands of Ukrainian civilian casualties. As of 2024, Russian troops occupy about 20% of Ukraine. From a population of 41 million, about 8 million Ukrainians had been internally displaced and more than 8.2 million had fled the country by April 2023, creating Europe's largest refugee crisis since World War II. In late 2021, Russia massed troops near Ukraine's borders but denied any plan to attack. On 24 February 2022, Russian president Vladimir Putin announced a \"special military operation\", stating that it was to support the Russian-backed breakaway republics of Donetsk and Luhansk, whose paramilitary forces had been fighting Ukraine in the Donbas conflict since 2014. Putin espoused irredentist views challenging Ukraine's legitimacy as a state, falsely claimed that Ukraine was governed by neo-Nazis persecuting the Russian minority, and said that Russia's goal was to \"demilitarise and denazify\" Ukraine. Russian air strikes and a ground invasion were launched on a northern front from Belarus towards Kyiv, a southern front from Crimea, and an eastern front from the Donbas and towards Kharkiv. Ukraine enacted martial law, ordered a general mobilisation and severed diplomatic relations with Russia."

In [22]:
doc2 = "There are currently no diplomatic or bilateral relations between Russia and Ukraine. The two states have been at war since Russia invaded the Crimean peninsula in February 2014, and Russian-controlled armed groups seized Donbas government buildings in May 2014. Following the Ukrainian Euromaidan in 2014, Ukraine's Crimean peninsula was occupied by unmarked Russian forces, and later illegally annexed by Russia, while pro-Russia separatists simultaneously engaged the Ukrainian military in an armed conflict for control over eastern Ukraine; these events marked the beginning of the Russo-Ukrainian War. In a major escalation of the conflict on 24 February 2022, Russia launched a large scale military invasion across a broad front, causing Ukraine to sever all formal diplomatic ties with Russia.[1][2][3] After the collapse of the Soviet Union in 1991, the successor states' bilateral relations have undergone periods of ties, tensions, and outright hostility. In the early 1990s, Ukraine's policy was dominated by aspirations to ensure its sovereignty and independence, followed by a foreign policy that balanced cooperation with the European Union (EU), Russia, and other powerful polities.[4] Relations between the two countries became hostile after the 2014 Ukrainian revolution, which was followed by Russia's annexation of Crimea from Ukraine, and the war in Donbas, in which Russia backed the separatist fighters of the Donetsk People's Republic and the Luhansk People's Republic. The conflicts had killed over 13,000 people by early 2020, and brought international sanctions on Russia.[5] Numerous bilateral agreements have been terminated and economic ties severed."

In [23]:
X = vectorizer.fit_transform([doc1, doc2])

In [24]:
vector_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=['Doc1', 'Doc2'])

In [25]:
cos_sim = cosine_similarity(vector_df)

In [26]:
print('\nCosine Similarity between the two documents:')
print(pd.DataFrame(cos_sim, index = ['doc1', 'doc2'], columns = ['doc1', 'doc2']))


Cosine Similarity between the two documents:
          doc1      doc2
doc1  1.000000  0.688405
doc2  0.688405  1.000000


## Jaccard Similarity
Jaccard Similarity measures the similarity between two sets by dividing the size of the intersection by the size of the union of the sets.

In [27]:
# Sample sentences
doc1 = "Text mining finds useful patterns in data."
doc2 = "Mining data helps in finding useful patterns."

In [28]:
# Tokenize the sentences into words
def tokenize(sentence):
    return set(sentence.lower().split())

In [29]:
# Convert documents to sets of words (tokens)
set1 = tokenize(doc1)
set2 = tokenize(doc2)

In [30]:
# Compute the Jaccard Similarity
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

In [31]:
# Output the similarity score
similarity = jaccard_similarity(set1, set2)
print(f"Jaccard Similarity: {similarity:.2f}")

Jaccard Similarity: 0.27


In [32]:
# Show details of the sets
print(f"\nSet 1: {set1}")
print(f"Set 2: {set2}")
print(f"Intersection: {set1.intersection(set2)}")
print(f"Union: {set1.union(set2)}")


Set 1: {'mining', 'text', 'patterns', 'finds', 'data.', 'in', 'useful'}
Set 2: {'helps', 'finding', 'mining', 'patterns.', 'in', 'data', 'useful'}
Intersection: {'in', 'useful', 'mining'}
Union: {'helps', 'finding', 'mining', 'patterns', 'text', 'patterns.', 'finds', 'data.', 'data', 'in', 'useful'}
