# Sentiment predictions using Bag-of-Words features

We will first import packages, download dataset, conduct BoW, and at the end use logistic regression. 

In [None]:
from datasets import load_dataset # Huggingface package for downloading datasets
from sklearn.feature_extraction.text import CountVectorizer # Package for BoW
import pandas as pd

In [3]:
# Download the IMDB dataset. Go to https://huggingface.co/datasets for more datasets
imdb_dataset = load_dataset("imdb")

In [None]:
# Inspect dataset structure 
imdb_dataset

In [None]:
# Let's check the first datapoint in train part
imdb_dataset['train'][[0]]

Training set consist of 'text' which is users' comment and 'label'. The later in binary and it has value 0 (negative sentiment) and 1 (positive sentiment).

In [7]:
# Split the data into training and testing sets. You learned this in previous module!
train_data = imdb_dataset['train']
test_data = imdb_dataset['test']

# Extract the text reviews and their labels
train_reviews = train_data['text']

train_labels = train_data['label']

In [None]:
# Create a Bag-of-Words model using CountVectorizer
vectorizer = CountVectorizer(max_features=5000, stop_words='english')  # Limit to 5000 features and remove English stop words

# Fit the vectorizer on the training data and transform the reviews into BoW vectors
X_train_bow = vectorizer.fit_transform(train_reviews)

# Convert the BoW to a DataFrame 
bow_df = pd.DataFrame(X_train_bow.toarray(), columns=vectorizer.get_feature_names_out())

# Display the first few rows of the Bag-of-Words matrix
bow_df.head()

Here we can see how each row consist of 0 and 1 as a big sparse matrix. In a way we can say each review is represented as the vector of lenght 5000 (number of words used in BoW)

In [9]:
# Use the Bag-of-Words features in a machine learning model
# We will use logistic regression as a classifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:


# Split training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train_bow, train_labels, test_size=0.2, random_state=42)

# Initialize the classifier and train it on the BoW vectors
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Evaluate the classifier on the validation set
y_pred = clf.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")


BoW is very simple and easy approach. In our example it reaches very good accuracy. But there are some drawbacks of this approach as well. Due to the simplicity its usage is very limited. 