# Sentiment predictions using Bag-of-Words features

We will first import packages, download dataset, conduct BoW, and at the end use logistic regression. 

In [20]:
from datasets import load_dataset # Huggingface package for downloading datasets
from sklearn.feature_extraction.text import CountVectorizer # Package for BoW
import pandas as pd

In [21]:
# Download the IMDB dataset. Go to https://huggingface.co/datasets for more datasets
imdb_dataset = load_dataset("imdb")

In [22]:
# Inspect dataset structure 
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [23]:
# Let's check the first datapoint in train part
imdb_dataset['train'][[0]]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

Training set consist of 'text' which is users' comment and 'label'. The later in binary and it has value 0 (negative sentiment) and 1 (positive sentiment).

In [24]:
# Split the data into training and testing sets. You learned this in previous module!
train_data = imdb_dataset['train']
test_data = imdb_dataset['test']

# Extract the text reviews and their labels
train_reviews = train_data['text']

train_labels = train_data['label']

In [25]:
# Create a Bag-of-Words model using CountVectorizer
vectorizer = CountVectorizer(max_features=5000, stop_words='english')  # Limit to 5000 features and remove English stop words

# Fit the vectorizer on the training data and transform the reviews into BoW vectors
X_train_bow = vectorizer.fit_transform(train_reviews)

# Convert the BoW to a DataFrame 
bow_df = pd.DataFrame(X_train_bow.toarray(), columns=vectorizer.get_feature_names_out())

# Display the first few rows of the Bag-of-Words matrix
bow_df.head()

Unnamed: 0,00,000,10,100,11,12,13,13th,14,15,...,yesterday,york,young,younger,youth,zero,zizek,zombie,zombies,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


Here we can see how each row consist of 0 and 1 as a big sparse matrix. In a way we can say each review is represented as the vector of lenght 5000 (number of words used in BoW)

In [26]:
# Use the Bag-of-Words features in a machine learning model
# We will use logistic regression as a classifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [27]:


# Split training data for validation
X_train, X_val, y_train, y_val = train_test_split(X_train_bow, train_labels, test_size=0.2, random_state=42)

# Initialize the classifier and train it on the BoW vectors
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Evaluate the classifier on the validation set
y_pred = clf.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")


Validation Accuracy: 0.8594


BoW is very simple and easy approach. In our example it reaches very good accuracy. But there are some drawbacks of this approach as well. Due to the simplicity its usage is very limited. 

In [28]:

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_val)
rf_accuracy = accuracy_score(y_val, y_pred_rf)

print(f"Random Forest Validation Accuracy: {rf_accuracy:.4f}")


Random Forest Validation Accuracy: 0.8468


In [29]:

# XGBoost Classifier
from xgboost import XGBClassifier

xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_val)
xgb_accuracy = accuracy_score(y_val, y_pred_xgb)

print(f"XGBoost Validation Accuracy: {xgb_accuracy:.4f}")


Parameters: { "use_label_encoder" } are not used.



XGBoost Validation Accuracy: 0.8524


In [31]:


results_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Validation Accuracy': [accuracy, rf_accuracy, xgb_accuracy]
})

results_df


Unnamed: 0,Model,Validation Accuracy
0,Logistic Regression,0.8594
1,Random Forest,0.8468
2,XGBoost,0.8524
