Sentiment Prediction from Restaurant Reviews

Classifying restaurant reviews as positive or negative using Support Vector Classification (SVC) — with text vectorisation via CountVectorizer.

Problem

Given 1,000 restaurant reviews with binary labels (liked / not liked), build a model that predicts sentiment from raw text. The challenge: converting free-form text into numerical features that an SVM can learn from.

Approach

Vectorise text using CountVectorizer — converts words into a bag-of-words matrix with English stop words removed
Split data 75/25 with stratified sampling
Train an SVC (RBF kernel, default parameters)
Evaluate accuracy on held-out test set
Test on unseen custom text

What's Inside

Notebook	What It Does
`1, CountVectorizer_demo.ipynb`	Standalone demo of unigram and bigram vectorisation — how text becomes numbers
`2, review_data_using_SVC.ipynb`	Full pipeline: load data → vectorise → train SVC → evaluate → predict on unseen text

Results

The model achieves strong accuracy on the test set. Class distribution is balanced (~50/50 positive/negative), so accuracy is a reliable metric here.

Key output: the model can predict sentiment on completely unseen text:

unseen_text = vect.transform(["Good customer service! The food was nice"])
model.predict(unseen_text)  # => [1] (positive)

Setup

Google Colab

Click the badge above — no setup required.

Local

pip install scikit-learn pandas matplotlib seaborn
git clone https://github.com/wsamuelw/review-data-using-SVC.git
cd review-data-using-SVC
jupyter notebook "2, review_data_using_SVC.ipynb"

Data

Restaurant Reviews — 1,000 reviews scraped from restaurant listings. Tab-separated with two columns:

Column	Type	Description
`Review`	string	Free-text review
`Liked`	int (0/1)	Binary sentiment label

Class distribution: ~50% positive, ~50% negative.

How CountVectorizer Works

Raw text → tokenise → remove stop words → build vocabulary → create word-count matrix:

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['Great food', 'Terrible service']
vect = CountVectorizer(stop_words='english')
X = vect.fit_transform(corpus)

# Vocabulary: ['food', 'great', 'service', 'terrible']
# Matrix: [[1, 1, 0, 0],
#           [0, 0, 1, 1]]

Each row is a document, each column is a word in the vocabulary. The value is how many times that word appears.

Bigrams (2-word phrases) capture context that single words miss:

vect = CountVectorizer(ngram_range=(2, 2))
# 'not good' → single feature, captures negation

Why SVC for Text?

Works well in high dimensions — text vectors have thousands of features (one per word), SVMs handle this naturally
Kernel trick — RBF kernel captures non-linear relationships without explicit feature engineering
Robust to overfitting — margin maximisation generalises well on small-to-medium datasets

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
1, CountVectorizer_demo.ipynb		1, CountVectorizer_demo.ipynb
2, review_data_using_SVC.ipynb		2, review_data_using_SVC.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Prediction from Restaurant Reviews

Problem

Approach

What's Inside

Results

Setup

Google Colab

Local

Data

How CountVectorizer Works

Why SVC for Text?

Tech Stack

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Prediction from Restaurant Reviews

Problem

Approach

What's Inside

Results

Setup

Google Colab

Local

Data

How CountVectorizer Works

Why SVC for Text?

Tech Stack

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages