<h1>Machine Learning Fundamental:<br>
Sentiment Analysis</h1>

<h3>Import Libraries</h3>
<strong>Scikit-learn</strong>, often referred to as <strong>sklearn</strong>, is an open-source Python library widely used for machine learning. It offers implementations of a vast array of <i>supervised</i> and <i>unsupervised</i> learning algorithms, including <i>classification</i>, <i>regression</i> and <i>clustering</i>.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

<h3>Prepare a Dataset</h3>

In [3]:
data = {
    'review': [
        "This movie was great and I loved the acting.",
        "The plot was confusing and the ending was terrible.",
        "A truly fantastic and enjoyable film.",
        "I would not recommend this to anyone.",
        "The special effects were good, but the story was weak.",
        "An absolutely horrible waste of my time."
    ],
    'sentiment': [
        'positive',
        'negative',
        'positive',
        'negative',
        'negative',
        'negative'
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,review,sentiment
0,This movie was great and I loved the acting.,positive
1,The plot was confusing and the ending was terr...,negative
2,A truly fantastic and enjoyable film.,positive
3,I would not recommend this to anyone.,negative
4,"The special effects were good, but the story w...",negative
5,An absolutely horrible waste of my time.,negative


<h3>Preprocessing and Feature Extraction using Bag-of-Words</h3>

In [4]:
# The CountVectorizer tokenises, removes stop words, and counts frequencies
vectorizer = CountVectorizer(stop_words='english', lowercase=True)

# Split the data to prevent data leakage (using test data to train the model)
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.3, random_state=42)

# 'fit_transform' learns the vocabulary from the training data and transforms it into a numerical matrix
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test) 

print("Bag-of-Words matrix created.")
print(f"Shape of training data matrix: {X_train_vec.shape}")

Bag-of-Words matrix created.
Shape of training data matrix: (4, 14)


<h3>Model Training</h3>

In [6]:
# Use a simple Logistic Regression classifier, a good starting point for text classification.
model = LogisticRegression()

# 'fit' the model with our vectorised training data and labels
model.fit(X_train_vec, y_train)

<h3>Prediction and Evaluation</h3>

In [7]:
y_pred = model.predict(X_test_vec)

accuracy = accuracy_score(y_test, y_pred)
print(f"The model's accuracy on the test set is: {accuracy:.2f}")

The model's accuracy on the test set is: 0.50


<h3>Testing with Different Data</h3>

In [8]:
# Example of a new prediction on a single, unseen sentence
new_review = ["The movie was a complete failure."]
new_review_vec = vectorizer.transform(new_review)
new_review_prediction = model.predict(new_review_vec)

print(f"Prediction for '{new_review[0]}': {new_review_prediction[0]}")

Prediction for 'The movie was a complete failure.': negative
