# **Sentiment Analysis of Customer Reviews with TF-IDF and Logistic Regression**

Sentiment analysis (also known as opinion mining) uses natural language processing to identify and quantify opinions expressed in text. In this notebook, we apply sentiment analysis to customer reviews using TF-IDF vectorization and logistic regression for classification. TF-IDF (term frequency–inverse document frequency) converts text into numerical features that reflect how important a word is in a document relative to the entire corpus

**Workflow Overview**

Data Loading: Prepare or load a dataset of labeled customer reviews (positive and negative).

Preprocessing: Clean the text (lowercase, remove punctuation and non-alphabetic characters, stopwords, etc.).

Feature Extraction: Apply TF-IDF vectorization to transform text into numerical feature vectors.

Modeling: Train a logistic regression classifier on the TF-IDF features.
Evaluation: Evaluate the classifier using metrics like accuracy, precision, recall, and F1-score.

**Dataset**
For example, the popular IMDB movie review dataset contains 50,000 labeled reviews. For this demonstration, we create a small synthetic dataset of customer product reviews and binary sentiment labels (1 = positive, 0 = negative). Below we construct a pandas DataFrame `df` with example reviews:

In [1]:
import pandas as pd
reviews = [
    "I absolutely loved this product, it exceeded my expectations!",
    "Worst purchase ever. I hate it and regret buying it.",
    "Great value for money. I am happy with it.",
    "Very disappointed. The quality was terrible.",
    "Not what I expected, quite bad overall.",
    "Excellent service and friendly staff. Highly recommend!",
    "Terrible! I will never buy this again.",
    "Fantastic product! Works like a charm.",
    "Bad quality, broke in a week.",
    "I am satisfied with the purchase.",
    "Loved it! Will buy again and tell friends.",
    "Horrible. The worst experience ever.",
    "Pleasantly surprised by how good this is.",
    "Very poor performance and unacceptable quality.",
    "Superb build and easy to use. Very happy!"
]
labels = [
    1, # positive
    0, # negative
    1,
    0,
    0,
    1,
    0,
    1,
    0,
    1,
    1,
    0,
    1,
    0,
    1
]
df = pd.DataFrame({'review': reviews, 'label': labels})
df.head()

Unnamed: 0,review,label
0,"I absolutely loved this product, it exceeded m...",1
1,Worst purchase ever. I hate it and regret buyi...,0
2,Great value for money. I am happy with it.,1
3,Very disappointed. The quality was terrible.,0
4,"Not what I expected, quite bad overall.",0


The DataFrame `df` contains text reviews and their corresponding sentiment labels. This small dataset is balanced between positive and negative examples.

**Data Preprocessing**
Preprocessing text typically includes lowercasing, removing punctuation or non-alphabetic characters, and eliminating common "stopwords" (like "the", "and") that carry little semantic value. Key steps include:

Lowercase all text and remove non-letter characters.

Remove stopwords to reduce noise.

(Optionally) apply stemming or lemmatization to consolidate word forms.

Below, we demonstrate basic cleaning with a regular expression to keep only letters, then lowercase:

In [2]:
import re

# Remove non-letters and convert to lowercase
df['cleaned'] = df['review'].apply(lambda s: re.sub(r'[^a-zA-Z]', ' ', s).lower())
df[['review', 'cleaned']].head(5)

Unnamed: 0,review,cleaned
0,"I absolutely loved this product, it exceeded m...",i absolutely loved this product it exceeded m...
1,Worst purchase ever. I hate it and regret buyi...,worst purchase ever i hate it and regret buyi...
2,Great value for money. I am happy with it.,great value for money i am happy with it
3,Very disappointed. The quality was terrible.,very disappointed the quality was terrible
4,"Not what I expected, quite bad overall.",not what i expected quite bad overall


After this step, each review is a lowercase string containing only letters (with spaces where punctuation was). Further steps like removing stopwords can be handled by the vectorizer or other NLP tools.

**TF-IDF Vectorization**

TF-IDF is a numerical representation of text that weighs terms by importance
. We use scikit-learn’s `TfidfVectorizer` to transform the cleaned text into TF-IDF features. This produces a sparse matrix `X` where each row corresponds to a review and each column to a distinct word:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['cleaned'])
print(X.shape)  # for example, (15, N), where N is vocabulary size

(15, 49)


Each feature in `X` is the TF-IDF weight of a word in a review. Words common to many reviews get lower weights, while distinctive words have higher weights, capturing term importance relative to the corpus

**Model Training: Logistic Regression**

For modeling, we train a logistic regression classifier on the TF-IDF features. Logistic regression is a statistical method for binary classification (predicting one of two classes). It models the probability of the positive class using a linear combination of the input features and the logistic (sigmoid) function. We use scikit-learn to fit the model:

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split data (e.g., 75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, df['label'], test_size=0.25, random_state=42, stratify=df['label']
)


# Train logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

The model learns coefficients for the TF-IDF features that best separate positive vs negative reviews.

**Evaluation**

Finally, we evaluate the model on the test set using standard metrics. Accuracy measures the fraction of correctly classified reviews. We also compute precision (the ratio of correct positive predictions to all positive predictions), recall (the ratio of correct positive predictions to all actual positives), and the F1 score (the harmonic mean of precision and recall). A confusion matrix is used to visualize true vs predicted labels:

In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, zero_division=0))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.5
Precision: 0.5
Recall: 1.0
F1 score: 0.6666666666666666

Classification report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.50      1.00      0.67         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4


Confusion matrix:
 [[0 2]
 [0 2]]
