<a href="https://colab.research.google.com/github/skeew0813/Text_Analytics/blob/main/Text_Analytics_Week_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title**: Text Analytics: Week 10   
**Author**: Ryan Weeks  
**Date**: 5/17/2025  
**Description**:  This notebook builds a sentiment analysis model using hotel review data. The workflow includes text preprocessing, TF-IDF vectorization, and training a simple neural network using TensorFlow. The model is evaluated using a train/validation/test split, and performance is assessed through accuracy metrics. All steps are documented with a focus on thought process and observations, in line with assignment requirements.

In [1]:
from google.colab import files
uploaded = files.upload()

Saving hotel-reviews.csv to hotel-reviews.csv


In [2]:
from google.colab import files
uploaded = files.upload()

Saving text_normalizer.py to text_normalizer.py


In [3]:
import nltk
nltk.download('punkt', force=True)
nltk.download('stopwords', force=True)
nltk.download('wordnet', force=True)
nltk.download('omw-1.4', force=True)
nltk.download('averaged_perceptron_tagger', force=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## 🔢 Step 1: Load Data and Label Encode Sentiment

The dataset includes hotel reviews and a sentiment label in the `Is_Response` column, which contains values like `"happy"` and `"not happy"`. For modeling, I converted these to numeric values using label encoding:

- `happy` → 0  
- `not happy` → 1

I also kept only the review text and the new encoded sentiment column for simplicity.


In [4]:
import pandas as pd

# Load the hotel reviews dataset
df = pd.read_csv("hotel-reviews.csv")

# Encode sentiment: happy = 0, not happy = 1
df['Sentiment'] = df['Is_Response'].map({'happy': 0, 'not happy': 1})

# Keep only necessary columns
df = df[['Description', 'Sentiment']]

# Preview the data
df.head()

Unnamed: 0,Description,Sentiment
0,The room was kind of clean but had a VERY stro...,1
1,I stayed at the Crown Plaza April -- - April -...,1
2,I booked this hotel through Hotwire at the low...,1
3,Stayed here with husband and sons on the way t...,0
4,My girlfriends and I stayed here to celebrate ...,1


## 🧼 Step 2: Normalize the Review Text

For modeling, it's important to clean the raw review text. I used the custom `TextNormalizer` we built in Week 4, which applies multiple preprocessing steps:

- Removing HTML tags and special characters
- Expanding contractions (e.g., “can’t” → “cannot”)
- Lowercasing and lemmatizing text
- Removing stopwords

The result is stored in a new column called `Clean_Review`, which will be used for training the model.


In [6]:
from text_normalizer import TextNormalizer
normalizer = TextNormalizer(text_lemmatization=False)
df['Clean_Review'] = normalizer.normalize_corpus(df['Description'])

df[['Description', 'Clean_Review']].head()

Unnamed: 0,Description,Clean_Review
0,The room was kind of clean but had a VERY stro...,room kind clean strong smell dogs generally av...
1,I stayed at the Crown Plaza April -- - April -...,stayed crown plaza april april staff friendly ...
2,I booked this hotel through Hotwire at the low...,booked hotel hotwire lowest price could find g...
3,Stayed here with husband and sons on the way t...,stayed husband sons way alaska cruise loved ho...
4,My girlfriends and I stayed here to celebrate ...,girlfriends stayed celebrate th birthdays plan...


## 📊 Step 3: Train, Validation, and Test Split

To properly evaluate the model's performance, I split the dataset into three parts:

- **Training set (70%)**: used to fit the model
- **Validation set (15%)**: used during training to tune parameters
- **Test set (15%)**: held out completely until the end for final evaluation

This three-way split helps ensure the model isn’t overfitting and gives a fair assessment of how it might perform in the real world.


In [7]:
from sklearn.model_selection import train_test_split

# First split: train vs. temp (which will be split into val and test)
X_train, X_temp, y_train, y_temp = train_test_split(
    df['Clean_Review'], df['Sentiment'], test_size=0.3, random_state=42, stratify=df['Sentiment'])

# Second split: temp → validation and test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Confirm sizes
print(f"Training samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")
print(f"Test samples: {len(X_test)}")

Training samples: 27252
Validation samples: 5840
Test samples: 5840


## 🧮 Step 4: Text Vectorization with TF-IDF

Before feeding the text into a machine learning or deep learning model, I need to convert it into numeric format. For this, I’m using **TF-IDF (Term Frequency–Inverse Document Frequency)**, which scores words based on how important they are in a review compared to the rest of the dataset.

Each review becomes a numeric vector that captures the significance of different terms. This allows the model to learn patterns in the word usage tied to either “happy” or “not happy” sentiment.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize vectorizer
vectorizer = TfidfVectorizer(max_features=10000)

# Fit only on training data, then transform all
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

# Check shape of the result
print(f"TF-IDF matrix shape (train): {X_train_vec.shape}")

TF-IDF matrix shape (train): (27252, 10000)


## 🤖 Step 5: Train a Sentiment Classifier Using TensorFlow 2.x

To classify sentiment from reviews, I built a simple neural network using TensorFlow’s Keras API. This model takes the TF-IDF vectorized text as input and outputs a binary prediction: happy (0) or not happy (1).

To keep training time short and in line with assignment guidelines, I used:

- `TOTAL_STEPS = 100`
- `STEP_SIZE = 10`

This means I trained the model using only 100 small batches of data (mini-batch gradient descent).


In [10]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy as np

# Define constants
TOTAL_STEPS = 100
STEP_SIZE = 10
INPUT_DIM = X_train_vec.shape[1]

# Build model
model = Sequential([
    Dense(128, activation='relu', input_shape=(INPUT_DIM,)),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Custom training loop that uses TOTAL_STEPS and STEP_SIZE
for step in range(1, TOTAL_STEPS + 1):
    idx = np.random.choice(X_train_vec.shape[0], STEP_SIZE)
    batch_x = X_train_vec[idx].toarray()
    batch_y = y_train.values[idx]

    model.train_on_batch(batch_x, batch_y)

    if step % 10 == 0 or step == 1:
        train_loss, train_acc = model.evaluate(batch_x, batch_y, verbose=0)
        val_loss, val_acc = model.evaluate(X_val_vec.toarray(), y_val.values, verbose=0)
        print(f"Step {step} — Loss: {train_loss:.4f} — Train Acc: {train_acc:.4f} — Val Acc: {val_acc:.4f}")

# Final test accuracy
test_loss, test_acc = model.evaluate(X_test_vec.toarray(), y_test.values, verbose=0)
print(f"\n✅ Final Test Accuracy: {test_acc:.4f}")

Step 1 — Loss: 0.6713 — Train Acc: 1.0000 — Val Acc: 0.6774
Step 10 — Loss: 0.6569 — Train Acc: 0.7000 — Val Acc: 0.6813
Step 20 — Loss: 0.6984 — Train Acc: 0.4000 — Val Acc: 0.6813
Step 30 — Loss: 0.6456 — Train Acc: 0.6000 — Val Acc: 0.6813
Step 40 — Loss: 0.4875 — Train Acc: 0.9000 — Val Acc: 0.6813
Step 50 — Loss: 0.5046 — Train Acc: 0.8000 — Val Acc: 0.6813
Step 60 — Loss: 0.4310 — Train Acc: 0.9000 — Val Acc: 0.6813
Step 70 — Loss: 0.6359 — Train Acc: 0.6000 — Val Acc: 0.6813
Step 80 — Loss: 0.4437 — Train Acc: 0.8000 — Val Acc: 0.6815
Step 90 — Loss: 0.5704 — Train Acc: 0.6000 — Val Acc: 0.7036
Step 100 — Loss: 0.5794 — Train Acc: 0.7000 — Val Acc: 0.7627

✅ Final Test Accuracy: 0.7563


## ✅ Evaluation Results and Observations

After training the model for 100 steps using small batches of 10 samples each, the results were solid considering the lightweight setup:

- **Final Validation Accuracy**: 76.27%  
- **Final Test Accuracy**: 75.63%

The validation accuracy remained fairly steady during training, with a noticeable jump in performance toward the final steps. This suggests the model was still learning and could benefit from additional training steps or a slightly larger batch size.

The relatively high test accuracy indicates that the model generalized well to unseen data, even without using more complex techniques like dropout, embeddings, or lemmatization (which had to be skipped in Colab). Overall, this demonstrates that even a simple feedforward neural network can perform well on text classification tasks with proper preprocessing and clean input features.


## 📌 Final Reflection and Thought Process

My overall goal was to build a sentiment analysis model that could classify hotel reviews as either "happy" or "not happy." I knew that working with raw text requires multiple transformation steps before the data is ready for modeling, so I started by focusing on text normalization. Originally, I planned to include lemmatization, but after encountering persistent compatibility issues with Colab, I decided to turn it off. This was a trade-off — I lost a small amount of precision in text preprocessing, but it allowed the workflow to continue smoothly.

When it came to splitting the data, I wanted to ensure fairness and consistency, so I used stratified sampling to preserve the label distribution across train, validation, and test sets. I made sure to fit the TF-IDF vectorizer only on the training set to prevent data leakage — something I’ve learned can quietly skew results if not handled correctly.

For model training, I intentionally followed the assignment’s instruction to use `TOTAL_STEPS = 100` and `STEP_SIZE = 10`. I knew this would drastically reduce the amount of data the model saw during training, so I lowered my expectations accordingly. What surprised me was how well the model still performed — hitting a **final test accuracy of 75.63%**. I observed that the validation accuracy gradually increased and peaked toward the end, which suggests that the model was still learning and hadn’t yet overfit, despite the small training size.

Overall, this assignment helped me practice balancing technical goals with practical limitations. I made several intentional decisions along the way — skipping lemmatization, using small batches, limiting steps — and still ended up with a model that generalized well. If I were to continue improving it, I’d look into using word embeddings instead of TF-IDF and training across full epochs with early stopping.
