<a href="https://colab.research.google.com/github/skeew0813/Text_Analytics/blob/main/Text_Analytics_Week_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title**: Text Analytics: Week 11  
**Author**: Ryan Weeks  
**Date**: 5/25/2025  
**Description**:  This notebook uses a deep learning approach to classify hotel reviews as either positive ("happy") or negative ("not happy"). The text is embedded using TensorFlow’s Universal Sentence Encoder (USE), then passed through a dense neural network for classification. Performance is evaluated using accuracy, AUC, precision, and recall — reported in a format consistent with Chapter 10 of the textbook.

In [2]:
from google.colab import files

# Upload the CSV file
uploaded = files.upload()

Saving hotel-reviews.csv to hotel-reviews (2).csv


## 📌 Step 1: Load and Prepare the Dataset  
After uploading the CSV file, we’ll load it using pandas and convert the sentiment label into binary form (1 = happy, 0 = not happy).  
We'll also split the data into training and test sets using stratification to maintain class balance.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv("hotel-reviews.csv")
df = df[['Description', 'Is_Response']].copy()
df['Is_Response'] = df['Is_Response'].map({'happy': 1, 'not happy': 0})
df.dropna(inplace=True)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['Description'], df['Is_Response'],
    test_size=0.2, stratify=df['Is_Response'], random_state=42
)


## 📌 Step 2: Embed the Text Using Universal Sentence Encoder (with Batching)  
Because we're working with a large number of reviews, embedding them all at once can exceed Colab's memory limit.  
To avoid this, we'll embed the reviews in smaller batches using a loop and stack the results together.


In [4]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

# Load the USE model
use_embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Helper function to embed in batches
def embed_text_in_batches(text_list, batch_size=512):
    embeddings = []
    for i in range(0, len(text_list), batch_size):
        batch = text_list[i:i+batch_size]
        batch_embeddings = use_embed(batch)
        embeddings.append(batch_embeddings)
    return tf.concat(embeddings, axis=0)

# Embed training and testing data with batching
X_train_embed = embed_text_in_batches(X_train.tolist())
X_test_embed = embed_text_in_batches(X_test.tolist())

## 📌 Step 3: Build and Train the Neural Network  
Now that we’ve converted our reviews into 512-dimensional embeddings, we’ll define a simple feedforward neural network using TensorFlow/Keras.  
The network will use a dense hidden layer with ReLU activation, dropout for regularization, and a sigmoid output layer for binary classification.


In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Build the model
model = Sequential([
    Dense(128, activation='relu', input_shape=(512,)),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Early stopping to prevent overfitting
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model
history = model.fit(
    X_train_embed, y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=1
)

Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.8130 - loss: 0.4156 - val_accuracy: 0.8666 - val_loss: 0.3211
Epoch 2/10
[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.8629 - loss: 0.3239 - val_accuracy: 0.8616 - val_loss: 0.3181
Epoch 3/10
[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.8650 - loss: 0.3135 - val_accuracy: 0.8708 - val_loss: 0.3118
Epoch 4/10
[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.8676 - loss: 0.3073 - val_accuracy: 0.8676 - val_loss: 0.3152
Epoch 5/10
[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8721 - loss: 0.3018 - val_accuracy: 0.8658 - val_loss: 0.3164
Epoch 6/10
[1m779/779[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6ms/step - accuracy: 0.8718 - loss: 0.2924 - val_accuracy: 0.8690 - val_loss: 0.3128


## 📌 Step 4: Evaluate Model Performance  
Now that the model is trained, we’ll evaluate it on both the training and test sets.  
We'll collect metrics like accuracy, AUC, precision, and recall, and format the results in a markdown table similar to the one shown in Chapter 10 of the textbook.


In [6]:
from sklearn.metrics import precision_score, recall_score, roc_auc_score

# Predict probabilities
y_train_prob = model.predict(X_train_embed).flatten()
y_test_prob = model.predict(X_test_embed).flatten()

# Convert probabilities to binary predictions
y_train_pred = (y_train_prob > 0.5).astype(int)
y_test_pred = (y_test_prob > 0.5).astype(int)

# Accuracy
train_acc = model.evaluate(X_train_embed, y_train, verbose=0)[1]
test_acc = model.evaluate(X_test_embed, y_test, verbose=0)[1]

# AUC
train_auc = roc_auc_score(y_train, y_train_prob)
test_auc = roc_auc_score(y_test, y_test_prob)

# Precision
train_prec = precision_score(y_train, y_train_pred)
test_prec = precision_score(y_test, y_test_pred)

# Recall
train_rec = recall_score(y_train, y_train_pred)
test_rec = recall_score(y_test, y_test_pred)

# Print results for the markdown table
print(f"Training Accuracy: {train_acc:.6f}")
print(f"Test Accuracy: {test_acc:.6f}")
print(f"Training AUC: {train_auc:.6f}")
print(f"Test AUC: {test_auc:.6f}")
print(f"Training Precision: {train_prec:.6f}")
print(f"Test Precision: {test_prec:.6f}")
print(f"Training Recall: {train_rec:.6f}")
print(f"Test Recall: {test_rec:.6f}")

[1m974/974[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step
[1m244/244[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
Training Accuracy: 0.871247
Test Accuracy: 0.870425
Training AUC: 0.933290
Test AUC: 0.930245
Training Precision: 0.880563
Test Precision: 0.881528
Training Recall: 0.938254
Test Recall: 0.935533


## 📊 Model Evaluation Metrics

| Model Dir | Training Accuracy | Test Accuracy | Training AUC | Test AUC | Training Precision | Test Precision | Training Recall | Test Recall |
|-----------|-------------------|---------------|--------------|----------|---------------------|----------------|------------------|-------------|
| use-512   | 0.871247          | 0.870425      | 0.933290     | 0.930245 | 0.880563            | 0.881528       | 0.938254         | 0.935533    |


## 🧠 Final Thoughts

The deep learning model performed well across all key evaluation metrics, with both training and test accuracy above 87%, and AUC scores above 0.93 — suggesting strong model performance and generalization. Precision and recall were also well-balanced, indicating that the model can reliably distinguish between positive and negative hotel reviews.

Using the Universal Sentence Encoder allowed us to skip traditional text preprocessing while still capturing meaningful semantic information. For future improvements, it may be worth experimenting with:

- Different embedding models (e.g., NNLM or BERT variants)
- Hyperparameter tuning (dropout rate, learning rate, hidden layer size)
- More advanced architectures (e.g., LSTM or transformer layers)

Overall, this workflow showcases how powerful pretrained embeddings can be when combined with a straightforward neural network architecture.
