# Sentiment Analysis with an Recurrent Neural Networks (RNN)

**Recurrent Neural Networks (RNNs)** are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews as positive or negative for providing insights into customer experiences.

### 1. Importing Libraries and Dataset
I will be importing numpy, pandas, Regular Expression (RegEx), scikit learn and tenserflow.

In [7]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding

### 2. Loading Dataset
We will be using swiggy dataset of customer reviews.

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
data = pd.read_csv('/content/drive/MyDrive/Dataset/swiggy.csv')
print("Columns in the dataset:")
print(data.columns.tolist())

Columns in the dataset:
['ID', 'Area', 'City', 'Restaurant Price', 'Avg Rating', 'Total Rating', 'Food Item', 'Food Type', 'Delivery Time', 'Review']


After mounting Google Drive (which you've already done), you can access your files using the path `/content/drive/My Drive/`. To get the exact path for your file:

1.  **Open the File Browser**: Click the folder icon on the left sidebar in Colab.
2.  **Navigate to your file**: Find your CSV file within `drive > My Drive`.
3.  **Copy the path**: Right-click on your CSV file and select 'Copy path'.
4.  **Paste the path**: Use the copied path in `pd.read_csv()`.

In [10]:
import pandas as pd

file_path = '/content/drive/MyDrive/Dataset/swiggy.csv'
data_from_drive = pd.read_csv(file_path)

print("Successfully loaded CSV from Google Drive.")
display(data_from_drive.head())

Successfully loaded CSV from Google Drive.


Unnamed: 0,ID,Area,City,Restaurant Price,Avg Rating,Total Rating,Food Item,Food Type,Delivery Time,Review
0,1,Suburb,Ahmedabad,600,4.2,6198,Sushi,Fast Food,30-40 min,"Good, but nothing extraordinary."
1,2,Business District,Pune,200,4.7,4865,Pepperoni Pizza,Non-Vegetarian,50-60 min,"Good, but nothing extraordinary."
2,3,Suburb,Bangalore,600,4.7,2095,Waffles,Fast Food,50-60 min,Late delivery ruined it.
3,4,Business District,Mumbai,900,4.0,6639,Sushi,Vegetarian,50-60 min,Best meal I've had in a while!
4,5,Tech Park,Mumbai,200,4.7,6926,Spring Rolls,Gluten-Free,20-30 min,Mediocre experience.


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                8000 non-null   int64  
 1   Area              8000 non-null   object 
 2   City              8000 non-null   object 
 3   Restaurant Price  8000 non-null   int64  
 4   Avg Rating        8000 non-null   float64
 5   Total Rating      8000 non-null   int64  
 6   Food Item         8000 non-null   object 
 7   Food Type         8000 non-null   object 
 8   Delivery Time     8000 non-null   object 
 9   Review            8000 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 625.1+ KB


In [12]:
data.columns

Index(['ID', 'Area', 'City', 'Restaurant Price', 'Avg Rating', 'Total Rating',
       'Food Item', 'Food Type', 'Delivery Time', 'Review'],
      dtype='object')

In [13]:
data.describe()

Unnamed: 0,ID,Restaurant Price,Avg Rating,Total Rating
count,8000.0,8000.0,8000.0,8000.0
mean,4000.5,544.5875,4.1299,4979.9775
std,2309.54541,287.968871,0.645791,2877.285148
min,1.0,100.0,3.0,51.0
25%,2000.75,300.0,3.5,2476.0
50%,4000.5,500.0,4.2,4989.5
75%,6000.25,800.0,4.7,7498.0
max,8000.0,1000.0,5.0,10000.0


### 3. Text Cleaning and Sentiment Labeling
We will clean the review text, create a sentiment label based on ratings and remove any missing values.

* data["Review"] = data["Review"].str.lower() : Converts all text in the "Review" column to lowercase
* data["Review"] = data["Review"].replace(r'[^a-z0-9\s]', '', regex=True) : Removes all characters except letters, numbers and spaces from the "Review" column
* data['sentiment'] = data['Avg Rating'].apply(lambda x: 1 if x > 3.5 else 0) : Creates a new "sentiment" column with 1 for ratings above 3.5 and 0 otherwise
* data = data.dropna() : Removes rows that contain any missing values

In [14]:
data["Review"] = data["Review"].str.lower()
data["Review"] = data["Review"].replace(r'[^a-z0-9\s]', '', regex=True)

data['sentiment'] = data['Avg Rating'].apply(lambda x: 1 if x > 3.5 else 0)
data = data.dropna()

### 4. Tokenization and Padding
We will prepare the text data by tokenizing and padding it and extract the target sentiment labels. Tokenizer converts words into integer sequences and padding ensures all input sequences have the same length (max_length).

In [15]:
max_features = 5000
max_length = 200

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(data["Review"])
X = pad_sequences(tokenizer.texts_to_sequences(
    data["Review"]), maxlen=max_length)
y = data['sentiment'].values

### 5. Splitting the Data
We will split the data into training, validation and test sets while maintaining the class distribution.

* train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) : Splits data into 80% training and 20% test sets, preserving sentiment class balance
* train_test_split(X_train, y_train, test_size=0.1, random_state=42, stratify=y_train) : Further splits training data into 90% training and 10% validation sets, keeping class distribution consistent

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42, stratify=y_train
)

### 6. Building RNN Model
We will build and compile a simple RNN model for binary sentiment classification.

* Sequential([...]) : Creates a sequential neural network model
* Embedding(input_dim=max_features, output_dim=16, input_length=max_length) : Maps input words to 16-dimensional vectors
* SimpleRNN(64, activation='tanh', return_sequences=False) : Adds a recurrent layer with 64 units using tanh activation
* Dense(1, activation='sigmoid') : Adds an output layer with one neuron using sigmoid activation for binary output
* model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) : Configures the model with binary crossentropy loss, Adam optimizer and accuracy metric

In [17]:
model = Sequential([
    Embedding(input_dim=max_features, output_dim=16, input_length=max_length),
    SimpleRNN(64, activation='tanh', return_sequences=False),
    Dense(1, activation='sigmoid')
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)



### 7. Training and Evaluating Model
We will train the model on training data, validate it during training, then evaluate its performance on test data

In [18]:
history = model.fit(
    X_train, y_train,
    epochs=5,
    batch_size=32,
    validation_data=(X_val, y_val),
    verbose=1
)

score = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {score[1]:.2f}")

Epoch 1/5
[1m180/180[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 48ms/step - accuracy: 0.7057 - loss: 0.6109 - val_accuracy: 0.7156 - val_loss: 0.5973
Epoch 2/5
[1m180/180[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 38ms/step - accuracy: 0.7215 - loss: 0.5916 - val_accuracy: 0.7156 - val_loss: 0.5966
Epoch 3/5
[1m180/180[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 50ms/step - accuracy: 0.7134 - loss: 0.5998 - val_accuracy: 0.7156 - val_loss: 0.5978
Epoch 4/5
[1m180/180[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 39ms/step - accuracy: 0.7057 - loss: 0.6070 - val_accuracy: 0.7156 - val_loss: 0.5971
Epoch 5/5
[1m180/180[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 51ms/step - accuracy: 0.7118 - loss: 0.6017 - val_accuracy: 0.7156 - val_loss: 0.5968
Test accuracy: 0.72


### 8. Predicting Sentiment
We will create a function to preprocess a single review, predict its sentiment and display the result.

Returns "Positive" if prediction is 0.5 or above, otherwise "Negative" including the probability score

In [19]:
def predict_sentiment(review_text):
    text = review_text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)

    seq = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(seq, maxlen=max_length)

    prediction = model.predict(padded)[0][0]
    return f"{'Positive' if prediction >= 0.5 else 'Negative'} (Probability: {prediction:.2f})"


sample_review = "Nothing special but edible."
print(f"Review: {sample_review}")
print(f"Sentiment: {predict_sentiment(sample_review)}")

Review: Nothing special but edible.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 205ms/step
Sentiment: Positive (Probability: 0.74)


This model processes textual reviews through RNN to predict sentiment from raw data. This helps in actionable insights by understanding customer sentiment.

