# NLP for mulitple apps user reviews

This notebook is the next step of the previous and smaller one I uploaded "NLP-Netflix-Notebook". The main difference is for this notebook I will be concatenating multiple dataframes from different apps to make one larger dataset. This will make a more robust final model.

### Datasets used:
Netflix:  https://www.kaggle.com/datasets/ashishkumarak/netflix-reviews-playstore-daily-updated

Spotify: https://www.kaggle.com/datasets/ashishkumarak/spotify-reviews-playstore-daily-update

ChatGpt:https://www.kaggle.com/datasets/ashishkumarak/chatgpt-reviews-daily-updated

Facebook: https://www.kaggle.com/datasets/ashishkumarak/play-store-reviews-facebook

Amazon: https://www.kaggle.com/datasets/ashishkumarak/amazon-shopping-reviews-daily-updated

## Imports

In [1]:
import pandas as pd
import numpy as np
import re 

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import MinMaxScaler

#Lots of models to compare
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.model_selection import cross_val_score

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

#The size of this new data set makes training on the cpu with sklearn incredibly slow so I will use pytorch on the GPU
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

## Load Data

Time to load all of the review datasets and concatenate them in to one large master dataset. 

In [2]:
spotify_file = "spotify_reviews.csv"
spotify_df = pd.read_csv(spotify_file)

facebook_file = "facebook_reviews.csv"
facebook_df = pd.read_csv(facebook_file)

chatgpt_file = "chatgpt_reviews.csv"
chatgpt_df = pd.read_csv(chatgpt_file)

netflix_file = "netflix_reviews.csv"
netflix_df = pd.read_csv(netflix_file)


df = pd.concat([spotify_df, facebook_df, chatgpt_df, netflix_df], axis=0, ignore_index=True)
df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,437314fe-1b1d-4352-abea-12fec30fce58,Rajib Das,It's good,4,0,,2024-05-09 16:28:13,
1,4933ad2c-c70a-4a84-957d-d405439b2e0f,Mihaela Claudia Neagu,"I love this app so much, I've been using Spoti...",5,0,8.9.38.494,2024-05-09 16:27:18,8.9.38.494
2,1ab275fb-59bf-42c7-88ef-b85901f0445e,JONATHAN GRACIA,Perfect,5,0,8.9.36.616,2024-05-09 16:27:03,8.9.36.616
3,b38406eb-7b11-4ceb-a45c-d7f28fb5d382,Cam Rempel,Best all around music streaming app I have use...,5,0,8.9.38.494,2024-05-09 16:26:19,8.9.38.494
4,7be7999d-4cb6-47b9-8414-d7bdaa9df578,Your clowness (Her Clowness),Are y'all fr gatekeeping the play button on so...,1,0,8.9.38.494,2024-05-09 16:26:14,8.9.38.494


In [3]:
print(f"Spotify shape: {spotify_df.shape}")
print(f"Facebook shape: {facebook_df.shape}")
print(f"ChatGPT shape: {chatgpt_df.shape}")
print(f"Netflix shape: {netflix_df.shape}")
print(f"\nCombined datasets shape: {df.shape}")


Spotify shape: (84165, 8)
Facebook shape: (89458, 8)
ChatGPT shape: (137132, 8)
Netflix shape: (112271, 8)

Combined datasets shape: (423026, 8)


For the task of sentiment analysis on the reviews text most of these columns are not needed. The two important ones are "content" and "score".

In [4]:
df = df[["content","score"]]
df.head()

Unnamed: 0,content,score
0,It's good,4
1,"I love this app so much, I've been using Spoti...",5
2,Perfect,5
3,Best all around music streaming app I have use...,5
4,Are y'all fr gatekeeping the play button on so...,1


In [5]:
df.isnull().sum()

content    16
score       0
dtype: int64

There is just two values missing in this dataset given how this is so proportionatly small I am ok with just removing them.

## Feature Engineering

### Creating Sentiment

This is an easy step. If the star rating is 4 or 5 the sentiment is 2. For a star rating of 3 sentiment is 1 and 1 and 2 star ratings are 0 sentiment.

In [6]:
df.dropna(inplace=True)

In [7]:
def score_to_sentiment(score):
    if score in [4,5]:
        return 2
    elif score ==3:
        return 1
    elif score in [1,2]:
        return 0

df["sentiment"] = df["score"].apply(score_to_sentiment)
df.head()

Unnamed: 0,content,score,sentiment
0,It's good,4,2
1,"I love this app so much, I've been using Spoti...",5,2
2,Perfect,5,2
3,Best all around music streaming app I have use...,5,2
4,Are y'all fr gatekeeping the play button on so...,1,0


### Cleaning the text

Here the text is cleaned with the removal of emojis and special characters.

In [8]:
def preprocessor(text):
    text = re.sub("<[^>]*>", "",text)
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)",text)
    text = re.sub("[\W]+", " ", text.lower())+ " ".join(emoticons).replace("-", "")
    return text

In [9]:
print(f"Before processing: \n{df['content'].iloc[401]}")
print(f"After processing: \n {preprocessor(df['content'].iloc[401])}")

Before processing: 
Tooooo expensive!!
After processing: 
 tooooo expensive 


In [10]:
df["content"] = df["content"].apply(preprocessor)

### Word and character counts

Here I will create two new columns for the dataframe to contain the word and character counts for the content column.

In [11]:
w =[]
c=[]
def word_and_char_counts(text):
    words = text.split()
    char_len = 0
    for word in words:
        char_len += len(word) #Character count
    w.append(len(words)) #Word count
    c.append(char_len)
    return (len(words), char_len)

In [12]:
df["content"].apply(word_and_char_counts)
df["wordCount"] = w
df["charCount"] = c
df.head()

Unnamed: 0,content,score,sentiment,wordCount,charCount
0,it s good,4,2,3,7
1,i love this app so much i ve been using spotif...,5,2,82,308
2,perfect,5,2,1,7
3,best all around music streaming app i have use...,5,2,14,59
4,are y all fr gatekeeping the play button on so...,1,0,10,40


## Train, Test split

In [13]:
i = int(df.shape[0]*0.8)
features = ["content","wordCount","charCount"]
X_train = df.loc[:i, features]
y_train = df.loc[:i, "sentiment"].values
X_test  = df.loc[i+1:, features]
y_test  = df.loc[i+1:, "sentiment"].values

## Scale the numerical features

In [14]:
sc = MinMaxScaler()
X_train[features[1:]] = sc.fit_transform(X_train[features[1:]])
X_test[features[1:]] = sc.transform(X_test[features[1:]])
X_train.head()

Unnamed: 0,content,wordCount,charCount
0,it s good,0.006198,0.00327
1,i love this app so much i ve been using spotif...,0.169421,0.143858
2,perfect,0.002066,0.00327
3,best all around music streaming app i have use...,0.028926,0.027557
4,are y all fr gatekeeping the play button on so...,0.020661,0.018683


## Vectorize the content column

In [15]:
cv = CountVectorizer(max_features=5000)

X_train_content = X_train["content"].values
X_train_other = X_train[["wordCount","charCount"]].values
X_content_vectorized = cv.fit_transform(X_train_content).toarray()
X_train = np.concatenate((X_content_vectorized, X_train_other), axis=1)
X_train.shape

(338393, 5002)

In [16]:
X_test_content = X_test["content"].values
X_test_other = X_test[["wordCount","charCount"]].values
X_content_vectorized = cv.transform(X_test_content).toarray()
X_test = np.concatenate((X_content_vectorized, X_test_other), axis=1)
X_test.shape

(84617, 5002)

# Define models

Set the device

In [17]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Define the model

In [18]:
input_size = X_train.shape[1]
model =nn.Sequential(
    nn.Linear(input_size,1024),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(1024,512),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(512,256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256,3),
).to(device)

In [19]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [20]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

# Create Dataset objects
train_dataset = CustomDataset(X_train, y_train)
test_dataset = CustomDataset(X_test, y_test)

# Create DataLoader objects
trainloader = DataLoader(train_dataset, batch_size=128, shuffle=True)
testloader = DataLoader(test_dataset, batch_size=128, shuffle=False)

In [21]:
for epoch in range(100):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 2000 == 1999:
            print(f'[Epoch {epoch + 1}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

# Evaluation loop
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the test data: {100 * correct / total}%')

[Epoch 1] loss: 0.986
[Epoch 2] loss: 0.881
[Epoch 3] loss: 0.826
[Epoch 4] loss: 0.759
[Epoch 5] loss: 0.728
[Epoch 6] loss: 0.710
[Epoch 7] loss: 0.695
[Epoch 8] loss: 0.676
[Epoch 9] loss: 0.648
[Epoch 10] loss: 0.622
[Epoch 11] loss: 0.598
[Epoch 12] loss: 0.581
[Epoch 13] loss: 0.569
[Epoch 14] loss: 0.558
[Epoch 15] loss: 0.550
[Epoch 16] loss: 0.543
[Epoch 17] loss: 0.534
[Epoch 18] loss: 0.531
[Epoch 19] loss: 0.526
[Epoch 20] loss: 0.522
[Epoch 21] loss: 0.516
[Epoch 22] loss: 0.514
[Epoch 23] loss: 0.511
[Epoch 24] loss: 0.508
[Epoch 25] loss: 0.505
[Epoch 26] loss: 0.504
[Epoch 27] loss: 0.501
[Epoch 28] loss: 0.498
[Epoch 29] loss: 0.495
[Epoch 30] loss: 0.494
[Epoch 31] loss: 0.491
[Epoch 32] loss: 0.491
[Epoch 33] loss: 0.489
[Epoch 34] loss: 0.487
[Epoch 35] loss: 0.485
[Epoch 36] loss: 0.484
[Epoch 37] loss: 0.483
[Epoch 38] loss: 0.482
[Epoch 39] loss: 0.480
[Epoch 40] loss: 0.479
[Epoch 41] loss: 0.479
[Epoch 42] loss: 0.476
[Epoch 43] loss: 0.476
[Epoch 44] loss: 0.4

# Save the model

In [22]:
torch.save(model, 'model.pth')