# NLP for Netflix app user reviews

Dataset used: https://www.kaggle.com/datasets/ashishkumarak/netflix-reviews-playstore-daily-updated

The goal of this note book is to train a model to classify natural language text found in reviews for the Netflix app as either "Happy", "Neutral", or "Sad".

## Imports

In [1]:
import pandas as pd
import numpy as np
import re 

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import MinMaxScaler

#Lots of models to compare
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.model_selection import cross_val_score

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

## Load Data

In [2]:
file = "netflix_reviews.csv"
df = pd.read_csv(file)
df.head()

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,9eea76a3-1b8b-4cb5-a5f3-48cf00f786e4,Faker toz,i rly wanted to watch my favourite show but it...,2,0,,2024-06-05 22:09:02,
1,e81f843a-a8c9-40af-8c0e-f084fde157db,Prince Kpahn,I love this app,5,0,8.85.0 build 9 50502,2024-06-05 22:08:47,8.85.0 build 9 50502
2,be1149b6-730b-4910-810d-04c65800a6a0,Isaiah anis,Great love to enjoy watching latest movies and...,5,0,8.20.0 build 12 40171,2024-06-05 22:07:47,8.20.0 build 12 40171
3,5210b102-ea01-4186-aac2-141647bda83c,Gamalyel Hernandez,Please add all the MY little pony Seasons whic...,4,0,8.117.0 build 3 50695,2024-06-05 22:07:35,8.117.0 build 3 50695
4,1effc685-c8de-412f-8029-a8f0d47b3547,Humle Junior,Nice app,5,0,,2024-06-05 22:06:53,


For the task of sentiment analysis on the reviews text most of these columns are not needed. The two important ones are "content" and "score".

In [3]:
df = df[["content","score"]]
df.head()

Unnamed: 0,content,score
0,i rly wanted to watch my favourite show but it...,2
1,I love this app,5
2,Great love to enjoy watching latest movies and...,5
3,Please add all the MY little pony Seasons whic...,4
4,Nice app,5


In [4]:
df.isnull().sum()

content    2
score      0
dtype: int64

There is just two values missing in this dataset given how this is so proportionatly small I am ok with just removing them.

## Feature Engineering

### Creating Sentiment

This is an easy step. If the star rating is 4 or 5 the sentiment is 2. For a star rating of 3 sentiment is 1 and 1 and 2 star ratings are 0 sentiment.

In [5]:
df.dropna(inplace=True)

In [6]:
def score_to_sentiment(score):
    if score in [4,5]:
        return 2
    elif score ==3:
        return 1
    elif score in [1,2]:
        return 0

df["sentiment"] = df["score"].apply(score_to_sentiment)
df.head()

Unnamed: 0,content,score,sentiment
0,i rly wanted to watch my favourite show but it...,2,0
1,I love this app,5,2
2,Great love to enjoy watching latest movies and...,5,2
3,Please add all the MY little pony Seasons whic...,4,2
4,Nice app,5,2


### Cleaning the text

Here the text is cleaned with the removal of emojis and special characters.

In [7]:
def preprocessor(text):
    text = re.sub("<[^>]*>", "",text)
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)",text)
    text = re.sub("[\W]+", " ", text.lower())+ " ".join(emoticons).replace("-", "")
    return text

In [8]:
print(f"Before processing: \n{df['content'].iloc[401]}")
print(f"After processing: \n {preprocessor(df['content'].iloc[401])}")

Before processing: 
Its a very good app it lets me watch my favourite shows and all that but one thing some shows you dont finish and it gets anoying knoing you haven't watched the whole show
After processing: 
 its a very good app it lets me watch my favourite shows and all that but one thing some shows you dont finish and it gets anoying knoing you haven t watched the whole show


In [9]:
df["content"] = df["content"].apply(preprocessor)

### Word and character counts

Here I will create two new columns for the dataframe to contain the word and character counts for the content column.

In [10]:
w =[]
c=[]
def word_and_char_counts(text):
    words = text.split()
    char_len = 0
    for word in words:
        char_len += len(word) #Character count
    w.append(len(words)) #Word count
    c.append(char_len)
    return (len(words), char_len)

In [11]:
df["content"].apply(word_and_char_counts)
df["wordCount"] = w
df["charCount"] = c
df.head()

Unnamed: 0,content,score,sentiment,wordCount,charCount
0,i rly wanted to watch my favourite show but it...,2,0,17,60
1,i love this app,5,2,4,12
2,great love to enjoy watching latest movies and...,5,2,10,54
3,please add all the my little pony seasons whic...,4,2,20,69
4,nice app,5,2,2,7


## Train, Test split

In [12]:
i = int(df.shape[0]*0.8)
features = ["content","wordCount","charCount"]
X_train = df.loc[:i, features]
y_train = df.loc[:i, "sentiment"]
X_test  = df.loc[i+1:, features]
y_test  = df.loc[i+1:, "sentiment"]

## Scale the numerical features

In [13]:
sc = MinMaxScaler()
X_train[features[1:]] = sc.fit_transform(X_train[features[1:]])
X_test[features[1:]] = sc.transform(X_test[features[1:]])
X_train.head()

Unnamed: 0,content,wordCount,charCount
0,i rly wanted to watch my favourite show but it...,0.050147,0.04351
1,i love this app,0.011799,0.008702
2,great love to enjoy watching latest movies and...,0.029499,0.039159
3,please add all the my little pony seasons whic...,0.058997,0.050036
4,nice app,0.0059,0.005076


## Vectorize the content column

In [14]:
cv = CountVectorizer(max_features=5000)

X_train_content = X_train["content"].values
X_train_other = X_train[["wordCount","charCount"]].values
X_content_vectorized = cv.fit_transform(X_train_content).toarray()
X_train = np.concatenate((X_content_vectorized, X_train_other), axis=1)
X_train.shape

(89814, 5002)

In [15]:
X_test_content = X_test["content"].values
X_test_other = X_test[["wordCount","charCount"]].values
X_content_vectorized = cv.transform(X_test_content).toarray()
X_test = np.concatenate((X_content_vectorized, X_test_other), axis=1)
X_test.shape

(22455, 5002)

# Define models

In [16]:
bc = GaussianNB()
mnc = MultinomialNB()
bnc = BernoulliNB()
knn = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier(n_estimators=50, random_state=42)

In [17]:
classifiers = {
    'GaussianNB': bc,
    'MultinomialNB': mnc,
    'BernoulliNB': bnc,
    'KNN':knn,
    'DecisionTreeClassifier': dtc,
    'RandomForestClassifier':rfc
}

In [18]:
def train_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average="macro")
    kfold_score = cross_val_score(estimator=model, X=X_train, y=y_train, cv=7)
    return [acc, prec, kfold_score.mean()]

Uncomment the below code if you want test all model against eachother. The Random Forest model scores the best though.

In [19]:
'''
precision = []
accuracy = []
k_fold = []

for name, model in classifiers.items():
    print(f"{name} is being trained.")
    acc,prec,kfold_score = train_model(model)
    print("Trained")
    accuracy.append(acc)
    precision.append(prec)
    k_fold.append(kfold_score)
    print(f"{name} complete")

print(accuracy)
'''

'\nprecision = []\naccuracy = []\nk_fold = []\n\nfor name, model in classifiers.items():\n    print(f"{name} is being trained.")\n    acc,prec,kfold_score = train_model(model)\n    print("Trained")\n    accuracy.append(acc)\n    precision.append(prec)\n    k_fold.append(kfold_score)\n    print(f"{name} complete")\n'

Train the random forest model on its own.

In [21]:
rfc.fit(X_train, y_train)

Here is a function for classifying new text.

In [30]:
def new_data(txt):
    txt = preprocessor(txt)
    word_len, char_len = word_and_char_counts(txt)
    text_features = cv.transform([txt])
    text_features_dense = text_features.toarray()
    char_len_arr = np.array([char_len])
    word_len_arr = np.array([word_len])
    if text_features_dense.ndim > 1:
        text_features_dense = np.squeeze(text_features_dense)
    features = np.hstack((text_features_dense,char_len_arr, word_len_arr))
    features = features.reshape(1, -1)
    print(features.shape)
    prediction = rfc.predict(features)
    if prediction==2:
        return "Happy"
    elif prediction==1:
        return "Neutral"
    return "Sad"


(1, 5002)
Sad


In [31]:
prediction = new_data("Wow the app works great!") #Change the input text here to get new predictions
print(prediction)

(1, 5002)
Happy
