<a href="https://colab.research.google.com/github/usesnames/Reviews_analyzer/blob/main/Reviews_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -qq google-play-scraper

[?25l[K     |██████▋                         | 10kB 13.8MB/s eta 0:00:01[K     |█████████████▎                  | 20kB 10.8MB/s eta 0:00:01[K     |████████████████████            | 30kB 6.3MB/s eta 0:00:01[K     |██████████████████████████▌     | 40kB 6.7MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.4MB/s 
[?25h  Building wheel for google-play-scraper (setup.py) ... [?25l[?25hdone


In [81]:
import os
import json
import pandas as pd
from tqdm import tqdm
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import matplotlib.pyplot as plt

from google_play_scraper import Sort, reviews, app, reviews_all

In [None]:
app_name = 'com.latuabancaperandroid'

## Scraping App Information

Let's scrape the info for our app:

In [None]:
app_infos = []

info = app(app_name, lang='en', country='us')
del info['comments']

We got the info for our app. Let's write a helper function that prints JSON objects a bit better:

In [None]:
def print_json(json_object):
  json_str = json.dumps(
    json_object, 
    indent=2, 
    sort_keys=True, 
    default=str
  )
  print(json_str)

Here are the app information:

In [None]:
print_json(info)

{
  "adSupported": null,
  "androidVersion": "5.0",
  "androidVersionText": "5.0 and up",
  "appId": "com.latuabancaperandroid",
  "containsAds": false,
  "contentRating": "Everyone",
  "contentRatingDescription": null,
  "currency": "USD",
  "description": "Semplice, veloce, personale: \u201cIntesa Sanpaolo Mobile\u201d \u00e8 l'applicazione di Intesa Sanpaolo per la banca di tutti i giorni. \r\nCon \u201cIntesa Sanpaolo Mobile\u201d puoi:\r\n- consultare il conto e le carte, fare bonifici, ricariche, bollettini e altri pagamenti;\r\n- gestire le tue carte di credito, debito e prepagate nominative, configurando limiti, opzioni geocontrol, carte virtuali associate e sospendendo o bloccando le carte in caso di emergenza;\r\n- aggregare in un'unica vista tutti tuoi conti e le tue carte, anche presso le altre banche, con XME Banks;\r\n- ricevere assistenza personalizzata dalla filiale online con un semplice \"shake\" in ogni momento della navigazione;\r\n- pagare nei negozi convenzionati 

In [None]:
app_reviews = []

for score in list(range(1, 6)):
  rvs, _ = reviews(
    app_name,
    lang='it',
    country='it',
    sort=Sort.NEWEST,
    count= 4000, 
    filter_score_with=score
  )
  for r in rvs:
    r['appId'] = app_name
  app_reviews.extend(rvs)

In [None]:
print_json(app_reviews[0])

{
  "appId": "com.latuabancaperandroid",
  "at": "2021-03-23 16:22:25",
  "content": "A me piace perch\u00e9 mi consente di tenere costantemente aggiornato la mie situazione finanziaria",
  "repliedAt": null,
  "replyContent": null,
  "reviewCreatedVersion": "2.18.2",
  "reviewId": "gp:AOqpTOGtJEUZEiJym1GYT0FxJnoItJzngbCFDcTOPkwka6B6T9KMX3gUfbYa0cIhHD3yvRXcT92rgXEooirBNQ",
  "score": 1,
  "thumbsUpCount": 0,
  "userImage": "https://play-lh.googleusercontent.com/--JYWBYgNmA4/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucla_xsuNyD0JzztPnDxxHGERfdcgg/photo.jpg",
  "userName": "Maria Rosa Ilardi"
}


`repliedAt` and `replyContent` contain the developer response to the review. Of course, they can be missing.

How many app reviews did we get?



In [None]:
len(app_reviews)

20000

Let's save the reviews to a CSV file:

In [40]:
if not os.path.isfile('balanced_reviews.csv'):
  balanced_reviews_df = pd.DataFrame(app_reviews)
  balanced_reviews_df.to_csv('balanced_reviews.csv', index=None, header=True, sep='\t')
  print('generato file balanced_reviews.csv')

balanced_reviews_df = pd.read_csv('balanced_reviews.csv', sep='\t')
balanced_reviews_df.drop(labels=['userName', 'userImage', 'replyContent', 'repliedAt', 
                                 'at', 'appId', 'reviewCreatedVersion', 'reviewId'], 
                         axis=1, inplace=True)

In [49]:
balanced_reviews_df.tail()

Unnamed: 0,content,score,thumbsUpCount
19995,Funzionale e sicura anche con impronta Digital...,5,0
19996,Ok,5,0
19997,Molto funzionale,5,0
19998,Molto pratica e valida. Mi trovo benissimo,5,0
19999,"Ottima, molto comoda e di facile utilizzo",5,0


split

In [93]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(balanced_reviews_df['content'], balanced_reviews_df['score'], 
                                                    test_size=0.33, random_state=42, stratify=balanced_reviews_df['score'])
y_train = y_train - 1
y_test = y_test - 1

In [99]:
tokenizer = Tokenizer(num_words=20000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)
vocab_size = len(tokenizer.word_index)+1

In [100]:
train_seq = tokenizer.texts_to_sequences(X_train)
train_padded = pad_sequences(train_seq, maxlen=300, truncating='post')

test_seq = tokenizer.texts_to_sequences(X_test)
test_padded = pad_sequences(test_seq, maxlen=300, truncating='post')

In [101]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 16, input_length=300),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(5, activation='sigmoid')
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

In [102]:
num_epochs = 30
history = model.fit(train_padded, y_train, epochs=num_epochs, validation_data=(test_padded, y_test), verbose=2)

Epoch 1/50
419/419 - 2s - loss: 1.6042 - accuracy: 0.2350 - val_loss: 1.5807 - val_accuracy: 0.2780
Epoch 2/50
419/419 - 2s - loss: 1.4800 - accuracy: 0.3229 - val_loss: 1.4012 - val_accuracy: 0.3480
Epoch 3/50
419/419 - 2s - loss: 1.3654 - accuracy: 0.3548 - val_loss: 1.3363 - val_accuracy: 0.3742
Epoch 4/50
419/419 - 2s - loss: 1.3100 - accuracy: 0.3717 - val_loss: 1.2967 - val_accuracy: 0.3909
Epoch 5/50
419/419 - 2s - loss: 1.2653 - accuracy: 0.3910 - val_loss: 1.2721 - val_accuracy: 0.3876
Epoch 6/50
419/419 - 2s - loss: 1.2346 - accuracy: 0.4071 - val_loss: 1.2397 - val_accuracy: 0.3983
Epoch 7/50
419/419 - 2s - loss: 1.2084 - accuracy: 0.4183 - val_loss: 1.2246 - val_accuracy: 0.4130
Epoch 8/50
419/419 - 2s - loss: 1.1871 - accuracy: 0.4311 - val_loss: 1.2118 - val_accuracy: 0.4015
Epoch 9/50
419/419 - 2s - loss: 1.1678 - accuracy: 0.4401 - val_loss: 1.2065 - val_accuracy: 0.4271
Epoch 10/50
419/419 - 2s - loss: 1.1488 - accuracy: 0.4571 - val_loss: 1.1931 - val_accuracy: 0.4300

KeyboardInterrupt: ignored