<a href="https://www.kaggle.com/code/sayidheykal/sentiment-analysis-of-simcity-app-reviews?scriptVersionId=207087676" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **PROJECT DESCRIPTION**

**Project Description**:<br>
This project aims to analyze the sentiment of user reviews for the Simcity BuiltId app on the Google Play Store, providing insights into users’ experiences and overall satisfaction with the app. The sentiment analysis will categorize reviews as either positive or negative, allowing us to identify prevalent themes and common user concerns.

**Data Collection**:<br>
The data will be collected by scraping reviews from the Google Play Store using the google-play-scraper Python library. This ensures real-time data acquisition directly from users, reflecting the most current feedback.

**Methodology**:<br>
The project will involve the following components and methodologies:
1. **Sentiment Classification Models**:<br>
Three machine learning and deep learning models will be utilized for sentiment classification:
    * Naive Bayes
    * Decision Tree
    * LSTM (Long Short-Term Memory Neural Network)
2. **Feature Extraction Techniques**: <br>
    * TF-IDF (Term Frequency-Inverse Document Frequency) for transforming text data into numerical features.
    * Word2Vec Word Embeddings for capturing semantic relationships between words.
3. **Data Splitting Strategy**: <br>
To evaluate model performance, the data will be split into training and testing sets using two different ratios:
    * 80/20 split (80% training, 20% testing)
    * 70/30 split (70% training, 30% testing)
4. **Model Training Configurations**: <br>
The following combinations of models, feature extraction, and data splits will be tested:
    * `Naive Bayes` with `TF-IDF` and `80/20` split
    * `Decision Tree` with `TF-IDF` and `70/30` split
    * `LSTM` with `Word2Vec` and `80/20` split

**Project Objective:** <br>
This analysis aims to uncover the user experience with the Simcity BuiltId app by analyzing sentiment patterns in reviews. Insights gained from the sentiment analysis will highlight common positive aspects as well as frequent areas of user dissatisfaction, which can inform app development, customer support strategies, and user engagement.

# **1. Import Library**

We scrap review dataset from google-play-scraper library. Thus, we should install it first

In [1]:
!pip install google-play-scraper sastrawi

Collecting google-play-scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sastrawi, google-play-scraper
Successfully installed google-play-scraper-1.2.7 sastrawi-1.0.1


In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string
import re

from google_play_scraper import reviews_all, Sort, app
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# **2. Scraping Dataset**

Below code for printing the description of the hbo application.

In [3]:
app_id = 'com.ea.game.simcitymobile_row'
appdesc = app(
    app_id,
    lang='id', country='id'
)

appdesc

{'title': 'SimCity BuildIt',
 'description': 'Selamat datang, Wali Kota! Jadilah pahlawan kotamu sendiri sambil merancang dan menciptakan metropolis ramai yang indah. Setiap keputusan ada di tanganmu untuk membuat kotamu semakin besar dan lengkap. Buat pilihan cerdas agar wargamu tetap bahagia dan kaki langitmu meninggi. Setelah itu berdagang, mengobrol, bersaing, dan bergabung dengan klub dengan sesama wali kota. Bangun jalan menuju kejayaan!\r\n\r\nHIDUPKAN KOTAMU\r\nBangun gedung pencakar langit, taman, jembatan, dan banyak lagi! Tempatkan bangunan secara strategis agar pajak tetap mengalir dan kotamu berkembang. Tuntaskan tantangan kehidupan nyata, seperti lalu lintas dan polusi. Sediakan berbagai layanan, seperti pembangkit listrik dan kantor polisi. Jagalah agar lalu lintas lancar dengan jalanan lebar dan trem.\r\n\r\nTUMPAHKAN IMAJINASIMU DI PETA\r\nBangun lingkungan bergaya Tokyo, London, Paris, dan buka bangunan penting eksklusif, seperti Menara Eiffel dan Patung Liberty. Ungk

Let's start **scraping**!

In [4]:
scrap_app_reviews = reviews_all(
    app_id, 
    lang='id', 
    country='id',
    sort=Sort.MOST_RELEVANT, 
    count=1000
)

# **3. Loading Dataset**

In [5]:
app_reviews_df = pd.DataFrame(scrap_app_reviews)
app_reviews_df.to_csv('simcity_reviews.csv', index=False)
num_of_reviews, num_of_columns = app_reviews_df.shape

print(f'Number of Reviews: {num_of_reviews}')
print(f'Number of Columns: {num_of_columns}')
app_reviews_df.head()

Number of Reviews: 144652
Number of Columns: 11


Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion
0,153cae38-9708-4e35-a4c4-92d3078ba8b0,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Barangnya mahal2 dan sulit untuk dapet simleon...,4,642,1.57.2.129660,2024-10-16 07:42:13,,NaT,1.57.2.129660
1,ec85c1e0-ac51-4c18-958b-a90914a7e26f,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Yang paling disayangkan adalah saat ada tantan...,4,5,1.57.1.129081,2024-10-16 10:08:37,,NaT,1.57.1.129081
2,807166ba-f53a-44f1-a6fe-bb4834ed7d42,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Pendapatan uang nya terlalu kecil dan susah. b...,2,17,1.57.1.129081,2024-10-03 13:05:17,,NaT,1.57.1.129081
3,b64cc92b-ea52-4eef-84ef-34ec52326ccc,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Offline dari mana'nya coba, udah numpuk resour...",3,20,1.57.1.129081,2024-10-07 09:01:38,,NaT,1.57.1.129081
4,1ae3aae2-ec03-473e-821d-741240df7e1a,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"""untuk saran"" 1.ringankan harga tanah 2.kenapa...",3,32,1.57.1.129081,2024-09-21 11:38:00,,NaT,1.57.1.129081


In [6]:
app_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144652 entries, 0 to 144651
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   reviewId              144652 non-null  object        
 1   userName              144652 non-null  object        
 2   userImage             144652 non-null  object        
 3   content               144645 non-null  object        
 4   score                 144652 non-null  int64         
 5   thumbsUpCount         144652 non-null  int64         
 6   reviewCreatedVersion  102137 non-null  object        
 7   at                    144652 non-null  datetime64[ns]
 8   replyContent          2 non-null       object        
 9   repliedAt             2 non-null       datetime64[ns]
 10  appVersion            102137 non-null  object        
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 12.1+ MB


# **4. Data Cleansing**

## **4.1 Handling Missing Values**

In [7]:
# Retrieve number of missing values for each columns
num_missing = app_reviews_df.isna().sum()

# Calculate percentage of missing values for each columns
percentage_missing = round(num_missing / app_reviews_df.shape[0] * 100, 3)

print(f'percentage of missing values \n{percentage_missing}\n\nNumber of missing values:\n{num_missing}')

percentage of missing values 
reviewId                 0.000
userName                 0.000
userImage                0.000
content                  0.005
score                    0.000
thumbsUpCount            0.000
reviewCreatedVersion    29.391
at                       0.000
replyContent            99.999
repliedAt               99.999
appVersion              29.391
dtype: float64

Number of missing values:
reviewId                     0
userName                     0
userImage                    0
content                      7
score                        0
thumbsUpCount                0
reviewCreatedVersion     42515
at                           0
replyContent            144650
repliedAt               144650
appVersion               42515
dtype: int64


There is different number of missing in some columns. However, the `content` value has around `0.005%` of missing values which is the column that we will analize. Thus, with the little amount of missing value, we can safely remove it. There are some steps we will do as following:

**STRATEGY**
1. Drop all the columns with the missing values except `content` column using `drop()` method.
2. Remove the rows contain the missing values of `content` column using `dropna()` method.

The reason we drop all columns with missing values is to ensure we don't remove all rows that contains the missing values, which there is 99 percent of the dataset contain missing values.

In [8]:
# Store the columns that need to be dropped
drop_columns = ['reviewCreatedVersion', 'replyContent', 'repliedAt', 'appVersion']

# drop the listed columns
df_cleaned = app_reviews_df.drop(columns=drop_columns)

# drop the missing values from content columns
df_cleaned = df_cleaned.dropna()

# check percentage of missing values after dropping
percentage_missing = df_cleaned.isna().sum() / df_cleaned.shape[0] * 100

print(f'Percentage of missing values: \n{percentage_missing}')

Percentage of missing values: 
reviewId         0.0
userName         0.0
userImage        0.0
content          0.0
score            0.0
thumbsUpCount    0.0
at               0.0
dtype: float64


Now we free from missing values

In [9]:
df_cleaned.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,at
0,153cae38-9708-4e35-a4c4-92d3078ba8b0,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Barangnya mahal2 dan sulit untuk dapet simleon...,4,642,2024-10-16 07:42:13
1,ec85c1e0-ac51-4c18-958b-a90914a7e26f,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Yang paling disayangkan adalah saat ada tantan...,4,5,2024-10-16 10:08:37
2,807166ba-f53a-44f1-a6fe-bb4834ed7d42,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Pendapatan uang nya terlalu kecil dan susah. b...,2,17,2024-10-03 13:05:17
3,b64cc92b-ea52-4eef-84ef-34ec52326ccc,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Offline dari mana'nya coba, udah numpuk resour...",3,20,2024-10-07 09:01:38
4,1ae3aae2-ec03-473e-821d-741240df7e1a,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"""untuk saran"" 1.ringankan harga tanah 2.kenapa...",3,32,2024-09-21 11:38:00


## **4.2 Handling Duplicated Rows**

In [10]:
# retrieve number of duplicate rows
num_duplicated = df_cleaned.duplicated().sum() 

# Print the number of duplicate
print(f'number of duplicate rows: {num_duplicated}')

number of duplicate rows: 0


# **5. Preprocessing**

The most important preprocessing for sentiment analysis that our focus on the compound words that describes the sentimental of the text. So, punctuation like `?`, `!`, `#`, or number, does not describe anything at all, or lowercase and uppercase should be Threatened as the same, for example, `Book` and `book` should be the same. and so on. Thus, clean the content of those problems is needed to be done on preprocessing. We will do the steps below:

**STRATEGY**
1. `Cleaning Text`, such as removing the hastags, links, mentions, punctuation.
2. `Lowercasing` all the content
3. `Tokenizing` all the words, divide the sentence into words tokenized.
4. Remove all the `Stopwords`.
5. `Stemming`, usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time
6. Fix `Slangwords`.
7. Restructured to sentence

## **5.1 Cleaning Text**

In [11]:
def cleaning_content(content):
    # remove numbers
    content = re.sub(r'[\d]+', '', content) 
    
    # remove hashtag
    content = re.sub(r'#[\w]+', '', content) 
    
    # remove mentions
    content = re.sub(r'@[\w]+', '', content) 
    
    # remove link
    content = re.sub(r"http[\S]+", '', content) 
    
    # replace new line into space
    content = content.replace('\n', ' ')
    
    # remove all punctuations
    punct = list(string.punctuation)
    content = [i for i in content if i not in punct]
    content = ''.join(content)

    # remove characters space from both left and right content
    content = content.strip(' ') 
    
    return content

In [12]:
# applying cleaning function for content columns, and store it to another column called `text_cleaned`
df_cleaned['content_cleaned'] = df_cleaned['content'].apply(cleaning_content)

# print content example
print(f'\33[94m--------- | Content cleaned | ---------\n\33[0m')
print(f"{df_cleaned['content_cleaned'].iloc[0]}\n")

[94m--------- | Content cleaned | ---------
[0m
Barangnya mahal dan sulit untuk dapet simleonnya Udah gitu harga jalanannya mahal banget buset Game nya seakan untuk yg bermodal gede aja wkwkw Dari segi grafik ya gamenya cukup bagus sih sebenernya Pas teken kereta kita bisa ikutin keretanya Coba aja kendaraan lain juga bisa Kadang masih suka ngebug Terus ekspansi wilayah sama perluas gudang susah bener dah terlalu repot dan banyak banget Pemain yang cheat di war clubs tolong ditindaklanjuti



## 5.2 CaseFolding

In [13]:
def casefolding(content):
    return content.lower()

In [14]:
# lowercase the content
df_cleaned['content_lower'] = df_cleaned['content_cleaned'].apply(casefolding)

# print content example
print(f'\33[94m--------- | lowercase content | ---------\n\33[0m')
print(f"{df_cleaned['content_lower'].iloc[0]}\n")

[94m--------- | lowercase content | ---------
[0m
barangnya mahal dan sulit untuk dapet simleonnya udah gitu harga jalanannya mahal banget buset game nya seakan untuk yg bermodal gede aja wkwkw dari segi grafik ya gamenya cukup bagus sih sebenernya pas teken kereta kita bisa ikutin keretanya coba aja kendaraan lain juga bisa kadang masih suka ngebug terus ekspansi wilayah sama perluas gudang susah bener dah terlalu repot dan banyak banget pemain yang cheat di war clubs tolong ditindaklanjuti



## **5.3 Tokenize**

In [15]:
def tokenize(content):
    return word_tokenize(content)

In [16]:
# applying tokenizing for each rows of lowercase content
df_cleaned['content_tokenized'] = df_cleaned['content_lower'].apply(tokenize)

# print content example
print(f'\33[94m--------- | Tokenized content | ---------\n\33[0m')
print(f"{df_cleaned['content_tokenized'].iloc[0]}\n")

[94m--------- | Tokenized content | ---------
[0m
['barangnya', 'mahal', 'dan', 'sulit', 'untuk', 'dapet', 'simleonnya', 'udah', 'gitu', 'harga', 'jalanannya', 'mahal', 'banget', 'buset', 'game', 'nya', 'seakan', 'untuk', 'yg', 'bermodal', 'gede', 'aja', 'wkwkw', 'dari', 'segi', 'grafik', 'ya', 'gamenya', 'cukup', 'bagus', 'sih', 'sebenernya', 'pas', 'teken', 'kereta', 'kita', 'bisa', 'ikutin', 'keretanya', 'coba', 'aja', 'kendaraan', 'lain', 'juga', 'bisa', 'kadang', 'masih', 'suka', 'ngebug', 'terus', 'ekspansi', 'wilayah', 'sama', 'perluas', 'gudang', 'susah', 'bener', 'dah', 'terlalu', 'repot', 'dan', 'banyak', 'banget', 'pemain', 'yang', 'cheat', 'di', 'war', 'clubs', 'tolong', 'ditindaklanjuti']

