<a href="https://colab.research.google.com/github/raihankr/ml-sentiment-analysis/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dicoding - Projek Analisis Sentimen
Dibuat oleh: Raihan Khairul Rochman

**Objektif:**  
Menganalisis sentimen pada ulasan pengguna terhadap aplikasi **SATUSEHAT Mobile** di Play Store

# Impor *Library*

In [1]:
!pip install google_play_scraper
!pip install Sastrawi

import tensorflow as tf
import pandas as pd
from google_play_scraper import Sort, reviews
import csv
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import json
import requests
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer

Collecting google_play_scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Installing collected packages: google_play_scraper
Successfully installed google_play_scraper-1.2.7
Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# *Data Scraping*

In [2]:
scraped_data, token = reviews(
    'com.pinterest', ## SATUSEHAT Mobile
    lang='id',
    country='id',
    sort=Sort.MOST_RELEVANT,
    count=15000,
)

In [3]:
with open('drive/My Drive/reviews.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Review'])
    for review in scraped_data:
        writer.writerow([review['content']])

# Load & Clean Dataset

In [4]:
df = pd.read_csv('drive/My Drive/reviews.csv')

In [5]:
df = df.dropna().drop_duplicates()

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14994 entries, 0 to 14999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  14994 non-null  object
dtypes: object(1)
memory usage: 234.3+ KB


**Deskripsi Data**:
Saya mengambil data dari sekitar 15.000  ulasan pengguna paling relevan terhadap aplikasi *SATUSEHAT Mobile* di *platform* Google Play Store

# Text Preprocessing

In [7]:
# Tambahan stopwords untuk bahasa Indonesia
stopwords1 = pd.read_csv('https://raw.githubusercontent.com/ramaprakoso/analisis-sentimen/master/kamus/stopword.txt', header=None, names=['word'])
stopwords1 = stopwords1['word'].to_list()

In [8]:
response = requests.get('https://raw.githubusercontent.com/louisowen6/NLP_bahasa_resources/master/combined_slang_words.txt')
slangwords = json.loads(response.text)

In [9]:
def cleanText(text):
    result = re.sub(r"(([@#]|https?:\/\/)\S+|\d|[^\w\s])", "", text)
    result.replace("\n", " ")
    result = result.translate(str.maketrans("", "", string.punctuation))
    result = result.strip(" ")
    return result

casefoldingText = lambda text: text.lower()

def fixSlangWords(words):
    result = []
    for word in words:
        if word in slangwords:
            result.append(slangwords[word])
        else:
            result.append(word)
    return result

def filterWords(words):
    stopwords_list = set(stopwords.words('indonesian'))
    stopwords_list.update(stopwords1)
    stopwords_list.update(stopwords.words('english'))

    result = []
    for word in words:
        if word not in stopwords_list:
            result.append(word)
    return result

factory = StemmerFactory()
stemmer = factory.create_stemmer()
stemWords = lambda text: stemmer.stem(text)

toSentence = lambda words: ' '.join(words)

In [16]:
df['Clean'] = df['Review'].apply(cleanText).apply(casefoldingText)
df['Tokenized'] = df['Clean'].apply(word_tokenize)
df['Formalized'] = df['Tokenized'].apply(fixSlangWords)
df['Stemmed'] = df['Formalized'].apply(toSentence).apply(lambda t: stemmer.stem(t))
df['Filtered'] = df['Stemmed'].apply(filterWords)
df['Final'] = df ['Filtered'].apply(toSentence)

# Data Labeling

In [None]:
lexicon_positive = pd.read_csv('https://raw.githubusercontent.com/fajri91/InSet/master/positive.tsv', delimiter='\t', index_col=0).T.loc['weight'].to_dict()
lexicon_negative = pd.read_csv('https://raw.githubusercontent.com/fajri91/InSet/master/negative.tsv', delimiter='\t', index_col=0).T.loc['weight'].to_dict()

In [None]:
def sentiment_analysis(words):
    score = 0
    for word in words:
        if word in lexicon_positive:
            score += lexicon_positive[word]
        if word in lexicon_negative:
            score += lexicon_negative[word]
    polarity: str
    if score >= 2:
        polarity = 'positive'
    elif score <= -2:
        polarity = 'negative'
    else:
        polarity = 'neutral'
    return score, polarity

In [None]:
labeled = df['Filtered'].apply(sentiment_analysis)
labeled = list(zip(*labeled))
(df['Score'], df['Polarity']) = labeled

In [None]:
df[['Filtered', 'Polarity', 'Score']].describe()

In [None]:
df['Polarity'].value_counts()

In [None]:
(X, y) = (df['Final'], df['Polarity'])

In [None]:
X,

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)

In [None]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()
stemmer.stem('menggunakan')

In [13]:
df['Clean'][0:100].apply(lambda t: stemmer.stem(t))

Unnamed: 0,Clean
0,bagus tapi belakang ini suka loading lama bang...
1,apk nya baguss baguss banget cocok buat altern...
2,aplikasi sebenernya bagus tapi telah saya upda...
3,untuk aplikasi sudah baik dan cukup tarik teta...
4,update terbarupinterest cukup baikhanya saja a...
5,ken nyari inspirasi di pinterest tapi baru mas...
6,semua udh bagus tapi tolong dong loadingnya ak...
7,turut pinterest nya udah bagus bisa nyari ide ...
8,tolong baik untuk bug dan loading yang lebih l...
9,saya sudah lama pakai app ini tapi saya kurang...
