# Introduction
Hello people, welcome to this kernel. In this kernel I am going to classify game reviews collected from Steam. I will use deep learning based approach and tensorflow.

# Table of Content
1. Preparing Environment
1. Data Overview and Preprocessing
1. Building Model
1. Loading Pre-trained Word Embeddings
1. Training Model | Displaying Results
1. Final Test
1. Conclusion

# Preparing Environment
* We'll prepare our environment in this section, we'll import libraries and the data we'll use.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import time
import random

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from tensorflow.compat.v1.keras.layers import CuDNNGRU

import warnings as wrn
wrn.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_set = pd.read_csv("/kaggle/input/game-review-dataset/train_gr/train.csv")
test_set = pd.read_csv('/kaggle/input/game-review-dataset/test_gr/test.csv')
game_ov = pd.read_csv('/kaggle/input/game-review-dataset/train_gr/game_overview.csv')

In [None]:
train_set.head()

In [None]:
test_set.head()

* As you can see, test set is useless because there are no labels in it.

In [None]:
game_ov.head()

# Data Overview and Data Preprocessing
In this section we'll take a look at the data and then we'll process it to train a deep neural network.

In [None]:
train_set.info()

* We have 17k sample.
* Let's check class distribution.

In [None]:
sns.countplot(train_set["user_suggestion"])
plt.show()

* We can consider our set balanced, great news guys!!!

Now let's start to process our dataset, we'll follow steps below:

1. Cleaning and Lowering Data
1. Tokenizing and Padding
1. Train Test Splitting

### Step 1: Cleaning and Lowering The Data
In this section we'll drop redundant features from the data and define a function that clean the data.

In [None]:
# Dropping unrelevant features
x = train_set["user_review"]
y = train_set["user_suggestion"]



In [None]:
def cleanTexts(texts):
    cleaned = []
    pattern = "[^a-zA-Z0-9]"
    for text in texts:
        clrd = re.sub(pattern," ",text).lower().strip()
        cleaned.append(clrd)
    return cleaned



* Let's check our function.

In [None]:
cleanTexts(["If it works great, it will remove something  ()}12451235"])

In [None]:
x_cleaned = cleanTexts(x)
x_cleaned[0]

### Step 2: Tokenizing and Padding Data
In this section we'll convert our texts into sequences by matching each word with an integer. Then we'll make sure that each sequence has same length by adding 0's to the short ones and trimming long ones.

In [None]:
# Tokenizer 
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(x_cleaned)
x_tokens = tokenizer.texts_to_sequences(x_cleaned)

* And let's check our sequences.

In [None]:
print(x_tokens[0])
print()

print(len(x_tokens[0]))
print(len(x_tokens[1]))
print(len(x_tokens[2]))

* As you can see sequences has different shapes but neural networks works with the data that has a constant shape. Let's solve this problem by padding.
* First we'll make an array that includes lengths of sequences and then we'll find the third quartile's value.
* Our new sequences will have length third quartile.

In [None]:
len_arr = [len(s) for s in x_tokens]
MAX_LEN = int(np.percentile(len_arr,.75))

* Also we'll save this value to a json file in order to use in the future.

In [None]:
import json
with open("maxlen.json",mode="w") as F:
    json.dump({"maxlen":MAX_LEN},F)
    

In [None]:
print(MAX_LEN)

In [None]:
x_tokens_pad = pad_sequences(x_tokens,maxlen=MAX_LEN)
x_tokens_pad.shape


### Step 3: Train Test Splitting
In this section we'll split our set into train and test, we won't use test set in training.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x_tokens_pad,np.asarray(y),test_size=0.2,random_state=42)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

# Building Model
In this section I am going to build deep neural network using Tensorflow. But before start to implementation I wanna talk about GRU a bit.

In deep learning when we work with sequences (such as music is a sequence of notes and texts are sequences of words) we use Recurrent Neural Networks, because they have memories, they evaluate every part of sequences.

But when we use Recurrent Neural Networks (I'll call them RNN after this) we encounter with a big problem: **vanishing gradient**. Because of the backpropagation of neural networks we encounter with this problem.

But if we use LSTMs (Long Short Term Memories, developed version of Simple RNNs) or GRU (Gated Recurrent Units) we don't encounter with this problem, because these networks have some data filters that we named **forget gates**.

In [None]:
VOCAB_SIZE = len(tokenizer.word_index) + 1
# We've added 1 because of padding

# Each world will be 100D vector.
VECTOR_SIZE = 100

def buildModel(MAX_LEN,embedding_weights=None):
    
    model = keras.Sequential()
    if embedding_weights is not None:
        model.add(layers.Embedding(input_dim=VOCAB_SIZE,
                                   output_dim=VECTOR_SIZE,
                                   input_length=MAX_LEN,
                                   weights=[embedding_weights],
                                   trainable=True
                              ))
        
    else:
        model.add(layers.Embedding(input_dim=VOCAB_SIZE,
                                   output_dim=VECTOR_SIZE,
                                   input_length=MAX_LEN
                                  ))
    
    model.add(CuDNNGRU(512,return_sequences=True))
    model.add(CuDNNGRU(1024,return_sequences=True))
    model.add(CuDNNGRU(1024,return_sequences=False))
    model.add(layers.Dense(1,activation="sigmoid"))
    
    model.compile(optimizer="Adam",loss="binary_crossentropy",metrics=["accuracy"])
    return model

In [None]:
model = buildModel(MAX_LEN)
model.summary()

# Loading Pre-trained Word Embeddings 

In this section we'll load pre-trained word embeddings.

In [None]:
word2vec = {} # Trained glove model 
with open("../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt",encoding="UTF-8") as f:
    for line in f:
        values = line.split() 
        word = values[0]
        vec = np.asarray(values[1:],dtype="float32")
        word2vec[word] = vec
        

* First we've read word vectors from text file and created a python dictionary.

In [None]:
# initializing as uniform
embedding_matrix = np.random.uniform(-1,1,(VOCAB_SIZE,100))

for word,i in tokenizer.word_index.items():
    if i<VOCAB_SIZE: 
        embedding_vector = word2vec.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector


* Then we've created our embedding matrix and if a value in our set is in the pre trained vector we changed the value of it.

In [None]:
model = buildModel(MAX_LEN,embedding_matrix)

In [None]:
model.summary()

# Training Model | Displaying Results
In this section I am going to train our model.

In [None]:
model.fit(x_train,y_train,epochs=3,validation_split=0.2)

# Final Test
In this section I am going to test our model with test set.

In [None]:
y_pred = model.predict_classes(x_test)

accuracy_sc = round(accuracy_score(y_pred=y_pred,y_true=y_test)*100,2)
conf_matrix = confusion_matrix(y_pred=y_pred,y_true=y_test)


print("Accuracy score is {}% ".format(accuracy_sc))

plt.subplots()
sns.heatmap(conf_matrix,annot=True,linewidths=1.5,fmt=".1f")
plt.xlabel("Prediction")
plt.ylabel("Actual")
plt.show()

* Model does not have balance problem, so we can say not bad for %73.