## **0.Starter codes**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## **0.5.Info**
This is a dataset of top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs.<br>***Data as of April 9, 2021.***

## **1.Import libraries and a dataset**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

In [None]:
df = pd.read_csv("/kaggle/input/top-play-store-games/android-games.csv")
print(df.info())
df.head(5)

This dataset has...
* 1730 rows
* 15 columns
* No missing values
* Proper dtypes, **except installs** (should be numeric)
* ***objective: EDA***

## **2.Data Preprocessing**

#### Convert installs data type from object to numeric

In [None]:
# Check unique values of installs
print(f"number of installs' unique values: {df['installs'].nunique()}")
print("\n")
print("unique values")
print(df['installs'].unique())

In [None]:
# define a unit transform function to work with .apply()
def unit_transform(value):
    if value[-1].lower() == 'm':
        return float(value[:-1].strip()) * 1000000
    elif value[-1].lower() == 'k':
        return float(value[:-1].strip()) * 1000

In [None]:
# transform installs column
df['installs'] = df['installs'].apply(unit_transform)
df['installs'][:3]

## **3.EDA**

In [None]:
# select all 1st rank games sorted by average rating
df[df['rank'] == 1].sort_values(by = 'average rating', ascending = False)

*As expected, there are many famous games like Candy Crush, Clash of Clans, Garena Free Fire, etc on the list. But I didn't expect that 1st rank with highest average rating would have a title I can't read*<br>
[For more info](https://play.google.com/store/apps/details?id=com.zytoona.wordscrush&hl=th&gl=US)

In [None]:
# set plot style and font scale
sns.set(style = 'darkgrid', font_scale = 1)

In [None]:
# create text of game titles
titles = ', '.join(df['title'].to_list())
titles[:100]

In [None]:
# stopwords
stopwords = set(STOPWORDS)

In [None]:
# instantiate a word cloud object
title_wc = WordCloud(
    background_color='white',
    max_words=2000,
    stopwords=stopwords
)

# generate the word cloud
title_wc.generate(titles)

In [None]:
# display the word cloud
fig = plt.figure(figsize = (14,18))

# display the cloud
plt.imshow(title_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

*From the word cloud above, there are many games that have the word 'game' in the title. However, I don't think the word 'game' in title of game apps is meaningful. So let's add Game or Games in stopwords.*

In [None]:
# add the words [Game,Games] to stopwords
stopwords.add('Game')
stopwords.add('Games')

# re-generate the word cloud
title_wc.generate(titles)

# display the cloud
fig = plt.figure(figsize = (14,18))
plt.imshow(title_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

*Alright, it seems that the words like Free, Puzzle, World, Word and Online are highly used for naming an android game.*

In [None]:
# total number of unique game category
df['category'].nunique()

In [None]:
# Average installations by game category
df.groupby('category').agg(['mean'])['installs'].sort_values(by = 'mean', ascending = False).plot(kind = 'barh', title = 'average installs by category')

*Arcade, Casual and Action are top three categories by the highest average number of installations.*

In [None]:
# Number of free, paid games by category
df.pivot_table(index = 'category', columns = 'paid', values = 'rank',aggfunc='count').plot(kind = 'bar', stacked = True, title = 'number of free/paid games by category')
plt.legend(bbox_to_anchor = (1.05,1), title = 'paid')
# df.groupby(['category','paid']).count()['title']

*Most games are free to play. It's the same for every categories.* 

In [None]:
# Average of average rating by type of game(free vs paid)
sns.barplot(x = 'paid', y = 'average rating', estimator = np.mean, data = df)
plt.title("Average rating (free vs paid games)")

*There is no big difference in average rating between paid games and free games.*

In [None]:
df.head(2)

In [None]:
# Mean of [average rating, growth 30 days, growth 60 days] by category
df[['category', 'average rating','growth (30 days)','growth (60 days)']].pivot_table(index = 'category',
                                                                                     values = ['average rating',
                                                                                               'growth (30 days)',
                                                                                               'growth (60 days)'],
                                                                                    aggfunc = 'mean')

*Wow, what's wrong with GAME ACTION and GAME WORD? That 30-days growth is extremely crazy !!*

In [None]:
# Check growth (30 days) of GAME ACTION and GAME WORD
strange = df[(df['category'] == 'GAME ACTION') | 
             (df['category'] == 'GAME WORD')][['category','growth (30 days)']].sort_values(by = 'growth (30 days)',
                                                                                           ascending = False).head().index

In [None]:
# show those strange games with abnormally high growth (30 days)
df.iloc[strange]

*So, Fill-The-Words-word search puzzle and Garena AOV: Link Start are the cause of an extremely high 30-days growth of GAME ACTION and GAME WORD genres* 

In [None]:
# Average of average rating by category
df.groupby("category").mean()['average rating'].sort_values(ascending = False).plot(kind = 'barh', title = 'average rating by category')

*From the plot above, there is no clear difference in average rating between categories*. 

In [None]:
# average rating distribution by game category
g = sns.FacetGrid(df, row = 'category')
g = g.map(sns.histplot, 'average rating', kde = True)

*From the histograms above, some categories have negative skewness. So, in this case, **median** is likely to do a better job as Measure of Central Value than mean.*

## **4.Fit Model**

*Next, we will try clustering categories into group of similarity (average rating and installs) using KMeans.*

In [None]:
from scipy import stats
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [None]:
df.head(3)

In [None]:
# df_std = pd.DataFrame(np.abs(StandardScaler().fit_transform(df.drop(['title','rank','category','paid'], axis = 1))), 
#                       columns = df.drop(['title','rank','category','paid'], axis = 1).columns) 
# df_std.head(3)
# df_std[(df_std > 3).any(axis = 1)]

In [None]:
df_cluster = df[['category','installs','average rating']].groupby("category").agg(['mean','median'])

In [None]:
df_cluster

In [None]:
df_cluster = pd.concat([df_cluster.xs(('installs','mean') , axis = 1), df_cluster.xs(('average rating','median') , axis = 1)], axis = 1)
df_cluster.head(3)

In [None]:
df_cluster = df_cluster.droplevel(1, axis = 1)

In [None]:
scaler = StandardScaler().fit(df_cluster)
df_cluster_std = scaler.transform(df_cluster)

In [None]:
error = []
for i in range(2,15): # we only have 17 categories, this is prabably too much.
    km = KMeans(n_clusters = i, init = 'k-means++', random_state = 101).fit(df_cluster_std)
    error.append(km.inertia_)

In [None]:
plt.plot(range(2,15), error, marker = 'D', markerfacecolor = 'red', color = 'blue', markeredgecolor = 'red')
plt.title("Error VS K")
plt.xlabel('K')
plt.ylabel('Error')

*optimal K: 4*

In [None]:
# Fit KMeans with optimal K
km = KMeans(n_clusters = 4, init = 'k-means++', random_state = 101).fit(df_cluster_std)
pred = km.labels_
centroids = km.cluster_centers_

## **5.Result**

In [None]:
# result
result = df_cluster.copy()
result.reset_index(inplace = True)
result['cluster'] = pred
# result = pd.DataFrame(np.column_stack((df_cluster.index, pred)), columns = ['category','cluster'])
result.head()

In [None]:
fig_r, ax_r = plt.subplots(figsize = (13,10))

sns.scatterplot(x = 'installs', y = 'average rating', hue = 'cluster', palette = 'gist_rainbow', 
                data = result, s = 200, ax = ax_r)

plt.legend(bbox_to_anchor = (1.1,1), title = 'cluster')

for i in range(result.shape[0]):
    
    if result.iloc[i,0] == 'GAME STRATEGY':
        plt.text(result.iloc[i,1], result.iloc[i,2] - 0.005, result.iloc[i,0][5:])
    else:
        plt.text(result.iloc[i,1], result.iloc[i,2], result.iloc[i,0][5:])

plt.title("Cluster by game category")
plt.show()

### Cluster Result
* Cluster 0 ***red***
    * (WORD, CARD, CASINO) 
    * Highest rating, but also the smallest average number of installations. 
<br>
* Cluster 1 ***green*** 
    * (TRIVIA, BOARD, ADVENTURE, STRATEGY, SIMULATION, PUZZLE) 
    * Medium-high rating and small-medium average number of installations.
<br>
* Cluster 2 ***blue*** 
    * (ROLE PLAYING, EDUCATIONAL, MUSIC, SPORTS)  
    * Lowest rating and small-medium average number of installations.
<br>
* Cluster 3 ***Pink*** 
    * (RACING, CASUAL, ARCADE, ACTION)
    * Medium rating and high average number of installations.

## **5.Conclusion & Last word**
### Conclusion
After analyzed this dataset of **top-100 Play store games by category** which contained 17 unique game categories, I found that<br>
* 1.Most game titles in this dataset has words like Free, Puzzle, World, Word and Online.
* 2.Most games are free to play.
* 3.There is no significant different in average rating between paid game and free game.
* 4.The most popular by number of installations are Arcade, Casual and Action categories.
* 5.Using Kmeans clustering, we can assign 17 categories into 4 groups.
<br>

### Last word
Thanks **Dhruvil Dave** for providing this dataset. It was fun.