- This study is going to focus on Top Games dataset to make Exploratory Data Analysis.
- It is designed to be a beginner-friendly study.
- It will be based on a dataset, which is about popular games from Google playstore.

As it is the case with every EDA study, let's import the libraries that are required.


Due to its interactive and dynamic structre, pltoly will be preferred for data visualization.

In [None]:
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Knowing Dataset

As it is case with other areas or fields of different studies or professions, it is highly important to know your dataset before diving into analysis. First and foremost, we need to understand what we want to do with a given dataset and what can be done with it. 

Let's read our csv dataset and have a look at basics of it

In [None]:
df = pd.read_csv("../input/top-play-store-games/android-games.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

With these very four codes above, we can see that our dataset has 1730 games with 15 different varibales. It seems to be quite clean in terms of missing values. Therefore it is kinda black swan in real world of datasets. We have eleven numeric columns, which means that we can apply several calculations on it. 

In [None]:
df.price.value_counts()

- Our dataset has games from different categories, different ratings and different number of installs.  
- "installs" column displays numeric values. However, when paid attention to info() code it is object type. Therefore, we will need to manipulate it before using it.
- There is no missing value, which means that less work during data clearing stage.
- 'category' column is categorical variable, it would be good to see whether any significant differences among the categories of the games.
- 'paid' and 'price' seems to have a lot on common (when paid attention to the very above code output, it can be observed that since almost all games are free of charge both "price" and "paid" actually tell the same story). Therefore, for the sake of simplicity, one of them might be drpped.

In [None]:
df.describe()

Let's focus on "intalls" column and make some changes on it.

In [None]:
df.installs.value_counts()

We have changed values in thousands into million

In [None]:
def in_million(inst):
    if inst == "500.0 k":
        return "0.5 M"
    elif inst == "100.0 k":
        return "0.1 M"
    else:
        return inst

Now, let's apply this function to all "installs" column.

In [None]:
df.installs = df.installs.apply(in_million)

In [None]:
df.installs.value_counts()

Then, let's get rid of "M"s and change type to float

In [None]:
df.installs = df["installs"].str.replace("M", "").str.strip().astype("float")

In [None]:
df.installs.value_counts()

Let's move to "price" and "paind" columns.
- When paid close attention to the following codes outputs, it can be observed that since almost all games are free of charge both "price" and "paid" actually tell the same story. Therefore, for the sake of simplicity, let's drop "price" column.

Note1: Sample size less than 30, most of the time, does not fulfill minimum requirements for the sample - population representativeness.




Note1: Dropping column, deleting rows are decisions to be taken very cautiously and should based on analysis and domain knowledge.

In [None]:
df.price.value_counts()

In [None]:
df.paid.value_counts()

In [None]:
df.drop("price", axis=1, inplace=True)

# df = df.drop("price", axis=1)

In [None]:
df.shape

In [None]:
df.info()

# Analyzing Dataset

###### Let's see game categories first. 
- With "normalize=True" parameter, we have just returned relative frequescies of game catgories. Same output might be reahced with # df.category.value_counts() / df.shape[0]". 

In [None]:
df.category.value_counts(normalize=True) 

In [None]:
df.category.value_counts() / df.shape[0] * 100

All categories are almost in the same size

In [None]:
fig = px.histogram(df, x="category", title='Game Categories')
fig.update_layout(xaxis=go.layout.XAxis(tickangle=90))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

###### Let's see total ratings

In [None]:
df.columns

In [None]:
df["total ratings"].describe()

- It can be inferred from the very above output that our distribution will be right-skewed due to fact that mean is higher than median. And we might expect some outliers in max side of our distribution.

In [None]:
fig = px.histogram(data_frame=df, x="total ratings", title="Game Total Ratings")
fig.show()

- Our histogram displays that our distribution is positively skewed. However, it cannot say anything about outliers for sure. Let's check whether we have any with boxplot.

In [None]:
fig = px.box(data_frame=df, x="total ratings", hover_data=df[["title", "category"]])
fig.show()

- It can be seen that quite a lot of the ratings are in the 0 - 500.000 ratings range.
- We have highly skewed distribution, more specifially right skewed distribution with the possible outliers on the maximum side of the distribution. 
- On the other hand ve have quite a number of outliers on the max side, which increases mean and put it further away from the median.
- In these kinds of situations, it would be a good idea to look for the median based approach since median is more resilient to outliers than mean.

###### Let's move to number of games installed.

In [None]:
df.installs.describe()

- It can be inferred from the very above output that our distribution will be right-skewed due to fact that mean is higher than median. And we might expect some outliers in max side of our distribution.

In [None]:
fig = px.histogram(data_frame=df, x="installs", title="Number of Game Installs")
fig.show()

- Our histogram displays that our distribution is positively skewed. However, it cannot say anything about outliers for sure. Let's check whether we have any with boxplot.

In [None]:
fig = px.box(data_frame=df, x="installs", title="Number of Game Installs", hover_data=df[["title", "category"]])
fig.show()

- We have rightly skewed distribution with possible outliers on max side of our distribution.
- Candy Crush Saga with 1 Billion install and Clash of Clans with 500 Million installs shown in the box plot.
- Size of the outliers definitely affect mean value and distributions.
- Difference between mean value and median value is really huge (mean = 29.1M,median= 10M)
- As it was the case with the above distribution, it would be a good idea to use median based approach due to same reason.

###### Now, let's chack our dataset in terms of free-paid games

In [None]:
df.paid.value_counts(normalize=True)*100

In [None]:
df_paid_notpaid= df['paid'].value_counts()
label =['Free','NotPaid']
fig = px.pie(df_paid_notpaid, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

- 99.5% of games are free of charge (with the exception of 7 games out of total 1730 games).

###### Let's see what we have in terms of total ratings by category.

In [None]:
tot_rat_by_cat = df.groupby("category")["total ratings"].mean()
tot_rat_by_cat

In [None]:
fig = px.bar(data_frame=tot_rat_by_cat, x= tot_rat_by_cat.index, y=tot_rat_by_cat.values, labels={'y':'Total Ratings'})
fig.update_layout(xaxis=go.layout.XAxis(tickangle=90))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- It can be inferred from bar chart above that games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.

###### This time, let's se what we have in terms of number of installattions by category

In [None]:
inst_by_cat = df.groupby("category")["installs"].mean()
inst_by_cat

In [None]:
fig = px.bar(data_frame=inst_by_cat, x= inst_by_cat.index, y=inst_by_cat.values, labels={'y':'Install'})
fig.update_layout(xaxis=go.layout.XAxis(tickangle=90))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- It can be inferred from bar chart above that games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories. 

###### Now, let's see how the grotwh rates are by category

In [None]:
growth_first30days_by_cat = df.groupby("category")["growth (30 days)"].mean()
growth_first30days_by_cat

In [None]:
fig = px.bar(data_frame=growth_first30days_by_cat, x=growth_first30days_by_cat.index, y=growth_first30days_by_cat.values, labels={'y':'Growth in 30 Days'})
fig.update_layout(xaxis=go.layout.XAxis(tickangle=90))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- It can be easily concluded that even though games in the action categories get more ratings and were installed more than games in the other categories, games in the casino category have more growth in 30 days.

###### And what about growth in 60 days? Let's explore it.

In [None]:
growth_first60days_by_cat = df.groupby("category")["growth (60 days)"].mean()
growth_first60days_by_cat

In [None]:
fig = px.bar(data_frame=growth_first60days_by_cat, x=growth_first60days_by_cat.index, y=growth_first60days_by_cat.values, labels={'y':'Growth in 60 Days'})
fig.update_layout(xaxis=go.layout.XAxis(tickangle=90))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- We have totally different picture in 60-day-growth. At the stage it is quite hard to point out a reason for this phenemonen for sure. 

###### Let's see top 3 ranked games in each categories.

In [None]:
top_3_ranked = df[df["rank"] < 4][["rank", "category","total ratings", "title", "installs"]]
top_3_ranked

In [None]:
fig = px.scatter(data_frame=top_3_ranked, x='total ratings', y='title', 
                 hover_data = top_3_ranked[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Total Ratings")
fig.show()

- It can be inferred from scatter chart as well  that games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.

In [None]:
fig = px.scatter(data_frame=top_3_ranked, x='installs', y='title', 
                 hover_data = top_3_ranked[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Installs")
fig.show()

- Same story is valid in terms of number of installments, as well.

###### And let's finalize our analysis with the top 10 games in terms of number of installments.

In [None]:
top_10 = df.sort_values("installs", ascending=False)[:11]
top_10

In [None]:
fig = px.bar(data_frame=top_10, x= 'title', y='installs', color="category")
fig.update_layout(xaxis={"categoryorder":"total descending"})
fig.show()

- 2 top games have 1 Billion installs, 8 following games have 500 million installs.

In [None]:
fig = px.bar(data_frame=top_10, x= 'title', y='total ratings', color="category")
fig.update_layout(xaxis={"categoryorder":"total descending"})
fig.show()

- It can be inferred from the chart above ther even though Candy Crush Saga and Subway Surfers have 1 Billion installs, it does not automatically mean that, they will get the most total number of ratings.

That is the end of our EDA, hope you will enjoy it and learn lots of things as I did while studying it.