- In this notebook, I am going to make Exploratory Data Analysis (EDA) with the Top Games on Google Playstore dataset.
- This is a dataset of top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs. Data as of Jun 9, 2021.
- Let's start.

- Let's start with importing required libraries.

In [None]:
import pandas as pd
import numpy as np


import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

- Now, let's read and check dataset.

In [None]:
df = pd.read_csv("../input/top-play-store-games/android-games.csv")

In [None]:
df.head()

In [None]:
df.shape

- We have 15 columns(variables) and 1730 rows(games). What about missing values?

In [None]:
df.isnull().sum()

- Not even a one missing value! We will be deprived of the joy of handling missing values!
- Let's check general info of dataset.

In [None]:
df.info()

- All looks OK except installs column. Although I would have expect it to be a numerical(int or float), it is object type. I will look into that later.

In [None]:
df.describe()

Before going further, let's summarize what we have got from the dataset.

- Our dataset has games from different categories, different ratings and different number of installs.  
- `installs` variable has a good numerical info to use. It would be a good idea to make adjustments on it to use it as a numerical variable
- There is no missing value, which is very good during the data preparation stage.
- `category` column is categorical variable, it would be good to see whether any significant differences among the categories of the games.
-  Numerical variables deserves special attention for further analysis.
- `paid` and `price` seems to have a lot on common. Needs to look in detail and if necessary to drop one of them for simplicity.

- Let's look into `installs` column.

In [None]:
df["installs"]

- Let's convert this column into numerical type by defining a function.

In [None]:
def numbers(df):
    if df.split(".")[1].split(" ")[1] == "M":
        return int(df.split(".")[0])
    else:
        return int(df.split(".")[0])/1000
df["installs"] = df.installs.apply(numbers)
df = df.rename(columns={'installs': 'installs_in_million'})

In [None]:
df["installs_in_million"].head()

- Great! Now `installs` is a float type column.
- Let's check `price` and `paid` columnn.

In [None]:
df.price.value_counts(normalize=True)

In [None]:
df.paid.value_counts(normalize=True)

- In dataset more than %99 of the games are free.
- There is not enough sample size to make reliable conclusions about price ranges.
- Let's drop `price` column.

In [None]:
df.drop("price", axis=1, inplace=True)

In [None]:
df.info()

Now, let's work on *star ratings* columns.

- Since, there are really big numbers, it is hard to compare them. I will create new colums by normalizing existing columns

In [None]:
df["5 star ratings %"] = round(df["5 star ratings"]  / (df["5 star ratings"] + df["4 star ratings"] + df["3 star ratings"] + df["2 star ratings"] + df["1 star ratings"]) * 100, 2)
df["4 star ratings %"] = round(df["4 star ratings"]  / (df["5 star ratings"] + df["4 star ratings"] + df["3 star ratings"] + df["2 star ratings"] + df["1 star ratings"]) * 100, 2)
df["3 star ratings %"] = round(df["3 star ratings"]  / (df["5 star ratings"] + df["4 star ratings"] + df["3 star ratings"] + df["2 star ratings"] + df["1 star ratings"]) * 100, 2)
df["2 star ratings %"] = round(df["2 star ratings"]  / (df["5 star ratings"] + df["4 star ratings"] + df["3 star ratings"] + df["2 star ratings"] + df["1 star ratings"]) * 100, 2)
df["1 star ratings %"] = round(df["1 star ratings"]  / (df["5 star ratings"] + df["4 star ratings"] + df["3 star ratings"] + df["2 star ratings"] + df["1 star ratings"]) * 100, 2)

In [None]:
df.head(1)

Great. Now it is more clear. Feature engineering part is done. Let's move on to analysis part.

- First, let's look into `category` column if category samples are the same. 

## Category

In [None]:
df["category"].value_counts()

- Almost same category samples.
- Let's visualize `category` column.

In [None]:
fig = px.histogram(df, x="category", title="Game Categories", labels={"category": "Categories"})
fig.update_layout(xaxis={"categoryorder":"total descending"})
fig.show()

## Total Ratings

- It is time to check `total ratings` column.

In [None]:
df["total ratings"].describe()

In [None]:
fig = px.histogram(df, x="total ratings", title="Total Ratings", labels={"total ratings": "Total Ratings"})
fig.update_layout()
fig.show()

In [None]:
fig = px.box(df, x="total ratings", title="Total Ratings", labels={"total ratings": "Total Ratings"},
             hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- Most of the ratings are in the range of 0-500.000.
- The mean is greater than the median.
- We have highly right skewed distribution because of outliers on the maximum side of the distribution.
- Because of outliers, it would be a good idea to look for the median based approach.

## Installs in Million

- What about installs? Let's check `installs_in_million` column.

In [None]:
df["installs_in_million"].describe()

In [None]:
df["installs_in_million"].value_counts().sort_index()

In [None]:
fig = px.histogram(df, x="installs_in_million", title="Installs in Millions", labels={"installs_in_million": "Installs in Millions"})
fig.update_layout()
fig.show()

In [None]:
fig = px.box(df, x="installs_in_million", title="Installs in Millions",
             labels={"installs_in_million": "Installs in Millions"},
             hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- We have rightly skewed distribution.
- Just like `total ratings` column, `installs_in_million` columns also has outliers on the maximum side.
- By looking at just boxplot, you may think that there are only two outliers but that may be wrong. Even though `installs_in_million` column seems like numerical type, it is actually categorical column because there are only 9 possible values in this column. Because of this, in box plot, outliers stacked on each other. You may understand better by looking at value counts of this column. I suspect there are 14 outliers in this column.
- Most of the values are stacked between 1M and 100M.
- Size of the outliers definitely affect  mean value and distributions.
- Difference between mean value and median value is really huge (mean = 29.1M,median= 10M).

## Paid-Free Games






- Even though most of the games, more than %99, is free, let's check differences between paid and free games.

In [None]:
df.groupby("paid").mean()

In [None]:
df.paid.value_counts()

In [None]:
paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

## Total Ratings by Category

In [None]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

In [None]:
fig = px.bar(total_ratings_by_category, x= total_ratings_by_category.index, y=total_ratings_by_category.values, labels={'y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.

In [None]:
install_by_category = df.groupby('category')['installs_in_million'].mean()
install_by_category

In [None]:
fig = px.bar(install_by_category, x=install_by_category.index, y=install_by_category.values,
            labels={"y":"Total Installs"})
fig.update_layout(xaxis={"categoryorder":"total descending"})
fig

- Games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

## Growth by Category

In [None]:
growth = df.groupby("category")[["growth (30 days)", "growth (60 days)"]].mean()

In [None]:
fig = px.bar(growth, y="growth (30 days)", labels={"category": "Category", "value": "Total Growth"})
fig.update_layout(xaxis={"categoryorder": "total descending"})

In [None]:
fig = px.bar(growth, y="growth (60 days)", labels={"category": "Category", "value": "Total Growth"})
fig.update_layout(xaxis={"categoryorder": "total descending"})

In [None]:
fig = px.line(growth, y=["growth (30 days)", "growth (60 days)"],
             labels={"category": "Category", "value": "Total Growth"})
fig.show()

- Even though games in the action categories get more ratings and were installed more than games in the other categories, games in the casino category have more growth in 30 days. 
- Growth in 60 days for the games in the casino, adventure, role playing categories are significantly lower than their growth in 30 days. 
- With given dataset, we can only speculate something, but we can not make an analytical assumptions based on the  given data. We need more variables to explain the signifcant differences for some of the categories in 30-60 days growth.

## Star Ratings by Category

In [None]:
stars = df.groupby("category")[["1 star ratings %", "5 star ratings %"]].mean()

In [None]:
fig = px.bar(stars, y="1 star ratings %")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.show()

In [None]:
fig = px.bar(stars, y="5 star ratings %")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.show()

In [None]:
df.groupby("category")[["1 star ratings %","5 star ratings %"]].mean().sort_values(by="1 star ratings %")

In [None]:
df.groupby("category")[["1 star ratings %","5 star ratings %"]].mean().sort_values(by="5 star ratings %")

- Casino games have the most 5 star ratings as a percentage by %75.4. Also has the fourth least 1 star rating as a percentage.
- Music games has the least 5 star ratings and also have the most 1 star ratings. It doesn't look good for music games.
- Most installed categories, Action, Arcade and Casual, have almost the same 5 star ratings as percentage. Among those, Casual has the least 1 star rating with just %8.46, Arcade is the second least with %9.79 and Action has most 1 star ratings among most installed game categories with %11.25.

## Top 3 Games by Category

In [None]:
top_3 = df[df["rank"]<4][['rank','title','category', 'total ratings', 'installs_in_million', '5 star ratings', "5 star ratings %"]]
top_3

In [None]:
fig = px.scatter(top_3, x="5 star ratings %", 
                 hover_data = top_3[['category','rank']], color='category', 
                 title = "Top 3 Games by Their % 5 Star Ratings")
fig.show()

- As mentioned above, games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.
- It is the same even for the top ranked games in these categories.

In [None]:
fig = px.scatter(top_3, x="5 star ratings", 
                 hover_data = top_3[['category','rank']], color='category', 
                 title = "Top 3 Games by Their 5 Star Ratings")
fig.show()

In [None]:
fig = px.scatter(top_3, x="installs_in_million", 
                 hover_data = top_3[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Installs in Million")
fig.show()

- As mentioned above, games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.
- It is the same even for the top ranked games in these categories.

It was a pleasure to work with this dataset for me. I would like to thank dataset contibutor for this data. I hope you enjoyed too. If you liked my EDA on this dataset, feel free to check my other notebooks as well. Looking forward for your feedback. Thanks a lot.

Have a great day.