---

#  Top Games on Google Playstore - EDA

---

 - In this study, we aimed to make an Exploratory Data Analysis (EDA) by using Top Games on Google Playstore dataset with very terse but clear explanations.

---

- We are going to start by importing the libraries we will be using during the study and then can start to explore our dataset.

- We are going to use both Seaborn and Plotly to have variety of visualization options.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings("ignore")

## Overview Stage 

- Read the csv
- Use necessary functions to get basic informations about the dataset

In [None]:
df = pd.read_csv('../input/top-play-store-games/android-games.csv')

In [None]:
df.head()

- We basically can say that we have a dataset which is about top games in Google Playstore including the titles, average ratings, numbers of installation, ratings and the price of each game.

In [None]:
df.shape

- The dataset has 1730 rows and 15 columns.

- To have null values in a dataset and the number of null values have very crucial effect on analyzing.
- To be aware of the missing values, I would like to check the null values in the dataset.

In [None]:
df.isnull().sum()

- Even though having no null values in the dataset makes me very happy, it is very rare situation in the real world.
- Since it is kind of a dream dataset, let's enjoy it together :)

In [None]:
df.info()

- According to output of info function, since we have mainly integers and the floats as data types, I can say that we have a easy-to-analyze dataset.
- Another point which takes my attention immeadiately is that, even though 'installs' column exhibits the number of installation, it has object Dtype. 
- To avoid potential problems, we better change the type to integer or float.

In [None]:
df.describe()

To summarize what we have got so far ;
- We have got a dataset which has 1730 rows and 15 columns, about detailed information related top games on Google Playstore.
- Since we don't have any null values and most commonly have numeric values, we are not going to need to many adjustments.
- Even though it looks quite all right, to make an adjustment on the install column will make analyzing easier.
- Another point that we might need to take care is that price and paid column have a lot in common. Most likely to study with one od them is going to be enough, which means we should drop one of them.
- Just for further steps, to have in mind, we should be aware of the uneven distribution of the price column and the possible outliers on the rank column.

- Let's start with making making necessary adjustments. 

In [None]:
df['installs'].value_counts()

In [None]:
def in_thousand(install):
    if install == '500.0 k':
        return '0.5 M'
    elif install == '100.0 k':
        return '0.1 M'
    else:
        return install

In [None]:
df['installs']= df['installs'].apply(in_thousand)

df['installs']= df['installs'].str.replace( 'M', '').str.strip().astype('float')

df= df.rename(columns={'installs': 'installs_in_million'})
df['installs_in_million'].value_counts()

- As a second step let's see the price and paid columns and decide which one is more necessary to continue with.

In [None]:
df['price'].value_counts()

In [None]:
df['paid'].value_counts()

- When we look at the price of the games almost %99 percent is free and there is not much number of sample in different prices to compares them by price.
- Because of all this reason we can drop the price column since it doesn't have much to do with.
- For coming steps first we are going to drop the price column.
- Dropping a column or a row is one thing that we need to be very careful as making that decision.

In [None]:
df.drop('price', axis=1, inplace=True)

In [None]:
df.info()

- Let's move on to the **analysis part**.

## Analysis Part

- As a first step I will look at the game catogories.

In [None]:
df['category'].value_counts(normalize=True)

- Even though many of thm have the same size, Game Card and the Game Word categories are little more than the others.

- For visualization I will be using both Seaborn and Plotly.

In [None]:
#with Seaborn

plt.figure(figsize=(10,4))
sns.countplot(x = "category", data = df)
plt.xticks(rotation = 45);

In [None]:
#withPlotly

fig = px.histogram(df, x="category", title='Game Categories')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As we said before they all have almost the same size only the two categories are slightly more than the others.

## Total Ratings

In [None]:
df['total ratings'].describe()

- First thing drew my attention is the difference between mean and Q2.
- If we have a look at the max, we easily can say thay possible outliers caused that difference.
- Even though to see the outliers it is best way to use boxplot, I will be using both boxplot and the histogram.

In [None]:
#Histogram with Seaborn

plt.figure(figsize = (10,6))
sns.histplot(df['total ratings'], bins = 50);

In [None]:
#Histogram with Plotly


fig = px.histogram(df, x= 'total ratings', title='Total Ratings of the Games')

fig.show()

- It gives some information about the values but to see outliers let's use boxplot.

In [None]:
#Boxplot with Seaborn

plt.figure(figsize=(15,5))
sns.boxplot(data=df, x ="total ratings");

- Seaborn is more popular than Plotly but we can not get some certain inf about the certain values.

In [None]:
#Boxplot with Plotly

fig = px.box(df, x= 'total ratings', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- As we have seen in the histogram, quite a lot of the ratings are in the 0 - 500.000 ratings range.
- On the other hand ve have quite a number of outliers, which increases mean and put it further away from the median.
- The difference between mean and the median which is made by outliers from max side another way to say that we have highly right skewed distribution.
- Instead of using mean values, using the meadian is going to make much more sense for further analysis.

## Number of Game Install

In [None]:
df['installs_in_million'].describe()

In [None]:
plt.figure(figsize=(15,5))
sns.histplot(data=df, x ="installs_in_million");

In [None]:
fig = px.histogram(df, x= 'installs_in_million', title='Number of Game Install in Millions')

fig.show()

In [None]:
fig = px.box(df, x= 'installs_in_million', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- We have rightly skewed distribution with possible outliers.
- Candy Crush Saga with  1 Billion install and Clash of Clans with 500 Million installs shown in the box plot.
- Size of the outliers definitely affect  mean value and distributions.
- As mentioned above, it would be a good idea to use median based approach.

## Paid & Free Games

In [None]:
df['paid'].value_counts(normalize=True)

In [None]:
values = [7, 1723]
index = ["Paid", "Free"]

plt.figure(figsize=(7,5))
y = values
mylabels = index
myexplode = [0, 0]

plt.pie(y, labels = mylabels, labeldistance=1.1, explode = myexplode, startangle=0, autopct='%1.1f%%')

plt.show()

In [None]:
paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

- Almost all the games except 7 in this dataset are free games.

- Now, It is time to deep diving into our dataset.

## Total Ratings by Category

- To see the total ratings for each category we are going to use groupby function in Pandas.

In [None]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

In [None]:
plt.figure(figsize=(10,6))

sns.barplot(data = df, x = total_ratings_by_category.index, y = total_ratings_by_category.values)
plt.xticks(rotation = 60);
plt.show()

In [None]:
fig = px.bar(total_ratings_by_category, x= total_ratings_by_category.index, y=total_ratings_by_category.values, labels={'y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Game Action has way more total ratings copared with the others.

## Number of Installation by Category

In [None]:
install_by_category = df.groupby('category')['installs_in_million'].mean().sort_values(ascending =False)
install_by_category

In [None]:
plt.figure(figsize=(10,6))

sns.barplot(data = df, x = install_by_category.index, y = install_by_category.values)
plt.xticks(rotation = 60);
plt.show()

In [None]:
fig = px.bar(install_by_category, x= install_by_category.index, y=install_by_category.values, labels={'y':'Install in Millions'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

In [None]:
growth_by_category_30 = df.groupby('category')['growth (30 days)'].mean().sort_values(ascending = True)
growth_by_category_30

In [None]:
fig = px.bar(growth_by_category_30, x= growth_by_category_30.index, y=growth_by_category_30, labels={'y':'Growth in 30 days'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Even though games in the action categories get more ratings and were installed more than games in the other categories, games in the casino category have more growth in 30 days. 

## Top 5 Ranked Games By Category

In [None]:
top_ranked_games = df[df['rank']<6][['rank','title','category', 'total ratings', 'installs_in_million', '5 star ratings']]
top_ranked_games

## Top 20 Games

In [None]:
top_20 = df.sort_values(by='installs_in_million', ascending=False).head(20)
top_20

In [None]:
fig = px.bar(top_20, x= 'title', y='installs_in_million', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- While the top 2 games have 1 Billion installation, following 12 games have 500 million installations.
- Since the installation for top 2 games are very high, I would like to see the realtion between Intallation and the total ratings. 

In [None]:
fig = px.scatter(top_20, x= 'installs_in_million', y='total ratings', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
fig = px.bar(top_20, x= 'title', y='total ratings', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As we see on the both scatter and bar plot, having the most number of installation doesn't mean to have the most number of ratings.
- To compare by the numbers ; While Candy Crush Saga and Subway Surfers have 1 billion installations, they have 31 and 35 million total ratings in order.
- On the other hand Gerena Free Fire - World Series has 500 million installation which is half of the both Candy Crush Saga and the Subway Surfers, it has 86 million total ratings which is almost three times the total ratings for both Candy Crush Saga and the Subway Surfers.

---

 - All these were what I wanted to mention about the dataset.
 - Thank you for the dataset contributor for sharing this data which I had pleasure working on.
 - Since this is my first project on the Kaggle, I m so happy to share it with you.
 - Thank you for your time.I hope you all like it.
 


---

___All the best !___