# Explotary Data Analysis (EDA) for Top Games on Google Play Store

## Introduction

As it is known, Google Play Store allows many games to be downloaded by users. Thousands of users install millions of apps every day. The Play Store offers millions of applications in many categories to users for free or paid.
To determine which games are more popular among users, we need to examine the ratings users give for games.
Thus, Google can examine which of the games it offers is more popular, the effects of the price of the game on its popularity and make improvements.

In addition, application performance and efficiency can be increased by developers. The analysis will not only be useful for developers, but also the user. The content of Google play store apps will be characterized on a certain scale. Thus, as a result of this analysis, it can be determined that it will be profitable to upload advertisements to the game. 

In [None]:
###########################################
# Uploading the necessary libraries
###########################################


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore') 

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In [None]:
###########################################
# Upload CSV data
###########################################

df = pd.read_csv('../input/top-play-store-games/android-games.csv')
df.head(201)

In [None]:
###########################################
# Detailed review for dataset
###########################################

df.info()

In [None]:
df.isnull().sum()

In [None]:
df.sample(10)

# As a result of these examination; 

  ** it was seen that none of the columns in the dataset had null values. 
  
  ** In order to be sure of this, random samples were taken from our dataset and analyzed.
     It was observed that there were no abnormal values. 

  ** In addition, the types of columns in the dataset are examined.
     As a result, it was determined that there are 11 numeric columns and 4 categorical columns. 


At this point, it has been noticed that: 
  
  ** The "installs" column is seen as an object even though it has numeric values. This column needs to be corrected.

In [None]:
###########################################
# Let's make 'installs' column a numerical variable by doing a small adjustment.
###########################################

In [None]:
# ALTERNATIVE CODE for doing a small adjustment in "installs" column


# def in_thousand(df):
#     if df.split(".")[1].split(" ")[1] == "k":
#         return int(df.split(".")[0])*1000
    
#     elif df.split(".")[1].split(" ")[1] == "M":
#         return int(df.split(".")[0])*1000000
    
#     else:
#         return df
    
# df['installs'] = df.installs.apply(in_thousand)
# df= df.rename(columns={'installs': 'installs_in_million'})

In [None]:
df["installs_in_million"] = df.installs.apply(lambda x: float(x.split(" ")[0])*1000000 if "M" in x else float(x.split(" ")[0])*1000 )

In [None]:
df.drop('installs', axis = 1, inplace = True)

In [None]:
df.head(10)

In [None]:
df['installs_in_million'].value_counts()

In [None]:
df.info()


******************************************************************************************************************************
      ** Now, the "installs" column in our dataset has been brought to the desired format and is ready to be analyzed. **
******************************************************************************************************************************


In [None]:
fig = px.histogram(df, x="category", title='Game Categories', color_discrete_sequence=['indianred'])



fig.update_layout(xaxis={'categoryorder':'total descending'},
    title_text='Total Count of Games in Each Category', # title of plot
    xaxis_title_text='Category', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)


fig.show()


******************************************************************************************************************************
#### The count values of the games in all categories in our dataset are displayed in this Figure.
******************************************************************************************************************************

In [None]:
free = df[df['paid']==False][['installs_in_million',"category"]]

In [None]:
fig = px.bar(free, x= free["category"], y=free["installs_in_million"], labels={'y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()


******************************************************************************************************************************
#### As Game Action category have a greater number of installs with respect to other categories as shown in Figure above. 
******************************************************************************************************************************

In [None]:
df['category'].value_counts()  #df['category'].value_counts(normalize=True)


******************************************************************************************************************************

#### Now, the "installs" column in our dataset has been brought to the desired format and is ready to be analyzed. 
******************************************************************************************************************************

In [None]:
fig = px.histogram(df, x= 'total ratings', title='Total Ratings of the Games', color_discrete_sequence=['indianred'])

fig.update_layout(
    title_text='Total Ratings of the Games', # title of plot
    xaxis_title_text='Total ratings', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)


fig.show()

In [None]:
fig = px.box(df, x= 'total ratings', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()


******************************************************************************************************************************
#### Looking at the outliers in the Total Ratings boxplot, the trend in the Count-Total Rate graph is more clearly understood.
******************************************************************************************************************************

In [None]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

In [None]:
install_by_category = df.groupby('category')['installs_in_million'].mean()
install_by_category

In [None]:
import plotly.graph_objs as go

data = [
    go.Bar(x= list(total_ratings_by_category.index), y=list(total_ratings_by_category.values), name='Total Ratings by Category',offsetgroup=0),
    go.Bar(x= list(total_ratings_by_category.index), y=list(install_by_category.values), name='Install by Category', yaxis='y2',offsetgroup=1)
]

# Add titles and color the font of the titles to match that of the traces
# 'SteelBlue' and 'DarkOrange' are the defaults of the first two colors.

y1 = go.YAxis(title='Total Ratings', titlefont=go.Font(color='SteelBlue'))
y2 = go.YAxis(title= 'Installation', titlefont=go.Font(color='DarkOrange'))

# update second y axis to be position appropriately
y2.update(overlaying='y', side='right')

# Add the pre-defined formatting for both y axes 
layout = go.Layout( yaxis1 = y1, yaxis2 = y2)



fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='Total Ratings & Installs by Categories',xaxis={'categoryorder':'total descending'}, bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1) # gap between bars of the same location coordinates)
fig.update_xaxes(title_text="Categories")
iplot(fig)

In [None]:
growth_by_category_30 = df.groupby('category')['growth (30 days)'].mean()
growth_by_category_60 = df.groupby('category')['growth (60 days)'].mean()

In [None]:
data = [
    go.Bar(x= list(total_ratings_by_category.index), y=list(growth_by_category_30.values), name='Growth by Category in 30 days',offsetgroup=0),
    go.Bar(x= list(total_ratings_by_category.index), y=list(growth_by_category_60.values), name='Growth by Category in 60 days', yaxis='y2',offsetgroup=1)
]

# Add titles and color the font of the titles to match that of the traces
# 'SteelBlue' and 'DarkOrange' are the defaults of the first two colors.

y1 = go.YAxis(title='Growth in 30 days', titlefont=go.Font(color='SteelBlue'))
y2 = go.YAxis(title= 'Growth in 60 days', titlefont=go.Font(color='DarkOrange'))

# update second y axis to be position appropriately
y2.update(overlaying='y', side='right')

# Add the pre-defined formatting for both y axes 
layout = go.Layout( yaxis1 = y1, yaxis2 = y2)



fig = go.Figure(data=data, layout=layout)
fig.update_layout(title='Growth Trend in 30 and 60 days',xaxis={'categoryorder':'total descending'}, bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1) # gap between bars of the same location coordinates)
fig.update_xaxes(title_text="Categories")
iplot(fig)

In [None]:
top_ranked_games = df[df['rank']<6][['rank','title','category', 'total ratings', 'installs_in_million', '5 star ratings']]
top_ranked_games

## Analysis of Top 20 Games

In [None]:
top_20 = df.sort_values(by='installs_in_million', ascending=False).head(20)
top_20

In [None]:
data = [
    go.Scatter(x= list(top_20['title'].values), y=list(top_20['5 star ratings'].values), name='5 star ratings'),
    go.Bar(x= list(top_20['title'].values), y=list(top_20['installs_in_million'].values), name='Top 20 Games by Installation', yaxis='y2', opacity = 0.5)
]

# Add titles and color the font of the titles to match that of the traces
# 'SteelBlue' and 'DarkOrange' are the defaults of the first two colors.

y1 = go.YAxis(title='5 Star Ratings', titlefont=go.Font(color='SteelBlue'))
y2 = go.YAxis(title= 'Installs in Million', titlefont=go.Font(color='DarkOrange'))

# update second y axis to be position appropriately
y2.update(overlaying='y', side='right')

# Add the pre-defined formatting for both y axes 
layout = go.Layout( yaxis1 = y1, yaxis2 = y2)



fig = go.Figure(data=data, layout=layout)
fig.update_layout( bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1) # gap between bars of the same location coordinates)
fig.update_xaxes(title_text="Game Title")
iplot(fig)


******************************************************************************************************************************
#### This figure shows that the installation values and 5 stars rating values of the 20 most downloaded games in the same graphic.
******************************************************************************************************************************

In [None]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

In [None]:
trend = df[df['paid']==False][['average rating',"category"]].sort_values(by = 'average rating', ascending = False)
trend

In [None]:
fig = px.histogram(trend, x= "average rating", title='Total Ratings of the Games', color_discrete_sequence=['indianred'])

fig.update_layout(
    title_text='Aveage Rating Trends of Free Games', # title of plot
    xaxis_title_text='Average Ratings', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)


fig.show()

In [None]:
df["average rating"].value_counts()

In [None]:
df['paid'].value_counts(normalize=True)

In [None]:
paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()


******************************************************************************************************************************
#### As we can see in this pie chart, most of the games in this dataset are available to users for free.
******************************************************************************************************************************

In [None]:
dfx = df[df['paid']==False][['category']].value_counts()
dfx

## PieChart of Free Games by Categories

In [None]:
import plotly.graph_objects as go

labels = dfx.index
values = dfx.values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()

In [None]:
dfxt = df[df['paid']==True][['category']].value_counts()
dfxt

## PieChart of Paid Games by Categories

In [None]:
import plotly.graph_objects as go

labels = dfxt.index
values = dfxt.values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()

## Conclusion

** There are two main types of apps in google paly store, free and paid. Such applications can be used for games, 
  movies, education and video, etc. There are other categories of applications, such as All of these apps are 
  available on the Google Play Store.

** The categories used by the applications of these games are Word, Trivia, Simulation, Sports, Strategy, Racing, 
  Role_Playing, Puzzle, Music, Educational, Card, Casino, Casual, Board, Action, Adventure and Arcade.

** While the Game Casino category is in the first place in the growth for 30 days, the Game Board category appears 
  in the first place in the growth for 60 days. Accordingly, it will be a more accurate method to examine the growth 
  kinematics of games that can be examined in long-term prediction.

### THANKS!