# Top Games on Google Play Store - An EDA

# Problem Context

A mobile game developer is planning to develop an Android game and put it on [Google Play Store](https://play.google.com/store/apps). The developer wants to strategically analyze the top existing games on Play Store in order to have a better sense of what to develop. The main questions that the developer wants an answer are:

1. Which types of games are more successful in number of ratings?
2. Paid or free games? If paid, what is a good price to go for?
4. Which types of games are growing at the moment?
3. Which types of games have the highest overall ratings?

We'll answer these questions using an Exploratory Data Analysis (EDA) approach using the "*android-games.csv*" dataset which can be found at [Top Games of Google Play Store](https://www.kaggle.com/dhruvildave/top-play-store-games).

For now, there won't be a data cleaning step.

# Data Exploration

### Import the Libraries Used

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


sns.set_theme(style = 'white') # theme used for the seaborn graphs

### Load and Read Data

In [None]:
data_path = '/kaggle/input/top-play-store-games/android-games.csv'
raw_data = pd.read_csv(data_path)

### Basic Data Information

In [None]:
raw_data.head()

We can see that we have the following features:
* __rank__: Rank in a particular category
* __title__: Game title
* __total ratings__: Total number of ratings
* __installs__: Approximate install milestone
* __average rating__: Average rating out of 5
* __growth (30 days)__: Percent growth in 30 days
* __growth (60 days)__: Percent growth in 60 days
* __price__: Price in dollars
* __category__: Game category
* __5 star ratings__: Number of 5 star ratings
* __4 star ratings__: Number of 4 star ratings
* __3 star ratings__: Number of 3 star ratings
* __2 star ratings__: Number of 2 star ratings
* __1 star ratings__: Number of 1 star ratings
* __paid__: Whether the game is paid or not

There's not a lot of features in this dataset, but every one of them appears to be useful (maybe not the "*title*" for a numerical standpoint, but we'll keep it on the dataset for now).

In [None]:
raw_data.info()

In [None]:
raw_data.isnull().sum()

There is no null data and all the columns are filled on every row. This is expected since every game on Play Store must have all these information filled.

In [None]:
raw_data.describe()

# Price Analysis

From a preliminary analysis of the `raw_data.describe()` above, we can see that more than 75% of the top games are free (by checking when the "_price_" is 0.00). In fact, let's check the total (as a percentage):

In [None]:
# total free and paid games as a percentage
raw_data['paid'].value_counts(normalize = True) * 100 

So only 0.40% of the games in this list are paid games (that acounts for only 7 games).

To find the paid games on this list we use:

In [None]:
raw_data[raw_data['paid'] == True].sort_values(by = 'price', ascending = False)

Here we have __Minecraft__ as the most expensive game on this list, costing $7.49, with more than 10.0 M installs at the time this dataset was gathered. __Minecraft__ is also the most popular game among the paid ones, so price is not a problem for game success or popularity.

In [None]:
paid_games = raw_data[raw_data['paid'] == True]
paid_games.describe()

In [None]:
paid_games.median()

In [None]:
f, ax = plt.subplots(figsize = (10, 5))

sns.countplot(x = 'price',
              data = paid_games,
              palette = 'cool_r')

ax.set_title('Median Game Price',
             fontsize = 25,
             y = 1.1)

ax.set(ylabel = '',
       xlabel = 'Price')

ax.axhline(paid_games['price'].mean(),
          color = 'r',
          linewidth = 3,
          label = 'Mean Price = $3.20')

plt.legend()

We only have 7 values to analyse, so it's easy to see it through a pandas DataFrame, but a graph is more visually pleasing for this.

By analysing the price, we see that the __average game price is &#0036;3.20__ and that most games on this list are priced at &#0036;1.99 (we can see that from the median).

# Category Analysis

In [None]:
raw_data['category'].unique()

We have 17 different game categories in this dataset.

In [None]:
raw_data['category'].value_counts()

Let's create a new pandas.DataFrame using the original dataset and grouping by the game category. Here we drop the "*rank*" since it's not relevant for now (if we add it, it would just be the sum from 1 to 100 (depending on the category, some have more games, "*GAME CARD*" contains 122 games, for example)). We also drop the "*paid*" column since it behaves as a sum of True or False values, also not relevant at the moment.

In [None]:
categories_df = raw_data.groupby(['category'], as_index = False).mean().drop(labels = 'rank', axis = 1).drop(labels = 'paid', axis = 1)
categories_df

We can now see how the columns behave in this new dataset:

* __total_ratings__: The sum of the total rating of all games in the category.
* __average rating__: The average rating of the whole category.
* __growth (30 days)__: The sum of the growth (30 days) of all games in the category.
* __growth (60 days)__: The sum of the growth (60 days) of all games in the category.
* __price__: The sum of the prices of all games in the category.
* __5 star ratings__: The sum of the 5 star rating from each game of the category.
* __4 star ratings__: The sum of the 4 star rating from each game of the category.
* __3 star ratings__: The sum of the 3 star rating from each game of the category.
* __2 star ratings__: The sum of the 2 star rating from each game of the category.
* __1 star ratings__: The sum of the 1 star rating from each game of the category.

Since the number of installs is not a numerical value, but instead it is a range, we'll use the number of ratings as a metric of game popularity (here we are assuming that the higher the number of installs is, the higher the number of ratings.

In [None]:
f, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = 'average rating',
            y = 'category',
            data = categories_df,
            palette = 'cool_r',
            order = categories_df.sort_values('average rating', ascending = False).category)

ax.set_title('Average Rating per Game Category',
             fontsize = 25,
             x = 0.4,
             y = 1.1)

ax.set(xlim = (4, 4.5),
       ylabel = '',
       xlabel = 'Average Rating')

In [None]:
f, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = 'total ratings',
            y = 'category',
            data = categories_df,
            palette = 'cool_r',
            order = categories_df.sort_values('total ratings', ascending = False).category)

ax.set_title('Total Number of Ratings per Game Category',
             fontsize = 25,
             y = 1.1)

ax.set_xlabel('Total Number of Ratings (x10⁶)',fontsize = 18)
ax.set_ylabel('')

It is easy to see that __Action__ games dominate the market by Total Number of Ratings.

# Growth Analysis

In [None]:
growth_30_days = raw_data.groupby('category', as_index=False)['growth (30 days)'].mean()
growth_30_days

In [None]:
growth_60_days = raw_data.groupby('category', as_index=False)['growth (60 days)'].mean()
growth_60_days

In [None]:
f, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = 'growth (30 days)',
            y = 'category',
            data = growth_30_days,
            palette = 'cool_r',
            order = growth_30_days.sort_values('growth (30 days)', ascending = False).category)

ax.set_title('Average 30 Day Growth per Game Category',
             fontsize = 20,
             x = 0.4,
             y = 1.1)

ax.set(ylabel = '',
       xlabel = 'Average 30 Day Growth')



f, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = 'growth (60 days)',
            y = 'category',
            data = growth_60_days,
            palette = 'cool_r',
            order = growth_60_days.sort_values('growth (60 days)', ascending = False).category)

ax.set_title('Average 60 Day Growth per Game Category',
             fontsize = 20,
             x = 0.4,
             y = 1.1)

ax.set(ylabel = '',
       xlabel = 'Average 60 Day Growth')

Considering the last 30 days (from the date this dataset was gathered), __Action__ and __Word__ games have the highest growth among the categories listed. Analysing the last 60 days (again, from the time this dataset was gathered), we see that __Educational__ games had the highest growth.

In [None]:
a = [] # empty list

# average number of ratings of paid games

a.append( raw_data[raw_data['paid'] == True]['total ratings'].mean() )

# average number of ratings of free games

a.append( raw_data[raw_data['paid'] == False]['total ratings'].mean() )

a

In [None]:
f, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = ['Paid', 'Free'],
            y = a,
            palette = 'cool_r')

ax.set_title('Average Number of Ratings for Paid and Free games',
             fontsize = 20,
             y = 1.1)

ax.set_xlabel('')
ax.set_ylabel('Average Number of Ratings (x10⁶)', fontsize = 17)

So, we see that __Free Games__ have a higher average number of ratings than __Paid Games__.

# Answering the Questions

__1. Which types of games are more successful in number of ratings?__

Since we didn't use the number of installs in our analysis (because we don't have an exact value for each game, just a range), we can infer the success for each type of game from the total number of ratings for each category. With this in mind, the "__ACTION__" category has the highest number of ratings at $4.13 \times 10^6$ installs, which shows the popularity of action mobile games.

__2. Paid or free games? If paid, what is a good price to go for?__

We have seen that free games have a higher average number of ratings than paid games, so it could be a better option for a game developer to go for a free game. If the game is paid, a generally good price can be set at $1.99.

__3. Which types of games are growing at the moment?__

Considering the last 30 days (from the time this dataset was gathered), the categories the are growing more rapidly at the moment are __Action__ and __Word__ games.

__4. Which types of games have the highest overall ratings?__

Seeing as this is a list with the most popular games, the overall ratings are not too different from each other, but __Word__ games and __Cassino__ games have the highest ratings amongst all categories.

In conclusion, __Action__ mobile Android games perform very well at Google Play Store in almost every metric analysed.