**PREFACE**

In this Exploratory Data Analysis (EDA), we will examine the dataset named as "Top Games on Google Playstore" under the 'android-games.csv' file at Kaggle website [link text](https://www.kaggle.com/dhruvildave/top-play-store-games). 

This study, in general, will cover what any beginner can do as much as possible for a better understanding with the given dataset not only by examining its various aspects but also visualising it.

According to description, "this is a dataset of top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs."

NOTE: For a better understanding and comprehending the coding, this author will try to give pandas official links of methods/atributes.  

Now it's time to jump on the dataset.

**The first step is to import the required libraries.**

For the visualization, the study will use both Seaborn and Plotly's interactive environment for making a better and meaningful comparison with related subjects.

If your plotly module is not in your working environment, please download plotly and run the following codes.

First --> pip install plotly==5.1.0

In [None]:
pip install plotly==5.1.0

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

**OVERVIEW THE DATASET**

1. Identification of variables and data types.
2. Examining/Deciding if columns/variables are appropriate for our analysis.
3. Analyzing the basic metrics.
4. Exploring missing values in Dataset.
5. Dealing with missing/invalid values.
6. Outlier treatment.
7. If needed, variable transformations.
8. Visualization.

How to read and assign the dataset as df. [link text](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) (You can define it as what you want instead of df)

In [None]:
df = pd.read_csv('../input/top-play-store-games/android-games.csv')
df

df.shape() [link text](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html)

In [None]:
df.shape

**Explanation:** df itself and df.shape have returned 1730 observations (rows) and 15 atributes/feaatures (columns).

df.columns [link text](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)

In [None]:
df.columns

**Explanation:** df.columns atribute has returned all column labels of the given DataFrame.

Check how many records are in the dataset and if we have any NA.

df.info() [link text](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)

df.isnull() [link text](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html)

df.sum() [link text](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)

In [None]:
df.info()

Our dataset demonstrates;

*   11 numeric variable including (8) int64 and (3) float64 data types out of 11.

*   4 non-numeric variable including (3) object and (1) bool types out of 4.

**Special Note:** Although "Installs" column has a data type of object, for further analysis it's better to convert its values into either integer or float data type. 

In [None]:
df.isnull().sum()

**Explanation:** The function of df.isnull().sum() is one of the best way to find out the number of missing values in the dataset. At hand, thankfully it looks like there have been no missing values for the given dataset whisch is miracle for real world.

df.value_counts() [link text](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

In [None]:
df['total ratings'].value_counts()

In [None]:
df['title'].value_counts()

**Explanation:** df.value_counts returns a Series containing counts of unique values. When necessary, we can implement this method to each column to get the number of unique values in a given column. 

In [None]:
df['paid'].value_counts(normalize=True)

**Explanation:** With normalize set to True, we can obtain the relative frequencies by dividing all values by the sum of values. For example, regarding "paid" column in our case, unpaid android games represent almost 99.5% of all installed games.

pd.unique() [link text](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html)

df.nunique() [link text](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html)

In [None]:
df.columns

In [None]:
df["category"].unique()

In [None]:
df["category"].nunique()

**Explanation:** With unique() and nunique() methods, we can obtain unique values and their numbers. For example, regarding "category" column in our case, there have been 17 unique values.

df.describe() [link text](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

In [None]:
df.describe()

**Explanation:** for numerical variables, df.describe() generates descriptive statistics including count, mean, std, min, first, second and third quartiles and max. However, any analyst should be careful since some values for some columns may NOT give meaningful implications. 

**Sum Up Overview**

Before going further, let's summarize what we have got from the dataset.

The given dataset represents android games from different categories on the basis of different ratings and different number of installs.

'installs' is an atractive variable to make some inferences on  other variables. Therefore it would be better to convert and use it as a numerical variable.

There is no missing value, which is very good during the data preparation stage.

'Category' column is categorical variable and may give some significant differences among the categories of android games.

Numerical variables deserves special attention for further analysis.

'Paid' and 'Price' looks like common on nature except their data types. In consistent with the analyst's needs, it can be either dropped or maintaned. With respect to evaluating the impact of categories that are "False" representing the unpaid games and "True" represending the paid games, this study will keep them in the analysis.

Now let's make 'installs' numerical.

**Method-1 using split(), replace(), join() and astype() methods for the conversion of "installs" column :**

In [None]:
df.sample(10)

In [None]:
df['installs'] = df['installs'].str.replace( '0.M', '000.000').str.strip()
df['installs']

In [None]:
df['installs'] = df['installs'].str.replace( '0.k', '000').str.strip()
df['installs']

In [None]:
df['installs'].value_counts()

**Explanation:** in "intalls" column we replaced '0.M' and '0.k' with '000.000' and '000'.

In [None]:
df['installation_in_million'] = df['installs'].str.split(".").str.join('').astype('int')/1000000
df['installation_in_million']

**Explanation:** the object (string) values in "intalls" column were assigned as 'int' type in a new column named 'installation_in_million' after being splitted and joined.

In [None]:
df['installation_in_million'].value_counts()

**Explanation:** we have checked the values, numbers and type of variable in our new column named 'installation_in_million'.

In [None]:
df.sample(10)

**Explanation:** we have checked our new column named 'installation_in_million' at the end of our DataFrame.

In [None]:
df.info()

In [None]:
df = df.drop("installs", axis=1)
df

**Explanation:** Since we don't need the former column 'installs' we have dropped it from our DataFrame.

**Method-2 for the conversion of "installs" column using def function, split(), replace(), join() and astype() methods:**

In [None]:
# DO NOT RUN this part under comment not to cause any conflict/error. 
# This just gives an alternative way for conversion

# def in_thousand (inst):
#     if inst == '500.0 k':
#         return '0.5 M' 
#     elif inst == '100.0 k':
#         return '0.1 M'
#     else:
#         return inst

# df['installs']= df['installs'].apply(in_thousand)

# df['installs']= df['installs'].str.replace( 'M', '').str.strip().astype('float')

# df= df.rename(columns={'installs': 'installation_in_million'})

# df['installation_in_million'].value_counts()

**The Genres of Android Games**

In [None]:
df.category.nunique()

In [None]:
df.category.value_counts()

**Explanation:** When we examine the 'category' column, there have 17 categories in different names. In addition to 15 out of 17 android games have the same size, Game Card and the Game Word categories are a little more than the others.

In [None]:
df.groupby('category')[['average rating']].mean().sort_values(by = 'average rating', ascending = False)

As shown above, racing games are one of the most popular games with the average rating of 3.96.Later casino, casual, word and simulation games consist of top 5 most popular ones with the average ratings of 3.95, 3.95, 3.942308 and 3.94, respectively. 

**The Growth of Android Games in Time**

In [None]:
growth_30 = df.groupby('category')[['growth (30 days)']].mean().sort_values(by = 'growth (30 days)', ascending = False)
growth_30.head()

**Explanation:** The highest 5 android games showing spectacular growth in 30 days are Casino, Trivia, Card, Adventure and Role Playing ones, respectively.

In [None]:
growth_60 = df.groupby('category')[['growth (60 days)']].mean().sort_values(by = 'growth (60 days)', ascending = False)
growth_60.head()

**Explanation:** Unlike 30-Days growth, The highest 5 android games showing spectacular growth in 60 days are Board, Card, Strategy, Action and Racing ones, respectively. As kept itself in both table without much mean differences, the growth of Card android games is more stabil than other android games.  

**The Worth of Android Games**

In [None]:
df.price.value_counts() 

In [None]:
df.paid.value_counts() 

In [None]:
df.paid.value_counts(normalize=True) 

**Explanation:** When we examine the 'price' and 'paid' columns, almost all android games, 1723 out of 1730, are free (unpaid). 

As shown above, there are only 7 paid games, so it can be concluded that the data is highly inclined (biased) to  unpaid android games; therefore, it should be cautious for making impications for paid games

Price ranges from 0 (free) to 7.49 including 0.00 (1723), 0.99 (1), 1.49 (1), 1.99 (3), 2.99 (1) and 7.49 (1). Unpaid games consist of 99.6% of all games installed.

**SPECIAL NOTE ON SAMPLE SIZE:** The central limit theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, **if the sample size is large enough**. However, the question is that how large is "large enough"? The answer depends on two factors.

**Requirements for accuracy.** The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.

**The shape of the underlying population.** The more closely the original population resembles a normal distribution, the fewer sample points will be required.

In practice, some statisticians say that **a sample size of 30** is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger. [link text](https://guides.fscj.edu/Statistics/centrallimit)

**The Average Ratings of Paid/Unpaid Games**

In [None]:
df.head()

As seen above, the 'paid' column has boolean values of True and False, which actually represent Unpaid and Paid meaning, respectively. We decide to change the names of this values in this column as 'Unpaid' and 'Paid' for more readibility and clearity.  

In [None]:
def paid_unpaid(paid):
    if paid == True:
        return 'Paid'
    elif paid == False:
        return 'Free'
    else:
        return paid

In [None]:
df['paid']= df['paid'].apply(paid_unpaid)

In [None]:
df.head()

In [None]:
df.groupby(['paid', 'average rating'])[['installation_in_million']].mean()

In [None]:
df.groupby(['paid', 'average rating'])[['installation_in_million']].sum()

**Explanation:** When we examine the 'installation_in_million' column after grouping 'paid' and 'average rating' columns, unpaid android games with average rating '4' are mostly downloaded by the game lowers. In general, the unpaid android games have higher ratings in numbers than the paid android games. 

Moreover paid android games deserve a special attention since its number is not enough to make a robust implication due to the lack of unbiased, fair and sufficient data for paid games. Nevertheles, it can be clearly captured that all paid android games have a high average rating of '4'.  

**VISUALIZATION**

For a better understanding and comparison, Seaborn and Plotly libraries will be used in the following visualization part.

**The Most Popular Games in Each Category by Rank**

In [None]:
popular_games = df[df['rank'] <= 5][['rank','title','category', 'total ratings', 'installation_in_million', '5 star ratings']]
popular_games

**Top 10 Games By Total Ratings**

In [None]:
top_10_by_total_ratings = df.sort_values(by='total ratings', ascending=False)[['title','category', 'installation_in_million', '5 star ratings', 'total ratings']].head(10)
top_10_by_total_ratings

In [None]:
# Barplot with Plotly

fig = px.bar(top_10_by_total_ratings, x= 'title', y='total ratings', hover_data = top_10_by_total_ratings[['category']], color='category', title='Top 10 Games By Total Ratings')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

**Top 10 Games By Installation**

In [None]:
top_10_by_installation = df.sort_values(by='installation_in_million', ascending=False)[['title','category', 'installation_in_million', '5 star ratings', 'total ratings']].head(10)
top_10_by_installation

In [None]:
# Barplot with Plotly

fig = px.bar(top_10_by_installation, x= 'title', y='installation_in_million', hover_data = top_10_by_installation[['category']], color='category', title='Top 10 Games By Installation')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

**The Relation Between "total ratings" & "5 star ratings"**

In [None]:
# Barplot with Plotly for whole data set

fig = px.scatter(df, x= 'total ratings', y='5 star ratings', hover_data = df[['title', 'installation_in_million']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
# Barplot with Plotly for Top 10 Games By Total Ratings

fig = px.scatter(top_10_by_total_ratings, x= 'total ratings', y='5 star ratings', hover_data = top_10_by_total_ratings[['title', 'installation_in_million']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
# Barplot with Plotly for Top 10 Games By Installation

fig = px.scatter(top_10_by_installation, x= 'total ratings', y='5 star ratings', hover_data = top_10_by_installation[['title', 'installation_in_million']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

As seen in the scatter plot, it looks like there have been a positive relationship between "total ratings" and "5 star ratings" for the top 10 Games by installation. As the number of installation increases, 5-star ratings given by the players also increase. It looks like these players are likely to enjoy these games; indeed, all 5-star ratings for each top 10 Game consist of more than half of the total ratings itself.  

Moreover; the games named as Subway Surfers	and Candy Crush Saga have the highest instalation number with 1 billion, while the others have 500 million. 

On the other hand, Gerena Free Fire-World Series installed 500 million times has the highest 5-star rating and total ratings with the value of 63546766 and 86273129, respectively.

**Unique Game Categories**

In [None]:
#with Seaborn

plt.figure(figsize=(10,4))
sns.countplot(x = "category", data = df)
plt.xticks(rotation = 45);

In [None]:
#withPlotly

fig = px.histogram(df, x="category", title='Game Categories')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

**The Growth of Android Games  in the First 30 Days by Category**

In [None]:
growth_30 = df.groupby('category')[['growth (30 days)']].mean().sort_values(by = 'growth (30 days)', ascending = False)
growth_30.head()

In [None]:
a = growth_30.index
a

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x='category', y='growth (30 days)', data=df, order=a).set(xlabel="Game Categories", ylabel='Growth (30 Days')
plt.xticks(rotation = 60)
plt.title('The Growth of Android Games in the First 30 Days by Category', fontdict={'fontsize': 16})
plt.show();

In [None]:
# df = px.data.gapminder().query("country=='Canada'")
fig = px.bar(df, x='category', y='growth (30 days)', title='The Growth of Android Games in the First 30 Days by Category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

**The Growth of Android Games in the First 60 Days by Category**

In [None]:
growth_60 = df.groupby('category')[['growth (60 days)']].mean().sort_values(by = 'growth (60 days)', ascending = False)
growth_60.head()

In [None]:
b = growth_60.index
b

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x='category', y='growth (60 days)', order=b, data=df).set(xlabel="Game Categories", ylabel='Growth (60 Days')
plt.xticks(rotation = 60)
plt.title('The Growth of Android Games in the First 60 Days by Category', fontdict={'fontsize': 16})
plt.show();

**Explanation:** When we make comparison;

While the first 5 game categories are 'CASINO', 'TRIVIA', 'CARD', 'ADVENTURE' and 'ROLE PLAYING', respectively, which need to be grew in the first 30-day period, 'BOARD', 'CARD', 'STRATEGY', 'ACTION' and'RACING' are the first 5 game categories need to be grew the first 60-day period. The category of 'CARD' has been common in both groups. This may result from either being popular among the players or their deficiencies such as bug, development, needs of players etc. So implication on this subject needs more detailed examination and/or domain exprience.  

**Total Ratings**

In [None]:
df['total ratings'].describe()

In [None]:
df["total ratings"].describe().apply(lambda x: format(x, 'f'))

**Explanation:** We have suppressed scientific notation output from df.describe().

First thing taken into consideration is the huge difference between mean value (1064331.919653) and Q2 value (428606.5). Most probably this can  likely be caused by the probable outliers. Indeed the max. value of 86273129.0 supports this assumption to some extent. Nevertheless, we need to examine boxplot/whisker plot.

In [None]:
#Boxplot with Seaborn

plt.figure(figsize=(15,5))
sns.boxplot(data=df, x ="total ratings");

In [None]:
#Boxplot with Plotly

fig = px.box(df, x= 'total ratings', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

**Explanation:** Indeed boxplot/whisker plot demonstrates outliers including extreme ones.

In [None]:
#Histogram with Seaborn by 'total ratings'

plt.figure(figsize = (10,6))
sns.histplot(df['total ratings'], bins = 50);

In [None]:
#Histogram with Plotly by 'total ratings'

fig = px.histogram(df, x= 'total ratings', title='Total Ratings of the Games')
fig.show()

df.skew() [link text](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.skew.html)

In [None]:
df.skew()

As seen in the histogram, much of the ratings are between 0 and 500.000 total ratings range.

Similar to boxplot/whisker plot, histogram demonstrates a number of outliers including extreme ones, which increase mean value and create a huge gap between median and mean. 

As a general rule of thumb [link text](https://community.gooddata.com/metrics-and-maql-kb-articles-43/normality-testing-skewness-and-kurtosis-241) :

*   If skewness is less than -1 or greater than 1, the distribution is highly skewed.
*   If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
*   If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

So all extremly high skewness values made by outliers, especially extreme ones in our case, causes highly right skewed distribution for all attributes.

Instead of using mean values, using the meadian is going to make much more sense for further analysis.

**The Number of Android Games Installed**

In [None]:
# With Seaborn

plt.figure(figsize=(15,5))
sns.histplot(data=df, x ="installation_in_million");

In [None]:
# With Plotly

fig = px.histogram(df, x= 'installation_in_million', title='The Number of Games Installed in Millions')
fig.show()

In [None]:
# Boxplot With Plotly

fig = px.box(df, x= 'installation_in_million', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

The boxplot/whisker plot demonstrates that there has also been highly right skewed distribution with possible outliers for Clash of Clans with 500 Million installations and Candy Crush Saga with 1 Billion installations.

Size of the outliers definitely affect mean value and distributions.
As mentioned above, it would be reasonable to use median for further analysis.

**Free  & Paid Android Games**

In [None]:
df['paid'].value_counts(normalize=True)*100

In [None]:
values_paid = df.paid.value_counts()
values_paid

In [None]:
x = values_paid.values
x

In [None]:
y = values_paid.index
y

In [None]:
plt.figure(figsize=(9,6))
plt.pie(x, labels = y, autopct='%1.1f%%')
plt.show()

In [None]:
# Pie Chart With Plotly

paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

**Total Ratings by Category**

In [None]:
category_ratings = df.groupby('category')['total ratings'].mean()
category_ratings

In [None]:
plot_order = df.groupby('category')['total ratings'].mean().sort_values(ascending=False).index.values

In [None]:
# Barplot With Seaborn

plt.figure(figsize=(10,6))
sns.barplot(data = df, x = category_ratings.index, y = category_ratings.values, order=plot_order).set(xlabel="Game Categories", ylabel='Total Ratings')
plt.xticks(rotation = 60)
plt.title('Game Categories by Total Ratings')
plt.show();

In [None]:
# Barplot With Plotly

fig = px.bar(category_ratings, x= category_ratings.index, y=category_ratings.values, labels={'x': 'Games Categories','y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

**The Number of Installations by Category**

In [None]:
installation_by_category = df.groupby('category')['installation_in_million'].mean().sort_values(ascending =False)
installation_by_category

In [None]:
index1 = installation_by_category.index
index1

In [None]:
value1 = installation_by_category.values
value1

In [None]:
# Barplot with Seaborn

plt.figure(figsize=(12,6))
sns.barplot(data = df, x = index1, y = value1).set(xlabel="Game Categories", ylabel='Instalation in Million')
plt.xticks(rotation = 60)
plt.title('Game Categories by Installation')
plt.show();

In [None]:
# Barplot with Plotly

fig = px.bar(installation_by_category, x= index1, y=value1, labels={'x': 'Games Categories','y':'Install in Millions'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In this Exploratory Data Analysis (EDA), we have examined the dataset named as "Top Games on Google Playstore" under the 'android-games.csv' file at Kaggle website.

This study, in general, has covered what any beginner can do as much as possible for a better understanding with the given dataset not only by examining its various aspects but also visualising it.

For a better understanding and comprehending the coding, pandas official links of methods/atributes have mostly been attached next to methods used in the analysis. 

Thank you for your time.I hope you all like it and may it contribute to your knowledge .

**SPECIAL NOTE: "Thank you to all in Clarusway Cohort08-Data Science Path who have contributed in this work and to my knowledge"**