# Exploratory Data Analysis

### My Dataset
Games on Google Play Store
This is a dataset of top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### My Research Questions

#### 1.Which game category has the highest no of downloads?
#### 2.With respect to downloads and average ratings, which category holds the most number of popular games?
#### 3.What are the average ratings of games in free vs paid games?
#### 4.Which category of games has grown the most in the last 60 days? 
#### 5.Which Category of games have the most number of negative reviews?
#### 6.Is there a relationship between average rating of a game and its growth in 30 days and 60 days?

#### 1.Which game category has the highest no of downloads?

##### Importing the dataset

In [None]:
import numpy
import pandas as pd
import matplotlib.pyplot as plot
import seaborn as sns
import plotly.express as px

In [None]:
games = pd.read_csv('/kaggle/input/top-play-store-games/android-games.csv')

#### Let us take a glance of how the dataset looks

In [None]:
games.head()

#### To find out the dimensions of our dataset, we can use the shape function

In [None]:
games.shape

#### This means we have 1730 rows and 15 columns in our dataset. Since we are not doing Analytics using Natural Language Processing,  the game title is of no use to us.
#### For our reference, let us find out the datatype of the columns

In [None]:
games.info()

#### From the above output, we can see that the dimension of each column is equal and all columns are non-null and also the datatypes are now known to us. If we want to check the number of null values in each column, we can use the following command

In [None]:
games.isnull().sum()

#### Now that we have understood the dataset, we can move on to doing Data Analytics on  our Data set

### EDA on Android games

#### Let us find out the number of paid games.

In [None]:
games[games['paid']==True] #selecting all rows in games, which are paid 

#### As we can see, there are a total of 7 paid games. This means there is a majority of free games

In [None]:
games['category'].value_counts()

In [None]:
games['installs'].value_counts()

#### In the above table, we can see that the installs are of object type. To conitnue with our analysis we must convert it to numerical type. We can do this by using the map function of the numoy library with will allow us to add a numerical column and map it according to the present data.

In [None]:
number_of_downloads = {'100.0 k' : 100000, '500.0 k' : 500000, '1.0 M' : 1000000, '5.0 M' : 5000000, '10.0 M' :10000000, '50.0 M' : 50000000, '100.0 M': 100000000,'500.0 M': 500000000, '1000.0 M': 1000000000,}
games['number_of_downloads'] = games['installs'].map(number_of_downloads)

#### Now let us see if our Dataset has the updated columns

In [None]:
games.head(1)

#### Now that we have converted our column to  numerical type, we have completed the data manipulation section of our analysis, we can now find the answer to our Research Question

In [None]:
rq1=games.groupby(by='category')['number_of_downloads'].sum() #gets the total number of downloads per category

In [None]:
rq1.head(2)

#### Let's sort our data according to number of downloads

In [None]:
rq1 = rq1.reset_index()
rq1 = rq1.sort_values(by= 'number_of_downloads')

In [None]:
rq1.head()

#### Now let us analyse the data using a barplot and find out the game category with the most installed number of games

In [None]:
sns.barplot(y='category',x='number_of_downloads', data=rq1)

#### According to our plot, we can conclude that the Casual, Arcade, and Action based games are the most download category 
#### Trivia and casino are the least downloaded games

#### 2.With respect to downloads and average ratings, which category holds the most number of popular games?

#### Let us sort the dataset on the basis of its average ratings and the number of downloads

In [None]:
rq2 = games.sort_values(by= ['number_of_downloads', 'average rating'], ascending = False).head(200)


#### After the above step we group our data according to each category, and get the count of reviews and again sort them by the number of downloads

In [None]:
rq2=rq2.groupby(by='category')['number_of_downloads'].count().reset_index().sort_values(by='number_of_downloads', ascending=False)

#### Now all we need to do, is plot this using any plot to find out the most played game category now with a different filter

In [None]:
rq2 = games.sort_values(by = ['number_of_downloads','average rating'], ascending=False).head(200)
rq2=rq2.groupby(by = 'category')['number_of_downloads'].count().reset_index().sort_values(by = 'number_of_downloads', ascending=False)
plot.scatter(rq2['number_of_downloads'], rq2['category'])
plot.plot(rq2['number_of_downloads'], rq2['category'])

#### As we can see our result is not the same as it was before, now the most downloaded game belongs to the action category.

#### 3.What are the average ratings of games in free vs paid games?

#### To find the average ratings of free games vs paid games, we must find the mean(average) of either types of games and compare them

In [None]:
rq3free=games[(games['paid']==False)]
rq3paid=games[(games['paid']==True)]

In [None]:
freemean=rq3free['average rating'].mean()
paidmean=rq3paid['average rating'].mean()
print("The Average rating of free games are :",round(freemean,2))
print("The Average rating of paid games are :",round(paidmean,2))

#### From  the above results we can conclude that the paid ames, although being less in number, have a better average rating than free games

#### 4.Which category of games has grown the most in the last 60 days? Is it similar to its progress in 30 days?

In [None]:
rq430days = games.groupby(by = 'category')['growth (30 days)'].mean()
rq460days = games.groupby(by = 'category')['growth (60 days)'].mean()

In [None]:
rq4 = games.sort_values(by = 'growth (30 days)', ascending=False).head(200)
rq4=rq4.groupby(by = 'category')['growth (30 days)'].sum().reset_index().sort_values(by = 'growth (30 days)', ascending=False)
plot.scatter(rq4['growth (30 days)'], rq4['category'])
plot.plot(rq4['growth (30 days)'], rq4['category'])

In [None]:
rq4 = games.sort_values(by = 'growth (60 days)', ascending=False).head(200)
rq4=rq4.groupby(by = 'category')['growth (60 days)'].sum().reset_index().sort_values(by = 'growth (60 days)', ascending=False)
plot.scatter(rq4['growth (60 days)'], rq4['category'])
plot.plot(rq4['growth (60 days)'], rq4['category'])

#### On analysing the above graphs we can see that not all games that have the same growth rate as it did during the first 30 days. We can see the in the first 30 days Action Category had the maximum growth rate whereas in the first 60 days, educational category of games had the maximum growth

#### 5.Which Category of games have the most number of negative reviews?

In [None]:
games['negative reviews'] = games['1 star ratings']+games['2 star ratings']
rq5=games.groupby(by='category')['negative reviews'].sum()
rq5 = rq5.reset_index()
rq5 = rq5.sort_values(by= 'negative reviews')
sns.barplot(y='category',x='negative reviews', data=rq5)

#### We can see that Action Based games have the most negative ratings/reviews 

#### 6.Is there a relationship between average rating of a game and its growth in 30 days and 60 days?

In [None]:
columnA=games['average rating']
columnB=games['growth (30 days)']
correlation = columnA.corr(columnB)
print(correlation)

#### Since there is a negative relationship, we can conclude that average ratings,decreases with increase in number of days(till 30 days). This conclusion was drawn by calculating the Pearson Correlation betwen the two columns. When we get a value < 0 it means we have a negative correlation.

In [None]:
columnA=games['average rating']
columnB=games['growth (60 days)']
correlation = columnA.corr(columnB)
print(correlation)

#### Since there is a positive relationship, we can conclude that average ratings,increase with increase in number of days(till 60 days). This conclusion was drawn by calculating the Pearson Correlation betwen the two columns. When we get a value > 0 it means we have a positive correlation.