<a href="https://colab.research.google.com/github/unwastefultoday/DATA_ANALYTICS/blob/main/Google_Playstore_EDA_%26_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Playstore Case Study

**Problem Statement**

The team at Google Play Store wants to develop a feature that would enable them to boost visibility for the most promising apps. Now, this analysis would require a preliminary understanding of the features that define a well-performing app. You can ask questions like:
- Does a higher size or price necessarily mean that an app would perform better than the other apps? 
- Or does a higher number of installs give a clear picture of which app would have a better rating than others?


In [1]:
#import the libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
#from google.colab import files
#uploaded = files.upload()

In [3]:
#read the dataset and check the first five rows

inp0 = pd.read_csv("../input/playstore/googleplaystore_v2.csv")
inp0.head()

In [4]:
#Check the shape of the dataframe
inp0.shape

### Data Handling and Cleaning

The first few steps involve making sure that there are no __missing values__ or __incorrect data types__ before we proceed to the analysis stage. These aforementioned problems are handled as follows:

 - For Missing Values: Some common techniques to treat this issue are
    - Dropping the rows containing the missing values
    - Imputing the missing values
    - Keep the missing values if they don't affect the analysis
 
    
 - Incorrect Data Types:
    - Clean certain values 
    - Clean and convert an entire column
 

In [5]:
#Check the datatypes of all the columns of the dataframe
inp0.info()

#### Missing Value Treatment

In [6]:
#Check the number of null values in the columns
inp0.isnull().sum()

Handling missing values for rating
 - Ratings is the target variable
 - drop the records

In [7]:
#Drop the rows having null values in the Rating field
inp1 = inp0[-inp0.Rating.isnull()]

#Check the shape of the dataframe
inp1.shape

In [8]:
# Check the number of nulls in the Rating field again to cross-verify
inp1.Rating.isnull().sum()

In [9]:
#Check the number of nulls in the dataframe again and find the total number of null values
inp1.isnull().sum()


In [10]:
#Inspect the nulls in the Android Version column
inp1[inp1['Android Ver'].isnull()]

In [11]:
#Drop the row having shifted values
inp1.loc[10472,:]

inp1[(inp1['Android Ver'].isnull()& (inp1.Category=="1.9"))]


In [12]:
inp1 = inp1[-(inp1['Android Ver'].isnull()& (inp1.Category=="1.9"))]

In [13]:
inp1[inp1['Android Ver'].isnull()]

In [14]:
inp1['Android Ver'].value_counts()

Imputing Missing Values

- For numerical variables use mean and median
- For categorical variables use mode

In [15]:
#Check the most common value in the Android version column
inp1["Android Ver"].mode()[0]

In [16]:
#Fill up the nulls in the Android Version column with the above value
inp1["Android Ver"]=inp1["Android Ver"].fillna(inp1["Android Ver"].mode()[0])

In [17]:
#Check the nulls in the Android version column again to cross-verify
inp1["Android Ver"].isnull().sum()

In [18]:
#Check the nulls in the entire dataframe again
inp1.isnull().sum()

In [19]:
#Check the most common value in the Current version column
inp1["Current Ver"].mode()[0]

In [20]:
#Replace the nulls in the Current version column with the above value
inp1["Current Ver"]=inp1["Current Ver"].fillna(inp1["Current Ver"].mode()[0])


In [21]:
#Check null values in the Current version column
inp1['Current Ver'].isnull().sum()

#### Handling Incorrect Data Types 

In [22]:
#Check the datatypes of all the columns 
inp1.dtypes

In [23]:
#Price column can't be worked upon due to presence of $ sign and it being an object datatype


In [24]:
#Write the function to make the changes
inp1.Price = inp1.Price.apply(lambda x: 0 if x=='0' else float(x[1:]))


In [25]:
#Verify the dtype of Price once again
inp1.Price.dtypes

In [26]:
#Analyse the Reviews column
inp1.Reviews.value_counts()

In [27]:
#Change the dtype of this column
inp1.Reviews = inp1.Reviews.astype("int32")
#Check the quantitative spread of this dataframe
inp1.Reviews.describe()

In [28]:
#Analyse the Installs Column
inp1.Installs.head()


In [29]:
# Clean the Installs Column and find the approximate number of apps at the 50th percentile.
inp1.Installs = inp1.Installs.apply(lambda x:int(x.replace(",","").replace("+","")))


In [30]:
inp1.Installs

#### Sanity Checks



- Rating is between 1 and 5 for all the apps.
- Number of Reviews is less than or equal to the number of Installs.
- Free Apps shouldn’t have a price greater than 0.


In [31]:
#Perform the sanity checks on the Reviews column
inp1[(inp1.Reviews > inp1.Installs)].shape

In [32]:
inp1 = inp1[inp1.Reviews <= inp1.Installs]

In [33]:
#perform the sanity checks on prices of free apps 
inp1[(inp1.Type == "Free") & (inp1.Price>0)]

#### Outliers Analysis Using Boxplot

In [34]:
#Create a box plot for the price column
plt.boxplot(inp1.Price)
plt.show()

In [35]:
#Check the apps with price more than 200
inp1[inp1.Price > 200]

In [36]:
#Clean the Price column
inp1 = inp1[inp1.Price < 200]

In [37]:
#Create a box plot for paid apps
inp1[inp1.Price>0].Price.plot.box()

In [38]:
#Check the apps with price more than 30
inp1[inp1.Price>30]

In [39]:
#Clean the Price column again
inp1 = inp1[inp1.Price <= 30]
inp1.shape

### Histograms


In [40]:
#Create a histogram of the Reviews
plt.hist(inp1.Reviews)
plt.show()

In [41]:
#Create a boxplot of the Reviews column
plt.boxplot(inp1.Reviews)
plt.show()


In [42]:
#Check records with 1 million reviews
inp1[inp1.Reviews >= 10000000]

In [43]:
#Drop the above records
#Drop the above records
inp1 = inp1[inp1.Reviews <= 1000000]
inp1.shape


In [44]:
# Creating a histogram again and checking the peaks

In [45]:
plt.hist(inp1.Reviews)
plt.show()

In [46]:
#Create a box plot for the Installs column, find IQR

plt.boxplot(inp1.Installs)
plt.show()

In [47]:
np.percentile(inp1.Installs,75)-np.percentile(inp1.Installs,25)

In [48]:
#CLeaning Installs by removing all the apps having more than or equal to 100 million installs
inp1 = inp1[inp1.Installs <= 100000000]
inp1.shape

In [49]:
#Plotting a histogram for Size as well.
plt.hist(inp1.Size)
plt.show()

In [50]:
#Create a boxplot for the Size column and report back the median value
plt.show()

#### Distribution Plots

In [51]:
#Create a distribution plot for rating
sns.distplot(inp1.Rating)
plt.show()

In [52]:
#Change the number of bins
sns.distplot(inp1.Rating, bins=20)
plt.show()


In [53]:
#Change the colour of bins to green
sns.distplot(inp1.Rating, bins=20, color="g")
plt.show()

In [54]:
#Apply matplotlib functionalities
sns.distplot(inp1.Rating, bins=20, color="g")
plt.title("Distribution of app ratings", fontsize=12)
plt.show()

#### Styling Options

Below are some styling options that are available in Seaborn.

In [55]:
#Check all the styling options
?sns.set_style
sns.set_style("dark")
sns.distplot(inp1.Rating, bins=20, color="g")
plt.title("Distribution of app ratings", fontsize=12)
plt.show()

In [56]:
sns.set_style("white")
sns.distplot(inp1.Rating, bins=20, color="g")
plt.title("Distribution of app ratings", fontsize=12)
plt.show()

In [57]:
plt.style.available

In [58]:
plt.style.use("tableau-colorblind10")

In [59]:
#Change the number of bins to 20
sns.distplot(inp1.Rating, bins=20)
plt.show()

In [60]:
plt.style.use("ggplot")

In [61]:
sns.distplot(inp1.Rating, bins=20)
plt.show()

In [62]:
plt.style.use("dark_background")

In [63]:
sns.distplot(inp1.Rating, bins=20)
plt.show()

In [64]:
plt.style.use("default")
%matplotlib inline

In [65]:
sns.distplot(inp1.Rating, bins=20)
plt.show()

#### Pie-Chart and Bar Chart

For analysing how a numeric variable changes across several categories of a categorical variable you utilise either a pie chart or a box plot

In [66]:
#Analyse the Content Rating column
inp1['Content Rating'].value_counts()

In [67]:
#Remove the rows with values which are less represented 
inp1 = inp1[~inp1['Content Rating'].isin(["Adults only 18+","Unrated"])]
inp1.shape

In [68]:
#Reset the index
inp1.reset_index(inplace=True, drop=True)

In [69]:
#Check the apps belonging to different categories of Content Rating 
#Check the apps belonging to different categories of Content Rating 
inp1['Content Rating'].value_counts()

In [70]:
#Plot a pie chart
inp1['Content Rating'].value_counts().plot.pie()
plt.show()

In [71]:
#Plot a bar chart
inp1['Content Rating'].value_counts().plot.bar()
plt.show()

In [72]:
first_ten_index= inp0["Category"].value_counts().head(10).index
first_ten_values= inp0["Category"].value_counts().head(10).values
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(x=first_ten_index, y=first_ten_values ).set_title('First 10 App Categories on Google Store',
                                                              fontdict= { 'fontsize': 14,'fontweight':'bold'})
plt.xticks(rotation=45)
ax.set_ylabel('Number of download', size =12)
plt.show()

#### Scatter Plots

In [73]:
###Size vs Rating

##Plot a scatter-plot in the matplotlib way between Size and Rating
plt.scatter(inp1.Size, inp1.Rating)
plt.show()

In [74]:
sns.jointplot(inp1.Size, inp1.Rating)
plt.show()

In [75]:
## Plot a jointplot for Price and Rating
sns.jointplot(inp1.Price, inp1.Rating)
plt.show()

**Reg Plots**

- These are an extension to the jointplots, where a regression line is added to the view 

In [76]:
##Plot a reg plot for Price and Rating and observe the trend
sns.jointplot(inp1.Price, inp1.Rating, kind="reg")
plt.show()

In [77]:
#Plot a reg plot for Price and Rating again for only the paid apps.
sns.jointplot("Price", "Rating", data=inp1[inp1.Price>0], kind="reg")
plt.show()

**Pair Plots**

In [78]:
sns.pairplot(inp1[['Reviews', 'Size', 'Price','Rating']])
plt.show()

It is clearly visible that the left-most factor is the most prominently related to the profits, given how linearly scattered the points are and how randomly scattered the rest two factors are.

**Bar Charts using groupby & estimator parameters**

In [79]:
##Plot a bar plot of Content Rating vs Average Rating 
inp1.groupby(['Content Rating'])['Rating'].mean().plot.bar()

In [80]:
##Plot the bar plot again with Median Rating
inp1.groupby(['Content Rating'])['Rating'].median().plot.bar()

In [81]:
##Plot the above bar plot using the estimator parameter
sns.barplot(data=inp1, x="Content Rating", y="Rating", estimator=np.median)
plt.show()

In [82]:
##Plot the bar plot with only the 5th percentile of Ratings
sns.barplot(data=inp1, x="Content Rating", y="Rating", estimator=lambda x: np.quantile(x,0.05))
plt.show()

In [83]:
##Plot the bar plot with the minimum Rating
sns.barplot(data=inp1, x="Content Rating", y="Rating", estimator=np.min)
plt.show()

__Box Plots for comparing the spread and analysing a numerical variable across several categories__


In [84]:
##Plot a box plot of Rating vs Content Rating
plt.figure(figsize=[9,7])
sns.boxplot(inp1['Content Rating'], inp1.Rating)
plt.show()

In [85]:
##Plot a box plot for the Rating column only
sns.boxplot(inp1.Rating)
plt.show()

In [86]:
#Plot a box plot of Ratings across the 4 most popular Genres
inp1['Genres'].value_counts()

In [87]:
c = ['Tools','Entertainment','Medical','Education']
inp5= inp1[inp1['Genres'].isin(c)]
sns.boxplot(inp5['Genres'],inp1.Rating)

#### Heat Maps

In [88]:
##Ratings vs Size vs Content Rating

##Prepare buckets for the Size column using pd.qcut
inp1['Size_Bucket'] = pd.qcut(inp1.Size, [0, 0.2, 0.4, 0.6, 0.8, 1], ["VL","L","M","H","VH"])


In [89]:
##Create a pivot table for Size_buckets and Content Rating with values set to Rating
inp1.head()
pd.pivot_table(data=inp1, index="Content Rating", columns="Size_Bucket", values="Rating")

In [90]:
##Change the aggregation to median
pd.pivot_table(data=inp1, index="Content Rating", columns="Size_Bucket", values="Rating", aggfunc=np.median)

In [91]:
##Change the aggregation to 20th percentile
##Change the aggregation to 20th percentile
pd.pivot_table(data=inp1,index="Content Rating",columns="Size_Bucket",values="Rating",aggfunc=lambda x: np.quantile(x,0.2))

In [92]:
##Store the pivot table in a separate variable
res = pd.pivot_table(data=inp1,index="Content Rating",columns="Size_Bucket",values="Rating",aggfunc=lambda x: np.quantile(x,0.2))

In [93]:
##Plot a heat map
sns.heatmap(res)
plt.show()

In [94]:
##Apply customisations
sns.heatmap(res, cmap = "Greens", annot=True)
plt.show()

In [95]:
#Replace Content Rating with Review_buckets in the above heat map
##Keep the aggregation at minimum value for Rating
inp1.dtypes

#### Line Plots

In [96]:
## Extract the month from the Last Updated Date
inp1['Last Updated'].head()

In [97]:
inp1['updated_month'] = pd.to_datetime(inp1['Last Updated']).dt.month

In [98]:
## Find the average Rating across all the months
inp1.groupby(['updated_month'])['Rating'].mean()

In [99]:
## Plot a line graph
plt.figure(figsize=[10,5])
inp1.groupby(['updated_month'])['Rating'].mean().plot()
plt.show()

#### Stacked Bar Charts

- A stacked bar chart breaks down each bar of the bar chart on the basis of a different category
- For example, for the Campaign Response bar chart you saw earlier, the stacked bar chart is also showing the Gender bifurcation as well

![Stacked](images\stacked.png)

In [100]:
## Create a pivot table for Content Rating and updated Month with the values set to Installs
pd.pivot_table(data=inp1, values="Installs", index="updated_month", columns="Content Rating", aggfunc=sum)

In [101]:
##Store the table in a separate variable
monthly = pd.pivot_table(data=inp1, values="Installs", index="updated_month", columns="Content Rating", aggfunc=sum)

In [102]:
##Plot the stacked bar chart.
monthly.plot(kind="bar", stacked="True", figsize=[10,6])
plt.show()

In [103]:
##Plot the stacked bar chart again wrt to the proportions.
monthly_perc = monthly[["Everyone","Everyone 10+","Mature 17+","Teen"]].apply(lambda x: x/x.sum(), axis=1)

#### Plotly

Plotly is a Python library used for creating interactive visual charts. You can take a look at how you can use it to create aesthetic looking plots with a lot of user-friendly functionalities like hover, zoom, etc.

Check out this link for installation and documentation:https://plot.ly/python/getting-started/

In [104]:
#Install plotly
!pip install plotly

In [105]:
#Take the table you want to plot in a separate variable
res = inp1.groupby(["updated_month"])[['Rating']].mean()
res.reset_index(inplace=True)
res

In [106]:
#Import the plotly libraries
import plotly.express as px

In [107]:
#Prepare the plot
fig = px.line(res, x="updated_month",y="Rating",title="Montly average rating")
fig.show()