<font size = 6.5 color = "black">  Google Play Store Data Preparation & EDA </font>

<font size = 6 color = "blue"> Introduction </font>

<font size=4 color = "black"> Exploratory data analysis of Apps from </font> 
 <a href="https://www.kaggle.com/gauthamp10/google-playstore-apps">Google Play Store Apps</a>

<font size=4 color = "grey"> In this technological world most of the business organizations are offering their services with the help of mobile applications. Due to high competition rates among the applications, it is difficult to get your product in front of target customers. So, it is important to conduct detailed research among the categories to find out how we would stack up against our competitors.
 
This dataset contains information about more than 1 million apps on the Google Play Store which can be used for exploratory data analysis. In this notebook we will try to compare the performance of free versus paid apps in the top 8 app categories on Google Play Store (excluding games):

- Education
- Business
- Music & Audio
- Tools
- Entertainment
- Lifestyle
- Books & Reference
- Food & Drink
    
 </font>


<font size=5 color = "black">Contains: </font>
<font size=4 color = "grey"> 
- Data preparation and transformation
- EDA with matplotlib and Dataprep.eda
</font>

<font size = 4.5 color = "blue">Library Setup and Read in Data</font>

In [None]:
#To install dataprep library
! pip install dataprep


In [None]:
#Base libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns
from dataprep.eda import plot, plot_correlation, plot_missing

<font size = 4 color = "blue"> Load the data </font>

In [None]:
df = pd.read_csv('../input/google-playstore-apps/Google-Playstore.csv')

<font size = 6 color = "blue"> Basic Exploration & Data Cleaning </font>

<font size=4 color = "grey"> We will start with a basic exploration of the dataset and get to know what the data looks like.</font>


<font size=5 color = "black"> Initial Overview </font>

In [None]:
# To display the top 5 rows
df.head(5)

In [None]:
# To display the bottom 5 rows
df.tail(5)

<font size=4 color = "grey"> Let's create a custom column called "distribution" to analyze the "Free" apps vs "Paid" apps.</font>

In [None]:
#create a function to define distribution.
def distribution(row):
    if row['Price']== 0:
        return 'Free'
    else:
        return 'Paid'
    


In [None]:
#Apply to dataframe, use axis=1 to apply the function to every row.
df['Distribution'] = df.apply(distribution, axis=1)
df.head()


In [None]:
df.info()

<font size=4 color = "grey">
There are 24 columns and several of them have missing values. We can create a single cell to list all the issues that need to be addressed and deal with them separately after the exploring process. Then, we can cross out each issue one by one as we fix them.
</font>

In [None]:
#checking the null values
df.isnull().sum()

<font size=4 color = "grey"> 
Two columns have incorrect data types: Released and Size. Released should be a datetime. Size is probably rendered as string because each size contains the letter 'M' to indicate megabytes. These issues will be added to the list.</font>

<font size=4 color = "grey"> 
Some columns will not be useful for us: App ID, minimum and maximum installs, minimum android version, developer ID, website, email, and privacy policy link.</font>

<font size=4 color = "grey"> 
Let's look at the 'Category' column.
</font>

In [None]:
df['Category'].value_counts()

<font size=5 color = "black"> Issues for this dataset </font>

<font size=4 color = "grey"> 

- Missing values in several columns: rating, rating count, installs, minimum and maximum installs, currency and more.

- Drop these columns: app ID, minimum and maximum installs, minimum android version, developer ID, website, email, and privacy      policy link.
- Incorrect data types for release_date and size.
- Music and education are represented by separate labels.
- Drop all categories except 8 selected for analysis.
- Convert all columns to snake_case.
</font>

<font size = 6 color = "blue"> Data Cleaning </font>

<font size=5 color = "black"> Convert all the columns snake_case. </font>

In [None]:
df.rename(lambda x: x.lower().strip().replace(' ', '_'), 
            axis='columns', inplace=True)

In [None]:
df.head(1)

<font size=5 color = "black"> Drop the irrelevant columns </font>

In [None]:
to_drop = [
        'developer_website', 'developer_email', 'privacy_policy'
]

# Drop them
df.drop(to_drop, axis='columns', inplace=True)

<font size = 6 color = "blue"> Assigning Data Types </font>

In [None]:
df['price'] = df['price'].astype(int)
df['category'] = df['category'].astype('category')

In [None]:
assert df.columns.all() not in to_drop

<font size=5 color = "black"> Convert released & last_updated to datetime </font>

In [None]:
# Specifying the datetime format significantly reduces conversion time
df['released'] = pd.to_datetime(df['released'], format='%b %d, %Y',
                                 infer_datetime_format=True, errors='coerce')
# Specifying the datetime format significantly reduces conversion time
df['last_updated'] = pd.to_datetime(df['last_updated'], format='%b %d, %Y',
                                 infer_datetime_format=True, errors='coerce')

<font size=3.5 color = "black"> check </font>

In [None]:
df.released.dtype
df.last_updated.dtype

In [None]:
df.category.dtype

<font size=5 color = "black"> Convert size to float </font>

In [None]:
# Strip of all text and convert to numeric
df['size'] = pd.to_numeric(df['size'].str.replace(r'[a-zA-Z]+', ''), 
                             errors='coerce')

<font size=3.5 color = "black"> check </font>

In [None]:
assert df['size'].dtype == 'float64'

<font size=3.5 color = "black"> No output means passed</font>

<font size=5 color = "black"> Collapse multiple columns into one </font>

In [None]:
# Collapse 'Music' and 'Music & Audio' into 'Music'
df['category'] = df['category'].str.replace('Music & Audio', 'Music')

In [None]:
# Collapse 'Educational' and 'Education' into 'Education'
df['category'] = df['category'].str.replace('Educational', 'Education')

In [None]:
assert 'Educational' not in df['category'] and \
       'Music & Audio' not in df['category']

<font size=5 color = "black"> Let's create the Subset only for top 8 categories </font>

In [None]:
top_8_list = [
    'Education', 'Music', 'Business', 'Tools', 
    'Entertainment', 'Lifestyle', 'Food & Drink', 
    'Books & Reference'
]
top = df[df['category'].isin(top_8_list)].reset_index(drop=True)

In [None]:
assert top['category'].all() in top_8_list

In [None]:
top.info()

<font size=5 color = "black"> Dealing with missing values </font>

<font size=3.5 color = "black"> Dropping the duplicate rows</font>

In [None]:
# Total number of rows and columns
top.shape

In [None]:
# Rows containing duplicate data
duplicate_rows_top = top[top.duplicated()]
print('Number of duplicate rows: ', duplicate_rows_top.shape)

<font size=4 color = "grey"> Now let us remove the duplicate data because it's okay to remove them.</font>

In [None]:
# Dropping the duplicates 
top = top.drop_duplicates()
top.head(5)

In [None]:
# Counting the number of rows after removing duplicates.
top.count()

In [None]:
plot_missing(top)

<font size=4 color = "grey"> Dropping the missing or null values.</font>

In [None]:
# Finding the null values.
print(top.isnull().sum())

In [None]:
# Dropping the missing values.
top = top.dropna() 
top.count()

<font size=4 color = "grey">Now we have removed all the rows which contain the Null or N/A values.</font>

In [None]:
# After dropping the values
print(top.isnull().sum())

In [None]:
top.to_csv('Play_store_final.csv')

In [None]:
plot(top)

<font size=5 color = "black"> Univariate Analysis</font>

In [None]:
plot(top, 'category')

<font size=4 color = "grey">Among all the categories "Education" has the highest count while "Food & Drink" has the lowest count."</font>

In [None]:
plot(top,'content_rating')

<font size=4 color = "grey">Among all the categories "Everyone" has the highest count while "Adult only 18+"  has the lowest count."</font>

In [None]:
plot(top,'installs')

<font size=4 color = "grey">Among all the different types of installs "1000+" has the highest count.</font>

In [None]:
plot_correlation(top)

<font size=4 color = "grey">In the above chart, there is no correlation found."</font>

<font size=5 color = "black"> Bivariate Analysis</font>

In [None]:
plot(top, 'category', 'rating')

In [None]:
# While books and reference doesnot have highest installs, it has the highest median value for rating by category. 

In [None]:
plot(top, 'category', 'price')

In [None]:
# the tools category has the highest outlier.

In [None]:
plot(top, 'category', 'installs')

<font size=4 color = "grey">Among all the categories "Education" has the highest number of installs.</font>

In [None]:
plot(top, 'category', 'size')

In [None]:
#The education category has the largest outliers for the size.

In [None]:
plot(top, 'category', 'released')

<font size=4 color = "grey">Among all the categories "Education" has the highest number of releases over the years while "Food & Drink" has the lowest.</font>

In [None]:
plot(top, 'category', 'last_updated')

<font size=4 color = "grey">Among all the categories "Education" has the highest number of last updates over the years while "Food & Drink" has the lowest.</font>

<font size=5 color = "black"> Comparison of Free apps vs Paid apps</font>

<font size=4 color = "grey"> In the distribution column, we have two types of data "Free" & "Paid". A detailed analysis can be conducted based upon these two types with other columns in the dataframe. </font>

In [None]:
plot(top,'distribution')

<font size=4 color = "grey">From the above bar-chart, we found that we have 96% "Free" apps and only 4% "Paid" apps.</font>

In [None]:
plot(top, 'distribution', 'installs')

<font size=4 color = "grey">The huge differennce between the number of installs could be explained by the cost of the apps, "free" or "paid".</font>

In [None]:
plot(top, 'category', 'distribution')

<font size=4 color = "grey">Similarly, when we compared the "Free" vs "Paid" apps among the 8 selected categories, it is found that people only pay for categories like "Books & Reference", "Education" and "Tools". </font>

In [None]:
plot(top, 'distribution', 'rating')

<font size=4 color = "grey">In the above box plot of rating and distribution, the "Free" apps has a median value of 3.4 while "Paid" apps has median value of 0.</font>

In [None]:
plot(top, 'distribution', 'released')

<font size=4 color = "grey">In the above line chart, there is a gradual increase in the total number of released "Free" apps while the "Paid" apps remained constant throughout the timeline.</font>

In [None]:
plot(top, 'distribution', 'last_updated')

<font size=4 color = "grey"> In the above line chart, there is a gradual increase in the total number of last updated "Free" apps while the "Paid" apps remained constant throughout the timeline.</font>

In [None]:
plot(top, 'distribution', 'size')

<font size=4 color = "grey"> The median size for "Free" apps is 8.2MB while the median size for "Paid" apps is 11MB.</font>