## **Introduction**
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

## **Import dataset and libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")

data = pd.read_csv('/kaggle/input/google-play-store-apps/googleplaystore.csv')
data_reviews = pd.read_csv('/kaggle/input/google-play-store-apps/googleplaystore_user_reviews.csv')


## **Data Preprocessing**

In [None]:
data.head()

In [None]:
# Data shape
print("Shape: %d rows, %d columns" % data.shape)

In [None]:
data.info()

In [None]:
# check for missing values
print(data.isnull().sum())

In [None]:
total_null = data.isnull().sum().sum()
missing_values_percentage = (data.isnull().sum().sum() / data.shape[0]) * 100
print(f'Total missing values: {total_null} ({round(missing_values_percentage,2)}%)')

In [None]:
# handle missing values
data.dropna(axis=0, inplace=True)

In [None]:
# Drop duplicates app
data = data.drop_duplicates(subset='App')

In [None]:
# Drop symbols
columns = ['Installs','Price']
chars = ["+",",","$"]

for column in columns:
    for char in chars:
        #remove char
        data[column] = data[column].apply(lambda x: x.replace(char,""))
        
# replace underscore with space in Category column
data['Category'] = data['Category'].apply(lambda x: x.replace("_"," "))

In [None]:
# change dtype of Reviews and Price columns
data['Reviews'] = data['Reviews'].astype('float')
data['Price'] = data['Price'].astype('float')

In [None]:
data['App'].nunique()

In [None]:
data.describe()

## **Data Exploration and Visualization**

### Paid vs Free Apps

In [None]:
app_free = data[data['Type']=='Free']
app_paid = data[data['Type']=='Paid']

print('Total Free Apps: %d' % app_free.value_counts().sum())
print('Total Paid Apps: %d' % app_paid.value_counts().sum())

sns.countplot(x=data['Type'], palette='coolwarm_r')
plt.title('Total Paid and Free Apps in Google Play Store')
plt.show()

In [None]:
# Calculate the average, min and max price paid apps
app_paid.describe()

In [None]:
app_paid.sort_values(by='Price', ascending=False)

### The distribution of apps across different categories. 

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['Category'])

plt.title('Distribution of Apps Across Different Categories')
plt.xticks(rotation=90)
plt.show()

In [None]:
data_sorted = data.groupby('Category')['Category'].value_counts().reset_index(name='count')
data_sorted.sort_values(by='count', ascending=False)[:10]

### The distribution of user ratings

In [None]:
sns.histplot(data=data, x='Rating', bins=11)

plt.title('Distribution of User Ratings')
plt.show()

### Which apps have the highest number of installs and rating?

In [None]:
data_sorted_installs = data.sort_values(by=['Installs','Rating'], ascending=False).reset_index()
data_sorted_installs['App'].head(10)

### The distribution of content rating

In [None]:
content_rating = data.groupby('Content Rating')['Content Rating'].value_counts().reset_index(name='count').sort_values(by='count', ascending=False)
print(content_rating)

plt.figure(figsize=(12,6))
sns.countplot(x=data['Content Rating'])

plt.title('Distribution of Apps Content Rating')
plt.show()

### Correlations between numerical variables, such as 'Rating', 'Reviews', and 'Price'.

In [None]:
sns.heatmap(data[['Rating', 'Reviews', 'Price']].corr(), annot=True, fmt=".2f")
plt.title('Correlation between Rating, Reviews and Price')

## **Summary**
* #### There are 7588 free apps and 602 paid apps. This suggests there are 12x more free apps than paid apps, but "free apps" doesn't mean totally free. Some of this "free apps" are app-in purchase.
* #### Average rating is 4.17 from total 8190 apps. We can assume people happy with their apps
* #### Minimum price of paid apps is \\$0.99 and maximum price is \\$400
    > wow?! I was wondering what kind of app cost \\$400, turn out it's "I'm Rich - Trump Edition" with 10k installs
* #### Top 5 categories: Family (1.607), Game (912), Tools (717), Finance (302), Lifestyle (301)
* #### Which apps have the highest number of installs and rating? 
        1. Clean Master- Space Cleaner & Antivirus
        2. Security Master - Antivirus, VPN, AppLock, Boo...
        3. Google Duo - High Quality Video Calls
        4. SHAREit - Transfer & Share
        5. UC Browser - Fast Download Private & Secure
* #### Based on heatmap above there's low or even no correlation between Rating, Reviews and Price feature. 