# Introduction

This is an investigation on which categories are more likely to make an 'free' app successful. I won't be including games as most games are just rehashes of older ones.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns

print(os.listdir('../input/google-playstore-apps/'))

In [None]:
nRowsRead = None
df = pd.read_csv('../input/google-playstore-apps/Google-Playstore.csv')

print(df.info())
df.head()

# Cleaning

## Removing irrelevant columns
The remaining categories are sufficient in identifying the app and the characteristics.

In [None]:
df = df.drop(['App Id', 'Currency', 'Developer Website', 'Developer Email', 'Privacy Policy', 'Size', 'Price'], axis=1)

## Removing paid apps
As I'm investigating free apps.

In [None]:
df = df[df['Free'] != False]

## Removing irrelevant categories
These categories are all related to gaming.

In [None]:
irrelevant_cat = ['Puzzle', 
                  'Arcade', 
                  'Simulation', 
                  'Action', 
                  'Adventure', 
                  'Racing', 
                  'Role Playing', 
                  'Board', 
                  'Strategy', 
                  'Casino', 
                  'Card', 
                  'Word', 
                  'Sports', 
                  'Trivia',
                  'Casual',
                  'Music',
                  'Educational']

for cat in irrelevant_cat:
    df = df.drop(df.index[df['Category'] == cat])

## Dropping missling values

In [None]:
df.dropna(inplace=True)

## Converting date format

In [None]:
df['Released'] = pd.to_datetime(df['Released'], format='%b %d, %Y',
                                 infer_datetime_format=True, errors='coerce')

# Plots

## Number of installations

I would consider an app that has achieved over 1 million downloads a successful app. This figure shows that only less than 3% of all free non-game apps achieve this. 

In [None]:
plt.rcParams.update({'font.size': 12, 'figure.figsize': (8, 8)})
plt.ylabel('Number of Installs')
plt.xlabel('Percentage')
plus_mill = ['tab:blue' if (x < 1000000.0) else 'tab:green' for x in df['Minimum Installs'].value_counts().sort_index().keys().tolist()]
df['Minimum Installs'].value_counts(normalize=True).sort_index().plot(kind="barh", 
                                                                      title='Proportion of the Number of Installs (Non-Games)', 
                                                                      color=plus_mill);
plt.gca().invert_yaxis()

## Correlation with ratings and installation
There is a positive correlation between app ratings and the number of installs. Looking at the apps with high ratings will help narrow down what makes an app successful.

In [None]:
rating_installation = df[['Minimum Installs', 'Rating']]

corr = rating_installation.corr(method='spearman')

fig, ax = plt.subplots(figsize=(9, 4))
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True,
    annot=True,
    fmt=".1n",
    linewidths=.5
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right',
    
);
ax.set_yticklabels(
    ax.get_yticklabels(),
    rotation=0,
    horizontalalignment='right',
);
plt.title("Ratings and Installs Correlation Matrix")
ax

## Average rating per category

Not a lot of insight here. But with 937,442 apps, there’s going to be a lot of variance.

In [None]:
all_apps = df.groupby('Category')['Rating'].mean()

plt.rcParams.update({'font.size': 12, 'figure.figsize': (8, 8)})
plt.ylabel('Category')
plt.xlabel('Average Rating')
all_apps.sort_values(ascending=False).plot(kind="barh", title='Average Rating per Category');
plt.gca().invert_yaxis()

Narrowing down the apps which have over 1 million installs, the average rating for those popular apps is around 4.3.

In [None]:
million = df[df['Minimum Installs'] >= 1000000]

plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 12)})
plt.ylabel('Rating')
plt.xlabel('Percentage')
million['Rating'].value_counts(normalize=True).sort_index().plot(kind="barh", title='Proportion of Ratings with Apps over 1 Million');
plt.gca().invert_yaxis()

## Average rating per caregory (Over 1 million installs)

Apps with a lot of installs are generally well received.

Those lower rated apps tend to be banking or tv apps. (Congrats to Android TV Home for being the lowest rated app (1.1) that’s brave enough to display it’s rating).

Apps with higher ratings tend to be books & reference, health & fitness and educational. (Congrats to Skill Academy by Ruangguru for having over 365,000 ratings and having an average of 5.0) 
 
Of all the successful apps, each category seems to have a decent average rating.


In [None]:
average = million.groupby('Category')['Rating'].mean()

plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 12)})
plt.ylabel('Category')
plt.xlabel('Average Rating (Mean)')
average.sort_values(ascending=False).plot(kind="barh", title='The Average Rating of each Category with over a Million Installs');
plt.gca().invert_yaxis()

## Ratings over 4.0

In [None]:
pd.set_option("display.max.rows", 1020)

#v_good_rating = million[(million['Rating'] >= 4.8) & (million['Rating'] <= 5.0)]
good_rating = million[million['Rating'] >= 4.0]
#bad_rating = million[(million['Rating'] <= 2.2) & (million['Rating'] >= 0.1)]


The fact that 'Tools' apps are dominant, means that when done right, they consistently perform really well.

In [None]:
plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 12)})
plt.ylabel('Category')
plt.xlabel('Percentage')
good_rating['Category'].value_counts(normalize=True).plot(kind="barh", title='Proportion of Categories (High Rating & over a Million Installs)');
plt.gca().invert_yaxis()

## Average number of installs per category

Apps under communication are the highest by far, with an average of 60 million installs. Apps under productivity are also notably high.

In [None]:
average_install = good_rating.groupby('Category')['Minimum Installs'].mean()
plt.axes().set_facecolor("white")
plt.rcParams.update({'font.size': 12, 'figure.figsize': (6, 8)})
plt.ylabel('Category')
plt.xlabel('Installs per 10 million')

average_install.sort_values().plot(kind="barh", title='Average Number of Installs per Category');


## Average installs per age rating

Apps that come under a teen content rating tend to get a larger number of installs.

Quite interesting as there are nearly 10 times more apps with the content rating as “Everyone” over “Teen”.

In [None]:
age_install = good_rating.groupby('Content Rating')['Minimum Installs'].mean()

plt.axes().set_facecolor("white")
plt.rcParams.update({'font.size': 12, 'figure.figsize': (5, 4)})
plt.ylabel('Category')
plt.xlabel('Installs per 10 million')
age_install.sort_index().plot(kind="barh", title='Average Number of Installs per Content Rating');
plt.gca().invert_yaxis()
plt.savefig("Age rating", transparent=False, bbox_inches="tight")

In [None]:
good_rating["Content Rating"].value_counts()

## Editors Choice

Of all apps with over 1 million downloads and a rating above 4.0, only 229/15440 are labeled as editors choice. When an app manages to get set as an editors choice, it’s pretty much guaranteed to be successful.

In [None]:
editor_install = good_rating.groupby('Editors Choice')['Minimum Installs'].mean()

plt.axes().set_facecolor("white")
plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 4)})
plt.ylabel('Category')
plt.xlabel('Installs per 10 million')
editor_install.sort_index().plot(kind="barh", title='Editors Choice Average Number of Installs');
plt.gca().invert_yaxis()
plt.savefig("Editors Choice", facecolor="white", transparent=False, bbox_inches="tight")

In [None]:
good_rating["Editors Choice"].value_counts()

## Editors choice apps

The data shows that there is a bias in the categories of apps that get selected as editors choice. As over 60% of all apps labeled as editors choice fall under the categories 'Education' and ‘Health & Fitness’.

In [None]:
app_editors = good_rating[good_rating['Editors Choice'] != False]

plt.rcParams.update({'font.size': 12, 'figure.figsize': (8, 8)})
plt.ylabel('Category')
plt.xlabel('Percentage')
app_editors['Category'].value_counts().plot(kind="barh", title='Editors Choice (High Rating, over a Million Installs)');
plt.gca().invert_yaxis()

## Ad supported

Apps without ads get a higher average number of installs.

In [None]:
ad_install = good_rating.groupby('Ad Supported')['Minimum Installs'].mean()
in_app_install = good_rating.groupby('In App Purchases')['Minimum Installs'].mean()

In [None]:
plt.rcParams.update({'font.size': 12, 'figure.figsize': (7, 4)})
plt.ylabel('Category')
plt.xlabel('Installs per 10 million')
ad_install.sort_index().plot(kind="barh", title='Ad Supported Number of Installs');
plt.gca().invert_yaxis()

## In App Purchases

Apps without an-app purchases are more likely to get more installs. But the difference is smaller in comparison to enabling ads.

In [None]:
plt.rcParams.update({'font.size': 12, 'figure.figsize': (7, 4)})
plt.ylabel('Category')
plt.xlabel('Installs per 10 million')
in_app_install.sort_index().plot(kind="barh", title='In App Purchases Average Number of Installs');
plt.gca().invert_yaxis()

## Fast growing apps

All apps that managed to achieve over 1 milllion downloads, and were released after October 2020.

In [None]:
recent_apps = million[(million['Released'] > '2020-10-01') & (million['Released'] < '2020-12-31')]
recent_apps.sort_values(by=['Released'], ascending=False)

# Conclusion

Overall it looks like apps with elements of communication or productivity tend to get more downloads on average. However, going for educational or health & fitness apps will make it more likely to get featured as editors choice.

Apps with a teen rating tend to be successful, but this may be because communication apps tend to require a teen rating.

Apps with in-app purchases tend to be more successful over ad supported apps.