# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [1]:
import pandas as pd
import plotly.express as px


# Notebook Presentation

In [2]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [3]:
df_apps = pd.read_csv('data/apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [4]:
print(df_apps.shape)

(10841, 12)


In [5]:
print(df_apps.head())

                                            App         Category  Rating  \
0                       Ak Parti Yardım Toplama           SOCIAL     NaN   
1                    Ain Arabic Kids Alif Ba ta           FAMILY     NaN   
2  Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     NaN   
3                     Command & Conquer: Rivals           FAMILY     NaN   
4                                    CX Network         BUSINESS     NaN   

   Reviews  Size_MBs Installs  Type   Price Content_Rating           Genres  \
0        0      8.70        0  Paid  $13.99           Teen           Social   
1        0     33.00        0  Paid   $2.99       Everyone        Education   
2        0      5.50        0  Paid   $1.49       Everyone  Personalization   
3        0     19.00        0   NaN       0   Everyone 10+         Strategy   
4        0     10.00        0  Free       0       Everyone         Business   

     Last_Updated         Android_Ver  
0   July 28, 2017          4

In [6]:
print(df_apps.sample(5))

                                                   App           Category  \
3123  Listen to the story~The Story of the Fairy Tales          PARENTING   
1148                                         Alumni BJ             SOCIAL   
164                          CHOSEN - EV Smart Charger  AUTO_AND_VEHICLES   
366                     CL 2ne1 Wallpaper KPOP HD Best    PERSONALIZATION   
1813                           Diabetes & Diet Tracker            MEDICAL   

      Rating  Reviews  Size_MBs Installs  Type  Price Content_Rating  \
3123    4.70       69      5.80    5,000  Free      0       Everyone   
1148     NaN        2      5.40      100  Free      0       Everyone   
164      NaN        1     19.00       10  Free      0       Everyone   
366      NaN        3      3.40       10  Free      0       Everyone   
1813    4.60      395     19.00    1,000  Paid  $9.99       Everyone   

               Genres      Last_Updated   Android_Ver  
3123        Parenting     July 10, 2018  4.0.3 a

### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

In [7]:
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)
print(df_apps.head())

                                            App         Category  Rating  \
0                       Ak Parti Yardım Toplama           SOCIAL     NaN   
1                    Ain Arabic Kids Alif Ba ta           FAMILY     NaN   
2  Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     NaN   
3                     Command & Conquer: Rivals           FAMILY     NaN   
4                                    CX Network         BUSINESS     NaN   

   Reviews  Size_MBs Installs  Type   Price Content_Rating           Genres  
0        0      8.70        0  Paid  $13.99           Teen           Social  
1        0     33.00        0  Paid   $2.99       Everyone        Education  
2        0      5.50        0  Paid   $1.49       Everyone  Personalization  
3        0     19.00        0   NaN       0   Everyone 10+         Strategy  
4        0     10.00        0  Free       0       Everyone         Business  


### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

In [8]:
nan_rows = df_apps[df_apps.Rating.isna()]
print(nan_rows.shape)
print(nan_rows.head())

(1474, 10)
                                            App         Category  Rating  \
0                       Ak Parti Yardım Toplama           SOCIAL     NaN   
1                    Ain Arabic Kids Alif Ba ta           FAMILY     NaN   
2  Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     NaN   
3                     Command & Conquer: Rivals           FAMILY     NaN   
4                                    CX Network         BUSINESS     NaN   

   Reviews  Size_MBs Installs  Type   Price Content_Rating           Genres  
0        0      8.70        0  Paid  $13.99           Teen           Social  
1        0     33.00        0  Paid   $2.99       Everyone        Education  
2        0      5.50        0  Paid   $1.49       Everyone  Personalization  
3        0     19.00        0   NaN       0   Everyone 10+         Strategy  
4        0     10.00        0  Free       0       Everyone         Business  


In [9]:
df_apps_clean = df_apps.dropna()
print(df_apps_clean.shape)

(9367, 10)


### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


In [10]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
print(duplicated_rows.head())

(476, 10)
                                             App Category  Rating  Reviews  \
946                       420 BZ Budeze Delivery  MEDICAL    5.00        2   
1133                                 MouseMingle   DATING    2.70        3   
1196  Cardiac diagnosis (heart rate, arrhythmia)  MEDICAL    4.40        8   
1231                                Sway Medical  MEDICAL    5.00        3   
1247              Chat Kids - Chat Room For Kids   DATING    4.70        6   

      Size_MBs Installs  Type   Price Content_Rating   Genres  
946      11.00      100  Free       0     Mature 17+  Medical  
1133      3.90      100  Free       0     Mature 17+   Dating  
1196      6.50      100  Paid  $12.99       Everyone  Medical  
1231     22.00      100  Free       0       Everyone  Medical  
1247      4.90      100  Free       0     Mature 17+   Dating  


In [11]:
print(df_apps_clean[df_apps_clean.App == 'Instagram'])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10808  Instagram   SOCIAL    4.50  66577446      5.30  1,000,000,000  Free   
10809  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10810  Instagram   SOCIAL    4.50  66509917      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  
10808     0           Teen  Social  
10809     0           Teen  Social  
10810     0           Teen  Social  


In [12]:
df_apps_clean = df_apps_clean.drop_duplicates() # Not enough
print(df_apps_clean[df_apps_clean.App == 'Instagram'])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   
10808  Instagram   SOCIAL    4.50  66577446      5.30  1,000,000,000  Free   
10810  Instagram   SOCIAL    4.50  66509917      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  
10808     0           Teen  Social  
10810     0           Teen  Social  


In [13]:
# We need to specify the subset got identifying duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])
print(df_apps_clean[df_apps_clean.App == 'Instagram'])

             App Category  Rating   Reviews  Size_MBs       Installs  Type  \
10806  Instagram   SOCIAL    4.50  66577313      5.30  1,000,000,000  Free   

      Price Content_Rating  Genres  
10806     0           Teen  Social  


In [14]:
print(df_apps_clean.shape)

(8199, 10)


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [15]:
print(df_apps_clean.sort_values('Rating', ascending=False).head())

                      App     Category  Rating  Reviews  Size_MBs Installs  \
21    KBA-EZ Health Guide      MEDICAL    5.00        4     25.00        1   
1230         Sway Medical      MEDICAL    5.00        3     22.00      100   
1227    AJ Men's Grooming    LIFESTYLE    5.00        2     22.00      100   
1224       FK Dedinje BGD       SPORTS    5.00       36      2.60      100   
1223      CB VIDEO VISION  PHOTOGRAPHY    5.00       13      2.60      100   

      Type Price Content_Rating       Genres  
21    Free     0       Everyone      Medical  
1230  Free     0       Everyone      Medical  
1227  Free     0       Everyone    Lifestyle  
1224  Free     0       Everyone       Sports  
1223  Free     0       Everyone  Photography  


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

In [16]:
print(df_apps_clean.sort_values('Size_MBs', ascending=False).head())

                                  App            Category  Rating  Reviews  \
9942   Talking Babsy Baby: Baby Games           LIFESTYLE    4.00   140995   
10687          Hungry Shark Evolution                GAME    4.50  6074334   
9943            Miami crime simulator                GAME    4.00   254518   
9944     Gangster Town: Vice District              FAMILY    4.30    65146   
3144                       Vi Trainer  HEALTH_AND_FITNESS    3.60      124   

       Size_MBs     Installs  Type Price Content_Rating  \
9942     100.00   10,000,000  Free     0       Everyone   
10687    100.00  100,000,000  Free     0           Teen   
9943     100.00   10,000,000  Free     0     Mature 17+   
9944     100.00   10,000,000  Free     0     Mature 17+   
3144     100.00        5,000  Free     0       Everyone   

                       Genres  
9942   Lifestyle;Pretend Play  
10687                  Arcade  
9943                   Action  
9944               Simulation  
3144         Hea

# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [17]:
print(df_apps_clean.sort_values('Reviews', ascending=False).head(50))

                                                     App            Category  \
10805                                           Facebook              SOCIAL   
10785                                 WhatsApp Messenger       COMMUNICATION   
10806                                          Instagram              SOCIAL   
10784           Messenger – Text and Video Chat for Free       COMMUNICATION   
10650                                     Clash of Clans                GAME   
10744            Clean Master- Space Cleaner & Antivirus               TOOLS   
10835                                     Subway Surfers                GAME   
10828                                            YouTube       VIDEO_PLAYERS   
10746  Security Master - Antivirus, VPN, AppLock, Boo...               TOOLS   
10584                                       Clash Royale                GAME   
10763                                   Candy Crush Saga                GAME   
10770        UC Browser - Fast Download 

# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [18]:
ratings = df_apps_clean.Content_Rating.value_counts()
print(ratings)

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64


In [19]:
fig = px.pie(labels=ratings.index, values=ratings.values)
fig.show()

In [22]:
fig = px.pie(labels=ratings.index, values=ratings.values, title="Content Rating", names=ratings.index)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

In [24]:
fig = px.pie(labels=ratings.index, values=ratings.values, title="Content Rating", names=ratings.index, hole=0.6)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [25]:
print(df_apps_clean.Installs.describe())

count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object


In [26]:
print(df_apps_clean.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB
None


In [27]:
print(df_apps_clean[['App', 'Installs']].groupby('Installs').count())

                App
Installs           
1                 3
1,000           698
1,000,000      1417
1,000,000,000    20
10               69
10,000          988
10,000,000      933
100             303
100,000        1096
100,000,000     189
5                 9
5,000           425
5,000,000       607
50               56
50,000          457
50,000,000      202
500             199
500,000         504
500,000,000      24


In [28]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', '')
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
print(df_apps_clean[['App', 'Installs']].groupby('Installs').count())

             App
Installs        
1              3
5              9
10            69
50            56
100          303
500          199
1000         698
5000         425
10000        988
50000        457
100000      1096
500000       504
1000000     1417
5000000      607
10000000     933
50000000     202
100000000    189
500000000     24
1000000000    20


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [31]:
print(df_apps_clean.Price.describe())

df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', '')
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)

print(df_apps_clean.sort_values('Price', ascending=False).head(20))

count     8199
unique      73
top          0
freq      7595
Name: Price, dtype: object
                                 App   Category  Rating  Reviews  Size_MBs  \
3946        I'm Rich - Trump Edition  LIFESTYLE    3.60      275      7.30   
2461              I AM RICH PRO PLUS    FINANCE    4.00       36     41.00   
4606               I Am Rich Premium    FINANCE    4.10     1867      4.70   
3145              I am rich(premium)    FINANCE    3.50      472      0.94   
3554                      💎 I'm rich  LIFESTYLE    3.80      718     26.00   
5765                       I am rich  LIFESTYLE    3.80     3547      1.80   
1946  I am rich (Most expensive app)    FINANCE    4.10      129      2.70   
2775                   I Am Rich Pro     FAMILY    4.40      201      2.70   
3221                  I am Rich Plus     FAMILY    4.00      856      8.70   
3114                       I am Rich    FINANCE    4.30      180      3.80   
1331          most expensive app (H)     FAMILY    4.30


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



### The most expensive apps sub $250

In [33]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]
print(df_apps_clean.sort_values('Price', ascending=False).head(50))

                                                    App             Category  \
2281                          Vargo Anesthesia Mega App              MEDICAL   
1407                                       LTC AS Legal              MEDICAL   
2629                                   I am Rich Person            LIFESTYLE   
2481                            A Manual of Acupuncture              MEDICAL   
2463                                 PTA Content Master              MEDICAL   
2207                                           EMT PASS              MEDICAL   
4264                            Golfshot Plus: Golf GPS               SPORTS   
504                           AP Art History Flashcards               FAMILY   
4772   Human Anatomy Atlas 2018: Complete 3D Human Body              MEDICAL   
3241  Muscle Premium - Human Anatomy, Kinesiology, B...              MEDICAL   
2119                                         NewTek NDI          PHOTOGRAPHY   
4470                                  DR

### Highest Grossing Paid Apps (ballpark estimate)

In [34]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
print(df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10])

                                App     Category  Rating  Reviews  Size_MBs  \
9220                      Minecraft       FAMILY    4.50  2376564     19.00   
8825                  Hitman Sniper         GAME    4.60   408292     29.00   
7151  Grand Theft Auto: San Andreas         GAME    4.40   348962     26.00   
7477            Facetune - For Free  PHOTOGRAPHY    4.40    49553     48.00   
7977        Sleep as Android Unlock    LIFESTYLE    4.50    23966      0.85   
6594            DraStic DS Emulator         GAME    4.60    87766     12.00   
6082                   Weather Live      WEATHER    4.50    76593      4.75   
7954                    Bloons TD 5       FAMILY    4.60   190086     94.00   
7633        Five Nights at Freddy's         GAME    4.60   100805     50.00   
6746     Card Wars - Adventure Time       FAMILY    4.30   129603     23.00   

      Installs  Type  Price Content_Rating                     Genres  \
9220  10000000  Paid   6.99   Everyone 10+  Arcade;Action

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

### Vertical Bar Chart - Highest Competition (Number of Apps)

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

# Grouped Bar Charts: Free vs. Paid Apps per Category

**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.