# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [6]:
import pandas as pd
import plotly.express as px


# Notebook Presentation

In [49]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [50]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [51]:
df_apps.shape

(10841, 12)

In [52]:
df_apps.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [53]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
3902,How do I care about my child?,PARENTING,,34,4.9,10000,Free,0,Everyone,Parenting,"July 3, 2018",4.0 and up
5136,Tennis 24 - tennis live scores,SPORTS,4.6,990,9.4,100000,Free,0,Everyone,Sports,"July 18, 2018",4.1 and up
10831,Google News,NEWS_AND_MAGAZINES,3.9,877635,13.0,1000000000,Free,0,Teen,News & Magazines,"August 1, 2018",4.4 and up
7584,Fertility Friend Ovulation App,HEALTH_AND_FITNESS,4.5,12955,7.2,1000000,Free,0,Teen,Health & Fitness,"June 27, 2018",4.4 and up
5514,XXX DISTILLERY,BUSINESS,4.2,509,3.2,100000,Free,0,Everyone,Business,"August 2, 2017",4.0 and up


### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

In [54]:
df_apps.drop(['Last_Updated','Android_Ver'], axis=1)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.70,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.00,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.50,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.00,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.00,0,Free,0,Everyone,Business
...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade
10839,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade


### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

In [55]:
df_apps.isna().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size_MBs             0
Installs             0
Type                 1
Price                0
Content_Rating       0
Genres               0
Last_Updated         0
Android_Ver          2
dtype: int64

In [56]:
df_apps_clean=df_apps.dropna()

In [58]:
df_apps_clean.isna().sum()

App               0
Category          0
Rating            0
Reviews           0
Size_MBs          0
Installs          0
Type              0
Price             0
Content_Rating    0
Genres            0
Last_Updated      0
Android_Ver       0
dtype: int64

### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


In [16]:
df_apps_clean.duplicated().sum()

474

In [59]:
df_apps_clean.drop_duplicates()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
21,KBA-EZ Health Guide,MEDICAL,5.00,4,25.00,1,Free,0,Everyone,Medical,"August 2, 2018",4.0.3 and up
28,Ra Ga Ba,GAME,5.00,2,20.00,1,Paid,$1.49,Everyone,Arcade,"February 8, 2017",2.3 and up
47,Mu.F.O.,GAME,5.00,2,16.00,1,Paid,$0.99,Everyone,Arcade,"March 3, 2017",2.3 and up
82,Brick Breaker BR,GAME,5.00,7,19.00,5,Free,0,Everyone,Arcade,"July 23, 2018",4.1 and up
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.00,1,4.60,5,Free,0,Everyone,Medical,"August 2, 2018",4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...
10835,Subway Surfers,GAME,4.50,27722264,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10836,Subway Surfers,GAME,4.50,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10837,Subway Surfers,GAME,4.50,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10838,Subway Surfers,GAME,4.50,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up


In [60]:
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [61]:
df_apps_clean.sort_values(by='Rating',ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
99,Anatomy & Physiology Vocabulary Exam Review App,MEDICAL,5.0,1,4.6,5,Free,0,Everyone,Medical,"August 2, 2018",4.0 and up
541,AJ RETAILS,SHOPPING,5.0,9,0.17,10,Free,0,Everyone,Shopping,"June 1, 2018",2.2 and up
382,DG OFF - 100% Free Coupons & Deals,SHOPPING,5.0,1,1.1,10,Free,0,Everyone,Shopping,"July 11, 2018",4.0 and up
392,Tic Tac CK,FAMILY,5.0,3,13.0,10,Free,0,Everyone,Puzzle,"July 3, 2018",4.0.3 and up
486,Easy Hotspot Ad Free,TOOLS,5.0,2,3.3,10,Paid,$0.99,Everyone,Tools,"July 26, 2018",4.0 and up


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

In [62]:
df_apps_clean.sort_values(by='Size_MBs', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10689,Hungry Shark Evolution,GAME,4.5,6071542,100.0,100000000,Free,0,Teen,Arcade,"July 25, 2018",4.1 and up
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade,"July 25, 2018",4.1 and up
10688,Hungry Shark Evolution,GAME,4.5,6074627,100.0,100000000,Free,0,Teen,Arcade,"July 25, 2018",4.1 and up
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation,"May 31, 2018",4.0 and up
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play,"July 16, 2018",4.0 and up


# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [63]:
df_apps_clean.sort_values(by='Reviews', ascending=False).head(50)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10811,Facebook,SOCIAL,4.1,78128208,5.3,1000000000,Free,0,Teen,Social,"August 3, 2018",Varies with device
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10789,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10797,WhatsApp Messenger,COMMUNICATION,4.4,69109672,3.5,1000000000,Free,0,Everyone,Communication,"August 3, 2018",Varies with device
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social,"July 31, 2018",Varies with device
10790,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56646578,3.5,1000000000,Free,0,Everyone,Communication,"August 1, 2018",Varies with device


# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

In [64]:
ratings=df_apps_clean.Content_Rating.value_counts()
ratings

Content_Rating
Everyone           7419
Teen               1084
Mature 17+          461
Everyone 10+        397
Adults only 18+       3
Unrated               1
Name: count, dtype: int64

In [66]:
fig=px.pie(labels=ratings.index,values=ratings.values,names=ratings.index,title='Content Rating')

fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [67]:
df_apps_clean.Installs.describe()



count          9365
unique           19
top       1,000,000
freq           1577
Name: Installs, dtype: object

In [68]:
#df_apps_clean[['App','Installs']].groupby('Installs').count()

In [69]:
#Replace the comma in order to change the installs values to numeric
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "")
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App', 'Installs']].groupby('Installs').count()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,309
500,201
1000,713
5000,432
10000,1009
50000,467


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


In [70]:
df_apps_clean.Price.describe()

count     9365
unique      73
top          0
freq      8719
Name: Price, dtype: object

In [71]:
df_apps_clean.Price.head()

21        0
28    $1.49
47    $0.99
82        0
99        0
Name: Price, dtype: object

In [72]:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('.', "")
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean[['App', 'Price']].groupby('Price').count()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,App
Price,Unnamed: 1_level_1
0,8719
99,107
100,2
120,1
129,1
...,...
29999,1
37999,1
38999,1
39999,11


In [73]:
df_apps_clean.sort_values('Price', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,40000,Everyone,Lifestyle,"May 3, 2018",4.1 and up
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,39999,Teen,Finance,"December 6, 2017",4.0.3 and up
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,39999,Everyone,Entertainment,"May 19, 2018",4.4 and up
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,39999,Everyone,Entertainment,"July 16, 2018",7.0 and up
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,39999,Everyone,Finance,"March 22, 2018",4.2 and up
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,39999,Everyone,Entertainment,"May 30, 2017",1.6 and up
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,39999,Everyone,Finance,"November 12, 2017",4.0 and up
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,39999,Everyone,Finance,"June 25, 2018",4.1 and up
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,39999,Everyone,Lifestyle,"March 11, 2018",4.4 and up
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,39999,Everyone,Lifestyle,"January 12, 2018",4.0.3 and up


### The most expensive apps sub $250

In [74]:
df_apps_clean_250=df_apps_clean[df_apps_clean["Price"]<250]
df_apps_clean_250.sort_values('Price', ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
4512,Day Night Live Wallpaper (All),PERSONALIZATION,4.7,4856,18.0,50000,Paid,249,Everyone,Personalization,"January 31, 2018",4.0 and up
248,The DG Buddy,BUSINESS,3.7,3,11.0,10,Paid,249,Everyone,Business,"June 30, 2014",2.2 and up
2277,Jump'N'Shoot Attack,GAME,4.1,155,32.0,1000,Paid,249,Everyone,Arcade,"May 26, 2018",4.1 and up
1621,Age of AI: War Strategy,FAMILY,4.6,71,5.7,500,Paid,249,Everyone,Strategy,"March 14, 2018",4.0 and up
6498,Beautiful Widgets Pro,PERSONALIZATION,4.2,97890,14.0,1000000,Paid,249,Everyone,Personalization,"August 24, 2016",2.3 and up


### FILTERING DATAFRAME & DROPPING INDEXED ROWS ****************************************

In [75]:
# Create a df index object with the conditions you want and Attention to place .index in the end
df_cut=df_apps_clean[(df_apps_clean['App'].str.contains('Rich',case=False)) & (df_apps_clean['Price']>250)].index


#Drop the df index object from the original df
df_new=df_apps_clean.drop(df_cut)
#df_new.sort_values('Price', ascending=False).head(20)
#df_new=df_apps_clean.drop(df_apps_clean[(df_apps_clean['App'].str.contains('Rich')) & (df_apps_clean['Price']>250)].index)
df_new.sort_values('Price', ascending=False).head(20)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
1331,most expensive app (H),FAMILY,4.3,6,1.5,100,Paid,39999,Everyone,Entertainment,"July 16, 2018",7.0 and up
2282,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,7999,Everyone,Medical,"June 18, 2018",4.0.3 and up
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,7999,Everyone,Medical,"June 18, 2018",4.0.3 and up
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,3999,Everyone,Medical,"April 4, 2018",4.1 and up
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,3399,Everyone,Medical,"October 2, 2017",4.0 and up
2482,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,3399,Everyone,Medical,"October 2, 2017",4.0 and up
2464,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,2999,Everyone,Medical,"December 22, 2015",2.2 and up
2208,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,2999,Everyone,Medical,"October 22, 2014",4.0 and up
504,AP Art History Flashcards,FAMILY,5.0,1,96.0,10,Paid,2999,Mature 17+,Education,"January 19, 2016",4.0 and up
2207,EMT PASS,MEDICAL,3.4,51,2.4,1000,Paid,2999,Everyone,Medical,"October 22, 2014",4.0 and up


In [76]:
df_apps_clean['Revenue_Estimate']=df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate',ascending=False)[:10]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,699,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device,6990000000
9224,Minecraft,FAMILY,4.5,2375336,19.0,10000000,Paid,699,Everyone 10+,Arcade;Action & Adventure,"July 24, 2018",Varies with device,6990000000
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,39999,Everyone,Lifestyle,"January 12, 2018",4.0.3 and up,3999900000
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,39999,Everyone,Finance,"November 12, 2017",4.0 and up,1999950000
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,99,Mature 17+,Action,"July 12, 2018",4.1 and up,990000000
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,699,Mature 17+,Action,"March 21, 2015",3.0 and up,699000000
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,599,Everyone,Lifestyle,"June 27, 2018",4.0 and up,599000000
7479,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,599,Everyone,Photography,"July 25, 2018",4.1 and up,599000000
7478,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,599,Everyone,Photography,"July 25, 2018",4.1 and up,599000000
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,599,Everyone,Photography,"July 25, 2018",4.1 and up,599000000


# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [77]:
df_apps_clean.Category.nunique()

33

In [78]:
df_top_categories=df_apps_clean.Category.value_counts()[:10]
df_top_categories

Category
FAMILY           1747
GAME             1097
TOOLS             734
PRODUCTIVITY      351
MEDICAL           350
COMMUNICATION     328
FINANCE           323
SPORTS            319
PHOTOGRAPHY       317
LIFESTYLE         315
Name: count, dtype: int64

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [79]:
px.bar(x=df_top_categories.index, y=df_top_categories.values)

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [90]:
# GROUPBY categories and add the Installs column with agg
category_installs=df_apps_clean.groupby('Category').agg({'Installs':pd.Series.sum})
category_installs.sort_values(by='Installs',ascending=False, inplace=True)
category_installs
h_bar=px.bar(x=category_installs.Installs, y=category_installs.index, orientation='h')
h_bar.show()

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:



* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 



*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

In [81]:
category_installs_App=df_apps_clean.groupby(['Category']).agg({'Installs':pd.Series.sum})
category_num=df_apps_clean.groupby(['Category']).agg({'App':pd.Series.count})
cat_merged_df = pd.merge(category_num, category_installs_App, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)


The dimensions of the DataFrame are: (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,1097,35085862717
COMMUNICATION,328,32647241530
PRODUCTIVITY,351,14176070180
SOCIAL,259,14069841475
TOOLS,734,11450724500
FAMILY,1747,10257701590
PHOTOGRAPHY,317,10088243130
NEWS_AND_MAGAZINES,233,7496210650
TRAVEL_AND_LOCAL,226,6868859300
VIDEO_PLAYERS,160,6221897200


In [82]:
import plotly.express as px

fig = px.scatter(cat_merged_df, x="App", y="Installs", color='Installs', hover_name=cat_merged_df.index)

fig.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",yaxis_title="Installs",yaxis=dict(type='log'))
fig.show()

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


In [83]:
df_apps_clean.Genres.value_counts()

Genres
Tools                          733
Entertainment                  533
Education                      468
Action                         358
Productivity                   351
                              ... 
Health & Fitness;Education       1
Music & Audio;Music & Video      1
Strategy;Education               1
Communication;Creativity         1
Lifestyle;Pretend Play           1
Name: count, Length: 115, dtype: int64

In [84]:
len(df_apps_clean.Genres.value_counts())

115

In [85]:
stack=df_apps_clean.Genres.str.split(";",expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')

num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

We now have a single column with shape: (9848,)
Number of genres: 53


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 



Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

In [86]:
bar = px.bar(x = num_genres.index[:15],y = num_genres.values[:15],title='Top Genres',hover_name=num_genres.index[:15],color=num_genres.values[:15],color_continuous_scale='Agsunset')
bar.update_layout(xaxis_title='Genre',
yaxis_title='Number of Apps',
coloraxis_showscale=False)

bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 



You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

In [87]:
df_free_vs_paid=df_apps_clean.groupby(['Category','Type'], as_index=False).agg({'App':pd.Series.count})
df_free_vs_paid.head()

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,59
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42


# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?




In [88]:
g_bar=px.bar(df_free_vs_paid,x='Category',y='App',title="Free vs Paid Apps",color="Type", barmode='group')

g_bar.update_layout(xaxis_title='Category', yaxis_title='Number of Apps',xaxis={'categoryorder':'total descending'},yaxis=dict(type='log'))
g_bar.show()

# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

In [89]:

df_paid_apps=df_apps_clean[df_apps_clean['Type']=='Paid']
df_paid_apps.sort_values(by='Revenue_Estimate',ascending=False)

box2 = px.box(df_paid_apps,x='Category',y='Revenue_Estimate',title='How Much Can Paid Apps Earn')
box2.update_layout(xaxis_title='Category',yaxis_title='Paid App Ballpark Revenue',xaxis={'categoryorder':'min ascending'},yaxis=dict(type='log'))

#box_2=px.box(df_paid_apps,x=)