# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [118]:
import pandas as pd


# Notebook Presentation

In [119]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [120]:
df_apps = pd.read_csv('../data/apps.csv')
df_apps

print(df_apps.shape)
df_apps.columns


(10841, 12)


Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')

In [121]:
df_apps.head(10)
df_apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
6428,OnePlus Icon Pack,PERSONALIZATION,4.8,440,0.9,500000,Free,0,Everyone,Personalization,"May 23, 2018",5.0 and up
933,BX Mobile TMC for SAP B1,BUSINESS,4.7,3,11.0,100,Free,0,Everyone,Business,"January 20, 2015",4.0.3 and up
7933,Tokyo Ghoul: Dark War,GAME,4.3,25094,82.0,1000000,Free,0,Teen,Role Playing,"July 6, 2018",4.3 and up
5289,AE GTO Racing,GAME,3.9,6988,35.0,100000,Free,0,Everyone 10+,Racing,"September 24, 2015",Varies with device
7062,Yahoo! Weather for SH Forecast for understandi...,WEATHER,4.2,7457,4.75,1000000,Free,0,Everyone,Weather,"August 2, 2018",Varies with device
2560,My Virtual Boyfriend,FAMILY,4.3,105,51.0,1000,Paid,$0.99,Teen,Casual,"February 17, 2017",2.3.3 and up
5534,BD CRICKET LIVE,SPORTS,4.6,1238,9.5,100000,Free,0,Everyone,Sports,"January 15, 2018",4.0.3 and up
10419,Tumblr,SOCIAL,4.4,2955326,5.3,100000000,Free,0,Mature 17+,Social,"August 1, 2018",Varies with device
1846,G-REMOTE,HOUSE_AND_HOME,4.5,30,25.0,1000,Free,0,Everyone,House & Home,"August 1, 2018",4.0.3 and up
3991,Study AP World History,FAMILY,4.6,513,1.6,10000,Free,0,Everyone,Education,"July 18, 2016",4.0.3 and up


## <---------- Droping rows / cols based on Null ( Very IMPORTANT ) ------->

In [122]:
# simple scnario to drop a col with name 
df_app_clean =df_apps.drop(["Last_Updated", "Android_Ver" ], axis="columns")



In [123]:
# drop the rows whose any col has any null value 
df_apps = pd.read_csv('../data/apps.csv')
print(df_apps)

after_drop_rows_having_any_col_null = df_apps.dropna(axis=0, how = "any")
after_drop_rows_having_any_col_null

# check if we have any col with null value 
after_drop_rows_having_any_col_null.isnull().any()
after_drop_rows_having_any_col_null["Reviews"].isnull().any()

                                                App         Category  Rating  \
0                           Ak Parti Yardım Toplama           SOCIAL     nan   
1                        Ain Arabic Kids Alif Ba ta           FAMILY     nan   
2      Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     nan   
3                         Command & Conquer: Rivals           FAMILY     nan   
4                                        CX Network         BUSINESS     nan   
...                                             ...              ...     ...   
10836                                Subway Surfers             GAME    4.50   
10837                                Subway Surfers             GAME    4.50   
10838                                Subway Surfers             GAME    4.50   
10839                                Subway Surfers             GAME    4.50   
10840                                Subway Surfers             GAME    4.50   

        Reviews  Size_MBs       Install

False

In [124]:
# loading the data once again, so that we can get initial data to explore....
df_apps = pd.read_csv('../data/apps.csv')
#print(df_apps)

# first try to check wheter "Rating" has null value in it ?

df_apps["Rating"].isnull().any()
df_apps.loc[df_apps["Rating"].isna()]

print("\n before dropping_rows_with_rating_null ------> ")
print(df_apps.loc[df_apps["Rating"].isna()] )

# drop the rows whose Rating col value is NaN atleast one
print("\n after_dropping_rows_with_rating_null ------> ")
after_dropping_rows_with_rating_null = df_apps.dropna(axis =0 , how="any" ,subset=["Rating"]) 
print(after_dropping_rows_with_rating_null)
after_dropping_rows_with_rating_null.loc[df_apps["Rating"].isna()]


 before dropping_rows_with_rating_null ------> 
                                               App            Category  \
0                          Ak Parti Yardım Toplama              SOCIAL   
1                       Ain Arabic Kids Alif Ba ta              FAMILY   
2     Popsicle Launcher for Android P 9.0 launcher     PERSONALIZATION   
3                        Command & Conquer: Rivals              FAMILY   
4                                       CX Network            BUSINESS   
...                                            ...                 ...   
5840                                Em Fuga Brasil              FAMILY   
5862                    Voice Tables - no internet           PARENTING   
6141                                Young Speeches  LIBRARIES_AND_DEMO   
7035                                SD card backup               TOOLS   
7175                     Android TV Remote Service               TOOLS   

      Rating  Reviews  Size_MBs   Installs  Type   Price Conte

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver


In [155]:
## Another good way to drop null rec is dont drop just create new DF with filtering non null vals
df_apps = pd.read_csv('../data/apps.csv')
df_with_non_null = df_apps.loc[ df_apps["Rating"].notnull()]
print(df_with_non_null)

df_with_non_null["Rating"].isnull().any() ## false no null in this col. so worked ...

                                                   App Category  Rating  \
21                                 KBA-EZ Health Guide  MEDICAL    5.00   
28                                            Ra Ga Ba     GAME    5.00   
47                                             Mu.F.O.     GAME    5.00   
82                                    Brick Breaker BR     GAME    5.00   
99     Anatomy & Physiology Vocabulary Exam Review App  MEDICAL    5.00   
...                                                ...      ...     ...   
10836                                   Subway Surfers     GAME    4.50   
10837                                   Subway Surfers     GAME    4.50   
10838                                   Subway Surfers     GAME    4.50   
10839                                   Subway Surfers     GAME    4.50   
10840                                   Subway Surfers     GAME    4.50   

        Reviews  Size_MBs       Installs  Type  Price Content_Rating   Genres  \
21            4   

False

In [126]:
# till now we have see dropiing rows , now drop the col 
df_apps = pd.read_csv('../data/apps.csv')
after_drop_col_with_null = df_apps.dropna(axis=1 , how="any" , subset=[1] )
after_drop_col_with_null

Unnamed: 0,App,Category,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,0,8.70,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,0,33.00,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,0,5.50,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,0,19.00,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,0,10.00,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,27723193,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10837,Subway Surfers,GAME,27724094,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10838,Subway Surfers,GAME,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up
10839,Subway Surfers,GAME,27725352,76.00,1000000000,Free,0,Everyone 10+,Arcade,"July 12, 2018",4.1 and up


In [128]:
df_app_clean.duplicated()
df_app_clean.shape
df_apps.shape

(10841, 12)

## Find and drop null from DF and any specific col ( Note : Very Important)

In [3]:
import pandas as pd
df_apps = pd.read_csv('../data/apps.csv')
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


In [4]:
from pyspark.sql import SparkSession

ModuleNotFoundError: No module named 'pyspark'

# Data Cleaning

**Challenge**: How many rows and columns does `df_apps` have? What are the column names? Look at a random sample of 5 different rows with [.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

In [156]:
# simple scnario to drop a col with name 
df_app_clean =df_apps.drop(["Last_Updated", "Android_Ver" ], axis="columns")



0        False
1        False
2        False
3        False
4        False
         ...  
10836    False
10837    False
10838    False
10839     True
10840    False
Length: 10841, dtype: bool

In [166]:
# Find the duplicated rows 

complete_row_duplicated_rows = df_app_clean[df_app_clean.duplicated()] # will have the firs row and drop al the duplicated ros subsequently.
print(complete_row_duplicated_rows.head())

df_app_clean.loc[df_app_clean["App"] == "RT 516 VET"]

                                App Category  Rating  Reviews  Size_MBs  \
190                      RT 516 VET  MEDICAL     nan        0     29.00   
741      Penn State Health OnDemand  MEDICAL     nan        0     40.00   
803                     Maricopa AH  MEDICAL     nan        0     29.00   
914  Breastfeeding Tracker Baby Log  MEDICAL     nan        6     23.00   
946          420 BZ Budeze Delivery  MEDICAL    5.00        2     11.00   

    Installs  Type Price Content_Rating   Genres  
190       10  Free     0       Everyone  Medical  
741       50  Free     0       Everyone  Medical  
803      100  Free     0       Everyone  Medical  
914      100  Free     0       Everyone  Medical  
946      100  Free     0     Mature 17+  Medical  


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
189,RT 516 VET,MEDICAL,,0,29.0,10,Free,0,Everyone,Medical
190,RT 516 VET,MEDICAL,,0,29.0,10,Free,0,Everyone,Medical


In [164]:
#df_app_clean.iloc[10839]

df_app_clean.drop_duplicates(inplace=True)
#df_app_clean.shape


                  App Category  Rating   Reviews  Size_MBs       Installs  \
10835  Subway Surfers     GAME    4.50  27722264     76.00  1,000,000,000   
10836  Subway Surfers     GAME    4.50  27723193     76.00  1,000,000,000   
10837  Subway Surfers     GAME    4.50  27724094     76.00  1,000,000,000   
10838  Subway Surfers     GAME    4.50  27725352     76.00  1,000,000,000   
10839  Subway Surfers     GAME    4.50  27725352     76.00  1,000,000,000   
10840  Subway Surfers     GAME    4.50  27711703     76.00  1,000,000,000   

       Type Price Content_Rating  Genres  
10835  Free     0   Everyone 10+  Arcade  
10836  Free     0   Everyone 10+  Arcade  
10837  Free     0   Everyone 10+  Arcade  
10838  Free     0   Everyone 10+  Arcade  
10839  Free     0   Everyone 10+  Arcade  
10840  Free     0   Everyone 10+  Arcade  


### Drop Unused Columns

**Challenge**: Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not use these columns. 

### Find and Remove NaN values in Ratings

**Challenge**: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called `df_apps_clean` that does not include these rows. 

### Find and Remove Duplicates

**Challenge**: Are there any duplicates in data? Check for duplicates using the [.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function. How many entries can you find for the "Instagram" app? Use [.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove any duplicates from `df_apps_clean`. 


# Find Highest Rated Apps

**Challenge**: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [131]:
df_app_clean.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


In [134]:
sorrted_by_rating = df_app_clean.sort_values("Rating" , ascending=False )
sorrted_by_rating.head(15)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1573,FHR 5-Tier 2.0,MEDICAL,5.0,2,1.2,500,Paid,$2.99,Everyone,Medical
1096,BG Guide,TRAVEL_AND_LOCAL,5.0,3,2.4,100,Free,0,Everyone,Travel & Local
1095,Morse Player,FAMILY,5.0,12,2.4,100,Paid,$1.99,Everyone,Education
1092,DG TV,NEWS_AND_MAGAZINES,5.0,3,5.7,100,Free,0,Everyone,News & Magazines
1083,A-Y Collection,SHOPPING,5.0,2,2.9,100,Free,0,Teen,Shopping
1076,Spring flowers theme couleurs d t space,ART_AND_DESIGN,5.0,1,2.9,100,Free,0,Everyone,Art & Design
1064,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100,Free,0,Everyone,Education
1060,Free coupons and vouchers,FINANCE,5.0,4,5.9,100,Free,0,Everyone,Finance
1049,Helping BD,LIFESTYLE,5.0,15,4.5,100,Free,0,Everyone,Lifestyle


In [135]:
sorrted_by_size  = df_app_clean.sort_values("Size_MBs" , ascending=False)
sorrted_by_size.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10295,SimCity BuildIt,FAMILY,4.5,4218587,100.0,50000000,Free,0,Everyone 10+,Simulation
9945,Ultimate Tennis,SPORTS,4.3,183004,100.0,10000000,Free,0,Everyone,Sports
10687,Hungry Shark Evolution,GAME,4.5,6074334,100.0,100000000,Free,0,Teen,Arcade
10688,Hungry Shark Evolution,GAME,4.5,6074627,100.0,100000000,Free,0,Teen,Arcade
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness


In [149]:
# Very good exmaple of handling col to filter out records based on codition

sorrted_by_reviews  = df_app_clean.sort_values("Reviews" , ascending=False)
print(sorrted_by_reviews.head())

# check any rec with price > 0
sorrted_by_reviews.loc[df_app_clean["Price"].str.replace("$" , "").astype("float") > 0.0 ]
print(sorrted_by_reviews)

# check any top 50 reviewed rec with price >0 , i.e no 

sorrted_by_reviews.head(50).loc[ df_app_clean["Price"].str.replace("$" , "").astype("float") > 0.0]


                      App       Category  Rating   Reviews  Size_MBs  \
10805            Facebook         SOCIAL    4.10  78158306      5.30   
10811            Facebook         SOCIAL    4.10  78128208      5.30   
10785  WhatsApp Messenger  COMMUNICATION    4.40  69119316      3.50   
10797  WhatsApp Messenger  COMMUNICATION    4.40  69109672      3.50   
10808           Instagram         SOCIAL    4.50  66577446      5.30   

            Installs  Type Price Content_Rating         Genres  
10805  1,000,000,000  Free     0           Teen         Social  
10811  1,000,000,000  Free     0           Teen         Social  
10785  1,000,000,000  Free     0       Everyone  Communication  
10797  1,000,000,000  Free     0       Everyone  Communication  
10808  1,000,000,000  Free     0           Teen         Social  
                                  App       Category  Rating   Reviews  \
10805                        Facebook         SOCIAL    4.10  78158306   
10811                        

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres


# Find 5 Largest Apps in terms of Size (MBs)

**Challenge**: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please? 

# Find the 5 App with Most Reviews

**Challenge**: Which apps have the highest number of reviews? Are there any paid apps among the top 50?

# Plotly Pie and Donut Charts - Visualise Categorical Data: Content Ratings

# Numeric Type Conversion: Examine the Number of Installs

**Challenge**: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install? 

Check the datatype of the Installs column.

Count the number of apps at each level of installations. 

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first. 

In [172]:
df_app_clean["Installs"].describe()
#df_app_clean.info

<bound method DataFrame.info of                                                 App         Category  Rating  \
0                           Ak Parti Yardım Toplama           SOCIAL     nan   
1                        Ain Arabic Kids Alif Ba ta           FAMILY     nan   
2      Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     nan   
3                         Command & Conquer: Rivals           FAMILY     nan   
4                                        CX Network         BUSINESS     nan   
...                                             ...              ...     ...   
10836                                Subway Surfers             GAME    4.50   
10837                                Subway Surfers             GAME    4.50   
10838                                Subway Surfers             GAME    4.50   
10839                                Subway Surfers             GAME    4.50   
10840                                Subway Surfers             GAME    4.50   

       

In [205]:
df_app_clean["Installs"] =  df_app_clean["Installs"].astype("str").replace("," , "").astype("int") 
df_app_clean["Installs"] = pd.to_numeric(df_app_clean["Installs"])
#df_app_clean.groupby(by=["Installs"]).count()
df_app_clean[['App', 'Installs']].groupby('Installs').count()


 # we can add new col like this if required .

# df_app_clean["a"] = df_app_clean["Installs"].astype("str").replace("," , "").astype("int")
# df_app_clean[ ["Installs" , "a"] ]

Unnamed: 0,Installs,a
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
10836,1000000000,1000000000
10837,1000000000,1000000000
10838,1000000000,1000000000
10839,1000000000,1000000000


In [194]:
category_installs = df_app_clean.groupby('Category').agg({'App': pd.Series.count}).sort_values(by="App" , ascending=False)
category_installs

Unnamed: 0_level_0,App
Category,Unnamed: 1_level_1
FAMILY,1972
GAME,1144
TOOLS,843
MEDICAL,463
BUSINESS,460
PRODUCTIVITY,424
PERSONALIZATION,392
COMMUNICATION,387
SPORTS,384
LIFESTYLE,383


In [199]:
# import plotly.express as px
# bar = px.bar(x = top10_category.index, # index = category name
#              y = top10_category.values)
 
# bar.show()

# h_bar = px.bar(x = category_installs.Installs,
#                y = category_installs.index,
#                orientation='h')
 
# h_bar.show()
# h_bar = px.bar(x = category_installs.Installs,
#                y = category_installs.index,
#                orientation='h',
#                title='Category Popularity')


In [169]:
df_app_clean.groupby(by="Installs").count().sort_values(by="App" , ascending=False)

Unnamed: 0_level_0,App,Category,Rating,Reviews,Size_MBs,Type,Price,Content_Rating,Genres
Installs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1000000,1579,1579,1577,1579,1579,1579,1579,1579,1579
10000000,1252,1252,1252,1252,1252,1252,1252,1252,1252
100000,1169,1169,1150,1169,1169,1169,1169,1169,1169
10000,1054,1054,1010,1054,1054,1054,1054,1054,1054
1000,908,908,714,908,908,908,908,908,908
5000000,752,752,752,752,752,752,752,752,752
100,719,719,309,719,719,719,719,719,719
500000,539,539,538,539,539,539,539,539,539
50000,479,479,467,479,479,479,479,479,479
5000,477,477,432,477,477,477,477,477,477


## How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.

In [203]:
stack = df_app_clean.Genres.str.split(';', expand=True).stack()
print(stack)
print(f'We now have a single column with shape: {stack.shape}')
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

0      0             Social
1      0          Education
2      0    Personalization
3      0           Strategy
4      0           Business
                 ...       
10836  0             Arcade
10837  0             Arcade
10838  0             Arcade
10839  0             Arcade
10840  0             Arcade
Length: 11339, dtype: object
We now have a single column with shape: (11339,)
Number of genres: 53


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

Let's examine the Price column more closely.

**Challenge**: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.

Remove all apps that cost more than $250 from the `df_apps_clean` DataFrame.

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?


### The most expensive apps sub $250

### Highest Grossing Paid Apps (ballpark estimate)

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

### Vertical Bar Chart - Highest Competition (Number of Apps)

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

### Category Concentration - Downloads vs. Competition

**Challenge**: 
* First, create a DataFrame that has the number of apps in one column and the number of installs in another:

<img src=https://imgur.com/uQRSlXi.png width="350">

* Then use the [plotly express examples from the documentation](https://plotly.com/python/line-and-scatter/) alongside the [.scatter() API reference](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)to create scatter plot that looks like this. 

<img src=https://imgur.com/cHsqh6a.png>

*Hint*: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log') 

# Extracting Nested Data from a Column

**Challenge**: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html). 


# Colour Scales in Plotly Charts - Competition in Genres

**Challenge**: Can you create this chart with the Series containing the genre data? 

<img src=https://imgur.com/DbcoQli.png width=400>

Try experimenting with the built in colour scales in Plotly. You can find a full list [here](https://plotly.com/python/builtin-colorscales/). 

* Find a way to set the colour scale using the color_continuous_scale parameter. 
* Find a way to make the color axis disappear by using coloraxis_showscale. 

# Grouped Bar Charts: Free vs. Paid Apps per Category

**Challenge**: Use the plotly express bar [chart examples](https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories) and the [.bar() API reference](https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.bar) to create this bar chart: 

<img src=https://imgur.com/LE0XCxA.png>

You'll want to use the `df_free_vs_paid` DataFrame that you created above that has the total number of free and paid apps per category. 

See if you can figure out how to get the look above by changing the `categoryorder` to 'total descending' as outlined in the documentation here [here](https://plotly.com/python/categorical-axes/#automatically-sorting-categories-by-name-or-total-value). 

# Plotly Box Plots: Lost Downloads for Paid Apps

**Challenge**: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the [Box Plots Guide](https://plotly.com/python/box-plots/) and the [.box API reference](https://plotly.com/python-api-reference/generated/plotly.express.box.html) to create the following chart. 

<img src=https://imgur.com/uVsECT3.png>


# Plotly Box Plots: Revenue by App Category

**Challenge**: See if you can generate the chart below: 

<img src=https://imgur.com/v4CiNqX.png>

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?

Hint: I've used 'min ascending' to sort the categories. 

# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.