# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [1]:
import pandas as pd


# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

dataset = pd.read_csv("apps.csv")


# Data exploration 

In [2]:
## Getting the shape of the data
dataset.shape

## Check the columns 
dataset.columns

## Checking a random sample of 5 of them: 
dataset.sample(5)

## Check missing values 
dataset.isna().sum()

    # Remove rows with missing values
clean_df = dataset.dropna()

    # Remove unwanted columns
clean_df = clean_df.drop(['Last_Updated', 'Android_Ver'], axis=1)
clean_df.head()


## Check for the duplites of the data: 
clean_df[clean_df.duplicated()]     # Pull the rows of the duplicated data 
                                    ## Cannot apply drop duplicates straight away given that the other columns may be different 
                                    ### Therefore need to subset by the column names 
clean_df =  clean_df.drop_duplicates(subset = ["App", "Type", "Price"])
                                    # Can also add the parameter of Keep = last which keeps the last one instead of the first.


## Can run the same sets of data exploration for the cleaned dataset if needed. 




# ## Check for the type of object 
dataset.dtypes


# ## Getting descripteive statistics 
dataset.describe

## Can also work out the number of values that are outside 1.5x the IQR.




<bound method NDFrame.describe of                                                 App         Category  Rating  \
0                           Ak Parti Yardım Toplama           SOCIAL     NaN   
1                        Ain Arabic Kids Alif Ba ta           FAMILY     NaN   
2      Popsicle Launcher for Android P 9.0 launcher  PERSONALIZATION     NaN   
3                         Command & Conquer: Rivals           FAMILY     NaN   
4                                        CX Network         BUSINESS     NaN   
...                                             ...              ...     ...   
10836                                Subway Surfers             GAME    4.50   
10837                                Subway Surfers             GAME    4.50   
10838                                Subway Surfers             GAME    4.50   
10839                                Subway Surfers             GAME    4.50   
10840                                Subway Surfers             GAME    4.50   

     

# Questions 


<h3>Challenge:</h3><div> Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?<div>

In [3]:
## Highest rated mean that the raings are maximised: 

    # Find them using this if there is a single value 
highest_app_id = clean_df["Rating"].idxmax()
clean_df.loc[highest_app_id, "App"]

   
    # Could also use tghe ascending function: 
clean_df.sort_values("Rating", ascending= False).head(10)

    
    # Or can use the assignment function 
value = clean_df["Rating"].max()
highest_rated_items = clean_df[clean_df["Rating"]== value]



<h3>Challenge</h3><div>What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be a limit in place or can developers make apps as large as they please?<div>

In [4]:
clean_df.sort_values("Size_MBs", ascending  = False).head(30)
    # There seems to be a max of 100MBs. 

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10295,SimCity BuildIt,FAMILY,4.5,4218587,100.0,50000000,Free,0,Everyone 10+,Simulation
3144,Vi Trainer,HEALTH_AND_FITNESS,3.6,124,100.0,5000,Free,0,Everyone,Health & Fitness
4176,Car Crash III Beam DH Real Damage Simulator 2018,GAME,3.6,151,100.0,10000,Free,0,Everyone,Racing
7926,Post Bank,FINANCE,4.5,60449,100.0,1000000,Free,0,Everyone,Finance
8718,Mini Golf King - Multiplayer Game,GAME,4.5,531458,100.0,5000000,Free,0,Everyone,Sports
7927,The Walking Dead: Our World,GAME,4.0,22435,100.0,1000000,Free,0,Teen,Action
7928,Stickman Legends: Shadow Wars,GAME,4.4,38419,100.0,1000000,Paid,$0.99,Everyone 10+,Action
9945,Ultimate Tennis,SPORTS,4.3,183004,100.0,10000000,Free,0,Everyone,Sports
9942,Talking Babsy Baby: Baby Games,LIFESTYLE,4.0,140995,100.0,10000000,Free,0,Everyone,Lifestyle;Pretend Play
9943,Miami crime simulator,GAME,4.0,254518,100.0,10000000,Free,0,Mature 17+,Action


<h3>Challenge</h3><div>Which apps have the highest number of reviews? Are there any paid apps among the top 50?<div>

In [5]:
sorted_dataset = clean_df.sort_values("Reviews", ascending = False).head(50)
sorted_dataset[sorted_dataset["Type"]== "Free"].count()
    # There are no paid apps in the 

App               50
Category          50
Rating            50
Reviews           50
Size_MBs          50
Installs          50
Type              50
Price             50
Content_Rating    50
Genres            50
dtype: int64

<h3>Challenge</h3><div>All Android apps have a content rating like “Everyone” or “Teen” or “Mature 17+”. Let’s take a look at the distribution of the content ratings in our dataset and see how to visualise it with plotly - a popular data visualisation library that you can use alongside or instead of Matplotlib.

First, we’ll count the number of occurrences of each rating with .value_counts()<div>

In [6]:
clean_df["Content_Rating"].value_counts()

Content_Rating
Everyone           6619
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: count, dtype: int64

### Creating a pie chart or donut uisng plotly 

In [7]:
# Import Plotly 
import plotly.express as px 


# The paramteters of the graph 
fig  =  px.pie(data_frame= clean_df, names ="Content_Rating", title="Content Rating")
fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()



In [8]:
# Or you could do this: 
fig  =  px.pie(data_frame= clean_df, names ="Content_Rating", title="Content Rating", hole=0.6)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
 
fig.show()

## Conversions to different data types

<h3>Challenge</h3><div>How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?



Check the datatype of the Installs column.

Count the number of apps at each level of installations.

Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have to make sure you remove non-numeric characters first.<div>

In [9]:
# Check the type of data it is:
clean_df["Installs"].describe()
    ## Or use: 
clean_df["Installs"].info()


# To convert the data, we need to remove the commas int he dataset. 
clean_df["Installs"] = clean_df["Installs"].astype(str).str.replace(",", "")
                            # Convert the enture thing to as type str/ float/ int etc
                            ## The access the strings 
                            ### Then replace 
# Coverting to numerical values: 
clean_df["Installs"] =  pd.to_numeric(clean_df["Installs"], errors ="coerce")
                            # As type is more dixed compared to numeric. Can also choose what happens to the errors in to numeric. 
                            ## For the errros you can choose raise which is error, coerce-- NaN , or ignore 

# Grouping them 
clean_df[["App", "Installs"]].groupby("Installs").agg({"App":"count"})

<class 'pandas.core.series.Series'>
Index: 8197 entries, 21 to 10835
Series name: Installs
Non-Null Count  Dtype 
--------------  ----- 
8197 non-null   object
dtypes: object(1)
memory usage: 386.1+ KB


Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,697
5000,425
10000,987
50000,457


<h3>Challenge</h3><div>
Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.



Remove all apps that cost more than $250 from the df_apps_clean DataFrame.



Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest-grossing paid apps according to this estimate? Out of the top 10, how many are games?<div>

In [10]:
# Checking the type of data for the price column
clean_df["Price"]
        # Has dollar signs and numbers. So convert to string and remove the dolalr signs 
clean_df["Price"] = clean_df["Price"].astype(str).str.replace("$", "")

## Convert the price to numeric data
clean_df["Price"] = pd.to_numeric(clean_df["Price"], errors= "coerce")

# Pulling the Top 20 results 
clean_df.sort_values("Price",  ascending  = False).head(50)

# Removing all the rows where  it costs more than 250 dollars 
adjsted_price_dataset  = clean_df[clean_df["Price"]<250.00]
adjsted_price_dataset.head(20)

# Adding a column called revenue estimate: price x installs 
clean_df["Revenue_Estimate"]= clean_df["Price"]*clean_df["Installs"]
    # Sorting it to find the highest grossing: 
sorted = clean_df.sort_values("Revenue_Estimate", ascending  = False)

    # Getting the top 10 by grossing: 
top_10 =  sorted.head(10)
    # Finding out which ones are games 
int(top_10[top_10["Category"]=="GAME"]["App"].count())

3

## Bar charts and Scatter Plots

In [11]:
## Work out the number of unique categories: 
clean_df["Category"].nunique()

# Tells you how many there are in each cateogry: 
clean_df["Category"].value_counts()
                # Lots of values so split them to the first 10 
top_10_categories  = clean_df["Category"].value_counts()[:10]

In [12]:
#=================================================================================Visualising the data in a bar chart============================================================================================================# 
import plotly.express as px 
bar = px.bar(x=top_10_categories.index,
             y = top_10_categories.values)
    ## Can also do this: 
bar = px.bar(top_10_categories)
    ## Get the bar to appear
bar.show()

In [13]:
#===========================================================================================Horizontal bar graph==============================================================================================================#
    # Find out the numebr of installs per category. 
cateogory_installs = clean_df.groupby("Category").agg({"Installs":"sum"})

    # Sort values by the largest
cateogory_installs.sort_values("Installs" , ascending=True, inplace=True)

    # Creating a bar chart thats horizontal 
h_bar = px.bar(cateogory_installs, 
               orientation="h", 
               title='Category Popularity')

h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')

h_bar.show()

In [14]:
#===========================================================================================Creating a scatterplot===============================================================================================================# 

    # Grouping by the category 
data = clean_df.groupby("Category").agg({"App":"count", "Installs":"count"})
data
    # Scatter plot: 
import plotly.express as px 
fig  = px.scatter(data, x= "App", y = "Installs", color='Installs', size ="App", hover_name = data.index )
fig.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",yaxis_title="Installs", yaxis=dict(type='log'))
fig.show()


## Dealing with a stacked column

In [15]:
## Most of them fit into multiple cateogories. Hence, Need to split them. 
stacked  = clean_df["Genres"].str.split(";",expand = True ).stack()
            # Converts the genre to strings.Then splits it into the differnt parts horizontally, then stacks then vertically 
stacked.value_counts()
            # Counts the number of items that go into the categories 
len(stacked.value_counts())
stacked = stacked.value_counts()
stacked


Tools                      719
Education                  587
Entertainment              502
Action                     304
Lifestyle                  303
Finance                    302
Productivity               301
Personalization            296
Medical                    292
Sports                     270
Photography                263
Business                   262
Communication              258
Health & Fitness           245
Casual                     216
News & Magazines           204
Social                     203
Simulation                 200
Travel & Local             187
Arcade                     185
Shopping                   180
Books & Reference          171
Video Players & Editors    150
Dating                     134
Puzzle                     124
Maps & Navigation          118
Role Playing               111
Racing                     103
Action & Adventure          96
Strategy                    95
Food & Drink                94
Educational                 93
Adventur

In [16]:
## Getting it ready for the bar chart 
    # Convert to a dataframe 
stacked = pd.DataFrame(stacked)
    # Rename the index axis and then reset it only once
# stacked  = stacked.rename_axis("Genre")
# stacked = stacked.reset_index()
#     # Make sure all the axiss are correct 
stacked = stacked.rename(columns={"Genre": "Genre", 
                        "count":"Apps"})

import plotly.express as px 

fig = px.bar(stacked[:15],                          # Looks at the first 15
             x= "Genre",                            #Sets the x axis 
             y= "Apps",                             # Sets the Y axis 
             title ="Top Genres",                   #Chart title
             hover_name="Genre",                    # what comes up when you hover
             color="Apps",                          # each colour is based on the app value
             color_continuous_scale='Agsunset')     # the palette
fig.update_layout(xaxis_title='Genre',
yaxis_title='Number of Apps',
coloraxis_showscale=True)                          # Gets rid of the leged/ or shows it 
fig.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Apps'] but received: Genre

## Grouped Bar Charts and Box Plots with Plotly

In [19]:
    # Checking how many are free and how many ore not. 
clean_df["Type"].value_counts()

#Creating a table with the categories and whether it is free or not. 
clean_df.groupby(["Category", "Type"],as_index= False).agg({"App":"count"})

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42
...,...,...,...
56,TRAVEL_AND_LOCAL,Paid,8
57,VIDEO_PLAYERS,Free,144
58,VIDEO_PLAYERS,Paid,4
59,WEATHER,Free,65


In [20]:
#==========================================================Converting the data above into a bar graph=======================================================================================#
#Creating a table with the categories and whether it is free or not. 
cateogories_to_types = clean_df.groupby(["Category", "Type"],as_index= False).agg({"App":"count"})

# Pulling from the documentation: 
import plotly.graph_objects as go 

fig = go.Figure()       
                                                             # Create a blank canvas 
fig.add_trace(go.Bar(
        x= cateogories_to_types["Category"], 
        y= cateogories_to_types[ cateogories_to_types["Type"]== "Free"]["App"],         # State the x and Y values 
        name = "Free Products", 
        marker_color='indianred', 
))

fig.add_trace(go.Bar(
    x=cateogories_to_types["Category"],
    y=cateogories_to_types[cateogories_to_types["Type"] == "Paid"]["App"],
    name='Paid Products',
    marker_color='lightsalmon'
))

fig.update_layout(
    xaxis_title="Categories",
    yaxis_title="Apps",
    title ="Free vs Paid apps"
)


fig.update_xaxes(categoryorder='category ascending')                    # The cateogies in alphabetcial order 
fig.update_layout(barmode='group', xaxis_tickangle=-45)                 # Chanign the angle of the tickbox
fig.update_xaxes(categoryorder='total ascending')                       # The values in ascending order
fig.show()


## Boxplots

<h3> Challenge</h3>
<div>Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?<div>

In [47]:
import plotly.express as px 

fig  = px.box(cateogories_to_types, x = "Type", y ="App")
fig.update_layout(width =900, height =1200)
fig.show()

In [67]:
#===================================================================================  How much can paid apps earn ?=============================================================================================================#
## Need to kopok at the paid app revenue through a box plot 
paid_apps = clean_df[clean_df["Type"]== "Paid"].copy()                              # Sometimes bopolean indexig does not exreate an explicxyt copy of the dataset. hence, the use of copy()
    ##df.loc[df["Type"] == "Paid", "Revenue_Estimate"] = df["Price"] * df["Reviews"]     # Could also do this from the original dataset


# Creating the diagram 
fig = px.box(paid_apps, 
             x="Category",
             y="Revenue_Estimate",
             title ="How much can paid apps earn?")

fig.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  yaxis=dict(type='log'),
                   xaxis={'categoryorder':'min ascending'})             # Looks at the smallest revenuey eastimate to the largest



fig.show()


In [79]:
# Work out the median price of a paid app 
median_price = paid_apps["Price"].median()
median_revenue =paid_apps["Revenue_Estimate"].median()
print(f"${median}")

$2.99


In [93]:
#===================================================================================  How much can paid apps earn ?=============================================================================================================#
## Need to kopok at the paid app revenue through a box plot 
paid_apps = clean_df[clean_df["Type"]== "Paid"].copy()                              # Sometimes bopolean indexig does not exreate an explicxyt copy of the dataset. hence, the use of copy()
    ##df.loc[df["Type"] == "Paid", "Revenue_Estimate"] = df["Price"] * df["Reviews"]     # Could also do this from the original dataset


# Creating the diagram 
fig = px.box(paid_apps, 
             x="Category",
             y="Revenue_Estimate",
             title ="How much can paid apps earn?")

fig.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  yaxis=dict(type='log'),
                   xaxis={'categoryorder':'min descending'})             # Looks at the smallest revenuey eastimate to the largest

fig.add_hline(y=median_revenue, line_color="red",line_dash ="dot", annotation_text ="Median Revenue Estimate", annotation_font_size=40,
              annotation_font_color="blue", annotation_position="bottom right")


fig.show()