<a href="https://colab.research.google.com/github/zuhairahzolkaply/Data_Analyst_and_Visualisation/blob/main/Google_App_Data_VIsualisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# Import Statements

In [1]:
import pandas as pd
import plotly.express as px


# Notebook Presentation

In [2]:
# numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [3]:
df_apps = pd.read_csv('apps.csv')

# Data Cleaning

In [4]:
df_apps.shape

(10841, 12)

In [None]:
df_apps.head()

In [None]:
df_apps.sample(5)

### Drop Unused Columns



In [None]:
#Remove columns  `Last_Updated` and `Android_Version`
df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1, inplace=True)
df_apps.head()

### Remove NaN values in Ratings


In [None]:
##NaN value in the Ratings columns.
nan_rows = df_apps[df_apps.Rating.isna()]
print(nan_rows.shape)
nan_rows.head()

In [10]:
#DataFrame called `df_apps_clean` that does not include NaN row
df_apps_clean = df_apps.dropna()
df_apps_clean.shape

(9367, 10)

##Remove Duplicates


In [17]:

duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
duplicated_rows.head()

(0, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres


In [16]:
#entries in "Instagram" app
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [15]:
df_apps_clean = df_apps_clean.drop_duplicates() ## remove any duplicates from `df_apps_clean`.

In [18]:
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [19]:
# need to specify the subset for indentifying duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])
df_apps_clean[df_apps_clean.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


In [20]:
df_apps_clean.shape

(8199, 10)

#Highest Rated Apps


In [None]:
#highest rated app.
#use the .sort_values() function.
df_apps_clean.sort_values('Rating', ascending=False).head()

 5-star ratings are typically found on apps with minimal reviews and a low installation count, indicating that these ratings are likely given by friends and family of the app creators.


Relying solely on ratings to assess the quality of an app could lead us to encounter issue on this matter.


#5 Largest Apps in terms of Size (MBs)



In [None]:
#size in megabytes (MB) of the largest Android apps in the Google Play Store.
df_apps_clean.sort_values('Size_MBs', ascending=False).head()

there is a noticeable restriction on the size of an app, capped at 100 MB. A brief online search would confirm that this restriction is set by the Google Play Store. It's intriguing to observe that several apps precisely reach this limit.



# 5 App with Most Reviews



In [None]:
df_apps_clean.sort_values('Reviews', ascending=False).head(50)

Popular apps on the Android App Store, featuring familiar names like Facebook, WhatsApp, and Instagram. Interestingly, it's worth noting that among the top 50 most reviewed apps, not a single one is a paid app!

# Donut Charts - Visualise Categorical Data: Content Ratings

In [24]:
#count the number of occurrences of each rating with .value_counts()
ratings = df_apps_clean.Content_Rating.value_counts()
ratings

Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: Content_Rating, dtype: int64

In [26]:
fig = px.pie(labels=ratings.index,
             values=ratings.values,
             title="Content Rating",
             names=ratings.index,
             #add a value for the hole argument
             hole=0.6,
)
fig.update_traces(textposition='outside', textfont_size=15, textinfo='percent+label') #graphical marks, to change how the text is displayed.

fig.show()


Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.



# Numeric Type Conversion:  Number of Installs



In [27]:
#apps with over 1 billion
df_apps_clean.Installs.describe()

count          8199
unique           19
top       1,000,000
freq           1417
Name: Installs, dtype: object

In [28]:
df_apps_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB


non-numeric data type; object

In [None]:
#because of the comma (,) character
df_apps_clean[['App', 'Installs']].groupby('Installs').count() #count the number of entries per level of installations

In [None]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace(',', "") #remove the comma (,) character. replace the , with an empty string

df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs) #convert data to a number using .to_numeric().

df_apps_clean[['App', 'Installs']].groupby('Installs').count()

#Filter out the Junk



In [None]:
#filter out the dollar sign
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)

df_apps_clean.sort_values('Price', ascending=False).head(20)

### The most expensive apps sub $250

In [None]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250] #remove 15 I am Rich Apps in the Google Play Store
df_apps_clean.sort_values('Price', ascending=False).head(5)

### Highest Grossing Paid Apps (ballpark estimate)

In [None]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price) # multiply the values in the price and the installs column.
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]

# Plotly Bar Charts & Scatter Plots: Analysing App Categories

In [34]:
# Number of different categories
df_apps_clean.Category.nunique()

33

In [35]:
# Number of apps per category
top10_category = df_apps_clean.Category.value_counts()[:10] #use  .value_counts()
top10_category

FAMILY             1606
GAME                910
TOOLS               719
PRODUCTIVITY        301
PERSONALIZATION     298
LIFESTYLE           297
FINANCE             296
MEDICAL             292
PHOTOGRAPHY         263
BUSINESS            262
Name: Category, dtype: int64

### Vertical Bar Chart - Highest Competition (Number of Apps)

In [None]:
bar = px.bar(
        x = top10_category.index, # index = category name
        y = top10_category.values)

bar.show()

### Horizontal Bar Chart - Most Popular Categories (Highest Downloads)

In [38]:
# Group apps by category and then sum the number of installations
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [39]:
h_bar = px.bar(
        x = category_installs.Installs,
        y = category_installs.index,
        orientation='h',
        title='Category Popularity')

h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

### Category Concentration - Downloads vs. Competition



In [None]:
# DataFrame that has the number of apps in one column and the number of installs in another.
# Count the number of apps in each category
cat_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner") #use .merge() and combine the two DataFrames.
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)

In [None]:
scatter = px.scatter(cat_merged_df, # data
                     x='App', # column name
                     y='Installs',
                     title='Category Concentration',
                     size='App',
                     hover_name=cat_merged_df.index,
                     color='Installs'
)

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps

# Extracting Nested Data from a Column



In [42]:
# Number of Genres?
len(df_apps_clean.Genres.unique())

114

not accurate

In [None]:
#nested data
#using .value_counts() and looking at the values that just have a single entry.
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]

Music & Audio;Music & Video    1
Racing;Pretend Play            1
Communication;Creativity       1
Parenting;Brain Games          1
Adventure;Brain Games          1
Name: Genres, dtype: int64

semi-colon (;) separates the genre names.

In [43]:
#to separate the genre names to get a clear picture
# Split the strings on the semi-colon and then .stack them.
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}') #add them all into a single column with .stack()
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

We now have a single column with shape: (8564,)
Number of genres: 53


#  Plotly Charts - Competition in Genres

In [44]:
bar = px.bar(
        x = num_genres.index[:15], # index = category name
        y = num_genres.values[:15], # count
        title='Top Genres',
        hover_name=num_genres.index[:15],
        color=num_genres.values[:15],
        color_continuous_scale='Agsunset'
)

bar.update_layout(xaxis_title='Genre',
                  yaxis_title='Number of Apps',
                  coloraxis_showscale=False)

bar.show()

# Grouped Bar Charts: Free vs. Paid Apps per Category

In [45]:
df_apps_clean.Type.value_counts()

Free    7595
Paid     589
Name: Type, dtype: int64

In [46]:
#group  data first by Category and then by Type.
# add up the number of apps per each type.
#Using as_index=False, push all the data into columns rather than end up with  Categories as the index.
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"],
                                      as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.sort_values('App')

Unnamed: 0,Category,Type,App
3,AUTO_AND_VEHICLES,Paid,1
24,FOOD_AND_DRINK,Paid,2
38,NEWS_AND_MAGAZINES,Paid,2
40,PARENTING,Paid,2
17,ENTERTAINMENT,Paid,2
...,...,...,...
31,LIFESTYLE,Free,284
21,FINANCE,Free,289
53,TOOLS,Free,656
25,GAME,Free,834


In [47]:
g_bar = px.bar(df_free_vs_paid,
               x='Category',
               y='App',
               title='Free vs Paid Apps by Category',
               color='Type',
               barmode='group',)

# pass a dictionary to the axis parameter in .update_layout()
g_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    xaxis={'categoryorder':'total descending'},
                    yaxis=dict(type='log'),
                    )

g_bar.show()

There are few of paid applications on the Google Play Store, and certain categories exhibit a higher prevalence of paid apps compared to others. Categories such as Personalization, Medical, and Weather have a relatively greater number of paid apps. Therefore, the choice of releasing a paid app may be influenced by the specific category you are targeting.

# Plotly Box Plots: Lost Downloads for Paid Apps



In [48]:
box = px.box(df_apps_clean,
             y='Installs',
             x='Type',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?'
)

box.update_layout(yaxis=dict(type='log'))

box.show()



The chart's hover text indicates: the median download count for free apps is 500,000, whereas for paid apps, it hovers around 5,000. This represents a significant and substantial difference.

# Plotly Box Plots: Revenue by App Category



In [None]:
df_paid_apps = df_apps_clean[df_apps_clean['Type'] == 'Paid']

In [None]:
box = px.box(df_paid_apps,
             x='Category',
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'},
                  yaxis=dict(type='log'))


box.show()


If the development cost of an Android app is $30,000,
the average app in only a few categories would generate enough revenue to cover that expense. The median earnings for paid photography apps, for instance, amounted to approximately 20k. For numerous apps, their revenues were even less, necessitating alternative income streams such as advertising or in-app purchases to offset development costs. However, specific app categories appear to feature a substantial number of outliers with considerably higher (estimated) revenue, notably in Medical, Personalization, Tools, Games, and Family.

# Examine Paid App Pricing Strategies by Category


In [None]:
df_paid_apps.Price.median()

2.99

The median price for an Android app is $2.99.

In [None]:
box = px.box(df_paid_apps,
             x='Category',
             y="Price",
             title='Price per Category')

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log'))


box.show()

Nonetheless, certain categories exhibit higher median prices compared to others. Specifically, Medical apps stand out with the most expensive apps and a median price of $5.49. In contrast, Personalization apps are relatively inexpensive on average, priced at $1.49. Other categories with higher median prices include Business ($4.99) and Dating ($6.99). It appears that customers in these categories may not be overly concerned about paying a slightly higher price for their apps.