# Google Play Store Apps Analysis

<img src="https://storage.googleapis.com/kaggle-datasets-images/49864/90482/cd5596dc740da68fc6896566758e60b4/dataset-cover.jpg?t=2018-09-07-13-48-45">

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

### About Dataset
The dataset is fetched from kaggle and belongs to **Lavanya Gupta (https://www.kaggle.com/lava18/google-play-store-apps)**

Google Play, also known as the Google Play Store and formerly Android Market, is a digital distribution service operated and developed by Google.

Learn more about it https://en.wikipedia.org/wiki/Google_Play

# Introduction

### Exploratory Data Analysis using Python

Google Play is a Digital distribution service by Google,  it is a popular service that people across the world use for entertainment. Google Play has also served as a digital media store, offering games, music, books, movies, and television programs.
In this EDA, I will explore the Google Play Store Apps  dataset through visualizations and graphs using matplotlib and seaborn.

I got this dataset from Kaggle and Here is the link to the dataset - https://www.kaggle.com/lava18/google-play-store-apps

You can also use the following dataset - https://www.kaggle.com/datasets/neomatrix369/google-play-store-apps-extended.
which is the extended version of original dataset.



This project is a part of the course [Data Analysis with Python: Zero to Pandas](zerotopandas.com), on `Jovian.ai` taught by the instructor `Aakash N S`, Who is one of the best instructor I have ever seen in the Data Science domain.

Learning new things, solving coding challenges & assignments alongside was the great fun thing of this course and It was a great experience with this course.

I learned so many topics like

- 1. `First Steps with Python and Jupyter`
- 2. `A Quick Tour of Variables and Data Types`
- 3. `Branching using Conditional Statements and Loops`
- 4. `Writing Reusable Code Using Functions`
- 5. `Reading from and Writing to Files and Interacting with the filesystem using the os module`
- 6. `Numerical Computing with Python and Numpy`
- 7. `Analyzing Tabular Data using Pandas`
- 8. `Data Visualization using Matplotlib & Seaborn`

Apart from that many Tips & Advice in detail.

Let's learn Exploratory Data Analysis with Python by analyzing the Google Play Store Apps and Explore the Google Play Store Apps dataset through visualizations and graphs using matplotlib and seaborn. Learn how to handle null values in the data and split it into separate datasets. Follow along with the Python code for data preparation and cleaning.

## TABLE OF CONTENTS

### 1. Download and Read the Dataset
       
### 2. Data Preparation and Cleaning

### 3. Exploratory Data Analysis (EDA) and Visualization

### 4. Asking and Answering Questions

### 5. Inferences and Conclusion.

### 6. References and Future Work

# Downloading the Dataset

In this Project, we'll analyze the Google Play Store Apps dataset. You can find the raw data & official analysis here: https://www.kaggle.com/datasets/neomatrix369/google-play-store-apps-extended.

Or you can go for https://www.kaggle.com/lava18/google-play-store-apps

There are several options for getting the dataset into Jupyter:

1. Download the CSV manually and upload it via Jupyter's GUI
2. Use the urlretrieve function from the urllib.request to download CSV files from a raw URL
3. Use a helper library, e.g., opendatasets, which contains a collection of curated datasets and provides a helper function for direct download.

We'll go for the 3rd option to use the dataset in Jupyter.

In [1]:
# !pip install jovian opendatasets --upgrade --quiet

In [2]:
!pip install opendatasets --upgrade --quiet

### Package Install and Import
First, we will install and import necessary packages.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let's begin by downloading the data, and listing the files within the dataset.

In [4]:
# Downloading the dataset
# dataset_url = 'https://www.kaggle.com/datasets/neomatrix369/google-play-store-apps-extended'

dataset_url = 'https://www.kaggle.com/lava18/google-play-store-apps'

In [None]:
import opendatasets as od
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

The dataset has been downloaded and extracted.

Let us save and upload our work to Jovian before continuing.

In [None]:
project_name = "google-play-store-apps-analysis" # change this (use lowercase letters and hyphens only)

In [None]:
# !pip install jovian --upgrade -q

In [None]:
# import jovian

In [None]:
# jovian.commit(project=project_name)

# Data Preparation and Cleaning

In all data-analysis projects, the data preparation step is not only necessary but also vital to find and handle features that could cause some problems while making the quantitative analysis, or that could lead to low efficient coding. According to `Alivia Smith`, this step usually takes up to 80% of the entire time of a data analysis project.
Therefore, missing, invalid, and inconsistent values have been addressed.

As with any dataset, the first steps are going to be data exploration and data cleaning. We need to get a better understanding of what we're dealing with

> - Load the dataset into a data frame using Pandas
> - Explore the number of rows & columns, ranges of values etc.
> - Handle missing, incorrect and invalid data
> - Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)

### Load the dataset into a data frame using Pandas

In [None]:
# You can go for anyone of the following datasets
# df_apps = pd.read_csv('google-play-store-apps-extended/googleplaystore.csv')

df_apps = pd.read_csv('google-play-store-apps/googleplaystore.csv')

## Preliminary Data Exploration and Data Cleaning with Pandas

Now that we've got our data loaded into our dataframe, we need to take a closer look at it to help us understand what it is we are working with. This is always the first step with any data science project. Let's see if we can answer the following questions:

- 1. How many rows and columns does the dataset contain?

- 2. How many columns does it have?

- 3. What are the labels for the columns? Do the columns have names?

- 4. Are there any missing values in our dataframe?

- 5. Does our dataframe contain any bad data?

- 6. Are there any NaN values present?

- 7. Are there any duplicate rows?

- 8. What are the data types of the columns?


The first step as always is getting a better idea about what we're dealing with.

### Explore the number of rows & columns, ranges of values etc.

To see the number of rows and columns we can use the `shape` attribute:

In [None]:
df_apps.shape

we are working with a fairly large DataFrame this time.
It tells us we have 10841 rows and 13 columns.

Now take a look at the Pandas dataframe we've just created with `.head()`.

In [None]:
df_apps.head()

As we can see there is '+' sign after every entry in the Installs column

In [None]:
df_apps.tail()

In [None]:
# What are the labels for the columns? Do the columns have names?

df_apps.columns

### How many columns does our dataframe have?

In [None]:
len(df_apps.columns)

### Handle missing, incorrect and invalid data

#### Remove '+' from the values of number of installs and converting it to numeric

In [None]:
df_apps['Installs'] = df_apps['Installs'].map(lambda x: x.rstrip('+'))

In [None]:
df_apps['Installs'] = pd.to_numeric(df_apps['Installs'].str.replace(',',''))

As we can see in the above code cell, the error says that ValueError: Unable to parse string "Free" at position 10472
So, Yes there are some missing values in our dataframe.

#### Removing 10472 due to value error in the column

In [None]:
# Row 10472 removed due to missing value of Category
df_apps.drop(df_apps.index[10472], inplace=True)

In [None]:
df_apps.info()

We can answer many of the questions with a single command: **.describe()**.

In [None]:
df_apps.describe()

The average apps rating is about 4.300000. So that's encouraging.

But quite a lot of apps have terrible rating too. In fact, all the apps in the bottom quartile are apps with bad rating,
since The minimum rating is 1. That makes sense. If a app never gets installed or is downloaded, then this is the number we would expect to see here.

On the other hand, the highest rating was 5.000000! Holy smokes.

**So which app was the lowest rated apps in the dataset?**

In [None]:
df_apps[df_apps.Rating == 1.000000]

**And the highest rated apps in the dataset are:**

In [None]:
df_apps[df_apps.Rating == 5.000000]

In [None]:
df_apps.sample()

In [None]:
df_apps.info()

In [None]:
df_apps.sample(5)

- Check the datatype of the Reviews, Installs, Price column.
- Convert the number of installations (the Installs column) to a numeric data type.

To check the data types you can either use `.describe()` on the column or `.info()` on the DataFrame.

In [None]:
df_apps.Reviews.describe()

In [None]:
df_apps.Installs.describe()

In [None]:
df_apps.Price.describe()

In [None]:
df_apps.info()

Both of these show that we are dealing with a non-numeric data type. In this case, the type is "object".

## Numeric Type Conversions for the Reviews, Installations & Price Data

We can remove the comma (,) character - or any character for that matter - from a DataFrame using the string’s `.replace()` method. Here we’re saying: “replace the , with an empty string”. This completely removes all the commas in the Installs column. We can then convert our data to a number using `.to_numeric()`.

### Converting the type of  Reviews column to number

In [None]:
df_apps['Reviews'] = pd.to_numeric(df_apps['Reviews'])

### Converting the type of Price column to number

In [None]:
df_apps['Price'] = pd.to_numeric(df_apps['Price'].str.replace('$',''))

In [None]:
df_apps['Price'] = pd.to_numeric(df_apps['Price'])

In [None]:
df_apps['Price']

### Converting the type of Installs column to number

We are dealing with a non-numeric data type, the ordering is not helpful because the reason Python is not recognising our installs as numbers is because of the comma (,) characters.

In [None]:
df_apps['Installs'] = pd.to_numeric(df_apps['Installs'].str.replace(',',''))

### Converting the type of Last Updated column to datetime

To convert the Last Updated column to a DateTime object, all we need to do is call the **to_datetime()** function.

In [None]:
df_apps['Last Updated']  = pd.to_datetime(df_apps['Last Updated'])

In [None]:
df_apps.head()

In [None]:
df_apps.info()

## Data Cleaning: Removing NaN Values and Duplicates


### Missing Values and Junk Data

Before we can proceed with our analysis we should try and figure out if there are any missing or junk data in our dataframe. That way we can avoid problems later on. In this case, we're going to look for NaN (Not A Number) values in our dataframe. NAN values are blank cells or cells that contain strings instead of numbers. Use the `.isna()` method.

In [None]:
df_apps.isna()

In [None]:
df_apps.isna().values.any()

In [None]:
df_apps.duplicated().values.any()

We can see the total number of duplicates by creating a subset and looking at the length of that subset

In [None]:
duplicated_rows = df_apps[df_apps.duplicated()]
len(duplicated_rows)

In [None]:
df_apps.info()

In [None]:
df_apps.nunique()

### Handling Null Values
We can see that for each of the columns, there are a lot different unique values for some of them.

In [None]:
df_apps.isnull().values.any()

In [None]:
df_apps.isnull().sum().sum()

In [None]:
df_apps.isnull().sum()

In [None]:
sns.heatmap(df_apps.isnull(), cbar=False)
plt.title('Null Values Heatmap')
plt.show()

As we can see from the table and above heatmap, we have highest null values in the Rating column which is about 1474 followed by Current Ver, Android Ver and Type.

### Dropping Unused Columns and Removing NaN Values

To find and remove the rows with the NaN values we can create a subset of the DataFrame based on where `.isna()` evaluates to True. We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.

#### Remove the columns called Current Ver and Android Ver from the DataFrame. We will not use these columns.

To remove the unwanted columns, we simply provide a list of the column names `['Last_Updated'], [‘Android_Ver'], ['Current Ver']` to the .drop() method. By setting axis=1 we are specifying that we want to drop certain columns.

In [None]:
df_apps.drop(['Android Ver'], axis=1, inplace = True)

In [None]:
df_apps.drop(['Current Ver'], axis=1, inplace = True)

In [None]:
df_apps.head()

In [None]:
df_apps.tail()

We can drop the NaN values with `.dropna():`

In [None]:
df_apps_clean = df_apps.dropna()
df_apps_clean.shape

This leaves us with 9,366 entries in our DataFrame. But there may be other problems with the data too:

- `Are there any duplicates in data? `
Check for duplicates using the `.duplicated()` function.

Use `.drop_duplicates()` to remove any duplicates from `df_apps_clean`


### Accessing Columns and Individual Cells in a Dataframe

To access a particular column from a data frame we can use the square bracket notation;

In [None]:
df_apps_clean['Rating']

To find the highest rating we can simply chain the **.max()** method.

In [None]:
df_apps_clean['Rating'].max()

**Highest and Lowest Rated Apps**

In [None]:
print(df_apps_clean['Rating'].max())
print(f"Index for the max rated app: {df_apps_clean['Rating'].idxmax()}")

In [None]:
print(df_apps_clean['Rating'].min())
df_apps_clean['Rating'].loc[df_apps_clean['Rating'].idxmin()]

Let's do it for **Reviews** Column also;

In [None]:
df_apps_clean['Reviews']

In [None]:
df_apps_clean['Reviews'].max()

**The highest reviews are 78158306.**
But which app earns this much on average? For this, we need to know the row number or index so that we can look up the name of the major. the **.idxmax()** method will give us index for the row with the largest value.

In [None]:
print(df_apps_clean['Reviews'].max())
print(f"Index for the max reviewed app: {df_apps_clean['Reviews'].idxmax()}")

In [None]:
print(df_apps_clean['Reviews'].min())
df_apps_clean['Reviews'].loc[df_apps_clean['Reviews'].idxmin()]

In [None]:
print(f"Index for the minimum reviewed app: {df_apps_clean['Reviews'].idxmin()}")

### Finding and Removing Duplicates

There are indeed duplicates in the data. We can show them using the `.duplicated()` method

In [None]:
duplicate_exists = df_apps_clean['App'].duplicated().any()
duplicate_exists

In [None]:
df_apps_clean['App'].value_counts()

#### As we can see from the above mentioned data, some of the apps are having multiple rows. Let's check out if there data is identical or not

In [None]:
df_apps_clean[df_apps_clean['App']=='Candy Crush Saga']

#### As we can see from the above dataframe, Candy Crush Saga app is having identical rows with difference in number of reviews. It may have happened that for the same app, the data has been scraped in different points of time. So we have kept row of an app with maximum number of reviews, assuming it to be the latest one.

<img src="https://humornama.com/wp-content/uploads/2022/01/Saaley-Mera-Hi-Maal-Chura-Kar-Meme-Template-on-Phir-Hera-Pheri-364x205.jpg">

In [None]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
print(duplicated_rows.shape)
duplicated_rows.head()

In [None]:
df_apps_clean.drop_duplicates()

There are indeed duplicates in the data. We can show them using the `.duplicated()` method, which brings up 476 rows.

We can actually check for an individual app like ‘Instagram’ by looking up all the entries with that name in the App column.

In [None]:
df_apps_clean[df_apps_clean.App=='Instagram']

So how do we get rid of duplicates? Can we simply call `.drop_duplicates()`?

In [None]:
df_apps_clean = df_apps_clean.drop_duplicates()

In [None]:
df_apps_clean[df_apps_clean.App=='Instagram']

Not really. If we do this without specifying how to identify duplicates, we see that 3 copies of Instagram are retained because they have a different number of reviews. We need to provide the column names that should be used in the comparison to identify duplicates.

In [None]:
# Need to specify the subset for identifying duplicates
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price'])

df_apps_clean[df_apps_clean.App == 'Instagram']

In [None]:
df_apps_clean.shape

This leaves us with 8,198 entries after removing duplicates. Woah! 💪

In [None]:
# import jovian

In [None]:
# jovian.commit()

## Preliminary Exploration: The Highest Ratings, Most Reviews, and Largest Size


#### Some important quetions to be answered
1. Which apps are the highest rated.
2. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?
3. What's the size in megabytes (MB) of the largest Android apps in the Google Play Store.
4. Based on the data, do you think there could be a limit in place or can developers make apps as large as they please?
5. Which apps have the highest number of reviews?
6. Are there any paid apps among the top 50?

In [None]:
df_apps_clean.sort_values('Rating', ascending=False).head()

**Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family).**

In [None]:
df_apps_clean.sort_values('Size', ascending=False).head()

Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. It’s interesting to see that a number of apps actually hit that limit exactly.

In [None]:
df_apps_clean.sort_values('Installs', ascending=False).head(50)

In [None]:
df_apps_clean.sort_values('Installs', ascending=False).head(50)

In [None]:
highest_potential_apps = df_apps_clean.sort_values('Installs', ascending=False).head(50)

In [None]:
highest_potential_apps

In [None]:
highest_potential_apps[['App', 'Category', 'Installs']].head()

In [None]:
df_apps_clean['Installs'].min(),df_apps_clean['Installs'].max()

So here we can see top 5 installed apps

In [None]:
df_apps_clean.sort_values('Reviews', ascending=False).head(50)

If you look at the number of reviews, you can find the most popular apps on the Android App Store. These include the usual suspects: Facebook, WhatsApp, Instagram etc. What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app! 🤔

**Questions that's to be answered**

1. How many apps had over 1 billion installations?
2. How many apps just had a single install?



- Count the number of apps at each level of installations.

### Grouping and Pivoting Data with Pandas



If we take two of the columns, say Installs and the App name, we can count the number of entries per level of installations with `.groupby()` and `.count()`.

In [None]:
df_apps_clean = df_apps_clean.loc[df_apps_clean.groupby(['App'])['Installs'].idxmax()]

In [None]:
df_apps_clean[['App', 'Installs']].groupby('Installs').count()

Let's examine the above Installs column. We can see that there are some apps with only 3 installs, whereas some of them have such massive installations that I am not even able to count the zeros.

In [None]:
df_apps_clean = df_apps_clean.loc[df_apps_clean.groupby(['App'])['Reviews'].idxmax()]

In [None]:
df_apps_clean[['App', 'Reviews']].groupby('Reviews').count()

In [None]:
df_apps_clean = df_apps_clean.loc[df_apps_clean.groupby(['App'])['Price'].idxmax()]

In [None]:
df_apps_clean[['App', 'Price']].groupby('Price').count()

In [None]:
# import jovian

In [None]:
# jovian.commit()

# Exploratory Analysis and Visualization

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries Matplotlib and Seaborn to learn and apply some popular data visualization techniques.

> - **Compute the mean, sum, range and other interesting statistics for numeric columns**
> - **Explore distributions of numeric columns using histograms etc.**
> - **Explore relationship between columns using scatter plots, bar charts etc.**
> - **Make a note of interesting insights from the exploratory analysis**

### Computing the mean, sum, range and other interesting statistics for numeric columns

**For Reviews column**

In [None]:
# Calculating mean for Reviews column
df_apps_clean['Reviews'].mean()

In [None]:
# Calculating sum for Reviews column
df_apps_clean['Reviews'].sum()

In [None]:
# Calculating range for Reviews column
df_apps_clean['Reviews'].min(),df_apps_clean['Reviews'].max()

**For Installs column**

In [None]:
# Calculating mean for installations column
df_apps_clean['Installs'].mean()

In [None]:
# Calculating range for installations column
df_apps_clean['Installs'].min(),df_apps_clean['Installs'].max()

**For Price column**

In [None]:
# Calculating mean for Price column
df_apps_clean['Price'].mean()

In [None]:
# Calculating sum for price column
df_apps_clean['Price'].sum()

In [None]:
# Calculating range for Price column
df_apps_clean['Price'].min(),df_apps_clean['Price'].max()

We're now ready to visualize our data. For that we'll use another popular data visualization tool that you can use alongside Matplotlib: Seaborn. Seaborn is built on top of Matplotlib, and it makes creating certain visualizations very convenient.

Let's begin by importing`matplotlib.pyplot` and `seaborn`.

The first step is adding Seaborn to our notebook. By convention we'll use the name sns.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

 **NOTE**: The special command `%matplotlib inline` will ensure that our plots are shown and embedded within the Jupyter notebook itself.

Now lets explore:

- How to visualise data and create charts with Matplotlib

- How to style and customise a line chart to your liking


### Histogram for Content Rating

All Android apps have a content rating like “Everyone” or “Teen” or “Mature 17+”. Let’s take a look at the distribution of the content ratings in our dataset and see how to visualise it with plotly - a popular data visualisation library that you can use alongside or instead of Matplotlib.

First, we’ll count the number of occurrences of each rating with

In [None]:
ratings = df_apps_clean["Content Rating"].value_counts()
ratings

In [None]:
df_ratings = df_apps_clean["Content Rating"]

In [None]:
# Set the style
sns.set(style="whitegrid")
plt.figure(figsize=(10, 7))

# Create the histogram plot
ax = sns.histplot(data=df_ratings, bins=12, color='skyblue', edgecolor='black')

# Adding title and labels
plt.title("Distribution of Content Ratings", fontsize=16)
plt.xlabel("Content Rating", fontsize=14)
plt.ylabel("Count of Applications", fontsize=14)

# Adding grid lines for better readability
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Adding a subtle gray background
ax.set_facecolor('#F2F2F2')

# Adding data labels on top of the bars
for bar in ax.patches:
    height = bar.get_height()
    ax.annotate('{}'.format(height),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom', fontsize=12, color='black')

# Adjusting x-axis ticks for better readability
plt.xticks(rotation=45, ha='right', fontsize=12)

# Enhancing y-axis ticks with commas for better readability
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))


# Removing the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.xticks(rotation=80, ha='right', fontsize=12)
plt.yticks(fontsize=12)

plt.tight_layout()
plt.show()

**Content Rating for Everyone has reported the highest number of applications out of 8196, where Teen content, Mature 17+ Content, Everyone 10+, Adults only 18+ and Unrated follows the list.**

### Pie Chart for Content Rating

In [None]:
# Data
counts = ratings.values
names = ratings.index

# Colors for the pie chart
colors = ['#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#ff6666']

# Adjusting font size and figure size
plt.rcParams['font.size'] = 14
plt.figure(figsize=(10, 10))

# Creating the pie chart
plt.pie(counts, labels=names, colors=colors, autopct='%1.1f%%', startangle=140, shadow=True, explode=(0.05, 0, 0, 0, 0, 0.2))

# Adding a title
plt.title('Percentage of Content Ratings', fontsize=18, pad=20)

# Removing the legend and adding legend labels
plt.legend(labels=names, loc='upper left', bbox_to_anchor=(0.85, 1))

# Equal aspect ratio ensures that the pie is drawn as a circle.
plt.axis('equal')

# Display the pie chart
plt.show()

From the above chart we can clearly see that Content Rating %

- `Everyone`           6618   (80.75%)
- `Teen`                912   (11.13%)
- `Mature 17+`          357   (4.36%)
- `Everyone 10+`        305   (3.72%)
- `Adults only 18+`       3   (0.04%)
- `Unrated`               1   (0.01%)

### Average rating

In [None]:
df_genres_ratings = df_apps_clean.groupby(['Genres'])[['Rating']].mean()

In [None]:
top_genres = df_apps_clean.Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})

In [None]:
genres_installs = df_apps_clean.groupby(['Genres'])[['Installs']].sum()

In [None]:
top_genres_installs = pd.merge(top_genres, genres_installs, on='Genres')

In [None]:
genres_installs_ratings = pd.merge(top_genres_installs, df_genres_ratings, on='Genres')

In [None]:
genres_installs_ratings['Rating'].describe()

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(14, 7))

# Create the KDE plot
g = sns.kdeplot(data=genres_installs_ratings, x="Rating", color="green", shade=True)

# Customize the labels and title
g.set_xlabel("Rating(out of 5)", fontsize=14)
g.set_ylabel("Frequency of Apps", fontsize=14)
plt.title('Distribution of Ratings', size=20)

# Customize the tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(alpha=0.5)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.show()

**Above Distribution of Rating cleary show that the average rating is 4.248768**

### Histogram for Size Column

In [None]:
# Converting KB to MB
df_apps_clean['Size'] = df_apps_clean['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)

df_apps_clean['Size'] = df_apps_clean['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df_apps_clean['Size'] = df_apps_clean['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
df_apps_clean['Size'] = df_apps_clean['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
df_apps_clean['Size'] = df_apps_clean['Size'].apply(lambda x: float(x))

In [None]:
# to remove null values from size column
df_apps_clean.loc[df_apps_clean['Size'].isnull(),'Size']=0

In [None]:
# plt.figure(figsize=(14,7))
# plt.xlabel("Size")
# plt.ylabel("Count")
# plt.title("Distribution of Size")
# plt.hist(df_apps_clean['Size']);
# plt.show()

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(14, 7))

# Create the histogram
plt.hist(df_apps_clean['Size'], bins=30, color='#86bf91', alpha=0.7)

# Customize labels and title
plt.xlabel("Size", fontsize=14)
plt.ylabel("Count of Applications", fontsize=14)
plt.title("Distribution of App Sizes", fontsize=16)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.show()

***From the above histogram, it can be concluded that maximum number of applications present in the dataset are of small size.***

Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. It’s interesting to see that a number of apps actually hit that limit exactly.

### Distribution of Installs column

In [None]:
df_apps_clean.Installs.describe()

In [None]:
df_installs = df_apps_clean["Installs"]

In [None]:
df_apps_clean['Installs'].min(),df_apps_clean['Installs'].max()

<img src="https://indianmemetemplates.com/wp-content/uploads/maza-nahi-aa-raha-hai-800x600.jpg?crop=1">

#### Convering Installs column into log_installs

***As we can notice,there is a high variance in the number of installs. To remove this we are adding a new column to dataframe, which is the log of number of installs***

In [None]:
df_apps_clean['log_installs'] = np.log10(df_apps_clean['Installs'])

In [None]:
df_apps_clean.loc[df_apps_clean['log_installs']==df_apps_clean['log_installs'].min(),'log_installs']=0

In [None]:
# Create a figure with a larger size
plt.figure(figsize=(18, 7))


# Customize labels and title
plt.xlabel("Log of Installs", fontsize=14)
plt.ylabel("Number of Applications", fontsize=14)
plt.title("Distribution of Logrithm of Installs(base10)", fontsize=16)

plt.hist(df_apps_clean['log_installs']);

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

***Distribution of Installs shows that that size may impact the number of installations. Bulky applications are less installed by the user.***

In [None]:
df_category = df_apps_clean["Category"]

In [None]:
# plt.figure(figsize=(18, 7))
# plt.xticks(rotation=90, fontsize=12)

# # Create histogram
# plt.hist(df_category, bins=33, edgecolor='black')

# # Add labels and title
# plt.xlabel("Category")
# plt.ylabel("Number of application")
# plt.title("Top Categories")

# # Display the histogram
# plt.show()

In [None]:
# Create a figure with a larger size
plt.figure(figsize=(18, 7))

# Create the histogram
plt.hist(df_category, bins=33, edgecolor='black', color='#86bf91', alpha=0.7)

# Customize x-axis labels rotation and fontsize
plt.xticks(rotation=45, ha='right', fontsize=12)

# Customize labels and title
plt.xlabel("Category of Applicaton", fontsize=14)
plt.ylabel("Number of Applications", fontsize=14)
plt.title("Distribution of Top Categories", fontsize=16)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

#### Distribution of App Prices

In [None]:
df_price = df_apps_clean["Price"]

plt.figure(figsize=(14,7))

# Create the KDE plot
g = sns.kdeplot(data=df_price, color="purple", shade=True)

# Customize labels and title
g.set_xlabel("Price ($)", fontsize=14)
g.set_ylabel("Applications", fontsize=14)
plt.title('Distribution of App Prices', size=20)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Display the plot
plt.show()

From the above distribution graph, we can get a fair idea that most of the apps fall under the category of free, with some apps also falling under the paid category.

## Seaborn Data Visualisation
***Scatter Plots of different columns using Seaborn***

Let's get our hands dirty.

#### Scatter Plot for Category vs. Size

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(12, 8))

# Create the scatter plot
sns.scatterplot(data=df_apps_clean, x='Category', y='Size', palette='Set2', hue="Type")

# Customize labels and title
plt.xlabel("Category", fontsize=14)
plt.ylabel("Size of Application", fontsize=14)
plt.title("Scatter Plot of Category vs Size", fontsize=16)

# Customize tick parameters
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Adding a legend
plt.legend(fontsize=12)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

#### Scatter Plot for Size vs. Price

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(12, 8))

# Create the scatter plot
sns.scatterplot(data=df_apps_clean, x='Size', y='Price', palette='Set2', hue="Type")

# Customize labels and title
plt.xlabel("Size of Application(MBs)", fontsize=14)
plt.ylabel("Price of Application", fontsize=14)
plt.title("Scatter Plot of Size of Application vs Price", fontsize=16)

# Customize tick parameters
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Adding a legend
plt.legend(fontsize=12)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

From the above scatter plot we can get an fair idea how size can affect price, Here we can see that high paid applications ranging between (0-40) MBs.

Bulky application can affect our paid application.

#### Scatter Plot for Size vs. Number of Installs

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(12, 8))

# Create the scatter plot
sns.scatterplot(data=df_apps_clean, x='Size', y='Installs', palette='Set2', hue="Type")

# Customize labels and title
plt.xlabel("Size of Application", fontsize=14)
plt.ylabel("Number of Installs", fontsize=14)
plt.title("Scatter Plot of Size vs Number of Installs", fontsize=16)


# Customize tick parameters
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Adding a legend
plt.legend(fontsize=12)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

#### It is clear from the above mentioned plot that size may impact the number of installations.

#### Scatter Plot for Size vs. log_installs

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(12, 8))

# Create the scatter plot
sns.scatterplot(data=df_apps_clean, x='Size', y='log_installs', palette='Set2', hue="Type")

# Customize labels and title
plt.xlabel("Size of Application (MBs)", fontsize=14)
plt.ylabel("Log of Installs", fontsize=14)
plt.title("Scatter Plot of Size vs Log Installs(base 10)", fontsize=16)

# Customize tick parameters
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Adding a legend
plt.legend(fontsize=12)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

 #### Bulky applications are less installed by the user.

# Asking and Answering Questions

>- Asking and answering questions in data analysis is like a compass for navigating a vast sea of data. It defines our goals, guides data collection, and shapes analysis techniques. It uncovers patterns, informs decisions, and fuels effective communication. Ultimately, it transforms raw data into valuable insights, driving better understanding and smarter actions.

<img src="https://indian.memetemplates.in/uploads/1674731583.jpeg">

## Q1: What are the top 20 most expensive apps in the dataset ?

In [None]:
df_apps_clean.Price.describe()

In [None]:
df_apps_clean['Price'] = df_apps_clean['Price'].astype(str).str.replace('$', "", regex=False)

In [None]:
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)

In [None]:
df_apps_clean.sort_values('Price', ascending=False).head(20)

What’s going on here? There are 15 I am Rich Apps in the Google Play Store apparently. They all cost `$300` or more, which is the main point of the app. The story goes that in 2008, Armin Heinrich released the very first I am Rich app in the iOS App Store for `$999.90`. The app does absolutely nothing. It just displays the picture of a gemstone and can be used to prove to your friends how rich you are. Armin actually made a total of 7 sales before the app was hastily removed by Apple. Nonetheless, it inspired a bunch of copycats on the Android App Store, but if you search today, you’ll find all of these apps have disappeared as well. The high installation numbers are likely gamed by making the app was available for free at some point to get reviews and appear more legitimate.

<img src="https://humornama.com/wp-content/uploads/2020/12/25-Din-Mein-Paisa-Double-meme-template-of-Phir-Hera-Pheri-1024x576.jpg">

Leaving this bad data in our dataset will misrepresent our analysis of the most expensive 'real' apps. Here’s how we can remove these rows:

In [None]:
df_apps_clean = df_apps_clean[df_apps_clean['Price'] < 250]

In [None]:
df_apps_clean.sort_values('Price', ascending=False).head(5)

**When we look at the top 5 apps now, we see that 4 out of 5 are medical apps.**

We can work out the highest grossing paid apps now. All we need to do is multiply the values in the price and the installs column to get the number:

In [None]:
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)

In [None]:
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]

This generously assumes of course that all the installs would have been made at the listed price, which is unlikely, as there are always promotions and free give-aways on the App Stores.

<img src="https://humornama.com/wp-content/uploads/2021/01/150-Rupiya-Dega-Meme-Template-of-Kachra-Seth-1024x576.jpg">

The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels.

## Q2: What are the Most Competitive & Popular App Categories ?

### The Most Competitive & Popular App Categories

If you were to release an app, would you choose to go after a competitive category with many other apps? Or would you target a popular category with a high number of downloads? Or perhaps you can target a category which is both popular but also one where the downloads are spread out among many different apps. That way, even if it’s more difficult to discover among all the other apps, your app has a better chance of getting installed, right? Let’s analyse this with bar charts and scatter plots and figure out which categories are dominating the market.

> We can find the number of different categories using `.nunique()` function

In [None]:
df_apps_clean.Category.nunique()

There are 33 unique categories.

To calculate the number of apps per category we can use **.value_counts()**

In [None]:
top10_category = df_apps_clean.Category.value_counts()[:10]

In [None]:
top10_category

In [None]:
# plt.figure(figsize=(14,7))
# plt.xticks(rotation=65)
# plt.xlabel("Category")
# plt.ylabel("Number of application")
# plt.title("Top 10 Categories")
# plt.bar(top10_category.index,top10_category.values);

In [None]:

# Create a figure with a larger size
plt.figure(figsize=(14, 7))

# Create the bar plot
plt.bar(top10_category.index, top10_category.values, color='#86bf91', alpha=0.7)

# Customize x-axis labels rotation and fontsize
plt.xticks(rotation=45, ha='right', fontsize=12)

# Customize labels and title
plt.xlabel("Category", fontsize=14)
plt.ylabel("Number of Applications", fontsize=14)
plt.title("Top 10 App Categories", fontsize=16)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

Based on the number of apps, the **Family** and **Game** categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:

In [None]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
category_installs.sort_values('Installs', ascending=True, inplace=True)

In [None]:
cat_number = df_apps_clean.groupby('Category').agg({'App': pd.Series.count})

In [None]:
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how="inner")
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)

In [None]:
# plt.figure(figsize=(14,7))
# plt.title('Category Concentration')
# sns.scatterplot(x="App", # column name
#                 y="Installs",
#                 s=100,
#                 data=cat_merged_df);


In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(14, 7))

# Create the scatter plot
sns.scatterplot(x="App", y="Installs", s=100, data=cat_merged_df)

# Customize labels and title
plt.xlabel("App", fontsize=14)
plt.ylabel("Installs", fontsize=14)
plt.title("Category Concentration", fontsize=16)

# Customize tick parameters
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(alpha=0.5)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

In the above scatterplot, we can see 3 to 4 clusters of installs. Some of them are definitely outliers.

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(14, 7))

# Create the bar plot using seaborn
sns.barplot(x=top10_category.index, y=top10_category.values)

# Customize x-axis labels rotation and fontsize
plt.xticks(rotation=45, ha='right', fontsize=12)

# Customize labels and title
plt.xlabel("Category", fontsize=14)
plt.ylabel("Number of Applications", fontsize=14)
plt.title("Count of Applications for Each Category", fontsize=16)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', alpha=0.7)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show()

From the above barplot, we can confirm that the family category is dominating the play store categories.

In [None]:
# plt.figure(figsize=(14,7))
# plt.xticks(rotation=65)
# plt.xlabel("Category")
# plt.ylabel("Number of application")
# plt.title("Count of applications for each Category")
# sns.barplot(x=top10_category.index, y=top10_category.values)
# plt.show()

#---------------------------------Graph For Line plot------------------------------------------------

#The same conclusion can be drawn from the line chart.
# plt.figure(figsize=(14,7))
# plt.xticks(rotation=65)
# plt.xlabel("Category")
# plt.ylabel("Number of application")
# plt.title("Count of applications for each Category")
# sns.lineplot(x=top10_category.index, y=top10_category.values, marker='o', label='Line')
# plt.show()

In [None]:
df_category_installs = df_apps_clean.groupby(['Category','Type'])[['Installs']].sum().reset_index()
df_category_installs['log(Installs)'] = np.log10(df_category_installs['Installs'])

In [None]:
plt.figure(figsize=(25,12))
plt.xticks(rotation=75,fontsize=16)
plt.xlabel("Category", fontsize=16)
plt.ylabel("Installs(base10)", fontsize=14)
plt.title("Category type wise Number of Installs(base10) ")
sns.barplot(x='Category', y='log(Installs)', hue='Type', data=df_category_installs);
plt.show()

***We can see that the number of free applications installed by users is high when compared with the paid ones.***

Let us save and upload our work to Jovian before continuing

In [None]:
# import jovian

In [None]:
# jovian.commit()

## Q3: What are the top 20 Genres with their counts ?

Let’s turn our attention to the Genres column. This is quite similar to the categories column but more granular.

In [None]:
# Number of Genres
len(df_apps_clean.Genres.unique())

#### Working with Nested Column Data

In [None]:
# Problem : Have multiple categories separated by ;
df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]

We somehow need to separate the genre names to get a clear picture.
For this we'll use string’s `.split()` method comes in handy. After we’ve separated our genre names based on the semi-colon, we can add them all into a single column with `.stack()` and then use `.value_counts()`.

#### Extracting Nested Column Data using `.stack()`

In [None]:
# Split the strings on the semi-colon and then .stack them.
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f'We now have a single column with shape: {stack.shape}')
num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

**This shows us we actually have 53 different genres.**

**Let's plot top 20 Genres with their counts**

In [None]:
# plt.figure(figsize=(14,7))
# plt.xticks(rotation=75)
# plt.xlabel("Genres")
# plt.ylabel("Counts")
# plt.title('Top Genres')
# sns.barplot(x = num_genres.index[:20],  y = num_genres.values[:20])
# plt.show()

In [None]:
plt.figure(figsize=(14, 7))

# Adjusting style
sns.set(style="whitegrid")
palette = sns.color_palette("viridis", len(num_genres.index[:20]))

# Creating the bar plot
ax = sns.barplot(x=num_genres.index[:20], y=num_genres.values[:20], palette=palette)

# Adding labels and title
plt.xticks(rotation=75)
plt.xlabel("Genres", fontsize=14)
plt.ylabel("Counts", fontsize=14)
plt.title('Top 20 Genres of Apps', fontsize=16)

# Adding annotations to the bars
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'),
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha = 'center', va = 'center',
                xytext = (0, 9),
                textcoords = 'offset points',
                fontsize=10, color='black')

plt.tight_layout()
plt.show()

Here we have our top genres, i.e., tools, education, entertainment, action, and lifestyle finance productivity.

In [None]:
# Let's save the work

In [None]:
# import jovian

In [None]:
# jovian.commit()

## Q4:  What are the count of applications in each category differentiated by their type? What are the top free and paid categories ?

Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.

In [None]:
df_apps_clean.Type.value_counts()

We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate. We can group our data first by Category and then by Type. Then we can add up the number of apps per each type. Using as_index=False we push all the data into columns rather than end up with our Categories as the index.

In [None]:
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid.head()

In [None]:
df_free_vs_paid

Unsurprisingly the biggest categories have the most paid apps. However, there might be some patterns if we put the numbers of a graph!

***The above sorted plot by their number of counts shows that Family, Games, and Tools are the categories with the most number of apps.***

### Play Store Apps Types: Free or Paid
It'd be interesting to see the comparison between the total number of free and paid applications in this dataset just to get an idea of which one is the majority.

In [None]:
plt.figure(figsize=(7,5))
plt.title("Count of Free vs Paid Apps by Category")
plt.xlabel("Type (Free/Paid)")
plt.ylabel("Total Count")
g=sns.countplot(x=df_free_vs_paid.Type, palette="pastel");
plt.show()

We can clearly see that the number of free apps dominates the Android market.

#### As per above data and bar plot shows that the number of free applications installed by the user are high when compared with the paid ones.

In [None]:
plt.figure(figsize=(12,6))
plt.title("Percentages(%) of Play Store Apps that are either Free or Paid")
label = df_free_vs_paid.Type.value_counts().index
g = plt.pie(df_free_vs_paid.Type.value_counts(), explode=(0.025,0.025), labels=label, colors=['skyblue','navajowhite'],autopct='%1.1f%%', startangle=180);
plt.legend()
plt.show();

There 54.1% apps are free and rest are the paid apps

### Contrasting Free vs. Paid Apps per Category

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

In [None]:
app_count = df_apps_clean.groupby(['Category','Type'])[['App']].count().reset_index().rename(columns={'App':'Count','index':'App'})

In [None]:
df_app_count = app_count.pivot('Category', 'Type', 'Count').fillna(0).reset_index()

In [None]:
# Set style
sns.set(style="whitegrid")

# Create a figure with a larger size
plt.figure(figsize=(18, 9))

# Create the stacked bar plot
ax = df_app_count.set_index('Category').plot(kind='bar', stacked=True, figsize=(18, 9))

# labels and title
plt.xlabel("Category", fontsize=15)
plt.ylabel("Count of Applications", fontsize=15)
plt.title("Count of Applications in Each Category Differentiated by Their Type", fontsize=16)

# Add a grid for better readability
plt.grid(alpha=0.5)

# Adding a subtle background color
plt.gca().set_facecolor('#F2F2F2')

# Removing the right and top spines
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

# Display the plot
plt.tight_layout()
plt.show();

It looks like certain app categories have more free apps available for download than others.

#### We can clearly see that the majority of apps in Family, Food & Drink and Tools, as well as Social categories were free to install. At the same time Family, Sports, Tools and Medical categories had the biggest number of paid apps available for download.

### Q5. How frequently Play Stores applications updated each year ?

#### Content Updated each year

In [None]:
df_apps_year = df_apps_clean['Last Updated'].value_counts().to_frame().reset_index().rename(columns={'index': 'Year','Last Updated':'Count'})

df_apps_year

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
sns.lineplot(data=df_apps_year, x='Year', y='Count')

# Adding title and labels
plt.title("Total Apps Updated Each Year (up to 2019)", fontsize=16)
plt.ylabel("Count of Applications", fontsize=14)
plt.xlabel("Year", fontsize=14)

# Adding a background color
ax.set_facecolor('#f0f0f0')

# Adding a legend
ax.legend(['Count'], loc='upper left', fontsize=12)

plt.show()

From the above line plot, we're not getting more information, but we can surely say that updated versions of apps came recently.
<img src="https://humornama.com/wp-content/uploads/2022/01/Ye-Sab-Kya-Dekhna-Pad-Raha-Hai-Meme-Template-on-Deewane-Huye-Paagal-1024x410.jpg">
We need to draw one more line plot so that we can figure out what was happening.

In [None]:
df_last_updated_year = pd.DataFrame(df_apps_year['Year'])

In [None]:
df_last_updated_year['date_column'] = pd.to_datetime(df_last_updated_year['Year'])

In [None]:
df_last_updated_year['year'] =df_last_updated_year['date_column'].dt.year

In [None]:
print(df_last_updated_year)

In [None]:
# fig, ax = plt.subplots(figsize=(10, 8))
# sns.lineplot(data=df_apps_year, x=df_last_updated_year['year'], y='Count')


# plt.title("Total Apps updated each year (up to 2018)")
# plt.ylabel("Count")
# plt.xlabel("Year")
# plt.show()

In [None]:
# Create a figure and axes with a larger size
fig, ax = plt.subplots(figsize=(10, 8))

# Create the line plot using seaborn
sns.lineplot(data=df_apps_year, x=df_last_updated_year['year'], y='Count')

# Customize labels and title
plt.title("Total Apps Updated Each Year (up to 2018)", fontsize=16)
plt.ylabel("Count of Applications", fontsize=14)
plt.xlabel("Year", fontsize=14)

# Customize tick parameters
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(alpha=0.5)

# Adding a subtle background color
ax.set_facecolor('#F2F2F2')


# Display the plot
plt.tight_layout()
plt.show()

From the above line plot, we can conclude that after 2016, applications got updated regularly, which means creators are constantly improving their products.
One more insight: we can also predict that before 2014, the availability of the Internet was not up to par, and the number of Android users was low.

In [None]:
# import jovian

In [None]:
# jovian.commit()

## Inferences and Conclusion

* In our dataset, we actually have **53 different genres**.
* Based on the number of apps, the **Family and Game categories** are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

* It looks like certain app categories have more free apps available for download than others.

* It can be concluded that the number of **free applications** installed by the user are high when compared with the **paid ones**.

* The majority of apps in **Family**, **Food & Drink** and **Tools**, as well as **Social categories** were free to install.**

* **At the same time Family, Sports, Tools and Medical categories had the biggest number of paid apps available for download.**

* What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

In [None]:
# import jovian

In [None]:
# jovian.commit()

# References and Future Work


### References -

>- **Numerical computing with Numpy**: https://jovian.ai/aakashns/python-numerical-computing-with-numpy
>- **100 Numpy exercises**: https://jovian.ai/aakashns/100-numpy-exercises
>- **Working with OS & files**: https://jovian.ai/aakashns/python-os-and-filesystem
>- **Matplotlib & Seaborn tutorial:** https://jovian.ai/aakashns/python-matplotlib-data-visualization
>- **Data visualization cheat sheet:** https://jovian.ai/aakashns/dataviz-cheatsheet
> - Find an interesting dataset on this page: https://www.kaggle.com/datasets?fileType=csv
> - Download the dataset using the [`opendatasets` Python library](https://github.com/JovianML/opendatasets#opendatasets)
>- I got this dataset from Kaggle and Here is the link to the dataset - https://www.kaggle.com/lava18/google-play-store-apps
>- You can also use the following dataset - https://www.kaggle.com/datasets/neomatrix369/google-play-store-apps-extended. which is the extended version of original dataset.

>- **1. Learn more about scientific calculations using NumPy** - https://numpy.org/
>- **2. For knowing more about pandas and it's functions in deatil** - https://pandas.pydata.org
>- **3. For more ideas on Matplotlib and it's library**- https://matplotlib.org
>- **4. Also for many coding related doubts, You can head over to** - https://www.w3schools.com
>- **5. For doubts solution we have our good and old friend StackOverflow** - https://stackoverflow.com

>- Apart from all of these I have done some paid courses on **Udemy** such as **100 Days of Code: The Complete Python Pro Bootcamp for 2023**, taught by **Dr. Angela Yu**, https://www.udemy.com/course/100-days-of-code/


>- **Special thanks to memes**
>-1. https://humornama.com/wp-content/uploads/2020/12/25-Din-Mein-Paisa-Double-meme-template-of-Phir-Hera-Pheri-1024x576.jpg
>-2. https://indian.memetemplates.in/uploads/1674731583.jpeg
>-3. https://humornama.com/wp-content/uploads/2022/01/Ye-Sab-Kya-Dekhna-Pad-Raha-Hai-Meme-Template-on-Deewane-Huye-Paagal-1024x410.jpg






### Future Work-
>- *I want to work more on the topics of the **App Store** and **Play Store** and also do the analysis on the same type of dataset, but that will not be bound to any specific operating system.*
>- *I'll do the analysis on the same topic but with a more recently updated large dataset. I'm trying to find such a dataset for my future work because The Play Store app data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!*
>- I'll try to Implement Machine Learning models on this dataset.
>- Prediction of the number of users by using the regression model.
>- Recommender System



/*
    **कर्मण्येवाधिकारस्ते मा फलेषु कदाचन।**
    **मा कर्मफलहेतुर्भूर्मा ते सङ्गोऽस्त्वकर्मणि॥**

    Karmanye vadhikaraste Ma Phaleshu Kadachana,
    Ma Karmaphalaheturbhurma Te Sangostvakarmani,

    The meaning of the verse is :—
    You have the right to work only but never to its fruits.
    Let not the fruits of action be your motive, nor let your attachment be to inaction
   
*/

In [None]:
# import jovain

In [None]:
# jovian.commit()