# 1.0 `About Dataset`
Context
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

### 1.1 Content
Each app (row) has values for catergory, rating, size, and more.

### 1.2 Acknowledgements
This information is scraped from the Google Play Store. This app information would not be available without it.

### 1.3 Inspiration
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

### 1.4 Dataset link:
> The Data Set was downloaded from Kaggle, from the following [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps/)

# 2.0 `Import Libraries:`

In [2]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 3.0 `Exploring Data:`

In [3]:
# load data
df = pd.read_csv('../../datasets/googleplaystore.csv')

In [4]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [6]:
df.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10841 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10839 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


In [8]:
df.describe()

Unnamed: 0,Rating,Reviews
count,9367.0,10841.0
mean,4.191513,444111.9
std,0.515735,2927629.0
min,1.0,0.0
25%,4.0,38.0
50%,4.3,2094.0
75%,4.5,54768.0
max,5.0,78158310.0


## 3.11 `Observations:`

In [9]:
# showing all rows
pd.set_option('display.max_rows', None)

In [10]:
# showing unique values in each column with df.head()
df.columns.to_series().apply(lambda x: df[x].nunique())

App               9660
Category            33
Rating              39
Reviews           6001
Size               461
Installs            21
Type                 2
Price               92
Content Rating       6
Genres             119
Last Updated      1377
Current Ver       2831
Android Ver         33
dtype: int64

` Following are some anomalies in the data`
1. Rating has a maximum value of 19
2. Size has 'Varies with device',and  it should be numeric
3. Installs has '+' and ',', it should be numeric
4. Type has 'Free' and 'Paid'
5. Price has '$' and 'Everyone', it should be numeric
6. Content Rating has 'Adults only 18+' and 'Unrated'
7. Genres has ';'

# 4.0 `Exploring and Dealing with Columns:`

## 4.1 `App`:
### 4.1.1 `Exploring:`

In [11]:
df.App.isnull().sum()

0

In [12]:
df.App.nunique()

9660

In [13]:
# check for duplicates
df.duplicated().sum()

483

In [19]:
# Datatype
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

### 4.1.2 `Observations:`

In [14]:
# my observations about app coulumn are:
# 1. There are 483 duplicates
# 2. There is no missing value
# 3. There are 9660 unique values
# 4. The data type is object


### 4.1.3 `Dealing With Duplicates::`

## 4.2 `Category`:

## 4.3 `Rating`:

## 4.4 `Reviews`:

## 4.5 `Size`:

## 4.6 `Installs`:

## 4.7 `Type`:

## 4.8 `Price`:

## 4.9 `Content Rating`:

## 4.10 `Genres`:

## 4.11 `Last Updated`:

## 4.12 Current Version`:

## 4.13 `Andriod Version`: