<a href="https://colab.research.google.com/github/vishalrkumbhar/play-store-EDA/blob/main/Play_Store_App_Review_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>

## <b> Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.</b>

## <b> Explore and analyze the data to discover key factors responsible for app engagement and success. </b>

**Context :**
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

**Content :**
Each app (row) has values for catergory, rating, size, and more.

**Acknowledgements :**
This information is scraped from the Google Play Store. This app information would not be available without it.

**Inspiration :**
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

Despite the strange names I gave to the chapters, what we are doing in this kernel is something like:

1. **Understand the problem :** We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
2. **Univariable study :** We'll just focus on all variable and try to know a little bit more about it.
3. **Multivariate study :** We'll try to understand how all variable relate.
4. **Basic cleaning :** We'll clean the dataset and handle the missing data, outliers and categorical variables.
5. **Test assumptions :** We'll check if our data meets the assumptions required by most multivariate techniques.

In [1]:
# importing libraries
import pandas as pd               # for data manipulation
import numpy as np                # for mathemathical operations and linear algebra
import matplotlib.pyplot as plt   # for data visualisation

### Read the Dataset into a dataframe

In [2]:
# reading dataset from the URL
GPStore = pd.read_csv('/content/Play Store Data.csv')

In [3]:
# type of the variable GPStore
type(GPStore)

pandas.core.frame.DataFrame

In [4]:
# displaying the head or the first 10 rows of the dataframe, default 5
GPStore.head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In [5]:
# displaying the tail or the bottom 5 rows of the dataframe
GPStore.tail(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,,38,9.8M,"5,000+",Free,0,Everyone,Maps & Navigation,"June 13, 2018",2.0.148.0,4.0 and up
10832,FR Tides,WEATHER,3.8,1195,582k,"100,000+",Free,0,Everyone,Weather,"February 16, 2014",6.0,2.1 and up
10833,Chemin (fr),BOOKS_AND_REFERENCE,4.8,44,619k,"1,000+",Free,0,Everyone,Books & Reference,"March 23, 2014",0.8,2.2 and up
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up
10835,FR Forms,BUSINESS,,0,9.6M,10+,Free,0,Everyone,Business,"September 29, 2016",1.1.5,4.0 and up
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [6]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
GPStore.shape

(10841, 13)

In [7]:
# prints a summary of the dataframe rows and columns, including information on the datatypes and non-null values
GPStore.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [8]:
GPStore.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


### Explore all columns one by one and check for invalid data and clean data accordingly. 

#### App Column:
###### Check for duplicate values in App column if any then drop those rows. 

In [9]:
# get the count/frequency of all the unique values of the specified column
GPStore['App'].value_counts()

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
8 Ball Pool                                          7
ESPN                                                 7
Duolingo: Learn Languages Free                       7
                                                    ..
Info BMKG                                            1
Cheat Codes for GTA V                                1
Trello                                               1
Jumia online shopping                                1
Android Auto - Maps, Media, Messaging & Voice        1
Name: App, Length: 9660, dtype: int64

In [10]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
print(GPStore.shape)

# remove the duplicate values from the dataframe, specifying the column name in the subset parameter 
GPStore = GPStore.drop_duplicates(subset=['App'], keep = 'first')

# display the shape of the dataframe, i.e., the no. of rows and columns 
print(GPStore.shape)

(10841, 13)
(9660, 13)


### Category Column:
##### Check for unique categories. 

In [11]:
# get all the unique values present in the specified column
GPStore.Category.unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)

<b> In the 'Category' column we have one value as '1.9' which seems to be invalid. Let's have a look at that data entry. 

In [12]:
# dataframe filtering based on a condition
GPStore[GPStore.Category == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [13]:
# get the index of the dataframe
GPStore.index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            10831, 10832, 10833, 10834, 10835, 10836, 10837, 10838, 10839,
            10840],
           dtype='int64', length=9660)

In [14]:
# remove the row with the specified index; axis 0 implies along the rows; axis 1 along the columns
GPStore=GPStore.drop([10472],axis=0)

In [15]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
GPStore.shape

(9659, 13)

### Rating Column:
###### Check for valid rating values

In [16]:
# statistical summary of the specified numerical variable
GPStore['Rating'].describe()

count    8196.000000
mean        4.173243
std         0.536625
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64

<b> All the rating values are within the range so no invalid data is present in 'Rating' Column. But the count of rating values is 8196 where as we have 9659 entries in our dataset. It shows that there are missing values in 'Rating Column. Lets check for the Missing Values.  

In [17]:
# find the total no. of missing values present in the specified column
GPStore.Rating.isnull().sum()

1463

<b> Drop the entries having null values. 

In [18]:
# remove all the rows having null values
GPStore.dropna(inplace=True)

In [19]:
# display the shape of the dataframe, i.e., the no. of rows and columns 
GPStore.shape

(8190, 13)

### Review Column:
###### Check for any non numeric values if any replace it and convert it to a numeric column

In [20]:
# displays frequency measures for a non-numerical column
GPStore.Reviews.describe()    # The datatype for the reviews column is string 

count     8190
unique    5319
top          2
freq        82
Name: Reviews, dtype: object

In [21]:
# check for any non numeric value 
GPStore.Reviews.str.isnumeric().sum()   

8190

In [22]:
# convert the 'Review' column to numeric
GPStore.Reviews=pd.to_numeric(GPStore.Reviews) 

In [23]:
# statistical summary of the specified numerical variable
GPStore.Reviews.describe()


count    8.190000e+03
mean     2.554354e+05
std      1.986309e+06
min      1.000000e+00
25%      1.260000e+02
50%      3.009500e+03
75%      4.391425e+04
max      7.815831e+07
Name: Reviews, dtype: float64

### Size Column:
#####  In the 'Size' column we have the values as '20M' and '10K' which represents the size of app in MB and KB respectively. So replace 'M' and 'K' with their equivalent numeric values in bytes. 

In [24]:
# get the count/frequency of all the unique values of the specified column
GPStore.Size.value_counts()

Varies with device    1169
14M                    148
12M                    146
13M                    143
11M                    143
                      ... 
240k                     1
499k                     1
239k                     1
186k                     1
220k                     1
Name: Size, Length: 413, dtype: int64

In [25]:
GPStore['Size']

0                       19M
1                       14M
2                      8.7M
3                       25M
4                      2.8M
                ...        
10834                  2.6M
10836                   53M
10837                  3.6M
10839    Varies with device
10840                   19M
Name: Size, Length: 8190, dtype: object

In [26]:
# replace all the 'Varies with device' with 0
GPStore.Size = GPStore.Size.apply(lambda x: x.replace('Varies with device','0') if 'Varies with device' in x else x)

# replace all the 'k' representing thousand
GPStore.Size = GPStore.Size.apply(lambda x: x.replace('k','') if 'k' in x else x)

# replace all the 'M' representing Million with 3 zeroes
GPStore.Size = GPStore.Size.apply(lambda x: float(x.replace('M',''))*1024 if 'M' in x else x)


In [27]:
# convert to float datatype
GPStore.Size = GPStore.Size.apply(lambda x: float(x))

In [28]:
GPStore.Size

0        19456.0
1        14336.0
2         8908.8
3        25600.0
4         2867.2
          ...   
10834     2662.4
10836    54272.0
10837     3686.4
10839        0.0
10840    19456.0
Name: Size, Length: 8190, dtype: float64

In [29]:
# statistical summary of the specified numerical variable
GPStore.Size.describe()

count      8190.000000
mean      19108.073834
std       22918.997685
min           0.000000
25%        2867.200000
50%        9625.600000
75%       27648.000000
max      102400.000000
Name: Size, dtype: float64

In [30]:
GPStore=GPStore.rename(columns={'Size':'Size_in_KB'})   # rename the Size column to Size_in_KB 

### Installs column:
###### The Installs column shows the number of installations for an app. The values consists of '+' and ',' characters. So remove '+' and ',' present in Installs column and convert it to numeric. 

In [31]:
GPStore.Installs

0            10,000+
1           500,000+
2         5,000,000+
3        50,000,000+
4           100,000+
            ...     
10834           500+
10836         5,000+
10837           100+
10839         1,000+
10840    10,000,000+
Name: Installs, Length: 8190, dtype: object

In [32]:
# values are given as, for example, '1,000+'. Removes the '+' sign from the end of the string
GPStore.Installs=GPStore.Installs.apply(lambda x: x.strip('+'))

# numbers have commas in them, for eg., 100,000. Removes all the commas from the strings.
GPStore.Installs=GPStore.Installs.apply(lambda x: x.replace(',',''))

# get the count/frequency of all the unique values of the specified column
GPStore.Installs.value_counts()

1000000       1414
100000        1094
10000          986
10000000       937
1000           696
5000000        607
500000         503
50000          456
5000           424
100            303
50000000       202
500            199
100000000      188
10              69
50              56
500000000       24
1000000000      20
5                9
1                3
Name: Installs, dtype: int64

In [33]:
# convert to numeric datatype
GPStore.Installs=pd.to_numeric(GPStore.Installs)

In [34]:
GPStore['Installs'].describe()

count    8.190000e+03
mean     9.171613e+06
std      5.827170e+07
min      1.000000e+00
25%      1.000000e+04
50%      1.000000e+05
75%      1.000000e+06
max      1.000000e+09
Name: Installs, dtype: float64

### Check the values in Type column 
##### The type of the app is categorized as "Free" or "Paid" and we have these values only. So no cleaning is required for this column.

In [35]:
# get the count/frequency of all the unique values of the specified column
GPStore.Type.value_counts()

Free    7588
Paid     602
Name: Type, dtype: int64