# Visualisation Project Report - Google Play Store

## Installation and Imports

Running the following inside the notebook will install all required packages. If the packages are already available, then a message will appear to confirm this.

In [None]:
!pip install numpy 
!pip install pandas
!pip install seaborn
!pip install matplotlib

In [11]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# import scipy
# import squarify

## Reading the Datasets

The two datasets are need to be read in and I will display the first 5 records.

In [39]:
google_store = pd.read_csv('googleplaystore.csv')
google_store.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [46]:
google_store_reviews = pd.read_csv('googleplaystore_user_reviews.csv')
google_store_reviews.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


## Data Cleaning 

Certain checks can be done on the dataset to see if there are any NaN, null or duplicated values that need to be dealt with before beginning to analyse the data and draw insights. 

### google_store - Data Cleaning

In [40]:
google_store.duplicated().sum()

483

In [41]:
google_store_no_duplicates = google_store.drop_duplicates()

In [42]:
google_store_no_duplicates.isnull().sum()

App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [43]:
google_store_no_duplicates.isna().sum()

App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [44]:
google_store_cleaned = google_store_no_duplicates.dropna()

The **google_store_cleaned** variable now contains the cleaned dataset that does not include duplicated records or null records. It was important to drop the null and NaN records in the rating instead of setting the value to 0 because this would skew the insights when making plots.

### google_store_reviews - Data Cleaning

In [48]:
google_store_reviews.duplicated().sum()

33616

In [49]:
google_store_reviews_no_duplicates = google_store_reviews.drop_duplicates()

In [50]:
google_store_reviews_no_duplicates.isnull().sum()

App                         0
Translated_Review         987
Sentiment                 982
Sentiment_Polarity        982
Sentiment_Subjectivity    982
dtype: int64

In [52]:
google_store_reviews_no_duplicates.isna().sum()

App                         0
Translated_Review         987
Sentiment                 982
Sentiment_Polarity        982
Sentiment_Subjectivity    982
dtype: int64

In [62]:
google_store_reviews_cleaned = google_store_reviews_no_duplicates.dropna()
google_store_reviews_cleaned.loc[:, 'Sentiment_Polarity'] = google_store_reviews_cleaned['Sentiment_Polarity'].round(2)
google_store_reviews_cleaned.loc[:, 'Sentiment_Subjectivity'] = google_store_reviews_cleaned['Sentiment_Subjectivity'].round(2)

The **google_store_reviews_cleaned** variable now contains the cleaned dataset. The dataset for the google reviews showed the 'Most Relevant' 100 reviews for each app. There are records with complete NaN entries in them - one reason could be that certain apps do not have 100 reviews in total, therefore there are records that are entered that are of complete NaN values that have to be removed. To keep uniformity and consistency, the float integer for the polarity and subjectivity is standardised to 2.d.p.

## Installation Number vs Type of App (Free or Paid)

I will compare the number of installs with the type of app it is to see what the data suggests, whether applications that are free or paid result in a greater number of installs. 