<a href="https://colab.research.google.com/github/rohitme9798/Google-play-store-app-review-analysis/blob/main/New_Google_play_store_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market.

###Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.

###Explore and analyze the data to discover key factors responsible for app engagement and success

###Before deep-diving straight into the problem solution, we need to create a roadmap which we will be referring to throughout this exploratory data analysis.
###Google Play Store is a digital store managed and developed by Google, which provides services like installing applications for android and Chrome-based OS users.
###After installing applications some users give reviews and rating for the app they have used, this ratings and reviews defines the satisfaction of the customers, that's directly proportional to the performance of the app. So it is essential to analyze these ratings and reviews to improve the quality of service which will cater to the need of end-users.
###The objective of this project is to deliver insights to understand customer demands better and thus help developers to popularize the product. It is of 10k Play Store apps for analyzing the Android market. This dataset contains details of different applications and reviews from different users.
###Discussion of Google play store dataset will involve various steps such as
###1. loading the data into the data frame
###2. cleaning the data
###3. extracting statistics from the dataset
###4. exploratory analysis and visualizations
###5. questions that can be asked from the dataset
###6. conclusion

In [1]:
# importing the data file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
sns.set(rc={'figure.figsize':(16,7)})


In [2]:
#Mount drive with google colab notebook
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Step 1. Loading the Database

In [69]:
# Create the directional path for data 
dir_path="/content/drive/MyDrive/Almabetter/Play Store Data.csv"

In [70]:
# Read data
gps_df=pd.read_csv(dir_path)

In [71]:
#  Writing a small for loop to convert uppercase to lowercase and replacing the spaces with underscore
gps_df.columns=[str(x).lower().replace(" ","_") for x in df.columns]

In [72]:
gps_df.head()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [73]:
gps_df.tail()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [74]:
#Checking the shape of play store data set.
print(f"The shape of google play data store is {gps_df.shape}, where number of rows and columns are {df.shape[0]} ,{gps_df.shape[1]} respectively")

The shape of google play data store is (10841, 13), where number of rows and columns are 10841 ,13 respectively


In [75]:
# Dropping the features that we are not using extensively
gps_df=gps_df.drop(['current_ver',"android_ver"],axis=1)

In [76]:
# To Know the null values in each row of column
gps_df.isnull().sum()

app                  0
category             0
rating            1474
reviews              0
size                 0
installs             0
type                 1
price                0
content_rating       1
genres               0
last_updated         0
dtype: int64

###As we can explicitly see that there are 1474 missing values in rating columns. We have to fill these values with proper calculation. . But the question is how to handle the rating columns because we can not assign those as median, mean, or mode as in real life ratings are given by the customers, We are now using forward linear interpolation to fill nan values inside the rating column.

In [77]:
#Handling Missing values using interpolate forward linear method
gps_df.interpolate(method='linear',limit_direction='forward',inplace=True)

In [78]:
# Recheck for Null Values
gps_df.isnull().sum()

app               0
category          0
rating            0
reviews           0
size              0
installs          0
type              1
price             0
content_rating    1
genres            0
last_updated      0
dtype: int64

###As we can see there is one missing value in the content rating feature and one missing value inside type features, now we will be filling those two values with appropriate values.

In [79]:
# Filling NAN values with suitable values
gps_df['content_rating'].fillna(value='everyone',inplace=True)
gps_df['type'].fillna(value='free',inplace=True)

In [80]:
# Final Check for NULL values
gps_df.isnull().sum().any()

False

In [81]:
# Look at the Information
gps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          10841 non-null  float64
 3   reviews         10841 non-null  object 
 4   size            10841 non-null  object 
 5   installs        10841 non-null  object 
 6   type            10841 non-null  object 
 7   price           10841 non-null  object 
 8   content_rating  10841 non-null  object 
 9   genres          10841 non-null  object 
 10  last_updated    10841 non-null  object 
dtypes: float64(1), object(10)
memory usage: 931.8+ KB


###Step 2 - Cleaning Start

###After looking at the information about all data features in given data set, there is demand to work around data and make it useable flawlessly, Let's get down to business guys!

In [82]:
# Converting the datatype into porper data structure ,through ratings are already float64,just confirming.
gps_df['rating']=gps_df['rating'].astype(str).astype(float)

In [83]:
# As we can see in play store info need to typecast this to the proper intiger
gps_df['reviews']=gps_df['reviews'].apply(lambda x:x.replace("3.0M","3000000"))
gps_df['reviews']=gps_df['reviews'].apply(lambda x:int(x))

In [84]:
# Converting Number of Installs into proper intiger
gps_df=gps_df[gps_df['installs']!='Free']
gps_df['installs']=gps_df['installs'].apply(lambda x:x.replace("+","") if "+" in str(x) else x)
gps_df['installs']=gps_df['installs'].apply(lambda x:x.replace(".","") if "," in str(x) else x)
gps_df['installs']=gps_df['installs'].apply (lambda x: int())

In [85]:
# price never be an object it should be float
gps_df['price']=gps_df['price'].apply(lambda x:x.replace ("$","") if "$" in str(x) else x)
gps_df['price']=gps_df['price'].apply(lambda x: float(x))

In [86]:
# Now fix the  Size Column
gps_df['size']=gps_df['size'].apply(lambda x: str(x).replace("varies with device","NaN") if "varies with device" in str(x) else x)
gps_df['size']=gps_df['size'].apply(lambda x :str(x).replace("M","") if "M" in str(x) else x)
gps_df['size']=gps_df['size'].apply(lambda x: float(str(x).replace("k",""))/1000 if "k" in str(x) else x)
gps_df['size']=gps_df['size'].apply(lambda x: float())

In [87]:
# Fixing last updated object to date time
gps_df['last_updated'].unique()
gps_df['last_updated']=pd.to_datetime(gps_df['last_updated'])

In [88]:
# Ubique catagory
len(gps_df['category'].unique())

33

###Step 3. Extracting statistics from the dataset

###Here we will be doing some statistical analysis of data using pandas built-in method called describe(). Describe only works for the numerical features

In [89]:
# Let's take some statistical taste of play store dataframe:
gps_df.describe()

Unnamed: 0,rating,reviews,size,installs,price
count,10840.0,10840.0,10840.0,10840.0,10840.0
mean,4.190567,444152.9,0.0,0.0,1.027368
std,0.517606,2927761.0,0.0,0.0,15.949703
min,1.0,0.0,0.0,0.0,0.0
25%,4.0,38.0,0.0,0.0,0.0
50%,4.3,2094.0,0.0,0.0,0.0
75%,4.5,54775.5,0.0,0.0,0.0
max,5.0,78158310.0,0.0,0.0,400.0
