<a href="https://colab.research.google.com/github/rohitme9798/Google-play-store-app-review-analysis/blob/main/Module_1_Copy_of_Play_Store_App_Review_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market.
###Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.
###Explore and analyze the data to discover key factors responsible for app engagement and success

###Before deep-diving straight into the problem solution, we need to create a roadmap which we will be referring to throughout this exploratory data analysis.

###Google Play Store is a digital store managed and developed by Google, which provides services like installing applications for android and Chrome-based OS users.
###After installing applications some users give reviews and rating for the app they have used, this ratings and reviews defines the satisfaction of the customers, that's directly proportional to the performance of the app. So it is essential to analyze these ratings and reviews to improve the quality of service which will cater to the need of end-users.
###The objective of this project is to deliver insights to understand customer demands better and thus help developers to popularize the product. It is of 10k Play Store apps for analyzing the Android market. This dataset contains details of different applications and reviews from different users.
###Discussion of Google play store dataset will involve various steps such as
###1. loading the data into the data frame
###2. cleaning the data
###3. extracting statistics from the dataset
###4. exploratory analysis and visualizations
###5. questions that can be asked from the dataset
###6. conclusion

In [None]:
# Let's go importing all the store of weapons needed, just kidding!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
sns.set(rc={'figure.figsize':(16,7)})
from pylab import rcParams

In [None]:
# Mount drive with google colab notebook
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Step 1. As per the roadmap we have created let's put feet on the first step i.e loading the dataset into dataframe

In [None]:
# Create the directorial path for data
dir_path="/content/drive/MyDrive/Almabetter/Play Store App Review Analysis/Play Store Data.csv"

In [None]:
# Let's read it
play_store_df=pd.read_csv(dir_path)

In [None]:
#  Writing a small for loop to convert uppercase to lowercase and replacing the spaces with underscore
play_store_df.columns=[str(x).lower().replace(" ","_") for x in play_store_df.columns]

In [None]:
# Check data what it has from top!
play_store_df.head()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [None]:
# Check it from bottom
play_store_df.tail()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [None]:
# Checking the shape of the play store data set.
print(f"The shape of the google play store data set is {play_store_df.shape}, where number of rows are  {play_store_df.shape[0]}  and {play_store_df.shape[1]} columns")

The shape of the google play store data set is (10841, 13), where number of rows are  10841  and 13 columns


In [None]:
# Need to know the null values in each row of columns:
play_store_df.isnull().sum()

app                  0
category             0
rating            1474
reviews              0
size                 0
installs             0
type                 1
price                0
content_rating       1
genres               0
last_updated         0
current_ver          8
android_ver          3
dtype: int64

###As we can explicitly see that there are 1474 missing values in rating columns. We have to fill these values with proper calculation. . But the question is how to handle the rating columns because we can not assign those as median, mean, or mode as in real life ratings are given by the customers, We are now using forward linear interpolation to fill nan values inside the rating column.

In [None]:
# Missing Value Handling using interpolate forward linear method.
play_store_df.interpolate(method='linear',limit_direction='forward',inplace=True)

In [None]:
# Rechecking for null values
play_store_df.isnull().sum()

app               0
category          0
rating            0
reviews           0
size              0
installs          0
type              1
price             0
content_rating    1
genres            0
last_updated      0
current_ver       8
android_ver       3
dtype: int64

###As we can see there is one missing value in the content rating feature and one missing value inside type features, now we will be filling those two values with appropriate values.

In [None]:
# Filling NAN with suitable values
play_store_df['content_rating'].fillna(value='Everyone',inplace=True)
play_store_df['type'].fillna(value='Free',inplace=True)

In [None]:
# Final Check For Null Values.
play_store_df.isnull().sum().any()

True

In [None]:
# Let's have a look at the information 
play_store_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          10841 non-null  float64
 3   reviews         10841 non-null  object 
 4   size            10841 non-null  object 
 5   installs        10841 non-null  object 
 6   type            10841 non-null  object 
 7   price           10841 non-null  object 
 8   content_rating  10841 non-null  object 
 9   genres          10841 non-null  object 
 10  last_updated    10841 non-null  object 
 11  current_ver     10833 non-null  object 
 12  android_ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


###Step 2. Mission Cleaning Starts here!

####After looking at the information about all data features in given data set, there is demand to work around data and make it useable flawlessly, Let's get down to business guys!

In [None]:
# Converting the into the proper data structure, though ratings are already a float64, just comfirming.
play_store_df['rating']=play_store_df['rating'].astype(str).astype(float)

In [None]:
# As we can see in play store info we need to typecast this to the proper integer
play_store_df['reviews']=play_store_df['reviews'].apply(lambda x:x.replace("3.0M","3000000"))
play_store_df['reviews']=play_store_df['reviews'].apply(lambda x: int(x))

In [None]:
# Just converting the number of installs into proper integer number
play_store_df=play_store_df[play_store_df['installs']!='Free']
play_store_df['installs']=play_store_df['installs'].apply(lambda x : x.replace("+","")if "+" in str(x) else x)
play_store_df['installs']=play_store_df['installs'].apply(lambda x: x.replace(",","") if "," in str(x) else x)
play_store_df['installs']=play_store_df['installs'].apply(lambda x: int(x))

In [None]:
# Price never be an object it should be a float
play_store_df['price']=play_store_df['price'].apply(lambda x: x.replace("$","") if "$" in str(x) else x)
play_store_df['price']=play_store_df['price'].apply(lambda x:float(x))

In [None]:
# Let's fix the size column
play_store_df['size']=play_store_df['size'].apply(lambda x : str(x).replace("Varies with device","NaN") if "Varies with device" in str(x) else x)
play_store_df['size']=play_store_df['size'].apply(lambda x:  str(x).replace("M","") if "M" in str(x) else x)
play_store_df['size']=play_store_df['size'].apply(lambda x:  float(str(x).replace("k",""))/1000 if "k" in str(x) else x)
play_store_df['size']=play_store_df['size'].apply(lambda x:  float(x))

In [None]:
# Fixing last updated object to proper datetime 
play_store_df['last_updated'].unique()
play_store_df['last_updated']=pd.to_datetime(play_store_df['last_updated'])

In [None]:
# Unique category 
len(play_store_df['category'].unique())

33

In [None]:
play_store_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   app             10840 non-null  object        
 1   category        10840 non-null  object        
 2   rating          10840 non-null  float64       
 3   reviews         10840 non-null  int64         
 4   size            9145 non-null   float64       
 5   installs        10840 non-null  int64         
 6   type            10840 non-null  object        
 7   price           10840 non-null  float64       
 8   content_rating  10840 non-null  object        
 9   genres          10840 non-null  object        
 10  last_updated    10840 non-null  datetime64[ns]
 11  current_ver     10832 non-null  object        
 12  android_ver     10838 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(2), object(7)
memory usage: 1.2+ MB


###Ah! Until now we have been thoroughly cleaning and fixing the data type as required, we havn't explored the single aspect of any features given in data set which will be influencing the result of performance altogether.

###Step 3. Extracting statistics from the dataset

####Here we will be doing some statistical analysis of data using pandas built-in method called describe(). Describe only works for the numerical features

In [None]:
# Let's take some statistical taste of play store dataframe:
play_store_df.describe(include='all')

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
count,10840,10840,10840.0,10840.0,9145.0,10840.0,10840,10840.0,10840,10840,10840,10832,10838
unique,9659,33,,,,,2,,6,119,1377,2831,33
top,ROBLOX,FAMILY,,,,,Free,,Everyone,Tools,2018-08-03 00:00:00,Varies with device,4.1 and up
freq,9,1972,,,,,10040,,8714,842,326,1459,2451
first,,,,,,,,,,,2010-05-21 00:00:00,,
last,,,,,,,,,,,2018-08-08 00:00:00,,
mean,,,4.190567,444152.9,21.51653,15464340.0,,1.027368,,,,,
std,,,0.517606,2927761.0,22.588748,85029360.0,,15.949703,,,,,
min,,,1.0,0.0,0.0085,0.0,,0.0,,,,,
25%,,,4.0,38.0,4.9,1000.0,,0.0,,,,,


In [None]:
# Let's use describe only for numerical column
play_store_df.describe()

Unnamed: 0,rating,reviews,size,installs,price
count,10840.0,10840.0,9145.0,10840.0,10840.0
mean,4.190567,444152.9,21.51653,15464340.0,1.027368
std,0.517606,2927761.0,22.588748,85029360.0,15.949703
min,1.0,0.0,0.0085,0.0,0.0
25%,4.0,38.0,4.9,1000.0,0.0
50%,4.3,2094.0,13.0,100000.0,0.0
75%,4.5,54775.5,30.0,5000000.0,0.0
max,5.0,78158310.0,100.0,1000000000.0,400.0


###Sorry guys! I'm dumb! I can't find any meaningful insight just looking at the data, neither from top nor bottom though I have used describe,I have just got few information. I think I need to update myself like Google Play Store Apps.
###Why there are so many unnecessary decimal points followed by each result being displayed in the description, the reason is that if one of the feature entities required decimal points to show its accurate value, pandas try to put that many decimal points for each entity being displayed.
###Now I'm going to use the superpower of pandas to visualize data in a rigor way.

In [None]:
# installing the pandas profiling
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 

In [None]:
# Importing profile report from pandas profiling.
from pandas_profiling import profile_report

In [None]:
# Collecting The Pandas Profile Report.
play_store_df.profile_report()


###Step 4: Exploratory Data Analysis

###In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

###Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns.

###Reference(https://en.wikipedia.org/wiki/Exploratory_data_analysis)

In [None]:
# importting my favorite library plotly
import plotly.express as px

In [None]:
# Distribution of average app rating as per their category
average_rating=play_store_df.groupby(['category','type'],as_index=False)['rating'].median().reset_index()

In [None]:
# Applying a format with the help of lambda function to restrict the decimal points in rating column
average_rating['rating']=average_rating['rating'].apply(lambda x:"{0:1.1f}".format(x))

In [None]:
# Plotting a bar plot using plotly for average rating per category
px.bar(data_frame=average_rating,x=average_rating['category'],y=average_rating['rating'],text='rating',
       title='Average Rating comparison Between Free vs Paid Applications In Each Category',color='type')

In [None]:
# Printing the overall mean of the rating for all the categories in play store data
print(f"The average rating for each category is around {round(np.mean(play_store_df['rating']),1)} out of 5")

###As we can see from above bar plot, almost every category has average rating around 4.2, 4.5 is the highest rating for three categories namely Books_and_reference ,Events, and Health_and_Fitness.

In [None]:
# Let's see how application size affects the number of rating
px.scatter(data_frame=play_store_df,x='rating',y='size',color='size',
           title="Scatter Plot Representing the effect of size on the number of rating")

###The points are more dense at the lower bottom-right, meaning, less sized apps have higher ratings.

In [None]:
# Let's have a look at reviews vs rating
px.scatter(data_frame=play_store_df[play_store_df['reviews']<100000],x='reviews',y='rating',trendline='ols',color='rating',
           title='Scatter Plot With Trendline Represents Reviews VS Rating',text='rating')

###Obiviously by looking at above scatter plot with trendline we are able to conclude that lessser the reviews on applications lesser the rating as well.

In [None]:
# Let's see how price of application impacts the number of rating
px.scatter(data_frame=play_store_df,x='price',y='rating',color='type',trendline='ols',text='reviews',
           title='Price Vs Rating')

###Inference - - Of course, as the price of the application increases, there are fewer downloads hence fewer reviews and ratings as we can visualize from the above scatter plot.

In [None]:
# countplot for content rating
sns.countplot(x=play_store_df.content_rating,data=play_store_df,hue=play_store_df.category,palette='Set2_r')
plt.title("CountPlot Representing Content Rating Per Category")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
plt.show()

###A count plot is a kind of like a histogram or a bar graph for some categorical area. It simply shows the number of occurrences of an item based on a certain type of category.

###Content rating is another feature available in the given google play store data set.

###Content rating describes the minimum maturity of content inside the applications. But doesn’t tells that application is designed for a specific age group. We used the count plot to understand the content rating for each category.

###Most of the applications on google play is having a content rating for everyone. However, only the dating category is for the mature 17+ age group.

In [None]:
# Let's see what kind of content rating applications are being downloaded most.
px.bar(data_frame=play_store_df,x='content_rating',y='installs',color='content_rating',
       title='Content rating Vs Installs')

###The applications which have a content rating for everyone are being installed most than other.

In [None]:
px.box(data_frame=play_store_df,x='category',y='last_updated',
       title='Box Plot Representing the Tendency of the Last Updated For Both Free And Paid Category',color='type')

In [None]:
# Let's check the number of free and paid applications for each category
unique_last_update=play_store_df.groupby(['category','type'],as_index=False)['last_updated'].count()

In [None]:
# Setting the name of the columns
unique_last_update.set_axis(['category','type','Total Number of applications'],axis=1,inplace=True)

In [None]:
# With the help of plotly bar plot experimenting which kind of applications are being updated the most.
px.bar(data_frame=unique_last_update,x='category',y='Total Number of applications',color='type',
       text='Total Number of applications',title='Total Number Of Free and Paid Applications In Each Category')

###The above Bar plot shows the total number of free and paid applications in each category. The family category has 191 paid applications and 1781 are free applications. We can an explicit number of paid applications in red color and sky-blue color holds the number of free applications.

###During the second step I have set column name as the total number of applications in place of last updated column because I have counted the last updated value for each category which turns out to be the number of applications available inside each of the categories

In [None]:
# Calculating number of applications available in each category and storing it in variable categories.
categories=play_store_df['category'].value_counts().reset_index()

In [None]:
# Resetting the names of columns here
categories.set_axis(['category','count'],axis=1,inplace=True)

In [None]:
# Plotting a bar plot representing the total count of applications in each category using plotly.
px.bar(data_frame=categories,x=categories['category'],y=categories['count'],text="count",title='Total Number of Application In Each Category')


###The above count plot is the evidence that category family has the most number of applications available on the google play store, and very few apps are available for the category beauty and parenting. What I believe that beauty comes from inside and parenting should be come as naturally as possible.

In [None]:
# Let's calculate the most installed android version 
most_install_android_version=play_store_df.groupby(['android_ver','type'])['installs'].sum().reset_index().sort_values(by='installs',ascending=False)

In [None]:
# Converting sum of installs into log2 scale and creating a new column named log2 installs 
most_install_android_version['log2_installs']=np.log2(most_install_android_version['installs'])

In [None]:
# Applying a format to restrict the decimal points to avoid overwhelming of decimal points in installs columns
most_install_android_version['log2_installs']=most_install_android_version['log2_installs'].apply(lambda x:"{0:1.1f}".format(x))

In [None]:
# using plotly bar plot let's visualize which android version being installs along with its category type 
px.bar(data_frame=most_install_android_version,x='android_ver',y='log2_installs',color='type',
       text='log2_installs',title='Most Installed Android Version With Respect to Type')

###From the above bar plot it can be understood that most installed applications android version varies with the device, second highest installed android version is 4.1 and up

In [None]:
# Now i'm curious to know number application in each genres
genres_count=play_store_df['genres'].value_counts().reset_index()

In [None]:
# setting the colum name
genres_count.set_axis(['genres','count'],axis=1,inplace=True)

In [None]:
# I'm putting a restriction upto top 25 genres
genres_count=genres_count.head(25)

In [None]:
# It's time to see the top 25 genres with most applications
px.bar(data_frame=genres_count,x='genres',y='count',text='count',color='count',
       title='TOP 25 Genres')

###From the above countplot we can see the tools genres has the most number of applications, we can see the total number of application in top 25 genres.