# **Project Name**    - 



**Exploratory Data Analysis on Google Play Store Apps**

## **Project Type**    -  **EDA**
 
## **Contribution**    - **Individual**


# **Project Summary -**

**Google Play Store or formerly Android Market, is a digital distribution service developed and operated by Google. It is an official apps store that provides variety content such as apps, books, magazines, music, movies and television programs. It serves an as platform to allow users with 'Google certified' Android operating system devices to download applications developed and published on the platform either with a charge or free of cost. With the rapidly growth of Android devices and apps, it would be interesting to perform data analysis on the data to obtain valuable reviews.**

**The tools that are going to be used for this EDA would be numpy, pandas, matplotlib and seaborn which I have learnt from the course.**

# **GitHub Link -**

https://github.com/surajsv7607/Capstone-Project-1/blob/main/Capstone_Project_1.ipynb

# **Index Of Contents**




*  Introduction

*   Data Preparation and Cleaning


*  Exploratory Analysis and Visualization

*   Let’s have a look at the distribution of the ratings of the data frame.
*    Let’s plot a visualization graph to view what portion of the apps in the play store are paid and free.


*   Which category App’s have the most number of installs?

* What are the Top 10 installed apps in any category?


*  Which are the top 10 expensive Apps in the play store?

*   Which are the Apps with the highest number of reviews?

*  What are the count of Apps in different genres?
*  Which are the apps that have made the highest-earning?


* Inferences and Conclusion





# **Introduction**

There are more than 3.04 million apps found on Google Play Store. With this project/article I will take you through a journey of analyzing various apps found on the play store with the help of different python libraries.

# ***Let's Begin !***

# **Import Libraries**

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# **Dataset Loading**

In [None]:
# Load Dataset
googlestore_df = pd.read_csv('/content/Play Store Data.csv')

*After loading the dataset, we can start the exploration but before that, we need to check and see that the dataset is ready for performing several exploration operations or not, so let’s first have a look at the structure and the manner in which the data is organized.*

In [None]:
googlestore_df.head(10)


*To know if there is any missing value or Nan value in the dataset, we can use the isnull() function.*

In [None]:
googlestore_df.isnull().sum()

*So, we will need to prepare the dataset before performing exploratory data analysis on it.*

# **Data Preparation and Cleaning**

*Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and the combining of data sets to enrich data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.*

*We saw that the dataset contains many Null or missing values. The column Rating, Type , Content Rating , Current Ver , andAndroid Ver contains 1474, 1, 1, 8, and 3 missing values respectively.*

*Will it not be better if we can define a function to get more useful information about the different attributes of the dataset, also there is one more valid point in defining a function which it will be reusable, and we are going to utilize our defined function several times in future.*

In [None]:
def printinfo():
    temp = pd.DataFrame(index=googlestore_df.columns)
    temp['data_type'] = googlestore_df.dtypes
    temp['null_count'] = googlestore_df.isnull().sum()
    temp['unique_count'] = googlestore_df.nunique()
    return temp

*Let’s call the function and see what it returns:*

In [None]:
printinfo()

*We have some useful information about the dataset. i.e., we can now see the missing number of values of any attribute, its unique count, and its respective data types.*

*Now we can start the process of data cleaning, lets start with the column*

Type :-

In [None]:
googlestore_df[googlestore_df.Type.isnull()]

*Since there is only one missing value in this column, So, let’s fill the missing value. After cross-checking in the play store the missing value is found to be Free, So now we can fill the missing value with free .*

In [None]:
googlestore_df['Type'].fillna("Free", inplace = True)

*After filling the value we can check and see if that has been correctly placed.*

In [None]:
googlestore_df.isnull().sum()

Now, we can move on to the column 
 *Content Rating :*

In [None]:
googlestore_df[googlestore_df['Content Rating'].isnull()]

*We can clearly see that row 10472 has missing data for the Categorycolumn and all the prevailing column values are being replaced with its previous column. A better idea will be to drop this row from our data frame.*

In [None]:
googlestore_df.dropna(subset = ['Content Rating'], inplace=True)

*We are having some of the unwanted columns which will be of not much use in the analysis process. So let’s drop those columns.*

In [None]:
googlestore_df.drop(['Current Ver','Last Updated', 'Android Ver'], axis=1, inplace=True)

*Now, we can fix the Rating column which contains a total of 1474 of missing values. Replacing the missing values with the Modevalue of that entire column.*

In [None]:
modeValueRating = googlestore_df['Rating'].mode()

In [None]:
googlestore_df['Rating'].fillna(value=modeValueRating[0], inplace = True)

*Finally, after fixing all the missing values, we should have a look at our data frame, We defined a function as printinfo() . So, it’s time to use that function.*

In [None]:
printinfo()

*All the columns have the null_count as zero, which indicates that now the data frame doesn’t contain any missing values.*

*Now we are done with the data cleansing part and in a state to start the work for data preparation*

*Columns like Reviews, Size, Installs, & priceshould have an intor floatdatatype, But here we can see of objecttype, So let’s convert them to their respective correct type.*

*Starting with the column Reviews , converting its type to int .* 

In [None]:
googlestore_df['Reviews'] = googlestore_df.Reviews.astype(int)

*We can see that the changes have taken its effect or not by calling our printinfo() function.*

In [None]:
printinfo()

*Now, the reviews column has been converted to int type, so now we can move to the Column: Size*
*Converting the Size Column from object to integer, but this column contains some of the special characters like , , + , M , K & also it has a some of the value as Varies with device . We need to remove all of these and then convert it to int or float .*

Removing the +Symbol:

In [None]:
googlestore_df['Size'] = googlestore_df.Size.apply(lambda x: x.strip('+'))# Removing the + Sign

*Removing the , symbol:*

In [None]:
googlestore_df['Size'] = googlestore_df.Size.apply(lambda x: x.replace(',', ''))# For removing the `,`

*Replacing the M symbol by multiplying the value with 1000000:*

In [None]:
googlestore_df['Size'] = googlestore_df.Size.apply(lambda x: x.replace('M', 'e+6'))# For converting the M to Mega

*Replacing the k by multiplying the value with 1000:*+

In [None]:
googlestore_df['Size'] = googlestore_df.Size.apply(lambda x: x.replace('k', 'e+3'))# For convertinf the K to Kilo

*Replacing the Varies with device value with Nan :*

In [None]:
googlestore_df['Size'] = googlestore_df.Size.replace('Varies with device', np.NaN)

*Now, finally converting all these values to numeric type:*

In [None]:
googlestore_df['Size'] = pd.to_numeric(googlestore_df['Size']) # Converting the string to Numeric type

*So, after performing all of these operations, we should have a detailed look at that column, so yes again we will call our useful function which we defined.* i.e., **printinfo()**

In [None]:
printinfo()

*Since we converted the Varies with device value to Nan , so we have to do something with those set of Nan values data. It will be a better idea to drop the Rows of the column Size having Nanvalues because it will be not an efficient idea to replace those values with mean or mode since the size of some apps would be too large and some of them too small.*

In [None]:
googlestore_df.dropna(subset = ['Size'], inplace=True)

*Column: Installs :
To convert this column from object to integer type. First of all, we will need to remove the +symbol from these values.*

In [None]:
googlestore_df['Installs'] = googlestore_df.Installs.apply(lambda x: x.strip('+'))

*and then let’s remove the , symbol from the numbers.*

In [None]:
googlestore_df['Installs'] = googlestore_df.Installs.apply(lambda x: x.replace(',', ''))

*Lastly, we can now convert it from string type to numeric type, and then have a look at our dataset.*

In [None]:
googlestore_df['Installs'] = pd.to_numeric(googlestore_df['Installs'])

In [None]:
printinfo()

*So, now we are only left with the Price column.
Column: Price :
Converting this column from objectto Numeric type.*

In [None]:
googlestore_df['Price'].value_counts()

*The values contain a special symbol $ which can be removed and then converted to the numeric type.*

In [None]:
googlestore_df['Price'] = googlestore_df.Price.apply(lambda x: x.strip('$'))

In [None]:
googlestore_df['Price'] = pd.to_numeric(googlestore_df['Price'])

*After fixing all the issues, we should have a final look at the data frame.*

In [None]:
printinfo()

*Now, we are finally done. In this section Data Preparation and Cleaning. We can see that the original dataset contained 10841 Rows and 13 Columns. It contained App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Curernt Ver, and Android VerColumns. But after cleansing the dataset and dropping the unwanted rows and columns having Null Values and Garbage data from the data frame, we are left with 8434 Rows and 10 Columns.*

# **Exploratory Analysis and Visualization**

*In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent the property of a graphic mark, such as size or color, will change to reflect changes in the value of a datum.*

**Let’s begin by importing matplotlib.pyplot and seaborn , and at the same time set our fig size, font size, etc.**

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

*Now it is time to unveil the real strength of data analysis, i.e., to get an insight, and learn the trend, pattern and get answers to some of the questions related to the dataset.*

**Can we see what are the top categories in the play store, which contains the highest number of apps?**  Well, let us try to.

In [None]:
y = googlestore_df['Category'].value_counts().index
x = googlestore_df['Category'].value_counts()
xsis = []
ysis = []
for i in range(len(x)):
    xsis.append(x[i])
    ysis.append(y[i])

*We have defined our x and y axis. Let us plot and see:-*

In [None]:
plt.figure(figsize=(18,13))
plt.xlabel("Count")
plt.ylabel("Category")

graph = sns.barplot(x = xsis, y = ysis, palette= "husl")
graph.set_title("Top categories on Google Playstore", fontsize = 25);



> So there are all total of 33 categories in the dataset from the above output we can come to the conclusion that in the play store most of the apps are under Family & Game category and least are of Beauty & Comics Category.



**Which category of Apps from the ‘Content Rating’ column is found more on the play store?**

In [None]:
x2 = googlestore_df['Content Rating'].value_counts().index
y2 = googlestore_df['Content Rating'].value_counts()

x2sis = []
y2sis = []
for i in range(len(x2)):
    x2sis.append(x2[i])
    y2sis.append(y2[i])

In [None]:
plt.figure(figsize=(12,10))
plt.bar(x2sis,y2sis,width=0.8,color=['#15244C','#FFFF48','#292734','#EF2920','#CD202D','#ECC5F2'], alpha=0.8);
plt.title('Content Rating',size = 20);
plt.ylabel('Apps(Count)');
plt.xlabel('Content Rating');



> From the above plot, we can see that the Everyone category has the highest number of apps.

# **Let’s have a look at the distribution of the ratings of the data frame.**



In [None]:
plt.figure(figsize=(15,9))
plt.xlabel("Rating")
plt.ylabel("Frequency")
graph = sns.kdeplot(googlestore_df.Rating, color="Blue", shade = True)
plt.title('Distribution of Rating',size = 20);



>From the above graph, we can come to the conclusion that most of the apps in the google play store are rated between 3.5 to 4.8.

# **Let’s plot a visualization graph to view what portion of the apps in the play store are paid and free.**




In [None]:
plt.figure(figsize=(10,10))
labels = googlestore_df['Type'].value_counts(sort = True).index
sizes = googlestore_df['Type'].value_counts(sort = True)
colors = ["blue","lightgreen"]
explode = (0.2,0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=0)
plt.title('Percent of Free Vs Paid Apps in store',size = 20)
plt.show()



> From the above graph, we can see that 92%(Approx.) of apps in the google play store are free and 8%(Approx.) are paid.

# **Which category App’s have the most number of installs?**

*To answer this question we need to create a separate data frame out of our googlestore_df data frame which will contain a grouped value by Category and Installs .*



In [None]:
highest_Installs_df = googlestore_df.groupby('Category')[['Installs']].sum().sort_values(by='Installs', ascending=False)

*Now, let us plot it out:*

In [None]:
x2sis = []
y2sis = []

for i in range(len(highest_Installs_df)):
    x2sis.append(highest_Installs_df.Installs[i])
    y2sis.append(highest_Installs_df.index[i])

plt.figure(figsize=(18,13))

plt.xlabel("Installs")
plt.ylabel("Category")
graph = sns.barplot(x = x2sis, y = y2sis, alpha =0.9, palette= "viridis")
graph.set_title("Installs", fontsize = 25);



> From the above visualization, it can be interpreted that the top categories with the highest installs are Game, Family, Communication, News & Magazines, & Tools.

*We have done somewhat a good number of exploratory data analysis till now and in a state to finally answer some of the most common and in-demand questions which every App developer or any business company will love to know.*


# **What are the Top 10 installed apps in any category?**

*So, we have to be able to answer this not only for a single category but for many, i.e., we will need to define a function which should be able to return us a nice plot for any Category the name provided by any user as an argument to it.*

In [None]:
def findtop10incategory(str):
    str = str.upper()
    top10 = googlestore_df[googlestore_df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    # Top_Apps_in_art_and_design
    plt.figure(figsize=(15,12))
    plt.title('Top 10 Installed Apps',size = 20);    
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs)
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right');

*After we are done with defining the function, it’s time to check and see if everything is working fine. So let’s test it by passing Sports category to the above-defined function.*

In [None]:
findtop10incategory('Sports')



> From the above graph, we can see that in the Sports category FIFA Soccer, and Dream League Soccer 2018 has the highest installs. In the same way by passing different category names to the function, we can get the top 10 installed apps.



# **Which are the top 10 expensive Apps in the play store?**



We will again need to create a separate data frame.

In [None]:
top10PaidApps = googlestore_df[googlestore_df['Type'] == 'Paid'].sort_values(by='Price', ascending=False).head(11)
# top10PaidApps

*From the above data frame, we will need to drop an app name, because its’ name will be creating a mess in the plot.*

In [None]:
top10PaidApps_df = top10PaidApps[['App', 'Installs']].drop(9934)

**So Finally let’s plot and visualize the top 10 paid apps on the play store.**

In [None]:
plt.figure(figsize=(15,12));
plt.pie(top10PaidApps_df.Installs, explode=None, labels=top10PaidApps_df.App, autopct='%1.1f%%', startangle=0);
plt.title('Top Expensive Apps Distribution',size = 20);
plt.legend(top10PaidApps_df.App, 
           loc="lower right",
           title="Apps",
           fontsize = "xx-small"
          );



>From the above graph, we can interpret that the App I am rich is the most expensive app in the google play store followed by I am Rich Premium. we also had to drop one-row data for this visualization because the language of the app was Chinese and it was messing with the pie chart, visualization.



# **Which are the Apps with the highest number of reviews?**

In [None]:
Apps_with_Highest_rev = googlestore_df.sort_values(by='Reviews', ascending=False).head(20)

In [None]:
Apps_with_Highest_rev



>From the above data frame we can interpret, and come to the conclusion that the Apps like Clash of Clans, Subway Surfers, Clash Royale, and Candy Crush Saga has the highest number of reviews on google play store.



# **What are the count of Apps in different genres?**



By creating a data frame, let’s define our x and y axis, which will be required for plotting the graph.

In [None]:
topAppsinGenres = googlestore_df['Genres'].value_counts().head(50)

In [None]:
x3sis = []
y3sis = []

for i in range(len(topAppsinGenres)):
    x3sis.append(topAppsinGenres.index[i])
    y3sis.append(topAppsinGenres[i])

*Finally, we are in a state to plot and gain an insight into our raised question.*

In [None]:
plt.figure(figsize=(15,9))
plt.ylabel('Genres(App Count)')
plt.xlabel('Genres')
graph = sns.barplot(x=x3sis,y=y3sis,palette="deep")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90, fontsize=12)
graph.set_title("Top Genres in the Playstore", fontsize = 20);



>From the above visualization, we can see that the Highest Number of Apps found in the Tools and Entertainment genres followed by Education, Medical and many more.



The last question which we are going to answer is:

# **Which are the apps that have made the highest-earning?**


For answering these questions we will need to perform some extra operation to the data frame, i.e., we will need to create a separate data frame, and then multiply the Price column and the Installs column in order to get the earning of any particular app. So, let's start the process.

In [None]:
Paid_Apps_df = googlestore_df[googlestore_df['Type'] == 'Paid']

*Now from the above data frame, we will need to separate out the columns which we will require.*

In [None]:
earning_df = Paid_Apps_df[['App', 'Installs', 'Price']]

*We can now add a separate column Earnings to our new data frame which we will create by multiplying the two-column Price and Installs .*

In [None]:
earning_df['Earnings'] = earning_df['Installs'] * earning_df['Price'];

*Now let us sort the above data by Earnings and Price .*

In [None]:
earning_df_sorted_by_Earnings = earning_df.sort_values(by='Earnings', ascending=False).head(50)

In [None]:
earning_df_sorted_by_Price = earning_df_sorted_by_Earnings.sort_values(by='Price', ascending=False)

Finally, we can plot the graph and find out which are the apps with the highest number of earnings.

In [None]:
# PLot a bar chart of earning at y and app names at x
plt.figure(figsize=(15,9))
plt.bar(earning_df_sorted_by_Price.App, earning_df_sorted_by_Price.Earnings, width=1.1, label=earning_df_sorted_by_Price.Earnings)
plt.xlabel("Apps")
plt.ylabel("Earnings")
plt.tick_params(rotation=90)
plt.title("Top Earning Apps");

**The top five apps with the highest earnings found on google play store are:-**

* I am Rich

*   I am Rich Premium
* Hitman Sniper


*  Grand Theft Auto: San Andreas

* Facetune - For Free






**We have finally come to an end to our analysis, and hope that if you have reached till here it must have been interesting or useful to you.**

# **Conclusion**

**After Analyzing the dataset we have got answers to some of the serious & interesting questions which any of the android users would love to know.**