<a href="https://colab.research.google.com/github/sachin21398/Play-Store-App-Review-Analysis/blob/main/Play_Store_App_Review_Analysis_Capstone_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

In this project, I have analyze Google Play Store data.This study is my first data analyzing study in Python.



Let's take a look at the data, which consists of two files:

**playstore data.csv:** contains all the details of the applications on Google Play. There are 13 features that describe a given app.
**user_reviews.csv:** contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1. What are the top categories on Play Store?
2. What is the ratio of number of Paid or Free?
3. How importance is the rating of the application?
4. Which categories from the audience should the app be based on?
5. Which category has the most no. of installations?
6. How does the last update has an effect on the rating?
7. How are ratings affected when the app is a paid one?
8. How are reviews and ratings co-related?
9. Distribution of apps based on its size
10. Lets us discuss the sentiment subjectivity.
10. Is subjectivity and polarity proportional to each other?
11. What is the percentage of review sentiments?
12. How is sentiment polarity varying for paid and free apps?


#### **Define Your Business Objective?**

The objective of this is to analyze the desire of the user through the reviews provided in the feedback section and apps trend in the market to help the organization & developers.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
from datetime import datetime
# plotly
import plotly 
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import warnings
#sns.set(font_scale=1.5)
warnings.filterwarnings("ignore")


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
ps_file_path = '/content/drive/MyDrive/Colab Notebooks/Play Store Data.csv'
ur_data_file_path =  '/content/drive/MyDrive/Colab Notebooks/User Reviews.csv'
ps_df = pd.read_csv(ps_file_path)
ur_df = pd.read_csv(ur_data_file_path)

### Dataset First View

In [None]:
# Dataset First Look
ps_data = pd.concat([ps_df.head(),ps_df.tail()])
ps_data

In [None]:
ps_df.describe()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(ps_df.columns)
rows=ps_df.shape[0]
columns=ps_df.shape[1]
print(f"the no of rows is {rows} and no of columns is {columns}")

### Dataset Information

In [None]:
# Dataset Info
ps_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
ps_df['App'].value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
def playstoreinfo():
  temp=pd.DataFrame(index=ps_df.columns)
  temp["datatype"]=ps_df.dtypes
  temp["not null values"]=ps_df.count()
  temp["null value"]=ps_df.isnull().sum()
  temp["% of the null value"]=ps_df.isnull().mean()
  temp["unique count"]=ps_df.nunique()
  return temp
playstoreinfo()

In [None]:
# Visualizing the missing values
ps_df[ps_df['Rating'].isnull()]

**Rating: This column contains 1470 NaN values.**

In [None]:
ps_df[ps_df['Type'].isnull()]

**Type: This column contains 1 NaN values.**

In [None]:
ps_df[ps_df['Content Rating'].isnull()]

**Content Rating: This column contains 1 NaN values.**

In [None]:
ps_df[ps_df['Current Ver'].isnull()]

**Current Ver: This column contains 8 Nan values**

In [None]:
ps_df[ps_df['Android Ver'].isnull()]

**Andorid Ver: This column contains 3 Nan values**

### What did you know about your dataset?

play_store dataframe has **10841 rows** and **13 columns**. The 13 columns are identified as below:

**App** - It tells us about the name of the application with a short description (optional).
**Category** - It gives the category to the app.
**Rating** - It contains the average rating the respective app received from its users.
**Reviews** - It tells us about the total number of users who have given a review for the application.
**Size**- It tells us about the size being occupied the application on the mobile phone.
**Installs** - It tells us about the total number of installs/downloads for an application.
**Type** - It states whether an app is free to use or paid.
**Price** - It gives the price payable to install the app. For free type apps, the price is zero.
**Content Rating** - It states whether or not an app is suitable for all age groups or not.
**Genres** - It tells us about the various other categories to which an application can belong.
**Last Updated** - It tells us about the when the application was updated.
**Current Ver** - It tells us about the current version of the application.
**Android Ver** - It tells us about the android version which can support the application on its platform.Answer Here

## ***2. Understanding Your Variables and Variables Description***

*Rating:-*

In [None]:
# Dataset Columns
median_rating = ps_df[~ps_df['Rating'].isnull()]['Rating'].median()
median_rating

* The `Rating` column contains 1470 NaN values which accounts to apprximately 13.5% of the rows in the entire sheet.
* The NaN values in this case can be imputed by the aggregate (median) of the remaining values in the Rating column.

In [None]:
# Replacing the NaN values in the 'Rating' colunm with its median value
ps_df['Rating'].fillna(value=median_rating,inplace=True)

*Type :-*

In [None]:
# Finding the different values the 'Type' column takes
ps_df["Type"].value_counts()

The Typecolumn contains only two entries, namely, Free and Paid. Also, if the app is of type-paid, the price of that app will be printed in the corresponding Price column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.

In [None]:
# Replacing the NaN value in 'Type' column corresponding to row index 9148 with 'Free'
ps_df.loc[9148,'Type']='Free'
ps_df[ps_df["Type"].isnull()]

*Android Ver :-*

In [None]:
# Finding the different values the 'Android Ver' column takes
ps_df["Android Ver"].value_counts()

In Android Ver column there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
ps_df=ps_df[ps_df['Android Ver'].notna()]
ps_df.shape

*Current Ver :-*

In [None]:
# Finding the different values the 'Current Ver' column takes
ps_df['Current Ver'].value_counts()

Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.

In [None]:
# dropping rows corresponding to the values which contain NaN in the column 'Current Ver'.
ps_df=ps_df[ps_df["Current Ver"].notna()]
# Shape of the updated dataframe
ps_df.shape

## ***3. Data Wrangling***

 **Handling duplicates values and Manipulating dataset:**
 
  **`1.Handling the duplicates in the  `App `column`**

In [None]:
# Handling the error values in the Play store data
ps_df.head()

In [None]:
ps_df['App'].value_counts()


In [None]:
ps_df[ps_df.duplicated()]

In [None]:
# dropping duplicates from the 'App' column.
ps_df.drop_duplicates(subset = 'App', inplace = True)
ps_df.shape

In [None]:
# Checking whether the duplicates in the 'App' column are taken care of or not
ps_df[ps_df['App']=='ROBLOX']

I have successfully handled all the duplicate values in the App column. The resultant number of rows after droping the duplicate rows in the app column come out to be 9649.



**`2. Changing the datatype of the Last Updated column from string to datetime.`**

In [None]:
# Pandas to_datetime() function helps to convert string Date time into Python Date time object.
ps_df["Last Updated"] = pd.to_datetime(ps_df['Last Updated'])
ps_df.head()

**`3. Converting the datatype of values in the `Reviews` column from string to int.`**

In [None]:
# Converting the datatype of the values in the reviews column from string to int
ps_df['Reviews'] = ps_df['Reviews'].astype(int)
ps_df.head()

**`4. Changing the datatype of the `Price `column from string to float`.**

In [None]:
ps_df['Price'].value_counts()

To convert this column from string to float, we must first drop the $ symbol from the all the values. Then we can assign float datatype to those values.

Applying the `drop_dollar` function to convert the values in the` Price` column from string datatype to float datatype.

In [None]:
def convert_dollar(val):
  '''
  This funtion drops the $ symbol if present and returns the value with float datatype.
  '''
  if '$' in val:
    return float(val[1:])
  else:
    return float(val)

# The drop_dollar funtion applied to the price column
ps_df['Price']=ps_df['Price'].apply(lambda x: convert_dollar(x))
ps_df.head()


I have successfully converted the datatype of values in the Price column from string to float.

**5. Converting the values in the `Installs`column from string datatype to integer datatype.**

In [None]:
ps_df['Installs'].value_counts()

To convert all the values in the Installs column from string datatype to integer datatype, we must first drop the '+' symbol from all the entries if present and then we can change its datatype.

Applying the convert_plus function to convert the values in the Installs column from string datatype to float datatype.

In [None]:
def convert_plus(val):
  '''
  This function drops the + symbol if present and returns the value with int datatype.
  '''
  if '+' and ',' in val:
    new = int(val[:-1].replace(',',''))
    return new
  elif '+' in val:
    new1 = int(val[:-1])
    return new1
  else:
    return int(val)

# The drop_plus funtion applied to the main dataframe

ps_df['Installs'] = ps_df['Installs'].apply(lambda x: convert_plus(x))
ps_df.head()    

**`6. Converting the values in the `Size` column to a same unit of measure(MB).`**

In [None]:
ps_df['Size'].value_counts()

We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.

In [None]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def convert_kb_to_mb(val):
  '''
  This function converts all the valid entries in KB to MB and returns the result in float datatype.
  '''
  try:
    if 'M' in val:
      return float(val[:-1])
    elif 'k' in val:
      return round(float(val[:-1])/1024, 4)
    else:
      return val
  except:
    return val

# The kb_to_mb funtion applied to the size column

ps_df['Size'] = ps_df['Size'].apply(lambda x: convert_kb_to_mb(x))
ps_df.head()    

In [None]:
ps_df.describe()

### What all manipulations have you done and insights you found?

I have handled the errors or manipulations and NaN values in the playstoredata in above 6 steps.csv file, lets do the same for the userreviews.csv file.

##***Exploring User_review dataframe*** (same step Follow as play stora dataset)

In [None]:
# Checking the top 10 rows of the data
ur_df.head()

In [None]:
ur_df.info()


user_reviews dataframe has 64295 rows and 5 columns. The 5 columns are identified as follows:

* **App:** Contains the name of the app with a short description (optional).
* **Translated_Review**
* **Sentiment:** It can be ‘Positive’, ‘Negative’, or ‘Neutral’.
* **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
* **Sentiment_Subjectivity**

In [None]:
# Checking shape and column in dataframe
print(ur_df.columns)
rows=ur_df.shape[0]
columns=ur_df.shape[1]
print(f"the no of rows is {rows} and no of columns is {columns}")

In [None]:
def Urinfo():
  temp1=pd.DataFrame(index=ur_df.columns)
  temp1["datatype"]=ur_df.dtypes
  temp1["not null values"]=ur_df.count()
  temp1["null value"]=ur_df.isnull().sum()
  temp1["% of the null value"]=ur_df.isnull().mean().round(4)*100
  temp1["unique count"]=ur_df.nunique()
  return temp1
Urinfo()

**Findings**

The number of null values are:
* **Translated_Review** has 26868 null values which contributes **41.79%** of the data.
* **Sentiment** has 26863 null values which contributes **41.78%** of the data.
* **Sentiment_Polarity**  has 26863 null values which contributes **41.78%** of the data.
* **Sentiment_Subjectivity** has 26863 null values which contributes **41.78%** of the data.

**Handling the error and NaN values in the User reviews**

In [None]:
# Finding the total no of NaN values in each column.
ur_df.isnull().sum()

In [None]:
#checking the NaN values in the translated rview column
ur_df[ur_df['Translated_Review'].isnull()]

here are a total of 26868 rows containing NaN values in the Translated_Review column.

We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity in the majority of the cases.

Hence the Nan can be deleted altogather

In [None]:
# Deleting the rows containing NaN values
ur_df = ur_df.dropna()
# The shape of the updated df
ur_df.shape

In [None]:
# Inspecting the sentiment column
ur_df['Sentiment'].value_counts()

In User Revies data set, we successfully developed a data pipeline. We can now examine this data flow and create user-friendly visuals.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***


We have sucessfully cleaned the dirty data. Now we can perform some data visualization and come up with insights on the given datasets.


## **`1). Correlation Heatmap`**

In [None]:
# Finding correlation between different columns in the play store data
ps_df.corr()

In [None]:
# Heat map for play_store
plt.figure(figsize = (20,10))
sns.heatmap(ps_df.corr(), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=20)

##### 1. What is/are the insight(s) found from the chart?

**Answer :-** The` Rating` is slightly positively correlated with the` Installs and Reviews` column. This indicates that as the the average user rating increases, the app installs.


##### 2. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Answer :-**There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.

The` Price `is slightly negatively correlated with the `Rating, Reviews, and Installs.` This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.

## **2). What is the ratio of number of Paid apps and Free apps?**


In [None]:
data = ps_df['Type'].value_counts() 
labels = ['Free', 'Paid']

# create pie chart
plt.figure(figsize=(10,10))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()

##### 1. Why did you pick the specific chart?

**Answer:-**Pie chart clear the easy visualization in the ratio term.

##### 2. What is/are the insight(s) found from the chart?

**Answer:-** From the above graph we can see that 92% of apps in google play store are free and 8%are paid.

## **3).  Which category of Apps from the Content Rating column are found more on playstore ?**

In [None]:
ps_df['Content Rating'].unique()

In [None]:
# Content rating of the apps
data = ps_df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']

#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()

##### 1. Why did you pick the specific chart?

**Answer:-**Pie chart clear the easy visualization in the ratio term.

##### 2. What is/are the insight(s) found from the chart?

**Answer:-**A majority of the apps (82%) in the play store are can be used by everyone.The remaining apps have various age restrictions to use it.Answer Here

## **4). What is Top Categories in App?**

In [None]:
ps_df.groupby("Category")["App"].count().sort_values(ascending= False)

In [None]:
x = ps_df['Category'].value_counts()
y = ps_df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])

In [None]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

##### 1. Why did you pick the specific chart?

**Answer:-** Column Chart showing visualization more effective for categories because number of categories are 33.

##### 2. What is/are the insight(s) found from the chart?

**Answer:-**There are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are underFAMILY & GAME category and least are of EVENTS & BEAUTY Category.


### **`5). Which category App's have most number of installs?`**

In [None]:
# total app installs in each category of the play store

a = ps_df.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

**Findings:-**This tells us the category of apps that has the maximum number of installs. The `Game,` `Communication and Tools` categories has the highest number of installs compared to other categories of apps.

### **5). Average rating of the apps**

In [None]:
# Average app ratings

ps_df['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()

We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

* 4-5: Top rated
* 3-4: Above average
* 2-3: Average
* 1-2: Below average

**Lets create a new column `Rating group` in the main dataframe and apply these filters.**

In [None]:
def Rating_app(val):
  ''''
  This function help to categories the rating from 1 to 5
  as Top_rated,Above_average,Average & below Average
  '''
  if val>=4:
    return 'Top rated'
  elif val>3 and val<4:
    return 'Above Average'
  elif val>2 and val<3:
    return 'Average'
  else:
    return 'Below Average'

# Applying grouped_rating function
ps_df['Rating_group']=ps_df['Rating'].apply(lambda x: Rating_app(x))

In [None]:
# Average app ratings 
ps_df['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()

### **7). What are the Top 10 installed apps in any category?**

In [None]:
def findtop10incategory(str):
    str = str.upper()
    top10 = ps_df[ps_df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(15,6), dpi=100)
    plt.title('Top 10 Installed Apps',size = 20)  
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs, palette= "icefire")
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')

In [None]:
findtop10incategory('GAME')

**Findings:**

From the above graph we can see that in the **`Game category Subway Surfers,Candy Crush Saga, Temple Run 2`** has the highest installs. In the same way we by passing different category names to the function, we can get the top 10 installed apps.

### **8). Top apps that are of free type.**

In [None]:
# Creating a df for top free apps
top_free_df = ps_df[ps_df['Installs'] == ps_df['Installs'].max()]
top10free_apps=top_free_df.nlargest(10, 'Installs', keep='first')
top10free_apps.head(10)

In [None]:
# Categories in which the top 20 free apps belong to
top_free_df['Category'].value_counts().plot.bar(figsize=(20,6), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which the top 20 free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()


### **9). Top apps that are of paid type.**

In [None]:
# Creating a df containing only paid apps
paid_df=ps_df[ps_df['Type']=='Paid']

In [None]:
# Number of apps that can be installed at a particular price 
paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,6), color = 'crimson')

* The paid apps charge the users a certain amount to download and install the app. This amount varies from one app to another.
* Here a better way to determine the top apps in the paid category is by finding the revenue it generated through app installs.
* This is given by:

 Revenue generated through installs = (Number of installs)x(Price to install the app)

In [None]:
# Creatng a new column 'Revenue' in paid_df
paid_df['Revenue'] = paid_df['Installs']*paid_df['Price']
paid_df.head()

In [None]:
# Top app in the paid category
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]

In [None]:
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']

In [None]:
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()

### **10). Distribution of apps based on its size**#### 

Lets group the data in the size column as follows into intervals of 10 each:

(< 1 MB, 1-10, 10-20, 20-30, ..., 90-100, 'Varies with device')


In [None]:
# Function to group the apps based on its size in MB

def size_apps(var):
  '''
  This function groups the size of an app 
  between ~0 to 100 MB into certain intervals.
  '''
  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <10:
      return '1-10'
    elif var >= 10 and var <20:
      return '10-20'
    elif var >= 20 and var <30:
      return '20-30'
    elif var >= 30 and var <40:
      return '30-40'
    elif var >= 40 and var <50:
      return '40-50'
    elif var >= 50 and var <60:
      return '50-60'
    elif var >= 60 and var <70:
      return '60-70'
    elif var >= 70 and var <80:
      return '70-80'
    elif var >= 80 and var <90:
      return '80-90'
    else:
      return '90 and above'
  except:
    return var

ps_df['size_group']=ps_df['Size'].apply(lambda x : size_apps(x))
ps_df.head()    

In [None]:
# no of apps belonging to each size group
ps_df['size_group'].value_counts().plot.barh(figsize=(20,8),color='g').invert_yaxis()
plt.title("Number of apps in different size groups", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('No of apps', size=15)
plt.legend()

*   The sizes of the majority of the apps range in between 1 and 20 MB.
*   There are a good number of apps whose size varies with the device.

## **Data Visualization on User Reviews:**
### **`1). Percentage of Review Sentiments`**

In [None]:
ur_df.columns

In [None]:
import matplotlib
counts = list(ur_df['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()

**Findings:**

1. Positive reviews are **64.30%**
2. Negative reviews are **22.80%**
3. Neutral reviews are **12.90%**

### **`2). Apps with the highest number of positive reviews`**

In [None]:
# positive reviews
positive_ur_df=ur_df[ur_df['Sentiment']=='Positive']
positive_ur_df

In [None]:
positive_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(10,8),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()

### **`3). Apps with the highest number of negative reviews.`**

In [None]:
negative_ur_df=ur_df[ur_df['Sentiment']=='Negative']
negative_ur_df

In [None]:
negative_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(15,8),color='r').invert_yaxis()
plt.title("Top 10 negative review apps")
plt.xlabel('Total number of negative reviews')
plt.legend()

### **`4). Histogram of Subjectivity`**

In [None]:
ur_df.Sentiment_Subjectivity.value_counts()

In [None]:
plt.figure(figsize=(18,9))
plt.xlabel("Subjectivity")
plt.title("Distribution of Subjectivity")
plt.hist(ur_df[ur_df['Sentiment_Subjectivity'].notnull()]['Sentiment_Subjectivity'])
plt.show()

**Findings:**

It can be seen that maximum number of sentiment subjectivity lies between 0.4 to 0.7. From this we can conclude that maximum number of users give reviews to the applications, according to their experience.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
In this project of analyzing play store applications, we have worked on several parameters which would help the client to do well in launching their apps on the play store.

Clients needs to focus more on:
1. Most of the apps are Free, so focusing on free app is more important.
2. Focusing more on content available for Everyone will increase the chances of getting the highest installs.
3. They need to focus on updating their apps regularly, so that it will attract more users.
5. They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.



# **Conclusion**

In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.

* Percentage of free apps = ~92%
* Percentage of apps with no age restrictions = ~82%
* Most competitive category: Family
* Category with the highest average app installs: Game
* Percentage of apps that are top rated = ~80%
* Family, Game and Tools are top three categories having 1906, 926 and 829 app count. 
* There are 20 free apps that have been installed over a billion times.
* Category in which the paid apps have the highest average installation fee: Finance
* The apps whose size varies with device has the highest number average app installs.
* Overall sentiment count of merged dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***