# **Capstone_EDA_Project-Playstore-App-Review-Analysis** 



#### **Project Type**    - EDA 
##### **Contribution**    - Individual
##### **Team Member 1 -** Dedwal Dipali Swarupchand



# **Project Summary -**

The Play Store App Review Analysis project is a data analysis project that involves analyzing user reviews of Android apps on the Google Play Store. The goal of the project is to gain insights into user sentiment, identify common issues or complaints, and provide recommendations for app developers to improve their products.
The project involves collecting data on user reviews using web scraping tools or APIs, cleaning and processing the data, and conducting various analyses, such as sentiment analysis, topic modeling, and clustering. The results of the analysis can be presented in the form of visualizations, reports, or dashboards.
The insights gained from the project can be used by app developers to improve their products and address user concerns, which can lead to increased user satisfaction and retention. Additionally, the project can help app developers identify opportunities for new features or products based on user needs and preferences.
In this project we used two datasets i.e play store data.csv and user review data.csv

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1.What are the top categories on Play Store?

2.Are majority of the apps Paid or Free?

3.How importance is the rating of the application?

4.Which categories from the audience should the app be based on?

5.Which category has the most no. of installations?

6.How does the count of apps varies by Genres?

7.How does the last update has an effect on the rating?

8.How are ratings affected when the app is a paid one?

9.How are reviews and ratings co-related?

10.Lets us discuss the sentiment subjectivity.

11.Is subjectivity and polarity proportional to each other?

12.What is the percentage of review sentiments?

13.How is sentiment polarity varying for paid and free apps?

14.How Content Rating affect over the App?



#### **Define Your Business Objective?**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps. Explore and analyse the data to discover key factors responsible for app engagement and success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# importing libraries
import pandas as pd               # for data manipulation
import numpy as np                # for mathemathical operations and linear algebra
import matplotlib.pyplot as plt   # for data visualization
import seaborn as sns             # for data visualization 
import plotly.express as px       # for data visualization
from sklearn.impute import SimpleImputer
from datetime import datetime     # for datetime
import missingno as msno
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#mounting the drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
#load dataset
play_store_data_path = '/content/drive/MyDrive/projectEDA/'
play_store_df = pd.read_csv('/content/drive/MyDrive/projectEDA/Play Store Data.csv')
user_reviews_df = pd.read_csv('/content/drive/MyDrive/projectEDA/User Reviews.csv')
     

### Dataset First View

In [None]:
# Dataset First Look of Play store data.csv 
play_store_df.head(-1)

In [None]:
# last 5 rows of play store data dataset
play_store_df.tail()

In [None]:
# First Look of User reviews Dataset
user_reviews_df.head(-1)

In [None]:
# last 5 rows of user reviews data dataset
user_reviews_df.tail()

### Dataset Rows & Columns count

In [None]:
#counting rows and column of play store data
print("The shape of play store is ",play_store_df.shape)

In [None]:
#counting rows and column of user reviews data
print("The shape of user reviews is ",user_reviews_df.shape)

### Dataset Information

In [None]:
# Dataset Info of play store.csv dataset
play_store_df.info()

In [None]:
# Dataset Info of user reviews.csv dataset
user_reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count in play store.csv 
duplicate_value1 = play_store_df.copy()

duplicated_count_playstore = duplicate_value1.duplicated().sum()
#print duplicate values in play store
print("Duplicate rows in play store dataset:",duplicated_count_playstore)


In [None]:
#Dataset Duplicate Value Count in user reviews.csv
duplicate_value2 = user_reviews_df.copy()


duplicated_count_userreviews = duplicate_value2.duplicated().sum()

#print duplicate values in user reviews
print("Duplicate rows in user review dataset:",duplicated_count_userreviews)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count in play store
playstore_missvalues = duplicate_value1.isnull().sum()
print(playstore_missvalues)

In [None]:
# Missing Values/Null Values Count in user reviews
userreviwes_missvalues = duplicate_value2.isnull().sum()
print(userreviwes_missvalues)

In [None]:
# Visualizing the missing values of play store
playstore_missvalues.plot(kind = 'line')

In [None]:
# Visualizing the missing values of user reviews
userreviwes_missvalues.plot(kind = 'bar')

### What did you know about your dataset?

By exploring the play store dataset, we know that:

There are 13 columns of properties with 10841 rows of data.
There are 1474 missing values in the given dataset.
The UserReviews present 64295 rows and 5 columns.
Column 'Reviews', 'Size', 'Installs' and 'Price' are in the type of 'object' .
Values of column 'Size' are strings representing size in 'M' as Megabytes, 'k' as kilobytes and also 'Varies with devices'.
Values of column 'Installs' are strings representing install amount with symbols such as ',' and '+'.
Values of column 'Price' are strings representing price with symbol '$'.

By exploring user review dataset we have given number of null values as:

Translated_Review has 26868 null values of the data.
Sentiment has 26868 null values of the data.
Sentiment_Polarity has 26868 null values of the data.
Sentiment_Subjectivity has 26868 null values of the data.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns of play_store
play_store_df.columns

In [None]:
# Dataset Columns of user_reviews
user_reviews_df.columns

In [None]:
# Dataset Describe of play_store
play_store_df.describe()

In [None]:
# Dataset Describe of user_reviews
user_reviews_df.describe()

### Variables Description 

****Variables used in this dataset****

**Playstore variables:**

There are 13 columns. 
They are:

**1.App**- The name of the application.

**2.Category**- The category to which an application belongs.

***3.Rating*** - The ratings given by the users for a specific application.

**4.Reviews** - The total number of users who have given a review for the application.

***5.Size*** - The size being occupied the application on the mobile phone.

***6.Installs*** - The total number of installs/downloads for an application.

***7.Type*** - The application is free or a paid one.

***8.Price*** - The price of the application.

***9.Content_Rating*** - the target audience for the application.

***10.Genres*** - The various other categories to which an application can belong.

***11.Last_Updated*** - The when the application was updated.

***12.Current_Ver*** - The current version of the application.

***13.Android_Ver*** - The android version which can support the application on its platform.


***User review variables:***

There are 5 columns. 
They are:

***App*** - It contains the name or identifier of the app.

***Translated_Review*** - It contains the text of the review

***Sentiment*** - It contains the overall sentiment of the review (e.g. positive or negative)

***Sentiment_Polarity*** - It contains a measure of the positivity or negativity of the review

***Sentiment_Subjectivity*** - It contains a measure of the subjectivity or objectivity of the review. These columns could be useful for analyzing the sentiment of app reviews and understanding how users feel about different apps.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable of play store.
playstore_uniquevalues = duplicate_value1.nunique()
print(playstore_uniquevalues)

In [None]:
# Check Unique Values for each variable of user reviews.
userreviews_uniquevalues = duplicate_value2.nunique()
print(userreviews_uniquevalues)

**Exploring all the null values , NaN values and unique counts in Playstore Data Frame**

In [None]:
# creating a function to tabulate all the info on Null , NaN and unique counts in Playstore dataset
def PlayStoreinfo():
    temp = pd.DataFrame(index=duplicate_value1.columns)
    temp['data_type'] = duplicate_value1.dtypes
    temp["count of non null values"] = duplicate_value1.count()
    temp['NaN values'] = duplicate_value1.isnull().sum()
    temp['% NaN values'] = duplicate_value1.isnull().mean() *100
    temp['unique_count'] = duplicate_value1.nunique()
    return temp 
PlayStoreinfo()

The number of null values are:

Rating has 1474 null values which contributes 13.60% of the data. type has the 1 null values and 0.01% of the data and content rating has 1 null value and 0.01
% of the contributes and cureent ver is the 8 null values and contribute the 0.07% data.android ver is the 3 null values and 0.02% contribute data.

In [None]:
# creating a function to tabulate all the info on Null , NaN and unique counts in user reviews dataset
def UserReviewsinfo():
    temp = pd.DataFrame(index=duplicate_value2.columns)
    temp['data_type'] = duplicate_value2.dtypes
    temp["count of non null values"] = duplicate_value2.count()
    temp['NaN values'] = duplicate_value2.isnull().sum()
    temp['% NaN values'] = duplicate_value2.isnull().mean() *100
    temp['unique_count'] = duplicate_value2.nunique()
    return temp 
UserReviewsinfo()

The number of null values are:

1.Translated_Review has 26868 null values which contributes 41.7% of the data.

2.Sentiment has 26868 null values which contributes 41.7% of the data.

3.Sentiment_Polarity has 26868 null values which contributes 41.7% of the data.

4.Sentiment_Subjectivity has 26868 null values which contributes 41.7% of the data.

## 3. ***Data Wrangling***

### Cleaing the data

**Removing insufficiant data column from the table**

> 1.column - 'type'




In [None]:
# Finding the row with insufficiant data
duplicate_value1[duplicate_value1['Type']!='Free'][duplicate_value1[duplicate_value1['Type']!='Free']['Price']=='0']

In [None]:
#Droping the row from the data frame
duplicate_value1.drop(duplicate_value1[duplicate_value1['Type']!='Free'][duplicate_value1[duplicate_value1['Type']!='Free']['Price']=='0'].index, inplace=True)
     

2.column - 'Current_version'

In [None]:
duplicate_value1.columns

In [None]:
# Finding the null values column
duplicate_value1[duplicate_value1['Current Ver'].isnull()]

In [None]:
# Removing/droping null values
duplicate_value1 = duplicate_value1[duplicate_value1['Current Ver'].notna()]
duplicate_value1.shape

3.Column - 'Android version ' 

In [None]:
duplicate_value1.columns

In [None]:
# Finding rows of null values
duplicate_value1[duplicate_value1['Android Ver'].isnull()]

In [None]:
duplicate_value1 = duplicate_value1[duplicate_value1['Android Ver'].notna()]
duplicate_value1.shape

In [None]:
duplicate_value1.boxplot()

Ratings which exceed the normal rating system i.e more than 5 as shown in the box plot

In [None]:
duplicate_value1['Rating'].unique()

In [None]:
#elimination the null value rows from the db
duplicate_value2 = duplicate_value2[~duplicate_value2['Sentiment'].isna()]

#non null user reviews.info()
duplicate_value2.isna().sum()

In [None]:
duplicate_value2 = duplicate_value2.dropna(subset=['Translated_Review'],how='all')
duplicate_value2.shape

In [None]:
playstore_data_copy = duplicate_value1.copy()
user_review_copy = duplicate_value2.copy()

In [None]:
duplicate_cols_check = playstore_data_copy['App'].duplicated().any() 
duplicate_cols_check

In [None]:
#play store apps and their count value
dtype = playstore_data_copy.dtypes
print(dtype)

In [None]:
#define fun clean data
def data_clean(num):
    replace = {'+':'',',':'','$':'','M':'000000','k':'000','NaN':'0','':''}
    for old, new in replace.items():
        num = num.replace(old,new)
    try:
        cleaned_num = int(num)
    except ValueError:
        cleaned_num = float(num)
    return cleaned_num

In [None]:
# Cleaning the unwanted charactors and converting the required column values into valid numeric type for easy analysis
playstore_data_copy['Reviews'] = pd.to_numeric(playstore_data_copy['Reviews'])
playstore_data_copy['Size'] = playstore_data_copy['Size'].apply(lambda x: 'NaN' if x == 'Varies with device' else x)

playstore_data_copy['Installs'] = pd.to_numeric(playstore_data_copy['Installs'].map(data_clean))


In [None]:
playstore_data_copy['Price'] = pd.to_numeric(playstore_data_copy['Price'])


In [None]:
playstore_data_copy.info()

In [None]:
#selecting all the last rows of data for each app for max review analysis
playstore_last_reviews = playstore_data_copy.groupby('App').tail(1).reset_index()

app_review_max = playstore_last_reviews.loc[playstore_last_reviews.groupby(['App'])['Reviews'].idxmax()]
app_review_max

In [None]:
app_review_max.max()

In [None]:
# Displaying Genres
top_genres = app_review_max.Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})
top_genres

In [None]:
# displayin all free apps with max reviews
app_review_max[app_review_max['Price'] == 0]

In [None]:
#Preparing dataframe free app install counts
genres_free_apps_installs = app_review_max[app_review_max['Price'] == 0].groupby(['Genres'])[['Installs']].sum().rename(columns={'Installs':'free_app_installs'})
genres_free_apps_installs

In [None]:
#Preparing dataframe paid app install counts
genres_paid_apps_installs = playstore_data_copy[playstore_data_copy['Price']!= 0].groupby(['Genres'])[['Installs']].sum().rename(columns={'Installs':'Paid_app_installs'})
genres_paid_apps_installs

In [None]:
#Preparing dataframe  Rating
genre_ratings = app_review_max.groupby(['Genres'])[['Rating']].mean()
genre_ratings

In [None]:
#Mergering all the data previous dataframes for further analysis
top_genres_installs = pd.merge(top_genres, genres_free_apps_installs, on='Genres')
top_genres_installs

In [None]:
top_genres_apps_installs = pd.merge(top_genres_installs, genres_paid_apps_installs, on='Genres')
top_genres_apps_installs

In [None]:
top_genres_apps_installs_ratings= pd.merge(top_genres_apps_installs, genre_ratings, on='Genres')
top_genres_apps_installs_ratings

     

In [None]:
#Getting top 50 app data based on the Genres
top_50_genres = top_genres_apps_installs_ratings.head(50)
top_50_genres

### What all manipulations have you done and insights you found?

Our project challenge was data cleaning and 13.60% of ratings were NaN values, and even after merging both the dataframes.Many rows and columns anve insufficient data, so we have to drop them as well. More preferred the free apps. Because of the null and NaN values the analysis was not so accurate. Cleaning the unwanted charactors and converting the required column values into valid numeric type for easy analysis and for data clean we use the differnt function like count(),reset_value(),len()etc.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Heat map for play_store apps
plt.figure(figsize = (15,10))
sns.heatmap(playstore_data_copy.corr(), annot= True)
plt.title('Co-relation Heatmap for Playstore Data', size=20)

##### 1. Why did you pick the specific chart?

Heatmaps are a type of graphical representation that is commonly used in data analysis and visualization. Heatmaps are useful for displaying large amounts of data in a way that is easy to interpret and understand. Here are some reasons why heatmaps are commonly used in graphical representation:

Visualizing patterns and relationships: Heatmaps are useful for visualizing patterns and relationships in data. Heatmaps use color to represent values, so it is easy to see where the highest and lowest values are and how they are distributed across the data.

Identifying outliers: Heatmaps are also useful for identifying outliers or anomalies in data. Outliers are values that are significantly different from the rest of the data. Heatmaps can quickly highlight these values because they will stand out as different colors from the rest of the data.

Comparing variables: Heatmaps can be used to compare variables and their relationships with each other. By creating a heatmap that shows the correlation between two or more variables, you can easily see how they are related to each other.

Summarizing large datasets: Heatmaps are useful for summarizing large datasets. Instead of looking at a table of numbers, a heatmap can quickly provide an overview of the data by highlighting the most important information.

Overall, heatmaps are a useful tool for data analysis and visualization because they provide a quick and easy way to see patterns, relationships, outliers, and summarize large datasets.

##### 2. What is/are the insight(s) found from the chart?

The insights are - There is a majority of positive correlation between the Installs and Reviews columns. Greater the number of installs, higher is the user base, and major are the total number of reviews posted by the users.The Price is slightly negatively correlated with the Reviews, and Installs the average rating, total number of reviews and Installs decrease gradually.
The Rating is marginally positively correlated with the Installs and Reviews` column. This points to the possibility that as the the average user rating grows, the app installs and number of reviews also increases.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding the distribution of app, developers can identify which are most popular and competitive, and tailor their app development and marketing strategies accordingly. For example, they might choose to focus on free install app with a high demand but low competition, or find ways to differentiate their apps within highly competitive.
Overall, the insights gained from analyzing the given chart can have both positive and negative impacts on business growth, depending on how they are interpreted and acted upon.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Plotting Top 50 Genres VS App counts  chart
top_50_genres[['Genres','Count']].set_index('Genres').plot(kind='bar')
plt.xticks(range(len(top_50_genres['Genres'])), top_50_genres['Genres'], rotation=90)
plt.rc('font', size=14) 
plt.rcParams['figure.figsize'] = (20, 10)
plt.title('Top 50 Genres VS App Counts')
plt.ylabel('Number of applications')
plt.xlabel('Genres')

##### 1. Why did you pick the specific chart?

for the play store genre and count . using bar graph are a type of graphical representation that are commonly used to display and compare data.
bar graphs are a versatile and useful tool for data analysis and visualization because they allow you to compare and display data in a visually appealing way, summarize data, and show trends over time or across categories.

##### 2. What is/are the insight(s) found from the chart?

The insight are - Genres that are in top 3 are Tools, entertainment and education
The genre that has highest app count is 'Tools'
The genre that has lowest app count is board:brain games

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This has no specific impact on bussiness because its just the valid comparition between two categories. But tool uesly widely.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Plotting Top 50 Genres VS Free apps install count chart
top_50_genres[['Genres','free_app_installs']].set_index('Genres').plot(kind='bar')
plt.rcParams['figure.figsize'] = (15, 14)
plt.rc('font', size=14) 
plt.title('Top 50 Genres VS Free app install Counts')
plt.ylabel('Number of installations (100 millions)')
plt.xlabel('Genres')

##### 1. Why did you pick the specific chart?

The bar graph reprsent the highest genre in the free install app are the exploring.


##### 2. What is/are the insight(s) found from the chart?



The insights are: The genre having the highest number of free apps installed is communication and Most people having low trust on genres like parenting,medical shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This has no specific impact on bussiness because its just the valid comparition between two categories. Free form of any communication applications are widely used. And most of the contribution in the categorie are social.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Plotting Top 50 Genres VS Paid apps install count chart
top_50_genres[['Genres','Paid_app_installs']].set_index('Genres').plot(kind='bar')
plt.title('Top 50 Genres VS Paid app install Counts')
plt.ylabel('Number of paid installations (in 100 millions)')
plt.xlabel('Genres')


##### 1. Why did you pick the specific chart?

This bar explore the highest paid application

##### 2. What is/are the insight(s) found from the chart?

Most install the arcade; action and adventure and most of the this generation like to games, people buy anything for games they paid game are playing.most people not intersted in paid apps.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

the insight negative sign as generation interested in free app than paid , so some companies investing and making the game app and this game app are paid becaues people love the game so company see most profits.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Category wise free and paid app installs count
category_installs = playstore_data_copy.groupby(['Category','Type'])[['Installs']].sum().unstack().reset_index()
category_installs = category_installs[~category_installs['Installs']['Paid'].isna()].set_index('Category')
category_installs

In [None]:
#Plot between Paid and Free installed app counts
ind = category_installs.index
column0 = category_installs['Installs']['Paid']
column1 = category_installs['Installs']['Free']
title0 = 'Paid app install Counts(in millions)'
title1 = 'Free app install Counts(in 100 millions)'

fig, axes = plt.subplots(figsize=(20,10), ncols=2, sharey=True)
fig.tight_layout()

axes[0].barh(ind, column0, align='center', color='orange', zorder=12)
axes[0].set_title(title0, fontsize=18, pad=15, color='orange')
axes[1].barh(ind, column1, align='center', color='g', zorder=12)
axes[1].set_title(title1, fontsize=18, pad=15, color='g')
    
plt.subplots_adjust(wspace=0.01, top=0.85, bottom=0.1, left=0.18, right=0.95)

##### 1. Why did you pick the specific chart?

Installtion count between free and paid apps

##### 2. What is/are the insight(s) found from the chart?

family category is most paid app install and free app install are game

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

According to the stats, most paid apps are in family and games category. Apart from these no other category is doing good in making money. This also indiacates that if the service provided is good and secured with quality, it will certainly create a positive impact on buisness and no otherwise.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
#App pricing across categories for Paid apps
categrory_price_mean = playstore_data_copy[playstore_data_copy['Price'] !=  0].groupby(['Category'])['Price'].mean().reset_index(name='Price')
ax = sns.stripplot(x='Price', y='Category', data=categrory_price_mean, jitter=True, linewidth=2)
ax.set_title('App pricing trend across categories(in USD)')

##### 1. Why did you pick the specific chart?

The show pricing across categories for Paid apps

##### 2. What is/are the insight(s) found from the chart?

Most apps are priced under 25USD
Finance category has the highest priced app followed by lifestyle
Lowest priced category is libraries and demo

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As this is just the trend of pricing categories this may not decide the buisness. But what it gives out is an idea of on how much an app can charge according to there particular category and what their competitors are charging in the market

#### Chart - 7

In [None]:
# Chart - 7 visualization code

#Ploting the histogram Rating
rating_df = playstore_data_copy[~playstore_data_copy['Rating'].isna()]['Rating']

plt.title('Histogram of Rating')
sns.distplot(rating_df, hist=True , color = 'g')
     

##### 1. Why did you pick the specific chart?

To find analysis based on App Rating and We get a summarized information of liked and disliked app by a number of users all around the world.



##### 2. What is/are the insight(s) found from the chart?

Most apps have the rating between 4 to 5
Also shows us most apps are liked by many users depending on ease of use, funtionality, performance and less bugs

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Definetly a positive impact on bussiness.Because now they have the users trust they can experiment on scaling the app and with pricing. Increrase their functionality globally. Reduces adverisement cost. Mainly they can also divert this taffic to thier next projects which helps them becoming a brand



#### Chart - 8

In [None]:
# Chart - 8 visualization code
#ploting Category wise mean Rating
category_mean_rating = playstore_data_copy.groupby(['Category'])['Rating'].mean().reset_index(name='Rating')
category_mean_rating.set_index('Category').plot(kind='bar' , color = 'c')
plt.rcParams['figure.figsize'] = (20, 10)
plt.rc('font', size=14) 
plt.title('Category VS Mean Rating')
plt.ylabel('Ratings out of 5')
plt.xlabel('Category')

##### 1. Why did you pick the specific chart?

To find Category wise mean Rating



##### 2. What is/are the insight(s) found from the chart?

Most categories have the rating between 4 to 5
Also shows us most categories are liked by many users depending on ease of use, funtionality, performance and less bugs


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Definetly a positive impact on buisness.Because now they have the users trust they can experiment on scaling the app and with pricing. Increrase their functionality globally. Reduces adverisement cost. Mainly they can also divert this taffic to thier next projects which helps them becoming a brand



#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Ratings given by different age of people
content_rating = playstore_data_copy['Content Rating'].value_counts()
content_rating

In [None]:
content_rating.plot(kind='pie', fontsize =15,explode= (0.1,0.2,0.3,0.4,0.5,0.1), autopct='%1.2f%%', pctdistance=1.1 , labeldistance= 1.8)
#Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})

##### 1. Why did you pick the specific chart?

To find ratings given by different age group of people



##### 2. What is/are the insight(s) found from the chart?

Most of the apps are used by all age groups of people
Lowest is the adults only
Most apps were created keeping every age group in mind.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Going with this content rating most creators are targeting every age group. So that would be the wise choise to go for a positive business impact. Choosing teen content as the niche category can also feature growth. Mostly family-friendly content categories will win the race



#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Getting the top downloaded 10 apps and their Ratings vs Reviews
top_downloaded_apps = playstore_data_copy.groupby('App').tail(1).sort_values(['Installs','Rating'], ascending=False).head(50)
top_10_downloaded_apps = top_downloaded_apps.head(10).set_index('App')[['Rating','Reviews']].sort_values(['Reviews'])
     

In [None]:
#Plotting between Top ratting apps and reviews
ind = top_10_downloaded_apps.index
column0 = top_10_downloaded_apps['Reviews']
column1 = top_10_downloaded_apps['Rating']
title0 = 'Total ratings out of 5'
title1 = 'Total reviews in 10 millions'

fig, axes = plt.subplots(figsize=(10,5), ncols=2, sharey=True)
fig.tight_layout()

axes[0].barh(ind, column0, align='center', color='g', zorder=10)
axes[0].set_title(title0, fontsize=18, pad=15, color='g')
axes[1].barh(ind, column1, align='center', color='b', zorder=10)
axes[1].set_title(title1, fontsize=18, pad=15, color='b')
    
# If you have positive numbers and want to invert the x-axis of the left plot
axes[0].invert_xaxis() 

# To show data from highest to lowest
# plt.gca().invert_yaxis()
axes[0].set(yticks=ind, yticklabels=ind)
axes[0].yaxis.tick_left()

axes[0].set_xticklabels([1, 2, 3, 4, 5])
axes[1].set_xticklabels([0, 10, 20, 30, 40, 50])

plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)

##### 1. Why did you pick the specific chart?

To get the top downloaded 10 apps and their Ratings vs Reviews



##### 2. What is/are the insight(s) found from the chart?

Both of the top rated apps are from social media/communication with above 4 rating(out of 5)
Rest all are below 3 rating
Reviews of all the 10 apps are of same level.
Exceptional difference in ratings are seen in between top 2 and rest of the top 10

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This in here shows the quality differences betweeen the apps. Even though the downloads are high there is a huge difference in ratings. These stats definetly affect the business negatively. As people start to compare and go with the app that is good at both functionality and service. One of the best way to improve these ratings is to go through the reviews and try to find where the users are facing problems and fix it.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Plotting the distribution of Subjectivity
sentiment_subjectivity_df = user_review_copy['Sentiment_Subjectivity']
sns.distplot(sentiment_subjectivity_df, hist=True , color ='c')
plt.xlabel("Subjectivity")
plt.title('Distribution of Subjectivity')

##### 1. Why did you pick the specific chart?

To know the distribution of Subjectivity



##### 2. What is/are the insight(s) found from the chart?

Sentiment subjectivity lies between range 0.4 to 0.7.
We observed that maximum number of users post reviews to the apps which suits their experience.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As we know sentiment subjectivity refers to the degree to which a piece of writing expresses personal opinions, feelings, or biases. And everyone has their own point of view. So, Sentiment subjectivity is inversely proportional to the growth of business



#### Chart - 12

**Analysing the Relationship between the Sentiment Subjectivity and Sentiment Polarity**

I will use a scatter plot for plotting the relationship between subjectivity and polarity. Different colors in scatter show the distribution of negative, neutral and positive sentiments clearly.

Below is scatter plot which conclude that sentiment subjectivity is not likely proportional to sentiment polarity always but in maximum cases, proportional variance is too high or too low.

In [None]:
# Chart - 12 visualization code
#plotting the relationship between sentiment_subjectivity and sentiment_polarity in scatter plot
sns.scatterplot(data=user_review_copy, x='Sentiment_Subjectivity', y='Sentiment_Polarity', hue='Sentiment', edgecolor='white', palette='coolwarm')
plt.title('Relationship between Sentiment_Subjectivity and Sentiment_Polarity')
plt.show()

Above as we analysed the plot we got the insight that the positive sentiments are in majority of cases where the users gave reviews where they find satisfactio. For a better app success, the users needs should be more focused,like what feature and what detail is more liked and appreciated and enhance it with every update.



##### 1. Why did you pick the specific chart?

To plot the relationship between sentiment_subjectivity and sentiment_polarity in scatter plot



##### 2. What is/are the insight(s) found from the chart?

The scatter plot which concludes that sentiment subjectivity is not likely proportional to sentiment polarity always.
Proportional variance is too high or too low.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Depending on the above plot yes and no as well. Reason is when polarity is positive and subjectivity is high it is a good product for individual. And the product is a complete garbage when subjectivity is low and polarity is negative



#### Chart - 13

In [None]:
# Chart - 13 visualization code

#Counts of Review sentiments
user_review_copy['Sentiment'].value_counts()

In [None]:
#Percentage of Review sentiments
user_review_copy['Sentiment'].value_counts().plot(kind='pie', explode= (0.1,0.1,0.1), shadow=True, autopct='%1.2f%%', pctdistance=1.1, labeldistance=1.5)


##### 1. Why did you pick the specific chart?

To plot the sentiment percentage analysis



##### 2. What is/are the insight(s) found from the chart?

Most of the sentiment is positive
Users review apps based on thier own personal experience rather overall analogy

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This definetly has hindrance on business growth, as most people think and behave in different ways



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

In this project I analysed play store based on the data provided. The analysis done on given parameters would help client to enhance the properties of apps while launching , Most importantly provide a common ground to work on apps of different categories which need common updation. Initially me and my teammate focused more on cleaning and preparing the data for easy analysis that would help us get clear insights of playstore apps. After the analysis and plotting and working with data we would summarize our analysis to provide better and best results out of our work:

Focusing on more updated and useful free apps becausee they are the most commonly used
Regular updatation of existing apps, to attract more new users.
There are some unexplored categories like event, beauty , art etc. Developing apps in these categories would be helpful foor users who have intrest in these areas.
Providing good content in apps that would be a factor in increasing the app market.
Users are more likely to have varying sentiments while using the app in each phase. So keeping the features that would be in favour of the majority of users is a factor that should have the focus.

# **Conclusion**

Through exploratory data analysis we have observed some trends and have made some assumptions that might lead to app success among the users in the play store.

• Percentage of free apps is ~92%

• Percentage of apps with no age restrictions is ~82%

• Most competitive category in paid and free: Family and Games

• Family, Game and Tools are top three categories having 1906, 926 and 829 app count.

• Apps having size less than 50 MB are 8783. 7749 Apps are having rating more than 4.0 including both type of apps.

• Category with the highest average app installs: Game

• Percentage of apps that are top rated is ~80%

• There are 20 free apps that have been installed over a billion time

• Category in which the paid apps have the highest average installation fee: Finance

• The median size of all apps in the play store is 12 MB.

• The apps whose size varies with device has the highest number average app installs.

• The apps whose size is greater than 90 MB has the highest number of average user reviews, ie, they are more popular than the rest.

• Overall sentiment count of merged dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

• Sentiment Polarity is not highly correlated with Sentiment Subjectivity

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***