# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Siddesh Sakhalkar**

# **Project Summary -**



The Google Play Store, previously known as the Android Market, is an online platform operated and developed by Google. It serves as the official app store for certified devices running on the Android operating system and its derivatives, as well as ChromeOS. Users can access and download applications developed using the Android software development kit (SDK) and published through Google. Additionally, Google Play also functions as a digital media store, offering a wide range of content such as games, music, books, movies, and television programs.

In our current project, we have been provided with two datasets: Play Store data and User Reviews. Let's examine these datasets in detail:

1. playstore data.csv: This file contains comprehensive information about various applications available on Google Play. It includes 13 different features that describe each app.

2. user reviews.csv: This dataset consists of 100 reviews for each application, sorted in order of helpfulness. The reviews have undergone preprocessing and now include three new features: Sentiment (categorized as Positive, Negative, or Neutral), Sentiment Polarity, and Sentiment Subjectivity.

Before delving into the provided data, it is essential to understand the concept of Exploratory Data Analysis (EDA).

EDA involves generating summary statistics for numerical data and creating various graphical representations to gain insights into the dataset and make it visually appealing.

The following are the key steps involved in the EDA process:

1. Problem Statement: We will analyze and understand the given dataset, studying the attributes it contains, and pondering their significance and relevance to the problem at hand.

2. Hypothesis: Based on our examination of the attributes, we will formulate some initial hypotheses to explore and manipulate the data, seeking diverse results.

3. Univariate Analysis: This basic form of data analysis focuses on a single attribute at a time, describing and summarizing the data to identify patterns.

4. Bivariate Analysis: This analysis investigates the relationships and dependencies between two attributes, looking for causal connections.

5. Multivariate Analysis: When dealing with more than two variables simultaneously, we perform multivariate analysis.

6. Data Cleaning: We will clean the dataset by handling missing data, outliers, and categorical variables.

7. Testing Hypothesis: Before applying multivariate techniques, we will verify if our data meets the necessary assumptions.

By following these steps, we can gain valuable insights from the data and draw meaningful conclusions for our project.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**



   1) What are the top categories on Play Store?

   2) Are majority of the apps Paid or Free?

   3) How importance is the rating of the application?

   4) Which categories from the audience should the app be based on?

   5) Which category has the most no. of installations?

   6) How does the count of apps varies by Genres?

   7) How does the last update has an effect on the rating?

   8) How are ratings affected when the app is a paid one?

   9) How are reviews and ratings co-related?

   10) discuss the sentiment subjectivity.

   11) Is subjectivity and polarity proportional to each other?

   12) What is the percentage of review sentiments?

   13) How is sentiment polarity varying for paid and free apps?
   
   14) How Content Rating affect over the App?



#### **Define Your Business Objective?**


The data from the Play Store apps presents a significant opportunity for app-making enterprises to achieve success. Valuable insights can be extracted to guide developers in creating apps that can capture the Android market. The dataset includes crucial information such as category, rating, size, and more for each app listed. Additionally, there is another dataset containing customer reviews for the Android apps. By thoroughly exploring and analyzing this data, we can uncover the critical factors that contribute to app engagement and overall success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()

### Dataset First View

In [None]:
# Dataset First Look
play_store_df = pd.read_csv('/content/Play Store Data (1).csv')
user_reviews_df = pd.read_csv('/content/User Reviews (1).csv')

In [None]:
play_store_df.head().T

In [None]:
user_reviews_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("shape of PlayStore dataframe is ", play_store_df.shape)

In [None]:
print("The shape of user review dataframe is ",user_reviews_df.shape)

### Dataset Information

In [None]:
# Dataset Info
play_store_df.info()

In [None]:
user_reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print("Duplicate values for play store dataset ",play_store_df.duplicated().sum())
print("Duplicate values for user reviews dataset ",user_reviews_df.duplicated().sum())

In [None]:
temp_playstore_data = play_store_df.copy()
temp_user_reviews = user_reviews_df.copy()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
plst_nvalues = temp_playstore_data.isnull().sum()
print("Play Store data : \n",plst_nvalues)

In [None]:
plst_nvalues.index

In [None]:
usrev_nvalues = temp_user_reviews.isnull().sum()
print("Users review data: \n",usrev_nvalues)

In [None]:
plst_nvalues[plst_nvalues > 0]

In [None]:
# Visualizing the missing values
#sns.barplot(data=plst_nvalues,x=plst_nvalues.index,y=plst_nvalues.values)
plt.barh(plst_nvalues[plst_nvalues > 0].index,plst_nvalues[plst_nvalues > 0].values)
plt.title('null values in play store data set')
plt.xlabel('values')
plt.ylabel('column having null values')

In [None]:
plt.barh(usrev_nvalues.index,usrev_nvalues.values)
plt.title('null values in user review data set')
plt.xlabel('values')
plt.ylabel('column having null values')

### What did you know about your dataset?

The PlayStore dataset consists of '10841' rows and '13' columns, while the UserReview dataset contains '64295' rows and '5' columns.

Both datasets have some missing values. In the PlayStore dataset, the Rating column has '1474' null values, the Type column has '1' null value, the Content Rating column has '1' null value, the Current Ver column has '8' null values, and the Android Ver column has '3' null values. Additionally, there are '483' duplicate rows in the PlayStore dataset.

In the UserReview dataset, the Translated_Review column has '26868' null values, the Sentiment column has '26863' null values, the Sentiment_Polarity column has '26863' null values, and the Sentiment_Subjectivity column has '26863' null values. Moreover, '33616' duplicate rows are present in the UserReview dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
temp_playstore_data.columns

In [None]:
temp_user_reviews.columns

In [None]:
# Dataset Describe
temp_playstore_data.info()

In [None]:
temp_playstore_data.describe()

In [None]:
temp_user_reviews.info()

In [None]:
temp_user_reviews.describe()

### Variables Description

## **Play store variables:**

The dataset comprises 13 columns, each providing valuable information about the applications:

  1.App - Represents the name of the application.

  2.Category - Indicates the category to which an application belongs.

  3.Rating - Represents the user ratings given to a specific application.

  4.Reviews - Indicates the total number of user reviews for the application.

  5.Size - Specifies the occupied size of the application on a mobile phone.

  6.Installs - Indicates the total number of installs/downloads for an application.

  7.Type - Specifies whether the application is free or paid.

  8.Price - Represents the price of the application.

  9.Content_Rating - Indicates the target audience for the application.

  10.Genres - Specifies various other categories to which an application can belong.

  11.Last_Updated - Indicates the date of the last update for the application.

  12.Current_Ver - Represents the current version of the application.

  13.Android_Ver - Specifies the minimum Android version required to support the application on its platform.

  

# **User review variables:**

The dataset consists of 5 columns that provide insights into app reviews and user sentiments:

  1.App - Contains the name or identifier of the app.

  2.Translated_Review - Contains the text of the review.

  3.Sentiment - Represents the overall sentiment of the review (e.g., positive or negative).
  4.Sentiment_Polarity - Measures the positivity or negativity of the review.

  5.Sentiment_Subjectivity - Measures the subjectivity or objectivity of the review.

These variables can be leveraged for sentiment analysis of app reviews, helping to understand how users feel about different applications.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print('Play Store data set')
print(temp_playstore_data.nunique())

In [None]:
print('User review data set')
print(temp_user_reviews.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

1) type

In [None]:
# Write your code to make your dataset analysis ready.
# Finding the row with insufficiant data
mask = temp_playstore_data['Type']!='Free'
temp_playstore_data[mask][temp_playstore_data[mask]['Price']=='0']


In [None]:
#Droping the row from the data frame
temp_playstore_data.drop(temp_playstore_data[mask][temp_playstore_data[mask]['Price']=='0'].index, inplace=True)


2) Current Ver

In [None]:
# finding null values of this column
temp_playstore_data[temp_playstore_data['Current Ver'].isnull()].T

In [None]:
# Removing/droping null values
temp_playstore_data.shape

In [None]:
temp_playstore_data = temp_playstore_data[temp_playstore_data['Current Ver'].notna()]
temp_playstore_data.shape

3) Android Ver

In [None]:
# finding null values of this column
temp_playstore_data[temp_playstore_data['Android Ver'].isnull()].T

In [None]:
# Removing/droping null values
temp_playstore_data.shape

In [None]:
temp_playstore_data = temp_playstore_data[temp_playstore_data['Android Ver'].notna()]
temp_playstore_data.shape

In [None]:
temp_playstore_data.boxplot()

we have ratings which exceed the normal rating system i.e more than 5 as shown in the box plot

In [None]:
temp_playstore_data['Rating'].unique()

In [None]:
temp_user_reviews.info()

In [None]:
#Eliminating the null value rows from the database

temp_user_reviews = temp_user_reviews[~temp_user_reviews['Sentiment'].isna()]
temp_user_reviews.isna().sum()

In [None]:
temp_user_reviews = temp_user_reviews.dropna(subset = ['Translated_Review'],how='all')
temp_user_reviews.shape

In [None]:
temp_user_reviews.isna().sum()

Data preparation

In [None]:
playstore_data_copy = temp_playstore_data.copy()
user_review_copy = temp_user_reviews.copy()

In [None]:
#Filtering duplicate apps
duplicate_column_check = playstore_data_copy['App'].duplicated().any()
duplicate_column_check


In [None]:
playstore_data_copy['App'].value_counts()

In [None]:
# Apps and their counts
dtye = playstore_data_copy.dtypes
print(dtye)

In [None]:
# function to clean data
def clean_data(num):
    '''This function cleans data ,removes unwanted symbols'''
    unwanted = {'+': '', ',': '', '$': '', 'M': '000000', 'k': '000', 'NaN': '0'}
    for old, new in unwanted.items():
        num = num.replace(old, new)
    return num


In [None]:
# Cleaning the unwanted charactors and converting the required column values into valid numeric type for easy analysis
playstore_data_copy['Reviews'] = pd.to_numeric(playstore_data_copy['Reviews'])
playstore_data_copy['Size'] = playstore_data_copy['Size'].apply(lambda x: 'NaN' if x == 'Varies with device' else x)
playstore_data_copy['Size'] = pd.to_numeric(playstore_data_copy['Size'].map(clean_data))
playstore_data_copy['Installs'] = pd.to_numeric(playstore_data_copy['Installs'].map(clean_data))
playstore_data_copy['Price'] = pd.to_numeric(playstore_data_copy['Price'].map(clean_data))

In [None]:
playstore_data_copy.info()

In [None]:
#selecting all the last rows of data for each app for max review analysis
playstore_last_rev = playstore_data_copy.groupby('App').tail(1).reset_index()

app_review_max = playstore_last_rev.loc[playstore_last_rev.groupby(['App'])['Reviews'].idxmax()]
app_review_max

In [None]:
app_review_max.max()

In [None]:
# Displaying Genres
top_genres = app_review_max.Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})
top_genres

In [None]:
# displayin all free apps with max reviews
app_review_max[app_review_max['Price'] == 0]

In [None]:
#Preparing dataframe which contains free app install counts
genres_free_apps_installs = app_review_max[app_review_max['Price'] == 0].groupby(['Genres'])[['Installs']].sum().rename(columns={'Installs':'free_app_installs'})
genres_free_apps_installs

In [None]:
#Preparing dataframe which contains paid app install counts
genres_paid_apps_installs = playstore_data_copy[playstore_data_copy['Price']!= 0].groupby(['Genres'])[['Installs']].sum().rename(columns={'Installs':'Paid_app_installs'})
genres_paid_apps_installs

In [None]:
#Preparing dataframe which contains mean Rating
genre_ratings = app_review_max.groupby(['Genres'])[['Rating']].mean()
genre_ratings

In [None]:
#Mergering all the data previous dataframes for further analysis
top_genres_installs = pd.merge(top_genres, genres_free_apps_installs, on='Genres')
top_genres_apps_installs = pd.merge(top_genres_installs, genres_paid_apps_installs, on='Genres')
top_genres_apps_installs_ratings= pd.merge(top_genres_apps_installs, genre_ratings, on='Genres')

In [None]:
#Getting top 50 app data based on the Genres
top_50_genres = top_genres_apps_installs_ratings.head(50)
top_50_genres

### What all manipulations have you done and insights you found?

During our data analysis project, data cleaning emerged as a major challenge. Some reviews contained NaN (Not-a-Number) values, which required careful handling to ensure the integrity of our analysis. Even after merging both dataframes, the presence of these missing values impacted the accuracy of our results.

In addition to the NaN values, we encountered several rows and columns with insufficient data, which necessitated their removal from the dataset. While this step reduced the dataset size, it was crucial to maintain data quality and reliability.

One interesting finding during the analysis was the preference for free apps among users. This insight helped shape some of our conclusions and recommendations.

To improve the accuracy of our analysis, we took specific measures to address the presence of null and NaN values. We employed appropriate techniques to fill or remove these missing data points based on the nature of the data and the impact on our analysis.

Furthermore, data cleaning involved eliminating unwanted characters and converting certain column values into valid numeric types. By doing so, we ensured that the data was in a consistent format, making it easier to perform various analytical tasks.

Overall, overcoming these data cleaning challenges was essential for conducting reliable and insightful analyses, ultimately leading to valuable conclusions and actionable insights.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Heat map for play_store
plt.figure(figsize = (10,5))
sns.heatmap(playstore_data_copy.corr(numeric_only=True), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=15)

##### 1. Why did you pick the specific chart?

The specific chart, a heat map, was chosen for graphical representation because it is effective in visualizing and exploring patterns and trends in data contained in a matrix, where individual values are represented as colors. Heat maps allow for a quick understanding of relationships and correlations between different variables, making it suitable for the given dataset.

##### 2. What is/are the insight(s) found from the chart?

The insights obtained from the heat map are as follows:

  1.There is a strong positive correlation between the "Installs" and "Reviews" columns. This indicates that as the number of installs increases, there is a higher user base, leading to a larger number of total reviews posted by users.

  2.The "Price" column shows a slight negative correlation with the "Rating," "Reviews," and "Installs." This suggests that as the prices of the apps increase, the average rating, total number of reviews, and installs tend to decrease gradually.

  3.The "Rating" column exhibits a marginal positive correlation with the "Installs" and "Reviews" columns. This indicates that as the average user rating of an app increases, there is a tendency for both the number of app installs and the number of reviews to increase as well.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the above observations, it can be concluded that the insights gained from the heat map will likely have a positive impact on the business. The strong positive correlation between the number of installs and reviews suggests that as more users install the app, there will be a higher likelihood of receiving reviews, which can potentially attract even more users due to positive social proof. Similarly, the marginal positive correlation between the app's rating and its installs and reviews indicates that maintaining or improving the app's rating could lead to increased user engagement and growth.

As for negative growth, the insights do not directly indicate any negative impact. However, the slight negative correlation between the app's price and its rating, reviews, and installs may suggest that increasing the app's price could lead to reduced user engagement and adoption. To validate this potential negative impact, further analysis and experimentation may be necessary to understand how users respond to price changes and whether it affects overall growth and revenue in the long run.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 4))
sns.lineplot(data=top_50_genres, x='Genres', y='Count')
plt.xticks(range(len(top_50_genres['Genres'])), top_50_genres['Genres'], rotation=90)
plt.title('Top 50 Genres VS App Counts')
plt.ylabel('Number of Applications')
plt.xlabel('Genres')
plt.rc('font', size=14)
plt.show()

##### 1. Why did you pick the specific chart?

To visualize genres vs app count

##### 2. What is/are the insight(s) found from the chart?

  1.The top three genres with the highest app counts are 'Tools,' 'Entertainment,' and 'Education.'

  2.Among all the genres, 'Tools' has the highest number of apps.
  
  3.Conversely, the 'Board' and 'Brain Games' genre has the lowest app count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The findings from the comparison of app genres may not have a direct impact on business strategy. However, it provides valuable insights into user preferences and the popularity of different genres. The dominance of 'Tools' suggests a high demand for productivity and utility apps, which can be relevant for businesses targeting this segment. Understanding that 'Entertainment' and 'Education' genres are also popular indicates potential areas for content development or advertising. While the specific analysis may not directly impact business decisions, it offers a deeper understanding of app usage patterns and user interests, which can be leveraged for marketing and product development strategies. Additionally, recognizing the popularity of 'Tools' indicates an opportunity for businesses to create or collaborate with app developers in this category, considering the wide appeal of such apps due to their ability to streamline work and enhance efficiency.

#### Chart - 3

In [None]:
playstore_data_copy.Category

In [None]:
plt.figure(figsize=(10, 7))
plt.style.use("fivethirtyeight")

# Count the occurrences of each category and get the top 10 categories by count
top_categories = playstore_data_copy['Category'].value_counts().head(10)

# Create the bar chart
plt.bar(top_categories.index, top_categories.values)

plt.xlabel("Category -------------->", fontsize=15)
plt.ylabel("Count -------------->", fontsize=15)
plt.title("Count of Applications in Each Category")
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The bar plot was chosen for its effectiveness in visualizing and comparing the counts of different app genres. It presents the data in a clear and straightforward manner, making it easy to interpret and draw insights from the information. The bar plot's vertical bars represent each Category wise count, allowing for quick comparison between Category and identifying the number of counts.

##### 2. What is/are the insight(s) found from the chart?

The 'Family' and 'Game' category rules the play store market, followed by Tools, Medical, and Business. The information will understand our daily requirements and fill the market with similar apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can potentially help create a positive business impact. Understanding that the 'Family' and 'Game' categories dominate the Play Store market indicates that there is a significant demand for apps in these genres. This knowledge can guide businesses to develop or invest in apps that cater to family-oriented entertainment and gaming, potentially leading to increased user engagement and revenue generation.

Furthermore, recognizing the popularity of 'Tools,' 'Medical,' and 'Business' genres can inspire businesses to explore opportunities in these areas. Developing apps that fulfill daily requirements in these categories could address specific needs and attract a sizable user base. For instance, creating productivity tools, medical apps, or business-related utilities may prove to be profitable ventures, given their relevance to users' daily lives.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
scene_3=playstore_data_copy.groupby(["Category","Type"])["App"].count().reset_index()
point_3=pd.DataFrame(scene_3)
# we are plotting bar plot for above grouped dataframe
plt.figure(figsize=(10,6))
plt.style.use("fivethirtyeight")
sns.barplot(x="Category",y="App",hue="Type",data=point_3)
plt.xlabel("Category--------->", fontsize=11)
plt.ylabel("Count--------->", fontsize=11)
plt.title("Count of applications in each category differentiated by their type")
plt.xticks(rotation=90)
plt.show();

##### 1. Why did you pick the specific chart?

The specific chart, a bar plot, was chosen to visualize the relationship between the "Category" and "App" variables, with differentiation based on the "Type" variable. The bar plot is well-suited for this purpose because it allows us to compare the counts of different apps (y-axis) across various categories (x-axis) while also considering the app type is free or paid.

##### 2. What is/are the insight(s) found from the chart?

It looks like certain app categories have more free apps available for download than others. The majority of apps in the Family, Food & Drink, and Tools, as well as Social categories were free to install.
At the same time Family, Sports, Tools, and Medical categories had the biggest number of paid apps available for download.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help create a positive business impact by guiding app development and monetization strategies. Offering free apps in Family, Food & Drink, Tools, and Social categories can attract a larger user base and revenue through ads or in-app purchases. Developing paid apps in Family, Sports, Tools, and Medical categories can target niche markets and generate revenue from app purchases.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
scene_1=(playstore_data_copy.groupby("Category").agg({"Installs":"mean"})
        .sort_values(by="Installs",ascending=False)
        .reset_index().head(7))
point_1=pd.DataFrame(scene_1)

# we are plotting bar plot for above grouped dataframe
plt.figure(figsize=(10,5))
plt.style.use("fivethirtyeight")
sns.barplot(y="Installs",x="Category",data=point_1)
plt.xlabel("Category-------------->", fontsize=15)
plt.ylabel("Installs-------------->", fontsize=15)
plt.title("Number of Installs in each category")
plt.xticks(rotation=90)
plt.show();

##### 1. Why did you pick the specific chart?

The bar plot was chosen for its effectiveness in visualizing and comparing the counts of different app genres. It presents the data in a clear and straightforward manner, making it easy to interpret and draw insights from the information. The bar plot's vertical bars represent each Category wise count, allowing for quick comparison between Category and identifying the number of counts

##### 2. What is/are the insight(s) found from the chart?

The top 7 categories with the most apps developed are Communication, Social, Video Player, Productivity, Game, Photography, and Travel.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the most installed app categories align with the most developed categories, including Communication, Social, Video Player, Productivity, Game, Photography, and Travel, provides valuable guidance for businesses. This alignment indicates that there is a strong user demand for apps in these categories, making them potential lucrative areas for app development and investment.

#### Chart - 6

In [None]:
categrory_price_mean = playstore_data_copy[playstore_data_copy['Price'] !=  0].groupby(['Category'])['Price'].mean().reset_index(name='Price')
plt.figure(figsize=(10, 8))
ax = sns.stripplot(x='Price', y='Category', data=categrory_price_mean, jitter=True, linewidth=2)
ax.set_title('App pricing trend across categories(in USD)')

##### 1. Why did you pick the specific chart?

To show pricing across categories for Paid apps

##### 2. What is/are the insight(s) found from the chart?


 1) Most apps are priced under 25USD.

 2)Finance category has the highest priced app followed by lifestyle.

 3)Lowest priced category is libraries and demo


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As this is just the trend of pricing categories this may not decide the buisness. But what it gives out is an idea of on how much an app can charge according to there particular category and what their competitors are charging in the market

#### Chart - 7

In [None]:
# Chart - 7 visualization code
rating_df = playstore_data_copy[~playstore_data_copy['Rating'].isna()]['Rating']

plt.style.use("fivethirtyeight")

plt.figure(figsize=(10, 6))
plt.title('Histogram of Ratings')
sns.histplot(rating_df, kde=True, color='g')

plt.xlabel('Rating')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To find analysis based on App Rating. We get a summarized information of liked and disliked app by a number of users all around the world.

##### 2. What is/are the insight(s) found from the chart?


  1)Most apps have the rating between 4 to 5.

  2)Also shows us most apps are liked by many users depending on ease of use, funtionality, performance and less bugs.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definetly a positive impact on bussiness.Because now they have the users trust they can experiment on scaling the app and with pricing. Increrase their functionality globally. Reduces adverisement cost. Mainly they can also divert this taffic to thier next projects which helps them becoming a brand.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
category_mean_rating = playstore_data_copy.groupby(['Category'])['Rating'].mean().reset_index(name='Rating')
category_mean_rating.set_index('Category').plot(kind='bar' , color = 'c')
plt.style.use("fivethirtyeight")  # Apply the "fivethirtyeight" style
plt.rcParams['figure.figsize'] = (12, 7)
#category_mean_rating.set_index('Category').plot(kind='bar', color='c')
plt.title('Category VS Mean Rating')
plt.ylabel('Ratings out of 5')
plt.xlabel('Category')
plt.xticks(rotation=90)  # Rotate the x-axis tick labels for better readability
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To find Category wise mean Rating

##### 2. What is/are the insight(s) found from the chart?


    
    


1.  Most categories have the rating between 4 to 5
2.  Also shows us most categories are liked by many users depending on ease of use, funtionality, performance and less bugs



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definetly a positive impact on buisness.Because now they have the users trust they can experiment on scaling the app and with pricing. Increrase their functionality globally. Reduces adverisement cost. Mainly they can also divert this taffic to thier next projects which helps them becoming a brand.

#### Chart - 9

In [None]:
#Ratings given by different age of people
content_rating = playstore_data_copy['Content Rating'].value_counts()
content_rating

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 7))
plt.style.use("fivethirtyeight")
labels = content_rating.index
explode = (0.1, 0.2, 0.3, 0.4, 0.5, 0.1)

# Create the pie chart
plt.pie(content_rating, labels=labels, explode= (0.1,0.2,0.3,0.4,0.5,0.1), autopct='%1.2f%%', startangle=140,
        textprops={'fontsize': 15}, pctdistance=0.85, labeldistance=1.05)

plt.title('Content Rating Distribution')
plt.axis('equal')
plt.legend(labels, title="Content Rating", bbox_to_anchor=(1, 0.8))
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To find ratings given by different age group of people

##### 2. What is/are the insight(s) found from the chart?



1.  Most of the apps are used by all age groups of people
  Lowest is the adults only
    
    

2.   Most apps were created keeping every age group in mind.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Going with this content rating most creators are targeting every age group. So that would be the wise choise to go for a positive business impact. Choosing teen content as the niche category can also feature growth. Mostly family-friendly content categories will win the race

#### Chart - 10

In [None]:
#Getting the top downloaded 10 apps and their Ratings vs Reviews
top_downloaded_apps = playstore_data_copy.groupby('App').tail(1).sort_values(['Installs','Rating'], ascending=False).head(50)
top_10_downloaded_apps = top_downloaded_apps.head(10).set_index('App')[['Rating','Reviews']].sort_values(['Reviews'])

In [None]:
# Chart - 10 visualization code
ind = top_10_downloaded_apps.index
column0 = top_10_downloaded_apps['Reviews']
column1 = top_10_downloaded_apps['Rating']
title0 = 'Total ratings out of 5'
title1 = 'Total reviews in 10 millions'
plt.style.use("fivethirtyeight")

fig, axes = plt.subplots(figsize=(10,5), ncols=2, sharey=True)
fig.tight_layout()

axes[0].barh(ind, column0, align='center', color='g', zorder=10)
axes[0].set_title(title0, fontsize=18, pad=15, color='g')
axes[1].barh(ind, column1, align='center', color='b', zorder=10)
axes[1].set_title(title1, fontsize=18, pad=15, color='b')

# If you have positive numbers and want to invert the x-axis of the left plot
axes[0].invert_xaxis()

# To show data from highest to lowest
# plt.gca().invert_yaxis()
axes[0].set(yticks=ind, yticklabels=ind)
axes[0].yaxis.tick_left()

axes[0].set_xticklabels([1, 2, 3, 4, 5])
axes[1].set_xticklabels([0, 10, 20, 30, 40, 50])

plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)

plt.show()


##### 1. Why did you pick the specific chart?

To get the top downloaded 10 apps and their Ratings vs Reviews

##### 2. What is/are the insight(s) found from the chart?


    
    
    
    


1.   Both of the top rated apps are from social media/communication with above 4 rating(out of 5)
2.  Rest all are below 3 rating
3.   Reviews of all the 10 apps are of same level.
4.   Exceptional difference in ratings are seen in between top 2 and rest of the top 10





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This in here shows the quality differences betweeen the apps. Even though the downloads are high there is a huge difference in ratings. These stats definetly affect the business negatively. As people start to compare and go with the app that is good at both functionality and service. One of the best way to improve these ratings is to go through the reviews and try to find where the users are facing problems and fix it.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
sentiment_subjectivity_df = user_review_copy['Sentiment_Subjectivity']
plt.style.use("fivethirtyeight")
plt.figure(figsize=(10,5))

plt.figure(figsize=(10, 6))
sns.histplot(sentiment_subjectivity_df,  color='c')

plt.xlabel("Subjectivity")
plt.ylabel("Frequency")
plt.title('Distribution of Subjectivity')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To know the distribution of Subjectivity

##### 2. What is/are the insight(s) found from the chart?



1. Sentiment subjectivity lies between range 0.4 to 0.7.  
    
2.  We observed that maximum number of users post reviews to the apps which suits their experience.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we know sentiment subjectivity refers to the degree to which a piece of writing expresses personal opinions, feelings, or biases. And everyone has their own point of view. So, Sentiment subjectivity is inversely proportional to the growth of business

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.style.use("fivethirtyeight")

plt.figure(figsize=(10, 6))  # Set the figure size

sns.scatterplot(x=user_review_copy['Sentiment_Subjectivity'], y=user_review_copy['Sentiment_Polarity'],
                hue=user_review_copy['Sentiment'], edgecolor='white', palette="coolwarm")

plt.title("Relationship between Sentiment Subjectivity and Sentiment Polarity")
plt.xlabel("Sentiment Subjectivity")
plt.ylabel("Sentiment Polarity")

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To plot the relationship between sentiment_subjectivity and sentiment_polarity in scatter plot

##### 2. What is/are the insight(s) found from the chart?




*   The scatter plot which concludes that sentiment subjectivity is not likely proportional to sentiment polarity always
*   Proportional variance is too high or too low



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Depending on the above plot yes and no as well. Reason is when polarity is positive and subjectivity is high it is a good product for individual. And the product is a complete garbage when subjectivity is low and polarity is negative

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.style.use("fivethirtyeight")
user_review_copy['Sentiment'].value_counts().plot(kind='pie', explode=(0.1, 0.1, 0.1), shadow=True, autopct='%1.2f%%', pctdistance=1.1, labeldistance=1.5)

plt.title("Sentiment Distribution")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To plot the sentiment percentage analysis

##### 2. What is/are the insight(s) found from the chart?


    
    


1.   Most of the sentiment is positive
2.   Users review apps based on thier own personal experience rather overall analogy



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This definetly has hindrance on business growth, as most people think and behave in different ways

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

In this project, we conducted an analysis of the Play Store using the provided data. Our analysis aimed to enhance app properties during the launch and identify areas for common updates across different categories. Cleaning and preparing the data were prioritized to ensure clear insights into Play Store apps.

Based on our analysis, we recommend the following actions for the client:


1.   Prioritize the development of more updated and useful free apps, as they are widely preferred by users.

2.   Regularly update existing apps to attract new users and maintain user engagement.

3.   Explore opportunities in unexplored categories like events, beauty, art, etc., to cater to users with specific interests in these areas.
4.   Focus on providing high-quality content in apps, as this can positively impact the app market and user satisfaction.


5.   Understand and cater to varying user sentiments throughout their app usage journey, emphasizing features that align with the majority of users' preferences.

By implementing these recommendations, the client can potentially enhance their app offerings, attract more users, and achieve better results in the competitive app market.



# **Conclusion**

Through exploratory data analysis, we have identified several trends and made key assumptions that can potentially lead to app success in the Play Store.


*   Approximately 92% of the apps are free, indicating a high demand for free apps among users.
*   Around 82% of apps have no age restrictions, broadening their potential user base.

*   The Family and Games categories are highly competitive in both paid and free apps.
*   The top three categories based on app count are Family, Game, and Tools.


*   A significant number of apps have a size less than 50 MB and a rating above 4.0, indicating user preference for compact yet high-quality apps.
*   he Game category has the highest average app installs, making it an attractive category for developers.


*   The Finance category has the highest average installation fee for paid apps.
*    Overall, sentiment analysis reveals positive sentiments dominate, with 64% positive, 22% negative, and 13% neutral sentiments.


*   Sentiment Polarity is not highly correlated with Sentiment Subjectivity, indicating complex user emotions and opinions.

Utilizing these insights, developers can focus on creating free apps in the Family and Games categories while considering factors like app size and user ratings. Additionally, exploring the Finance category for potential paid apps with higher installation fees might be beneficial. Improving app quality, especially for larger-sized apps, can lead to greater user engagement and success.











### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***