<a href="https://colab.research.google.com/github/saurabhdroid/todo1/blob/master/_Final_Playstoreapp_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Playstore App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual.


# **Project Summary -**

This project aims to harness the vast potential of Play Store app data and customer reviews to identify key factors driving app engagement and success in the Android market. By analyzing app attributes, ratings, and user sentiments, actionable insights will be derived for app developers to optimize their apps, increase engagement, and capture a larger share of the Android market.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**



This project aims to conduct an in-depth analysis of Android apps and their customer reviews to gain insights into factors that contribute to app engagement and success. The app dataset contains information about various apps, such as their categories, ratings, reviews, and installs. Additionally, the customer reviews dataset includes textual reviews along with corresponding sentiment scores.

The project seeks to address the followings:-

The critical factors that influence app engagement and success?

Can we identify specific patterns that distinguish successful apps in terms of their categories, ratings, and popularity?

How do user sentiments expressed in the reviews impact app engagement?

Is there a discernible relationship between positive reviews and higher app ratings or installations?

What are the most prevalent positive and negative aspects frequently mentioned in customer reviews?

Can these insights be leveraged to enhance user experience and overall app performance?

Is it possible to predict the sentiment of customer reviews using machine learning techniques based on the review text?

Based on the findings, what actionable recommendations can be offered to app developers to optimize user satisfaction and increase the likelihood of app success?

By addressing these questions, the project aims to provide valuable and practical insights that can be utilized to improve app development strategies, attract a larger user base, and ensure enhanced user satisfaction.



#### **Define Your Business Objective?**

Playstore App Engagement Optimization

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#importing important libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
!git clone https://github.com/saurabhdroid/Play-Store-App-Review-Analysis.git

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#playstore_data_path = '/content/drive/MyDrive/Colab Playstore-EDA/Play Store Data.csv'
#user_reviews_path = '/content/drive/MyDrive/Colab Notebooks/AlmaBetter/Cohort Amsot/Module 1/EDA Capstone Project/Play store Review Analysis/dat/User Reviews.csv'
 #Load User Reviews dataset
user_reviews_df = pd.read_csv("/content/drive/MyDrive/User Reviews.csv")

# Load Play Store Data dataset
playstore_data_df = pd.read_csv("/content/drive/MyDrive/Play Store Data.csv")

### Dataset First View

In [None]:
#viewing playstore dataframe
playstore_data_df.head(-1)

In [None]:
#viewing info of playstore dataframe
playstore_data_df.info()

In [None]:
#viewing the available numeric column details
playstore_data_df.describe()

### Dataset Rows & Columns count

In [None]:

#viewing the columns name
playstore_data_df.columns

In [None]:
#viewing the database
user_reviews_df

In [None]:
#viewing the database info

user_reviews_df.info()

In [None]:
#viewing the database numeric details

user_reviews_df.describe()

### Dataset Information

Observations:
The data has 12 objects and 1 numeric feature.
We need to convert the columns - Reviews, Size, Installs and Price- to int.
We need to change the column Last Updated into date-time.

#What did you know about your dataset?

The dataset includes information about Android apps on the Play Store and user reviews for these apps. It contains details like app names, categories, ratings, reviews, installs, sizes, and more. Some rows have missing values, and the user reviews dataset includes sentiment labels and polarity for the reviews.

User Reviews Dataset has 64295 rows and 5 columns. Play Store Data Dataset has 10839 rows and 13 columns.

## ***2.1 Cleaning the data of playstore database***

Finding carbage values from all the colums and drop the rows
Drop the found carpage containing rows
Coverting 'Size' column into valid numeric column
Coverting 'Reviews' column into valid numeric column
Coverting 'Installs' column into valid numeric column **bold text**

---



In [None]:
# Finding the row with insufficiant data
playstore_data_df[playstore_data_df['Type']!='Free'][playstore_data_df[playstore_data_df['Type']!='Free']['Price']=='0']

In [None]:
#Droping the row from the data frame
playstore_data_df.drop(playstore_data_df[playstore_data_df['Type']!='Free'][playstore_data_df[playstore_data_df['Type']!='Free']['Price']=='0'].index, inplace=True)

In [None]:
#Finding value mismatched row
playstore_data_df[playstore_data_df['Genres']=='February 11, 2018']

In [None]:
#Finding the index of the row which containing the carbage values
playstore_data_df[playstore_data_df['Genres']=='February 11, 2018'].index

In [None]:
#Droping the found carbage row from our dataframe
playstore_data_df.drop(playstore_data_df[playstore_data_df['Genres']=='February 11, 2018'].index, inplace=True)

In [None]:
#Clean string function
def clean_it(num):
  """This function takes a string and replace the following characters if present, '+', ',' ,'$', 'M', 'k', 'NaN'"""
  if '+' in num:
    num = num.replace('+','')
  if ',' in num:
    num = num.replace(',','')
  if '$' in num:
    num = num.replace('$','')
  if 'M' in num:
    num = str(int(float(num.replace('M',''))*1000000))
  if 'k' in num:
    num = str(int(float(num.replace('k',''))*1000))
  if 'NaN' in num:
    num = '0'
  else:
    pass
  return num

In [None]:
# Cleaning the unwanted charactors and converting the required column values into valid numeric type

#Changing the 'Reviews' column values into valid numeric values
playstore_data_df['Reviews'] = pd.to_numeric(playstore_data_df['Reviews'])

#Changing the 'Size' column values into valid numeric values
playstore_data_df['Size'] = playstore_data_df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
playstore_data_df['Size'] = pd.to_numeric(playstore_data_df['Size'].map(lambda x: clean_it(x)))

#Changing the 'Installs' column values into valid numeric values
playstore_data_df['Installs'] = pd.to_numeric(playstore_data_df['Installs'].map(lambda x: clean_it(x)))

#Changing the 'Price' column values into valid numeric values
playstore_data_df['Price'] = pd.to_numeric (playstore_data_df['Price'].map(lambda x:clean_it(x)))
playstore_data_df.info()

### What did you know about your dataset?

The dataset includes information about Android apps on the Play Store and user reviews for these apps. It contains details like app names, categories, ratings, reviews, installs, sizes, and more. Some rows have missing values, and the user reviews dataset includes sentiment labels and polarity for the reviews.

User Reviews Dataset has 64295 rows and 5 columns.
Play Store Data Dataset has 10841 rows and 13 columns.

## ***2.2. Cleaning the data of User reviews database***

Eleminating the null value rows from the database

In [None]:
#Eleminating the null value rows from the database

non_null_user_reviews_df = user_reviews_df[~user_reviews_df['Sentiment'].isna()]
non_null_user_reviews_df.info()

In [None]:
non_null_user_reviews_df

### Variables Description

**User Reviews Dataset:**

**Translated_Review**: User review text.
**Sentiment**: Review sentiment label.
**Sentiment_Polarity**: Sentiment polarity score.
**Sentiment_Subjectivity**: Sentiment subjectivity score.

**Play Store Data Dataset:**

**App**: Name of the Android app.
**Category**: App category or genre.
**Rating**: Average rating on the Play Store.
**Reviews**: Number of user reviews.
**Size**: App size.
**Installs**: Number of app installations.
**Type**: App type (Free or Paid).
**Price**: App price (0 for Free apps).
**Content Rating**: Age group suitability.
**Genres**: Additional app genres or tags.
**Last Updated**: Date of last app update.
**Current Ver**: Current app version.
**Android Ver**: Minimum required Android version.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Filtering duplicate apps
dupli_col_chk = playstore_data_df.duplicated(subset='App').any()
print(dupli_col_chk)


In [None]:
#Apps and their counts
playstore_data_df['App'].value_counts()


In [None]:
#taking all the last rows of data for each app
ps_last_r_df = playstore_data_df.groupby('App').tail(1).reset_index()
app_rev_max_df = ps_last_r_df.loc[ps_last_r_df.groupby(['App'])['Reviews'].idxmax()]

In [None]:
#Getting Genres
top_genres_df = app_rev_max_df.Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})

In [None]:
top_genres_df

In [None]:
app_rev_max_df[app_rev_max_df['Price'] == 0]

In [None]:
#Preparing dataframe which contains free app install counts
genres_free_apps_installs_df = app_rev_max_df[app_rev_max_df['Price'] == 0].groupby(['Genres'])[['Installs']].sum()
genres_free_apps_installs_df

Explaination:- Resulting DataFrame genres_free_apps_installs_df contains two columns: 'Genres' and 'Installs'. Each row represents a genre, and the 'Installs' column contains the total number of installs for all free apps in that genre.

Overall, this code prepares a DataFrame that shows the total install counts for each genre, considering only free apps from the original DataFrame app_rev_max_df

In [None]:
#Preparing dataframe which contains paid app install counts
genres_paid_apps_installs_df = playstore_data_df[playstore_data_df['Price']!= 0].groupby(['Genres'])[['Installs']].sum().rename(columns={'Installs':'Paid_app_installs'})

In [None]:
#Preparing dataframe which contains mean Rating
genres_ratings_df = app_rev_max_df.groupby(['Genres'])[['Rating']].mean()
genres_ratings_df

In [None]:
#Mergering all the data previous dataframes
top_genres_installs_df = pd.merge(top_genres_df, genres_free_apps_installs_df, on='Genres')
top_genres_apps_installs_df = pd.merge(top_genres_installs_df, genres_paid_apps_installs_df, on='Genres')
top_genres_apps_installsr_df = pd.merge(top_genres_apps_installs_df, genres_ratings_df, on='Genres')

#Getting top 50 data frames based on the Genres
top_50_genres_df = top_genres_apps_installsr_df.head(50)
top_50_genres_df

### What all manipulations have you done and insights you found?

Manipulations:

Removed duplicate rows to ensure data cleanliness and avoid counting the same data multiple times.
Handled missing values by either dropping rows or filling them with appropriate values (e.g., mean, median) based on the nature of the data and the analysis requirements.


Insights (Hypothetical examples):
Average Rating by Category: Identified which app categories have the highest and lowest average ratings, helping developers understand which categories are more popular among users.
Distribution of Installs by Content Rating: Analyzed the spread of app installs based on the content rating, potentially revealing which age groups show higher interest in different types of apps.
Relationship between App Size and Rating: Explored whether there is any correlation between app size and user ratings, which might indicate whether users prefer smaller or larger apps.
Distribution of App Prices by Category: Discovered the price distribution across different app categories, providing insights into which categories tend to have more expensive apps.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Exploring app 'Genres' and Coun

This bar chart will give as the following:

What are the top 10 Genres based on the downloads? Which genres having the most number of app counts? Which are genres having the less number of app counts?

In [None]:
plt.figure(figsize=(20, 10))
sns.barplot(data=top_genres_df, x='Genres', y='Count')
plt.xticks(rotation=90)
plt.title('Top 50 Genres VS App Counts')
plt.ylabel('Number of applications')
plt.xlabel('Genres')

plt.show()







##### 1. Why did you pick the specific chart?

From the given database the apps can be filtered or grouped one feature with one or more other features for diffenrent view of information gathering. Playsotre has given a feature called 'Genres' with that we can come to know that to which the app is related to. 'Genres' means it denotes a style or category. One app can have more than one'Genres'. Here we have plotted a bar chat to know about the Genres and and corresponding apps count, which are belongs to the 'Genres'.

This bar chart will give as the following:

What are the top 10 Genres based on the downloads?
Which genres having the most number of app counts?
Which are genres having the less number of app counts?Answer Here.

#### Chart - 2 Exploring Installed Free and Paid Apps: A Comparative Analysis

In this analysis, we will explore and compare the installation trends of free and paid apps across different genres. The main objectives are as follows:

Identify the genres with the highest number of installations for free apps: We will analyze which genres attract the most installations for free apps, helping us understand user preferences in terms of freely available applications.

Identify the genres with the highest number of installations for paid apps: We will examine which genres garner the most installations for paid apps, providing insights into the market demand for paid applications.

Analyze the preferred purposes of paid apps: We will study the genres of paid apps that are most preferred by users, shedding light on the types of applications users are willing to pay for.

Compare free and paid apps within the same genre: We will assess which genres have free apps that offer better service or features compared to their corresponding paid versions, providing valuable insights into user behavior and willingness to invest in premium versions of applications.

This analysis aims to provide a comprehensive understanding of user preferences and trends in the free and paid app market across different genres.







In [None]:
#Plotting Top 50 Genres VS Free apps install count chart

plt.figure(figsize=(20, 10))
sns.barplot(data=top_50_genres_df, x='Genres', y='Installs')
plt.xticks(rotation=90)
plt.title('Top 50 Genres VS Free App Install Counts')
plt.ylabel('Number of installations (1000 millions)')
plt.xlabel('Genres')

plt.show()

In [None]:
#Plotting Top 50 Genres VS Paid apps install count chart

plt.figure(figsize=(20, 10))
sns.barplot(data=top_50_genres_df, x='Genres', y='Paid_app_installs')
plt.title('Top 50 Genres VS Paid App Install Counts')
plt.ylabel('Number of paid installations (1000 millions)')
plt.xlabel('Genres')

plt.show()








##### 1. Why did you pick the specific chart?

People always download, install and use the apps based on their necessity or interest. In playstore both free and paid apps are available in all the Genres or Category. Generally people always prefer free apps than paid apps untill all the free apps of the particular Genres or Category is not satisfied or not fullfilled their purpose. So we have plotted again a bar chart based on the top genres and install counts for both free and paid apps.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

This gives the following insights:


*Genres with the Highest Number of Free Apps Installed

*Genres with the Highest Number of Paid Apps Installed

*Preferred Purposes of Paid App

*Free Apps vs. Paid Apps within the Same Genre


*Popularity of Freemium Models

*Genres with the Highest Number of Free Apps Installed
*Genres with the Highest Number of Paid Apps Installed

*Preferred Purposes of Paid Apps
*Popularity of Freemium Models



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Targeted Marketing and User Acquisition: Understanding the genres with the highest number of free and paid app installations can help businesses target their marketing efforts more effectively. They can tailor their marketing campaigns to attract users who are more likely to be interested in their apps.

App Monetization Strategies: Identifying the genres where users are willing to pay for apps can guide businesses in developing premium or subscription-based versions of their apps. This can lead to increased revenue and higher profitability.

Product Development and Enhancements: Analyzing user preferences and popular genres can help businesses identify areas for improvement in their existing apps or guide the development of new apps that align with market demand.

User Experience and Retention: By understanding the purposes and preferences of users for paid apps, businesses can enhance the user experience and implement features that keep users engaged and satisfied. This can lead to higher user retention and loyalty.

Negative Growth:

While insights can mostly lead to positive business outcomes, there are certain scenarios where they might not result in favorable outcomes:

Lack of Monetization Opportunities: If the insights reveal that users are primarily interested in free apps and are less willing to pay for apps in certain genres, it might limit revenue generation opportunities for businesses operating in those genres.

Saturated Markets: Insights indicating high competition and saturation in certain genres may make it challenging for new entrants to gain traction and grow. Businesses might face difficulty standing out in overcrowded app markets.

Misinterpretation or Mismanagement of Insights: Incorrect interpretation or ineffective implementation of insights can lead to wasted resources and efforts. It's essential for businesses to carefully analyze and act upon the data to achieve positive results.

Overall, the impact of insights on business growth depends on the ability of the business to leverage the data effectively, adapt to market dynamics, and respond to user preferences with meaningful strategies and actions. Effective decision-making, user-centric approaches, and continuous adaptation to changing market demands are key factors in converting insights into positive business impact.

#### Chart - 3 Analysing the relationship between the free apps, paid apps and their price

As we saw earlier people generally goes with free apps, but if they really want it, then they have to buy the paid apps by spending some money. The price of the apps may vari based on the features and uses. Here we have plotted the comparision between the paid and free apps, price variation based on the app Category


1.  The 1st plot gives the installation count comparision between the Free and Paid apps
2. 2nd plot shows the price ranges as histogram
3. 3rd plot shows the price ranges across different categories




In [None]:
# Chart - 3 visualization code
#Category wise free and paid app installs count
categoty_type_installs_df = playstore_data_df.groupby(['Category','Type'])[['Installs']].sum().unstack().reset_index()
categoty_type_installs_df = categoty_type_installs_df[~categoty_type_installs_df['Installs']['Paid'].isna()].set_index('Category')
categoty_type_installs_df

In [None]:
#Plot between Paid and Free installed app counts
color_red = '#4472c4'
color_blue = '#ed7d31'

ind = categoty_type_installs_df.index
column0 = categoty_type_installs_df['Installs']['Paid']
column1 = categoty_type_installs_df['Installs']['Free']
title0 = 'Paid app install Counts(in millions)'
title1 = 'Free app install Counts(in 100 millions)'

fig, axes = plt.subplots(figsize=(20,10), ncols=2, sharey=True)
fig.tight_layout()

axes[0].barh(ind, column0, align='center', color=color_red, zorder=10)
axes[0].set_title(title0, fontsize=18, pad=15, color=color_red)
axes[1].barh(ind, column1, align='center', color=color_blue, zorder=10)
axes[1].set_title(title1, fontsize=18, pad=15, color=color_blue)

# If you have positive numbers and want to invert the x-axis of the left plot
axes[0].invert_xaxis()

# To show data from highest to lowest
# plt.gca().invert_yaxis()
axes[0].set(yticks=ind, yticklabels=ind)
axes[0].yaxis.tick_left()

plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)

In [None]:
#Histogram of pice range of piad apps
price_df = playstore_data_df[~playstore_data_df['Price'].isna() & playstore_data_df['Price'] != 0]['Price']
plt.hist(price_df.values, color='Red')
plt.xlabel('Price(USD)')
plt.ylabel('Frequency')
plt.title('Histogram of Price')

In [None]:
#App pricing across categories for Paid apps
categrory_price_mean_df = playstore_data_df[playstore_data_df['Price'] !=  0].groupby(['Category'])['Price'].mean().reset_index(name='Price')
ax = sns.stripplot(x='Price', y='Category', data=categrory_price_mean_df, jitter=True, linewidth=1)
ax.set_title('App pricing trend across categories(in USD)')
# categrory_price_mean_df

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4 Based On Rating anlysis

The analysing of app ratings is important because, this only gives the information how much an app is liked or disliked by the people. So that we have plotted the histrogram for Ratings and in the 2nd plot, it shows the category wise avg rating.

In [None]:
# Chart - 4 visualization code
#Ploting the histogram Rating
rating_df = playstore_data_df[~playstore_data_df['Rating'].isna()]['Rating']

plt.title('Histogram of Rating')
sns.distplot(rating_df, hist=True)

In [None]:
#ploting Catefgory wise mean Rating
category_mean_rating_df = playstore_data_df.groupby(['Category'])['Rating'].mean().reset_index(name='Rating')
category_mean_rating_df.set_index('Category').plot(kind='bar')
plt.rcParams['figure.figsize'] = (20, 10)
plt.rc('font', size=14)
plt.xticks(rotation=90)
plt.title('Category VS Mean Rating')
plt.ylabel('Ratings out of 5')
plt.xlabel('Category')

In [None]:
#Ratings given by different age of people
content_rating_df = playstore_data_df['Content Rating'].value_counts()
content_rating_df

In [None]:
content_rating_df.plot(kind='pie', fontsize=10, explode= (0.1,0.2,0.3,0.4,0.5,0.1), autopct='%1.2f%%', pctdistance=1.1, labeldistance=1.2)
# .Genres.value_counts().reset_index().rename(columns={'Genres':'Count','index':'Genres'})

In [None]:
#Getting the top downloaded 10 apps and their Ratings vs Reviews
top_downloaded_apps = playstore_data_df.groupby('App').tail(1).sort_values(['Installs','Rating'], ascending=False).head(50)
top_10_downloaded_apps = top_downloaded_apps.head(10).set_index('App')[['Rating','Reviews']].sort_values(['Reviews'])

In [None]:
#Plotting between Top ratting apps and reviews
color_red = '#4472c4'
color_blue = '#ed7d31'

ind = top_10_downloaded_apps.index
column0 = top_10_downloaded_apps['Reviews']
column1 = top_10_downloaded_apps['Rating']
title0 = 'Total ratings out of 5'
title1 = 'Total reviews in 10 millions'

fig, axes = plt.subplots(figsize=(10,5), ncols=2, sharey=True)
fig.tight_layout()

axes[0].barh(ind, column0, align='center', color=color_red, zorder=10)
axes[0].set_title(title0, fontsize=18, pad=15, color=color_red)
axes[1].barh(ind, column1, align='center', color=color_blue, zorder=10)
axes[1].set_title(title1, fontsize=18, pad=15, color=color_blue)

# If you have positive numbers and want to invert the x-axis of the left plot
axes[0].invert_xaxis()

# To show data from highest to lowest
# plt.gca().invert_yaxis()
axes[0].set(yticks=ind, yticklabels=ind)
axes[0].yaxis.tick_left()

axes[0].set_xticklabels([1, 2, 3, 4, 5])
axes[1].set_xticklabels([0, 10, 20, 30, 40, 50])

plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)

#### Chart - 5 - Analysing the user Subjectivity

In [None]:
# Chart - 5 visualization code
#Plotting the distribution of Subjectivity
sentiment_subjectivity_df = non_null_user_reviews_df['Sentiment_Subjectivity']
sns.distplot(sentiment_subjectivity_df, hist=True)
plt.xlabel("Subjectivity")
plt.title('Distribution of Subjectivity')

##### 1. Why did you pick the specific chart?



```
```
The maximum number of sentiment subjectivity lies between 0.4 to 0.7. From this we can conclude that maximum number of users give reviews to the applications, according to their experience.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6  Sentiment Subjectivity and Sentiment Polarity Relationship analysis

From the above scatter plot it can be concluded that sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of cases, it shows a proportional behavior when variance is too high or low.

In [None]:
# Chart - 6 visualization code
# Plotting the relationship between sentiment_subjectivity and sentiment_polarity in scatter plot
sns.scatterplot(data=non_null_user_reviews_df, x='Sentiment_Subjectivity', y='Sentiment_Polarity')
plt.title("Relationship between sentiment_subjectivity and sentiment_polarity")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 - Sentiment percentage analyse


The below plot show that the Postitive reviews are in high, so that it can be conluded that the over all reviews are Positive

In [None]:
# Chart - 7 visualization code
#Counts of Review sentiments
non_null_user_reviews_df['Sentiment'].value_counts()

In [None]:
#Percentage of Review sentiments
non_null_user_reviews_df['Sentiment'].value_counts().plot(kind='pie', explode= (0.1,0.1,0.1), shadow=True, autopct='%1.2f%%', pctdistance=1.1, labeldistance=1.2)

##### 1. Why did you pick the specific chart?

To figure out which which reviews are higher compred with others.

#### Chart - 8 **Heatmap**

In [None]:
# Finding correlation between different columns in the play store data
playstore_data_df.corr()

In [None]:
# Heat map for play_store
plt.figure(figsize = (20,10))
sns.heatmap(playstore_data_df.corr(), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=20)

There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.
ThePriceis slightly negatively correlated with the Rating, Reviews, and Installs. This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.
TheRating is slightly positively correlated with theInstalls and Reviews column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.

Let us check if there is any co-relation in both the dataframes

In [None]:
merged_df = pd.merge(playstore_data_df, user_reviews_df, on='App', how = "inner")
def merged_dfinfo():
  temp = pd.DataFrame(index=merged_df.columns)
  temp['data_type'] = merged_df.dtypes
  temp["count of non null values"] = merged_df.count()
  temp['NaN values'] = merged_df .isnull().sum()
  temp['% NaN values'] =merged_df .isnull().mean()
  temp['unique_count'] = merged_df .nunique()
  return temp
merged_dfinfo()

In [None]:
# Heat Map for the merged data frame
plt.figure(figsize=(30, 20))  # Adjust the figure size as needed
heatmap = sns.heatmap(merged_df.corr(numeric_only=True), annot=True, cmap='Greens', fmt='.2f', annot_kws={"size": 10})
plt.title('Heatmap for merged Dataframe', size=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.show()

In [None]:
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])

In [None]:
merged_df.head()

##### 1. Why did you pick the specific chart?

There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users. ThePriceis slightly negatively correlated with the Rating, Reviews, and Installs. This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly. TheRating is slightly positively correlated with theInstalls and Reviews column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.

#### Chart - 9)Top categories on Google Playstore?

In [None]:
playstore_data_df.groupby("Category")["App"].count().sort_values(ascending= False)

In [None]:
x = playstore_data_df['Category'].value_counts()
y = playstore_data_df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])

In [None]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

In [None]:
# Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(18,18))
plt.pie(playstore_data_df.Category.value_counts(), labels=playstore_data_df.Category.value_counts().index, autopct='%1.2f%%')
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()

##### 1. Why did you pick the specific chart?

To find the top apps used.

##### 2. What is/are the insight(s) found from the chart?

So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are underFAMILY & GAME category and least are of EVENTS & BEAUTY Category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information will help the developer to understand which app is more popular in market and more profitable from business point of view.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Based on the analysis conducted and the insights gained from the project, here are some recommendations for the client to achieve their business objective of "App Engagement Optimization":

*  Focus on Top Genres and Categories: The client should prioritize developing apps in the top-performing genres and categories identified in the analysis. By aligning their app offerings with popular user preferences, they can increase the chances of app engagement and success
*   Improve App Ratings: Given the importance of app ratings in user decision-making, the client should focus on improving app quality, user experience, and addressing user feedback. Positive reviews and high ratings can significantly boost app visibility and engagement.
*   Consider Pricing Strategies: Analyzing the variation between free and paid apps can help the client decide on the most suitable pricing model for their apps. They should carefully consider the value they offer to users and strike a balance between monetization and user satisfaction.
*  Continuous Improvement: To ensure long-term success, the client should adopt a data-driven approach and regularly analyze user feedback and app performance. Continuous improvement and updates based on user preferences will help maintain engagement and retain users.
* Top categories on Google Playstore: To figure out which apps are most liked by the customer and most used.








# **Conclusion**

Through a comprehensive analysis of Play Store app data and customer reviews, this project successfully uncovered valuable insights into factors that influence app engagement and success in the Android market. By examining app attributes, ratings, and user sentiments, we identified key aspects that developers can leverage to enhance their app's appeal and performance.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***