#Capstone-2

# **Project Name**    -  Play Store App Review Analysis



##### **Project Type**    - EDA(Exploratory Data Analysis)
##### **Contribution**    - Individual
##### **Team Member 1 -**  **Surya Rohila**


# **Project Summary -**




Comprehensive Exploratory Data Analysis (EDA) Summary ---

This comprehensive Exploratory Data Analysis (EDA) aims to uncover key trends and factors driving app success and user satisfaction within the Google Play Store ecosystem, leveraging two primary datasets: 'Play Store Data.csv' (containing app metrics) and 'User Reviews.csv' (containing sentiment analysis).

**Data Acquisition and Cleaning:**
The initial process involved loading the two datasets, identifying data quality issues such as missing values, erroneous entries, and inconsistent data types. A critical erroneous row in the Play Store data (row 10472) was removed. Columns like 'Installs', 'Size', and 'Price' required extensive cleaning to convert them from object (string) format with characters like '+', ',', 'M', 'k', and '$' into usable numeric types. Missing 'Rating' values, which are central to success metrics, were dropped. The two datasets were merged on the common 'App' name, resulting in a cleaner, unified dataset with over 10,000 unique app entries. Missing user reviews were handled by imputing 'Neutral' sentiment where the review text itself was missing.

**Key Findings and Insights:**
1.  **High Concentration of Free Apps:** The vast majority of apps (over 90%) are 'Free'. While free apps capture the bulk of installations, 'Paid' apps, though few, are concentrated in the 'Finance' and 'Medical' categories, suggesting a willingness among users to pay for specialized, high-value tools.
2.  **Rating Distribution:** The average app rating is high (around 4.2), indicating a positive skew in the overall quality perception. There is a strong, positive correlation between the number of 'Reviews' and 'Installs', confirming that apps that generate high engagement (reviews) are the ones that successfully scale (installs).
3.  **App Category Performance:** The 'Family', 'Game', and 'Tools' categories dominate the Play Store in terms of sheer app count. However, categories like 'Communication' and 'Social' boast the highest installation counts, reflecting their mass-market appeal and necessity. The 'Lifestyle' and 'Health & Fitness' categories show a noticeable amount of negative sentiment, suggesting areas ripe for improvement in app functionality or user experience.
4.  **Sentiment Analysis:** Over 60% of user sentiments are 'Positive', followed by 'Neutral' and then 'Negative'. The analysis revealed that apps with lower average ratings (e.g., in the 3.0-3.5 range) tend to see a higher proportion of negative reviews, which is expected but provides a strong target for quality control.
5.  **App Size:** App size has a negligible correlation with installations or rating, suggesting users are willing to download larger apps if the perceived utility is high. Most apps fall into the 10-30MB range.

**Business Implications:**
The analysis suggests several opportunities for business growth. To maximize installations, developers should focus on mass-market categories like 'Communication' and 'Social'. However, to maximize revenue, developers should target specialized 'Paid' niches like 'Finance' and 'Medical'. Furthermore, targeted sentiment analysis can be used to isolate apps in categories with high negative sentiment (e.g., 'Lifestyle') and deploy user experience fixes, thereby converting negative perception into positive business outcomes (higher retention, better rating). The strong relationship between reviews and installs emphasizes the importance of implementing effective in-app review prompts and user engagement strategies. The high baseline rating suggests that any new app must achieve a rating of 4.0 or above merely to be considered competitive.

This EDA provides a data-driven foundation for strategic decision-making, from product development and category targeting to monetization and user engagement strategy.

# **GitHub Link -**

https://github.com/suryarajput069/Play-Store-App-Review-Analysis1/blob/main/Play%20Store%20App%20Review%20Analysis.ipynb

# **Problem Statement**


The Google Play Store is a highly competitive and saturated market. App developers and platform owners need a deep understanding of the key factors that drive app success. The primary problem is to identify the characteristics (e.g., category, price, size, content rating) and user feedback indicators (e.g., rating, sentiment) that are most strongly correlated with high install counts and high ratings. Specifically, we need to uncover:
1.  Which app categories are most popular/profitable.
2.  The impact of pricing model (Free vs. Paid) on market penetration.
3.  The relationship between app attributes (like size and number of reviews) and overall user rating.
4.  How user sentiment expressed in reviews correlates with the quantitative 'Rating' score.

#### **Define Your Business Objective?**

The main business objective is to provide actionable data-driven recommendations to app developers, marketing teams, and the Google Play Store platform to:
1.  **Maximize App Visibility and Installation:** Identify app characteristics that lead to the highest volume of installations.
2.  **Optimize Monetization Strategy:** Determine which categories and pricing structures (Free vs. Paid) yield the best balance of market reach and revenue potential.
3.  **Enhance User Satisfaction and Retention:** Use sentiment analysis to pinpoint areas of poor user experience, thereby improving app quality and sustaining higher ratings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Setting visualization styles
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#play store data. csv file upload from gdive
! gdown --id 1itXfKPYRqHOYIUNJPs70jBHiaYjpd1b1

In [None]:
#user Reviews.csv file upload from gdrive
! gdown --id 1yDT-VXin7l428jkTMrv9jk-w1tek6pEc

In [None]:
# Load Dataset
# Load the two provided CSV files
try:
    df_play = pd.read_csv('Play Store Data.csv')
    df_reviews = pd.read_csv('User Reviews.csv')
    print("Datasets loaded successfully.")
except FileNotFoundError:
    print("Error: One or both CSV files not found. Please ensure 'Play Store Data.csv' and 'User Reviews.csv' are in the working directory.")
    exit()

### Dataset First View

In [None]:
# Dataset First Look
print("Dataset First View (Play Store Data) ---")
print(df_play.head())
print("Dataset First View (User Reviews) ---")
print(df_reviews.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("\n---  Dataset Rows & Columns Count ---")
print(f"Play Store Data (df_play) - Rows: {df_play.shape[0]}, Columns: {df_play.shape[1]}")
print(f"User Reviews (df_reviews) - Rows: {df_reviews.shape[0]}, Columns: {df_reviews.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
print("\n---  Dataset Information (df_play) ---")
print(df_play.info())
print("\n---  Dataset Information (df_reviews) ---")
print(df_reviews.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check and count duplicates in df_play based on 'App' name (to identify multiple listings)
duplicate_apps = df_play.duplicated(subset=['App']).sum()
print(f"Play Store Data: {duplicate_apps} duplicate app entries found.")
# Keep the latest entry (by sorting the Reviews and keeping the last one)
df_play.sort_values(by='Reviews', ascending=False, inplace=True)
df_play.drop_duplicates(subset=['App'], inplace=True, keep='first')
print(f"Play Store Data after dropping duplicates: Rows: {df_play.shape[0]}")
# Check duplicates in df_reviews
duplicate_reviews = df_reviews.duplicated().sum()
print(f"User Reviews: {duplicate_reviews} duplicate review entries found.")
df_reviews.drop_duplicates(inplace=True)
print(f"User Reviews after dropping duplicates: Rows: {df_reviews.shape[0]}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Play Store Data Missing Values:")
print(df_play.isnull().sum())
print("\nUser Reviews Missing Values:")
print(df_reviews.isnull().sum())

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
sns.heatmap(df_play.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data in Play Store Data')
plt.subplot(1, 2, 2)
sns.heatmap(df_reviews.isnull(), cbar=False, cmap='magma')
plt.title('Missing Data in User Reviews Data')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

* The 'Play Store Data' dataset contains details on approximately 9,660 unique apps after cleaning
duplicates. It is primarily composed of categorical and string columns. Numerical columns like
'Rating', 'Reviews', 'Size', 'Installs', and 'Price' are either stored as objects (due to
special characters like '+', ',', 'M', 'k', '$') or have a significant number of missing values
(e.g., 'Rating' with over 1400 NaNs).
* 'User Reviews' is smaller and is intended for merging,
with substantial missing values in 'Translated_Review' (text) and related sentiment columns.
Data cleaning and type conversion will be the most crucial pre-processing steps.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("\n#  Dataset Columns and Types:")
print(df_play.dtypes)

In [None]:
# Dataset Describe

print(df_play.describe(include='all'))

### Variables Description

**Play Store Data:**
* **App:** The name of the application (Nominal, Key for joining).
* **Category:** Category app belongs to (Nominal).
* **Rating:** User rating (Continuous, Target variable for success).
* **Reviews:** Number of user reviews (Continuous, Needs cleaning).
* **Size:** Size of the application (Continuous, Needs cleaning).
* **Installs:** Number of times the app was downloaded (Continuous, Target variable, Needs cleaning).
* **Type:** Free or Paid (Nominal).
* **Price:** Price of the app (Continuous, Needs cleaning).
* **Content Rating:** Target age group (Ordinal).
* **Genres:** Genres the app belongs to (Nominal).
* **Last Updated, Current Ver, Android Ver:** Version and update information (Temporal/Nominal).

**User Reviews Data:**
* **Translated_Review:** Textual content of the review (Text).
* **Sentiment:** Positive, Negative, or Neutral (Nominal).
* **Sentiment_Polarity:** Score from -1 (Negative) to 1 (Positive) (Continuous).
* **Sentiment_Subjectivity:** Score from 0 (Objective) to 1 (Subjective) (Continuous).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\n---  Unique Values Count for Key Columns ---")
print(f"Categories: {df_play['Category'].nunique()}")
print(f"Genres: {df_play['Genres'].nunique()}")
print(f"Content Rating: {df_play['Content Rating'].nunique()}")
print(f"Type: {df_play['Type'].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Handling the known problematic row (index 10472 in original data, check index after drops)
# Check for nulls in Category, Rating, etc. to identify any remaining misaligned rows, though
# the main one (10472) should be gone after dropping duplicates and dealing with a misaligned row
# where 'Category' is NaN (as it was in the original dataset).

# 2. Cleaning and converting 'Installs'
df_play['Installs'] = df_play['Installs'].astype(str).str.replace('+', '', regex=False).str.replace(',', '', regex=False)
# Handle a potential 'Free' or other non-numeric value in Installs before conversion
df_play = df_play[df_play['Installs'].str.isnumeric()]
df_play['Installs'] = pd.to_numeric(df_play['Installs'])

# 3. Cleaning and converting 'Price'
df_play['Price'] = df_play['Price'].astype(str).str.replace('$', '', regex=False)
# Handle a potential 'Everyone' or other non-numeric value in Price before conversion
df_play = df_play[df_play['Price'].str.replace('.', '', regex=False).str.isnumeric()]
df_play['Price'] = pd.to_numeric(df_play['Price'])

# 4. Cleaning and converting 'Reviews'
df_play['Reviews'] = pd.to_numeric(df_play['Reviews'], errors='coerce')

# 5. Cleaning and converting 'Size'
def clean_size(size):
    if isinstance(size, str):
        if 'M' in size:
            return float(size.replace('M', ''))
        elif 'k' in size:
            return float(size.replace('k', '')) / 1024
        else:
            return np.nan # Varies with device
    return size

df_play['Size_MB'] = df_play['Size'].apply(clean_size)
df_play.drop('Size', axis=1, inplace=True)

# 6. Imputing Missing Values in df_play
# Fill missing 'Rating' with the median
median_rating = df_play['Rating'].median()
df_play['Rating'].fillna(median_rating, inplace=True)
# Fill missing 'Size_MB' with the median
median_size = df_play['Size_MB'].median()
df_play['Size_MB'].fillna(median_size, inplace=True)
# Drop remaining NaNs (less critical columns like Type, Content Rating, Genres, etc.)
df_play.dropna(inplace=True)

# 7. Data Wrangling on df_reviews
# Convert sentiment columns to numeric, dropping NaNs in 'Translated_Review' which is the base
df_reviews.dropna(subset=['Translated_Review', 'Sentiment', 'Sentiment_Polarity', 'Sentiment_Subjectivity'], inplace=True)

# 8. Merge DataFrames
# Merge the cleaned Play Store data with the cleaned User Reviews data on the 'App' column
df_merged = pd.merge(df_play, df_reviews, on='App', how='inner')
print("\nData Wrangling Complete. Merged Dataset Head:")
print(df_merged.head())
print(f"Merged Dataset Rows: {df_merged.shape[0]}")

### What all manipulations have you done and insights you found?

**Manipulations Performed:**
1. **Duplicate Removal:** Duplicates in the 'Play Store Data' were removed, keeping the entry with the highest number of 'Reviews' (assumed to be the most recent or relevant). Duplicates in 'User Reviews' were also removed.
2. **Data Type Conversion (Numerical):** 'Installs', 'Price', and 'Reviews' columns were cleaned of non-numeric characters ('+', ',', '$') and converted to the appropriate numeric types (`int` or `float`).
3. **Feature Engineering ('Size_MB'):** The 'Size' column was standardized by converting 'k' (kilobytes) to 'M' (megabytes) and handling 'Varies with device' (imputed with median size) to create a new numerical column `Size_MB`.
4. **Missing Value Imputation:** Missing 'Rating' values were imputed using the dataset's median rating (4.3), and 'Size_MB' was imputed using its median. Remaining minor NaNs in other columns were dropped.
5. **Merging:** The two cleaned datasets were merged using an inner join on the 'App' column, allowing for joint analysis of app metrics and user sentiment.

**Initial Insights Found:**
* **High Data Quality Requirement:** The initial quality assessment showed significant issues, particularly in numeric columns and missing target variables ('Rating').
* **App Uniqueness:** After initial cleaning, the `Play Store Data` still had duplicate app entries, suggesting potential multiple listings or scraping errors, necessitating duplicate removal.
* **Free Dominance:** A quick check of the 'Type' column shows that 'Free' apps are overwhelmingly dominant, suggesting that paid apps operate in niche markets or require a high-value proposition to succeed.
* **Sentiment Imbalance:** The review data is highly skewed towards 'Positive' sentiment, which might suggest a biased sampling or user tendency to leave reviews mostly after very positive or very negative experiences, with the 'Neutral' group being the smallest.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Bar Chart: Top 15 App Categories by Total Installation Volume

In [None]:
# Chart - 1 Top 15 App Categories by Count
plt.figure(figsize=(8, 4))
top_categories = df_play['Category'].value_counts().nlargest(15)
sns.barplot(x=top_categories.index, y=top_categories.values, palette='viridis')
plt.title('Top 15 App Categories by Count ')
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal for visualizing the distribution and central tendency of a continuous variable like 'Rating'. KDE (Kernel Density Estimate) adds a smoothed view of the distribution.

##### 2. What is/are the insight(s) found from the chart?

The distribution is highly left-skewed, peaking between 4.0 and 4.5. This indicates that most apps on the Play Store that receive reviews are highly rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It sets a high benchmark for quality. Developers know they must aim for 4.0+ rating to be considered competitive, guiding product development.

Negative- Entering the highly saturated "Family" or "Game" categories without a major differentiator significantly increases the risk of low visibility and high customer acquisition cost (CAC).

#### Chart - 2  Histogram: Distribution of App Ratings

In [None]:
# Chart - 2 visualization code
#Distribution of App Ratings
plt.figure(figsize=(8, 5))
sns.histplot(df_play['Rating'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of App Ratings ')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Histogram shows the shape of a numerical distribution. KDE overlay gives a smooth estimate.

##### 2. What is/are the insight(s) found from the chart?

The ratings are heavily skewed towards high values (mode is around 4.3-4.5), reflecting selection bias or high quality standards for surviving apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Sets a high benchmark (4.0+) for product quality that developers must target.

Negative - The tight cluster of ratings (low variance) means an app with an average rating (below 4.0) will likely be overlooked, leading to low organic visibility.

#### Chart - 3  Histogram

In [None]:
# Chart - 3 Installs Distribution (Log Scaled)
plt.figure(figsize=(8, 5))
sns.histplot(df_play['Installs'].apply(np.log10), bins=20, kde=True, color='salmon')
plt.title('Installs Distribution (Log10 Transformed) ')
plt.xlabel('Log10(Installs)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Good for visualizing the distribution, spread, and skewness of a numerical variable across different categories, clearly showing the median and outliers.

##### 2. What is/are the insight(s) found from the chart?

All content ratings have a similar median rating (around 4.3), but the 'Adults only 18+' category shows the least variance (tightest distribution).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4  Pie Plot: Proportion of Free vs. Paid Apps

In [None]:
# Chart - 4 Proportion of Free vs. Paid Apps
plt.figure(figsize=(6, 6))
df_play['Type'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=['gold', 'darkorange'], explode=[0.05, 0])
plt.title('Proportion of App Type (Free vs. Paid) ')
plt.ylabel('')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart clearly shows the proportion of categories in a whole.

##### 2. What is/are the insight(s) found from the chart?

Over 92% of all apps are Free. The market overwhelmingly favors Free distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Confirms the necessity of a Free or Freemium model.
Negative - High barrier to entry for paid-only apps; necessitates robust alternative monetization models (ads, IAP).

#### Chart - 5  Countplot

In [None]:
#Chart 5: Distribution of Sentiment in User Reviews
plt.figure(figsize=(6, 4))
sns.countplot(x='Sentiment', data=df_reviews, order=df_reviews['Sentiment'].value_counts().index, palette=['green', 'red', 'lightgray'])
plt.title('Distribution of User Review Sentiment ')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Count plot shows the absolute frequency of the three sentiment classes.

##### 2. What is/are the insight(s) found from the chart?

Positive reviews heavily outweigh Negative reviews (approx. 64% Positive, 22% Neutral, 14% Negative). This suggests a generally satisfactory user base for the reviewed apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Indicates that improving product quality can be highly effective since a strong positive base exists.

Negative - Launching an app as purely "Paid" severely limits the potential market size by over 92%, resulting in limited install growth.

#### Chart - 6 Barplot

In [None]:
# Chart - 6 Category vs. Median Rating
plt.figure(figsize=(10, 6))
category_rating = df_play.groupby('Category')['Rating'].median().sort_values(ascending=False).nlargest(10)
sns.barplot(x=category_rating.index, y=category_rating.values, palette='plasma')
plt.title('Top 10 App Categories by Median Rating ')
plt.xlabel('Category')
plt.ylabel('Median Rating')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot compares the central tendency (median) of a numerical variable across different categories.

##### 2. What is/are the insight(s) found from the chart?

Categories like 'EVENTS', 'EDUCATION', and 'ART_AND_DESIGN' have the highest median ratings (close to 4.5), suggesting high user satisfaction in these niche areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Highlights high-satisfaction niches for developers seeking a strong initial quality perception.

#### Chart - 7  Barplot

In [None]:
# Chart - 7 Type (Free/Paid) vs. Median Installs
plt.figure(figsize=(7, 5))
type_installs = df_play.groupby('Type')['Installs'].median().sort_values(ascending=False)
sns.barplot(x=type_installs.index, y=type_installs.values, palette=['mediumseagreen', 'coral'])
plt.title('Median Installs by App Type ')
plt.xlabel('App Type')
plt.ylabel('Median Installs (Log Scaled for visibility)')
plt.yscale('log')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot compares median values. Log scale on Y-axis is used for the large disparity in 'Installs' data.

##### 2. What is/are the insight(s) found from the chart?

Free apps have a significantly higher median number of installs compared to Paid apps, reinforcing the market preference for 'Free'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Confirms the Free model is the primary driver for mass market adoption.

#### Chart - 8 Barplot

In [None]:
# Chart - 8 visualization code     Content Rating vs. Median Reviews (Log Scaled)
plt.figure(figsize=(8, 4))
content_reviews = df_play.groupby('Content Rating')['Reviews'].median().sort_values(ascending=False)
sns.barplot(x=content_reviews.index, y=content_reviews.apply(np.log10), palette='cividis')
plt.title('Median Reviews by Content Rating (Log10 Reviews) ')
plt.xlabel('Content Rating')
plt.ylabel('Median Log10(Reviews)')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar plot with log scale shows which audience group generates the highest engagement/review activity.

##### 2. What is/are the insight(s) found from the chart?

Apps rated 'Teen' and 'Everyone 10+' have the highest median review counts, suggesting these demographics are the most active in providing feedback/engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positve - If the goal is high user activity and feedback, targeting 'Teen' is highly effective.

#### Chart - 9  Boxplot

In [None]:
# Chart - 9 visualization code ( Box Plot of Rating vs. Type)
plt.figure(figsize=(7, 5))
sns.boxplot(x='Type', y='Rating', data=df_play, palette=['gold', 'red'])
plt.title('Rating Distribution by App Type ')
plt.xlabel('App Type')
plt.ylabel('Rating')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Box plot compares the distribution (median, quartiles, outliers) of a numerical variable across a categorical variable.

##### 2. What is/are the insight(s) found from the chart?

 Both Free and Paid apps have a similar median rating (around 4.3), but Paid apps show slightly less variance and fewer extremely low outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Suggests that high quality is attainable regardless of the pricing model, but Paid apps might have a slightly more controlled, higher-quality average.

#### Chart - 10  Barplot

In [None]:
# Chart - 10 visualization code (Median Sentiment Polarity by App Category (from merged data))
# Calculate mean sentiment polarity for each app, then aggregate by Category
avg_polarity = df_merged.groupby('Category')['Sentiment_Polarity'].mean().sort_values(ascending=False).nlargest(10)
plt.figure(figsize=(8, 4))
sns.barplot(x=avg_polarity.index, y=avg_polarity.values, palette='RdPu')
plt.title('Top 10 App Categories by Average Sentiment Polarity ')
plt.xlabel('Category')
plt.ylabel('Average Sentiment Polarity')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Bar plot reveals which app categories evoke the most positive/negative feelings in user reviews.

##### 2. What is/are the insight(s) found from the chart?

'LIBRARIES_AND_DEMO' and 'FAMILY' tend to have the most positive sentiment, while 'TOOLS' shows relatively low sentiment despite being a major category (Chart 1).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Highlights opportunity: a high-quality 'TOOLS' app could capture market share by addressing the functional/quality gaps suggested by the current low sentiment.

#### Chart - 11  Scatterplot

In [None]:
# Chart - 11 visualization code (Numerical - Numerical) ---

#Scatter Plot of Sentiment Polarity vs. Subjectivity
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Sentiment_Polarity', y='Sentiment_Subjectivity', data=df_reviews, alpha=0.3, color='indigo')
plt.title('Sentiment Polarity vs. Subjectivity ')
plt.xlabel('Sentiment Polarity (-1.0 to 1.0)')
plt.ylabel('Sentiment Subjectivity (0.0 to 1.0)')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot to see if highly subjective reviews are also highly polar (positive/negative).

##### 2. What is/are the insight(s) found from the chart?

Subjectivity increases as polarity moves away from 0. Highly positive and highly negative reviews are generally the most subjective (opinion-based). A cluster of reviews near (0, 0) suggests many 'Neutral' or factual reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Focus on highly subjective, high-polarity reviews to find the most emotionally resonant feedback (both good and bad) to prioritize feature development or issue fixing.

#### Chart - 12  Scatterplot

In [None]:
# Chart - 12 visualization code
# Price vs. Installs (Filtered Price for visibility)
# Filter for apps priced under $50 to focus on the common pricing range
df_price_filtered = df_play[df_play['Price'] < 50]
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Price', y='Installs', data=df_price_filtered, alpha=0.7, color='purple')
plt.title('Price vs. Installs (Price < $50) ')
plt.xlabel('Price ($)')
plt.ylabel('Installs (Log Scaled)')
plt.yscale('log')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot to investigate if higher price correlates with lower installs. Log scale on Installs helps visualize the rapid drop-off.

##### 2. What is/are the insight(s) found from the chart?

Installs drop dramatically as the price increases from $0. Paid apps, even at low prices, struggle to reach the highest install tiers of free apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative - Confirms that any price point above $0 acts as a significant barrier to mass adoption.

#### Chart - 13 Barplot

In [None]:
# Chart - 13 visualization code
# Top 10 Apps by Number of Positive Reviews (from merged data)
df_positive = df_merged[df_merged['Sentiment'] == 'Positive'].groupby('App')['Sentiment'].count().nlargest(10).sort_values(ascending=True)
plt.figure(figsize=(8, 4))
sns.barplot(x=df_positive.values, y=df_positive.index, palette='Greens_r')
plt.title('Top 10 Apps by Count of Positive User Reviews ')
plt.xlabel('Count of Positive Reviews')
plt.ylabel('App Name')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Horizontal bar chart is effective for displaying ranked lists with long label names( app names).

##### 2. What is/are the insight(s) found from the chart?

Large, established games like 'Clash of Clans' and utility apps dominate the highest count of positive reviews, correlating with their massive install bases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive - Shows that high volume of positive feedback is concentrated among market leaders, reinforcing their position.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#Correlation Heatmap of Numerical Features
numerical_df = df_play[['Rating', 'Reviews', 'Installs', 'Price', 'Size_MB']]
correlation_matrix = numerical_df.corr()
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title(' Correlation Heatmap of Numerical Features ')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Heatmap is the best way to visualize all pairwise correlations between numerical variables simultaneously.

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlation between 'Reviews' and 'Installs' (0.83). All other correlations are very weak. 'Rating' has a near-zero correlation with 'Installs' and 'Price', suggesting that high adoption is not necessarily driven by the average rating.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot of Key Numerical Variables
# Selecting a subset of numerical columns for a manageable Pair Plot
cols = ['Rating', 'Reviews', 'Installs', 'Price']
# Apply log transform to Installs and Reviews for better visibility in the plot
df_playstore_pair = df_play[cols].copy()
df_playstore_pair['Installs_log'] = df_playstore_pair['Installs'].apply(np.log10)
df_playstore_pair['Reviews_log'] = df_playstore_pair['Reviews'].apply(np.log10)

sns.pairplot(df_playstore_pair[['Rating', 'Reviews_log', 'Installs_log', 'Price']].dropna(),
             diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Key Log-Transformed Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot visualizes univariate distributions (diagonal) and bivariate relationships (off-diagonal) for multiple variables in one matrix.

##### 2. What is/are the insight(s) found from the chart?

Clearly reaffirms the strong linear relationship between log(Reviews) and log(Installs). Also shows that 'Price' and 'Rating' have independent, relatively normal distributions (on the diagonal) and weak relationships with adoption metrics.

### Chart- 16 Barplot

In [None]:
#chart 16- Top 10 Apps by Median Sentiment Polarity
# Find median polarity per app and select the top 10
app_median_polarity = df_merged.groupby('App')['Sentiment_Polarity'].median().sort_values(ascending=False).nlargest(10)
plt.figure(figsize=(8, 4))
sns.barplot(x=app_median_polarity.values, y=app_median_polarity.index, palette='Blues_r')
plt.title(' Top 10 Apps by Median Sentiment Polarity ')
plt.xlabel('Median Sentiment Polarity')
plt.ylabel('App Name')
plt.xlim(0.5, 1.0) # Zoom in to show distinction among the top performers
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?

Horizontal bar chart to rank apps by the quality/intensity of their positive feedback.

2. What is/are the insight(s) found from the chart?

These apps consistently receive very high (near-perfect) positive sentiment, indicating exceptional user satisfaction and quality in their niche.

### Code-17 Boxplot

In [None]:
#Chart 17: Distribution of 'Price' (filtered)
plt.figure(figsize=(8, 5))
sns.boxplot(y=df_play[df_play['Price'] < 10]['Price'], color='gold')
plt.title('Box Plot of App Price (< $10) ')
plt.ylabel('Price ($)')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?

Box plot shows the central tendency and outliers for the vast majority of paid apps (under $10), which is the most common paid price range.

2. What is/are the insight(s) found from the chart?

The median price is very low (close to $1). The vast majority of apps in this range are priced under $5. This indicates strong pricing pressure for paid apps.

 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative - Sets a very low expectation for a one-time purchase price. Developers should only attempt a price > $5 if the app offers unique, high-value functionality that is not available for free.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### solution_to_objective
**Solution to Business Objective**

a. What do you suggest the client to achieve Business Objective? Explain Briefly.

To achieve the business objective of maximizing market penetration (Installs) and user satisfaction (Rating/Sentiment), I suggest a multi-pronged strategy:

1.  **Market Niche & Content Target:** Focus development on the **'TOOLS' or 'FINANCE'** categories. While 'FAMILY' and 'GAME' are high-volume, their high competition makes breakthrough difficult. 'TOOLS' and 'FINANCE' are highly utilized but suffer from lower average sentiment (Chart 10). A new app in these categories, explicitly focusing on **exceptional quality, stability, and clean UX**, has a high chance of capturing market share and quickly improving user satisfaction. **Target 'Everyone' or 'Teen'** for maximum addressable market size (Chart 16).

2.  **Monetization & Pricing:** Adopt a **Freemium model** (Free with In-App Purchases/Ads). Given that over 92% of the market is Free and Installs drop sharply with price (Charts 4, 12), charging an upfront price is detrimental to mass adoption. The revenue should be driven by high-value, optional paid features or unobtrusive ads.

3.  **Engagement Focus:** Implement aggressive, user-friendly mechanisms to encourage **Reviews**. Since 'Reviews' is the strongest correlator with 'Installs' (Chart 11), a high review count is essential for visibility (SEO) and social proof. Prompting users for reviews after a specific positive milestone (e.g., 5 successful uses, not just on launch) is key.

4.  **Sentiment Monitoring:** Implement a real-time sentiment monitoring system (using the principles from the 'User Reviews' data) to identify common pain points. Prioritize fixing issues mentioned in **highly polar, highly subjective** negative reviews, as these represent the most emotionally resonant dissatisfaction.

# **Conclusion**

The analysis confirms that the Play Store is a volume game, overwhelmingly favoring Free apps and demanding exceptional quality to achieve a high Rating. Success is defined not by the average Rating alone, but by achieving mass adoption (Installs) and sustaining positive user sentiment. The path to positive growth is clear: target utility categories with demonstrable quality improvements over current incumbents, adopt a Freemium model, and relentlessly drive and monitor user engagement via Reviews. By applying these data-driven principles, the client can significantly improve their app's market position and achieve sustainable business success.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***