<a href="https://colab.research.google.com/github/sundusfirdous/EDA-Capstone-Project-on-Play-Store-App-Review-Analysis/blob/main/Playstore_App_Review_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#               **Play Store App Review Analysis**










Certainly! Here's a more polished and professional version of your text:

---

### **Unlocking Insights from Google Play Store Data for App Development Success**

The data obtained from Google Play Store applications holds significant potential to drive success in mobile app development. By analyzing this data, developers can uncover valuable insights that help them effectively engage users and navigate the competitive Android app market.

### **Overview of the Datasets**

This analysis utilizes two datasets:

1. **App Metadata** – Contains essential information such as app category, price, genre, type, number of installs, number of reviews, and more.
2. **User Reviews and Sentiment** – Includes translated user reviews along with sentiment labels, sentiment polarity, and sentiment subjectivity scores.

### **Objective**

The primary goal of this analysis is to explore and interpret the data to identify the key factors that influence user engagement and contribute to the overall success of apps on the Google Play Store.

---

Let me know if you want to format this for a presentation or report!


# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**


### **Exploratory Data Analysis of Google Play Store Applications**

The data science workflow is typically structured around five key phases: capture, maintain, process, analyze, and communicate. In this project, we applied this methodology to perform an in-depth exploratory data analysis (EDA) of Google Play Store applications.

With thousands of new mobile apps being launched daily on platforms like Google Play and an ever-growing global developer community, the competition for user attention is intense. The success of an app is often measured more by user engagement metrics—such as the number of installs and average ratings—than by revenue. In light of this, our primary objective was to identify trends and features that contribute to an app’s success and visibility, using insights drawn from EDA.

---

### **Data Preparation and Cleaning**

The initial phase of the project focused heavily on data cleaning and preprocessing to ensure the accuracy and reliability of our analysis. One of the major challenges encountered was the presence of missing data:

* **Ratings**: Approximately 13.6% of the data in the *Rating* column was missing. Despite attempts to merge with other datasets, we were unable to reliably impute these values and thus opted to remove them to preserve data integrity.
* **User Reviews**: The *User Reviews* dataset contained around 42% missing values. Although sentiment analysis of reviews could have informed imputation strategies for the *Rating* column, the limited overlap between datasets restricted this approach.
* **Merged Dataset**: After merging the Play Store data with user reviews, only 816 apps were found to be common—representing just 10% of the cleaned dataset. A higher overlap (ideally 70–80%) would have enhanced the depth and reliability of our analysis.

---

### **Exploratory Data Analysis (EDA)**

With the cleaned dataset, we conducted various analyses to uncover patterns and insights:

* **Categorical and Numerical Distributions**: We analyzed the frequency of categorical variables (e.g., genre, content rating) and the distribution of numerical variables (e.g., installs, reviews, rating).
* **Correlations**: We explored correlations between key features, such as:

  * *Category vs. Installs*
  * *Rating vs. Reviews*
  * *Sentiment Polarity vs. Subjectivity*

---

### **Key Findings**

* **App Pricing**: Free apps dominate the Play Store and are significantly more preferred by users.
* **App Size**: The majority of apps are relatively small in size. We found that app size does not significantly influence user behavior or success metrics.
* **Ratings Distribution**: Most apps are rated between 4.0 and 5.0, indicating a generally positive user experience.
* **Content Rating and Genre**: The most common content rating is "Everyone," and "Communication" emerged as the most frequent app genre.
* **Top Apps**: Based on install count and user reviews, **Facebook** emerged as the most popular app, closely followed by **WhatsApp**.

---

### **Sentiment Analysis Insights**

* **Review Sentiment**: The majority of user reviews express positive sentiment. Negative and neutral sentiments constitute a smaller portion of the dataset.
* **Polarity Range**: Most sentiment polarity scores fall within the range of \[-0.50, 0.75], suggesting that extreme sentiments are rare.
* **Subjectivity vs. Polarity**: While subjectivity and polarity are not always directly proportional, in cases of high or low variance, a linear relationship was observed.

---

### **Conclusion**

This project successfully achieved its objective of identifying trends and key success factors for applications on the Google Play Store. Through extensive data cleaning, analysis, and visualization, we uncovered meaningful relationships between app characteristics and performance indicators. While data limitations constrained some aspects of our analysis, especially in terms of dataset overlap, the findings provide valuable guidance for developers seeking to optimize their app’s visibility and engagement on the platform.

Future work could focus on expanding the dataset and incorporating additional features (e.g., user demographics, time-based trends) to further refine predictive models and recommendations.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


#### **Define Your Business Objective?**



The Google Play Store stands as one of the leading digital app marketplaces globally, widely used by millions of users. As the mobile app ecosystem continues to expand, the demand for skilled developers is also on the rise. The Play Store offers users access to a vast array of applications across various categories, along with the ability to rate and review apps based on their personal experiences.

To design and develop successful and engaging applications, it is essential for developers to understand user preferences and behavior. Key factors such as app size, pricing model, required Android version, and last update date can significantly influence user adoption and engagement.

Additionally, understanding user sentiment towards similar existing apps provides valuable insights that can guide development decisions. Sentiment analysis of user reviews allows developers to anticipate user expectations and avoid common pitfalls.

For this analysis, two datasets have been provided:

1. **App Metadata** – containing general information about the apps available on the Play Store.
2. **User Reviews and Sentiment** – comprising user-generated reviews and associated sentiment scores for individual apps.

By thoroughly examining both datasets, we aim to identify the key factors that influence app performance, user engagement, and overall success in the competitive app market.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
%matplotlib inline

### Dataset Loading

In [None]:
#import google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
ps_data_path='/content/drive/MyDrive/AlmaBetter/Data Scientist/Capstone_Project/Play Store Data.csv'
df_ps=pd.read_csv(ps_data_path)

### Dataset First View

In [None]:
#let's watch the first four row
df_ps.head(4)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_ps.columns

In [None]:
#find no of rows and columns in the dataset
rows=df_ps.shape[0]
column=df_ps.shape[1]
print(f"no of rows is {rows} and no of columns is {column}")

### Dataset Information

In [None]:
# Dataset concise summary
df_ps.info()

### **Initial Observations from the Dataset**

Based on our preliminary examination of the dataset, two key observations emerge:

1. **Data Composition**: Out of the 13 available columns, only one contains numerical values, while the remaining columns consist of categorical data. This indicates a need for appropriate encoding and handling of categorical variables during the analysis.

2. **Missing Values**: The dataset contains some missing (null) values. A detailed inspection will be conducted to assess the extent and distribution of these nulls, which will inform our data cleaning and preprocessing strategies.

#### Duplicate Values

In [None]:
# find the duplicate dataframe according to the app column
duplicate=df_ps[df_ps.duplicated(subset='App')]
duplicate.head(4)

In [None]:
#find the count of the duplicate data in the dataset
rows_duplicate=duplicate.shape[0]
print(f"no of duplicate data is {rows_duplicate} is the dataset")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df_ps.isna().sum()


### **Missing Data Overview**

Upon reviewing the dataset, it appears that only five columns contain missing values:

1. **Rating** – Represents the average user rating for each app. Missing values here may impact analysis related to user satisfaction.
2. **Current Ver** – Indicates the current version of the app. Missing data could hinder insights about app maintenance or updates.
3. **Android Ver** – Specifies the minimum Android version required to run the app. Null values may affect compatibility analysis.
4. **Type** – Defines whether the app is *Free* or *Paid*. Missing entries in this column could skew pricing-related insights.
5. **Content Rating** – Reflects the age-appropriateness of the app (e.g., Everyone, Teen). Missing values may interfere with audience segmentation.

These columns will be carefully examined during the data cleaning process to determine the best approach for handling their missing values—whether through imputation, removal, or other techniques.


In [None]:
# Visualizing the missing values
msno.matrix(df_ps)

Here's a more polished and professional version of your statement:

---

### **Visual Analysis of Missing Data**

From the matrix plot visualization, we can draw the following observations regarding the distribution of missing values:

* **Rating**: The missing values in the *Rating* column are dispersed throughout the dataset, indicating that they occur randomly across different app entries.
* **Current Ver**: The missing values in the *Current Ver* column are concentrated toward the end of the dataset. This could suggest issues with more recently added or less frequently updated apps.
* **Other Columns** (*Android Ver*, *Type*, *Content Rating*): The missing values in these columns are less visually prominent in the matrix plot, likely due to their relatively smaller number. However, they are still present and will need to be addressed during data preprocessing.

This visual inspection reinforces the importance of applying appropriate missing data handling techniques tailored to the distribution and significance of each column.


### What did you know about your dataset?

### **Understanding the Dataset: Column Descriptions**

The `play_store` DataFrame consists of **10,841 rows** and **13 columns**, each representing specific attributes of mobile applications available on the Google Play Store. Below is a detailed description of each column based on our inspection:

1. **App**
   Represents the **name of the application**.

2. **Category**
   Indicates the **primary category** to which the app belongs, such as *Education*, *Sports*, *Games*, etc.

3. **Rating**
   Contains the **average user rating** for the app, reflecting overall user satisfaction.

4. **Reviews**
   Shows the **total number of user reviews** submitted for the application.

5. **Size**
   Specifies the **storage size** of the app, indicating how much space it occupies on a user's device.

6. **Installs**
   Represents the **total number of downloads or installations** of the app.

7. **Type**
   States whether the app is **Free** or **Paid**.

8. **Price**
   Denotes the **cost to install** the app. For free apps, the value is zero.

9. **Content Rating**
   Indicates the **age suitability** of the app (e.g., *Everyone*, *Teen*, *Mature 17+*).

10. **Genres**
    Lists **additional categories or genres** associated with the app beyond the main category.

11. **Last Updated**
    Specifies the **date on which the app was last updated**.

12. **Current Ver**
    Displays the **current version** of the app available on the Play Store.

13. **Android Ver**
    Indicates the **minimum Android OS version required** to run the app on a device.


## ***2. Understanding Your Variables***

### **Primary Key Identification**

In this dataset, the **`App`** column serves as the **primary identifier**, as it contains the **name of each application**. Ideally, each app name should be **unique**, allowing it to act as a primary key for identifying and distinguishing individual records.

However, to ensure data integrity, it's important to verify the uniqueness of this column and check for any duplicate entries, especially considering that some apps may appear more than once due to updates, different versions, or multiple entries in the dataset.


In [None]:
# Dataset Columns
df_ps.columns

In [None]:
# Dataset Describe
df_ps.describe()

Only the rating column is numerical; the others contain categorical data.

### Variables Description

In this dataset, only the **`Rating`** column contains **numerical values**, while all other columns are **categorical**.

Upon inspection, we observed that the **maximum rating value is 19**, which is clearly **inappropriate**, as user ratings on the Google Play Store are typically **on a scale from 1.0 to 5.0**. The **minimum value**, however, is correctly recorded as **1.0**.

The presence of such **outliers or corrupted values** in the `Rating` column compromises the **reliability of statistical summaries** and may skew any analysis derived from this field. Therefore, these anomalies must be addressed—either by removing or correcting the affected records—before conducting any meaningful statistical or predictive analysis.

In [None]:
#let's look at the number of corrupted data in the Rating column
df_ps[(df_ps['Rating']>5.0) | (df_ps['Rating']<1.0)]

In the Rating column, there is only one corrupted value, and for the same app, there is also a null value for Content Rating. We can drop this row.

In [None]:
#drop the row
df_ps.drop(index=10472,inplace=True)
df_ps.shape

In [None]:
#statiscal description of the numerical column
df_ps.describe()

The maximum rating is between 4.0 and 4.5, and the average rating is around 4.2.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_ps.nunique()

### **Key Dataset Insights**

* The dataset contains **9,659 unique apps**, with **`App`** serving as the primary identifier.
* There are **33 distinct categories** (e.g., games,family,medical).
* **App types** include **Free** and **Paid**.
* **6 unique content ratings** are present (e.g., *Everyone*, *Teen*).
* The dataset spans **118 different genres**.
* Apps support **33 different Android versions**.


## 3. ***Data Wrangling***

### Data Wrangling Code

3.1 Drop the Duplicate value

In [None]:
#drop the duplicate values based on app column
df_ps.drop_duplicates(subset='App',keep='first',inplace=True)
#check the no of rows in the dataset after dropping the dataset
df_ps.shape


3.2 Dealing with the missing value

In [None]:
#let's understand the missing with more detailis.Here we gonna create a function which will tells us in detalis about missing data
def missingdata_info(df):
  '''
  This function is helps to find the missing data in details
  '''
  missing_df=pd.DataFrame(index=df.columns)
  missing_df['datatypes']=df.dtypes
  missing_df['total no of value']=[len(df)]*len(missing_df)
  missing_df['not null value']=df.count()
  missing_df['null values']=df.isna().sum()
  missing_df['percentage  of null value']=(missing_df['null values']/len(df))*100
  ms_df=missing_df[missing_df['null values']>0]
  return ms_df

In [None]:
#apply the funtion to dataframe for understand the missing value
missingdata_info(df_ps)

**Missing Data Summary**

* Approximately **15% of the values in the `Rating` column** are missing, accounting for around **1,463 entries**.
* The remaining columns with missing values—**`Type`**, **`Content Rating`**, **`Current Ver`**, and **`Android Ver`**—have **less than 1%** missing data, making their impact relatively minimal.


3.2.1 Dealing with Android Ver column

In [None]:
#check out the android ver missing rows
df_ps[df_ps['Android Ver'].isnull()]

In [None]:
#finding the number of different value in Android Ver
df_ps['Android Ver'].value_counts()

 **Handling Missing Values in `Android Ver`**

The `Android Ver` column contains a **broad range of unique values**, making it challenging to accurately impute missing entries. However, since there are **only 3 missing rows**, accounting for just **0.02%** of the dataset, it is both practical and justifiable to **remove these rows** without impacting the overall analysis.


In [None]:
#drop the null value of the Android Ver column
df_ps=df_ps[df_ps['Android Ver'].notnull()]
#check the shape of the dataframe after dropping missing Android ver column
df_ps.shape

3.2.2 Dealing with Current Ver column

In [None]:
#check out the current ver missing rows
df_ps[df_ps['Current Ver'].isnull()]

In [None]:
#finding the number of different value in Current Ver
df_ps['Current Ver'].value_counts()

**Handling Missing Values in `Current Ver`**

The `Current Ver` column contains a **large number of unique values**, with entries varying significantly between apps. This wide variability makes it difficult to accurately impute the missing data. However, since there are **only 8 missing values**, representing just **0.08%** of the dataset, it is reasonable to **remove these rows** to maintain data quality without affecting the analysis.


In [None]:
#Dropping the missing current ver data
df_ps=df_ps[df_ps['Current Ver'].notnull()]
#check out the shape of the dataframe  after dropping current ver missing data
df_ps.shape

3.2.3 Dealing with type column

In [None]:
#finding the null value in Type column
df_ps[df_ps['Type'].isnull()]

In [None]:
#find the type column data  spread
df_ps['Type'].value_counts()

**Handling Missing Values in `Type` Column**

The `Type` column contains only **two unique values**: **Free** and **Paid**. By definition:

* If an app is **Free**, its **Price** is `0`.
* If an app is **Paid**, the **Price** is greater than `0`.

For rows where the `Type` value is missing, we can **infer the type based on the Price**. Since all missing `Type` entries have a Price of `0`, we can confidently **replace the missing values with "Free"**.


In [None]:
#replace the Nan value of type column with free
df_ps['Type'].fillna('Free',inplace=True)

3.2.4 Dealing with the Rating column

In [None]:
#find out the missing rating data or rating is 0 in the dataset
df_ps[(df_ps['Rating'].isnull()) | (df_ps['Rating']==0)]

In [None]:
#find out the rating rows whose value is more than 5
df_ps[(df_ps['Rating']>5)]

 **Handling Missing Values in `Rating` Column**

The `Rating` column contains a **significant number of missing values**, making it impractical to drop the affected rows without losing valuable data. Since `Rating` is a **numerical feature**, we need to carefully examine its **distribution** to determine the most appropriate strategy for imputing missing values—such as using the **mean**, **median**, or **mode**, depending on whether the distribution is **normal or skewed**.


In [None]:
#try to understand the distribution of the data with histplot and boxplot
fig, ax = plt.subplots(2,1, figsize=(15,8))
plt.grid()
sns.histplot(df_ps['Rating'], color ='red', bins = 10,kde=True,ax=ax[0])
sns.boxplot(x=df_ps['Rating'],data=df_ps,ax=ax[1])


**Distribution Analysis of `Rating`**

Based on the distribution plot, the `Rating` column is **negatively skewed**, indicating that a higher number of apps have ratings closer to the upper end (e.g., 4 to 5), while a few outliers exist on the **lower end** of the scale.

Due to this skewness, using the **mean** to impute missing values may not be appropriate, as it could be influenced by the **low-end outliers**. Instead, a more robust approach would be to use the **median**, which is less sensitive to extreme values and better represents the central tendency in a skewed distribution.


In [None]:
#find the mean,median and mode value of the rating column
mean_value=round(df_ps[~df_ps['Rating'].isnull()].Rating.mean(),1)
median_value=df_ps[~df_ps['Rating'].isnull()].Rating.median()
mode_value=df_ps['Rating'].mode()
print(f"The mean value is {mean_value}, median value is {median_value} and mode value is {mode_value}")

**Imputation Strategy for `Rating` Column**

After analyzing the distribution of the `Rating` column, we found that the **median and mode have the same value**. This consistency indicates a strong central tendency in the data.

Given the **negative skew** and presence of **outliers**, replacing the missing values with the **median** is the most appropriate choice, as it is **robust to skewness and extreme values**.


In [None]:
#replace the Nan value of Rating column with median value
df_ps['Rating'].fillna(median_value,inplace=True)

In [None]:
#check once again if there is any null value
df_ps.isnull().sum()

3.3 Reshaping the data

**Data Type Conversion for Analysis**

To ensure accurate analysis, the following data type conversions are necessary:

* **`Reviews`**, **`Size`**, **`Price`**, and **`Installs`** should be converted to **numeric** types for statistical computation and modeling.
* **`Last Updated`** should be converted to **datetime format** to facilitate time-based analysis.

These conversions are essential to maintain data consistency and enable meaningful insights.

3.3.1 Change the datatype of Review column

In [None]:
#changing the review datatype from object to integer
df_ps['Reviews']=df_ps['Reviews'].astype(int)
#check the datatype of the column after change
df_ps['Reviews'].dtype

3.3.2 Change the datatype of Size column

In [None]:
#look at the format of the data of size column
df_ps['Size'].unique()

**Here M denotes the size in MB (MegaByte) and k denotes kB (kiloByte). Before converting the data, the unit of the size should be unique, and 'k' and 'M' should be omitted.**

In [None]:
# create a function to convert the data into numerical and unit of MB
def size_convert(app_size):
  '''
  This is the function to convert the size into kb to mb
  '''
  try:
    if app_size[-1]=='M':
      return float(app_size[:-1])
    elif app_size[-1]=='k':
      return round((float(app_size[:-1])/1024),2)
    else:
      return float(app_size)
  except:
    return app_size

In [None]:
#convert the column as suggested
df_ps['Size']=df_ps['Size'].apply(lambda size:size_convert(size))
#check the datatype of the column after applying function on the column
df_ps['Size'].dtype

**datatype is still an object type because 'varies with device' size is present in the dataset. So it can't convert the entire column to numbers.**

3.3.3 Dealing with Installs columns






In [None]:
#look at the format of the column
df_ps['Installs'].unique()

**In the Install column, there are '+' and ',' which need to be filtered before being converted to numeric.**

In [None]:
#create a function to convert install coumn to numeric data
def install_to_numeric(column_data):
  '''
  This function is omit the + and , in the data and turn into numeric
  '''
  if '+'and ',' in column_data:
    data=int(column_data[:-1].replace(',',''))
    return data
  elif '+' in column_data:
    data=int(column_data[:-1])
    return data
  elif ',' in column_data:
    data=int(column_data.replace(',',''))
    return data
  else:
    return int(column_data)

In [None]:
#apply the function to convert install data into numeric
df_ps['Installs']=df_ps['Installs'].apply(lambda x:install_to_numeric(x))
df_ps.head(4)

**3.3.4 Dealing the Price Column**

The price column must be converted to numeric data. There is a "$' in the price data. We need to eliminate it.

In [None]:
def price_to_numeric(price):
  '''
  This function is to convert price column datatype into numeric
  '''
  if '$' in price:
    new_price=float(price.replace('$',''))
    return new_price
  else:
    return float(price)

In [None]:
# apply the function into  price column
df_ps['Price']=df_ps['Price'].apply(lambda x:price_to_numeric(x))
#check the price column of the data which have non zero price
df_ps[df_ps['Price']!=0].head(4)

**3.3.5 Dealing with Last Updated Column**




Last Updated column need to be convert into datetime format

In [None]:
# Pandas to_datetime() function applied to the values in the last updated column
df_ps['Last Updated']=pd.to_datetime(df_ps['Last Updated'])
#check the dataframe after converting to datetime
df_ps.head(4)

#### User Review Data load and Data preparation for Analysis

In [None]:
#load the user review dataset
ur_path='/content/drive/MyDrive/AlmaBetter/Data Scientist/Capstone_Project/User Reviews.csv'
df_ur=pd.read_csv(ur_path)
df_ur.head()

In [None]:
#user review dataset information
df_ur.info()

**User Reviews Dataset Overview**

* The dataset consists of **5 columns** in total.
* **2 columns** are **numerical**:

  * `Sentiment Polarity`
  * `Sentiment Subjectivity`
* **3 columns** are **categorical**:

  * `App`
  * `Translated Review`
  * `Sentiment`
* A **significant number of null values** are present, particularly in the review-related fields, which warrants further inspection and appropriate handling during preprocessing.

In [None]:
# view the basic statistical details
df_ur.describe()

**User Reviews Dataset Overview**

The `user_reviews` DataFrame contains **64,295 rows** and **5 columns**, each serving a specific purpose:

* **`App`**:
  Indicates the name of the application. It may also contain a brief description in some cases.

* **`Translated_Review`**:
  Provides the **English translation** of the original user review submitted for the app.

* **`Sentiment`**:
  Represents the **emotional tone** of the review. Values include:

  * `Positive`
  * `Negative`
  * `Neutral`

* **`Sentiment_Polarity`**:
  A **numerical measure** ranging from **-1 to 1**, where:

  * `1` represents a strong positive sentiment
  * `-1` represents a strong negative sentiment

* **`Sentiment_Subjectivity`**:
  Ranges from **0 to 1**, indicating how **subjective** or **factual** the review is:

  * Values closer to **1** suggest a **highly subjective** review
  * Values closer to **0** suggest a **more objective/factual** review



In [None]:
#columns of the data set
print(df_ur.columns)
rows=df_ur.shape[0]
cols=df_ur.shape[1]
print(f"The user review data set has {rows} rows and {cols} columns")

## Data Cleaning and Preparation


In [None]:
# Understand the missing data in details
missingdata_info(df_ur)


 **Missing Data Overview in User Reviews Dataset**

A significant portion of the dataset contains missing values:

* Approximately **42% of the entries** have missing data across several columns.
* The **`Translated_Review`** column has **26,868 missing values**, accounting for **41.79%** of the total records.
* The following three columns each have **26,863 missing values** (**41.78%** of the data):

  * **`Sentiment`**
  * **`Sentiment_Polarity`**
  * **`Sentiment_Subjectivity`**

This indicates that the **missing data is aligned across multiple columns**, likely due to reviews not being available or processed for certain apps.


In [None]:
#find out where all the column has null value
df_na=df_ur[df_ur['Translated_Review'].isna() & df_ur['Sentiment'].isna() & df_ur['Sentiment_Polarity'].isna() & df_ur['Sentiment_Subjectivity'].isna()]
df_na.head()


In [None]:
#check the length of the dataframe
print(len(df_na))



*   26863 rows have all null valus.
*   This rows need to be dropped



In [None]:
#copy the na dataframe
df_na_1=df_na.copy()
#drop the rows which have all null value using isin function
df_ur.drop(df_na_1.index, axis=0,inplace=True)
#check the dataframe shape after drop
df_ur.shape

In [None]:
#check if any null vale in dataframe
df_ur.isna().sum()

Now only Translated Review has 5 null values

In [None]:
#check the rows which have null values of translated review
df_ur[df_ur['Translated_Review'].isna()]

In [None]:
#check the unique values of translated review
df_ur['Translated_Review'].unique()

In [None]:
#drop the null values
df_ur.dropna(inplace=True)
df_ur.shape

In [None]:
#check the last 5 rows of the dataframe after data cleaning
df_ur.tail()

**Merge the two Dataframe**

In [None]:
#create a groupby of app and sentiment and form a new dataframe
g_df=df_ur.groupby(['App','Sentiment']).size().unstack(level=1)
g_df.reset_index( inplace=True)
g_names=['App','Negative_Sentiment', 'Neutral_Sentiment','Positive_Sentiment']
g_df.columns=g_names
g_df.head()

In [None]:
#find any null value in new g_df
g_df.isna().sum()

In [None]:
#fill the null value with zero
g_df.fillna(0,inplace=True)

In [None]:
#merge two dataframe by inner join
merged_df=pd.merge(df_ps,g_df,on='App',how='inner')
merged_df.head(4)

In [None]:
#find the shape of the merged dataframe
merged_df.shape

Now we have total 3 dataframe for analysis


### What all manipulations have you done and insights you found?

 **Data Cleaning and Preprocessing Summary**

* **Dropped** null values from the `Type`, `Android Ver`, and `Current Ver` columns due to their **low percentage of missing data**.
* **Imputed** missing values in the `Rating` column using the **median**, as it is more robust to skewness and outliers.
* **Converted** the data types of key columns:

  * `Reviews`, `Installs`, and `Price` → **Integer**
  * `Size` → **Integer**, excluding entries labeled **"Varies with device"**
  * `Last Updated` → **Datetime** format
* **Dropped** rows with missing values in:

  * `Translated_Review`
  * `Sentiment`
  * `Sentiment_Polarity`
  * `Sentiment_Subjectivity`
* **Filtered** the `user_reviews` dataset to remove duplicate app entries before merging, and formatted columns to align with the structure of the `play_store` dataset.
* The **data pipeline** is now fully prepared, enabling effective **exploratory data analysis (EDA)** and **visualization** for drawing meaningful insights and comparisons.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1


###**4.1 Univariate Analysis**
Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words our data has only one variable.It’s major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.


In [None]:
#Visualisation of No of app based on each category
fig,ax=plt.subplots(figsize=(18,10))
cat_app=df_ps['Category'].value_counts()
cat_app_vis=sns.barplot(x=cat_app.index,y=cat_app,palette="bright")
plt.xticks(rotation=60)
plt.xlabel('Category')
plt.ylabel('No of Apps')
plt.show()

##### 1. Why did you pick the specific chart?

The category column contains categorical data and has 33 unique values. Bar plots are used here to compare the frequency or count of the categorical data. So Bar Plot has chosen.

##### 2. What is/are the insight(s) found from the chart?

From this plot, we can get the most and least popular categories of apps in the Play Store. We can see the Family category of the app has the highest number of apps, and it is above 1750. The beauty category has the lowest presence in the Play Store, and it is below 150.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This visualisation will help to understand the composition of each category of app. Family, Games, and Tools have the highest compettion, while parenting, comics, and beauty have the lowest.

#### Chart - 2

###4.1.2 **Analysis of Reviews and Installs Columns**






In [None]:
#statistics describe data of reviews and installs
df_ps[['Reviews','Installs']].describe()


As Installs and Reviews columns are widely spread, it would be wise to put that on a logarithmic scale,but there is 0 in a few columns, so 1 will add to them to make them finite, and it would have about zero impact on analysis.

In [None]:
#visualisiation of Reviews and Installs distribution by histplot
df_ps['Reviews']=df_ps['Reviews'].apply(lambda x:x+1)
df_ps['Installs']=df_ps['Installs'].apply(lambda x:x+1)
#take the log value for high variance
Reviews=np.log10(df_ps['Reviews'])
Installs=np.log10(df_ps['Installs'])
fig,ax=plt.subplots(1,2,figsize=(15,6))
sns.boxplot(x=Reviews,data=df_ps,color='red',ax=ax[0])
sns.boxplot(x=Installs,data=df_ps,color='orange',ax=ax[1])
plt.show()

##### 1. Why did you pick the specific chart?

A box plot uses boxes and lines to depict the distributions of one or more groups of numeric data.*italicized text*

##### 2. What is/are the insight(s) found from the chart?

1.    Most of apps has 250 to 30k reviews and 1000 to 1000000 no of installation
2.   Most of apps has  1000 to 1000000 no of installation

3.   Median value of Reviews are 969

4.   Median value of installation is 10000





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These data help to understand the relevance of the apps to customers. If the number of reviews for the app is above 30,000 and the number of installs is above 10,000,000, then the app is popular with users.

#### Chart - 3

###4.1.3 **Analysis of Type and Content Rating Column**

In [None]:
#pie chart visualisation of type and content rating
fig,ax=plt.subplots(1,2,figsize=(12,7))
ax[0].pie(x=df_ps['Type'].value_counts(),labels=df_ps['Type'].value_counts().index,explode=[0.2,0],autopct='%1.0f%%')
ax[0].title.set_text('Type')
ax[1].pie(x=df_ps['Content Rating'].value_counts(),labels=df_ps['Content Rating'].value_counts().index,explode=[0,0,0,0,0,1.4],autopct='%1.2f%%')
ax[1].title.set_text('Rating')
plt.show()

##### 1. Why did you pick the specific chart?

Type and Content Rating are categorical data and have 2 and 6 unique values, respectively. So a pie chart is an appropriate way to understand them.

##### 2. What is/are the insight(s) found from the chart?

1.   Most of the apps are free(92%)

2.   Most app the content rating for everyone(81.8%)

3.  Very few app has been restricted for 18+ ages and unrated



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For developing a new app, the developer should focus on making the app free and consider the age group of everyone.

#### Chart - 4

###4.1.4  **Analysis of  Price Column**

free type app has 0 price,so  dataframe need to be filterd before analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter for paid apps
df_price = df_ps[df_ps['Type'] == 'Paid']

# Display statistical description
print(df_price['Price'].describe())

In [None]:
# Visualize the price distribution
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.violinplot(x='Price', data=df_price, color='red', ax=ax)
plt.title('Price Distribution of Paid Apps')
plt.show()

##### 1. Why did you pick the specific chart?

Price is numerical data, and violinplot is an appropriate approach to visualise the spread of the data.

##### 2. What is/are the insight(s) found from the chart?

Most of the apps prices are below 5 dollars. Very few apps have a high price, near $400.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If developers want to launch a new paid app, then the price should be below $5.

#### Chart - 5

###4.1.5  **Analysis of  Size Column**

We can see there are two category of size one is variable size and other is fixed size

In [None]:
#find the distribution of fixed size and variable size
var_count=len(df_ps[df_ps['Size']=='Varies with device'])
var_per=round((var_count/len(df_ps))*100,2)
fix_per=100-var_per
print(f"In the dataset {var_per} percentage has variable size and {fix_per} percentage has fixed size")

In [None]:
#statistical analysis of size of fix sized app
#filter the fix sized apps
df_fix=df_ps[df_ps['Size']!='Varies with device']
df_1=df_fix.copy()
#convert it into numerical
df_1['Size']=df_1['Size'].apply(lambda x:float(x))
df_1['Size'].describe()


In [None]:
#visualistaion of size column
fig,ax=plt.subplots(1,1,figsize=(10,6))
sns.histplot(x=df_1['Size'],data=df_1,kde=True,bins=10,color='green')

##### 1. Why did you pick the specific chart?

Histogram plots are ideal for visulalize distribution of numeric data

##### 2. What is/are the insight(s) found from the chart?

This is positive skew data. App size lies between 0 and 100MB. Most of them are below 30 MB.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This data does not have enough information about business impact. But the developer should try to keep the app size below 100MB.

#### Chart - 6

###4.1.6 **Analysis of  Rating Column**

To get more clear idea about rating of the dataframe rating column can be convert into categorical data like below

In [None]:
#create a group rating function
def app_rate(rating):
  '''
  This function helps to categorize the app rating
  '''
  if rating>=4.0:
    return 'Top Rated'
  elif rating>=3.0 and rating<4.0:
    return 'Above Average'
  elif rating>=2.0 and rating<3.0:
    return 'Average'
  else:
    return 'below average'

In [None]:
#create a column in the dataframe and apply the function in the column
df_ps['Rating Group']=df_ps['Rating'].apply(lambda x:app_rate(x))
df_ps.head(4)

In [None]:
#visualize the rating group
fig,ax=plt.subplots(1,2,figsize=(10,5))
sns.histplot(x=df_ps['Rating'],kde=True,bins=30,ax=ax[0])
ax[1].pie(x=df_ps['Rating Group'].value_counts(),labels=df_ps['Rating Group'].value_counts().index,autopct='%1.1f%%')
plt.show()

##### 1. Why did you pick the specific chart?

histogram plot is a classical tool for describing numerical data spread and pie chart is appropriate for categorical data with less than 10 unique values

##### 2. What is/are the insight(s) found from the chart?


*  The majority of the apps have a rating of 4.0 to 5.0, indicating that they are top rated, and they account for 80% of the data.

*   A few apps belong to the average and below average ratings.

*  The median value of the rating is 4.3.
Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

One of the dependent factors in the popularity of apps is rating, so achieving a top rating is one of the key factors for a successful app. As most of the apps are top-rated,if any app does not belong to the top-rated category, it loses popularity.

#### Chart - 7

###4.1.7  Analysis of  Sentiment,Sentiment polarity and Sentiment Subjectivity

In [None]:
#statistical describe of sentiment_polarity and sentiment_subjectivity
df_ur[['Sentiment_Polarity','Sentiment_Subjectivity']].describe()

In [None]:
#visualisation of sentiment and sentiment polarity and sentiment subjectivity
fig,ax=plt.subplots(3,1,figsize=(20,15))
ax[0].pie(x=df_ur['Sentiment'].value_counts(),labels=df_ur['Sentiment'].value_counts().index,autopct='%1.1f%%')
ax[0].title.set_text('Sentiment Distribution')
sns.histplot(x=df_ur['Sentiment_Polarity'],kde=True,bins=30,ax=ax[1],color='red')
ax[1].title.set_text('Sentiment_Polarity Distribution')
sns.histplot(x=df_ur['Sentiment_Subjectivity'],kde=True,bins=30,color='green',ax=ax[2])
ax[2].title.set_text('Sentiment_Subjectivity Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

Sentiment is categorical data, and it has only three unique values, so a pie chart is appropriate. Sentiment polarity and sentiment subjectivity are numerical data, so a histogram is a classical approach to understanding the distribution of the data.

##### 2. What is/are the insight(s) found from the chart?

*   64.1 percent of the data is positive sentiment, 22.1 percent of the data is negative sentiment, and the rest, 13 percent, is neutral.
*   Most of the sentiment polarity lies between -0.5 and 0.75.
*   Most of the sentiment subjectivity is between 0.2 and 0.8.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Most of the user has given positive reviews.So if we get any negative reviews we should be concerned about it to stay relevent in the market.

#### Chart - 8

##4.2 **Bivariate and Multivariate Analysis**

Bi means two, and variate means variable, so here there are two variables. The analysis is related to the cause and the relationship between the two variables. There are three types of bivariate analysis.


Multivariate analysis is required when more than two variables have to be analysed simultaneously. It is a tremendously hard task for the human brain to visualise a relationship among four variables in a graph, and thus multivariate analysis is used to study more complex sets of data.

###4.2.1 **Distribution of install column in each Category**

In [None]:
#copy the original dataframe
df_ps_1=df_ps.copy()
#add 1 to each row in install column
df_ps_1["Installs"]=df_ps_1["Installs"].apply(lambda x:x+1)
#make the install column to logarithimic
df_ps_1["Installs"]=df_ps_1["Installs"].apply(lambda x:np.log10(x))
#visualize the data
g = sns.FacetGrid(df_ps_1, col='Category', palette="Set1",  col_wrap=5, height=4)
g = (g.map(plt.hist,"Installs", bins=10, color='g'))


##### 1. Why did you pick the specific chart?

FacetGrid helps visualise the distribution of one variable as well as the relationship between multiple variables separately within subsets of your dataset using multiple panels. So for an analysis of installs for each category, facetgrid is ideal for that.

##### 2. What is/are the insight(s) found from the chart?

*   All categories have widely spread installed data, whether they have a large number of apps or a small number of apps.

*  Lower and higher numbers of installs are less frequent, and midsize numbers of installs are most frequent in tools,photography,personalization, games,productivity, communication, and personalization.


*   Arts and Design,Beauty,Comics,Education,Entertainment,House and Home,Libraries and Demo, Parenting, and Weather categories have a lower number of installations.











##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the app category is Game or Family, it has a high possibility of getting a high number of installations, or if the category belongs to Parenting, Weather,Maps and navigation, it has a good possibility of getting a low number of installations.

#### Chart - 9

###4.2.2 **Size vs Rating for each Category**

before visualisation app data should divided in two category variable size and fixed size..

In [None]:
#eliminate the variable size
df_ps_size=df_ps[df_ps['Size']!='Varies with device']
#visualize the plot of fixed sized apps
g = sns.FacetGrid(df_ps_size, col='Category',hue='Type', palette="Set1",  col_wrap=5, height=4)
g = (g.map(plt.scatter,"Size","Rating",).add_legend())


In [None]:
#Visualization of installation of app of Variable Sized App
df_ps_var=df_ps[df_ps['Size']=='Varies with device']
fig,ax=plt.subplots(1,1,figsize=(15,35))
sns.barplot(data=df_ps_var,y=df_ps_var.Category,x=df_ps_var['Rating'],hue=df_ps_var['Type']);

##### 1. Why did you pick the specific chart?

FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first two have obvious correspondence with the resulting array of axes; think of the hue variable as a third dimension along a depth axis, where different levels are plotted with different colours. So for size vs. rating in each category, a facet grid is appropriate.


##### 2. What is/are the insight(s) found from the chart?


*   Large size apps always get high ratings, but small size apps get mixed ratings for most of the categories except medical,lifestyle, tools, and family, which have mixed ratings all through size.

* Paid apps are generally better rated than free apps.

* For the variable size of the app rating, most apps are highly rated except for parenting. Paid parenting apps are low-rated.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Rating does not have much dependency on app size.

#### Chart - 10

###4.2.3 **Installation and Android Version Relation**

In [None]:
#create a function to grouped the android column
def min_andr(version):
  '''
  This function helps to categorize the android version in few unique number
  '''
  if version=='Varies with device':
    return version
  else:
    min_ver=version[0]
  return min_ver

In [None]:
#apply the function to anderoid column
df_ps['Minimum Android Version']=df_ps['Android Ver'].apply(lambda x:min_andr(x))
#create groupby data with minimum android version and Total installs
and_ins=df_ps.groupby(['Minimum Android Version'])['Installs'].sum()
#visualize the plot
fig,ax=plt.subplots(1,1,figsize=(10,8))
sns.barplot(x=and_ins.index,y=and_ins)

plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are mainly used for count visualisation. So for plotting, installing count among Android versions is appropriate with Barchart.

##### 2. What is/are the insight(s) found from the chart?

*  Users prefer the Android version of the app, which varies with device; besides, Android 4 is also highly popular.

*   Other Android versions are not so preferred by users.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While developing a new app, the app's Android version is one of its main features. For the success of the app, the developer should go with a variable device or Android 4, and other versions should be avoided.

#### Chart - 11

###4.2.4 **Installs vs App size**

For better analysis, we have to create an app-sized group.

In [None]:
#create app_size group
def app_size_group(size):
  '''
  This function is create group of appsize between 0-100MB
  '''
  try:
    if size<1.0:
      return 'Below 1MB'
    elif size>=1.0 and size<5.0:
      return 'Below 5MB'
    elif size>=5.0 and size>15.0:
      return '5-15MB'
    elif size>=15.0 and size<30.0:
      return '15-30MB'
    elif size>=30.0 and size<60.0:
      return '30-60MB'
    elif size>=60.0 and size<100.0:
      return '60-100MB'
    else:
      return 'Above 100MB'
  except:
    return size

In [None]:
#apply the function to the dataframe
df_ps['appsize_category']=df_ps['Size'].apply(lambda x:app_size_group(x))
df_ps.head()

In [None]:
#groupby the appsize_category and type with install
appsize_cat_ins= df_ps.groupby(['appsize_category','Type'])['Installs'].sum().unstack()
appsize_cat_ins

In [None]:
#visualization code most installed app size
fig,ax=plt.subplots(1,2,figsize=(20,6))
sns.barplot(data=df_ps[df_ps['Type']=='Free'],x='appsize_category',y='Installs',hue='Rating Group',ax=ax[0],palette='icefire');
ax[0].set_title('Size vs installs in Free App')
sns.barplot(data=df_ps[df_ps['Type']=='Paid'],x='appsize_category',y='Installs',hue='Rating Group',ax=ax[1],palette='gnuplot');
ax[1].set_title('Size vs installs in Paid App')

##### 1. Why did you pick the specific chart?

The size column is a mix of categorical and numerical data. So for better understanding, it is converted into categorical data. Bar chart is appropriate for categorical and numerical visualisation.

##### 2. What is/are the insight(s) found from the chart?

*   Whether the app is free or paid depends on which device has the highest number of installations.

* For free apps, app sizes between 5 MB and 30 MB have high installation. For paid apps, 5–15 app sizes have decent installation.

*For free apps that are near about 100 MB, they has poorly rated, but paid apps that are near about 100 MB are highly rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For designing the new app, the app size should be variable with the device. For fixed-size apps, the preferred app size is 5–30 MB for free and 5–15 MB for paid apps. But for large apps, it is preferred to be paid for a high rating.

#### Chart - 12

###4.2.5 **Sentiment Polarity vs Sentiment Subjectivity**

In [None]:
#plot the sentiment poarity vs sentiment subjectivity

g = sns.JointGrid(data=df_ur, y='Sentiment_Polarity', x='Sentiment_Subjectivity',hue='Sentiment')
g.plot(sns.scatterplot, sns.kdeplot);

##### 1. Why did you pick the specific chart?

Joint plots allow plotting a relationship between two variables (also known as a bivariate relationship) while simultaneously exploring the distribution of each underlying variable. So,to understand sentiment subjectivity and sentiment polarity relations, a joint plot is ideal.

##### 2. What is/are the insight(s) found from the chart?

*   Lower sentiment subjectivity means lower sentiment polarity, and higher sentiment subjectivity has widely spread polarity.

*   Neutral sentiment is independent of subjectivity.

*   Negative sentiment polarity has a lower frequency than positive sentiment polarity.

* Most of the sentiment polarity lies between -0.5 and 0.75.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Translated reviews with lower sentiment and subjectivity are useless for understanding the popularity of the app.

#### Chart - 13

### **Does Sentiment Polarity has affect on Rating?**

In [None]:
# Step 1: Create 'Rating Group' column using your app_rate() function
merged_df['Rating Group'] = merged_df['Rating'].apply(lambda x: app_rate(x))

# Step 2: Create Sentiment Polarity column (Positive - Negative)
merged_df['Sentiment_Polarity'] = merged_df['Positive_Sentiment'] - merged_df['Negative_Sentiment']

# Optional: Drop NaNs if any (prevents plotting errors)
merged_df = merged_df.dropna(subset=['Rating Group', 'Sentiment_Polarity'])

# Step 3: Visualize sentiment polarity distribution by rating group
sns.catplot(
    x='Rating Group',
    y='Sentiment_Polarity',
    data=merged_df,
    kind='box',
    height=5,
    aspect=2
)
plt.title('Sentiment Polarity Distribution by Rating Group')
plt.show()


##### 1. Why did you pick the specific chart?

Catplot is ideal for categorical and numerical data visualisation. So for sentiment subjectivity and rating group analysis, a catplot is ideal.

##### 2. What is/are the insight(s) found from the chart?

*   The top-rated group has the maximum positive sentiment polarity. Only a few outliers have negative sentiment polarity.

*   The average-rated group has a negative majority of the sentiment polarity.
*   The above-average-rated group has few negative sentiments.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive sentiment polarity has a tendency towards above-average and top-rated groups, and a high rating can give a high installation, and vice versa.

###**Sentiment Subjectivity distribution of each Rating group**

In [None]:
# Calculate subjectivity
merged_df['Sentiment_Subjectivity'] = merged_df['Positive_Sentiment'] + merged_df['Negative_Sentiment']

# Plot
sns.catplot(
    x='Rating Group',
    y='Sentiment_Subjectivity',
    data=merged_df,
    kind='box',
    height=5,
    aspect=2
)
plt.title('Sentiment Subjectivity Distribution by Rating Group')
plt.show()


#####  What is/are the insight(s) found from the chart?



*  Most of the Sentiment Subjectivity of the top rated app  and above average app lies between 0.3 to 0.7.Outliers also present in higher and lower magnitude of subjectivity

*   Most of the sentiment  subjectivity of the top-rated and above-average apps lies between 0.3 and 0.7. Outliers are also present at higher and lower magnitudes of subjectivity.

#### Chart - 14 - Correlation Heatmap

### **Corelation of playstore data**

In [None]:
# Select only numeric columns
numeric_df = df_ps.select_dtypes(include='number')

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap for Play Store Data', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Correlation analysis measures the statistical relationship between two different variables. The result will show how the change in one parameter would impact the other parameter. Correlation analysis is a very important concept popular in the field of predictive analytics.

##### 2. What is/are the insight(s) found from the chart?

* Installs and reviews are highly positive co-related, meaning they both increase at the same time.

*  Rest of the columns has low co-relation.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should focus on getting a high number of reviews, then we will get a high number of installations,or we have to inspire the users to provide the reviews as many times as they can.

### **Corelation of merged dataframe**

In [None]:
# Step 1: Select numeric columns only
numeric_merged_df = merged_df.select_dtypes(include='number')

# Step 2: Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_merged_df.corr(), annot=True, cmap="icefire", fmt=".2f")
plt.title('Correlation Heatmap for Merged DataFrame', fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Correlation analysis measures the statistical relationship between merged variables. The result will show how the change in one parameter would impact the other parameter. Correlation analysis is a very important concept popular in the field of predictive analytics.

##### 2. What is/are the insight(s) found from the chart?



*    Besides installing and reviewing sentiment polarity, positive sentiment and negative sentiment have a high correlation.




*   Positive sentiment with neutral sentiment has medium strength of co-relation.

* Mean sentiment subjectivity has medium correlation strength with rating, and mean sentiment subjectivity

* Review has high correlation strength with positive sentiment.

* Mean sentiment polarity and negative sentiment have negative co-relation strength.



#### Chart - 15 - Pair Plot

###**Pairplot of playstore data**

In [None]:
#visuaise the pairplot data
df_ps_nz=df_ps.copy()
df_ps_nz['Installs']=df_ps_nz['Installs'].apply(lambda x:x+1)
df_ps_nz['Reviews']=df_ps_nz['Reviews'].apply(lambda x:x+1)
Rating = df_ps_nz['Rating']
Size = df_ps_nz['Size']
Installs = df_ps_nz['Installs']
Reviews = df_ps_nz['Reviews']
Type = df_ps_nz['Type']
Price = df_ps_nz['Price']
plt.figure(figsize=(10,10))
pair = sns.pairplot(pd.DataFrame(list(zip(Rating, Size, np.log10(Installs), np.log10(Reviews), Price, Type)),
                        columns=['Rating','Size', 'Installs', 'Reviews', 'Price','Type']), hue='Type')
#pair.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price",x=0.5, y=1.0, fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is a data visualisation that plots pair-wise relationships between all the variables in a dataset. This helps to better understand the relationships visually.

##### 2. What is/are the insight(s) found from the chart?

*   A higher number of reviews gets a higher rating,but a highly rated app may or may not get a high number of reviews.

*   A higher number of installations gets a high rating, but a highly rated app may or may not get a high number of installations.

*   The price column has a very low dependency on ratings, reviews, and installs

*  Review and Installation have a linear relationship.

*   Paid apps do not have high numbers of reviews,and free apps have a high spread of reviews.

*   Paid apps do not have a large number of installations compared to free apps.





.







































































##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


We should focus on getting a high number of reviews, then we will get a high number of installations,or we have to inspire the users to provide the reviews as many times as they can.

### 4.3. **Asking and Answsering Questions**

1) **Tools and Medical apps, is the average rating signifcantly different in each category?**

In [None]:
#list out the rating data of two category
tools = list(df_ps[df_ps.Category == "TOOLS"].Rating)
medical = list(df_ps[df_ps.Category == "MEDICAL"].Rating)
#check if the plots are normal
sns.kdeplot(tools)
sns.kdeplot(medical)
plt.title("Rating distributions")


In [None]:
#make numpy array of the data
tools_array = np.asarray(tools)
medical_array = np.asarray(medical)
#calculate the standard deviation
print("Standard deviation of tools app ratings:", tools_array.std())
print("Standard deviation of medical app ratings:", medical_array.std())


In [None]:
#import statistics library to run the hypothesis test
from scipy.stats import ttest_ind
#confidence interval: 95%
#setting confidence interval sets our alpha (treshold value) = 1-0.95 = 0.05

#Null Hypothesis: Difference in the mean rating of Tools and Medical apps are due to a random chance
#Alternative Hypothesis: Mean rating of Tools and Medical apps are significantly different

#p-value = when it is assumed that our null hypothesis is correct, p value gives us the probability of
#getting a sample with the results we assumed.

#run the 2 sample test:

_, pvalue = ttest_ind(tools, medical)
if pvalue<=0.05:
    print("Reject Null Hypothesis")
else:
    print("Accept Null Hypothesis")

#### 1)**Which category have most number of app and least number of app?**


In [None]:
#find out top 3 category
fig,ax=plt.subplots(1,2,figsize=(12,6))
top=df_ps['Category'].value_counts().head(3)
sns.barplot(x=top.index,y=top,ax=ax[0])
ax[0].set_title('Top 3 Category')
#find out least 3 category
low=df_ps['Category'].value_counts().tail(3)
sns.barplot(x=low.index,y=low,ax=ax[1])
ax[1].set_title('Least 3 Category')

The family category of the app has the highest presence, and it has 1829 apps. Other top categories include games and tools, which have 959 and 825 apps, respectively. The beauty category has the fewest apps, with only 53. Other less popular categories include comics and parenting, which have 56 and 60 apps, respectively.

###2)**Which category has maximum number and minimum number of installation?**

In [None]:
# Top 3 categories by installs
cg_ins_max = df_ps.groupby('Category')['Installs'].sum().sort_values(ascending=False).head(3)

# Bottom 3 categories by installs
cg_ins_min = df_ps.groupby('Category')['Installs'].sum().sort_values().head(3)

# Create side-by-side bar plots
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Max installed categories
sns.barplot(x=cg_ins_max.index, y=cg_ins_max.values, ax=ax[0], palette='Greens_r')
ax[0].set_title('Top 3 Categories by Installs')
ax[0].set_ylabel('Total Installs')
ax[0].set_xlabel('Category')

# Min installed categories
sns.barplot(x=cg_ins_min.index, y=cg_ins_min.values, ax=ax[1], palette='Reds')
ax[1].set_title('Bottom 3 Categories by Installs')
ax[1].set_ylabel('Total Installs')
ax[1].set_xlabel('Category')

plt.tight_layout()
plt.show()


Games, communication, and tools have the maximum number of installations. Events, Beauty, and Parenting have a minimum number of installations.

###3)**Find out the most popular app**

To find out the most popular app,we have to first


1. Most number of installed apps


2. Among the most installed apps, find which app has the most reviews.


3. If there is a tie, then find which app has the highest rating.

In [None]:
# find out the maximum installed app
max_app=df_ps[df_ps['Installs']==df_ps.Installs.max()]
#plot the maximum installed apps
fig,ax=plt.subplots(figsize=(20,8))
top_visual=sns.barplot(x=max_app.App,y=max_app.Installs)
#rotate the xticklabels
top_visual.set_xticklabels(top_visual.get_xticklabels(), rotation= 75, horizontalalignment='right')
plt.show()

In [None]:
#find out app has the maximum reviews which have maximum no of installation
fig,ax=plt.subplots(1,1,figsize=(12,8))
max_review=sns.barplot(x=max_app.App,y=max_app.Reviews)
max_review.set_xticklabels(top_visual.get_xticklabels(), rotation= 75, horizontalalignment='right')
plt.show()

**Facebook is the most popular app in the Play Store, and after Facebook, WhatsApp is the most popular app in the Play Store.**


From the univariate analysis section, we have seen that 92% of the apps in the dataset are free and 8% are paid.


### **4)Which category has most no of paid apps?**

In [None]:
# from a groupby by type category
type_cat=df_ps.groupby('Category')['Type'].value_counts().unstack()
#fill the null values with zero
type_cat.fillna(0,inplace=True)
#sort the category in order of paid app
paid_cat=type_cat.sort_values(by='Paid',ascending=False).head(3)
paid_cat


In [None]:
#visualise the plot
paid_cat.plot(kind="bar",figsize=(8, 5))

Family, Medical, and Game categories have the most paid apps, and the numbers are 182, 83, and 82.

###**5)Find the top 5 paid app according to revenue.**

In [None]:
#separate the paid app from dataframe
df_paid=df_ps[df_ps['Type']=='Paid']
df_pay=df_paid.copy()
#create a new column of reveneau
df_pay['Reveneau']=df_pay['Installs']*df_pay['Price']
#sort the dataframe by reveneu
apps_=df_pay.sort_values(by='Reveneau',ascending=False).head()
#visualise the data
fig,ax=plt.subplots(figsize=(12,6))
sns.barplot(x=apps_.App,y=apps_.Reveneau)
ax.set_title('Top Reveneau app')

The top 5 earned apps are Minecraft,I am Rich,I am Rich Premium, Hitman Sniper, and Grand Theft Auto:San Andreas.

###**6)What are the top  large size apps?also find out most popular app among them.**





In [None]:
#find out the largest apps
max_size=df_1[df_1['Size']==df_1['Size'].max()]
max_size.sort_values(by=['Installs','Reviews','Rating'],ascending=False).head()


Hungry Shark Evolution is the most popular large-size app; after that, Simcity Buildit and Miami Crime Simulator are the most popular apps in the dataset.

####**7)What are the top genres of apps with the most installations?**

In [None]:
genre_app=df_ps.groupby(['Genres'])['Installs'].sum().sort_values(ascending=False).head()
plt.subplots(figsize=(10,5))
sns.barplot(x=genre_app,y=genre_app.index)
plt.ylabel('Genre')
plt.xlabel('No of Installs')
plt.title('barchart of top 5 genres belongs to apps');

Communication,Tools,Productivity genre has most number of installation.

###**8)Are there any apps which have not been updated since years ?**

For this, we are going to add two more columns, named Year and Month.

In [None]:
#For this we are going to add two more columns named Year and Month
d = pd.DatetimeIndex(df_ps['Last Updated'])
df_ps['year'] = d.year
df_ps['month'] = d.month
df_ps.head()

Around 3,000 apps have not been updated in years. Maybe these apps have not been in the service.

### **9)Find out the most positive reviews and most negative reviews.Also find out their installation and rating.**

r_p=merged_df.sort_values(by='Positive_Sentiment',ascending=False).head()
r_p

Helix Jump is the most positively reviewed app in the Play Store.

In [None]:
r_j=merged_df.sort_values(by='Negative_Sentiment',ascending=False).head()
r_j_new=r_j.reset_index()
r_j_new

Angry Bird Classic is the most negatively reviewed app in the Play Store.

###**10). How  sentiment polarity affect in Rating?**

In [None]:
#plot the sentiment polarity
fig,ax=plt.subplots(1,1,figsize=(8,8))
sns.scatterplot(data=merged_df,y=merged_df['Positive_Sentiment'],x=merged_df['Negative_Sentiment'],hue=merged_df['Rating Group']);

When positive sentiment and negative sentiment are both low,, the rating is mixed (it may be high or low), but when both are high, the rating is high. When positive sentiment is high and negative sentiment is low, the rating is also high.

###**11)Find Top 10 Costly App**

In [None]:
#find the paid app sorted by price
paid_app=df_ps[df_ps['Type']=='Paid'].sort_values(by='Price',ascending=False).head(10)

#visualize the apps
fig,ax=plt.subplots(figsize=(15,8))
app_type_visuals=sns.barplot(data=paid_app,x=paid_app['App'],y=paid_app['Installs']);
app_type_visuals.set_xticklabels(app_type_visuals.get_xticklabels(),rotation=75);
plt.show()

The top 10 paid apps are I'm Rich: Trump Edition,I Am Rich Premium,I Am Rich Pro Plus, I am Rich, I am Rich (Most Expensive App), I Am Rich Pro, Most Expensive App (H), I am Rich (Premium), I am Rich!.

####**12) Find the top 10 games**

In [None]:
# Chart - 9 visualization code
def cat_top10(category):
  '''
  Plot  top 10 installed app of given category
  '''
  df_top=df_ps[df_ps['Category']==category]
  data=df_top[['App','Installs']].sort_values('Installs',ascending=False).head(10)
  fig,ax=plt.subplots(figsize=(15,5))
  #ax.set_title('Top 10 installed App ')
  visuals=sns.barplot(x=data.App,y=data.Installs,palette='husl')
  visuals.set_xticklabels(visuals.get_xticklabels(), rotation= 75, horizontalalignment='right')

In [None]:
cat_top10('GAME')


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



### **Key Insights and Recommendations for App Developers**

* **Pricing Strategy:**
  Approximately **92% of apps are free**, indicating that users prefer free applications. Developers aiming for a large user base should consider releasing free versions. If opting for a paid model, it is advisable to keep the **price below \$50** to remain competitive.

* **Category Selection:**
  Underrepresented categories such as **Events**, **Beauty**, and **Parenting** show promising potential. These categories have **fewer apps but relatively high installation numbers**, suggesting an opportunity for growth.

* **App Size Optimization:**
  App size significantly impacts installation rates. Ideally:

  * Make the app size **variable with the device** when possible.
  * If fixed, keep it between **5–15 MB** for general apps.
  * For categories like **Games** or **Family**, larger sizes are acceptable.

* **Content and Genre Preferences:**

  * Preferred **content rating**: “**Everyone**”
  * Preferred **genres**: **Communication**, **Tools**, **Photography**, and **Social** tend to attract more users.

* **Android Version Compatibility:**
  It is recommended to either:

  * Set the Android version to be **compatible with the device**, or
  * Use a **minimum requirement of Android 4.0** to maximize reach.

* **Game Category Caution:**
  Although popular, **Games** receive the **most negative reviews**. Developers should be particularly attentive to **user feedback** and focus on **game performance and stability**.

* **Regular Updates & User Feedback Integration:**
  Continuous improvement is crucial. Apps should be **regularly updated** to reflect user preferences and to resolve issues raised in reviews. This contributes to better ratings and user satisfaction.

* **Target Audience Metrics:**
  Successful apps typically have:

  * A **rating above 4.0**
  * **Installations in the range of 1 million (10 lakh)** or more
    Developers should aim for these benchmarks when designing and marketing their apps.

* **Importance of Exploratory Data Analysis (EDA):**
  Conducting EDA prior to development provides valuable insights that can **reduce the risk of app failure**. It helps in making data-driven decisions regarding app design, features, and marketing strategies.


# **Conclusion**

* **Facebook** remains the most popular application on the Google Play Store in terms of installations.

* **Helix Jump** holds the highest number of positive user reviews, while **Angry Birds Classic** has received the most negative feedback.

* Applications that receive a significant number of both positive and negative reviews tend to maintain a high overall rating. This suggests that after receiving initial negative feedback, updates may have addressed user concerns effectively.

* **I Am Rich: Premium** is one of the most installed paid applications, indicating a notable interest despite its premium pricing.

* Approximately **3,000 applications** have not been updated in recent years, implying that a substantial portion may no longer be actively maintained or in service.

* In most categories, **paid applications** generally receive higher user ratings compared to free ones, with the **Parenting** category being the primary exception.

* While applications with **large file sizes** are relatively few, they tend to have a high number of installations, possibly reflecting the value users associate with more content-rich or feature-intensive apps.

* A major challenge in this project was **data cleaning**. Around **15% of Play Store data** and **42% of user review data** were missing, necessitating careful preprocessing and validation.

* Only **816 applications** have matching records in both the Play Store dataset and the user review dataset. This represents about **10% of the total apps**, but is sufficient to perform a meaningful analysis on the merged dataset.

* Features such as **current version**, **last update date**, and **sentiment subjectivity** should be analyzed further to assess their impact on **ratings** and **installation numbers**.

* Understanding the **evolution of an app's metrics**—including installation count, user rating, and review volume—over time can provide valuable insights into app performance and business trends.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***