<a href="https://colab.research.google.com/github/shakirsayeed/PlayStore_DataAnalysis/blob/main/EDA_Project_Work_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Member  -** Syed Shakir Sayeed


# **Project Summary -**

The "Play Store App Review Analysis using Exploratory Data Analysis - EDA" project focuses on extracting valuable insights and patterns from app reviews available on the Google Play Store. The project aims to uncover trends, sentiments, and user feedback related to various mobile applications. By employing exploratory data analysis techniques, the project seeks to provide a comprehensive understanding of user sentiments, popular features, and potential areas for improvement for app developers.
Steps Involved in developing the Project:
1. Data Collection
2. Data Cleaning and Preprocessing
3. Implementing Exploratory Data Analysis
4. Data Visualization

# **GitHub Link -**

https://github.com/shakirsayeed/PlayStore_DataAnalysis.git

# **Problem Statement**


Explore and analyze the Google Play Store Dataset to discover key factors responsible for app engagement and success .


#### **Define Your Business Objective?**

## <b><i> The main objective of this project is to help app developers and businesses understand what factors make their apps successful. By analyzing the play_store_data and user_review_data datasets, we will identify relevant KPIs (Key Performance Indicators)
## <b>This information will be used to provide insights about the data and recommendations on how to improve app engagement, retain app users, increase app revenue, and enhance marketing strategies.</b>
## <b> My mission is to empower businesses in developing app solutions that ensure customer satisfaction and contribute to business growth. </i></b>

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd # used for Data Processing
import numpy as np # used to access builtin Numerical Methods and Functions
import matplotlib.pyplot as plt # Used for Data Visualization
import seaborn as sns # Used for Data Visualization
from datetime import datetime

# **1st Step: Data Cleaning on Play Store  Dataset**

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
psd_path="/content/drive/MyDrive/DS_Notes/EDA_Project/Dataset/Play_Store_Data.csv"

psd_df=pd.read_csv(psd_path)



### Dataset First View

In [None]:
# Dataset First Look of PlayStore Data
dataview_playstore= pd.concat([psd_df.head(),psd_df.tail()])
dataview_playstore

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count of PlayStore Data
print(psd_df.columns)
rows=psd_df.shape[0]
cols= psd_df.shape[1]
print(f"The number of rows are {rows} and columns are {cols}")

### Dataset Information

In [None]:
# Dataset Info PlayStore Data
print("=*=*=*=*=*=*=*=*=*Data information=*=*=*=*=*=*=*=*=*=*=*=*=")
psd_df.info()
print("=*=*=*=*=*=*=*=*=*Data Describe=*=*=*=*=*=*=*=*=*=*=*=*=")
psd_df.describe(include='all')

#### Duplicate Values

In [None]:
duplicate_values=len(psd_df[psd_df.duplicated])
print("The number of Duplicated Values are",duplicate_values)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count PlayStore Data
print(psd_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Create the bar plot
missing = psd_df.isnull()
missing_sum = missing.sum().sort_values(ascending=False)

missing_sum.plot(kind='bar')
plt.ylabel('Number of Missing Values')
plt.xlabel('Columns Name')
plt.title(' Missing Values by Column')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability if needed
plt.tight_layout()  # Ensures the labels fit within the plot area
plt.show()

### What did you know about your dataset?

In the bove dataset we have seen that


1.   Rating column is having 1474 missing values
2.   Type is having 1 missing values
3.   Content Rtingis having 1 missing values
4.   Current Ver is having 8 missing values
5.   Android Ver is having 3 missing values

So, here  in these rows of datasets we have missing values in these columns, in order to analyze the dataset, it is important to handle the missing values.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns of PlayStore Data
psd_df.columns


 The 13 columns in the playstore dataset are identified as below:
1. **App** - It tells us about the name of the application with a short description.
2. **Category** - It gives the category to the application.
3. **Rating** - It contains the average rating of the respective app received from its users.
4. **Reviews** - It tells us about the total number of users who have given a review for the application.
5. **Size** - It tells us about the size being occupied by the application on the mobile phone.
6. **Installs** - It tells us about the total number of downloads for an application.
7. **Type** - IIt states whether an app is free to use or is it a paid.
8. **Price** - It gives the price payable to install the app. For free type apps, the price is zero.
9. **Content Rating** - It states whether or not an app is suitable for all age groups or not.
10. **Genres** - It tells us about the various other categories to which an application belongs to.
11. **Last Updated** - It tells us about the when the application was updated.
12. **Current Ver** - It tells us about the current version of the android application.
13.**Android Ver** - It tells us about the android version which support the application on its platform.

In [None]:
# Dataset Describe
psd_df.describe()
# psd_df.describe(include='all')# It is used to display  statistical information of all the columns

### Variables Description

Here, it will show the short description of Statistical Information


*   **Count:** The number of non null values are 9367
*   **Mean:**The mean value of column is 4.193338
*   **Standard Deviation:** Th std of the column is 0.537431
*   **Minimum Value:** The min value of column is 1.0000000
*   **25% value:** The 25th percentile of the column is 4.0000000
*   **50% value:**The 50th percentile of the column is 4.3000000
*   **75% value:**The 75th percentile of the column is 4.5000000
*   **Maximum Value:**The max value of column is 19.0000000


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in psd_df.columns.to_list():
  print("Unique values in" ,i, "is",psd_df[i].nunique())

In above unique values are more in App column which is around 9660

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dataset Information about various attributes of Dataset
def Playstore_data():
  temp=pd.DataFrame(index=psd_df.columns)
  temp['Datatypes']=psd_df.dtypes
  temp['Not Null Values']=psd_df.count()
  temp['Null Values']=psd_df.isnull().sum()
  temp['% ratio of Null Values']=psd_df.isnull().mean()
  temp['Unique Values']=psd_df.nunique()
  temp["Duplicate Values"]=psd_df.duplicated().sum()
  return temp
Playstore_data()


In [None]:
psd_df.boxplot()

Here Rating min value is around 3.0 and Max value is 5.0, but we can see in the boxplot their is an outlier which exceeds the Max value

In [None]:
psd_df[psd_df['Rating']> 5]

**There is some problem in above row**
1. It is having Rating 19.0
2. NaN values in Content Rating and Android Ver.

So, here we are droping this row

In [None]:
psd_df.drop(10472,inplace=True)

Now we are checking whether it is removed or not

In [None]:
psd_df.boxplot()

Now, in this dataset still we have 1474 Null values in the Rating column. So here we are finding the Mean and Median of that column to fix it

In [None]:
# Finding Mean
rating_mean= psd_df['Rating'].mean()
print(f"Mean value of Rating is {rating_mean}")
# Finding Median
rating_median= psd_df['Rating'].median()
print(f"Median value of Rating is {rating_median}")

Here in above :
1.  Mean value is approx to 4.2
2.  Median value is  4.3

Their is only a minute difference betweeen Mean and Median
So here we replace all Null values with Median, 50% of Apps are having 4.3 and above Rating


In [None]:
psd_df['Rating'].fillna(value=rating_median,inplace=True)

In [None]:
# Here, checking the null values are filled or not
psd_df.isnull().sum()

Now, there are still some problemsin the columns that we have to fix

*   The Type has 1 NaN value.
*   The Current Ver contains 8 NaN values.
*   The Android Ver contains 2 NaN values.

# **Replacing Type NaN Values**

In [None]:
# Here we are checking Nan values in Type column
psd_df[psd_df['Type'].isnull()]

In above, we can see Price is 0, so it is a Free App. So, we we replace NaN with Free

In [None]:
# Checking How many Apps are Free and Paid
psd_df['Type'].value_counts()

In [None]:
# Replacing NaN with Free
psd_df.loc[9148,'Type']='Free'

In [None]:
# Now, We are checking is it replaced with Free or not
psd_df[psd_df['Type'].isnull()]

# **Replacing Current Ver NaN Values**

In [None]:
# Here we are checking Nan values in Current Ver column
psd_df[psd_df['Current Ver'].isnull()]

We are having 8 Nan Values, so we are dropping all 8 Nan values

In [None]:
# Dropping all 8 Nan values fro Current Ver column
psd_df.drop([15,1553,6322,6803,7333,7407,7730,10342],axis=0,inplace=True)

In [None]:
# Now, We are checking Whether the rows have been dropped or not
psd_df[psd_df['Current Ver'].isnull()]

# **Replacing Android Ver NaN Values**

In [None]:
# Here we are checking Nan values in Android  Ver column
psd_df[psd_df['Android Ver'].isnull()]

Here, we have 2 Nan values so we are dropping 2 Nan values as well

In [None]:
# Dropping 2 Nan values from Android Ver column
psd_df.drop([4453,4490],axis=0,inplace=True)

In [None]:
# Here we are checking Nan values in Android  Ver are droped or not
psd_df[psd_df['Android Ver'].isnull()]

## **Now its the Time to Handle all the Data Types for Every Column**

In [None]:
psd_df['Size'].value_counts()

**The Size column has different Uniyts**

1. 'K' for KB
2. 'M' for MB

In [None]:
# Checking datatype of Price
psd_df['Price'].value_counts()

Here, $ symbol in Price will create a problem ,so we remove it

In [None]:
# We are removing the $ symbol
def price(p):
  if type(p)==str and '$' in p:
    p=p.replace('$', '')
  return p


In [None]:
psd_df['Price']=psd_df['Price'].apply(lambda x: price(x))
psd_df['Price']

In [None]:
# Checking datatype of Installs
psd_df['Installs'].value_counts()

Here, in Installs we need to remove (,) and (+) sign


In [None]:
# We are removing the comma , and + sign fro Installs column
def comma_plus(val):
  '''
  This function drops the + symbol if present and returns the value with int datatype.
  '''
  if '+' and ',' in val:
    new=int(val[:-1].replace(',',''))
    return new
  elif '+' in val:
    new1=int(val[:-1])
    return new1
  else:
    return int(val)


In [None]:
psd_df['Installs']=psd_df['Installs'].apply(lambda x: comma_plus(x))
psd_df.head(5)

In [None]:
# Now we're converting Size from KB to MB
# def convertKB_MB(val):
#   '''
#   This function converts all the valid entries in KB to MB and returns the result in float datatype.
#   '''
#   try:
#     if 'M' in val:
#       return float(val[:-1])
#     elif 'k' in val:
#       return round(float(val[:-1])/1024, 4)
#     else:
#       return val
#   except:
#     return val

def clean_size(x):
    if 'Varies with device' in str(x):
        return np.nan
    elif 'k' in str(x):
        return float(str(x).replace('k', '')) / 1024
    else:
        return float(str(x).replace('M', ''))


In [None]:
psd_df['Size']=psd_df['Size'].apply(clean_size)
psd_df.head()

# **Here, We are replacing Size "Varies with Device" with NaN Values**

In [None]:
psd_df[psd_df['Size']=='Varies with device']

In [None]:
psd_df['Size']=psd_df['Size'].apply(lambda x: str(x).replace('Varies with device','NaN') if 'Varies with device' in str(x) else x)
psd_df[psd_df['Size']=='NaN']

# **Removing Duplicate Values**

In [None]:
# Handling  Duplicate  data
psd_df.head()

In [None]:
psd_df['App'].value_counts()

Here we have found ROBOLOX has duplicated 9 times, and similarly other apps are also duplicated

In [None]:
psd_df[psd_df['App']=='ROBLOX']

In [None]:
psd_df[psd_df.duplicated()]

In [None]:
# We are dropping all the duplicate values
psd_df.drop_duplicates(subset='App',inplace=True)
psd_df.shape

In [None]:
# Checking whether duplicates are removed from App column
psd_df[psd_df['App']=='ROBLOX']
psd_df.shape

**After Dropping the values from App Column, the total number of rows dropped to  9649**

In [None]:
psd_df['App'].duplicated().sum()

In [None]:
psd_df.shape

In [None]:
psd_df.describe()

## **Summary**

*   All duplicates values removed from dataset.
*   All null values are removed or replaced.  
*   Converted the datatypes of the particular column and also removed all the unwanted characters.

# **2nd Step: Data Cleaning on User Reviews Dataset**

In [None]:
urd_path="/content/drive/MyDrive/DS_Notes/EDA_Project/Dataset/User_Reviews.csv"
urd_df=pd.read_csv(urd_path)

In [None]:
# Dataset First Look of userReview Data
dataview_user_review= pd.concat([urd_df.head(),urd_df.tail()])
dataview_user_review

**No of Rows and Columns in Dataset of User Review**

In [None]:
# Dataset Rows & Columns count of UserReview Data
print(urd_df.columns)
rows=urd_df.shape[0]
cols= urd_df.shape[1]
print(f"The number of rows are {rows} and columns are {cols}")

**Dataset Information of User Review**

In [None]:
# Dataset Info User Review Data
print("=*=*=*=*=*=*=*=*=*Data information=*=*=*=*=*=*=*=*=*=*=*=*=")
urd_df.info()
print("=*=*=*=*=*=*=*=*=*Data Describe=*=*=*=*=*=*=*=*=*=*=*=*=")
urd_df.describe(include='all')

**Duplicated Values**

In [None]:
duplicate_values=len(urd_df[urd_df.duplicated])
print("The number of Duplicated Values are",duplicate_values)

**Missing or Null Values**

In [None]:
# Missing Values/Null Values Count User Review Data
print(urd_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Create the bar plot
missing = urd_df.isnull()
missing_sum = missing.sum().sort_values(ascending=False)

missing_sum.plot(kind='bar')
plt.ylabel('Number of Missing Values')
plt.xlabel('Columns Name')
plt.title(' Missing Values by Column')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability if needed
plt.tight_layout()  # Ensures the labels fit within the plot area
plt.show()

In the bove dataset we have seen that

1. Translated_Review column is having 26868 missing values
2. Sentiment is having 26863 missing values
3. Sentiment_Polarity is having 26863 missing values
4. Sentiment_Subjectivity is having 26863 missing values

So, here in these rows of datasets we have missing values in these columns, in order to analyze the dataset, it is important to handle the missing values.

# The dataset has 5 columns identified as below:


1.   **App:** Title of the application.
2.   **Translated_Review:** It contains the English translation of the review.
3.   **Sentiment:** It gives the emotion like ‘Positive’, ‘Negative’, or ‘Neutral’.
4.   **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
5.    **Sentiment_Subjectivity:** This value gives how close a reviewers opinion is to the opinion of the general public. Its range is[0,1].

In [None]:
# Dataset Describe
urd_df.describe()
# psd_df.describe(include='all')# It is used to display  statistical information of all the columns

Checking Unique Values

In [None]:
# Check Unique Values for each variable.
for i in urd_df.columns.to_list():
  print("Unique values in" ,i, "is",urd_df[i].nunique())

In [None]:
# Dataset Information about various attributes of Dataset
def User_Review():
  temp=pd.DataFrame(index=urd_df.columns)
  temp['Datatypes']=urd_df.dtypes
  temp['Not Null Values']=urd_df.count()
  temp['Null Values']=urd_df.isnull().sum()
  temp['% ratio of Null Values']=urd_df.isnull().mean()
  temp['Unique Values']=urd_df.nunique()
  temp["Duplicate Values"]=urd_df.duplicated().sum()
  return temp
User_Review()


In [None]:
urd_df.boxplot()

##In above dataset we have observed tht, their are more NaN values present in the data, so we are analyzing it to remove

#Finding NaN Values in the dataset


In [None]:
urd_df[(urd_df['Translated_Review'].isnull())]

Here we can see that total of 26868 rows contain NaN values in Translated_review column.

So we can analyze that, App reviews which have Nan Values, tends to have NaN values in other all Columns as well

In [None]:
# We check here all NaN Values inTanslated Review column correspond to some non Null values
urd_df[urd_df['Translated_Review'].isnull() & urd_df['Sentiment_Polarity'].notna()]

By above Data analysis,we can understand, the values of remaining columns are non null for null values in the translated_Review column,which can likely be  error. Its like all other columns Sentiment,Sentiment_Polarity, Sentiment_Subjectivity are depended on Translated_Review.

So, the corresponding rows are incorrect, and we can drop the corresponding rows

In [None]:
# Deleting the rows containing NaN values
urd_df.dropna(inplace=True)
urd_df.shape

In [None]:
# Checking all Null and corresponding non null Values Removed
urd_df[urd_df['Translated_Review'].isnull() & urd_df['Sentiment_Polarity'].notna()]

In [None]:
# Checking data is cleaned or not
urd_df.head(20)

**There are a total of 37427 rows in the UserReview Dataset.**

### What all manipulations have you done and insights you found?

**In the play_store dataset, we did these manipulations.**


* We checked the first and last 10 rows of the dataset using head() and tail() to get an idea of the data, then used shape to check the dimensions of the dataset, columns to get the column names, and info() to see data types and any null values.

* We found and removed 483 duplicate values to ensure data accuracy.

* We discovered 1474 null values in the "Rating" column and replaced them with the median value, as it is a better representation of central tendency than the mean, which can be skewed by outliers.

* We also dropped one row that had a rating of 19, which was not possible as the maximum rating is 5.

* We removed 1 null value from the "Type" column, 8 null values from the "Current Ver" column, and 2 null values from the "Android Ver" column.

* We noticed that there were 10,039 free apps that had a price of 0, so we replaced the price value with the keyword "Free" for clarity.

* We removed unnecessary characters such as "M" and "K" from the "Size" column, "$" sign from "Price" column, and "+" and "," from "Installs" column to make the data more consistent and easy to read.

* We discovered 1181 duplicate apps in the "App" column and removed them to avoid redundancy and ensure data accuracy.


**In the user_reviews dataset we did these manipulations.**

* We checked the first and last 10 rows of the dataset using head() and tail() to get an idea of the data, then used shape to check the dimensions of the dataset, columns to get the column names, and info() to see data types and any null values.

* We found NaN values in the "Sentiment_Polarity" column and dropped all null values from the dataset using dropna, as these are categorical values and cannot be imputed with a value.


**From above  we found these manipulations, we found these Insights.**

* From the play_store_data, we found that the majority of apps are free, with over 10039 apps being free and only 800 being paid. Additionally, we found that there were 1181 duplicate app entries, so we removed them from the dataset. We also discovered that the Rating column had 1474 missing values, which we replaced with the median value. Moreover, there were 483 duplicate rows that we removed.

* In terms of user_reviews_data, we found that the dataset contained 64295 rows and 5 columns. We observed that there were some missing values in the Sentiment_Polarity column, which we dropped as they were categorical values.

* In conclusion, we can say that the play_store_data and user_reviews_data are valuable datasets, and when we complete our analysis and generate a final report that will be helpful for developers and businesses.

## ***4. Data Visualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

#1. Which Category has most number of Installs?

In [None]:
# merge the datasets
merged_dataset = pd.merge(psd_df, urd_df, on='App')

# grouping the data by category and sum ming the installs
category_installs = merged_dataset.groupby('Category')['Installs'].sum().reset_index()

# sorting the data by installs in descending order
sorted = category_installs.sort_values('Installs', ascending=False)

# creating a BarPlot
sns.set(style="darkgrid")
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.barplot(x='Category', y='Installs', data=sorted)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=12)
ax.set_xlabel('Category', fontsize=12)
ax.set_ylabel('Installs', fontsize=12)
ax.set_title('App Categories with  Most Number of Installs', fontsize=14)

plt.show()






##### 1. Why did you pick the specific chart?


A bar chart is a good choice for visualizing the number of Installs for different categories of App because it allows us to compare easily between Categories.

##### 2. What is/are the insight(s) found from the chart?

In this visualization, we can see that the most downloaded applications belong to the GAME category, it indicates that GAMES category has a high demand. After GAMES, we can see there is a strong competition between PHOTOGRAPHY and COMMUNICATION applications, because the no of downloads vary with just a few in number

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Companies operating in the GAMES category can increase their high demand to develop and market their GAMES more effectively, and potentially it leads  to higher revenue and market share. Similarly, companies operating in the PHOTOGRAPHY and COMMUNICATION categories can identify the strong competition and innovate their products so that they can achieve the market and capture more customers. This can help them improve their market position and increase their revenue.

#### Chart - 2

#2. Which Category is having most Reviews?

In [None]:
# Merging Data
# merged_data=pd.merge(psd_df,urd_df, on='App')
# Calculate the number of reviews by category
review_cat = merged_dataset.groupby('Category')['Translated_Review'].count().sort_values(ascending=False)
# review_cat
# # creating a BarPlot
plt.figure(figsize=(10, 10))
sns.barplot(x=review_cat.values, y=review_cat.index, palette="Greens_r")
plt.xlabel("Number of Reviews")
plt.ylabel("Category")
plt.title("Number of Reviews by Category")


plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart to visualize the distribution of reviews across different app categories because it allows for easy comparison of the number of reviews for each category. The horizontal layout also makes it easier to read the category labels.

##### 2. What is/are the insight(s) found from the chart?

Generally Users will be giving reviews after using it, here we can see that the GAME category has the highest number of reviews. This suggests that the GAME category has a large and engaged users. HEALTH_AND_FITNESS, as well as FAMILY, are the next highest categories in terms of reviews received. The DATING and TRAVEL_AND_LOCAL categories have a similar number of reviews after HEALTH_AND_FITNESS and FAMILY, then we can see that, SPORTS, PRODUCTIVITY, TOOLS AND FINANCE have almost equal number of reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It provides insights into which categories have a large and engaged user base, which could be useful for app developers and marketers to target their efforts towards these categories. It also highlights the importance of providing a positive user experience in order to encourage users to leave reviews, which can ultimately lead to increased visibility and downloads for an app.

#### Chart - 3

# 3.Which Categories have Maximum Free vs. Paid Apps?

In [None]:
# Grouping the data by category and type
cat_count=psd_df.groupby(['Category', 'Type']).count()['App'].unstack()

#Here we use stacked barplot
cat_count.plot(kind='bar', stacked='True')
# Creating title and xy lables
plt.title("Distribution of Category based on Free Vs Paid")
plt.xlabel("Category")
plt.ylabel("No of Apps")
plt.show()


##### 1. Why did you pick the specific chart?

I picked the stacked bar chart because it allows a clear comparison between the number of free and paid apps in each category, as well as the overall distribution of free vs. paid apps across all categories.

##### 2. What is/are the insight(s) found from the chart?

Business and Developers create Apps under popular categories like "FAMILY", "GAMES", and "TOOLS" this leads to explore other revenue options because these categories have a high number of free apps, making it challenging to generate income solely through app purchases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It can help developers and businesses make better decisions about which categories to focus on more. If they want to create a paid app, they might focus on the "FAMILY" or "GAME" categories so, that they can increase the revenue. But if they want to create a free app with less competition, they could consider the "BEAUTY","ART AND DESIGN," and "COMICS" categories, brcause it is having very less amount of free Apps.

#### Chart - 4

#4.What is the Average  or Mean Rating of Free vs. Paid Apps?

In [None]:
# Merging Data is done in above plot using it here

free_paid=merged_dataset[merged_dataset['Type'].isin(['Free','Paid'])]

# Violin Plot
sns.violinplot(x="Type",y="Rating",data=free_paid)
# plt.hist(x="Type",y="Rating",data=free_paid)
plt.title("Average Rating of Free Vs Paid Apps")
plt.show()


##### 1. Why did you pick the specific chart?

I chose to use a violin plot because it can show the distribution of data for both free and paid apps, as well as the average rating for each.

##### 2. What is/are the insight(s) found from the chart?

Both "FREE" and "PAID" apps have a similar average rating of about 4.3. But however, "FREE" apps have a wider range of ratings than "PAID" apps, meaning there is more variation in their ratings. On the other hand, "PAID" apps have a more consistent rating, with fewer extreme ratings. This information can be helpful for users when deciding whether to download a free or "PAID" app.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information can help a companies to decide whether to offer their apps for "FREE", or offer "PAID". If they want consistent ratings, they may choose to charge for the app. If they want a wider audience, they may choose to offer the app for "FREE", even if there may be more variability in the reviews.

#### Chart - 5

#5. What is the distribution of app ratings?

In [None]:
# Merging data is aleady done, nned to use it here
plt.hist(merged_dataset['Rating'], bins=30)
# Labels
plt.title("Distribution of App Ratings")
plt.xlabel("App Rating")
plt.ylabel("Frequency of the App")

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

# 6.What is the Average App size in different Categories?

#### Chart - 6

In [None]:
# Grouping PlayStore Data by category and finding the Mean of Size
mean_size_cat = psd_df.groupby("Category")["Size"].mean() / 1000000
mean_size_cat
# creating bar chart
plt.figure(figsize=(12,8))
mean_size_cat.plot(kind='bar')
plt.xticks(rotation=90)
plt.title("Average App Size by Category")
plt.xlabel("Category")
plt.ylabel("Average App Size in MB")

# time to print
plt.show()


##### 1. Why did you pick the specific chart?

I have chosen bar plot because it is a simple and effective way to compare the average App size in different Categories.

##### 2. What is/are the insight(s) found from the chart?

Apps in different categories have different average sizes. The biggest apps are in the "GAME" and "FAMILY" category, while the smallest apps are in the "TOOLS". We can see that their is an equal number of size between "SPORTS" and "TRAVEL AND LOCAL"

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It helps App developers to optimize the size of their apps according to their target category.Also, make sure their apps are not too big and take up too much space on users' devices.

## 7.Which category of Apps from the ‘Content Rating’ column is found more on the play store?

#### Chart - 7

In [None]:
# content rating of the apps
data = psd_df['Content Rating'].value_counts()
labels = ['All', 'Teen', 'All 10+', 'Mature 17+','Adults Only 18+', 'Unrated']

# using pie chart
plt.figure(figsize=(12,18))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['y', 'r', 'b', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend();

##### 1. Why did you pick the specific chart?

A Pie Chart is a good choice for visualizing the distribution of categories in the 'Content Rating' column. Because Pie Charts are suitable for single variable

##### 2. What is/are the insight(s) found from the chart?

In the above Pie chart we found most of the apps on the play store have a content rating of 'Everyone' (81.80%). This is followed by 'Teen' (10.74%), 'Everyone 10+' (4.07%), 'Mature 17+' (3.34%), 'Adults only 18+' (0.03%), and 'Unrated' (0.02%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information can help app developers to know that the majority of apps have a 'Everyone' rating, which allows for a broad audience. But for business targets specific age groups, like 'Mature 17+' or 'Adults only 18+', this information can be useful as there is less competition in those areas.

#8.How many Apps are Paid or Free?

#### Chart - 8

In [None]:
# Here, we create a variable and store the count of Type of Apps value
paid_free= psd_df['Type'].value_counts()
labels= psd_df['Type'].value_counts().index
# creating a pie chart
plt.figure(figsize=(12,10))
colors = ['#11a6d4','#e3480b']
explode = (0.01,0.1)
plt.pie(paid_free,labels=labels,colors=colors,autopct='%.2f%%',explode=explode,shadow=True,textprops={'fontsize': 15})
plt.title('Free Vs Paid Apps')


plt.show()

##### 1. Why did you pick the specific chart?

Pie charts are a good choice for visualizing categorical data like the distribution of Free vs. Paid Apps.

##### 2. What is/are the insight(s) found from the chart?

Most Apps in the Play Store dataset are free (92.20%), while only a small percentage are Paid (7.80%). This suggests that users prefer maximum Free Apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The high percentage of Free Apps in the Play Store suggests that developing Free Apps can be a successful business strategy to attract users.

#9. What is the correlation between App size and the No of Downloads

#### Chart - 9

In [None]:
# Merging data
#here we convert the size into number
merged_dataset['Size']=merged_dataset['Size'].apply(lambda x: x.strip('M') if type(x)==str else x)
merged_dataset['Size']=pd.to_numeric(merged_dataset['Size'], errors='coerce')
merged_dataset['Size']


#Scatter Plot
plt.figure(figsize=(10,8))
sns.set_theme(style="white")
sns.scatterplot(x="Size", y="Installs", data=merged_dataset)

# adding labels and a title
plt.xlabel("App Size")
plt.ylabel("Number of Downloads")
plt.title("Correlation between App Size and Number of Downloads")

# time to print
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot was chosen to display the correlation between app size and number of downloads because it is an effective way to compare different categories and show the distribution of the data.

##### 2. What is/are the insight(s) found from the chart?

Here in this, we can observe that, as the number of  app size decreases, the number of downloads also decreases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In above information App developers can focus on providing a wider range of App sizes to attract more downloads. The developers should focus on creating apps of different sizes to attract a wider range of users and increase their chances of getting more downloads. It can also help them optimize their App size to meet user needs and preferences.

#### Chart - 10

# 10.What is Sentiment Analysis Based on Category in Reviews Data?

In [None]:
from matplotlib.ticker import PercentFormatter

# merge the data on the app by using inner join
merged_df = pd.merge(psd_df, urd_df, on='App', how = 'inner')
merged_df.head()
f = plt.figure(figsize=(15,8))
ax = f.add_subplot(1,1,1)

# using seaborn lib to show stacked histogram
sns.histplot(
    data=merged_df,
    x="Category", hue="Sentiment",
    bins=34,
    ax=ax,
    stat="count",
    multiple="stack",
    palette="light:r_r",
    edgecolor=".3",
    linewidth=1.5,
    legend=True
    )


ax.set_title("Sentiment Analysis Based on Category",fontsize=12,fontweight='bold')
plt.xticks(rotation='vertical')
ax.set_xlabel("Category",fontsize=14)
ax.set_ylabel("Review Counts",fontsize=14)

plt.gca().yaxis.set_major_formatter(PercentFormatter(20000))
sns.set(style="white")
# plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

Here we picked Histogram because it helps to visualize the sentiment analysis based on categories in the reviews data.In this plot we can make Stacked Histogram

##### 2. What is/are the insight(s) found from the chart?

It tell us which app categories have the most positive and negative reviews. Seems like the "GAME" category has the most reviews, both positive and negative. However, there are more negative reviews than positive ones, so we can see that  the developers of those particular Apps should figure out what is the cause for negative experiences and improve them. On the other hand, the "COMICS" category has very few reviews, both positive and negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can identify the areas that need improvement in their Apps. This can help them improve their Apps and provide better user experience, which can lead to increased customer satisfaction and loyalty. As a result, it can have a positive impact on their business's reputation and gradually increase their revenue.

#### Chart - 11

 # 11. Correlation Heatmap of Play Store Data

In [None]:
#generating Heatmap for Correlation of Playstore_data
psd_df.corr()
# plt.figure(figsize=(15,10))
# sns.heatmap(psd_df.corr(),annot=True, cmap="Accent_r")
# plt.title("Correlation of PlayStore Data", size="20")


##### 1. Why did you pick the specific chart?

We used Correlation Heatmap because it helps to understand the correlation between the columns of Play Store Data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***