<a href="https://colab.research.google.com/github/shakirsayeed/PlayStore_DataAnalysis/blob/main/EDA_Project_Work_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Member  -** Syed Shakir Sayeed


# **Project Summary -**

The "Play Store App Review Analysis using Exploratory Data Analysis - EDA" project focuses on extracting valuable insights and patterns from app reviews available on the Google Play Store. The project aims to uncover trends, sentiments, and user feedback related to various mobile applications. By employing exploratory data analysis techniques, the project seeks to provide a comprehensive understanding of user sentiments, popular features, and potential areas for improvement for app developers.
Steps Involved in developing the Project:
1. Data Collection
2. Data Cleaning and Preprocessing
3. Implementing Exploratory Data Analysis
4. Data Visualization

# **GitHub Link -**

https://github.com/shakirsayeed/PlayStore_DataAnalysis.git

# **Problem Statement**


Explore and analyze the Google Play Store Dataset to discover key factors responsible for app engagement and success .


#### **Define Your Business Objective?**

## <b><i> The main objective of this project is to help app developers and businesses understand what factors make their apps successful. By analyzing the play_store_data and user_review_data datasets, we will identify relevant KPIs (Key Performance Indicators)
## <b>This information will be used to provide insights about the data and recommendations on how to improve app engagement, retain app users, increase app revenue, and enhance marketing strategies.</b>
## <b> My mission is to empower businesses in developing app solutions that ensure customer satisfaction and contribute to business growth. </i></b>

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd # used for Data Processing
import numpy as np # used to access builtin Numerical Methods and Functions
import matplotlib.pyplot as plt # Used for Data Visualization
import seaborn as sns # Used for Data Visualization
from datetime import datetime

# **1st Step: Data Cleaning on Play Store  Dataset**

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
psd_path="/content/drive/MyDrive/DS_Notes/EDA_Project/Dataset/Play_Store_Data.csv"

psd_df=pd.read_csv(psd_path)



### Dataset First View

In [None]:
# Dataset First Look of PlayStore Data
dataview_playstore= pd.concat([psd_df.head(),psd_df.tail()])
dataview_playstore

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count of PlayStore Data
print(psd_df.columns)
rows=psd_df.shape[0]
cols= psd_df.shape[1]
print(f"The number of rows are {rows} and columns are {cols}")

### Dataset Information

In [None]:
# Dataset Info PlayStore Data
print("=*=*=*=*=*=*=*=*=*Data information=*=*=*=*=*=*=*=*=*=*=*=*=")
psd_df.info()
print("=*=*=*=*=*=*=*=*=*Data Describe=*=*=*=*=*=*=*=*=*=*=*=*=")
psd_df.describe(include='all')

#### Duplicate Values

In [None]:
duplicate_values=len(psd_df[psd_df.duplicated])
print("The number of Duplicated Values are",duplicate_values)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count PlayStore Data
print(psd_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Create the bar plot
missing = psd_df.isnull()
missing_sum = missing.sum().sort_values(ascending=False)

missing_sum.plot(kind='bar')
plt.ylabel('Number of Missing Values')
plt.xlabel('Columns Name')
plt.title(' Missing Values by Column')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability if needed
plt.tight_layout()  # Ensures the labels fit within the plot area
plt.show()

### What did you know about your dataset?

In the bove dataset we have seen that


1.   Rating column is having 1474 missing values
2.   Type is having 1 missing values
3.   Content Rtingis having 1 missing values
4.   Current Ver is having 8 missing values
5.   Android Ver is having 3 missing values

So, here  in these rows of datasets we have missing values in these columns, in order to analyze the dataset, it is important to handle the missing values.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns of PlayStore Data
psd_df.columns


 The 13 columns in the playstore dataset are identified as below:
1. **App** - It tells us about the name of the application with a short description.
2. **Category** - It gives the category to the application.
3. **Rating** - It contains the average rating of the respective app received from its users.
4. **Reviews** - It tells us about the total number of users who have given a review for the application.
5. **Size** - It tells us about the size being occupied by the application on the mobile phone.
6. **Installs** - It tells us about the total number of downloads for an application.
7. **Type** - IIt states whether an app is free to use or is it a paid.
8. **Price** - It gives the price payable to install the app. For free type apps, the price is zero.
9. **Content Rating** - It states whether or not an app is suitable for all age groups or not.
10. **Genres** - It tells us about the various other categories to which an application belongs to.
11. **Last Updated** - It tells us about the when the application was updated.
12. **Current Ver** - It tells us about the current version of the android application.
13.**Android Ver** - It tells us about the android version which support the application on its platform.

In [None]:
# Dataset Describe
psd_df.describe()
# psd_df.describe(include='all')# It is used to display  statistical information of all the columns

### Variables Description

Here, it will show the short description of Statistical Information


*   **Count:** The number of non null values are 9367
*   **Mean:**The mean value of column is 4.193338
*   **Standard Deviation:** Th std of the column is 0.537431
*   **Minimum Value:** The min value of column is 1.0000000
*   **25% value:** The 25th percentile of the column is 4.0000000
*   **50% value:**The 50th percentile of the column is 4.3000000
*   **75% value:**The 75th percentile of the column is 4.5000000
*   **Maximum Value:**The max value of column is 19.0000000


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in psd_df.columns.to_list():
  print("Unique values in" ,i, "is",psd_df[i].nunique())

In above unique values are more in App column which is around 9660

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dataset Information about various attributes of Dataset
def Playstore_data():
  temp=pd.DataFrame(index=psd_df.columns)
  temp['Datatypes']=psd_df.dtypes
  temp['Not Null Values']=psd_df.count()
  temp['Null Values']=psd_df.isnull().sum()
  temp['% ratio of Null Values']=psd_df.isnull().mean()
  temp['Unique Values']=psd_df.nunique()
  temp["Duplicate Values"]=psd_df.duplicated().sum()
  return temp
Playstore_data()


In [None]:
psd_df.boxplot()

Here Rating min value is around 3.0 and Max value is 5.0, but we can see in the boxplot their is an outlier which exceeds the Max value

In [None]:
psd_df[psd_df['Rating']> 5]

**There is some problem in above row**
1. It is having Rating 19.0
2. NaN values in Content Rating and Android Ver.

So, here we are droping this row

In [None]:
psd_df.drop(10472,inplace=True)

Now we are checking whether it is removed or not

In [None]:
psd_df.boxplot()

Now, in this dataset still we have 1474 Null values in the Rating column. So here we are finding the Mean and Median of that column to fix it

In [None]:
# Finding Mean
rating_mean= psd_df['Rating'].mean()
print(f"Mean value of Rating is {rating_mean}")
# Finding Median
rating_median= psd_df['Rating'].median()
print(f"Median value of Rating is {rating_median}")

Here in above :
1.  Mean value is approx to 4.2
2.  Median value is  4.3

Their is only a minute difference betweeen Mean and Median
So here we replace all Null values with Median, 50% of Apps are having 4.3 and above Rating


In [None]:
psd_df['Rating'].fillna(value=rating_median,inplace=True)

In [None]:
# Here, checking the null values are filled or not
psd_df.isnull().sum()

Now, there are still some problemsin the columns that we have to fix

*   The Type has 1 NaN value.
*   The Current Ver contains 8 NaN values.
*   The Android Ver contains 2 NaN values.

# **Replacing Type NaN Values**

In [None]:
# Here we are checking Nan values in Type column
psd_df[psd_df['Type'].isnull()]

In above, we can see Price is 0, so it is a Free App. So, we we replace NaN with Free

In [None]:
# Checking How many Apps are Free and Paid
psd_df['Type'].value_counts()

In [None]:
# Replacing NaN with Free
psd_df.loc[9148,'Type']='Free'

In [None]:
# Now, We are checking is it replaced with Free or not
psd_df[psd_df['Type'].isnull()]

# **Replacing Current Ver NaN Values**

In [None]:
# Here we are checking Nan values in Current Ver column
psd_df[psd_df['Current Ver'].isnull()]

We are having 8 Nan Values, so we are dropping all 8 Nan values

In [None]:
# Dropping all 8 Nan values fro Current Ver column
psd_df.drop([15,1553,6322,6803,7333,7407,7730,10342],axis=0,inplace=True)

In [None]:
# Now, We are checking Whether the rows have been dropped or not
psd_df[psd_df['Current Ver'].isnull()]

# **Replacing Android Ver NaN Values**

In [None]:
# Here we are checking Nan values in Android  Ver column
psd_df[psd_df['Android Ver'].isnull()]

Here, we have 2 Nan values so we are dropping 2 Nan values as well

In [None]:
# Dropping 2 Nan values from Android Ver column
psd_df.drop([4453,4490],axis=0,inplace=True)

In [None]:
# Here we are checking Nan values in Android  Ver are droped or not
psd_df[psd_df['Android Ver'].isnull()]

## **Now its the Time to Handle all the Data Types for Every Column**

In [None]:
psd_df['Size'].value_counts()

**The Size column has different Uniyts**

1. 'K' for KB
2. 'M' for MB

In [None]:
# Checking datatype of Price
psd_df['Price'].value_counts()

Here, $ symbol in Price will create a problem ,so we remove it

In [None]:
# We are removing the $ symbol
def price(p):
  if type(p)==str and '$' in p:
    p=p.replace('$', '')
  return p


In [None]:
psd_df['Price']=psd_df['Price'].apply(lambda x: price(x))
psd_df['Price']

In [None]:
# Checking datatype of Installs
psd_df['Installs'].value_counts()

Here, in Installs we need to remove (,) and (+) sign


In [None]:
# We are removing the comma , and + sign
def comma_plus(val):
  '''
  This function drops the + symbol if present and returns the value with int datatype.
  '''
  if '+' and ',' in val:
    new=(val[:-1].replace(',',''))
    return new
  elif '+' in val:
    new1=(val[:-1])
    return new1
  else:
    return (val)


In [None]:
psd_df['Installs']=psd_df['Installs'].apply(lambda x: comma_plus(x))
psd_df.head(5)

In [None]:
# Now we're converting 'Reviews' to numeric
def convertKB_MB(val):
  '''
  This function converts all the valid entries in KB to MB and returns the result in float datatype.
  '''
  try:
    if 'M' in val:
      return float(val[:-1])
    elif 'k' in val:
      return round(float(val[:-1])/1024, 4)
    else:
      return val
  except:
    return val


In [None]:
psd_df['Size']=psd_df['Size'].apply(lambda x: convertKB_MB(x))
psd_df.head()

# **Here, We are replacing Size "Varies with Device" with NaN Values**

In [None]:
psd_df[psd_df['Size']=='Varies with device']

In [None]:
psd_df['Size']=psd_df['Size'].apply(lambda x: str(x).replace('Varies with device','NaN') if 'Varies with device' in str(x) else x)
psd_df[psd_df['Size']=='NaN']

# **Removing Duplicate Values**

In [None]:
# Handling  Duplicate  data
psd_df.head()

In [None]:
psd_df['App'].value_counts()

Here we have found ROBOLOX has duplicated 9 times, and similarly other apps are also duplicated

In [None]:
psd_df[psd_df['App']=='ROBLOX']

In [None]:
psd_df[psd_df.duplicated()]

In [None]:
# We are dropping all the duplicate values
psd_df.drop_duplicates(subset='App',inplace=True)
psd_df.shape

In [None]:
# Checking whether duplicates are removed from App column
psd_df[psd_df['App']=='ROBLOX']
# psd_df.shape

**After Dropping the values from App Column, the total number of rows dropped to  9660**

In [None]:
psd_df['App'].duplicated().sum()

In [None]:
psd_df.shape

In [None]:
psd_df.describe()

## **Summary**

*   All duplicates values removed from dataset.
*   All null values are removed or replaced.  
*   Converted the datatypes of the particular column and also removed all the unwanted characters.

# **2nd Step: Data Cleaning on User Reviews Dataset**

In [None]:
urd_path="/content/drive/MyDrive/DS_Notes/EDA_Project/Dataset/User_Reviews.csv"
urd_df=pd.read_csv(urd_path)

In [None]:
# Dataset First Look of userReview Data
dataview_user_review= pd.concat([urd_df.head(),urd_df.tail()])
dataview_user_review

**No of Rows and Columns in Dataset of User Review**

In [None]:
# Dataset Rows & Columns count of UserReview Data
print(urd_df.columns)
rows=urd_df.shape[0]
cols= urd_df.shape[1]
print(f"The number of rows are {rows} and columns are {cols}")

**Dataset Information of User Review**

In [None]:
# Dataset Info User Review Data
print("=*=*=*=*=*=*=*=*=*Data information=*=*=*=*=*=*=*=*=*=*=*=*=")
urd_df.info()
print("=*=*=*=*=*=*=*=*=*Data Describe=*=*=*=*=*=*=*=*=*=*=*=*=")
urd_df.describe(include='all')

**Duplicated Values**

In [None]:
duplicate_values=len(urd_df[urd_df.duplicated])
print("The number of Duplicated Values are",duplicate_values)

**Missing or Null Values**

In [None]:
# Missing Values/Null Values Count User Review Data
print(urd_df.isnull().sum())

In [None]:
# Visualizing the missing values
# Create the bar plot
missing = urd_df.isnull()
missing_sum = missing.sum().sort_values(ascending=False)

missing_sum.plot(kind='bar')
plt.ylabel('Number of Missing Values')
plt.xlabel('Columns Name')
plt.title(' Missing Values by Column')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability if needed
plt.tight_layout()  # Ensures the labels fit within the plot area
plt.show()

In the bove dataset we have seen that

1. Translated_Review column is having 26868 missing values
2. Sentiment is having 26863 missing values
3. Sentiment_Polarity is having 26863 missing values
4. Sentiment_Subjectivity is having 26863 missing values

So, here in these rows of datasets we have missing values in these columns, in order to analyze the dataset, it is important to handle the missing values.

# The dataset has 5 columns identified as below:


1.   **App:** Title of the application.
2.   **Translated_Review:** It contains the English translation of the review.
3.   **Sentiment:** It gives the emotion like ‘Positive’, ‘Negative’, or ‘Neutral’.
4.   **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
5.    **Sentiment_Subjectivity:** This value gives how close a reviewers opinion is to the opinion of the general public. Its range is[0,1].

In [None]:
# Dataset Describe
urd_df.describe()
# psd_df.describe(include='all')# It is used to display  statistical information of all the columns

Checking Unique Values

In [None]:
# Check Unique Values for each variable.
for i in urd_df.columns.to_list():
  print("Unique values in" ,i, "is",urd_df[i].nunique())

In [None]:
# Dataset Information about various attributes of Dataset
def User_Review():
  temp=pd.DataFrame(index=urd_df.columns)
  temp['Datatypes']=urd_df.dtypes
  temp['Not Null Values']=urd_df.count()
  temp['Null Values']=urd_df.isnull().sum()
  temp['% ratio of Null Values']=urd_df.isnull().mean()
  temp['Unique Values']=urd_df.nunique()
  temp["Duplicate Values"]=urd_df.duplicated().sum()
  return temp
User_Review()


In [None]:
urd_df.boxplot()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***