<a href="https://colab.research.google.com/github/sachinwandale1994/Capstone-project/blob/main/EDA_on_Play_Store_app_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Play Store App Review Analysis**

<img src="https://drive.google.com/uc?id=1kj5p5-n7GLkgkJe-if6Sd_Z4MFZf2uPT" alt="drawing" width="400"/>


##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Sachin Wandale

# **Project Summary -**

In [None]:

"""The objective of this project is to perform Exploratory Data Analysis (EDA) on Play Store app reviews to gain insights into user opinions, sentiments, and preferences.
By analyzing a vast dataset of app reviews, we aim to uncover patterns, trends, and actionable information that can be useful for app developers, marketers, and decision-makers.

The project will involve the following steps:

1. Data Collection: Gathering a substantial volume of Play Store app reviews, including text reviews, ratings, app metadata,
   and other relevant information.

2. Data Cleaning and Preprocessing: Performing necessary data cleaning and preprocessing tasks to ensure the data is in a suitable format for analysis.
   This may involve removing duplicates, handling missing values, standardizing text, and converting ratings into a consistent numerical scale.

3. Exploratory Data Analysis: Conducting a comprehensive exploratory analysis to understand the characteristics of the dataset.
   This includes analyzing distributions, correlations, and statistical measures of the review ratings, app categories, review lengths and other relevant factors.

4. Visualizations and Insights: Presenting the findings of the EDA through meaningful visualizations such as histograms, bar plots, etc.
   These visual representations will help identify popular app categories, frequently mentioned features, and any other noteworthy patterns.

By conducting this EDA on Play Store app reviews, this project aims to provide valuable insights that can drive decision-making processes and
support the development of user-centric apps. The analysis will enable stakeholders to make informed decisions regarding app enhancements, feature prioritization,
marketing strategies, and overall app performance improvements."""

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/sachinwandale1994/Capstone-project/blob/main/EDA_on_Play_Store_app_Review_Analysis.ipynb

# **Problem Statement**


**Write Problem Statement Here.**

Question 1
What is the relationship between application rating and installation?

Question 2
What is the rating of an an application with given number of reviews and installation ?
1. How can we identify the most common issues reported by users in Play Store app reviews?
2. What are the key factors that influence user ratings and reviews of apps on the Play Store?
3. Is there a correlation between the length of app reviews and the overall rating given by users on the Play Store?
4. Are there any specific keywords or phrases in app reviews that indicate user satisfaction or dissatisfaction?
5. How can we detect and analyze spam or fake reviews in the Play Store app review dataset?
6. What are the main reasons behind users uninstalling apps, as indicated by their reviews on the Play Store?
7. Can we identify patterns or trends in app reviews based on factors such as app category, release date, or developer reputation?
8. How do app ratings and reviews vary across different versions of the same app on the Play Store?
9. Are there any specific features or functionalities that consistently receive positive or negative feedback from users in Play Store app reviews?
10. Can sentiment analysis techniques be used to classify app reviews into positive, negative, or neutral categories, and how accurate are these classifications?
11. Are there any differences in user ratings and reviews based on the geographic location of the users, as indicated by their Play Store profiles?
12. How do the sentiments expressed in app reviews change over time, and can we identify any external events or app updates that correlate with significant changes in sentiment?


#### **Define Your Business Objective?**

Answer Here.

To analyze the Google Play Store Apps by implementing Data Science Process

To determine trends and patterns in Google Play Store Apps

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')   # Google drive mounting

MessageError: ignored

In [None]:
# load first Dataset
df_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Python for Data Science /Capstone Project/Python/Play Store app Review/Play Store Data.csv')

In [None]:
# load Second Dataset
df_user_review = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Python for Data Science /Capstone Project/Python/Play Store app Review/User Reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
df_data.head() # head() function given default 5 first rows in output of df_data dataframe

In [None]:
df_user_review.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows1 = df_data.shape[0]
cols1 = df_data.shape[1]

# Printing the number of Dataset rows and columns
print("Rows in df_data dataset: ", rows1)
print("Columns in df_data dataset: ", cols1)

# Similarly for another Dataset

# Dataset Rows & Columns count
rows2 = df_user_review.shape[0]
cols2 = df_user_review.shape[1]

# Printing the number of Dataset rows and columns
print("Rows in df_user_review dataset: ", rows2)
print("Columns in df_user_review dataset: ", cols2)

In [None]:
print(df_data.shape)        # shape() method is used to fetch the dimensions of Pandas and NumPy type objects in python.
print(df_user_review.shape)

### Dataset Information

In [None]:
# Dataset Info
df_data.info() #This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage

In [None]:
df_user_review.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Shape of df_data DataFreame:",df_data.shape)    # dataset shape before removing duplicate values.
print("Shape of df_user_review DataFreame:",df_user_review.shape)

In [None]:
print("Duplicate entry in df_data data:",len(df_data[df_data.duplicated()]))  #By using duplicated method count total number of duplicates in dataset
print("Duplicate entry in df_user_review data:",len(df_user_review[df_user_review.duplicated()]))

In [None]:
duplicate_df = df_data[df_data.duplicated(keep = 'last') ] # all duplicates values from dt_data dataset
duplicate_df

In [None]:
duplicate_df2 = df_user_review[df_user_review.duplicated(keep = 'last') ] # all duplicates values from df_user_review dataset
duplicate_df2

In [None]:
#Remove Duplicates from Main Database
main_data_1 = df_data.drop_duplicates() #remove duplicate and save as new main_data_1
main_data_1.shape #shape after removing duplicates

In [None]:
#Remove Duplicates from Main Database
user_reviews_df1 = df_user_review.drop_duplicates() #remove duplicate and save as new DataFrame user_reviews_df1
user_reviews_df1.shape #shape after removing duplicates

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
main_data_1.isnull().sum().sum()
# Total number of missing values in main_data_1 is 1478

In [None]:
user_reviews_df1.isnull().sum().sum()
# Total number of missing values in user_reviews_df1 is 3933

In [None]:
# Visualizing the missing values
# define inpute_median function to fill the null values by using aggragate function like mean,mode,median.
def inpute_median(series):
  return series.fillna(series.median())



In [None]:
main_data_1['Rating'].unique() # check null values in Rating column


In [None]:
main_data_1.Rating = main_data_1['Rating'].transform(inpute_median)

In [None]:
main_data_1.isnull().sum()

In [None]:
# mode of catagarical data
print(main_data_1['Type'].mode())
print(main_data_1['Current Ver'].mode())
print(main_data_1['Android Ver'].mode())

In [None]:
# fill the missing catagarical values with mode
main_data_1['Type'].fillna(str(main_data_1['Type'].mode().values[0]), inplace = True)
main_data_1['Current Ver'].fillna(str(main_data_1['Current Ver'].mode().values[0]), inplace = True)
main_data_1['Android Ver'].fillna(str(main_data_1['Android Ver'].mode().values[0]), inplace = True)

In [None]:
main_data_1.isnull().sum()

### What did you know about your dataset?

Answer Here
The dataset given is a dataset from Telecommunication industry, and we have to analysis the churn of customers and the insights behind it.

Churn prediction is analytical studies on the possibility of a customer abandoning a product or service. The goal is to understand and take steps to change it before the costumer gives up the product or service.

The above dataset has 3333 rows and 20 columns. There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(main_data_1.columns)

In [None]:
# Dataset Describe
main_data_1.describe(include='all')

In [None]:
df_user_review.describe()

### Variables Description

Answer Here

Columns_Name                      Details_of_Contains
App              The app name
Category         Categorical label, which describes which broad category the app belongs to.
Rating	              Continuous variable with a range from 0.0 to 5.0, which describes the average rating the app has received from the users.
Reviews                   Continuous variable describing the number of reviews that the app received.
Size	The size of the app. The suffix M is used for megabytes, while the suffix K is used for kilobytes.
Installs	Categorical label that describes the number of installs.
Type	Label that indicates whether the app is free or paid.
Price	The price value for the paid apps.
Content Rating	Categorical rating that indicates the age group for which the app is suitable.
Genre	Smicolon-separated list of genres to which the app belongs.
Last Update	The date the app was last updated.
Current Version	The current version of the app as specified by the developers.
Android Version	The Android operating system the app is compatible with.

Details of Each Columns of user_reviews_df DataFrame

Columns_Name	Details_of_Contains
App	The app name.
Translated_Review	Review text in English.
Sentiment	Sentiment of the review, which can be positive, neutral, or negative.
Sentiment_Polarity	Sentiment in numerical form, ranging from -1.00 to 1.00.
Sentiment_Subjectivity	Measure of the expression of opinions, evaluations, feelings, and speculations

### Check Unique Values for each variable.

In [None]:
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Check Unique Values for each variable.
# Category
main_data_1['Category'].unique()

In [None]:
# '1.9' is wrong value. So I have to see in which rows '1.9' corresponds to.
main_data_1[main_data_1['Category'] == '1.9']


In [None]:
# We can make NaN category value. At that time we can get a clean data. I used shift() method that is found Pandas library.

main_data_1.loc[10472] = main_data_1.loc[10472].shift()
main_data_1['App'].loc[10472] = main_data_1['Category'].loc[10472]
main_data_1['Category'].loc[10472] = np.nan
main_data_1.loc[10472]

In [None]:
# Rating
main_data_1['Rating'].unique()

In [None]:
#Data type of Rating is object. If we convert from string to numeric, we can make easy.

main_data_1['Rating'] = pd.to_numeric(main_data_1['Rating'], errors='coerce')
main_data_1['Rating'].dtype

In [None]:
# Review
main_data_1['Reviews'].unique()

In [None]:
main_data_1['Reviews'] = main_data_1.Reviews.replace("0.0",0)
main_data_1['Reviews'] = main_data_1.Reviews.replace("3.0M",3000000.0)
main_data_1['Reviews'] = main_data_1['Reviews'].astype(float)
main_data_1['Reviews'].dtype

In [None]:
# Size
main_data_1['Size'].unique()


In [None]:
# Data type of Size is object. I had to convert the column because it contains the application sizes. Firstly I changed 'Varies with device' value with Nan.
# After, I dropped 'M' and 'k'. I changed from '1000+' to 1000. Finally, I converted float value.
main_data_1['Size'] = main_data_1.Size.replace("Varies with device",np.nan)
main_data_1['Size'] = main_data_1.Size.str.replace("M","000") # All size values became the kilobyte type.
main_data_1['Size'] = main_data_1.Size.str.replace("k","")
main_data_1['Size'] = main_data_1.Size.replace("1,000+",1000)
main_data_1['Size'] =main_data_1['Size'].astype(float)
main_data_1['Size'].dtype

In [None]:
#Installs
main_data_1['Installs'].unique()


In [None]:
# Data type of Size is object. I'm gonna make similar processes, which I made the in 'Size'.

main_data_1['Installs'] = main_data_1.Installs.str.replace(",","")
main_data_1['Installs'] = main_data_1.Installs.str.replace("+","")
main_data_1['Installs'] = main_data_1.Installs.replace("Free",np.nan)
main_data_1['Installs'] = main_data_1['Installs'].astype(float)
main_data_1['Installs'].dtype

In [None]:
# Price
main_data_1['Price'].unique()

In [None]:
# Data type of Price is object. I have made similar processes, which I made the in 'Size'.
main_data_1['Price'] = main_data_1.Price.replace("Everyone",np.nan)
main_data_1['Price'] = main_data_1.Price.str.replace("$","").astype(float)
main_data_1['Price'].dtype

In [None]:
# Last Update
main_data_1['Last Updated'].unique()

In [None]:
# Data type of Last Uptated is object. I converted from string to date type.
main_data_1['Last Updated'] = pd.to_datetime(main_data_1['Last Updated'])
main_data_1['Last Updated']

In [None]:
# Write your code to make your dataset analysis ready.
#dropped the rows having all null values
df_user_review1 = df_user_review.dropna(subset=["Translated_Review"],how="all")
df_user_review1.isnull().sum()

NameError: ignored

In [None]:
df_user_review1.info()

NameError: ignored

In [None]:
df_user_review1.describe()

In [None]:
main_data_1.info()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Top Categories Apps in Playstore
plt.figure(figsize=(15,10))

y = main_data_1['Category'].value_counts().index
x = main_data_1['Category'].value_counts()
plt.xlabel("Count")
plt.ylabel("Category")
graph = sns.barplot(x, y)
graph.set_title("Top categories on Playstore", fontsize = 20);

##### 1. Why did you pick the specific chart?

In [None]:
# The box plot organizes large amounts of data, and visualizes outlier values.
# # before cleaning the dataset outlier value found from the above boxplot is at an index values of 10472.
# which Rating is 19 but for our analysis rating varies from 1 to 5.

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

In [None]:

# from the above chart most of the app is rating is between 4 to 4.5

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
y = main_data_1['Genres'].value_counts().index
x = main_data_1['Genres'].value_counts().head(10)
xsisG = []
ysisG = []
for i in range(len(x)):
    xsisG.append(x[i])
    ysisG.append(y[i])

plt.figure(figsize=(15,5))
plt.xlabel("Count")
plt.ylabel("Geners")

graph = sns.barplot(x = xsisG, y = ysisG,)
graph.set_title("Top Genres on Playstore", fontsize = 20);

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code Category and Reviews

category_list = list(main_data_1['Category'].unique())
category_review = []
for i in category_list:
    x = main_data_1[main_data_1['Category'] == i]
    if(len(x)!=0):
        review = sum(x.Reviews)/len(x)
        category_review.append(review)
    else:
        review = sum(x.Reviews)
        category_review.append(review)
#sorting
data_category_reviews = pd.DataFrame({'category': category_list,'review':category_review})
new_index = (data_category_reviews['review'].sort_values(ascending=False)).index.values
sorted_data =data_category_reviews.reindex(new_index)

# visualization
plt.figure(figsize=(15,10))
sns.barplot(x=sorted_data['category'], y=sorted_data['review'])
plt.xticks(rotation=80)
plt.xlabel("Category")
plt.ylabel("Reviews")
plt.title("Category and Reviews")
plt.plot()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code Category and Installs
category_list = list(main_data_1['Category'].unique())
category_install = []
for i in category_list:
    x = main_data_1[main_data_1['Category'] == i]
    if(len(x)!=0):
        install = sum(x.Installs)/len(x)
        category_install.append(install)
    else:
        install = sum(x.Installs)
        category_install.append(install)

#sorting
data_category_install = pd.DataFrame({'category': category_list,'install':category_install})
new_index = (data_category_install['install'].sort_values(ascending=False)).index.values
sorted_data =data_category_install.reindex(new_index)

# visualization
plt.figure(figsize=(15,10))
sns.barplot(x=sorted_data['category'], y=sorted_data['install'])
plt.xticks(rotation=80)
plt.xlabel("Category")
plt.ylabel("Install")
plt.title("Category and Install")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code Content Rating
plt.figure(figsize=(10,7))
sns.countplot(data=main_data_1, x='Content Rating')
plt.xticks(rotation=80)
plt.title('Content Rating',color = 'blue',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(15,5))
plt.xlabel("Rating")
plt.ylabel("Frequency")
graph = sns.kdeplot(main_data_1.Rating, color="#4B0751", shade = True)
plt.title('Most Frequent Rating',size = 20);

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Paid and Free Apps Ratio From All Apps
plt.figure(figsize=(15,10))
x=main_data_1.Type.value_counts()
label=["Free","Paid"]
plt.pie(x,labels=label,autopct="%1.2f%%",shadow=True, explode=[0,.21], startangle=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
main_data_1.columns

In [None]:
# Chart - 8 visualization code
# App Updatation Details By Year
plt.figure(figsize=(15,5))
plt.title("Apps updatation by years", fontsize=20)
ax = plt.hist(main_data_1.Last Updated, color="#4B0751")
plt.tick_params(left='on', bottom='on')
plt.xlabel("Year")
plt.ylabel("Number of apps updated");
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#marge both database for more analysis

merged_df = main_data_1.merge(df_user_review1, on="App")
merged_df.info()

In [None]:
# Sentiment Data Across the All Reviews
plt.figure(figsize=(15,10))
pd.value_counts(df_user_review1["Sentiment"]).plot(kind = 'pie',  autopct='%1.2f%%',shadow=True, explode=[0, 0.05, 0.05], startangle=45 )
plt.title("Sentiment data across database",size=20)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# corr() : It returns correlation.
# describe (): It returns number of entries, average of entries, outlier values, standart deviation, minimum and maximum entry.
main_data_1.corr()

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(6, 6))
sns.heatmap(main_data_1.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***