<a href="https://colab.research.google.com/github/sarswat4/project-Repository/blob/main/Play_Store_App_Review_Analysis_M_2_Capstone_Project_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**  Diksha sarswat


# **Project Summary -**

Data Preprocessing :

1. Importing libraries
2. Getting the dataset
3. Importing datasets
4. Reading the data
5. Finding Missing Data
6. Data Cleaning
7. Data Visualisation
8. Conclusion



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Android is the dominant mobile operating system today with about 85% of all mobile devices running Google’s OS. The Google Play Store is the largest and most popular Android app store.

The purpose of our project was to gather and analyze detailed information on apps in the Google Play Store in order to provide insights on app features and the current state of the Android app market.

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market.

Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.

Explore and analyze the data to discover key factors responsible for app engagement and success.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from matplotlib import rcParams

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
path='/content/drive/MyDrive/Play Store Data.csv'
data=pd.read_csv(path,encoding="ISO-8859-1")

### Dataset First View

In [None]:
# Dataset First Look
#It will show by default 5 data of our 1st data frame
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

In [None]:
print(f' We have {data.duplicated().sum()} duplicate values in dataset.')

In [None]:
# Different types of Category
x=data['Category'].unique()
print('Total no. of category of app present in playstore is:-',len(x))

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15,5))
sns.heatmap(data.isnull(),cmap='plasma',annot=False,yticklabels=False)
plt.title(" Visualising Missing Values")

### What did you know about your dataset?

**Dataset Overview:**

The dataset contains 13 columns with 10,841 rows of data entries.

**Data Type Issues:**

Columns such as 'Reviews', 'Size', 'Installs', and 'Price' are currently stored as strings (object data type), but they represent numerical or categorical information. These need to be converted into appropriate formats for analysis.

**Column-Specific Observations:**

**Size**: The values in this column are inconsistent. Some represent size in megabytes (M) or kilobytes (k), while others have entries like "Varies with device", which must be addressed.

**Installs:**  This column uses string formatting, including symbols such as + and ,, which should be cleaned to extract numeric values.

**Price:** The price values include the currency symbol $, which needs to be removed for conversion to numeric format.

**Action Required:**
Data cleaning is necessary to standardize these columns and ensure they are in the correct data types for meaningful analysis.




In [None]:
# Dataset Describe
data.describe(include='all')

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns


In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

1-**App:** The name of the application available on the Google Play Store.

**2-Category:** The classification or category the app belongs to on the Play Store.

**3-Rating:** The average user rating of the app as provided on the Play Store.

**4-Reviews:** The total number of user reviews received by the app.

**5-Size:** The storage size of the application.

**6-Installs:** The total number of times the app has been downloaded or installed.

**7-Type:** Indicates whether the app is free to download or requires payment (Free/Paid).

8-Price: The cost of the app in USD (0 for free apps).

**9-Content Rating:** The target age group or audience suitability of the app (e.g., Everyone, Teen, Mature).

**10-Genres:** The specific genre or subcategory of the app.

**11-Last Updated:** The date when the app was most recently updated.

**12-Current Ver:** The version number of the app that is currently available.

**13-Android Ver:** The minimum Android operating system version required for the app to function.







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check Unique Values for each variable.
variables_df = data.columns.to_list()
for i in variables_df:
  print('The Unique Values of', i, 'are:', data[i].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Convert install coloum string to int in install price and review

def convert_int(x):
    if isinstance(x, int):
        return x
    x = str(x)
    x = x.replace(',', '')
    x = x.replace('+', '')
    x = x.replace('M', '')
    x = x.replace('k', '')
    x = x.replace('$', '')
    if x == 'Free' or x == 'Varies with device' or x == 'Everyone':
        return 0
    else:
        return int(float(x))

In [None]:
#Converting installs,price,reviews to int value for better calculations
data['Installs']=data['Installs'].apply(convert_int)
data['Size_int']=data['Size'].apply(convert_int)
data['Price']=data['Price'].apply(convert_int)
data['Reviews']=data['Reviews'].apply(convert_int)

In [None]:
#convert size due to size divided into multiple category so we convert into to same format(mb=1024k, k=mb/1024)
def convert_int2(x):
    x=x.lower()

    if x=='varies with device':
        x=0
    else:
        if 'm' in x:

            x=x+','+'+'+'m'+'$'
            x=x.replace(',','')
            x=x.replace('+','')
            x=x.replace('m','')
            x=x.replace('k','')
            x=x.replace('$','')
            x=eval(x)
        elif 'k' in x :

            x=x+','+'+'+'k'+'$'
            x=x.replace(',','')
            x=x.replace('+','')
            x=x.replace('M','')
            x=x.replace('k','')
            x=x.replace('$','')
            x=eval(x)/1024
        else:
            x=-100

    return x
data['Size_int']=data['Size'].apply(convert_int2)

In [None]:
# Remove wich app reating is more than 5 nonused entry
r=data[data['Rating']>5].index

data=data.drop(r)

In [None]:
x=data['Rating'].median()
data['Rating']=data['Rating'].fillna(x)
data.isna().sum()

In [None]:
#Remove the app which has contain no type

y=data[data['Type'].isna()].index
data=data.drop(index=y)
data.isnull().sum()

### What all manipulations have you done and insights you found?

We observed that 1,474 rows in the dataset have missing values in the 'Rating' column. To address this, we chose to replace the null values with the median of the overall 'Rating' values.

median=In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value



**Based on the highest review we created the data set and change**

In [None]:
df1=pd.DataFrame(data)
df1.sort_values(by='Reviews',ascending=False, inplace=True)

df1=df1.drop_duplicates(subset=['App'])
df1

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.set_style('darkgrid')
plt.figure(figsize=(10, 5))
sns.countplot(x='Category', data=df1)
plt.title('Number of Apps Per Category')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')
plt.show()

##### 1. Why did you pick the specific chart?

To get the number of apps for each category.



##### 2. What is/are the insight(s) found from the chart?

Here we can see that there are highest no. of app from family category and at second there is game.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2 visualization code
users=df1.groupby(['Category'])['Installs'].sum().reset_index()
sns.set_style('darkgrid')
plt.figure(figsize=(10, 5))
sns.barplot(x='Category',y='Installs',data=users)
plt.title('Number of install Apps per Category')
plt.xticks(rotation=90)
plt.ylabel('Number of User')
plt.show()

##### 1. Why did you pick the specific chart?

To get the highest no. of user/installed app category wise.

##### 2. What is/are the insight(s) found from the chart?

As for the above graph we conclude that most user showing their interest in Gaming app and after game people show their interest in communication apps. The highest install app is Game app and communicatation app is at second.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.



#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 5))
sns.countplot(x='Rating', data=df1)
plt.title('Rating Distribution')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')
plt.show()

##### 1. Why did you pick the specific chart?

show the distribution of rating.

##### 2. What is/are the insight(s) found from the chart?

From this distribution plotting, it implies that most of the apps in the Play Store are having rating higher than 4 or in the range of 4 to 4.7.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.



#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(14,9))

val = sns.boxplot(data=df1, x="Category", y="Rating")
val.axhline(df1['Rating'].mean(),ls="-",color="red")
x=plt.xticks(rotation=90)
print('Avg rating:-',df1['Rating'].mean())

##### 1. Why did you pick the specific chart?

Here we compared the app category wise with rating (which category get more rating with avg. and which gets low)



##### 2. What is/are the insight(s) found from the chart?

Red Line is the Average of rating

Performance of all app categories is mostly decent. Highest quality apps with 50% apps with a rating higher than 4.5 are Health and Fitness and Book and Reference app. This is considered to be extremely high!

However, the apps in Dating category having lower rating than the average ratings is 50%.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the game category has app more app rated as compared to avg and from Health and Fitness almost all app rated more than avg this is consider as positive feedback of those category.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 5 visualization code
rating=df1.groupby(['Rating']).sum().reset_index()
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

axes[0].plot(rating['Rating'], rating['Reviews'])
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Reviews')
axes[0].set_title('Reviews Per Rating')

axes[1].plot(rating['Rating'], rating['Size_int'] )
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Size')
axes[1].set_title('Size Per Rating')

axes[2].plot(rating['Rating'], rating['Installs'] )
axes[2].set_xlabel('Rating')
axes[2].set_ylabel('Installs')
axes[2].set_title('Installs Per Rating')

axes[3].plot(rating['Rating'], rating['Price'] )
axes[3].set_xlabel('Rating')
axes[3].set_ylabel('Price')
axes[3].set_title('Price Per Rating')

plt.tight_layout(pad=2)
plt.show()

##### 1. Why did you pick the specific chart?

To get the corelatation between how rating affected the other factors



##### 2. What is/are the insight(s) found from the chart?

From the above graphs, we found that most of the apps whose rating range is in between 4.0 - 4.7 are having high amount of reviews, size, and installs. In terms of price, it doesn't reflect a direct relationship with rating, as we could see a fluctuation in term of pricing even at the range of high rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, customers tend to install those app which get highest no. of reviews.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 6 visualization code
fig, ax = plt.subplots(figsize=(7, 7), subplot_kw=dict(aspect="equal"))

number_of_apps = df1['Type'].value_counts()

labels = number_of_apps.index
sizes = number_of_apps.values

ax.pie(sizes,labeldistance=2,autopct='%1.1f%%')
ax.legend(labels=labels,loc="right",bbox_to_anchor=(0.9, 0, 0.5, 1))
ax.axis("equal")
plt.title('Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

We want to count a paricular coloum comparison.



##### 2. What is/are the insight(s) found from the chart?

From the plot we can imply that majority of the apps in the Play Store are Free apps.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Not much, because most of the apps are free apps.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
fig, ax = plt.subplots(figsize=(7, 7), subplot_kw=dict(aspect="equal"))

number_of_apps = df1['Content Rating'].value_counts()

labels = number_of_apps.index
sizes = number_of_apps.values

ax.pie(sizes,labeldistance=2,autopct='%1.1f%%')
ax.legend(labels=labels,loc="right",bbox_to_anchor=(0.9, 0, 0.5, 1))
ax.axis("equal")
plt.title('Content Distributation')
plt.show()

##### 1. Why did you pick the specific chart?

Here we just want to see diffent type of content present in playstore



##### 2. What is/are the insight(s) found from the chart?

We can see most of content available for Everyone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Not much because type of content does not really affect the growth of the app.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
fig, axes = plt.subplots(figsize=(8, 8))
sns.heatmap(df1.select_dtypes(include=np.number).corr(), ax=axes, annot=True, linewidths=0.1, fmt='.2f', square=True)
plt.show()

##### 1. Why did you pick the specific chart?

Want to see the corelation with each category to other category by using heat map.



##### 2. What is/are the insight(s) found from the chart?

Installs and review are highly corelated to each other.

A moderate positive correlation of 0.63 exists between the number of reviews and number of installs. This means that customers tend to download a given app more if it has been reviewed by a larger number of people.

This also means that many active users who download an app usually also leave back a review or feedback.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,getting your app reviewed by more people maybe a good idea to increase your app's capture in the market!







# Sentiment Analysis (2nd DataSet)

This file contains the result of the sentiment analysis conducted by the dataset creator. It has 64,295 rows of data with the following columns:

App : Name of the app.

Translated_Review: Either the original review in English, or a translated version if the orignal review is in another language.

Sentiment: The result of the sentiment analysis conducted on a review. The value is either Positive, Neutral, or Negative.

Sentiment_Polarity: A value indicating the positivity or negativity of the sentiment, values range from -1 (most negative) to 1 (most positive).

Sentiment_Subjectivity: A value from 0 to 1 indicating the subjectivity of the review. Lower values indicate the review is based on factual information, and higher values indicate the review is based on personal or public opinions or judgements.



In [None]:
#Second datset
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
path= '/content/drive/MyDrive/User Reviews.csv'

In [None]:
#Second datset
data1=pd.read_csv(path,encoding="ISO-8859-1")

In [None]:
data1

In [None]:
#Here we are merging our play store data set and review data set
df4=pd.merge(df1,data1,how='inner',on='App')

**Chart - 9**

Sentiment Analysis on different Category basis

In [None]:
# Chart - 9 visualization code
from matplotlib.ticker import PercentFormatter

f = plt.figure(figsize=(15,8))
ax = f.add_subplot(1,1,1)

sns.histplot(
    data=df4,
    x="Category", hue="Sentiment",
    bins=34,
    ax=ax,
    stat="count",
    multiple="stack",
    palette="light:m_r",
    edgecolor=".3",
    linewidth=.5,
    legend=True
    )
ax.set_title("Sentiment Analysis Based on Category",fontsize=15,fontweight='bold')
plt.xticks(rotation='vertical')
ax.set_xlabel("Category",fontsize=14)
ax.set_ylabel("Review Counts",fontsize=14)

plt.gca().yaxis.set_major_formatter(PercentFormatter(20000))
sns.set(style="ticks")
plt.grid()
plt.show()


##### 1. Why did you pick the specific chart?

To see which category of app performs best.

##### 2. What is/are the insight(s) found from the chart?

Family, Sports and Health & Fitness apps perform the best, Having more than 50% positive reviews and Game and Social apps perform decent leading to 50% positive and 50% negative.

The no. of review of game is much more higher in compare to other.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, it will help which category of app will perform good.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The average rating of active apps on the Google Play Store is 4.17. Communication apps like Facebook and WhatsApp are highly reviewed, indicating that users are regularly active on these platforms and frequently provide feedback.

Medical and Family apps are among the most expensive, with some costing up to $80. Users are more likely to download an app if it has a large number of reviews.

More than half of the users rate Family, Sports, and Health & Fitness apps positively. In contrast, apps for games and social media receive mixed feedback, with about 50% positive and 50% negative reviews.

# **Conclusion**

The Google Play Store Apps report highlights key trends among popular apps. Based on the visual data, categories like GAME, COMMUNICATION, and TOOL dominate in terms of user installs, even though the number of apps available in these categories is significantly smaller compared to FAMILY apps. This popularity is likely due to their ability to entertain or assist users effectively. Developers in these categories appear to prioritize quality over quantity.

The charts also indicate that apps with high ratings (above 4.0) often have a large number of reviews and user installs. However, app size and price are not strong indicators of high ratings, as these attributes are influenced by a minority of apps. Categories such as SOCIAL, COMMUNICATION, and GAME—featuring apps like Facebook, WhatsApp, Instagram, Messenger, Clash of Clans, and Google apps—tend to have the most reviews and user engagement.

Despite their popularity, apps from GAME, SOCIAL, COMMUNICATION, and TOOL categories do not appear in the top 5 most expensive apps on the Play Store. In conclusion, the current trends in the Android app market revolve around apps that provide entertainment, communication, or assistance, making these categories the most influential among users.








### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***