# **Project Name**    - Analyzing Google Play Store Data for App Success



In [None]:
from google.colab import drive
drive.mount('/content/drive')

##### **Project Type**    - Exploratory Data Analysis (EDA) and Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Arjun Sharma
##### **Team Member 2 -**n/a
##### **Team Member 3 -**n/a
##### **Team Member 4 -**n/a

# **Project Summary -**

The project aims to analyze data from the Google Play Store to extract actionable insights for app developers. By exploring various attributes such as category, rating, size, and customer reviews, the goal is to identify key factors responsible for app engagement and success. The analysis will involve exploratory data analysis (EDA) techniques to understand the distribution and relationships within the dataset. Furthermore, regression analysis will be conducted to predict app success metrics based on relevant features. The insights derived from this analysis can guide developers in making informed decisions to capture the Android market effectively.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


*The problem statement revolves around understanding the factors that contribute to the success of apps on the Google Play Store. This involves analyzing a dataset containing information about various apps, including their categories, ratings, sizes, and customer reviews. The objective is to identify patterns and correlations within the data that can help developers optimize their app strategies for better engagement and success.*

#### **Define Your Business Objective?**

*The business objective is to provide app developers with actionable insights derived from the analysis of Google Play Store data. By understanding the key factors influencing app success, developers can make informed decisions regarding app development, marketing strategies, and user engagement tactics. Ultimately, the goal is to help developers maximize their app's visibility, downloads, and user satisfaction on the Android platform.*

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

!pip install pymysql
import pymysql
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')




In [None]:
# Access files in Google Drive
data_path = '/content/drive/MyDrive/eda playstore/Play Store Data.csv'

In [None]:
# Access files in Google Drive
df = pd.read_csv(data_path)

### Dataset First View

In [None]:
# Load the dataset

df.head()
# Display the first few rows of the dataset

df.head(10)

In [None]:
# @title Type vs Content Rating

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
    x_label: grp['Content Rating'].value_counts()
    for x_label, grp in df.groupby('Type')
})
sns.heatmap(df_2dhist, cmap='viridis')
plt.xlabel('Type')
_ = plt.ylabel('Content Rating')

In [None]:
# @title Content Rating

from matplotlib import pyplot as plt
import seaborn as sns
df.groupby('Content Rating').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print("Number of rows:", rows)
print("Number of columns:", columns)


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:

# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)


#### Missing Values/Null Values

In [None]:

# Missing Values/Null Values Count
missing_values_count = df.isnull().sum()
print("Missing values count:")
print(missing_values_count)




In [None]:
# Visualizing the missing values
import seaborn as sns
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

# **What did you know about your dataset?**

Based on the provided dataset information:

The dataset contains information about mobile apps available on the Google Play Store.

It consists of **9360 entries (rows) and 13 columns.**

The columns include:

*  App: Name of the app
*  Category: Category to which the app belongs
*  Rating: Average rating of the app
*  Reviews: Number of reviews for the app
*  Size: Size of the app
*  Installs: Number of installs of the app
*  Type: Type of the app (Free or Paid)
*  Price: Price of the app
*  Content Rating: Content rating of the app (e.g., Everyone, Teen, Mature 17+, etc.)
*  Genres: Genre(s) of the app
*  Last Updated: Date when the app was last updated
*  Current Ver: Current version of the app
*  Android Ver: Required Android version for the app
*  The 'Rating' column has a mean rating of approximately 4.19, with a standard deviation of about 0.54. The minimum rating is 1.0, and the maximum rating is 19.0, which seems unusual and may require further investigation.

# **Initial Observations:**

The dataset seems to provide comprehensive information about various attributes of mobile apps.
There are some potential data quality issues that need to be addressed, such as the presence of missing values (indicated by non-null count less than the total number of entries) and the unusual maximum rating value of 19.0.

**Hypotheses:**

There may be correlations between app ratings and other variables such as the number of reviews, app size, and content rating.
Certain app categories or genres might be more popular or tend to have higher ratings than others.
There could be differences in ratings between free and paid apps.
The required Android version for an app might influence its rating or popularity.

*These initial observations and hypotheses provide a foundation for further exploration and analysis of the dataset. Further data cleaning, preprocessing, and analysis will help validate or refute these hypotheses and uncover additional insights about the dataset.*

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:

# Dataset Describe
print(df.describe())


### Variables Description

# Based on the provided dataset information, here's a description of each variable:

*  App: Name of the mobile application.
*  Category: Category to which the app belongs (e.g., "Game", "Social", "Tools").
*  Rating: Average rating of the app. Ratings are typically on a scale of 1 to 5, with higher values indicating better user satisfaction.
*  Reviews: Number of user reviews for the app. This indicates the level of user engagement and feedback received by the app.
*  Size: Size of the app. This could be in various units such as megabytes (MB) or kilobytes (KB).
*  Installs: Number of times the app has been installed/downloaded. This provides insight into the popularity and reach of the app.
*  Type: Type of the app, indicating whether it is free or paid.
*  Price: Price of the app. For free apps, this would typically be "0".
*  Content Rating: Content rating of the app, indicating the target audience or age group for which the app is suitable (e.g., "Everyone", "Teen", "Mature 17+").
*  Genres: Genre(s) of the app, which provides additional categorization beyond the main category.
*  Last Updated: Date when the app was last updated. This indicates how recently the app has been maintained or improved.
*  Current Ver: Current version of the app. This helps users and developers track software updates.
*  Android Ver: Minimum required Android version for the app. This specifies the operating system compatibility of the app.

These variables provide comprehensive information about various aspects of mobile applications, including their characteristics, user feedback, and maintenance status. Analyzing these variables can help understand factors influencing app popularity, user satisfaction, and market trends.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for {column}: {unique_values}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 1. Handling missing values
# Let's assume we want to drop rows with missing values in a DataFrame df
df.dropna(inplace=True)





### What all manipulations have you done and insights you found?

During the data wrangling process and exploratory data analysis (EDA), several manipulations were performed on the dataset to prepare it for analysis and derive insights. Here's a summary of the manipulations and insights found:

Data Manipulations:


**Handling Missing Values:**
Identified and handled missing values in various columns by either dropping rows with missing values or imputing missing values using appropriate methods.

**Transforming Data Types:**
Converted data types of columns to appropriate formats, such as converting string representations of dates to datetime objects and converting categorical variables to numerical representations.

**Cleaning the Data:**
Cleaned text data by removing special characters, converting text to lowercase, and handling inconsistent formatting.
Addressed outliers or anomalies in the data that could affect analysis results.

**Handling Duplicates:**
Identified and removed duplicate rows or entries in the dataset to avoid biasing analysis results.

# Insights Found:
**Distribution of App Installs:**
The histogram of app installs revealed the popularity and reach of apps on the Google Play Store. It showed the frequency distribution of app installs, indicating how many apps fall into different install count ranges.

**App Ratings by Content Rating:**
The boxplot of app ratings by content rating provided insights into the variation of app ratings across different content rating categories. It helped identify whether certain content rating categories tend to have higher or lower-rated apps.

**Top 10 App Categories with Highest Average Ratings:**
The bar chart of top 10 app categories with the highest average ratings identified categories that are highly rated by users. It helped prioritize app development efforts in categories likely to receive positive feedback and user satisfaction.

`Overall, these manipulations and insights provided a comprehensive understanding of the dataset and helped guide decision-making and strategy development for app developers and businesses in the Android market.`

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt

# Plotting the distribution of app ratings
plt.figure(figsize=(10, 6))
plt.hist(df['Rating'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

***I chose a histogram to visualize the distribution of app ratings because it provides insights into the overall pattern of ratings and helps identify any skewness or outliers in the data.***

##### 2. What is/are the insight(s) found from the chart?

***The histogram shows that most app ratings are concentrated around the higher end of the scale, indicating that the majority of apps have relatively high ratings. However, there may be a few apps with exceptionally low or high ratings that need further investigation.***

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*`The insights from this chart can help developers understand the distribution of ratings for apps in the Google Play Store. Identifying factors contributing to low ratings can guide developers in improving app quality and user satisfaction, potentially leading to a positive impact on business by increasing user engagement and retention.`*

#### Chart - 2

In [None]:
import seaborn as sns

# Plotting boxplot of app categories vs. ratings
plt.figure(figsize=(12, 8))
sns.boxplot(x='Category', y='Rating', data=df, palette='viridis')
plt.title('App Categories vs. Ratings')
plt.xlabel('Category')
plt.ylabel('Rating')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a boxplot to visualize the distribution of ratings across different app categories.

**Boxplots** are effective for comparing the central tendency, spread, and potential outliers of numerical data across multiple categories.

##### 2. What is/are the insight(s) found from the chart?

The boxplot shows the distribution of ratings for each app category, highlighting any variations in ratings across different categories.

It can help identify categories with consistently high or low ratings and those
with a wide variability in ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding how app ratings vary across categories can help developers focus their efforts on improving apps in categories with lower ratings or capitalizing on strengths in categories with higher ratings. This targeted approach can lead to a positive business impact by enhancing user satisfaction and app performance.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import matplotlib.pyplot as plt

# Convert Installs column to numerical by removing '+' and ',' and converting to int
df['Installs'] = df['Installs'].str.replace('+', '').str.replace(',', '').astype(int)

# Plotting the distribution of app installs
plt.figure(figsize=(10, 6))
plt.hist(df['Installs'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7)
plt.title('Distribution of App Installs')
plt.xlabel('Installs')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of app installs because it provides insights into the popularity and reach of apps on the Google Play Store. The histogram shows the frequency distribution of app installs, indicating how many apps fall into different install count ranges. The x-axis represents the number of installs, while the y-axis represents the frequency of apps in each install range.



##### 2. What is/are the insight(s) found from the chart?

The insight gained from this chart is the distribution pattern of app installs, which helps understand the popularity and competitiveness of different apps. For example, it can reveal whether most apps have a low number of installs or if there are a significant number of highly-installed apps. This insight can inform app developers and businesses about market trends and user preferences, guiding their strategies for app development, marketing, and monetization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting boxplot of app ratings by content rating
plt.figure(figsize=(10, 6))
sns.boxplot(x='Content Rating', y='Rating', data=df, palette='muted')
plt.title('App Ratings by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a boxplot to visualize the distribution of app ratings by content rating because it helps compare the ratings of apps across different content rating categories. Each box represents the distribution of ratings within a content rating category, with the horizontal line inside the box indicating the median rating. The whiskers extend to show the range of ratings, while any outliers are displayed as individual points.


##### 2. What is/are the insight(s) found from the chart?


The insight gained from this chart is the variation in app ratings across different content rating categories. It can help identify whether certain content rating categories tend to have higher or lower-rated apps, providing insights into user satisfaction and preferences based on content suitability. This information can guide developers in optimizing app content and features to better align with user expectations and content guidelines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code


# Calculate average rating for each category
avg_rating_by_category = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)

# Plotting bar chart of top 10 app categories with highest average ratings
plt.figure(figsize=(12, 6))
avg_rating_by_category.head(10).plot(kind='bar', color='skyblue')
plt.title('Top 10 App Categories with Highest Average Ratings')
plt.xlabel('Category')
plt.ylabel('Average Rating')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the top 10 app categories with the highest average ratings because it provides a clear comparison of average ratings across different categories. Each bar represents the average rating of apps within a category, allowing for easy identification of categories with the highest average ratings.


##### 2. What is/are the insight(s) found from the chart?


The insight gained from this chart is the identification of app categories that are highly rated by users. It can help developers and businesses prioritize app development efforts in categories that are likely to receive positive feedback and user satisfaction. Understanding which categories have the highest average ratings can also inform market positioning and competitive strategies, guiding decisions on app features, marketing, and monetization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting numerical columns for correlation analysis
numerical_columns = df.select_dtypes(include=['float64', 'int64'])

# Calculating the correlation matrix
correlation_matrix = numerical_columns.corr()

# Plotting the correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap provides insights into the relationships between numerical variables in the dataset. Positive correlations are indicated by warmer colors (tending towards red), while negative correlations are indicated by cooler colors (tending towards blue). A correlation value close to 1 or -1 indicates a strong linear relationship between the variables, while a value close to 0 indicates a weak or no relationship.

Analyzing the correlation heatmap can help identify potential patterns or dependencies between variables, which can be valuable for further analysis and decision-making.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Selecting numerical columns for pair plot
numerical_columns = df.select_dtypes(include=['float64', 'int64'])

# Creating pair plot
sns.pairplot(numerical_columns)
plt.title('Pair Plot of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

The pair plot allows us to visualize the relationships between multiple variables at once, making it easier to identify potential patterns or trends. For example, it can help us identify linear relationships, clusters, or outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

# Solution to Business Objective

Based on the analysis conducted during the exploratory data analysis (EDA) process, as well as the insights gained from visualizing the dataset, here are some suggestions for the client to achieve their business objective:

1. **Focus on App Quality**: Encourage developers to prioritize app quality by improving user experience, addressing bugs, and responding to user feedback. Higher-quality apps are likely to receive better ratings and attract more users, leading to increased engagement and success in the Android market.

2. **Target Popular Categories**: Identify and invest in app categories that are popular among users. Analyze the distribution of app ratings across different categories to understand which categories have the highest average ratings and user satisfaction. Consider allocating resources towards developing apps in these categories to capitalize on their popularity.

3. **User Engagement Strategies**: Implement strategies to enhance user engagement and retention. This could include regular updates to existing apps, introducing new features based on user preferences, and fostering a sense of community through social features or user forums within the apps.

4. **Monetization Opportunities**: Explore monetization opportunities beyond the initial app purchase price. For example, consider offering in-app purchases, subscriptions, or advertisements within the apps. Analyze the impact of app type (free vs. paid) and pricing strategies on user engagement and revenue generation.

5. **Continuous Improvement**: Foster a culture of continuous improvement by regularly monitoring app performance metrics, analyzing user feedback, and iterating on app features based on data-driven insights. Encourage collaboration between developers, designers, and marketers to optimize app performance and user satisfaction over time.

6. **Market Research and Competition Analysis**: Stay informed about market trends and competitor strategies in the Android app market. Conduct regular market research to identify emerging trends, user preferences, and areas of opportunity. Analyze competitor apps to benchmark performance and identify areas for differentiation and innovation.

7. **Investment in Marketing and Promotion**: Allocate resources towards marketing and promotion efforts to increase app visibility and attract new users. Utilize targeted advertising campaigns, social media promotion, and collaborations with influencers or app review websites to reach a wider audience and drive app downloads.

By implementing these strategies and leveraging insights gained from data analysis, the client can enhance app engagement and success in the Android market, ultimately driving business growth and achieving their objectives. It's important to continuously evaluate and adjust these strategies based on evolving market dynamics and user preferences.


# **Conclusion**

# **Conclusion**

---


In conclusion, the exploratory data analysis (EDA) of the Google Play Store apps dataset has provided valuable insights into factors influencing app engagement and success in the Android market. Through data visualization and analysis, we identified key patterns and relationships between variables that can guide decision-making and strategy development for app developers and businesses.

The analysis highlighted the importance of focusing on app quality, targeting popular categories, implementing user engagement strategies, exploring monetization opportunities, continuously improving app features, conducting market research and competition analysis, and investing in marketing and promotion efforts.

By leveraging these insights and implementing data-driven strategies, businesses can enhance app performance, increase user satisfaction, and drive growth in the Android market. It is essential to continually monitor market trends, user preferences, and app performance metrics to adapt and optimize strategies over time.

Overall, the EDA process has provided actionable insights that can help businesses achieve their objectives and succeed in the competitive landscape of the Google Play Store. Through informed decision-making and strategic execution, businesses can maximize app engagement and unlock opportunities for long-term success.

This concludes our EDA capstone project on analyzing the Google Play Store apps dataset. Thank you for your attention and engagement throughout the analysis process!

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***