## **Google Playstore EDA project**

Exploratory Data Analysis (EDA) is a crucial step in data analysis that helps uncover patterns, detect anomalies, and understand data distributions before applying machine learning models or making business decisions. By performing EDA, I aim to gain practical exposure and enhance my hands-on experience with real-world datasets. This will allow me to apply SQL, Python (Pandas, Matplotlib, Seaborn), and ydata-profiling effectively, improving my ability to extract meaningful insights, clean data, and make data-driven decisions. Regular practice with EDA will strengthen my problem-solving skills and prepare me for data analytics and business intelligence roles in the future.

In [2]:
# Importing required Python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

### **1. Importing Data**

In [20]:
# Importing Data
Playstore_Data = pd.read_csv('googleplaystore.csv')

* Important Thing's to Note
    * set options to show maximum rows and columns 
    * Hide unnecessary warnings

In [21]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_columns',None)

# Hiding Warnings
import warnings
warnings.filterwarnings('ignore')


## Exploration and Cleaning Data

#### **Defining columns**

**Here’s a breakdown of each column in the Google Play Store dataset:**

* **App** – The name of the mobile application.
* **Rating** – The average user rating of the app (typically between 1.0 to 5.0).
* **Category** – The category or genre of the app (e.g., "Games", "Business", "Education").
* **Reviews** – The total number of user reviews submitted for the app.
* **Size** – The size of the application file (in MB or KB). Some values may be listed as "Varies with device."
* **Installs** – The number of times the app has been installed (e.g., "10,000+", "1,000,000+").
* **Type** – Indicates whether the app is "Free" or "Paid".
* **Price** – The cost of the app (0 if free, otherwise in USD).
* **Content Rating** – The age group for which the app is suitable (e.g., "Everyone", "Teen", "Mature 17+").
* **Genres** – The genre(s) of the app, similar to Category but can have multiple values (e.g., "Action", "Puzzle").
* **Last Updated** – The date when the app was last updated on the Play Store.
* **Current Ver** – The current version of the app available on the Play Store.
* **Android Ver** – The minimum Android version required to install the app (e.g., "4.0 and up").

* Looking at first 5 rows of data

In [22]:
Playstore_Data.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


## 📈 Automated EDA with `ydata-profiling`

### 🎯 Purpose
To generate a **quick and comprehensive initial overview** of the dataset using descriptive statistics.  
This helps in:
- Identifying missing data
- Understanding distributions
- Spotting outliers
- Detecting correlations
- Reviewing variable types

All before starting any manual preprocessing or modeling.

### 🛠️ Tool Used: `ydata-profiling` (formerly `pandas-profiling`)

We used the `ydata-profiling` Python library to automatically generate an interactive Exploratory Data Analysis (EDA) report.

### ✅ Features of the Report
- **Summary Statistics** (mean, median, standard deviation, min/max, etc.)
- **Missing Values Heatmap**
- **Variable Type Detection**
- **Correlations (Pearson, Spearman, Kendall)**
- **Distribution Plots**
- **Duplicate Rows Detection**
- **High Cardinality Alerts**



In [23]:
import ydata_profiling as yd


profile = yd.ProfileReport(Playstore_Data)
profile.to_file('C:\\Users\\yogya\\Desktop\\EDA project\\Google playstore eda\\practicing\\output\\Google_aapStore.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 13/13 [00:00<00:00, 18.93it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
Playstore_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10840 non-null  float64
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10841 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10839 non-null  object 
dtypes: float64(2), object(11)
memory usage: 1.1+ MB


# 📊 Data Quality Report – Google Play Store Dataset

## 🔢 Dataset Overview
- **Total Rows:** 10,841  
- **Total Columns:** 13  
- **Memory Usage:** ~1.1 MB  

---

## 🧹 Null / Missing Values

| Column           | Non-Null Count | Missing Count | Remarks                           |
|------------------|----------------|----------------|------------------------------------|
| `App`            | 10,841         | 0              | ✅ Complete                        |
| `Category`       | 10,840         | 1              | ⚠️ 1 missing value                |
| `Rating`         | 9,367          | 1,474          | ⚠️ Significant missing values     |
| `Reviews`        | 10,840         | 1              | ⚠️ Should be an integer           |
| `Size`           | 10,841         | 0              | ✅ Complete                        |
| `Installs`       | 10,841         | 0              | ✅ Complete                        |
| `Type`           | 10,840         | 1              | ⚠️ 1 missing value                |
| `Price`          | 10,841         | 0              | ✅ Complete                        |
| `Content Rating` | 10,841         | 0              | ✅ Complete                        |
| `Genres`         | 10,840         | 1              | ⚠️ 1 missing value                |
| `Last Updated`   | 10,841         | 0              | ✅ Complete                        |
| `Current Ver`    | 10,833         | 8              | ⚠️ 8 missing values               |
| `Android Ver`    | 10,839         | 2              | ⚠️ 2 missing values               |

---

## 🔍 Data Type Issues

- **`Reviews`** is of type `float64`, but reviews are generally integers. 🔁 *Should convert to `int`*.
- **`Last Updated`** if of type `object`, 🔁 *Should convert to `Date`*
- **`Size`**, **`Installs`**, and **`Price`** are object types and may contain inconsistent formatting like:
  - `'Varies with device'` in `Size`
  - `'1,000+’` in `Installs`
  - `'$4.99'` in `Price`

🛠️ These need cleaning and conversion:
- `Size` → numeric MB or KB (after removing "Varies with device")
- `Installs` → integer (remove `+` and `,`)
- `Price` → float (remove `$`)

---

## ❗ Inconsistencies Summary

- ⚠️ **Missing values** in 7 columns (especially `Rating`)
- ⚠️ **Data type mismatches** in `Reviews`, `Size`, `Installs`, `Price`,`and Last Updated`
- ⚠️ **Inconsistent formats** like symbols and text in numeric fields
- ✅ Most categorical/text columns are clean but need review for uniformity

---

## ✅ Next Steps

1. **Handle missing values:**
   - Impute/drop based on column significance.
2. **Clean formatting in object-type numeric fields.**
3. **Convert datatypes** (`Reviews`, `Installs`, `Price`,`Last Updated`) to proper formats.
4. **Visualize nulls** using a heatmap to understand patterns.


✅ Converting `Size` to a numeric column