# Data Cleaning - Google Play Store Dataset

## Introduction

**Data Cleaning** is one of the most important and time-consuming steps in any data science project. Real-world data is messy, incomplete, inconsistent, and often contains errors. Before we can perform meaningful analysis or build models, we must clean and prepare our data.

### What is Data Cleaning?

Data Cleaning (also called Data Cleansing or Data Scrubbing) is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. It includes:
- **Handling Missing Values**
- **Removing Duplicates**
- **Correcting Data Types**
- **Handling Outliers**
- **Standardizing Formats**
- **Fixing Inconsistencies**
- **Removing Irrelevant Data**

### Why Data Cleaning Matters

**"Garbage In, Garbage Out"** - Poor quality data leads to poor results!

- **Improves Accuracy**: Clean data = better predictions
- **Saves Time**: Prevents errors during analysis
- **Enables Analysis**: Many algorithms can't handle missing/inconsistent data
- **Better Decisions**: Reliable data leads to reliable insights
- **Increases Efficiency**: Reduces computational overhead

### Real-World Data Problems

1. **Missing Values**: 20-30% of real datasets have missing data
2. **Duplicates**: Can skew statistics and model performance
3. **Inconsistent Formats**: "5.0", "5", "5.0k" for the same value
4. **Type Errors**: Numbers stored as strings
5. **Special Characters**: "$10.99" instead of 10.99
6. **Outliers**: Extreme values that may be errors
7. **Inconsistent Categories**: "Free", "free", "FREE", "0"

### Google Play Store Dataset

This dataset contains information about Android apps on the Google Play Store:
- **App**: Application name
- **Category**: App category
- **Rating**: User rating (1-5 stars)
- **Reviews**: Number of user reviews
- **Size**: App size
- **Installs**: Number of installs
- **Type**: Free or Paid
- **Price**: App price
- **Content Rating**: Target audience
- **Genres**: App genres
- **Last Updated**: Last update date
- **Current Ver**: Current version
- **Android Ver**: Required Android version

This dataset is notoriously messy and perfect for learning data cleaning!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("=" * 70)
print("DATA CLEANING TOOLKIT LOADED")
print("=" * 70)
print("✓ pandas - Data manipulation")
print("✓ numpy - Numerical operations")
print("✓ matplotlib & seaborn - Visualization")
print("✓ re - Regular expressions for text processing")
print("=" * 70)

## 1. Load Data and Initial Inspection

The first step in data cleaning is understanding what we're working with - the "before" picture.

In [None]:
# Create a messy Google Play Store dataset for demonstration
np.random.seed(42)
n_samples = 300

categories = ['GAME', 'FAMILY', 'TOOLS', 'PRODUCTIVITY', 'MEDICAL', 'FINANCE']
types = ['Free', 'Paid', 'free', 'FREE', 'paid']
ratings_messy = ['4.5', '4.2', 'NaN', '3.8', 'null', '4.7', '', '19']  # Some errors
content_ratings = ['Everyone', 'Teen', 'Mature 17+', 'Everyone 10+']

# Create messy data
data = {
    'App': [f'App_{i}' for i in range(n_samples)],
    'Category': np.random.choice(categories, n_samples),
    'Rating': np.random.choice(ratings_messy + [str(round(np.random.uniform(1, 5), 1)) for _ in range(20)], n_samples),
    'Reviews': [str(np.random.randint(10, 1000000)) if np.random.random() > 0.05 else 'NaN' for _ in range(n_samples)],
    'Size': [f'{np.random.choice([str(np.random.randint(1, 100)) + "M", str(np.random.randint(1, 10)) + "k", "Varies with device"])}' 
             for _ in range(n_samples)],
    'Installs': [f'{np.random.choice(["1,000+", "10,000+", "100,000+", "1,000,000+", "10,000,000+", "Free"])}' 
                 for _ in range(n_samples)],
    'Type': np.random.choice(types, n_samples),
    'Price': [f'${np.random.choice([0, 0.99, 2.99, 4.99, 9.99])}' if t in ['Paid', 'paid'] else '$0' 
              for t in np.random.choice(types, n_samples)],
    'Content Rating': np.random.choice(content_ratings, n_samples),
}

# Introduce some missing values
for col in ['Rating', 'Reviews', 'Size']:
    indices = np.random.choice(n_samples, size=int(n_samples * 0.1), replace=False)
    for idx in indices:
        data[col][idx] = np.nan if col == 'Rating' else 'NaN'

# Add some duplicates
for i in range(10):
    dup_idx = np.random.randint(0, n_samples-50)
    for col in data.keys():
        data[col].append(data[col][dup_idx])

df_dirty = pd.DataFrame(data)

# Display initial state
print("=" * 70)
print("INITIAL DATA INSPECTION - DIRTY DATASET")
print("=" * 70)
print(f"\nDataset Shape: {df_dirty.shape}")
print(f"Rows: {df_dirty.shape[0]:,} | Columns: {df_dirty.shape[1]}")

print("\n" + "=" * 70)
print("FIRST 10 ROWS:")
print("=" * 70)
display(df_dirty.head(10))

print("\n" + "=" * 70)
print("DATASET INFO:")
print("=" * 70)
df_dirty.info()

print("\n" + "=" * 70)
print("SAMPLE VALUES FROM EACH COLUMN:")
print("=" * 70)
for col in df_dirty.columns:
    print(f"\n{col}:")
    print(f"  Sample values: {df_dirty[col].head(5).tolist()}")
    print(f"  Unique count: {df_dirty[col].nunique()}")
    print(f"  Data type: {df_dirty[col].dtype}")

## 2. Identify and Document Data Quality Issues

Before cleaning, let's systematically identify all problems in our dataset.

In [None]:
# Comprehensive data quality assessment
print("=" * 70)
print("DATA QUALITY ASSESSMENT REPORT")
print("=" * 70)

# 1. Missing Values
print("\n1. MISSING VALUES ANALYSIS:")
print("-" * 70)
missing_stats = pd.DataFrame({
    'Column': df_dirty.columns,
    'Missing_Count': df_dirty.isnull().sum(),
    'Missing_Percentage': (df_dirty.isnull().sum() / len(df_dirty) * 100).round(2),
    'Data_Type': df_dirty.dtypes
})
display(missing_stats[missing_stats['Missing_Count'] > 0])

# 2. Duplicates
print("\n2. DUPLICATE ROWS:")
print("-" * 70)
n_duplicates = df_dirty.duplicated().sum()
print(f"Number of duplicate rows: {n_duplicates}")
print(f"Percentage: {(n_duplicates/len(df_dirty)*100):.2f}%")

# 3. Data Type Issues
print("\n3. DATA TYPE ISSUES:")
print("-" * 70)
print("Issues found:")
print("  • Rating: Should be float, currently object")
print("  • Reviews: Should be int, currently object")
print("  • Size: Contains 'M', 'k', and text")
print("  • Installs: Contains '+' and ',' characters")
print("  • Price: Contains '$' symbol")
print("  • Type: Inconsistent case (Free, free, FREE)")

# 4. Invalid/Outlier Values
print("\n4. INVALID VALUES DETECTED:")
print("-" * 70)
# Check Rating for values outside 1-5 range
invalid_ratings = df_dirty[df_dirty['Rating'].notna()]
invalid_ratings = invalid_ratings[pd.to_numeric(invalid_ratings['Rating'], errors='coerce').isna() | 
                                   (pd.to_numeric(invalid_ratings['Rating'], errors='coerce') > 5)]
print(f"  • Invalid ratings: {len(invalid_ratings)} entries")

# 5. Inconsistent Categories
print("\n5. INCONSISTENT CATEGORIES:")
print("-" * 70)
print(f"  • Type values: {df_dirty['Type'].unique()}")
print("    Issue: Inconsistent case (Free, free, FREE, Paid, paid)")

# Visualization of data quality
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Missing values
missing_stats[missing_stats['Missing_Count'] > 0].plot(
    x='Column', y='Missing_Percentage', kind='bar', ax=axes[0, 0], color='red', alpha=0.7
)
axes[0, 0].set_title('Missing Values by Column (%)', fontweight='bold')
axes[0, 0].set_xlabel('')
axes[0, 0].set_ylabel('Missing %')
axes[0, 0].grid(axis='y', alpha=0.3)

# Duplicates
axes[0, 1].pie([len(df_dirty) - n_duplicates, n_duplicates], 
               labels=['Unique', 'Duplicates'], 
               autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
axes[0, 1].set_title('Duplicate Rows Distribution', fontweight='bold')

# Data types
df_dirty.dtypes.value_counts().plot(kind='bar', ax=axes[1, 0], color='skyblue', alpha=0.7)
axes[1, 0].set_title('Data Types Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Data Type')
axes[1, 0].set_ylabel('Count')
axes[1, 0].grid(axis='y', alpha=0.3)

# Completeness score
completeness = ((df_dirty.size - df_dirty.isnull().sum().sum()) / df_dirty.size) * 100
axes[1, 1].barh(['Dataset Completeness'], [completeness], color='green', alpha=0.7)
axes[1, 1].set_xlim(0, 100)
axes[1, 1].set_title('Overall Data Completeness', fontweight='bold')
axes[1, 1].set_xlabel('Completeness %')
axes[1, 1].text(completeness/2, 0, f'{completeness:.1f}%', 
                ha='center', va='center', fontsize=20, fontweight='bold', color='white')

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("SUMMARY OF ISSUES TO FIX:")
print("=" * 70)
print("✗ Missing values in multiple columns")
print("✗ Duplicate rows present")
print("✗ Incorrect data types (objects instead of numeric)")
print("✗ Special characters in numeric columns ($, +, ,, M, k)")
print("✗ Inconsistent text case in categorical variables")
print("✗ Invalid values (ratings > 5, 'NaN' strings)")
print("=" * 70)

## 3. Step-by-Step Data Cleaning Process

Now we'll systematically clean each issue we identified.

In [None]:
# Create a copy for cleaning
df_clean = df_dirty.copy()

print("=" * 70)
print("DATA CLEANING IN PROGRESS...")
print("=" * 70)

# Step 1: Remove Duplicates
print("\nStep 1: Removing Duplicates")
print("-" * 70)
before_dup = len(df_clean)
df_clean = df_clean.drop_duplicates()
after_dup = len(df_clean)
print(f"✓ Removed {before_dup - after_dup} duplicate rows")
print(f"  Before: {before_dup:,} rows | After: {after_dup:,} rows")

# Step 2: Clean Rating column
print("\nStep 2: Cleaning Rating Column")
print("-" * 70)
df_clean['Rating'] = pd.to_numeric(df_clean['Rating'], errors='coerce')
df_clean['Rating'] = df_clean['Rating'].clip(1, 5)  # Valid ratings are 1-5
invalid_before = df_clean['Rating'].isna().sum()
df_clean['Rating'] = df_clean['Rating'].fillna(df_clean['Rating'].median())
print(f"✓ Converted to numeric")
print(f"✓ Clipped values to 1-5 range")
print(f"✓ Filled {invalid_before} missing values with median: {df_clean['Rating'].median():.2f}")

# Step 3: Clean Reviews column
print("\nStep 3: Cleaning Reviews Column")
print("-" * 70)
df_clean['Reviews'] = df_clean['Reviews'].replace(['NaN', 'null', ''], np.nan)
df_clean['Reviews'] = pd.to_numeric(df_clean['Reviews'], errors='coerce')
reviews_missing = df_clean['Reviews'].isna().sum()
df_clean['Reviews'] = df_clean['Reviews'].fillna(0).astype(int)
print(f"✓ Converted to numeric")
print(f"✓ Filled {reviews_missing} missing values with 0")

# Step 4: Clean Size column
print("\nStep 4: Cleaning Size Column")
print("-" * 70)
def clean_size(size):
    if pd.isna(size) or 'Varies' in str(size):
        return np.nan
    size_str = str(size).strip()
    if 'M' in size_str:
        return float(size_str.replace('M', ''))
    elif 'k' in size_str:
        return float(size_str.replace('k', '')) / 1024
    return np.nan

df_clean['Size_MB'] = df_clean['Size'].apply(clean_size)
size_missing = df_clean['Size_MB'].isna().sum()
df_clean['Size_MB'] = df_clean['Size_MB'].fillna(df_clean['Size_MB'].median())
print(f"✓ Converted all sizes to MB")
print(f"✓ Handled 'Varies with device' entries")
print(f"✓ Filled {size_missing} missing values with median")

# Step 5: Clean Installs column
print("\nStep 5: Cleaning Installs Column")
print("-" * 70)
def clean_installs(install):
    if pd.isna(install) or 'Free' in str(install):
        return np.nan
    # Remove + and , characters
    return int(str(install).replace('+', '').replace(',', ''))

df_clean['Installs_Numeric'] = df_clean['Installs'].apply(clean_installs)
installs_missing = df_clean['Installs_Numeric'].isna().sum()
df_clean['Installs_Numeric'] = df_clean['Installs_Numeric'].fillna(0).astype(int)
print(f"✓ Converted to numeric")
print(f"✓ Removed special characters (+, ,)")
print(f"✓ Filled {installs_missing} missing values with 0")

# Step 6: Clean Price column
print("\nStep 6: Cleaning Price Column")
print("-" * 70)
df_clean['Price_Numeric'] = df_clean['Price'].str.replace('$', '').astype(float)
print(f"✓ Removed $ symbol")
print(f"✓ Converted to numeric")

# Step 7: Standardize Type column
print("\nStep 7: Standardizing Type Column")
print("-" * 70)
df_clean['Type'] = df_clean['Type'].str.capitalize()
print(f"✓ Standardized case: {df_clean['Type'].unique()}")

# Final cleaned dataset
df_clean = df_clean[['App', 'Category', 'Rating', 'Reviews', 'Size_MB', 
                     'Installs_Numeric', 'Type', 'Price_Numeric', 'Content Rating']]

df_clean.columns = ['App', 'Category', 'Rating', 'Reviews', 'Size_MB', 
                    'Installs', 'Type', 'Price', 'Content_Rating']

print("\n" + "=" * 70)
print("CLEANING COMPLETE!")
print("=" * 70)
print(f"Final dataset shape: {df_clean.shape}")
print(f"\nCleaned columns:")
for col in df_clean.columns:
    print(f"  • {col:20s}: {df_clean[col].dtype}")

# Compare before and after
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Before
axes[0].bar(['Complete', 'With Issues'], 
            [100-((df_dirty.isnull().sum().sum()/df_dirty.size)*100),
             (df_dirty.isnull().sum().sum()/df_dirty.size)*100],
            color=['lightcoral', 'red'], alpha=0.7)
axes[0].set_title('Data Quality: BEFORE Cleaning', fontweight='bold', fontsize=14)
axes[0].set_ylabel('Percentage')
axes[0].set_ylim(0, 100)
axes[0].grid(axis='y', alpha=0.3)

# After
axes[1].bar(['Complete', 'With Issues'], 
            [100-((df_clean.isnull().sum().sum()/df_clean.size)*100),
             (df_clean.isnull().sum().sum()/df_clean.size)*100],
            color=['lightgreen', 'orange'], alpha=0.7)
axes[1].set_title('Data Quality: AFTER Cleaning', fontweight='bold', fontsize=14)
axes[1].set_ylabel('Percentage')
axes[1].set_ylim(0, 100)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Dataset is now clean and ready for analysis!")
print("=" * 70)

## Summary: Data Cleaning Best Practices

### What We Learned

**1. Data Cleaning Steps:**
1. **Inspect Data**: Understand structure and identify issues
2. **Document Problems**: List all quality issues
3. **Create Cleaning Plan**: Prioritize fixes
4. **Clean Systematically**: Handle one issue at a time
5. **Validate Results**: Verify cleaning worked
6. **Document Changes**: Keep track of transformations

**2. Common Cleaning Operations:**

| Problem | Solution | Example |
|---------|----------|---------|
| **Missing Values** | Fill with mean/median/mode or drop | `fillna(median)` |
| **Duplicates** | Remove duplicate rows | `drop_duplicates()` |
| **Wrong Data Types** | Convert to correct type | `astype(float)` |
| **Special Characters** | Remove/replace characters | `str.replace('$', '')` |
| **Inconsistent Case** | Standardize case | `str.lower()` or `.capitalize()` |
| **Outliers** | Clip, cap, or remove | `.clip(min, max)` |
| **Invalid Values** | Replace or remove | `pd.to_numeric(errors='coerce')` |

**3. Key Principles:**
- **Never Modify Original Data**: Always work on a copy
- **Document Everything**: Record all transformations
- **Validate Changes**: Check results after each step
- **Consider Domain Knowledge**: Understand what values make sense
- **Be Consistent**: Apply same rules across dataset
- **Handle Missing Data Thoughtfully**: Don't blindly drop or fill

**4. When to Remove vs. Fix:**

**Remove When:**
- Rows are completely corrupted
- Duplicates exist
- > 50% of row data is missing
- Data is clearly erroneous and can't be corrected

**Fix When:**
- Issue is systematic (e.g., all prices have $)
- Valid data with formatting issues
- Missing values can be reasonably imputed
- Outliers are legitimate but need handling

**5. Tools and Techniques:**

```python
# Remove duplicates
df.drop_duplicates()

# Handle missing values
df.fillna(value)  # Fill with value
df.dropna()  # Drop rows with missing values
df.interpolate()  # Interpolate missing values

# Fix data types
pd.to_numeric(df['col'], errors='coerce')
df['col'].astype(float)

# Clean strings
df['col'].str.replace('$', '')
df['col'].str.strip()  # Remove whitespace
df['col'].str.lower()  # Lowercase

# Handle outliers
df['col'].clip(lower, upper)  # Cap values
df[df['col'].between(low, high)]  # Filter
```

### Impact of Data Cleaning

**Before Cleaning:**
- Missing values: Multiple columns
- Duplicates: Present
- Invalid data types: Many
- Special characters: Throughout
- Inconsistent formats: Multiple issues

**After Cleaning:**
- ✓ No missing values (handled appropriately)
- ✓ No duplicates
- ✓ Correct data types for all columns
- ✓ Numeric columns are truly numeric
- ✓ Consistent formatting
- ✓ Ready for analysis and modeling!

**Remember:** Data cleaning is iterative - you may need to revisit and refine your cleaning process as you learn more about your data through analysis!