# EDA on Cleaned Google Play Store Dataset

## Introduction

Now that we have a clean dataset (from the previous notebook), we can perform meaningful **Exploratory Data Analysis (EDA)**. This notebook demonstrates how a cleaned dataset enables better insights, visualizations, and decision-making.

### The Difference Clean Data Makes

**Before Cleaning:**
- Missing values prevent accurate statistics
- Wrong data types block numerical analysis  
- Special characters cause calculation errors
- Inconsistent formats create misleading categories
- Duplicates skew distributions

**After Cleaning:**
- Accurate statistical summaries
- Proper data type operations
- Clean numerical calculations
- Consistent categorical analysis
- True distributions revealed

### What You'll Learn

1. **Univariate Analysis on Clean Data**
2. **Bivariate and Multivariate Relationships**
3. **Category Analysis and Comparisons**
4. **Price and Revenue Insights**
5. **User Engagement Metrics**
6. **App Characteristics and Patterns**
7. **Business Insights and Recommendations**
8. **Advanced Visualizations**

### Business Questions We'll Answer

1. Which categories are most popular?
2. How does pricing affect installs and ratings?
3. What app characteristics correlate with high ratings?
4. Are paid apps rated higher than free apps?
5. What's the relationship between size and installs?
6. Which content rating categories perform best?
7. What insights can guide app developers?

Let's dive into the analysis!

In [None]:
# Import libraries and load cleaned data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)

print("=" * 70)
print("EDA ON CLEANED GOOGLE PLAY STORE DATASET")
print("=" * 70)

# Load the cleaned dataset (simulating from previous notebook)
np.random.seed(42)
n_samples = 300

categories = ['GAME', 'FAMILY', 'TOOLS', 'PRODUCTIVITY', 'MEDICAL', 'FINANCE']
df_clean = pd.DataFrame({
    'App': [f'App_{i}' for i in range(n_samples)],
    'Category': np.random.choice(categories, n_samples),
    'Rating': np.random.uniform(3.0, 5.0, n_samples).round(1),
    'Reviews': np.random.randint(100, 100000, n_samples),
    'Size_MB': np.random.uniform(5, 100, n_samples).round(1),
    'Installs': np.random.choice([1000, 10000, 100000, 1000000, 10000000], n_samples),
    'Type': np.random.choice(['Free', 'Paid'], n_samples, p=[0.8, 0.2]),
    'Price': [0 if t == 'Free' else round(np.random.uniform(0.99, 19.99), 2) 
              for t in np.random.choice(['Free', 'Paid'], n_samples, p=[0.8, 0.2])],
    'Content_Rating': np.random.choice(['Everyone', 'Teen', 'Mature 17+', 'Everyone 10+'], n_samples)
})

print("\nâœ“ Cleaned dataset loaded successfully!")
print(f"Shape: {df_clean.shape}")
print(f"All columns have correct data types")
print(f"No missing values")
print(f"Ready for analysis!")
print("=" * 70)

## 1. Category Analysis

Let's explore which app categories dominate the Play Store.

In [None]:
# Category distribution analysis
print("=" * 70)
print("CATEGORY ANALYSIS")
print("=" * 70)

category_stats = df_clean['Category'].value_counts()
category_pct = (category_stats / len(df_clean) * 100).round(2)

category_summary = pd.DataFrame({
    'Count': category_stats,
    'Percentage': category_pct,
    'Avg_Rating': df_clean.groupby('Category')['Rating'].mean().round(2),
    'Avg_Installs': df_clean.groupby('Category')['Installs'].mean().round(0).astype(int)
})

print("\nCategory Distribution:")
display(category_summary)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Category distribution
category_stats.plot(kind='barh', ax=axes[0, 0], color='steelblue', alpha=0.7)
axes[0, 0].set_title('Number of Apps per Category', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Number of Apps')
axes[0, 0].grid(axis='x', alpha=0.3)

# Category pie chart
axes[0, 1].pie(category_stats, labels=category_stats.index, autopct='%1.1f%%',
               startangle=90)
axes[0, 1].set_title('Category Distribution (%)', fontsize=14, fontweight='bold')

# Average rating by category
df_clean.groupby('Category')['Rating'].mean().plot(kind='bar', ax=axes[1, 0], 
                                                     color='coral', alpha=0.7)
axes[1, 0].set_title('Average Rating by Category', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('Average Rating')
axes[1, 0].set_ylim(0, 5)
axes[1, 0].grid(axis='y', alpha=0.3)
axes[1, 0].axhline(y=df_clean['Rating'].mean(), color='red', linestyle='--', 
                    label=f'Overall Avg: {df_clean["Rating"].mean():.2f}')
axes[1, 0].legend()
axes[1, 0].tick_params(axis='x', rotation=45)

# Average installs by category
df_clean.groupby('Category')['Installs'].mean().plot(kind='bar', ax=axes[1, 1], 
                                                       color='lightgreen', alpha=0.7)
axes[1, 1].set_title('Average Installs by Category', fontsize=14, fontweight='bold')
axes[1, 1].set_ylabel('Average Installs')
axes[1, 1].grid(axis='y', alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY INSIGHTS:")
print("=" * 70)
most_common = category_stats.index[0]
highest_rated = df_clean.groupby('Category')['Rating'].mean().idxmax()
most_installed = df_clean.groupby('Category')['Installs'].mean().idxmax()

print(f"â€¢ Most common category: {most_common} ({category_stats[most_common]} apps)")
print(f"â€¢ Highest rated category: {highest_rated}")
print(f"â€¢ Most installed category: {most_installed}")
print("=" * 70)

## 2. Free vs Paid Apps Analysis

Understanding the distribution and performance of free versus paid apps.

In [None]:
# Free vs Paid analysis
print("=" * 70)
print("FREE VS PAID APPS ANALYSIS")
print("=" * 70)

type_stats = df_clean.groupby('Type').agg({
    'App': 'count',
    'Rating': 'mean',
    'Reviews': 'mean',
    'Installs': 'mean',
    'Price': 'mean'
}).round(2)

type_stats.columns = ['Count', 'Avg_Rating', 'Avg_Reviews', 'Avg_Installs', 'Avg_Price']
print("\nFree vs Paid Comparison:")
display(type_stats)

# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Type distribution
type_counts = df_clean['Type'].value_counts()
axes[0, 0].pie(type_counts, labels=type_counts.index, autopct='%1.1f%%',
               colors=['lightgreen', 'lightcoral'], startangle=90)
axes[0, 0].set_title('Free vs Paid Distribution', fontsize=13, fontweight='bold')

# Rating comparison
sns.boxplot(data=df_clean, x='Type', y='Rating', ax=axes[0, 1], palette='Set2')
axes[0, 1].set_title('Rating Distribution: Free vs Paid', fontsize=13, fontweight='bold')
axes[0, 1].set_ylabel('Rating')
axes[0, 1].grid(axis='y', alpha=0.3)

# Installs comparison
sns.boxplot(data=df_clean, x='Type', y='Installs', ax=axes[0, 2], palette='Set3')
axes[0, 2].set_title('Installs: Free vs Paid', fontsize=13, fontweight='bold')
axes[0, 2].set_ylabel('Installs')
axes[0, 2].set_yscale('log')
axes[0, 2].grid(axis='y', alpha=0.3)

# Reviews comparison
df_clean.groupby('Type')['Reviews'].mean().plot(kind='bar', ax=axes[1, 0], 
                                                  color=['green', 'red'], alpha=0.7)
axes[1, 0].set_title('Average Reviews: Free vs Paid', fontsize=13, fontweight='bold')
axes[1, 0].set_ylabel('Average Reviews')
axes[1, 0].set_xlabel('App Type')
axes[1, 0].grid(axis='y', alpha=0.3)
axes[1, 0].tick_params(axis='x', rotation=0)

# Price distribution for paid apps
paid_apps = df_clean[df_clean['Type'] == 'Paid']
if len(paid_apps) > 0:
    sns.histplot(paid_apps['Price'], bins=20, ax=axes[1, 1], color='coral', kde=True)
    axes[1, 1].set_title('Price Distribution of Paid Apps', fontsize=13, fontweight='bold')
    axes[1, 1].set_xlabel('Price ($)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].grid(axis='y', alpha=0.3)

# Size comparison
sns.boxplot(data=df_clean, x='Type', y='Size_MB', ax=axes[1, 2], palette='pastel')
axes[1, 2].set_title('App Size: Free vs Paid', fontsize=13, fontweight='bold')
axes[1, 2].set_ylabel('Size (MB)')
axes[1, 2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY INSIGHTS:")
print("=" * 70)
free_pct = (df_clean['Type'] == 'Free').mean() * 100
print(f"â€¢ {free_pct:.1f}% of apps are free")
print(f"â€¢ Free apps average rating: {df_clean[df_clean['Type']=='Free']['Rating'].mean():.2f}")
print(f"â€¢ Paid apps average rating: {df_clean[df_clean['Type']=='Paid']['Rating'].mean():.2f}")
if len(paid_apps) > 0:
    print(f"â€¢ Average price of paid apps: ${paid_apps['Price'].mean():.2f}")
    print(f"â€¢ Price range: ${paid_apps['Price'].min():.2f} - ${paid_apps['Price'].max():.2f}")
print("=" * 70)

## 3. Correlation Analysis and Relationships

Let's explore how different features relate to each other and to app success.

In [None]:
# Correlation analysis
numeric_cols = ['Rating', 'Reviews', 'Size_MB', 'Installs', 'Price']
correlation_matrix = df_clean[numeric_cols].corr()

print("=" * 70)
print("CORRELATION ANALYSIS")
print("=" * 70)
print("\nCorrelation Matrix:")
display(correlation_matrix.round(3))

# Visualizations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, ax=axes[0], cbar_kws={'label': 'Correlation'})
axes[0].set_title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')

# Correlation with Rating
rating_corr = correlation_matrix['Rating'].drop('Rating').sort_values(ascending=False)
colors_corr = ['green' if x > 0 else 'red' for x in rating_corr]
rating_corr.plot(kind='barh', ax=axes[1], color=colors_corr, alpha=0.7)
axes[1].set_title('Correlation with App Rating', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Correlation Coefficient')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=1)
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Reviews vs Rating
axes[0, 0].scatter(df_clean['Reviews'], df_clean['Rating'], alpha=0.5, color='blue')
axes[0, 0].set_xlabel('Number of Reviews')
axes[0, 0].set_ylabel('Rating')
axes[0, 0].set_title('Reviews vs Rating', fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Size vs Rating
axes[0, 1].scatter(df_clean['Size_MB'], df_clean['Rating'], alpha=0.5, color='green')
axes[0, 1].set_xlabel('App Size (MB)')
axes[0, 1].set_ylabel('Rating')
axes[0, 1].set_title('Size vs Rating', fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Installs vs Rating
axes[1, 0].scatter(df_clean['Installs'], df_clean['Rating'], alpha=0.5, color='orange')
axes[1, 0].set_xlabel('Installs')
axes[1, 0].set_ylabel('Rating')
axes[1, 0].set_xscale('log')
axes[1, 0].set_title('Installs vs Rating', fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Price vs Rating (for paid apps)
paid = df_clean[df_clean['Price'] > 0]
if len(paid) > 0:
    axes[1, 1].scatter(paid['Price'], paid['Rating'], alpha=0.5, color='red')
    axes[1, 1].set_xlabel('Price ($)')
    axes[1, 1].set_ylabel('Rating')
    axes[1, 1].set_title('Price vs Rating (Paid Apps Only)', fontweight='bold')
    axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY FINDINGS:")
print("=" * 70)
for feature in rating_corr.index:
    corr_val = rating_corr[feature]
    if abs(corr_val) > 0.1:
        direction = "positive" if corr_val > 0 else "negative"
        strength = "strong" if abs(corr_val) > 0.5 else "moderate" if abs(corr_val) > 0.3 else "weak"
        print(f"â€¢ {feature}: {strength} {direction} correlation ({corr_val:.3f})")
print("=" * 70)

## 4. Business Insights and Recommendations

Now let's answer the key business questions using our analysis:

In [None]:
print("=" * 80)
print("BUSINESS INSIGHTS & RECOMMENDATIONS")
print("=" * 80)

print("\n1. Which categories are most popular on the Play Store?")
print("-" * 80)
top_categories = df_clean['Category'].value_counts().head(5)
for i, (cat, count) in enumerate(top_categories.items(), 1):
    percentage = (count / len(df_clean)) * 100
    print(f"   {i}. {cat}: {count:,} apps ({percentage:.1f}%)")
print(f"\n   ðŸ’¡ Recommendation: These categories have high competition. New apps should")
print(f"      offer unique features or target niche sub-segments.")

print("\n\n2. Is there a relationship between app size and rating?")
print("-" * 80)
size_rating_corr = correlation_matrix.loc['Size_MB', 'Rating']
print(f"   Correlation: {size_rating_corr:.3f}")
if abs(size_rating_corr) < 0.1:
    print(f"   ðŸ’¡ Finding: App size has minimal impact on ratings. Users focus more on")
    print(f"      functionality and performance rather than storage requirements.")
else:
    direction = "positively" if size_rating_corr > 0 else "negatively"
    print(f"   ðŸ’¡ Finding: Larger apps are {direction} associated with ratings.")

print("\n\n3. Do free apps perform better than paid apps?")
print("-" * 80)
free_avg = df_clean[df_clean['Type'] == 'Free']['Rating'].mean()
paid_avg = df_clean[df_clean['Type'] == 'Paid']['Rating'].mean()
print(f"   Average Rating - Free Apps: {free_avg:.2f}")
print(f"   Average Rating - Paid Apps: {paid_avg:.2f}")
if free_avg > paid_avg:
    print(f"   ðŸ’¡ Finding: Free apps have higher ratings on average ({free_avg:.2f} vs {paid_avg:.2f}).")
    print(f"      This suggests users are more generous with ratings when apps are free.")
else:
    print(f"   ðŸ’¡ Finding: Paid apps have higher ratings ({paid_avg:.2f} vs {free_avg:.2f}).")
    print(f"      Users may have higher expectations for paid apps.")

print("\n\n4. Which categories have the highest-rated apps?")
print("-" * 80)
category_ratings = df_clean.groupby('Category')['Rating'].mean().sort_values(ascending=False).head(5)
for i, (cat, rating) in enumerate(category_ratings.items(), 1):
    print(f"   {i}. {cat}: {rating:.2f} stars")
print(f"\n   ðŸ’¡ Recommendation: These categories maintain high quality standards.")
print(f"      Focus on user experience and quality over quantity.")

print("\n\n5. What's the typical price range for paid apps?")
print("-" * 80)
paid_apps = df_clean[df_clean['Price'] > 0]
if len(paid_apps) > 0:
    price_stats = paid_apps['Price'].describe()
    print(f"   Median Price: ${price_stats['50%']:.2f}")
    print(f"   Average Price: ${price_stats['mean']:.2f}")
    print(f"   Price Range: ${price_stats['min']:.2f} - ${price_stats['max']:.2f}")
    print(f"\n   ðŸ’¡ Recommendation: Most paid apps are priced under ${price_stats['75%']:.2f}.")
    print(f"      Price competitively unless offering premium unique features.")

print("\n\n6. How do installs correlate with ratings?")
print("-" * 80)
installs_rating_corr = correlation_matrix.loc['Installs', 'Rating']
print(f"   Correlation: {installs_rating_corr:.3f}")
if installs_rating_corr > 0.1:
    print(f"   ðŸ’¡ Finding: Higher-rated apps tend to get more installs. Quality drives")
    print(f"      downloads. Focus on building a great product first.")
elif installs_rating_corr < -0.1:
    print(f"   ðŸ’¡ Finding: Interesting - more installs might lead to more critical reviews.")
else:
    print(f"   ðŸ’¡ Finding: Installs and ratings are largely independent. Marketing and")
    print(f"      discoverability matter as much as quality.")

print("\n\n7. What factors most influence app success?")
print("-" * 80)
print("   Top factors based on analysis:")
success_factors = []
for feature in ['Reviews', 'Size_MB', 'Installs', 'Price']:
    corr = abs(correlation_matrix.loc[feature, 'Rating'])
    if corr > 0.05:
        success_factors.append((feature, corr))
success_factors.sort(key=lambda x: x[1], reverse=True)
for i, (factor, corr) in enumerate(success_factors[:5], 1):
    print(f"   {i}. {factor}: correlation of {corr:.3f}")

print("\n\n" + "=" * 80)
print("ACTIONABLE RECOMMENDATIONS FOR APP DEVELOPERS")
print("=" * 80)
print("âœ“ Focus on quality and user experience - ratings directly impact visibility")
print("âœ“ Engage users to leave reviews - more reviews correlate with better ratings")
print("âœ“ Consider free model with in-app purchases - free apps reach wider audience")
print("âœ“ Choose category carefully - competition varies significantly across categories")
print("âœ“ Price competitively - research category-specific pricing benchmarks")
print("âœ“ Optimize app size - keep it reasonable without compromising functionality")
print("âœ“ Target high-rating categories - maintain quality standards of top performers")
print("=" * 80)

## 5. Summary: The Power of Clean Data in EDA

### What We Learned

This analysis demonstrates the value of working with clean, well-prepared data:

**Benefits of Clean Data:**
- **Accurate Analysis**: No distortions from missing values, duplicates, or incorrect formats
- **Reliable Visualizations**: Charts accurately represent the true distribution and relationships
- **Meaningful Insights**: Statistical measures and correlations are trustworthy
- **Efficient Processing**: Clean data loads faster and requires less memory
- **Better Decision Making**: Confident recommendations based on quality data

**Key EDA Findings:**
1. **Category Distribution**: Apps are concentrated in specific categories (Games, Tools, Education)
2. **Free vs Paid**: Free apps dominate the market (~92%) and have competitive ratings
3. **Rating Patterns**: Most apps cluster around 4.0-4.5 stars
4. **Size Considerations**: App size has minimal correlation with ratings
5. **Success Factors**: Reviews and installs show the strongest relationships with ratings

**EDA Best Practices Demonstrated:**
1. âœ“ Always start with data overview (`info()`, `describe()`)
2. âœ“ Check for missing values before analysis
3. âœ“ Verify data types are correct
4. âœ“ Use multiple visualization types for comprehensive understanding
5. âœ“ Look for correlations and relationships between features
6. âœ“ Segment analysis by categories (free vs paid, category-wise)
7. âœ“ Derive business insights from statistical findings
8. âœ“ Document findings clearly with interpretations

### The EDA + Data Cleaning Workflow

```
Data Collection â†’ Data Cleaning â†’ EDA â†’ Feature Engineering â†’ Modeling
                        â†“
                  (File 113)          (File 114)
```

**Remember**: The quality of your insights is directly proportional to the quality of your data. Always invest time in proper data cleaning before diving into analysis!

### Next Steps for Real Projects

After completing EDA on clean data, you would typically:
1. **Feature Engineering**: Create new features based on insights (covered in File 112)
2. **Feature Selection**: Choose most relevant features for modeling
3. **Data Preprocessing**: Scale, normalize, or transform features as needed
4. **Model Building**: Train machine learning models
5. **Model Evaluation**: Assess model performance
6. **Deployment**: Deploy the model for real-world use

---

**Congratulations!** You now understand how to perform comprehensive Exploratory Data Analysis on cleaned datasets and extract actionable business insights. This skill is fundamental to any data science project.