# Tutorial 1: Data Exploration & Analysis

## Overview
This tutorial demonstrates how to load, explore, and understand datasets using Databricks Free Edition. This notebook covers fundamental data exploration techniques including:
- Loading and exploring datasets
- Data cleaning and quality checks
- Statistical analysis and profiling
### Learning Objectives
 - Load data from various sources
 - Explore data structure and characteristics
 - Perform basic statistical analysis
 - Handle common data quality issues

**Datasets Used:**
- customer_data.csv
- products.csv
- sales_data.csv
- web_traffic.csv

## 1. Basic Data Exploration

### Loading Data into Databricks

**Key Concepts:**
- `spark.read.csv()`: Reads CSV files into Spark DataFrame
- `header=True`: First row contains column names
- `inferSchema=True`: Automatically detects data types
- `.toPandas()`: Converts Spark DataFrame to Pandas for easier manipulation

In [0]:
# Import required libraries
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Load customer data
# SYNTAX: spark.read.csv("path", header=True, inferSchema=True)
customer_df = spark.read.csv("/Volumes/workspace/sample/datasets/customer_data.csv", header=True, inferSchema=True)

# Display first few rows
# SYNTAX: .display() shows data in interactive table format
display(customer_df.limit(10))

### Understanding Your Data

**Key Functions:**
- `.printSchema()`: Shows column names and data types
- `.count()`: Returns number of rows
- `.columns`: Lists all column names

In [0]:
# Check schema and structure
print("Customer Data Schema:")
customer_df.printSchema()

print(f"\nTotal Rows: {customer_df.count()}")
print(f"Total Columns: {len(customer_df.columns)}")
print(f"\nColumn Names: {customer_df.columns}")

In [0]:
# Load other datasets
products_df = spark.read.csv("/Volumes/workspace/sample/datasets/products.csv", header=True, inferSchema=True)
sales_df = spark.read.csv("/Volumes/workspace/sample/datasets/sales_data.csv", header=True, inferSchema=True)


print("All datasets loaded successfully!")
print(f"Products: {products_df.count()} rows")
print(f"Sales: {sales_df.count()} rows")
print(f"Customers: {customer_df.count()} rows")

### Quick Data Profiling

**Key Functions:**
- `.describe()`: Statistical summary for numeric columns
- `.summary()`: Extended statistics including percentiles

In [0]:
# Statistical summary of customer data
display(customer_df.describe())

In [0]:
# More detailed summary with percentiles
display(customer_df.summary())

## 2. Data Cleaning

### Handling Missing Values

**Key Concepts:**
- Check for NULL values using `.isNull().sum()`
- Drop nulls with `.na.drop()`
- Fill nulls with `.na.fill()`
- Replace values with `.na.replace()`

In [0]:
# Check for missing values in customer data
from pyspark.sql.functions import col, count, when, isnan

# Count nulls for each column
# SYNTAX: F.sum(F.when(condition, 1).otherwise(0))
missing_counts = customer_df.select([
    F.sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) 
    for c in customer_df.columns
])

print("Missing Values by Column:")
display(missing_counts)

In [0]:
# Handle missing values - Example strategies

# Strategy 1: Drop rows with any null values
customer_clean = customer_df.na.drop()
print(f"Rows after dropping nulls: {customer_clean.count()}")

# Strategy 2: Fill missing values with defaults
# SYNTAX: .na.fill({"column_name": default_value})
customer_filled = customer_df.na.fill({
    "phone": "Unknown",
    "email_subscribed": False,
    "annual_income": 0
})

# Strategy 3: Fill with column statistics (mean, median)
avg_age = customer_df.select(F.avg("age")).first()[0]
customer_filled = customer_df.na.fill({"age": int(avg_age)})

print("Data cleaning strategies applied!")

### Handling Duplicates

**Key Functions:**
- `.dropDuplicates()`: Removes duplicate rows
- `.dropDuplicates([columns])`: Removes duplicates based on specific columns

In [0]:
# Check for duplicate customer records
print(f"Total rows: {customer_df.count()}")
print(f"Unique customer_ids: {customer_df.select('customer_id').distinct().count()}")

# Remove duplicates based on customer_id
customer_unique = customer_df.dropDuplicates(['customer_id'])
print(f"Rows after removing duplicates: {customer_unique.count()}")

### Data Quality Checks

**Best Practices:**
- Check for invalid values (negative prices, ages, etc.)
- Validate email formats
- Check date ranges
- Identify outliers

In [0]:
# Quality checks for customer data

# 1. Check for invalid ages
invalid_age = customer_df.filter((col("age") < 0) | (col("age") > 120))
print(f"Records with invalid age: {invalid_age.count()}")

# 2. Check annual income range
income_stats = customer_df.select(
    F.min("annual_income").alias("min_income"),
    F.max("annual_income").alias("max_income"),
    F.avg("annual_income").alias("avg_income")
)
display(income_stats)

# 3. Check for valid email domains
email_pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
valid_emails = customer_df.filter(col("email").rlike(email_pattern))
print(f"Valid email addresses: {valid_emails.count()} out of {customer_df.count()}")

### Handling Outliers

**Methods:**
- IQR (Interquartile Range) method
- Z-score method
- Visual inspection

In [0]:
# Detect outliers in annual income using IQR method

# Calculate quartiles
quantiles = customer_df.approxQuantile("annual_income", [0.25, 0.75], 0.01)
Q1, Q3 = quantiles[0], quantiles[1]
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1:,.0f}, Q3: {Q3:,.0f}, IQR: {IQR:,.0f}")
print(f"Outlier bounds: [{lower_bound:,.0f}, {upper_bound:,.0f}]")

# Filter outliers
outliers = customer_df.filter(
    (col("annual_income") < lower_bound) | (col("annual_income") > upper_bound)
)
print(f"\nOutliers detected: {outliers.count()}")

# Remove outliers
customer_no_outliers = customer_df.filter(
    (col("annual_income") >= lower_bound) & (col("annual_income") <= upper_bound)
)
print(f"Records after removing outliers: {customer_no_outliers.count()}")

## 3. Statistical Analysis

### Descriptive Statistics

**Key Metrics:**
- Central tendency: mean, median, mode
- Dispersion: variance, standard deviation, range
- Distribution: skewness, kurtosis

In [0]:
# Comprehensive statistical analysis

# Convert to Pandas for advanced statistics
customer_pd = customer_df.select("age", "annual_income").toPandas()

print("=== AGE STATISTICS ===")
print(f"Mean: {customer_pd['age'].mean():.2f}")
print(f"Median: {customer_pd['age'].median():.2f}")
print(f"Mode: {customer_pd['age'].mode()[0]:.2f}")
print(f"Std Dev: {customer_pd['age'].std():.2f}")
print(f"Variance: {customer_pd['age'].var():.2f}")
print(f"Skewness: {customer_pd['age'].skew():.2f}")

print("\n=== INCOME STATISTICS ===")
print(f"Mean: ${customer_pd['annual_income'].mean():,.2f}")
print(f"Median: ${customer_pd['annual_income'].median():,.2f}")
print(f"Std Dev: ${customer_pd['annual_income'].std():,.2f}")

### Group-wise Analysis

**Key Functions:**
- `.groupBy()`: Group data by one or more columns
- `.agg()`: Apply aggregate functions
- Common aggregations: count, sum, avg, min, max

In [0]:
# Analyze customers by segment
segment_analysis = customer_df.groupBy("segment").agg(
    F.count("customer_id").alias("customer_count"),
    F.avg("age").alias("avg_age"),
    F.avg("annual_income").alias("avg_income"),
    F.sum(when(col("email_subscribed") == True, 1).otherwise(0)).alias("subscribed_count")
).orderBy(F.desc("customer_count"))

display(segment_analysis)

In [0]:
# Geographic analysis by state
state_analysis = customer_df.groupBy("state").agg(
    F.count("customer_id").alias("customer_count"),
    F.avg("annual_income").alias("avg_income")
).orderBy(F.desc("customer_count")).limit(10)

display(state_analysis)

### Correlation Analysis

**Purpose:** Understand relationships between numeric variables

In [0]:
# Calculate correlation between age and income
correlation = customer_df.stat.corr("age", "annual_income")
print(f"Correlation between Age and Income: {correlation:.3f}")

# Create correlation matrix using Pandas
numeric_cols = ["age", "annual_income"]
correlation_matrix = customer_df.select(numeric_cols).toPandas().corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

### Product Data Analysis

In [0]:
# Analyze product data
product_summary = products_df.agg(
    F.count("product_id").alias("total_products"),
    F.avg("price").alias("avg_price"),
    F.avg("rating").alias("avg_rating"),
    F.sum("num_reviews").alias("total_reviews")
)

display(product_summary)

# Products by category
category_analysis = products_df.groupBy("category").agg(
    F.count("product_id").alias("product_count"),
    F.avg("price").alias("avg_price"),
    F.avg("rating").alias("avg_rating")
).orderBy(F.desc("product_count"))

display(category_analysis)

### Sales Data Profiling

In [0]:
# Sales performance metrics
sales_summary = sales_df.agg(
    F.count("transaction_id").alias("total_transactions"),
    F.sum("total_sales").alias("total_revenue"),
    F.avg("total_sales").alias("avg_transaction_value"),
    F.avg("customer_satisfaction").alias("avg_satisfaction"),
    F.sum("quantity").alias("total_units_sold")
)

display(sales_summary)

# Regional performance
regional_analysis = sales_df.groupBy("region").agg(
    F.count("transaction_id").alias("transactions"),
    F.sum("total_sales").alias("revenue"),
    F.avg("customer_satisfaction").alias("avg_satisfaction")
).orderBy(F.desc("revenue"))

display(regional_analysis)

## Key Takeaways

**Data Exploration:**
- Always start with `.printSchema()` and `.describe()` to understand your data
- Check data quality early: missing values, duplicates, outliers

**Data Cleaning:**
- Handle missing values appropriately (drop, fill, or impute)
- Validate data ranges and formats
- Remove or treat outliers based on business context

**Statistical Analysis:**
- Use descriptive statistics to understand distributions
- Group-wise analysis reveals patterns across categories
- Correlation helps identify relationships between variables

**Next Steps:** Move to Notebook 2 for Data Visualization techniques!