# EDA Integration - Part 2: Full EDA Workflow

This notebook covers complete exploratory data analysis combining all skills.

**Topics covered:**
- Data loading and inspection
- Missing data analysis
- Statistical summaries
- Visualization for EDA
- Feature engineering basics

**Problems:** 20 (Easy: 1-7, Medium: 8-14, Hard: 15-20)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import sys
sys.path.insert(0, '..')
from utils.checks import eda_02_full_eda_workflow as verify

# Dataset paths (provided for convenience)
TITANIC_PATH = '../datasets/public/titanic.csv'
TIPS_PATH = '../datasets/public/tips.csv'

print("Verification module loaded! Dataset paths defined.")
print(f"Titanic: {TITANIC_PATH}")
print(f"Tips: {TIPS_PATH}")
print("\nNow import the libraries you need and load the datasets.")

---
## Problem 0: Import Libraries and Load Data
**Difficulty:** Easy

### Concept
A complete EDA workflow requires importing data analysis and visualization libraries, then loading your datasets.

### Syntax
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# Load datasets
titanic = pd.read_csv(TITANIC_PATH)
tips = pd.read_csv(TIPS_PATH)
```

### Task
1. Import NumPy as `np`, Pandas as `pd`, and matplotlib.pyplot as `plt`
2. Enable inline plotting with `%matplotlib inline`
3. Load the Titanic and Tips datasets using the paths provided in SETUP

### Expected Properties
- All libraries should be imported
- `titanic` and `tips` should be DataFrames

In [None]:
# Your solution:


In [None]:
# Verification
verify.p0(globals())

---
## Problem 1: Get Dataset Shape
**Difficulty:** Easy

### Concept
The first step in any EDA is understanding the size of your dataset. The `.shape` attribute returns a tuple of (rows, columns), giving you an immediate sense of the data scale.

### Syntax
```python
# Get shape of DataFrame
shape = df.shape  # Returns (rows, columns)
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> df.shape
(3, 2)  # 3 rows, 2 columns
```

### Task
Get the shape of the Titanic dataset. Store it in `shape`.

### Expected Properties
- `shape` should be a tuple
- Should have 2 elements (rows, columns)
- Number of rows should be greater than 0

In [None]:
# Your solution:
shape = None

In [None]:
# Verification
verify.p1(shape)

---
## Problem 2: Get Data Types
**Difficulty:** Easy

### Concept
Understanding the data types of each column helps you know what operations are valid and whether data needs type conversion. The `.dtypes` attribute returns a Series with column names as index and data types as values.

### Syntax
```python
# Get data types of all columns
dtypes = df.dtypes
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
>>> df.dtypes
A     int64
B    object
dtype: object
```

### Task
Get the data types for all columns in the Titanic dataset. Store the result in `dtypes`.

### Expected Properties
- `dtypes` should be a pandas Series
- Should have an entry for each column in the dataset

In [None]:
# Your solution:
dtypes = None

In [None]:
# Verification
verify.p2(dtypes)

---
## Problem 3: Preview First Rows
**Difficulty:** Easy

### Concept
Looking at the first few rows gives you a quick sense of the data structure and sample values. The `.head()` method is one of the most commonly used methods in EDA.

### Syntax
```python
# Get first n rows (default is 5)
first_rows = df.head(n)

# Get last n rows
last_rows = df.tail(n)
```

### Example
```python
>>> df = pd.DataFrame({'A': range(10)})
>>> df.head(3)
   A
0  0
1  1
2  2
```

### Task
Get the first 3 rows and last 3 rows of the Titanic dataset. Store them in `first_rows` and `last_rows`.

### Expected Properties
- Both should be DataFrames
- Each should have exactly 3 rows
- Should have the same number of columns as the original dataset

In [None]:
# Your solution:
first_rows = None
last_rows = None

In [None]:
# Verification
verify.p3(first_rows, last_rows, titanic)

---
## Problem 4: Count Missing Values
**Difficulty:** Easy

### Concept
Missing data is common in real-world datasets. Identifying how much data is missing helps you decide on appropriate handling strategies (imputation, deletion, etc.).

### Syntax
```python
# Count missing values per column
missing_counts = df.isnull().sum()

# OR
missing_counts = df.isna().sum()
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
>>> df.isnull().sum()
A    1
B    1
dtype: int64
```

### Task
Count the number of missing values in each column of the Titanic dataset. Store the result in `missing_counts`.

### Expected Properties
- `missing_counts` should be a pandas Series
- Should have non-negative integer values
- Should have an entry for each column

In [None]:
# Your solution:
missing_counts = None

In [None]:
# Verification
verify.p4(missing_counts, titanic)

---
## Problem 5: Calculate Missing Percentage
**Difficulty:** Easy

### Concept
While counts are useful, percentages give you a better sense of the proportion of missing data relative to the total dataset size. This helps prioritize which missing data to address first.

### Syntax
```python
# Calculate missing percentage
missing_pct = (df.isnull().sum() / len(df)) * 100
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, None, 3, 4]})
>>> (df.isnull().sum() / len(df)) * 100
A    25.0
dtype: float64
```

### Task
Calculate the percentage of missing values for each column in the Titanic dataset. Store the result in `missing_pct`.

### Expected Properties
- `missing_pct` should be a pandas Series
- All values should be between 0 and 100
- Should be numeric (float) values

In [None]:
# Your solution:
missing_pct = None

In [None]:
# Verification
verify.p5(missing_pct, titanic)

---
## Problem 6: Get Descriptive Statistics
**Difficulty:** Easy

### Concept
Descriptive statistics (mean, std, min, max, quartiles) provide a statistical summary of numerical columns. This is crucial for understanding the distribution and range of your data.

### Syntax
```python
# Get descriptive statistics for numerical columns
desc_stats = df.describe()
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
>>> df.describe()
              A
count  5.000000
mean   3.000000
std    1.581139
min    1.000000
25%    2.000000
50%    3.000000
75%    4.000000
max    5.000000
```

### Task
Get descriptive statistics for the Titanic dataset. Store the result in `desc_stats`.

### Expected Properties
- `desc_stats` should be a DataFrame
- Should include statistics like 'mean', 'std', 'min', 'max'
- Should only include numerical columns

In [None]:
# Your solution:
desc_stats = None

In [None]:
# Verification
verify.p6(desc_stats)

---
## Problem 7: Value Counts for Categorical Column
**Difficulty:** Easy

### Concept
For categorical data, value counts show the frequency distribution of each category. This is essential for understanding the distribution of categorical variables.

### Syntax
```python
# Get value counts for a column
counts = df['column'].value_counts()
```

### Example
```python
>>> df = pd.DataFrame({'day': ['Mon', 'Tue', 'Mon', 'Wed', 'Mon']})
>>> df['day'].value_counts()
Mon    3
Tue    1
Wed    1
Name: day, dtype: int64
```

### Task
Get value counts for the 'day' column in the Tips dataset. Store the result in `day_counts`.

### Expected Properties
- `day_counts` should be a pandas Series
- Should have positive integer values
- Should have at least one entry

In [None]:
# Your solution:
day_counts = None

In [None]:
# Verification
verify.p7(day_counts)

---
## Problem 8: Count Unique Values
**Difficulty:** Medium

### Concept
Knowing the number of unique values in a column helps determine if it's truly categorical (few unique values) or continuous (many unique values). The `nunique()` method counts distinct values.

### Syntax
```python
# Get unique values
unique_vals = df['column'].unique()

# Count unique values
n_unique = df['column'].nunique()
```

### Example
```python
>>> df = pd.DataFrame({'sex': ['M', 'F', 'M', 'F']})
>>> df['sex'].unique()
array(['M', 'F'], dtype=object)
>>> df['sex'].nunique()
2
```

### Task
For the 'sex' column in the Tips dataset, get the unique values and count of unique values. Store them in `unique_vals` and `n_unique`.

### Expected Properties
- `unique_vals` should be a NumPy array
- `n_unique` should be an integer
- `n_unique` should equal the length of `unique_vals`

In [None]:
# Your solution:
unique_vals = None
n_unique = None

In [None]:
# Verification
verify.p8(unique_vals, n_unique)

---
## Problem 9: Create Histogram
**Difficulty:** Medium

### Concept
Histograms visualize the distribution of numerical data by showing the frequency of values within bins. They're essential for identifying skewness, outliers, and the overall data distribution shape.

### Syntax
```python
# Create histogram
fig, ax = plt.subplots()
n, bins, patches = ax.hist(df['column'], bins=20)
ax.set_xlabel('Label')
ax.set_ylabel('Frequency')
plt.show()
```

### Example
```python
>>> fig, ax = plt.subplots()
>>> ax.hist([1, 2, 2, 3, 3, 3, 4], bins=4)
>>> plt.show()
```

### Task
Create a histogram of the 'total_bill' column in the Tips dataset. Store the histogram return values (n, bins, patches) in variables with those names.

### Expected Properties
- Should create a matplotlib figure
- `n`, `bins`, and `patches` should be returned from ax.hist()
- `n` should be an array of frequencies

In [None]:
# Your solution:
fig, ax = plt.subplots()
# Create histogram of total_bill
n, bins, patches = None, None, None

ax.set_xlabel('Total Bill')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Total Bill')
plt.show()

In [None]:
# Verification
verify.p9(n, bins, patches)

---
## Problem 10: Create Box Plot by Category
**Difficulty:** Medium

### Concept
Box plots show the distribution of data through quartiles and help identify outliers. When grouped by category, they allow comparison of distributions across different groups.

### Syntax
```python
# Box plot grouped by category
df.boxplot(column='value_col', by='category_col')

# OR using groupby
df.groupby('category')['value'].plot(kind='box')
```

### Example
```python
>>> df = pd.DataFrame({
...     'group': ['A', 'A', 'B', 'B'],
...     'value': [1, 2, 3, 4]
... })
>>> df.boxplot(column='value', by='group')
```

### Task
Create box plots showing 'total_bill' grouped by 'day' in the Tips dataset. Store the result in `bp`.

### Expected Properties
- Should create a visualization
- `bp` should be the return value from the boxplot function

In [None]:
# Your solution:
fig, ax = plt.subplots(figsize=(10, 6))
bp = None  # Create box plots (hint: use tips.boxplot())

plt.suptitle('')  # Remove default title
ax.set_title('Total Bill by Day')
plt.show()

In [None]:
# Verification
verify.p10(bp)

---
## Problem 11: Create Scatter Plot with Groups
**Difficulty:** Medium

### Concept
Scatter plots reveal relationships between two numerical variables. Coloring points by a categorical variable adds a third dimension to the analysis, showing how relationships differ across groups.

### Syntax
```python
# Scatter plot with different colors for groups
fig, ax = plt.subplots()
for group in df['category'].unique():
    subset = df[df['category'] == group]
    ax.scatter(subset['x'], subset['y'], label=group)
ax.legend()
```

### Example
```python
>>> fig, ax = plt.subplots()
>>> males = df[df['sex'] == 'Male']
>>> females = df[df['sex'] == 'Female']
>>> ax.scatter(males['x'], males['y'], label='Male')
>>> ax.scatter(females['x'], females['y'], label='Female')
>>> ax.legend()
```

### Task
Create a scatter plot of 'total_bill' vs 'tip', with different colors for 'Male' and 'Female'. Store the scatter plot return values in `scatter1` and `scatter2`.

### Expected Properties
- Should create two scatter plots (one for each sex)
- Both scatter objects should be returned
- Plot should have a legend

In [None]:
# Your solution:
fig, ax = plt.subplots()

male = tips[tips['sex'] == 'Male']
female = tips[tips['sex'] == 'Female']

scatter1 = None  # Plot male
scatter2 = None  # Plot female

ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
ax.set_title('Total Bill vs Tip by Sex')
ax.legend()
plt.show()

In [None]:
# Verification
verify.p11(scatter1, scatter2)

---
## Problem 12: Group Statistics
**Difficulty:** Medium

### Concept
Grouping data and calculating statistics reveals patterns across categories. This is fundamental to comparative analysis and understanding how variables differ between groups.

### Syntax
```python
# Group by one or more columns and calculate statistics
grouped = df.groupby(['col1', 'col2'])['value_col'].mean()

# Multiple aggregations
grouped = df.groupby('col')['value'].agg(['mean', 'std', 'count'])
```

### Example
```python
>>> df = pd.DataFrame({
...     'category': ['A', 'A', 'B', 'B'],
...     'value': [10, 20, 30, 40]
... })
>>> df.groupby('category')['value'].mean()
category
A    15.0
B    35.0
Name: value, dtype: float64
```

### Task
Calculate the mean tip grouped by both 'day' and 'time' in the Tips dataset. Store the result in `grouped_mean`.

### Expected Properties
- `grouped_mean` should be a pandas Series
- Should have a MultiIndex (day and time)
- Values should be numeric (mean tips)

In [None]:
# Your solution:
grouped_mean = None  # Group by ['day', 'time'] and get mean tip

In [None]:
# Verification
verify.p12(grouped_mean)

---
## Problem 13: Create Bar Chart of Aggregated Data
**Difficulty:** Medium

### Concept
Bar charts are ideal for comparing values across categories. They make it easy to see which categories have higher or lower values at a glance.

### Syntax
```python
# Create bar chart from aggregated data
grouped_data = df.groupby('category')['value'].mean()
fig, ax = plt.subplots()
bars = ax.bar(grouped_data.index, grouped_data.values)
```

### Example
```python
>>> data = pd.Series([10, 20, 15], index=['A', 'B', 'C'])
>>> fig, ax = plt.subplots()
>>> ax.bar(data.index, data.values)
```

### Task
Create a bar chart showing the average tip by day. Store the bar chart return value in `bars`.

### Expected Properties
- Should create a bar chart
- `bars` should be the return value from ax.bar()
- Chart should show one bar per day

In [None]:
# Your solution:
avg_tip_by_day = tips.groupby('day')['tip'].mean()

fig, ax = plt.subplots()
bars = None  # Create bar chart

ax.set_ylabel('Average Tip')
ax.set_xlabel('Day')
ax.set_title('Average Tip by Day')
plt.show()

In [None]:
# Verification
verify.p13(bars)

---
## Problem 14: Create Pie Chart
**Difficulty:** Medium

### Concept
Pie charts show proportions of a whole, making them useful for displaying the composition of categorical data. They work best with a small number of categories.

### Syntax
```python
# Create pie chart
counts = df['category'].value_counts()
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(counts, labels=counts.index, autopct='%1.1f%%')
```

### Example
```python
>>> data = pd.Series([30, 70], index=['A', 'B'])
>>> fig, ax = plt.subplots()
>>> ax.pie(data, labels=data.index, autopct='%1.1f%%')
```

### Task
Create a pie chart showing the distribution of smokers vs non-smokers in the Tips dataset. Store the wedges in `wedges`.

### Expected Properties
- Should create a pie chart
- `wedges` should contain the pie wedge objects
- Should show percentages for each category

In [None]:
# Your solution:
smoker_counts = tips['smoker'].value_counts()

fig, ax = plt.subplots()
wedges, texts, autotexts = None, None, None  # Create pie chart

ax.set_title('Smoker Distribution')
plt.show()

In [None]:
# Verification
verify.p14(wedges)

---
## Problem 15: Create Derived Feature
**Difficulty:** Hard

### Concept
Feature engineering creates new variables from existing ones to reveal patterns or prepare data for modeling. Derived features often provide more insight than raw values.

### Syntax
```python
# Create derived feature
df['new_feature'] = df['col1'] / df['col2']
df['percentage'] = (df['part'] / df['total']) * 100
```

### Example
```python
>>> df = pd.DataFrame({'sales': [100, 200], 'cost': [80, 150]})
>>> df['profit_margin'] = ((df['sales'] - df['cost']) / df['sales']) * 100
>>> df
   sales  cost  profit_margin
0    100    80           20.0
1    200   150           25.0
```

### Task
Create a new column 'tip_percentage' that calculates (tip / total_bill) * 100. Store the modified DataFrame in `tips_with_pct`.

### Expected Properties
- DataFrame should have a 'tip_percentage' column
- Values should be between 0 and 100
- Mean tip percentage should be between 10 and 20

In [None]:
# Your solution:
tips_with_pct = tips.copy()
tips_with_pct['tip_percentage'] = None

In [None]:
# Verification
verify.p15(tips_with_pct)

---
## Problem 16: Create Multi-Panel Figure
**Difficulty:** Hard

### Concept
Multi-panel figures allow you to display multiple related visualizations side-by-side, providing a comprehensive view of the data. This is essential for creating effective EDA reports.

### Syntax
```python
# Create subplots grid
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))

# Access individual subplots
axes[0, 0].plot(...)  # Top-left
axes[0, 1].plot(...)  # Top-right
axes[1, 0].plot(...)  # Bottom-left
axes[1, 1].plot(...)  # Bottom-right

plt.tight_layout()
```

### Example
```python
>>> fig, axes = plt.subplots(2, 2, figsize=(10, 8))
>>> axes[0, 0].hist(data['col1'])
>>> axes[0, 1].scatter(data['x'], data['y'])
>>> plt.tight_layout()
```

### Task
Create a 2x2 subplot figure showing:
- Top-left: Histogram of 'total_bill'
- Top-right: Histogram of 'tip'
- Bottom-left: Scatter plot of 'total_bill' vs 'tip'
- Bottom-right: Bar chart of count by 'day'

### Expected Properties
- `axes` should be a 2D array with shape (2, 2)
- All four subplots should be created

In [None]:
# Your solution:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top-left: Histogram of total_bill
# axes[0, 0].hist(...)

# Top-right: Histogram of tip
# axes[0, 1].hist(...)

# Bottom-left: Scatter of total_bill vs tip
# axes[1, 0].scatter(...)

# Bottom-right: Bar chart of count by day
# day_counts = tips['day'].value_counts()
# axes[1, 1].bar(...)

plt.tight_layout()
plt.show()

In [None]:
# Verification
verify.p16(axes)

---
## Problem 17: Multi-Group Summary Statistics
**Difficulty:** Hard

### Concept
Calculating multiple statistics across groups provides a comprehensive understanding of how data varies. The `agg()` method allows you to compute multiple aggregations simultaneously.

### Syntax
```python
# Multiple statistics with groupby
summary = df.groupby(['group1', 'group2'])['value'].agg(['mean', 'std', 'min', 'max'])
```

### Example
```python
>>> df = pd.DataFrame({
...     'category': ['A', 'A', 'B', 'B'],
...     'value': [10, 20, 30, 40]
... })
>>> df.groupby('category')['value'].agg(['mean', 'std', 'min', 'max'])
          mean       std  min  max
category                          
A         15.0  7.071068   10   20
B         35.0  7.071068   30   40
```

### Task
Create a summary table with mean, std, min, and max of 'total_bill' grouped by 'day' and 'time'. Store the result in `summary_table`.

### Expected Properties
- `summary_table` should be a DataFrame
- Should have columns for 'mean', 'std', 'min', 'max'
- Should have a MultiIndex from grouping

In [None]:
# Your solution:
summary_table = None  # Group by ['day', 'time'] and calculate multiple stats

In [None]:
# Verification
verify.p17(summary_table)

---
## Problem 18: Filter and Analyze Subset
**Difficulty:** Hard

### Concept
Filtering creates subsets of data based on conditions, allowing focused analysis of specific segments. Combining multiple filters with boolean operators enables complex selections.

### Syntax
```python
# Filter with multiple conditions
filtered = df[(df['col1'] == value1) & (df['col2'] > value2)]

# Use isin for multiple values
filtered = df[df['col'].isin(['val1', 'val2'])]
```

### Example
```python
>>> df = pd.DataFrame({
...     'day': ['Mon', 'Tue', 'Sat', 'Sun'],
...     'meal': ['Lunch', 'Dinner', 'Dinner', 'Dinner'],
...     'bill': [10, 20, 30, 40]
... })
>>> weekend_dinner = df[(df['day'].isin(['Sat', 'Sun'])) & (df['meal'] == 'Dinner')]
>>> weekend_dinner
   day    meal  bill
2  Sat  Dinner    30
3  Sun  Dinner    40
```

### Task
Filter the Tips dataset for dinner on weekends (Sat or Sun). Store the filtered data in `weekend_dinner` and its descriptive statistics in `stats`.

### Expected Properties
- `weekend_dinner` should be a DataFrame
- Should have fewer rows than the original dataset
- `stats` should be descriptive statistics of the filtered data

In [None]:
# Your solution:
weekend_dinner = None  # Filter for (Sat or Sun) AND (Dinner)
stats = None  # Get descriptive statistics

In [None]:
# Verification
verify.p18(weekend_dinner, stats, tips)

---
## Problem 19: Identify High Values
**Difficulty:** Hard

### Concept
Identifying exceptional cases (high/low values, outliers) is crucial in EDA. Boolean masks make it easy to find rows that meet specific criteria.

### Syntax
```python
# Create boolean mask and filter
mask = df['column'] > threshold
high_values = df[mask]

# Or in one line
high_values = df[df['column'] > threshold]
```

### Example
```python
>>> df = pd.DataFrame({'score': [85, 92, 78, 95, 88]})
>>> high_scorers = df[df['score'] > 90]
>>> high_scorers
   score
1     92
3     95
```

### Task
Identify all rows where the tip percentage is above 20%. First calculate tip percentage, then filter. Store the result in `high_tippers`.

### Expected Properties
- `high_tippers` should be a DataFrame
- Should only contain rows where tip percentage > 20
- Should have at least one row

In [None]:
# Your solution:
tip_pct = (tips['tip'] / tips['total_bill']) * 100
high_tippers = None  # Filter where tip_pct > 20

In [None]:
# Verification
verify.p19(high_tippers, tips)

---
## Problem 20: Create EDA Report Function
**Difficulty:** Hard

### Concept
Automating EDA by creating reusable functions saves time and ensures consistency. A good EDA function should provide shape, types, missing data, and basic statistics.

### Syntax
```python
def eda_report(df):
    report = {}
    report['shape'] = df.shape
    report['dtypes'] = df.dtypes
    report['missing'] = df.isnull().sum()
    report['stats'] = df.describe()
    return report
```

### Example
```python
>>> def basic_info(df):
...     return {
...         'rows': len(df),
...         'cols': len(df.columns),
...         'missing': df.isnull().sum().sum()
...     }
>>> basic_info(df)
{'rows': 100, 'cols': 5, 'missing': 3}
```

### Task
Complete the `basic_eda` function that returns a dictionary with:
- 'shape': DataFrame shape
- 'columns': List of column names
- 'dtypes': Data types
- 'missing_counts': Missing value counts
- 'missing_pct': Missing value percentages
- 'numerical_stats': Descriptive statistics
- 'categorical_columns': List of non-numeric columns

Test it on the Tips dataset.

### Expected Properties
- Function should return a dictionary
- All dictionary keys should be populated (not None)
- Should work on any DataFrame

In [None]:
# Your solution:
def basic_eda(df):
    """
    Generate a basic EDA report for a DataFrame.
    Returns a dictionary with various statistics.
    """
    report = {
        'shape': None,
        'columns': None,
        'dtypes': None,
        'missing_counts': None,
        'missing_pct': None,
        'numerical_stats': None,
        'categorical_columns': None
    }
    
    # Fill in the report dictionary
    # Hint: categorical_columns can be found with df.select_dtypes(exclude=[np.number]).columns.tolist()
    
    return report

# Test the function
eda_report = basic_eda(tips)

In [None]:
# Verification
verify.p20(eda_report)

---
## Summary

Run this cell to see your overall progress on this notebook.

In [None]:
check.summary()