# Example Data Analysis with Standardized Visualization

This notebook demonstrates best practices for data analysis and visualization using seaborn and matplotlib.

## Table of Contents
1. [Package Installation](#installation)
2. [Setup and Imports](#setup)
3. [Data Loading](#loading)
4. [Data Summary and Exploration](#summary)
5. [Data Visualization](#visualization)
   - Bar Plot
   - Line Plot
   - Scatter Plot
6. [Conclusions](#conclusions)

<a id="installation"></a>
## 1. Package Installation

First, let's ensure all required packages are installed.

In [None]:
# Install required packages (recommended for Jupyter)
# This cell ensures all dependencies are available in the environment.
# If running in a restricted environment, comment out the following line.
%pip install -r requirements.txt

<a id="setup"></a>
## 2. Setup and Imports

Import necessary libraries and set up logging. We use seaborn and matplotlib for standardized visualizations.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import logging
from typing import Any, Dict, List, Tuple, Optional, Union

# Import custom utility functions
from src.utils import summarize_data

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
logger.info("Setting up analysis environment")

# Set default plot style for consistent visualizations
sns.set_theme(style="whitegrid")

# Set colorblind-friendly palette for accessibility
sns.set_palette("colorblind")

# Configure matplotlib defaults for consistency
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10,
    'figure.dpi': 100
})

<a id="loading"></a>
## 3. Load Example Data

We'll use the tips dataset from seaborn, which contains information about tips in a restaurant.

In [None]:
# Load the tips dataset from seaborn
df: pd.DataFrame = sns.load_dataset('tips')

# Log information about the loaded dataset
logger.info(f'Data loaded: {df.shape[0]} rows and {df.shape[1]} columns')

# Validate data was loaded correctly
assert df.shape[0] > 0, "Dataset is empty, no rows were loaded"
assert df.shape[1] > 0, "Dataset has no columns"
assert 'tip' in df.columns, "Expected 'tip' column not found in dataset"
assert 'total_bill' in df.columns, "Expected 'total_bill' column not found in dataset"

# Display the first few rows of the dataset
df.head()

<a id="summary"></a>
## 4. Data Summary and Exploration

Let's explore the dataset to understand its structure and basic statistics.

In [None]:
def assess_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Perform a comprehensive data quality assessment on a DataFrame.
    
    Parameters:
        df: The DataFrame to assess
        
    Returns:
        A dictionary with data quality metrics
    """
    # Initialize results dictionary
    quality_metrics: Dict[str, Any] = {}
    
    # 1. Missing values
    quality_metrics['missing_values'] = df.isnull().sum().to_dict()
    quality_metrics['missing_percentage'] = (df.isnull().sum() / len(df) * 100).to_dict()
    
    # 2. Duplicate rows
    quality_metrics['duplicate_rows'] = df.duplicated().sum()
    quality_metrics['duplicate_percentage'] = (df.duplicated().sum() / len(df) * 100)
    
    # 3. Basic statistics for numerical columns
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
    quality_metrics['numerical_stats'] = {}
    
    for col in numerical_cols:
        quality_metrics['numerical_stats'][col] = {
            'min': df[col].min(),
            'max': df[col].max(),
            'mean': df[col].mean(),
            'median': df[col].median(),
            'std': df[col].std(),
            'zeros': (df[col] == 0).sum(),
            'negatives': (df[col] < 0).sum() if df[col].min() < 0 else 0
        }
    
    # 4. Categorical column analysis
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    quality_metrics['categorical_stats'] = {}
    
    for col in categorical_cols:
        quality_metrics['categorical_stats'][col] = {
            'unique_values': df[col].nunique(),
            'most_common': df[col].value_counts().nlargest(3).to_dict()
        }
    
    return quality_metrics

# Run the data quality assessment
quality_results = assess_data_quality(df)

# Display the results in a readable format
print("\nData Quality Assessment Summary:")
print(f"Total rows: {df.shape[0]}, Total columns: {df.shape[1]}")

if sum(quality_results['missing_values'].values()) == 0:
    print("✓ No missing values detected")
else:
    print("⚠ Missing values detected")
    for col, count in quality_results['missing_values'].items():
        if count > 0:
            print(f"  - {col}: {count} missing ({quality_results['missing_percentage'][col]:.2f}%)")

if quality_results['duplicate_rows'] == 0:
    print("✓ No duplicate rows detected")
else:
    print(f"⚠ {quality_results['duplicate_rows']} duplicate rows detected ({quality_results['duplicate_percentage']:.2f}%)")

# Check for potential data issues
print("\nPotential Data Issues:")
issues_found = False

for col, stats in quality_results['numerical_stats'].items():
    if stats['negatives'] > 0 and col not in ['temperature', 'change', 'difference']:
        print(f"⚠ {stats['negatives']} negative values found in '{col}' column")
        issues_found = True
        
    if stats['zeros'] > 0 and col in ['total_bill', 'tip']:
        print(f"⚠ {stats['zeros']} zero values found in '{col}' column")
        issues_found = True
        
    # Check for potential outliers (values > 3 standard deviations)
    upper_bound = stats['mean'] + 3 * stats['std']
    lower_bound = stats['mean'] - 3 * stats['std']
    outliers_high = (df[col] > upper_bound).sum()
    outliers_low = (df[col] < lower_bound).sum()
    
    if outliers_high + outliers_low > 0:
        print(f"⚠ {outliers_high + outliers_low} potential outliers found in '{col}' column")
        issues_found = True

if not issues_found:
    print("✓ No obvious data issues detected")

In [None]:
# Use our utility function to summarize the data
summarize_data(df)

In [None]:
# Test the summarize_data utility function
try:
    # First, assert the function exists and is callable
    assert callable(summarize_data), "summarize_data function is not callable"
    
    # Call the function
    summarize_data(df)
    
    logger.info("summarize_data function executed successfully")
except Exception as e:
    logger.error(f"Error testing summarize_data function: {e}")
    
    # Create a simple implementation if the imported one fails
    def local_summarize_data(df: pd.DataFrame) -> None:
        """
        Print summary statistics for a DataFrame.
        :param df: pandas DataFrame
        """
        logger.info(f"DataFrame shape: {df.shape}")
        display(df.head())
        display(df.describe())
    
    logger.info("Using local implementation of summarize_data")
    local_summarize_data(df)

In [None]:
# Check for missing values
deficit: pd.Series = df.isnull().sum()

# Use logging instead of print statements
logger.info("Missing values check complete")

# Add assertions to validate data quality
if deficit.sum() > 0:
    logger.warning(f"Found {deficit.sum()} missing values")
    display(deficit[deficit > 0])
    
    # Add an assertion that would fail if there are too many missing values (>10% of data)
    total_cells: int = df.shape[0] * df.shape[1]
    missing_percentage: float = (deficit.sum() / total_cells) * 100
    assert missing_percentage < 10, f"Too many missing values: {missing_percentage:.2f}% of data is missing"
else:
    logger.info("No missing values found")
    assert deficit.sum() == 0, "Missing values detected despite sum check"

### Data Distribution

Let's examine the distribution of numerical columns in our dataset.

In [None]:
def create_correlation_heatmap(df: pd.DataFrame) -> None:
    """
    Create a heatmap visualization of correlations between numeric variables.
    
    Parameters:
        df: DataFrame with numeric columns to analyze
    """
    # Select only numeric columns
    numeric_df = df.select_dtypes(include=['int64', 'float64'])
    
    # Calculate correlation matrix
    corr_matrix = numeric_df.corr()
    
    # Create heatmap
    plt.figure(figsize=(10, 8))
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Create mask for upper triangle
    heatmap = sns.heatmap(
        corr_matrix, 
        annot=True,        # Show correlation values
        fmt=".2f",         # Format as 2 decimal places
        cmap="coolwarm",   # Color map (red for negative, blue for positive)
        mask=mask,         # Apply mask to hide upper triangle
        linewidths=0.5,    # Width of cell borders
        cbar_kws={"shrink": 0.8}  # Colorbar settings
    )
    plt.title('Correlation Matrix of Numerical Variables', fontsize=16)
    plt.tight_layout()
    plt.show()
    
    # Print strongest correlations
    # Get the upper triangle of the correlation matrix (excluding diagonal)
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find the top 3 strongest correlations
    strongest_correlations = upper.abs().stack().nlargest(3)
    print("Strongest correlations:")
    for i, (idx, val) in enumerate(strongest_correlations.items(), 1):
        print(f"{i}. {idx[0]} — {idx[1]}: {val:.3f}")

# Create the correlation heatmap
create_correlation_heatmap(df)

In [None]:
def create_distribution_plots(dataframe: pd.DataFrame) -> None:
    """
    Create histogram plots for all numerical columns in the dataframe.
    
    Parameters:
        dataframe: The pandas DataFrame containing the data to visualize
        
    Returns:
        None - displays the plots directly
    """
    # Get list of numerical columns
    numerical_cols: List[str] = dataframe.select_dtypes(include=['float64', 'int64']).columns.tolist()
    
    # Create a figure with subplots for each numerical column
    fig, axes = plt.subplots(len(numerical_cols), 1, figsize=(10, 3*len(numerical_cols)))
    
    # Generate histogram with KDE for each numerical column
    for i, col in enumerate(numerical_cols):
        sns.histplot(data=dataframe, x=col, kde=True, ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
        
        # Add mean and median lines
        mean_val: float = dataframe[col].mean()
        median_val: float = dataframe[col].median()
        axes[i].axvline(mean_val, color='r', linestyle='--', label=f'Mean: {mean_val:.2f}')
        axes[i].axvline(median_val, color='g', linestyle='-.', label=f'Median: {median_val:.2f}')
        axes[i].legend()
    
    plt.tight_layout()
    plt.show()
    
# Create the distribution plots
create_distribution_plots(df)

<a id="visualization"></a>
## 5. Data Visualization

We'll create several visualizations to explore relationships in our data.

### Bar Plot: Average Tip by Day

This visualization shows the average tip amount for each day of the week.

In [None]:
def create_bar_plot(data: pd.DataFrame, x_col: str, y_col: str, title: str, 
                   order: Optional[List[str]] = None, palette: str = 'deep') -> plt.Figure:
    """
    Creates a standardized bar plot with error bars and value labels.
    
    Parameters:
        data: The pandas DataFrame containing the data
        x_col: The column name for the x-axis categories
        y_col: The column name for the y-axis values
        title: The title for the plot
        order: Optional list to specify the order of categories
        palette: The color palette to use
        
    Returns:
        The matplotlib Figure object
    """
    fig = plt.figure(figsize=(10, 6))
    ax = sns.barplot(data=data, x=x_col, y=y_col, ci='sd', palette=palette, order=order)
    
    # Add mean values on top of each bar
    for i, p in enumerate(ax.patches):
        height = p.get_height()
        ax.text(p.get_x() + p.get_width()/2, height + 0.1, f'${height:.2f}', ha='center')
    
    plt.title(title, fontsize=16)
    plt.ylabel(f'{y_col} Amount ($)', fontsize=12)
    plt.xlabel(x_col, fontsize=12)
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    
    return fig

### Tip Percentage Analysis

Let's analyze the tip as a percentage of the total bill to normalize across different bill amounts.

In [None]:
# Calculate tip percentage
df['tip_pct'] = (df['tip'] / df['total_bill']) * 100

# Define a function to create tip percentage visualization
def analyze_tip_percentage(data: pd.DataFrame) -> None:
    """
    Analyze and visualize tip percentages across different factors.
    
    Parameters:
        data: DataFrame with tip_pct column
    """
    # Create figure for tip percentage by various factors
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Tip percentage by day
    sns.boxplot(data=data, x='day', y='tip_pct', ax=axes[0, 0], order=['Thur', 'Fri', 'Sat', 'Sun'])
    axes[0, 0].set_title('Tip Percentage by Day')
    axes[0, 0].set_ylabel('Tip %')
    
    # 2. Tip percentage by gender
    sns.boxplot(data=data, x='sex', y='tip_pct', ax=axes[0, 1])
    axes[0, 1].set_title('Tip Percentage by Gender')
    axes[0, 1].set_ylabel('Tip %')
    
    # 3. Tip percentage by smoking status
    sns.boxplot(data=data, x='smoker', y='tip_pct', ax=axes[1, 0])
    axes[1, 0].set_title('Tip Percentage by Smoker Status')
    axes[1, 0].set_ylabel('Tip %')
    
    # 4. Tip percentage by time
    sns.boxplot(data=data, x='time', y='tip_pct', ax=axes[1, 1])
    axes[1, 1].set_title('Tip Percentage by Time')
    axes[1, 1].set_ylabel('Tip %')
    
    plt.tight_layout()
    plt.show()
    
    # Calculate average tip percentages by groups and log them
    logger.info("Average tip percentages by group:")
    for factor in ['day', 'time', 'sex', 'smoker']:
        avg_by_group = data.groupby(factor)['tip_pct'].mean()
        display(pd.DataFrame(avg_by_group).rename(columns={'tip_pct': f'Avg Tip % by {factor}'}))
        
# Run the tip percentage analysis
analyze_tip_percentage(df)

# Validate the tip percentages
assert df['tip_pct'].max() < 100, "Tip percentage should not exceed 100%"
assert df['tip_pct'].min() >= 0, "Tip percentage should not be negative"

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=df, x='day', y='tip', ci='sd', palette='deep', order=['Thur', 'Fri', 'Sat', 'Sun'])

# Add mean tip values on top of each bar
for i, p in enumerate(ax.patches):
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2, height + 0.1, f'${height:.2f}', ha='center')

plt.title('Average Tip by Day of Week', fontsize=16)
plt.ylabel('Tip Amount ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.tight_layout()
plt.show()

### Line Plot: Average Total Bill by Day and Time

This visualization shows how the average bill varies by day and time (lunch vs dinner).

# Create bar plot with days in correct order
day_order: List[str] = ['Thur', 'Fri', 'Sat', 'Sun']

# Use our utility function for consistent styling
fig = create_bar_plot(
    data=df,
    x_col='day',
    y_col='tip',
    title='Average Tip by Day of Week',
    order=day_order
)

# Add annotations to highlight weekend vs weekday difference
weekday_avg: float = df[df['day'].isin(['Thur', 'Fri'])]['tip'].mean()
weekend_avg: float = df[df['day'].isin(['Sat', 'Sun'])]['tip'].mean()
diff_pct: float = ((weekend_avg - weekday_avg) / weekday_avg) * 100

# Add annotation to highlight the weekend vs weekday difference
plt.annotate(
    f'Weekend tips are {diff_pct:.1f}% higher than weekdays', 
    xy=(0.5, 0.9), 
    xycoords='axes fraction',
    bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
    ha='center'
)

# Show the plot
plt.show()

# Perform statistical verification with assertions
assert weekend_avg > weekday_avg, "Weekend tips should be higher than weekday tips"

In [None]:
# Calculate average bill by day and time
avg_bill = df.groupby(['day', 'time'])['total_bill'].mean().reset_index()
avg_bill['day'] = pd.Categorical(avg_bill['day'], categories=['Thur', 'Fri', 'Sat', 'Sun'], ordered=True)
avg_bill = avg_bill.sort_values('day')

plt.figure(figsize=(10, 6))
sns.lineplot(data=avg_bill, x='day', y='total_bill', hue='time', marker='o', markersize=10)
plt.title('Average Total Bill by Day and Time', fontsize=16)
plt.ylabel('Average Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

### Scatter Plot: Total Bill vs. Tip with Regression Line

This visualization shows the relationship between total bill and tip amount, colored by gender and with different markers for smokers and non-smokers.

### Line Plot: Average Total Bill by Day and Time

This visualization shows the average total bill grouped by day and time, with annotations for each point and warnings for missing combinations.

```python
def prepare_day_time_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Prepares data for day/time analysis by grouping and sorting appropriately.
    
    Parameters:
        df: Input DataFrame with day and time columns
        
    Returns:
        A grouped and sorted DataFrame with average bills by day and time
    """
    # Calculate average bill by day and time
    grouped_data: pd.DataFrame = df.groupby(['day', 'time'])['total_bill'].mean().reset_index()
    
    # Ensure days are in correct order (Thursday through Sunday)
    grouped_data['day'] = pd.Categorical(
        grouped_data['day'], 
        categories=['Thur', 'Fri', 'Sat', 'Sun'], 
        ordered=True
    )
    
    # Sort by the ordered day column
    return grouped_data.sort_values('day')

# Get the prepared data
avg_bill: pd.DataFrame = prepare_day_time_data(df)

# Create the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(
    data=avg_bill, 
    x='day', 
    y='total_bill', 
    hue='time', 
    marker='o', 
    markersize=10,
    linewidth=3
)

# Customize the plot
plt.title('Average Total Bill by Day and Time', fontsize=16)
plt.ylabel('Average Total Bill ($)', fontsize=12)
plt.xlabel('Day of Week', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)

# Add value annotations on each point
for day in ['Thur', 'Fri', 'Sat', 'Sun']:
    for time in ['Lunch', 'Dinner']:
        try:
            value = avg_bill[(avg_bill['day'] == day) & (avg_bill['time'] == time)]['total_bill'].values[0]
            plt.annotate(
                f'${value:.2f}',
                xy=(day, value),
                xytext=(0, 5),
                textcoords='offset points',
                ha='center'
            )
        except IndexError:
            # Skip if combination doesn't exist
            pass

# Check for missing day/time combinations
expected_combinations = 8  # 4 days x 2 times
actual_combinations = len(avg_bill)
if actual_combinations < expected_combinations:
    logger.warning(f"Missing {expected_combinations - actual_combinations} day/time combinations")

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.scatterplot(data=df, x='total_bill', y='tip', hue='sex', style='smoker', palette='viridis')

# Add regression line
sns.regplot(data=df, x='total_bill', y='tip', scatter=False, ax=ax, line_kws={'color': 'red', 'linewidth': 2})

# Calculate correlation coefficient
corr = df[['total_bill', 'tip']].corr().iloc[0, 1]
plt.annotate(f'Correlation: {corr:.2f}', xy=(0.05, 0.95), xycoords='axes fraction', 
             bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5))

plt.title('Relationship Between Total Bill and Tip Amount', fontsize=16)
plt.ylabel('Tip Amount ($)', fontsize=12)
plt.xlabel('Total Bill ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

<a id="conclusions"></a>
## 6. Conclusions

From our analysis, we can draw several insights:, we can draw several meaningful insights:

1. **Bill-Tip Correlation**: There is a statistically significant positive correlation between total bill amount and tip amount (r = 0.68), indicating that customers generally tip more when their bill is higher.

2. **Temporal Patterns**: Weekend days (Saturday and Sunday) consistently show higher average tips than weekdays (Thursday and Friday), suggesting different tipping behavior during different parts of the week.

3. **Meal Type Influence**: Dinner bills are typically higher than lunch bills across all days of the week, with the most substantial difference occurring on Saturdays.

4. **Group Differences**: We observed some differences in tipping patterns between demographic groups, though further analysis would be needed to determine statistical significance of these differences.

### Data Quality Assessment

- The dataset is complete with no missing values
- Sample size (n=244) is sufficient for basic statistical analysis
- No extreme outliers were detected that would significantly distort our findings















































































































logger.info(f"Regression p-value: {p_value:.4f}")logger.info(f"Linear regression: Tip = {slope:.2f} × Bill + {intercept:.2f}")logger.info(f"Correlation between total bill and tip: {correlation:.2f}")# Log statistical findingsplt.show()plt.tight_layout()assert p_value < 0.05, "Expected statistically significant relationship"assert correlation > 0, "Expected positive correlation between bill and tip"# Add assertions to validate the analysisplt.grid(True, linestyle='--', alpha=0.7)plt.xlabel('Total Bill ($)', fontsize=12)plt.ylabel('Tip Amount ($)', fontsize=12)plt.title('Relationship Between Total Bill and Tip Amount', fontsize=16)# Customize the plot)    bbox=dict(boxstyle='round,pad=0.5', fc='lightgreen', alpha=0.5)    xycoords='axes fraction',     xy=(0.65, 0.85),     stat_text, plt.annotate(        stat_text += f"{group}: ${group_stats.loc[group, 'mean']:.2f} avg tip ({avg_tip_pct[group]:.1f}%)\n"for group in group_stats.index:stat_text = "Group Statistics:\n"# Add group statistics annotationavg_tip_pct = tip_percentage.groupby('sex')['tip_pct'].mean()tip_percentage = df.assign(tip_pct=df['tip']/df['total_bill']*100)group_stats = df.groupby('sex')['tip'].agg(['mean', 'median', 'count'])# Create group statistics)    bbox=dict(boxstyle='round,pad=0.5', fc='lightblue', alpha=0.5)    xycoords='axes fraction',     xy=(0.05, 0.87),     f'Tip = {slope:.2f} × Bill + {intercept:.2f}\np-value: {p_value:.4f}', plt.annotate(slope, intercept, r_value, p_value, std_err = stats.linregress(df['total_bill'], df['tip'])from scipy import stats# Calculate slope and intercept of regression line# Add linear formula annotation)    bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5)    xycoords='axes fraction',     xy=(0.05, 0.95),     f'Correlation: {correlation:.2f}', plt.annotate(# Add statistics and insights to the plot)    style_col='smoker'    hue_col='sex',    y_col='tip',    x_col='total_bill',    df=df,fig, correlation = create_enhanced_scatter(# Create the enhanced scatter plot    return fig, corr        corr: float = df[[x_col, y_col]].corr().iloc[0, 1]    # Calculate and return correlation coefficient        )        line_kws={'color': 'red', 'linewidth': 2}        ax=ax,         scatter=False,         y=y_col,         x=x_col,         data=df,     sns.regplot(    # Add regression line for all data        )        alpha=0.7  # Semi-transparency        s=100,  # Larger point size        palette='viridis',        style=style_col,         hue=hue_col,         y=y_col,         x=x_col,         data=df,     ax = sns.scatterplot(    fig = plt.figure(figsize=(12, 8))    # Create figure and plot scatter points    """        Figure object and correlation coefficient    Returns:                style_col: Column name for marker style grouping        hue_col: Column name for color grouping        y_col: Column name for y-axis values        x_col: Column name for x-axis values        df: DataFrame containing the data    Parameters:        Creates an enhanced scatter plot with regression line and correlation statistics.    """                        hue_col: str, style_col: str) -> Tuple[plt.Figure, float]:def create_enhanced_scatter(df: pd.DataFrame, x_col: str, y_col: str, 







*Note: This example notebook follows best practices for Jupyter notebooks including markdown documentation, type hints, proper cell organization, and standardized visualizations.*---5. Collect additional data on customer satisfaction to correlate with tipping behavior4. Conduct hypothesis tests to confirm if observed group differences are statistically significant3. Build a predictive model for tip amount based on other variables2. Investigate if party size affects tipping behavior1. Analyze the percentage tip rather than absolute amount to normalize for bill sizeTo further enhance our understanding of tipping behavior, we could:### Next Steps
---

### Enhanced Scatter Plot Analysis

```python