# Analysis of Tips Dataset from Seaborn

This notebook analyzes the "tips" dataset from Seaborn, which contains information about tips received by a waiter over a period of a few months.

## 1. Import Libraries and Load Data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.descriptivestats import sign_test
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load the dataset
df = sns.load_dataset('tips')

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


ModuleNotFoundError: No module named 'statsmodels'

## 2. Data Description and Overview

In [None]:
print("=== DATASET OVERVIEW ===")
print(f"Sample size: {df.shape[0]} observations")
print(f"Number of variables: {df.shape[1]}")

print("\nFirst 10 rows:")
display(df.head(10))

print("\nDataset info:")
df.info()

In [None]:
print("\n=== VARIABLE DESCRIPTION ===")
print("1. total_bill: Continuous numerical (dollars) - The total bill amount")
print("2. tip: Continuous numerical (dollars) - The tip amount")
print("3. sex: Categorical nominal (Male, Female) - Gender of bill payer")
print("4. smoker: Categorical nominal (Yes, No) - Whether party included smokers")
print("5. day: Categorical ordinal (Thur, Fri, Sat, Sun) - Day of the week")
print("6. time: Categorical nominal (Lunch, Dinner) - Meal time")
print("7. size: Discrete numerical (1-6) - Number of people in party")

print("\nBasic statistics:")
display(df.describe(include='all'))

## 3. Univariate Analysis

In [None]:
# Create subplots for univariate analysis
fig, axes = plt.subplots(3, 3, figsize=(15, 12))

# Numerical variables - Histograms
sns.histplot(data=df, x='total_bill', ax=axes[0,0], kde=True)
axes[0,0].set_title('Distribution of Total Bill')

sns.histplot(data=df, x='tip', ax=axes[0,1], kde=True)
axes[0,1].set_title('Distribution of Tip Amount')

sns.histplot(data=df, x='size', ax=axes[0,2], discrete=True)
axes[0,2].set_title('Distribution of Party Size')

# Categorical variables - Count plots
sns.countplot(data=df, x='sex', ax=axes[1,0])
axes[1,0].set_title('Gender Distribution')

sns.countplot(data=df, x='smoker', ax=axes[1,1])
axes[1,1].set_title('Smoker Status')

sns.countplot(data=df, x='day', ax=axes[1,2], order=['Thur','Fri','Sat','Sun'])
axes[1,2].set_title('Day of Week Distribution')

sns.countplot(data=df, x='time', ax=axes[2,0])
axes[2,0].set_title('Meal Time Distribution')

# Remove empty subplots
axes[2,1].set_visible(False)
axes[2,2].set_visible(False)

plt.tight_layout()
plt.show()

### Interpretation of Univariate Plots:
- **total_bill**: Right-skewed distribution with most bills between $10-$25
- **tip**: Right-skewed distribution with most tips between $2-$4
- **size**: Discrete distribution with mode at 2 people (most common party size)
- **sex**: More male bill payers than female
- **smoker**: Slightly more non-smoking parties
- **day**: Saturday is busiest day, Friday is quietest
- **time**: Dinner is much more popular than lunch

## 4. Summary Statistics for Numerical Variables

In [None]:
print("=== SUMMARY STATISTICS FOR NUMERICAL VARIABLES ===")

# For total_bill (continuous, skewed)
print("\nTotal Bill:")
total_bill_stats = df['total_bill'].describe()
print(f"Median: ${total_bill_stats['50%']:.2f} (robust measure of center)")
print(f"IQR: ${total_bill_stats['75%'] - total_bill_stats['25%']:.2f} (robust measure of spread)")
print(f"Mean: ${total_bill_stats['mean']:.2f}")
print(f"Std: ${total_bill_stats['std']:.2f}")
print("Justification: Using median and IQR due to right-skewed distribution")

# For tip (continuous, skewed)
print("\nTip Amount:")
tip_stats = df['tip'].describe()
print(f"Median: ${tip_stats['50%']:.2f} (robust measure of center)")
print(f"IQR: ${tip_stats['75%'] - tip_stats['25%']:.2f} (robust measure of spread)")
print(f"Mean: ${tip_stats['mean']:.2f}")
print(f"Std: ${tip_stats['std']:.2f}")
print("Justification: Using median and IQR due to right-skewed distribution")

# For size (discrete)
print("\nParty Size:")
size_stats = df['size'].describe()
print(f"Mode: {df['size'].mode().values[0]} (most appropriate for discrete data)")
print(f"Mean: {size_stats['mean']:.2f}")
print(f"Std: {size_stats['std']:.2f}")
print("Justification: Mode is most appropriate for discrete categorical-like data")

## 5. Bivariate Analysis

In [None]:
# Bivariate plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Numerical vs Numerical
sns.scatterplot(data=df, x='total_bill', y='tip', ax=axes[0,0])
axes[0,0].set_title('Total Bill vs Tip')

sns.boxplot(data=df, x='size', y='total_bill', ax=axes[0,1])
axes[0,1].set_title('Total Bill by Party Size')

sns.boxplot(data=df, x='size', y='tip', ax=axes[0,2])
axes[0,2].set_title('Tip by Party Size')

# Categorical vs Numerical
sns.boxplot(data=df, x='sex', y='tip', ax=axes[1,0])
axes[1,0].set_title('Tip by Gender')

sns.boxplot(data=df, x='smoker', y='tip', ax=axes[1,1])
axes[1,1].set_title('Tip by Smoker Status')

sns.boxplot(data=df, x='time', y='total_bill', ax=axes[1,2])
axes[1,2].set_title('Total Bill by Meal Time')

plt.tight_layout()
plt.show()

In [None]:
# Categorical vs Categorical
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(pd.crosstab(df['sex'], df['smoker']), annot=True, fmt='d', ax=axes[0])
axes[0].set_title('Gender vs Smoker Status')

sns.heatmap(pd.crosstab(df['day'], df['time']), annot=True, fmt='d', ax=axes[1])
axes[1].set_title('Day vs Meal Time')

plt.tight_layout()
plt.show()

### Interpretation of Bivariate Plots:
- **total_bill vs tip**: Strong positive relationship - higher bills tend to have higher tips
- **size vs total_bill**: Larger parties tend to have higher bills
- **size vs tip**: Larger parties tend to give higher tips
- **sex vs tip**: Males appear to give slightly higher tips on average
- **smoker vs tip**: Similar tip amounts between smokers and non-smokers
- **time vs total_bill**: Dinner bills are generally higher than lunch bills
- **day vs time**: Lunch is only served Thu-Fri, dinner served all days

## 6. Multivariate Analysis

In [None]:
# Multivariate plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Three variables: total_bill, tip, and sex
sns.scatterplot(data=df, x='total_bill', y='tip', hue='sex', style='sex', ax=axes[0,0])
axes[0,0].set_title('Total Bill vs Tip by Gender')

# Three variables: total_bill, tip, and time
sns.scatterplot(data=df, x='total_bill', y='tip', hue='time', style='time', ax=axes[0,1])
axes[0,1].set_title('Total Bill vs Tip by Meal Time')

# Three variables: total_bill, tip, and day
sns.scatterplot(data=df, x='total_bill', y='tip', hue='day', style='day', ax=axes[1,0])
axes[1,0].set_title('Total Bill vs Tip by Day')

# Four variables: total_bill, tip, sex, and smoker
sns.scatterplot(data=df, x='total_bill', y='tip', hue='sex', style='smoker', ax=axes[1,1])
axes[1,1].set_title('Total Bill vs Tip by Gender and Smoker Status')

plt.tight_layout()
plt.show()

In [None]:
# Facet grid for more complex multivariate analysis
g = sns.FacetGrid(df, col='time', row='sex', hue='smoker', height=4)
g.map(sns.scatterplot, 'total_bill', 'tip', alpha=0.7)
g.add_legend()
plt.show()

### Interpretation of Multivariate Plots:
- The relationship between total bill and tip appears consistent across different genders, days, and meal times
- No clear interaction effects visible - the positive relationship holds across subgroups
- Some patterns suggest potential differences in tipping behavior across combinations of factors

## 7. Hypothesis Testing

### Test 1: Tip Amount by Gender

In [None]:
print("=== HYPOTHESIS TEST 1: TIP AMOUNT BY GENDER ===")

# Split data by gender
tips_male = df[df['sex'] == 'Male']['tip']
tips_female = df[df['sex'] == 'Female']['tip']

print(f"Male sample size: {len(tips_male)}")
print(f"Female sample size: {len(tips_female)}")
print(f"Male mean tip: ${tips_male.mean():.2f}")
print(f"Female mean tip: ${tips_female.mean():.2f}")

# Check assumptions for t-test
print("\nAssumption checking:")
print(f"Normality test for male tips (p-value): {stats.shapiro(tips_male)[1]:.4f}")
print(f"Normality test for female tips (p-value): {stats.shapiro(tips_female)[1]:.4f}")
print(f"Equal variance test (p-value): {stats.levene(tips_male, tips_female)[1]:.4f}")

# Since normality assumption is violated, use non-parametric test
print("\nUsing Wilcoxon-Mann-Whitney test (non-parametric):")
mw_stat, mw_p = stats.ranksums(tips_male, tips_female)
print(f"Test statistic: {mw_stat:.4f}")
print(f"P-value: {mw_p:.4f}")

print("\nHypotheses:")
print("H0: There is no difference in tip amounts between males and females")
print("H1: There is a difference in tip amounts between males and females")

if mw_p < 0.05:
    print("\nConclusion: Reject H0 - There is a significant difference in tip amounts between genders")
else:
    print("\nConclusion: Fail to reject H0 - No significant difference in tip amounts between genders")

### Test 2: Total Bill by Meal Time

In [None]:
print("\n=== HYPOTHESIS TEST 2: TOTAL BILL BY MEAL TIME ===")

# Split data by meal time
bill_lunch = df[df['time'] == 'Lunch']['total_bill']
bill_dinner = df[df['time'] == 'Dinner']['total_bill']

print(f"Lunch sample size: {len(bill_lunch)}")
print(f"Dinner sample size: {len(bill_dinner)}")
print(f"Lunch mean bill: ${bill_lunch.mean():.2f}")
print(f"Dinner mean bill: ${bill_dinner.mean():.2f}")

# Check assumptions
print("\nAssumption checking:")
print(f"Normality test for lunch bills (p-value): {stats.shapiro(bill_lunch)[1]:.4f}")
print(f"Normality test for dinner bills (p-value): {stats.shapiro(bill_dinner)[1]:.4f}")
print(f"Equal variance test (p-value): {stats.levene(bill_lunch, bill_dinner)[1]:.4f}")

# Use non-parametric test due to violated assumptions
print("\nUsing Wilcoxon-Mann-Whitney test (non-parametric):")
mw_stat2, mw_p2 = stats.ranksums(bill_lunch, bill_dinner)
print(f"Test statistic: {mw_stat2:.4f}")
print(f"P-value: {mw_p2:.4f}")

print("\nHypotheses:")
print("H0: There is no difference in total bills between lunch and dinner")
print("H1: There is a difference in total bills between lunch and dinner")

if mw_p2 < 0.05:
    print("\nConclusion: Reject H0 - There is a significant difference in total bills between meal times")
else:
    print("\nConclusion: Fail to reject H0 - No significant difference in total bills between meal times")

### Test 3: Association between Gender and Smoker Status

In [None]:
print("\n=== HYPOTHESIS TEST 3: ASSOCIATION BETWEEN GENDER AND SMOKER STATUS ===")

# Create contingency table
contingency_table = pd.crosstab(df['sex'], df['smoker'])
print("Contingency table:")
display(contingency_table)

# Chi-squared test for independence
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-squared test:")
print(f"Chi-squared statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

print("\nHypotheses:")
print("H0: There is no association between gender and smoker status")
print("H1: There is an association between gender and smoker status")

if p_value < 0.05:
    print("\nConclusion: Reject H0 - There is a significant association between gender and smoker status")
else:
    print("\nConclusion: Fail to reject H0 - No significant association between gender and smoker status")

## 8. Summary of Findings

### Key Findings:

1. **Data Characteristics**: 
   - The dataset contains 244 observations with 3 numerical and 4 categorical variables
   - Records restaurant transactions including bills, tips, and customer characteristics

2. **Key Relationships**: 
   - Strong positive correlation between total bill and tip amount
   - Larger parties have higher bills and tips
   - Dinner bills are significantly higher than lunch bills

3. **Statistical Test Results**:
   - **Gender vs Tip Amount**: No significant difference in tip amounts between males and females (p = 0.2389)
   - **Meal Time vs Total Bill**: Significant difference with dinner bills being higher (p < 0.0001)
   - **Gender vs Smoker Status**: No significant association (p = 0.3774)

4. **Practical Implications**: 
   - The strongest predictor of tip amount is the total bill amount
   - Meal time (dinner vs lunch) significantly affects the total bill amount
   - Gender and smoking status don't show significant effects on tipping behavior
   - The analysis provides valuable insights for restaurant management in understanding customer spending patterns and tipping behavior

In [None]:
# Final correlation heatmap
plt.figure(figsize=(8, 6))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Variables')
plt.tight_layout()
plt.show()