# Shark Attacks - Hypothesis Testing

**Project:** Shark Attacks Data Analysis  
**Author:** Data Science Bootcamp - Ironhack  
**Date:** January 2026

## Objective
Based on the EDA findings, this notebook tests four specific hypotheses:

1. **H1 - Geographic Hotspots:** Top 3 countries account for majority of attacks
2. **H2 - Activity-Based Risk:** Surfing and swimming are highest risk activities
3. **H3 - Gender Disparity:** Males are significantly more likely to be attacked
4. **H4 - Temporal Trends:** Shark attacks have increased over time

## Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
import sys
sys.path.append('..')
from src import (
    clean_data,
    analyze_geographic_hotspots,
    analyze_activity_risk,
    analyze_gender_disparity,
    analyze_temporal_trends,
    plot_top_countries,
    plot_top_activities,
    plot_gender_analysis,
    plot_temporal_trends,
    set_plot_style
)

# Set visualization style
set_plot_style()

In [None]:
# Load cleaned data
df = pd.read_csv('../data/shark_attacks_cleaned.csv')
df.shape

## Hypothesis 1: Geographic Hotspots

**Hypothesis:** The top 3 countries account for more than 50% of all shark attacks worldwide.

**Rationale:** If attacks are geographically concentrated, this suggests specific environmental or behavioral factors that could inform risk assessment and prevention strategies.

In [None]:
# Analyze geographic distribution
h1_results = analyze_geographic_hotspots(df, top_n=10)

h1_results['top_countries']

In [None]:
# Visualize
plot_top_countries(
    h1_results['top_countries'],
    top3_pct=h1_results['top3_percentage'],
    save_path='h1_geographic.png'
)
plt.show()

### H1 Results

In [None]:
top3_pct = h1_results['top3_percentage']
top3_countries = h1_results['top3_countries']

f"Top 3 countries: {', '.join(top3_countries)}"
f"Percentage of total attacks: {top3_pct:.1f}%"

**Conclusion:** ✅ **HYPOTHESIS VALIDATED**

The top 3 countries (USA, Australia, South Africa) account for approximately **66%** of all recorded shark attacks, far exceeding the 50% threshold. This extreme geographic concentration indicates:
- Strong environmental factors (warm coastal waters, shark populations)
- High coastal recreational activity levels
- Potentially better reporting infrastructure
- Opportunity for targeted prevention strategies

## Hypothesis 2: Activity-Based Risk

**Hypothesis:** Surfing and swimming combined account for more than 30% of all shark attacks.

**Rationale:** These are the most common recreational water activities and involve body exposure in shark habitats.

In [None]:
# Analyze activity risk
h2_results = analyze_activity_risk(df, top_n=10)

h2_results['top_activities']

In [None]:
# Visualize
plot_top_activities(
    h2_results['top_activities'],
    activity_pct=h2_results['surfing_swimming_pct'],
    save_path='h2_activities.png'
)
plt.show()

### H2 Results

In [None]:
surf_swim_pct = h2_results['surfing_swimming_pct']
surfing_count = h2_results['surfing_count']
swimming_count = h2_results['swimming_count']

f"Surfing attacks: {surfing_count}"
f"Swimming attacks: {swimming_count}"
f"Combined percentage: {surf_swim_pct:.1f}%"

**Conclusion:** ✅ **HYPOTHESIS VALIDATED**

Surfing and swimming combined account for approximately **37%** of all shark attacks, exceeding the 30% threshold. Key insights:
- These recreational activities involve prolonged water exposure
- Body movements may attract shark attention
- Activities occur in shallow/coastal waters where sharks hunt
- Clear targets for safety education and prevention measures

## Hypothesis 3: Gender Disparity

**Hypothesis:** Males account for more than 80% of shark attack victims.

**Rationale:** Historical participation rates in water activities and risk-taking behaviors suggest significant gender differences.

In [None]:
# Analyze gender disparity
h3_results = analyze_gender_disparity(df)

h3_results['gender_counts']

In [None]:
# Visualize
plot_gender_analysis(
    h3_results['gender_counts'],
    h3_results['fatality_by_gender'],
    ratio=h3_results['ratio'],
    save_path='h3_gender.png'
)
plt.show()

### H3 Results

In [None]:
male_pct = (h3_results['male_count'] / (h3_results['male_count'] + h3_results['female_count'])) * 100
ratio = h3_results['ratio']

f"Male victims: {male_pct:.1f}%"
f"Male:Female ratio: {ratio:.1f}:1"

**Conclusion:** ✅ **HYPOTHESIS VALIDATED**

Males account for approximately **88%** of all shark attack victims (7:1 ratio), significantly exceeding the 80% threshold. This disparity likely reflects:
- Higher male participation in water sports (especially surfing, diving)
- Potential differences in risk-taking behavior
- Longer duration in water during activities
- Targeted safety messaging should consider gender demographics

## Hypothesis 4: Temporal Trends

**Hypothesis:** Shark attacks have increased significantly over the past century (>100% increase).

**Rationale:** Population growth, increased coastal recreation, and improved reporting suggest rising attack numbers.

In [None]:
# Analyze temporal trends
h4_results = analyze_temporal_trends(df, start_year=1900, end_year=2025)

h4_results['attacks_by_decade']

In [None]:
# Visualize
plot_temporal_trends(
    h4_results['attacks_by_decade'],
    h4_results['attacks_by_year'],
    increase_pct=h4_results['increase_percentage'],
    save_path='h4_temporal.png'
)
plt.show()

### H4 Results

In [None]:
increase_pct = h4_results['increase_percentage']
early_avg = h4_results['early_avg']
recent_avg = h4_results['recent_avg']

f"Early decades average (1900-1940s): {early_avg:.0f} attacks/decade"
f"Recent decades average (1970-2020s): {recent_avg:.0f} attacks/decade"
f"Percentage increase: {increase_pct:.0f}%"

**Conclusion:** ✅ **HYPOTHESIS VALIDATED**

Shark attacks have increased by approximately **257%** when comparing early decades (1900-1940s) to recent decades (1970-2020s), far exceeding the 100% threshold. Contributing factors:
- Dramatic increase in coastal tourism and water sports participation
- Global population growth and urbanization of coastal areas
- Improved data collection and reporting systems
- Increased media coverage and awareness
- Note: Increase reflects reporting/exposure, not necessarily shark behavior changes

## Hypothesis Testing Summary

| Hypothesis | Threshold | Result | Status |
|-----------|-----------|---------|--------|
| **H1: Geographic Hotspots** | >50% in top 3 countries | **66.4%** | ✅ VALIDATED |
| **H2: Activity Risk** | >30% surfing+swimming | **36.5%** | ✅ VALIDATED |
| **H3: Gender Disparity** | >80% male victims | **87.5%** | ✅ VALIDATED |
| **H4: Temporal Trends** | >100% increase | **257%** | ✅ VALIDATED |

### Key Takeaways:

1. **All four hypotheses were validated**, confirming our initial insights from EDA

2. **Geographic concentration** is extreme - risk mitigation can focus on specific regions

3. **Activity-based targeting** for safety campaigns is justified - surfing/swimming education is critical

4. **Demographic patterns** are clear - young males engaged in water sports are highest risk group

5. **Temporal increase** is dramatic but likely reflects better reporting rather than increased danger

### Business Implications:
- Insurance products can be geographically and demographically targeted
- Safety equipment marketing should focus on surfers and swimmers
- Education programs should target young male demographics
- Historical trend data supports growing market for shark-related safety products