# Comprehensive Exploratory Data Analysis
## Data Breach Patterns and Business Insights

**Author:** T. Spivey  
**Course:** BUS 761  
**Assignment:** 5 - Exploratory Data Analysis Module  
**Date:** October 2025

---

## Executive Summary

This notebook demonstrates a comprehensive exploratory data analysis of 35,378 data breach incidents reported in the United States from 2003-2025. Using our newly developed **modular EDA package**, we uncover critical patterns in breach frequency, severity, and industry vulnerabilities.

### Key Findings:
1. **Industry-Specific Vulnerabilities**: Healthcare experiences 43% more disclosure breaches than expected
2. **Financial Sector Risk**: Physical breaches are 169% higher than expected in financial services
3. **Retail Targeting**: Payment card breaches in retail are 400% above statistical expectation
4. **Impact Relationships**: Non-linear relationship between total and resident impact (Spearman ρ=0.52 vs Pearson r=0.32)
5. **Temporal Trends**: Breach frequency and severity show distinct time-based patterns

---

## 1. Setup and Data Loading

First, we'll import our modular EDA package and load the data from our SQLite database.

In [None]:
# Import our custom EDA package
import sys
sys.path.append('..')  # Add parent directory to path

from eda_package import BreachAnalyzer, BreachVisualizer, DataLoader
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 3)

print("✓ Packages imported successfully")

In [None]:
# Initialize data loader
loader = DataLoader('databreach.db')

# Load main breach dataset
df_breach = loader.load_breach_data()

print(f"Loaded {len(df_breach):,} breach records")
print(f"Columns: {df_breach.shape[1]}")
print(f"\nData Range: {df_breach['breach_date'].min()} to {df_breach['breach_date'].max()}")

In [None]:
# Display database summary
table_info = loader.get_table_info()
print("\nDatabase Tables:")
print("="*60)
for table, count in table_info.items():
    print(f"{table:40s}: {count:>10,} rows")

## 2. Data Overview and Quality Assessment

Let's examine the structure and quality of our dataset.

In [None]:
# Display first few records
print("Sample Records:")
df_breach.head()

In [None]:
# Data quality assessment
print("\nData Quality Report")
print("="*60)
print(f"Total Records: {len(df_breach):,}")
print(f"\nMissing Values by Column:")
missing = df_breach.isnull().sum()
missing_pct = (missing / len(df_breach) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Percentage', ascending=False))

In [None]:
# Categorical variable distributions
print("\nOrganization Type Distribution:")
print(df_breach['organization_type'].value_counts())
print("\n" + "="*60)
print("\nBreach Type Distribution:")
print(df_breach['breach_type'].value_counts())

## 3. Statistical Analysis

Now we'll use our `BreachAnalyzer` class to conduct comprehensive statistical analyses.

In [None]:
# Initialize analyzer
analyzer = BreachAnalyzer(df_breach, alpha=0.05)
print(f"Analyzer initialized: {analyzer}")

### 3.1 Descriptive Statistics

First, let's examine the central tendencies and distributions of key numeric variables.

In [None]:
# Overall descriptive statistics
desc_stats_overall = analyzer.descriptive_statistics()
print("\nOverall Descriptive Statistics:")
print("="*80)
desc_stats_overall

In [None]:
# Descriptive statistics by organization type
desc_stats_by_org = analyzer.descriptive_statistics(group_by='organization_type')
print("\nDescriptive Statistics by Organization Type:")
print("="*80)
desc_stats_by_org

**Business Insight:** The data shows significant skewness (positive skewness values), indicating that most breaches are relatively small, but a few massive breaches drive up the mean. The median provides a better measure of "typical" breach size.

### 3.2 Correlation Analysis

Examine the relationship between total individuals affected and state residents affected.

In [None]:
# Correlation analysis
corr_results = analyzer.correlation_analysis()

print("\nCorrelation Analysis Results:")
print("="*80)
print(f"Variables: {corr_results['variable_1']} vs {corr_results['variable_2']}")
print(f"Sample Size: {corr_results['sample_size']:,} valid pairs\n")

print(f"Pearson Correlation (Linear):")
print(f"  r = {corr_results['pearson_r']:.4f}")
print(f"  p-value = {corr_results['pearson_p']:.6f}")
print(f"  Significant: {corr_results['pearson_significant']}\n")

print(f"Spearman Correlation (Monotonic):")
print(f"  ρ = {corr_results['spearman_rho']:.4f}")
print(f"  p-value = {corr_results['spearman_p']:.6f}")
print(f"  Significant: {corr_results['spearman_significant']}")

**Key Finding:** The difference between Pearson (r=0.32) and Spearman (ρ=0.52) suggests:
- Strong monotonic relationship (rank-order)
- Non-linear pattern in actual values
- Presence of outliers affecting linear correlation
- Typical in breach data where mega-breaches distort linear measures

### 3.3 Chi-Squared Test: Industry vs Breach Type

Test whether organization type and breach type are independent.

In [None]:
# Chi-squared test
chi_results = analyzer.chi_squared_test('organization_type', 'breach_type')

print("\nChi-Squared Test Results:")
print("="*80)
print(f"Null Hypothesis: {chi_results['variable_1']} and {chi_results['variable_2']} are independent")
print(f"\nχ² Statistic: {chi_results['chi2_statistic']:.2f}")
print(f"p-value: {chi_results['p_value']:.8f}")
print(f"Degrees of Freedom: {chi_results['degrees_of_freedom']}")
print(f"Sample Size: {chi_results['sample_size']:,}")
print(f"\nConclusion: {'REJECT' if chi_results['significant'] else 'FAIL TO REJECT'} null hypothesis")
print(f"Interpretation: {'Strong evidence of' if chi_results['significant'] else 'No significant'} relationship")

In [None]:
# Display observed frequencies
print("\nObserved Frequencies (Contingency Table):")
chi_results['observed']

In [None]:
# Calculate deviations from expected
deviation = chi_results['observed'] - chi_results['expected']
deviation_pct = (deviation / chi_results['expected'] * 100).round(1)

print("\nDeviation from Expected (Percentage):")
print("Positive = More breaches than expected, Negative = Fewer")
deviation_pct

**Critical Business Insights:**

The significant chi-squared result (p < 0.001) confirms that different industries face different breach threats:

1. **Healthcare (MED)**: 43% more DISC (disclosure) breaches → Focus on access controls
2. **Financial (BSF)**: 169% more PHYS (physical) breaches → Strengthen document security
3. **Retail (BSR)**: 400% more CARD breaches → Enhanced POS security critical
4. **Business/Other (BSO)**: 15% more HACK attacks → Cyber defense priority

### 3.4 ANOVA: Breach Impact Across Industries

Test whether breach severity differs significantly across organization types.

In [None]:
# ANOVA test
anova_results = analyzer.anova_test('organization_type', 'total_affected')

print("\nANOVA Results:")
print("="*80)
print(f"Question: Does breach impact vary across organization types?")
print(f"\nF-Statistic: {anova_results['f_statistic']:.4f}")
print(f"p-value: {anova_results['p_value']:.6f}")
print(f"Number of Groups: {anova_results['n_groups']}")
print(f"\nConclusion: {'YES' if anova_results['significant'] else 'NO'} - ")
print(f"Breach impact {'DOES' if anova_results['significant'] else 'DOES NOT'} vary significantly by industry")

In [None]:
# Display group statistics
print("\nGroup Statistics:")
anova_results['group_statistics'].sort_values('mean', ascending=False)

### 3.5 Linear Regression: Predicting Resident Impact

Build a simple linear model to predict resident impact from total individuals affected.

In [None]:
# Simple linear regression
reg_results = analyzer.simple_linear_regression()

print("\nSimple Linear Regression Results:")
print("="*80)
print(f"Model: {reg_results['y_variable']} = {reg_results['slope']:.6f} * {reg_results['X_variable']} + {reg_results['intercept']:.2f}")
print(f"\nR² (Variance Explained): {reg_results['r_squared']:.4f}")
print(f"Sample Size: {reg_results['sample_size']:,}")
print(f"\nInterpretation: {reg_results['r_squared']*100:.1f}% of variance in resident impact")
print(f"                is explained by total individuals affected")

**Model Interpretation:**
- Slope ≈ 0.0027: For every 1,000 total individuals affected, ~2.7 additional residents are affected
- Low R² (0.099): Simple linear model explains only 10% of variance
- Suggests: Need more complex model or additional predictors

### 3.6 Time Series Analysis

Analyze breach trends over time.

In [None]:
# Time series by year
time_series = analyzer.time_series_analysis(freq='Y')

print("\nTime Series Analysis (Yearly):")
print("="*80)
time_series.tail(10)  # Show last 10 years

### 3.7 Logistic Regression: Predicting Severe Breaches

Classify breaches as severe (>10,000 affected) or non-severe.

In [None]:
# Logistic regression
logit_results = analyzer.logistic_regression_severity(threshold=10000)

print("\nLogistic Regression Results:")
print("="*80)
print(f"Classification Threshold: {logit_results['threshold']:,} individuals")
print(f"Model Accuracy: {logit_results['accuracy']:.2%}")
print(f"\nSevere Breaches: {logit_results['severe_count']:,}")
print(f"Non-Severe Breaches: {logit_results['non_severe_count']:,}")
print(f"\nTop Coefficients (Highest Risk Factors):")

coef_df = pd.DataFrame.from_dict(logit_results['coefficients'], 
                                 orient='index', columns=['Coefficient'])
coef_df.sort_values('Coefficient', ascending=False).head(10)

## 4. Data Visualization

Now we'll use our `BreachVisualizer` class to create publication-quality visualizations.

In [None]:
# Initialize visualizer
viz = BreachVisualizer(df_breach, output_dir='output/visualizations')
print(f"Visualizer initialized: {viz}")

In [None]:
# Generate comprehensive dashboard
# Load necessary data for visualizations
chi_observed = loader.load_statistical_results('chi_squared_observed')
time_series_data = loader.load_statistical_results('time_series_yearly')

# Prepare descriptive stats for visualization
desc_for_viz = desc_stats_by_org.reset_index()
desc_for_viz.columns = ['organization_type'] + desc_for_viz.columns[1:].tolist()

# Create all visualizations
viz.create_comprehensive_dashboard(
    chi_observed=chi_observed,
    desc_stats=desc_for_viz,
    time_series=time_series_data,
    correlation_stats=corr_results,
    regression_results=reg_results
)

## 5. Business Insights Summary

Let's generate actionable business insights from our analyses.

In [None]:
# Get business insights
insights = analyzer.get_business_insights()

print("\nBUSINESS INSIGHTS")
print("="*80)
for category, insight in insights.items():
    print(f"\n{category.replace('_', ' ').title()}:")
    print(f"  {insight}")

## 6. Strategic Recommendations

Based on our comprehensive analysis, here are the key strategic recommendations:

### For Healthcare Organizations (MED):
- **Priority:** Disclosure prevention
- **Action:** Implement stricter access controls and data handling procedures
- **Rationale:** 43% more disclosure breaches than expected

### For Financial Services (BSF):
- **Priority:** Physical document security
- **Action:** Enhanced document destruction protocols, secure storage
- **Rationale:** 169% more physical breaches than expected

### For Retail (BSR):
- **Priority:** Payment card security
- **Action:** POS system hardening, EMV compliance, fraud detection
- **Rationale:** 400% more card breaches than expected

### For All Organizations:
- **Trend Monitoring:** Breach frequency and severity evolving over time
- **Impact Planning:** Most breaches are small, but extreme outliers drive total impact
- **Resource Allocation:** Industry-specific security investments provide better ROI

---

## 7. Conclusion

This exploratory data analysis has revealed significant patterns in data breach vulnerabilities across industries. Our modular EDA package enables:

1. **Reusable Analysis:** Object-oriented design allows easy replication with new data
2. **Statistical Rigor:** Multiple hypothesis tests confirm significant relationships
3. **Business Focus:** Insights directly inform security investment decisions
4. **Visualization Quality:** Publication-ready charts for stakeholder communication

### Next Steps:
- Develop predictive models (Assignment 6)
- Create interactive dashboard (Assignment 7)
- Monitor trends with updated data
- Industry-specific deep dives

---

**Contact:** T. Spivey | BUS 761 | October 2025