# EDA: Credit Card Fraud Detection
This notebook contains exploratory data analysis of the credit card transaction data for fraud detection.

# Credit Card Fraud Detection - EDA To-Do List

Based on the data preview, here's a structured to-do list for your credit card fraud detection EDA:

## 1. Data Understanding
- [ ] **Data Structure**
  - [ ] Check dataset dimensions (rows Ã— columns)
  - [ ] List all features and their data types
  - [ ] Identify the target variable (Class: 0=legitimate, 1=fraud)

## 2. Data Quality Check
- [ ] **Missing Values**
  - [ ] Check for missing values in each column
  - [ ] Document any missing data patterns
- [ ] **Duplicates**
  - [ ] Check for duplicate transactions
  - [ ] Document findings

## 3. Class Distribution Analysis
- [ ] **Class Imbalance**
  - [ ] Calculate the percentage of fraud vs. non-fraud cases
  - [ ] Visualize the class distribution
  - [ ] Note any data imbalance considerations

## 4. Time Feature Analysis
- [ ] **Time Column**
  - [ ] Understand the time unit (likely seconds since first transaction)
  - [ ] Convert to datetime if possible
  - [ ] Analyze transaction patterns over time
  - [ ] Check for any time-based patterns in fraud

## 5. Amount Analysis
- [ ] **Transaction Amounts**
  - [ ] Basic statistics (min, max, mean, median, std)
  - [ ] Distribution visualization (consider log scale)
  - [ ] Compare amount distributions between fraud and non-fraud
  - [ ] Identify any amount thresholds for fraud

## 6. PCA Features Analysis (V1-V28)
- [ ] **Feature Distributions**
  - [ ] Basic statistics for each PCA component
  - [ ] Visualize distributions of first few components
  - [ ] Compare distributions between fraud and non-fraud
- [ ] **Outlier Detection**
  - [ ] Identify potential outliers in PCA components
  - [ ] Document any extreme values

## 7. Correlation Analysis
- [ ] **Feature Correlations**
  - [ ] Correlation matrix of all features
  - [ ] Identify highly correlated features
  - [ ] Correlations with the target variable (Class)
  - [ ] Visualize top correlations with heatmap

## 8. Time-Based Patterns
- [ ] **Temporal Analysis**
  - [ ] Convert time to hours of day
  - [ ] Analyze fraud frequency by time of day
  - [ ] Identify peak fraud periods
  - [ ] Visualize time-based patterns

## 9. Bivariate Analysis
- [ ] **Amount vs. Time**
  - [ ] Scatter plot of transaction amount over time
  - [ ] Highlight fraud cases
- [ ] **Amount vs. PCA Features**
  - [ ] Scatter plots of amount vs. top PCA components
  - [ ] Look for patterns in fraud distribution

## 10. Feature Engineering Ideas
- [ ] **Potential New Features**
  - [ ] Time-based features (hour of day, day of week)
  - [ ] Transaction amount categories
  - [ ] Interaction terms between key features
  - [ ] Statistical aggregations

## 11. Documentation
- [ ] **Summary of Findings**
  - [ ] Document key insights
  - [ ] Note any data quality issues
  - [ ] List potential features for modeling
  - [ ] Document class imbalance and potential solutions

## 12. Next Steps
- [ ] **Modeling Preparation**
  - [ ] Feature scaling requirements
  - [ ] Class imbalance handling strategy
  - [ ] Feature selection considerations
  - [ ] Cross-validation strategy

## 13. Final Deliverables
- [ ] **Notebook with:**
  - [ ] Clear markdown explanations
  - [ ] Well-labeled visualizations
  - [ ] Reproducible code
  - [ ] Summary of key findings
  - [ ] Recommendations for modeling

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
sns.set_style('whitegrid')
%matplotlib inline

In [None]:
# Load the data
data_dir = Path('../data/raw')
credit_data = pd.read_csv(data_dir / 'creditcard.csv')

# Display basic info
print("Credit Card Data Shape:", credit_data.shape)
print("\nCredit Card Data Info:")
credit_data.info()

In [None]:
# Basic statistics
credit_data.describe()

In [None]:
# Check for missing values
credit_data.isnull().sum()

In [None]:
# Class distribution
class_dist = credit_data['Class'].value_counts(normalize=True)
plt.figure(figsize=(8, 6))
sns.barplot(x=class_dist.index, y=class_dist.values)
plt.title('Class Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraudulent)')
plt.ylabel('Percentage')
plt.show()

In [None]:
# Distribution of transaction amounts
plt.figure(figsize=(10, 6))
sns.histplot(credit_data[credit_data['Class'] == 0]['Amount'], bins=50, label='Legitimate', alpha=0.5, color='blue')
sns.histplot(credit_data[credit_data['Class'] == 1]['Amount'], bins=50, label='Fraudulent', alpha=0.5, color='red')
plt.title('Transaction Amount Distribution by Class')
plt.xlabel('Amount')
plt.legend()
plt.yscale('log')
plt.show()