# ðŸ§ª 01 â€” Exploratory Data Analysis (EDA)

This notebook explores the **Credit Card Fraud Detection** dataset.

Goals:
- Understand dataset size & structure  
- Examine class imbalance  
- Inspect feature distributions  
- Explore correlations  
- Document insights useful for modeling  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.config import RAW_DATA_PATH, TARGET_COL

In [None]:
df = pd.read_csv(RAW_DATA_PATH)
df.head()

In [None]:
df.info()
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df[TARGET_COL].value_counts(normalize=True)
df[TARGET_COL].value_counts().plot(kind="bar")
plt.title("Fraud vs Non-Fraud Distribution")
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df["Amount"], bins=100, log_scale=True)
plt.title("Amount Distribution (log scale)")
plt.show()

In [None]:
df["hour"] = (df["Time"] / 3600) % 24
sns.boxplot(x=df[TARGET_COL], y=df["hour"])
plt.title("Time of Day vs Fraud")
plt.show()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

# ðŸ“Œ EDA Conclusions

- Dataset has **284k rows**, **492 frauds**, extreme imbalance (0.17%).
- No missing values.
- Amount distribution is highly skewed â†’ benefit from log transformation.
- PCA features dominate varianceâ€”common for anonymized fraud datasets.
- Some PCA components show separability between classes.

These findings inform preprocessing & modeling decisions.