# EDA: E-commerce Fraud Detection
This notebook contains exploratory data analysis of the e-commerce transaction data for fraud detection.

## The steps

### **1. Data Loading & Initial Exploration**
   - Load and examine both datasets
   - Check basic info and class distribution

### **2. Data Cleaning**
   - Convert timestamps to datetime
   - Handle IP addresses (convert to integers)
   - Remove duplicates and validate data

### **3. IP Address Integration**
   - Load and clean IP-to-country data
   - Convert IP ranges to integers
   - Merge with fraud data using IP ranges
   - Handle any unmatched IPs

### **4. Basic EDA**
   - Time-based analysis
   - Categorical analysis (browser, source, gender)
   - Numerical analysis (purchase value, age)
   - Geolocation analysis (by country)

### **5. Feature Engineering**
   - Time-based features
   - User behavior features
   - Geolocation features
   - Device analysis

### **6. Data Transformation**
   - Handle categorical variables
   - Scale numerical features
   - Create train/test splits

### **7. Handle Class Imbalance**
   - Apply SMOTE or class weights

### **8. Save & Document**
   - Save processed data
   - Document findings


## Importing the libraries and Setting the display options

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline

# IP address handling
import ipaddress
import socket
import struct

# Feature engineering
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, 
                           confusion_matrix, 
                           roc_auc_score, 
                           precision_recall_curve,
                           average_precision_score)

# System and performance
import os
import sys
import time
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set plotting style
plt.style.use('seaborn')
sns.set_palette('viridis')

print("All libraries imported successfully!")

In [4]:
# Load the data
data_dir = Path('../data/raw')
fraud_data = pd.read_csv(data_dir / 'Fraud_Data.csv')
ip_data = pd.read_csv(data_dir / 'IpAddress_to_Country.csv')

# Display basic info
print("Fraud Data Shape:", fraud_data.shape)
print("\nFraud Data Info:")
fraud_data.info()

Fraud Data Shape: (151112, 11)

Fraud Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         151112 non-null  int64  
 1   signup_time     151112 non-null  object 
 2   purchase_time   151112 non-null  object 
 3   purchase_value  151112 non-null  int64  
 4   device_id       151112 non-null  object 
 5   source          151112 non-null  object 
 6   browser         151112 non-null  object 
 7   sex             151112 non-null  object 
 8   age             151112 non-null  int64  
 9   ip_address      151112 non-null  float64
 10  class           151112 non-null  int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 12.7+ MB


In [None]:
# Basic statistics
fraud_data.describe()

In [None]:
# Check for missing values
fraud_data.isnull().sum()

In [None]:
# Class distribution
class_dist = fraud_data['class'].value_counts(normalize=True)
plt.figure(figsize=(8, 6))
sns.barplot(x=class_dist.index, y=class_dist.values)
plt.title('Class Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraudulent)')
plt.ylabel('Percentage')
plt.show()