# Data Cleaning

In this notebook, we perform essential cleaning steps on the datasets before analysis and modeling. Specifically, we will:

- Handle missing values
- Remove duplicates
- Correct data types (especially datetime columns)
- Save cleaned datasets for later use



##Import Libraries and Load Data



In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', 100)

# Load data
fraud_df = pd.read_csv("../data/Fraud_Data.csv")
ip_map_df = pd.read_csv("../data/IpAddress_to_Country.csv")
creditcard_df = pd.read_csv("../data/creditcard.csv")


#1. Handle Missing Values
We will:
- Check for missing values
- Decide whether to drop or impute, with justification

In [5]:
# Check for nulls in each dataset
print(" Missing values in Fraud_Data:")
print(fraud_df.isnull().sum(), "\n")

print(" Missing values in IP Mapping:")
print(ip_map_df.isnull().sum(), "\n")

print(" Missing values in Credit Card Data:")
print(creditcard_df.isnull().sum())


 Missing values in Fraud_Data:
user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
dtype: int64 

 Missing values in IP Mapping:
lower_bound_ip_address    0
upper_bound_ip_address    0
country                   0
dtype: int64 

 Missing values in Credit Card Data:
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64


#Remove Duplicates

Duplicate records can skew analysis. Let’s identify and remove them.


In [6]:
# Check and remove duplicates
print(" Duplicates in Fraud_Data:", fraud_df.duplicated().sum())
fraud_df.drop_duplicates(inplace=True)

print("Duplicates in IP Map:", ip_map_df.duplicated().sum())
ip_map_df.drop_duplicates(inplace=True)

print("Duplicates in Credit Card Data:", creditcard_df.duplicated().sum())
creditcard_df.drop_duplicates(inplace=True)

print("✅ Duplicate records removed.")


 Duplicates in Fraud_Data: 0
Duplicates in IP Map: 0
Duplicates in Credit Card Data: 1081
✅ Duplicate records removed.


In [7]:
print("Duplicates in Credit Card Data after removal:", creditcard_df.duplicated().sum())

Duplicates in Credit Card Data after removal: 0


#Correct Data Types

We will:
- Convert `signup_time` and `purchase_time` in `fraud_df` to datetime objects
- Ensure all features have appropriate types


In [9]:
fraud_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151112 entries, 0 to 151111
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   user_id         151112 non-null  int64         
 1   signup_time     151112 non-null  datetime64[ns]
 2   purchase_time   151112 non-null  datetime64[ns]
 3   purchase_value  151112 non-null  int64         
 4   device_id       151112 non-null  object        
 5   source          151112 non-null  object        
 6   browser         151112 non-null  object        
 7   sex             151112 non-null  object        
 8   age             151112 non-null  int64         
 9   ip_address      151112 non-null  float64       
 10  class           151112 non-null  int64         
dtypes: datetime64[ns](2), float64(1), int64(4), object(4)
memory usage: 13.8+ MB


In [8]:
# Convert timestamp columns to datetime
fraud_df['signup_time'] = pd.to_datetime(fraud_df['signup_time'])
fraud_df['purchase_time'] = pd.to_datetime(fraud_df['purchase_time'])

# Confirm types
print(fraud_df.dtypes)


user_id                    int64
signup_time       datetime64[ns]
purchase_time     datetime64[ns]
purchase_value             int64
device_id                 object
source                    object
browser                   object
sex                       object
age                        int64
ip_address               float64
class                      int64
dtype: object


In [14]:
import os
os.makedirs("../data/clean", exist_ok=True)

fraud_df.to_csv("../data/clean/final_fraud_data.csv", index=False)
ip_map_df.to_csv("../data/clean/final_ip_map.csv", index=False)
creditcard_df.to_csv("../data/clean/final_creditcard.csv", index=False)

print("✅ Final cleaned datasets saved to '../data/clean/'")

✅ Final cleaned datasets saved to '../data/clean/'
