<a href="https://colab.research.google.com/github/thamadziripi/isolation-forest-ipqs-api/blob/main/fraud_detection_using_isolation_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fraud Detection Using Isolation Forest and IPQualityScore API

# Project summary

This project demonstrates the detection of potentially fraudulent transactions using the **Isolation Forest** algorithm, augmented with data from the **IPQualityScore** API. The aim is to build a system that identifies anomalous transactions by leveraging a combination of machine learning and external risk factors, such as fraud scores and user activity.


This project integrates the **Isolation Forest** algorithm with the **IPQualityScore** API to build an efficient fraud detection framework. By combining unsupervised machine learning with external risk data, this notebook demonstrates a robust approach to identifying potentially fraudulent transactions in real-time.

This notebook serves as a practical demonstration of how machine learning models can be utilised to enhance fraud detection in transaction datasets, making it a valuable resource for financial and e-commerce fraud prevention applications.

## Key Components:

### Data Sourcing
The project utilises transaction data sourced from the IPQualityScore API, which provides insights into the likelihood of fraud based on various features, including:
- **Fraud score**: A numeric risk score (0-100) indicating the probability of fraudulent activity.
- **Valid billing and shipping address**: Boolean indicators showing the legitimacy of address details.
- **User activity**: Categorical data measuring the frequency of legitimate user behaviour (e.g., high, medium, low).
- **Leaked user data**: Indicates if the user's data has been exposed in known breaches.

### Data Preprocessing
The transaction data undergoes a series of preprocessing steps before being fed into the Isolation Forest model:
- **Categorical Encoding**: Categorical fields such as `user_activity` are encoded into numerical values.
- **Boolean Conversion**: Binary fields (e.g., address validity) are converted into integers (`1` for `True`, `0` for `False`).
- **Normalisation**: Numerical features are normalised to bring them onto a similar scale, optimising the performance of the model.

### Isolation Forest Model
An **Isolation Forest** is trained to detect anomalous transactions. This unsupervised algorithm isolates observations by randomly selecting features and split values, with anomalous transactions requiring fewer splits to be isolated. Key parameters include:
- **`contamination`**: Controls the expected proportion of anomalous transactions in the data.
- **`n_estimators`**: The number of trees in the forest, which impacts model accuracy and complexity.

### Fraud Detection Process
Once trained, the Isolation Forest identifies transactions as either normal (`1`) or anomalous (`-1`). Anomalous transactions are flagged for further inspection based on IPQualityScore's fraud score and other risk metrics. This hybrid approach combines rule-based insights with machine learning to enhance fraud detection.

### Model Tuning and Evaluation
The model parameters are fine-tuned to strike a balance between detecting actual fraud and minimising false positives. The system's performance is evaluated by comparing the flagged anomalies against IPQualityScore's fraud metrics, offering a data-driven approach to improving fraud detection accuracy.




# Code

## Project set-up

In [6]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import requests
import json
from google.colab import userdata

# Stored API key as a google colab secret (recommended)
API_KEY = userdata.get('IPQualitySocreAPIKey')
IP_TOKEN = userdata.get('IPInfoToken')
IP_ADDRESS = "82.5.130.47"

## Data preprocessing

In [16]:
def get_ipqualityscore_data(api_key, ip: str) -> dict:
  """
  Function used to lookup Payment & Transaction Fraud Prevention API

  Args:
      api_key (str): _description_
      ip (str): _description_

  Returns:
      dict: _description_
  """

  url = f"https://ipqualityscore.com/api/json/ip/{api_key}/{ip_address}"

  try:
      response = requests.get(url)
      data = response.json()

      if response.status_code != 200 or 'fraud_score' not in data:
          return {"error": "Failed to fetch data or fraud score missing"}

      result = {
          "fraud_score": data.get("fraud_score", None),
          "valid_billing_address": data.get("billing_address", {}).get("valid", None),
          "valid_shipping_address": data.get("shipping_address", {}).get("valid", None),
          "user_activity": data.get("user_activity", None),  # Could be 'high', 'medium', 'low'
          "leaked_user_data": data.get("leaked", None),  # Boolean, checks if user data was leaked
          "transaction_risk": data.get("transaction_risk", None),  # Overall transaction risk assessment
          "recent_abuse": data.get("recent_abuse", None),  # Tracks if IP has been involved in fraud
          "proxy": data.get("proxy", None),  # Checks if the user is using a proxy
          "vpn": data.get("vpn", None),  # Whether user is on a VPN
          "tor": data.get("tor", None),  # Whether user is using Tor
          "device_tracking": data.get("device_tracking", None),  # Tracking device ID if applicable
      }

      return result

  except Exception as e:
      return {"error": str(e)}

def iterate_ip_addresses(api_key: str, ip_addresses: str) -> pd.DataFrame:
  """
  A function that iterates through a list of IP addresses and returns a
  DataFrame.

  Args:
    api_key (str): _description_
    ip (str): _description_

  Returns:
      dict: _description_
  """

  results = []
  for ip in ip_addresses:
      result = get_ipqualityscore_data(api_key, ip)
      results.append(result)

  return pd.DataFrame(results)

ipv4s = ["8.8.8.8", "82.5.130.47", "123.45.67.89"]
data = iterate_ip_addresses(API_KEY, IP_ADDRESS)

In [17]:
data

Unnamed: 0,fraud_score,valid_billing_address,valid_shipping_address,user_activity,leaked_user_data,transaction_risk,recent_abuse,proxy,vpn,tor,device_tracking
0,100,,,,,,True,True,True,False,
1,100,,,,,,True,True,True,False,
2,100,,,,,,True,True,True,False,
3,100,,,,,,True,True,True,False,
4,100,,,,,,True,True,True,False,
5,100,,,,,,True,True,True,False,
6,100,,,,,,True,True,True,False,
7,100,,,,,,True,True,True,False,
8,100,,,,,,True,True,True,False,
9,100,,,,,,True,True,True,False,


## Isolation Forest Model

## Model tuning & evaluation

# Conclusion