# Challenge

Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers. Using this starter notebook, code two Python functions:

* One that uses standard deviation to identify anomalies for any cardholder.

* Another that uses interquartile range to identify anomalies for any cardholder.

## Identifying Outliers using Standard Deviation

In [3]:
# Initial imports
import pandas as pd
import numpy as np
import random
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv
import dotenv

In [4]:
# Load .env enviroment variables
load_dotenv()
postgress_user = os.getenv("POSTGRES_USER")
postgress_pass = os.getenv("POSTGRES_PASS")

In [5]:
# Create a connection to the database
engine = create_engine(f'postgresql://{postgress_user}:{postgress_pass}@localhost:5432/fraud_detection')
# Use the connection variable rather than the engine, to maintain the db link active
connection = engine.connect()

In [88]:
# Write function that locates outliers using standard deviation
def outlier_std_identifier(ch_id):
    # Query the transactions for the given card holder ID
    query = f"""
            DROP VIEW if EXISTS transactions_by_ch_id;
            
            CREATE VIEW transactions_by_ch_id AS
            SELECT T.date,
                    C.cardholder_id,
                    CH.name as card_holder_name,
                    T.amount,
                    MC.name as merchant_type
            FROM transaction as T
            INNER JOIN credit_card as C
            ON T.card = C.card
            INNER JOIN card_holder as CH
            ON C.cardholder_id = CH.id
            INNER JOIN merchant as M
            ON T.id_merchant = M.id
            INNER JOIN merchant_category as MC
            ON M.id_merchant_category = MC.id
            WHERE C.cardholder_id = {ch_id}
            ORDER BY T.date;
            
            SELECT * FROM transactions_by_ch_id;
            """
    # Create a DataFrame from the query result
    ch_df = pd.read_sql(query, connection)
    # FOR DEBUGGING: View a sample of the DataFrame
    # display(ch_df.head())
    # determine the normal range of values
    mean = round(ch_df['amount'].values.mean(),2)
    std = round(ch_df['amount'].values.std(),2)
    range_max = mean + 3*std
    range_min = max(0,mean - 3*std) # transactions should always be positive, so ensuring we're not looking at a negative sigma range
    # filter the dataframe for outlier transations outside of the normal range
    outliers = ch_df.query('amount < @range_min or amount > @range_max')
    # load a result object with details on the card holder's transactions characteristics
    result = f'mean: {mean}\nstd: {std}'
    # load a result object with details on the card holder's transactions characteristics
    if outliers.empty:
        result = f'mean: {mean}\nstd: {std}\nNo outliers identified'
    else:
        result = f'mean: {mean}\nstd: {std}\n{outliers}'
    # return the result
    return result

In [92]:
# Find anomalous transactions for 3 random card holders

# Query the list of card holder IDs
query = """
        DROP VIEW if EXISTS card_holders_list;
        
        CREATE VIEW card_holders_list AS
        SELECT DISTINCT C.cardholder_id
        FROM transaction as T
        INNER JOIN credit_card as C
        ON T.card = C.card
        ORDER BY C.cardholder_id;
        
        SELECT * FROM card_holders_list;
        """
# Randomly select 3 card holder IDs
ch_IDs = pd.read_sql(query,connection).sample(3)['cardholder_id'].values.tolist()
# Call the outlier identifier function for the selected card holder IDs
for ch_id in ch_IDs:
    print(f'Outlier charges (potential fraud) for card holder {ch_id}:\n{outlier_std_identifier(ch_id)}\n')

Outlier charges (potential fraud) for card holder 7:
mean: 82.23
std: 313.42
                   date  cardholder_id card_holder_name  amount merchant_type
1   2018-01-04 03:05:18              7      Sean Taylor  1685.0    food truck
19  2018-02-19 16:00:43              7      Sean Taylor  1072.0    food truck
32  2018-04-18 23:23:29              7      Sean Taylor  1086.0   coffee shop
88  2018-08-07 11:07:32              7      Sean Taylor  1449.0    food truck
128 2018-12-13 15:51:59              7      Sean Taylor  2249.0    food truck
133 2018-12-18 17:20:33              7      Sean Taylor  1296.0           bar

Outlier charges (potential fraud) for card holder 1:
mean: 110.67
std: 359.75
                   date  cardholder_id card_holder_name  amount merchant_type
6   2018-01-24 13:17:19              1   Robert Johnson  1691.0   coffee shop
70  2018-07-31 05:15:17              1   Robert Johnson  1302.0   coffee shop
79  2018-09-04 01:35:39              1   Robert Johnson  1790.0 

## Identifying Outliers Using Interquartile Range

In [None]:
# Write a function that locates outliers using interquartile range
def outlier_interquartile_identifier(ch_id):
    return None

In [None]:
# Find anomalous transactions for 3 random card holders
for ch_id in ch_IDs:
    print(f'Interquartile outlier analysis for card holder {ch_id}: {outlier_interquartile_identifier(ch_id)}')