# StyleMeUp - Fraud Detection in Online Retail 


#### Problem Description: 
A global retailer 'StyleMeUp' has been experiencing transaction fraud. To reduce costs related to fraudulent transactions, StyleMeUp wants to implement a fraud detection solution that leverages machine learning. 

This demo shocases how Data Engineering and Data Science teams at StyleMeUp can use familiar programming concepts and APIs, and a rich ecosystem of open source packages provided by Snowpark for Python to collaborate and build this solution.

### Data Engineering Notebook

As a data engineer we have been tasked to not only load the orders and details of customer transactions but also help data scientist to quickly identify if the transaction could be fraud. In order to do that we will analyze the origin ip address of the transaction and build features using a third party and second party data sets right from Snowflake marketplace and data exchange

We will use the built in functions, Python Snowpark API and UDF's to create enriched data and features.

#### Lets start by writing some helper functions that we will use later
we need helper fuctions to make our life easy in the data pipeline. It will help, when we join the orders data with IpInfo data for identifying features.

In [15]:
def to_join_key_func(df, col):
    return df.join_key

def builtin(function_name):
    import snowflake.snowpark.functions as sf
    if function_name == 'to_join_key':
        return to_join_key_func
    return sf.builtin(function_name)

In [16]:
def enrich_with_geocoordinates(df):
    
    location_df = session.table('ipinfo.public.location')
    orders_ip_location_df = df.join(location_df, to_join_key(df, 'ip_address') == location_df.join_key) \
        .where(parse_ip(df.ip_address, 'inet')['ipv4'].between(location_df.start_ip_int, location_df.end_ip_int)) \
        .select('trnx_id', 'ip_address', location_df.lat.alias('ip_order_loc_lat'), location_df.lng.alias('ip_order_loc_lng')) \

    orders_shipping_location_all_locations_df = df.join(location_df, to_join_key(df, 'ip_address') == location_df.join_key) \
        .filter(df.shipping_zipcode == location_df.postal)  \
        .select('trnx_id', 'ip_address', location_df.lat.alias('shipping_lat'), location_df.lng.alias('shipping_lng')) \

    orders_shipping_location_avg_lat_df = orders_shipping_location_all_locations_df \
        .groupBy(['trnx_id', 'ip_address']).agg(avg(col('shipping_lat')).alias('shipping_lat'))
    
    orders_shipping_location_avg_lng_df = orders_shipping_location_all_locations_df \
    .groupBy(['trnx_id', 'ip_address']).agg(avg(col('shipping_lng')).alias('shipping_lng'))

    orders_shipping_location_df = orders_shipping_location_avg_lat_df \
        .join(orders_shipping_location_avg_lng_df, ['ip_address', 'trnx_id'])
    
    orders_location_df = df \
        .select('trnx_id', 'ip_address', 'shipping_zipcode' ) \
        .join(orders_ip_location_df, [ 'trnx_id', 'ip_address']) \
        .join(orders_shipping_location_df, [ 'trnx_id', 'ip_address'])
 
    return orders_location_df

#### Add snowpark libraries

In [3]:
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import udf, avg, col
from snowflake.snowpark.types import IntegerType, FloatType, StringType, BooleanType
import pandas as pd
from config import snowfalke_conn_prop

In [4]:
from snowflake.snowpark import version
print(version.VERSION)

(0, 2, 0, None)


#### Connect to Snowflake

In [17]:
session = Session.builder.configs(snowfalke_conn_prop).create()
print(session.sql('create schema if not exists frauddemo').collect())
print(session.sql('use schema frauddemo').collect())
print(session.sql('select current_warehouse(), current_database(), current_schema()').collect())

[Row(status='FRAUDDEMO already exists, statement succeeded.')]
[Row(status='Statement executed successfully.')]
[Row(CURRENT_WAREHOUSE()='LAB_S_WH', CURRENT_DATABASE()='DEMO', CURRENT_SCHEMA()='FRAUDDEMO')]


#### Create dataframes for Snowflake tables

In [18]:
orders_df = session.table('orders')
orders_df.limit(10).toPandas()

Unnamed: 0,ISFRAUD,TRNX_ID,IP_ADDRESS,CITY,SHIPPING_ZIPCODE,SHIPPING_STATE,PAYMENT_NETWORK,PAYMENT_TYPE,TOTAL_TRNX_AMOUNT,JOIN_KEY
0,0,XSHNDTR1FH,103.55.45.248,Bellevue,98006,WA,Mastercard,Credit,388.3,1731657728
1,0,RN5JV38DSG,104.128.113.128,Los Angeles,90009,CA,Diners Club,Credit,261.06,1753219072
2,1,MTAKNRFPIV,104.149.135.14,Los Angeles,90009,CA,Other,Credit,788.98,1754595328
3,0,IPB02VY5ZH,104.151.240.51,Dearing,67340,KS,Amex,Credit,300.62,1754726400
4,1,KAFTHXMY6C,104.156.237.244,Dallas,75270,TX,Amex,Credit,445.21,1755054080
5,1,KWGNPAROUG,104.168.23.0,Los Angeles,90009,CA,Mastercard,Other,189.78,1755840512
6,0,3FUXYETWFW,104.169.163.107,Monroe,28111,NC,Visa,Debit,164.78,1755906048
7,1,5SKBRYDRIT,104.219.251.112,Phoenix,85001,AZ,Visa,Credit,65.72,1759182848
8,0,GWNUOMTDUP,104.238.156.136,Dearing,67340,KS,Visa,Debit,286.49,1760428032
9,0,GFFFV6H6ZC,104.245.239.0,Los Angeles,90009,CA,Visa,Credit,61.37,1760886784


In [19]:
order_details_df = session.table('order_details')
order_details_df.limit(10).toPandas()

Unnamed: 0,TRNX_ID,ITEM,PRICE,QTY
0,FANGEBUUTE,JADE GREEN ENAMEL HAIR COMB,72.71,4
1,FANGEBUUTE,ASSORTED COLOUR LIZARD SUCTION HOOK,16.5,1
2,FANGEBUUTE,STRAWBERRY FAIRY CAKE TEAPOT,12.74,1
3,TWJSHYBFL1,PAPER BUNTING PAISLEY PARK,68.33,4
4,TWJSHYBFL1,MINI LADLE LOVE HEART RED,69.95,1
5,TWJSHYBFL1,PINK UNION JACK PASSPORT COVER,71.55,2
6,S6OV6NCHQD,ENGLISH ROSE HOT WATER BOTTLE,68.88,3
7,S6OV6NCHQD,FELTCRAFT PRINCESS CHARLOTTE DOLL,45.03,1
8,S6OV6NCHQD,TRIPLE WIRE HOOK IVORY HEART,31.86,5
9,S6OV6NCHQD,POSY CANDY BAG,63.83,3


#### Aggregate avg_price_per_item feature

In [20]:
avg_price_df = orders_df.join(order_details_df, 'trnx_id') \
                        .groupBy(orders_df.trnx_id) \
                        .agg(avg(order_details_df.price).alias('avg_price_per_item')) 

avg_price_df.limit(10).toPandas()

Unnamed: 0,TRNX_ID,AVG_PRICE_PER_ITEM
0,FANGEBUUTE,33.983333
1,S6OV6NCHQD,47.714
2,SPIX7QQSNF,58.633333
3,ONWCVWSCCS,43.865
4,N6SHHHU892,48.683333
5,LOAHFAGB6A,27.425
6,NZTHF7KLVY,43.555
7,JW8TGLETFP,40.56
8,U5UCR5BXJO,43.9
9,1XRGDTZYY6,49.734


#### Enrich data with IPInfo Privacy dataset to determine if IP is masked

In [21]:
privacy_df = session.table('ipinfo.public.privacy')
parse_ip = builtin("parse_ip")
to_join_key = builtin("to_join_key")

orders_masked_df = orders_df \
    .join(privacy_df, to_join_key(orders_df, 'ip_address') == privacy_df.join_key) \
    .where(parse_ip(orders_df.ip_address, 'inet')['ipv4'].between(privacy_df.start_ip_int, privacy_df.end_ip_int)) \
    .select('trnx_id', 'ip_address', (privacy_df.proxy | privacy_df.tor | privacy_df.vpn).alias('is_masked'))  
 
#orders_masked_df.schema   
#orders_masked_df.collect()
orders_masked_df.sample(n=10).toPandas()

Unnamed: 0,TRNX_ID,IP_ADDRESS,IS_MASKED
0,9H1MEPG7AN,85.187.158.228,
1,R9L9MCP6FN,34.176.46.0,
2,WFHRBMMGS2,66.219.54.72,
3,HDB5VBIO75,45.76.240.74,
4,JLPJ3QYLAQ,18.99.252.0,
5,OZNA06ZWJK,209.216.47.172,
6,RQRGUAOE4L,216.130.0.68,
7,LJHL9JTL8S,74.2.16.0,
8,9IROSDKD2A,159.100.29.0,
9,CX2MJAJXTL,107.175.87.146,


#### Enrich data with IPInfo Location dataset to get geo-coordinates

In [22]:
loc_df = enrich_with_geocoordinates(orders_df)
loc_df.sample(n=10).toPandas()

Unnamed: 0,TRNX_ID,IP_ADDRESS,SHIPPING_ZIPCODE,IP_ORDER_LOC_LAT,IP_ORDER_LOC_LNG,SHIPPING_LAT,SHIPPING_LNG
0,SU9GTFNCKZ,76.81.101.216,96816,21.2887,-157.8006,21.2887,-157.8006
1,SIBR0YFPSP,209.235.254.208,18105,40.60843,-75.49018,40.60843,-75.49018
2,0KWMLLLJAD,96.46.94.32,74015,36.23482,-95.69109,36.23482,-95.69109
3,9MBTMEJJDD,174.208.37.128,14202,42.88645,-78.87837,42.88645,-78.87837
4,W7L7IUKTLX,184.74.142.232,12487,41.8651,-73.9948,41.8651,-73.9948
5,GRC3HMPVB1,24.39.86.200,14485,42.90479,-77.61139,42.90479,-77.61139
6,G8KUTUKXFX,71.78.138.160,78701,33.08901,-96.88639,30.26715,-97.74306
7,E31I94NWYU,104.128.113.128,90009,34.05223,-118.24368,34.05223,-118.24368
8,43FMQPZDZL,50.251.14.0,87102,35.08449,-106.65114,35.08449,-106.65114
9,7OVTZW0A1R,75.147.78.128,19099,39.95233,-75.16379,39.95233,-75.16379


#### Calculate distance between order IP and shipping locations using Snowflake's built in Geography functions

In [23]:
%%time
import snowflake.snowpark.functions as F

session.sql("alter session set geography_output_format='WKT'").collect()


distance_df = loc_df.select(loc_df.trnx_id, loc_df.ip_address, loc_df.shipping_zipcode, \
                        F.call_builtin("st_makepoint",loc_df.IP_ORDER_LOC_LNG,loc_df.IP_ORDER_LOC_LAT).alias('ipinfo_point') \
                       ,F.call_builtin("st_makepoint",loc_df.SHIPPING_LNG,loc_df.SHIPPING_LAT).alias('shipping_point') \
                       ,F.call_builtin("st_distance",col("ipinfo_point"),col("shipping_point")).alias("ip_to_shipping_distance") \
                       ,(col("ip_to_shipping_distance")/1609).alias("distance_in_miles") )
#distance_df.sample(n=10).toPandas()

CPU times: user 10.1 ms, sys: 1.01 ms, total: 11.2 ms
Wall time: 348 ms


#### Write enriched data back to a new Snowflake table

In [24]:
%%time
orders_merged_df = orders_df.join(orders_masked_df, ['trnx_id', 'ip_address'], 'left_outer') \
    .join(loc_df,['trnx_id', 'ip_address', 'shipping_zipcode'],  'left_outer') \
    .join(distance_df,['trnx_id', 'ip_address', 'shipping_zipcode'], 'left_outer') \
    .join(avg_price_df,'trnx_id', 'left_outer') \
    .write.mode('overwrite').saveAsTable('enriched_data')

CPU times: user 63.4 ms, sys: 6.12 ms, total: 69.6 ms
Wall time: 39.7 s


In [13]:
enr_df = session.table('enriched_data').sample(n = 20000)
enr_df.sample(n=10).toPandas()


Unnamed: 0,TRNX_ID,IP_ADDRESS,SHIPPING_ZIPCODE,ISFRAUD,CITY,SHIPPING_STATE,PAYMENT_NETWORK,PAYMENT_TYPE,TOTAL_TRNX_AMOUNT,JOIN_KEY,IS_MASKED,IP_ORDER_LOC_LAT,IP_ORDER_LOC_LNG,SHIPPING_LAT,SHIPPING_LNG,IPINFO_POINT,SHIPPING_POINT,IP_TO_SHIPPING_DISTANCE,DISTANCE_IN_MILES,AVG_PRICE_PER_ITEM
0,QOBKTPZM4L,24.123.22.240,46218,0,Indianapolis,IN,Amex,Credit,368.36,410714112,,39.97837,-86.11804,39.8082,-86.1014,POINT(-86.11804 39.97837),POINT(-86.1014 39.8082),18975.248284,11.793193,40.5
1,1G7OXLD6E4,202.94.129.176,10004,1,New York City,NY,Mastercard,Credit,103.98,3395158016,,,,,,,,,,54.7625
2,ALPO7PMR5B,24.249.80.130,32566,0,Navarre,FL,Mastercard,Credit,571.97,418971648,,,,,,,,,,27.553333
3,RACMRMXRDU,24.231.214.160,48602,0,Saginaw,MI,Diners Club,Credit,247.6,417792000,,43.59781,-84.76751,43.4248,-83.9745,POINT(-84.76751 43.59781),POINT(-83.9745 43.4248),66781.306658,41.504852,38.535
4,F1JWQYS4MY,162.155.236.176,44240,0,Brimfield,OH,Mastercard,Credit,135.12,2728067072,,37.98869,-84.47772,41.1449,-81.3498,POINT(-84.47772 37.98869),POINT(-81.3498 41.1449),441582.64517,274.445398,64.8825
5,P7QPIODG9C,208.185.145.9,60666,0,Chicago,IL,Diners Club,Credit,186.91,3501785088,,41.85003,-87.65005,41.85003,-87.65005,POINT(-87.65005 41.85003),POINT(-87.65005 41.85003),0.0,0.0,36.842
6,RRP9WFDPHR,223.243.53.44,94088,0,Sunnyvale,CA,Visa,Credit,120.65,3757244416,,,,,,,,,,53.8575
7,P32EDRZWJT,71.78.138.160,78701,0,Austin,TX,Mastercard,Credit,681.89,1196294144,,33.08901,-96.88639,30.26715,-97.74306,POINT(-96.88639 33.08901),POINT(-97.74306 30.26715),324075.164216,201.414024,43.113333
8,XLL3HAHEU0,209.173.40.0,38301,1,Jackson,TN,Mastercard,Credit,89.9,3517775872,,35.25619,-88.98784,35.61452,-88.81395,POINT(-88.98784 35.25619),POINT(-88.81395 35.61452),42845.992016,26.628957,45.37
9,WHZR3LOFCP,209.160.114.244,60666,0,Chicago,IL,Visa,Debit,72.44,3516923904,,41.85003,-87.65005,41.85003,-87.65005,POINT(-87.65005 41.85003),POINT(-87.65005 41.85003),0.0,0.0,44.211667


In [25]:
enr_df.write.mode('overwrite').saveAsTable('new_transaction_data')