# StyleMeUp - Fraud Detection in Online Retail 

### Data Science Model training, Pipeline, Deploy and serving

### In this notebook we will use the data enriched and prepared by our data engineers using the IPINFO dataset. 
##### We will follow following steps -

1. use snowpark python to connect with snowflake
2. get training dataset
3. feature visualization
4. Check feature importance
5. split training dataset into train and test
6. setup transformations
7. setup classifier
8. build ML Pipeline
9. train and test the model
10. check model accuracy
11. deploy model as Python UDF in snowflake

#### Finally, Use model deployed in Snowflake to score and predict data saved in snowflake



In [None]:
!pip install matplotlib_venn

In [None]:
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import IntegerType, FloatType, StringType, BooleanType

from matplotlib_venn import venn2
import sys
sys.path.append('..')
from utilities.creds import Credentials

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from snowflake.snowpark import version
print(f"Snowflake snowpark version is : {version.VERSION}")

#### Create a session to connect with snowflake

In [None]:
session = Session.builder.configs(Credentials().__dict__).create()
print(session.sql('USE WAREHOUSE LEARNINGSNOWPARKVW').collect())
print(session.sql('USE DATABASE LEARNINGSNOWPARKDB').collect())
print(session.sql('USE SCHEMA FRAUDDEMO').collect())
print(session.sql('SELECT CURRENT_WAREHOUSE(), CURRENT_DATABASE(), CURRENT_SCHEMA()').collect())

In [None]:
train_dataset = session.table('enriched_data').sample(n = 20000)
df = train_dataset.toPandas()

## Data Exploration
### Masked IP feature visualization

In [None]:
venn2(subsets = (len(df.loc[df['ISFRAUD'] == 1]), 
                 len(df.loc[df['IS_MASKED'] == 1]), 
                 len(df.loc[(df['ISFRAUD'] == 1) & (df['IS_MASKED'] == 1)])),
      set_labels = ('Fraud', 'Masked IP', 'Fraud & Masked IP'))
plt.show()

## Training fraud detection model

### Preparing training data

In [None]:
features = ['CITY', 'SHIPPING_ZIPCODE', 'SHIPPING_STATE', 'PAYMENT_NETWORK', 'PAYMENT_TYPE', 
            'IS_MASKED', 'AVG_PRICE_PER_ITEM', 'TOTAL_TRNX_AMOUNT', 'IP_TO_SHIPPING_DISTANCE']
encoded_features = ['CITY', 'SHIPPING_ZIPCODE', 'SHIPPING_STATE', 'PAYMENT_NETWORK', 'PAYMENT_TYPE','IS_MASKED']

num_feature_fill_na = ['IP_TO_SHIPPING_DISTANCE']


In [None]:
# get data into dataframe
data = session.table('enriched_data').sample(n = 10000)
df = pd.DataFrame(data.toPandas())

In [None]:
# setup pipeline

#transformations
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer

#Classifier
from sklearn.ensemble import RandomForestClassifier

#Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

#Model Accuracy
from sklearn.metrics import balanced_accuracy_score

# split train and test
X = df[features]
y = df['ISFRAUD'] == True
weights = (y==0).sum()/(1.0 *  (y==1).sum())

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)


# Model Pipeline
ord_pipe = make_pipeline(
    FunctionTransformer(lambda x: x.astype(str)),
    OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    )

num_pipe = make_pipeline(
    SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0),
    MinMaxScaler()
    )

clf = make_pipeline(RandomForestClassifier(random_state=0, n_jobs=-1))

model = make_pipeline(ord_pipe, num_pipe, clf)

# fit the model
model.fit(X_train, y_train)


#### Check Model balacne accuracy

In [None]:
#Check Accuracy of our model on test dataset
y_pred = model.predict_proba(X_test)[:,1]
predictions = [round(value) for value in y_pred]
balanced_accuracy = balanced_accuracy_score(y_test, predictions)
print("Model testing completed.\n   - Model Balanced Accuracy: %.2f%%" % (balanced_accuracy * 100.0))

#### Check confusion matrix

In [None]:
#Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)

TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)

print('Accuracy of the classification = {:0.3f}'.format(accuracy))

In [None]:
# Feature importance
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(model, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(np.array(features)[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Feature Importance")


# Register model as UDF

## !!!The terms need to be accepted by the ORGADMIN first!!!

In [None]:
%%time

features = list(X_train.columns)

session.add_packages("scikit-learn==1.0.2", "pandas", "numpy")

@udf(name='predict_retail_fraud',is_permanent = True, stage_location = '@UDFSTAGE', replace=True)
def predict_retail_fraud(args: list) -> float:
    row = pd.DataFrame([args], columns=features)
    return model.predict(row)

In [None]:
new_df = session.table(name = 'new_transaction_data')

In [None]:
%%time
import snowflake.snowpark.functions as F
new_df.select(new_df.trnx_id, \
              F.call_udf("predict_retail_fraud", F.array_construct(*features)).alias('fraud_flag')) \
        .write.mode('overwrite').saveAsTable('fraud_detection')

# Predict fraud in new transactions

In [None]:
session.table('fraud_detection').sample(n=10).toPandas()

#### This demo showcase how Data Engineering and Data Science teams at StyleMeUp can use familiar programming concepts and APIs, and a rich ecosystem of open source packages provided by Snowpark for Python to collaborate and build this solution.

#### Snowflake marketplace and data exchange offerings quickly let you test and build your models using 1st, 2nd and 3rd party data sets for better accuracy and testing. Without worrying about the logisticts of ingesting, transforming and loading data in your own database

#### Some great features in this demo are -

1. Using Snowflake native GEOGRAPHY datatypes and ST_DISTANCE geography functions to calculate ip_to_shipping distance. (No need for GeoPandas)
2. Load data using pandas data frame (new functionality in Snowpark Python API)
3. Create and deploy UDF in snowflake without pickle
4. Using snowflake marketplace to quickly use 3rd party datasets without any dataengineering
5. Using scikit learn, pandas, NumPy 

# Close Connection

In [None]:
session.close()