# Project Title: FraudGuard ML

Team Members:<br>
Michelle Silver supersilver1978@gmail.com<br>
Alex Valenzuela axvalenzuela@gmail.com<br>
Dylan Johnston - dylanhjjohnston@gmail.com<br>
Rosalinda Olvera - rolvera98271@gmail.com<br>
James White - jswhite1992@gmail.com<br>
1. Data Scientist: Responsible for data preprocessing, exploratory data analysis, feature selection, and model building.<br>
2. Machine Learning Engineer: Works closely with the data scientist in model building, model evaluation, and integrating the model with the API.<br>
3. Back-End Developer: Responsible for setting up and maintaining the Flask API and its integration with the ML model.<br>
4. Project Manager: Oversees the project's progress, facilitates communication among team members, and ensures that the project is on track and within budget.<br>
5. QA Tester/Technical Writer: Handles testing of the machine learning model and the Flask API, and prepares comprehensive project documentation.

# Project Description/Outline:<br>
FraudGuard ML aims to enhance the security of online transactions by leveraging machine learning to identify potentially fraudulent credit card transactions. The project combines data analytics and predictive modeling techniques to stay effective in the ever-evolving landscape of online transactions.
The application is built using Python, incorporating libraries such as Scikit-learn for machine learning, and Flask for creating a web-based API. The model is trained on a dataset rich with both fraudulent and non-fraudulent transactional data, learning from the trends and patterns to flag potential fraud.

# Research Questions to Answer:<br>
1. What are the most significant indicators of a fraudulent transaction?<br>
2. How can machine learning improve the accuracy of fraud detection in comparison to traditional methods?<br>
3. How can the model be updated regularly to adapt to new fraud patterns?<br>
4. How can the model accuracy be maintained despite the imbalance of fraudulent to non-fraudulent transactions in the dataset?<br>
# Datasets to be Used:<br>
The datasets to be used will consist of anonymized credit card transactions. These datasets will have a mix of fraudulent and non-fraudulent transactions, making it possible for the model to learn and distinguish between them.
# Rough Breakdown of Tasks:<br>
1. Data Acquisition and Understanding: Gather the required dataset and understand the variables. (Data Scientist)<br>
2. Exploratory Data Analysis (EDA) and Preprocessing: Perform EDA, clean the data, manage missing values and outliers, and handle class imbalance. (Data Scientist)<br>
3. Flask API Structure: Set up the basic structure of the Flask API. (Back-End Developer)<br>
4. Feature Selection and Engineering: Determine the most relevant features for the machine learning model. (Data Scientist and Machine Learning Engineer)<br>
5. Model Building and Evaluation: Build, train, and evaluate the machine learning model. (Data Scientist and Machine Learning Engineer)<br>
6. Flask API and Model Integration: Integrate the machine learning model with the Flask API. (Back-End Developer and Machine Learning Engineer)<br>
7. Testing and Documentation: Test all parts of the project and complete the project documentation. (QA Tester/Technical Writer and Project Manager)<br>

# Exploratory Data Analysis (EDA)

### Step 1: Import the Libraries

In [86]:
# Importing libraries
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [87]:
# Shows all the columns

pd.set_option('display.max_columns',None)

In [88]:
# Set the float values to decimal points

pd.set_option('display.float_format', lambda x: '%.2f' % x)

### Step 2: Read the data from the CSV file into a Pandas DataFrame and review the DataFrame.

In [89]:
test_dir = '/kaggle/input/fraud-transactions-dataset/fraudTest.csv'
train_dir = '/kaggle/input/fraud-transactions-dataset/fraudTrain.csv'

df_train = pd.read_csv(train_dir, index_col=0)
df_test = pd.read_csv(test_dir, index_col=0)

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/fraud-transactions-dataset/fraudTrain.csv'

In [84]:
# Read the fraudTrainlarge.csv file from the Resources folder into a Pandas DataFrame

# fraud_df = pd.read_csv(
    # Path("fraudTrainlarge.csv"))

In [74]:
# EDA (6 steps):
# discovering : what columns are relevant, datatypes, data descrepancies.
# joining : more than two table.
# cleaning : removing decsrepancies.
# validating : validate for errors in the data.
# structuring : start_date, end_date -- > start_date - end_date = duration (feature engineering).
# presenting : example :'merchant', y do you feel that this column needs to be added. 10% > null values : dropna() , 15% < null values : drop(columns='')

### Step 3: Discovering the Relevant Columns, Data Types and Descrepancies

In [75]:
# Review the DataFrame

# display(fraud_df.head(10))
# display(fraud_df.tail(10))

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654,36.08,-81.18,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.01,-82.05,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160,48.89,-118.21,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.16,-118.19,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252,42.18,-112.26,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.15,-112.15,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632,46.23,-112.11,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.03,-112.56,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,24433,38.42,-79.46,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.67,-78.63,0
5,5,2019-01-01 00:04:08,4767265376804500,"fraud_Stroman, Hudson and Erdman",gas_transport,94.63,Jennifer,Conner,F,4655 David Island,Dublin,PA,18917,40.38,-75.2,2158,Transport planner,1961-06-19,189a841a0a8ba03058526bcfe566aab5,1325376248,40.65,-76.15,0
6,6,2019-01-01 00:04:42,30074693890476,fraud_Rowe-Vandervort,grocery_net,44.54,Kelsey,Richards,F,889 Sarah Station Suite 624,Holcomb,KS,67851,37.99,-100.99,2691,Arboriculturist,1993-08-16,83ec1cc84142af6e2acf10c44949e720,1325376282,37.16,-100.15,0
7,7,2019-01-01 00:05:08,6011360759745864,fraud_Corwin-Collins,gas_transport,71.65,Steven,Williams,M,231 Flores Pass Suite 720,Edinburg,VA,22824,38.84,-78.6,6018,"Designer, multimedia",1947-08-21,6d294ed2cc447d2c71c7171a3d54967c,1325376308,38.95,-78.54,0
8,8,2019-01-01 00:05:18,4922710831011201,fraud_Herzog Ltd,misc_pos,4.27,Heather,Chase,F,6888 Hicks Stream Suite 954,Manor,PA,15665,40.34,-79.66,1472,Public affairs consultant,1941-03-07,fc28024ce480f8ef21a32d64c93a29f5,1325376318,40.35,-79.96,0
9,9,2019-01-01 00:06:01,2720830304681674,"fraud_Schoen, Kuphal and Nitzsche",grocery_pos,198.39,Melissa,Aguilar,F,21326 Taylor Squares Suite 708,Clarksville,TN,37040,36.52,-87.35,151785,Pathologist,1974-03-28,3b9014ea8fb80bd65de0b1463b00b00e,1325376361,37.18,-87.49,0


Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
1296665,1296665,2020-06-21 12:08:42,213193596103206,fraud_Gulgowski LLC,home,72.17,James,Hunt,M,7369 Gabriel Tunnel,Pointe Aux Pins,MI,49775,45.75,-84.45,95,Electrical engineer,1994-02-09,108c103b26f686c24c021aaf4210977e,1371816522,44.94,-84.0,0
1296666,1296666,2020-06-21 12:09:22,4587657402165341815,"fraud_Hyatt, Russel and Gleichner",health_fitness,7.3,Amber,Lewis,F,6296 John Keys Suite 858,Pembroke Township,IL,60958,41.06,-87.59,2135,"Psychotherapist, child",2004-05-08,37a18c6fb0c5c722b6339ffedc82f55a,1371816562,40.56,-88.09,0
1296667,1296667,2020-06-21 12:10:56,4822367783500458,"fraud_Hahn, Douglas and Schowalter",travel,19.71,Christopher,Farrell,M,97070 Anderson Land,Haines City,FL,33844,28.08,-81.59,33804,Exercise physiologist,1991-01-01,34e72e0a659a6c8f4a20ee65594f3a7d,1371816656,27.47,-81.51,0
1296668,1296668,2020-06-21 12:11:23,213141712584544,"fraud_Metz, Russel and Metz",kids_pets,100.85,Margaret,Curtis,F,742 Oneill Shore,Florence,MS,39073,32.15,-90.12,19685,Fine artist,1984-12-24,0d86d8c17638d7eff77db9c6a878b477,1371816683,31.38,-90.53,0
1296669,1296669,2020-06-21 12:11:36,4400011257587661852,fraud_Stiedemann Inc,misc_pos,37.38,Marissa,Powell,F,474 Allen Haven,North Loup,NE,68859,41.5,-98.79,509,"Nurse, children's",1980-09-15,9a7ea2625cf8303efe34e3c09546868f,1371816696,41.73,-99.04,0
1296670,1296670,2020-06-21 12:12:08,30263540414123,fraud_Reichel Inc,entertainment,15.56,Erik,Patterson,M,162 Jessica Row Apt. 072,Hatch,UT,84735,37.72,-112.48,258,Geoscientist,1961-11-24,440b587732da4dc1a6395aba5fb41669,1371816728,36.84,-111.69,0
1296671,1296671,2020-06-21 12:12:19,6011149206456997,fraud_Abernathy and Sons,food_dining,51.7,Jeffrey,White,M,8617 Holmes Terrace Suite 651,Tuscarora,MD,21790,39.27,-77.51,100,"Production assistant, television",1979-12-11,278000d2e0d2277d1de2f890067dcc0a,1371816739,38.91,-78.25,0
1296672,1296672,2020-06-21 12:12:32,3514865930894695,fraud_Stiedemann Ltd,food_dining,105.93,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,NM,88325,32.94,-105.82,899,Naval architect,1967-08-30,483f52fe67fabef353d552c1e662974c,1371816752,33.62,-105.13,0
1296673,1296673,2020-06-21 12:13:36,2720012583106919,"fraud_Reinger, Weissnat and Strosin",food_dining,74.9,Joseph,Murray,M,42933 Ryan Underpass,Manderson,SD,57756,43.35,-102.54,1126,Volunteer coordinator,1980-08-18,d667cdcbadaaed3da3f4020e83591c83,1371816816,42.79,-103.24,0
1296674,1296674,2020-06-21 12:13:37,4292902571056973207,"fraud_Langosh, Wintheiser and Hyatt",food_dining,4.3,Jeffrey,Smith,M,135 Joseph Mountains,Sula,MT,59871,45.84,-113.87,218,"Therapist, horticultural",1995-08-16,8f7c8e4ab7f25875d753b422917c98c9,1371816817,46.57,-114.19,0


In [76]:
# fraud_df.dtypes

Unnamed: 0                 int64
trans_date_trans_time     object
cc_num                     int64
merchant                  object
category                  object
amt                      float64
first                     object
last                      object
gender                    object
street                    object
city                      object
state                     object
zip                        int64
lat                      float64
long                     float64
city_pop                   int64
job                       object
dob                       object
trans_num                 object
unix_time                  int64
merch_lat                float64
merch_long               float64
is_fraud                   int64
dtype: object

In [47]:
# Check for total number of rows and columns

# fraud_df.shape

In [48]:
# Check for duplicate rows

# fraud_df.duplicated().sum()

In [49]:
# Check the data information 

# fraud_df.info()

In [50]:
# Changing the data types for all the columns

#fraud_df['trans_date_trans_time'] = pd.to_datetime(fraud_df['trans_date_trans_time'])
#fraud_df['merchant'] = fraud_df['merchant'].astype(str)
#fraud_df['category'] = fraud_df['category'].astype(str)
#fraud_df['first'] = fraud_df['first'].astype(str)
#fraud_df['last'] = fraud_df['last'].astype(str)
#fraud_df['gender'] = fraud_df['gender'].astype(str)
#fraud_df['street'] = fraud_df['street'].astype(str)
#fraud_df['city'] = fraud_df['city'].astype(str)
#fraud_df['state'] = fraud_df['state'].astype(str)
#fraud_df['job'] = fraud_df['job'].astype(str)
#fraud_df['dob'] = pd.to_datetime(fraud_df['dob'])




In [51]:
# Checking the data types 

fraud_df.dtypes

trans_date_trans_time     object
cc_num                     int64
merchant                  object
category                  object
amt                      float64
first                     object
last                      object
gender                    object
street                    object
city                      object
state                     object
zip                        int64
lat                      float64
long                     float64
city_pop                   int64
job                       object
dob                       object
trans_num                 object
unix_time                  int64
merch_lat                float64
merch_long               float64
is_fraud                   int64
dtype: object

In [52]:
# Obtaining total for null values

#fraud_df.isnull().sum()

In [53]:
# Generate summary statistics

#fraud_df.describe().apply(lambda x: x.apply('{0:.2f}'.format)).T

In [54]:
# Checking the unique values for all the columns

#fraud_df.nunique()

In [55]:
# Function to return categorical columns to unique values and append to categorical_col

#def unique_values(data):
  

    #categorical_col = []
    #for i,x in fraud_df.dtypes.items():
        #if x == 'object':
            #categorical_col.append(i)
    #return(categorical_col)
    
        

In [56]:
# Check for unique values for all the categorical columns

#cat_cols = unique_values(fraud_df)

In [57]:
# List all the categorical columns

#cat_cols

In [58]:
# For loop to get categorical columns and value counts

#df_dict = {}
#for i in cat_cols:
    #df_dict[i] = len(fraud_df[i].value_counts())

In [59]:
# Check the total number of unique values in the categorical columns

#df_dict

In [60]:
# Using boolean mask as an alternate way to seperated out the categorical and numerical columns

# Categorical columns
#cat_col = [col for col in fraud_df.columns if fraud_df[col].dtype == 'object']
#print('Categorical columns :',cat_col)
# Numerical columns
#num_col = [col for col in fraud_df.columns if fraud_df[col].dtype != 'object']
#print('Numerical columns :',num_col)


#### Step 4: Presenting the Heat Map and Box Plot

In [61]:
# Correlation of numberical columns

#corr = fraud_df.corr()

#plt.figure(dpi=130)
#sns.heatmap(fraud_df.corr(), annot=True, fmt= '.2f')
#plt.show()


In [62]:
# Sorting all the correlated values

#corr['is_fraud'].sort_values(ascending = False)


In [63]:


#fraud_df.boxplot(rot =45)

In [64]:
# Total amount of columns

#len(fraud_df.columns)

In [65]:
# Container for columns for box plots

#cols = ['cc_num','amt','city_pop']

In [66]:
# Enumerate all the columns for th box plots

#for i in enumerate(cols):
    #print(i)

In [67]:
# Each box plot for the enumerated columns 

#fig,axs = plt.subplots(nrows=1, ncols=3)
#axs.flatten()
#for i,x in enumerate(cols):
    #axs[i].boxplot(fraud_df[x],showfliers =False)
    #plt.xlabel(x)
#plt.show()

In [68]:
# Using box plot to visualize the outliers

#fraud_df.boxplot(figsize=(20,10),showfliers = False)

# Data Preprocessing