 **IEEE-CIS Fraud Detection**

The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features.

**What is fraud detection?**:  Fraud detection protects person information, assets, accounts and transactions through the real-time, near-real-time analysis of activities by users and other defined entities. It uses background server-based processes that examine users’ and other defined entities’ access and behavior patterns, and typically compares this information to a profile of what’s expected. 


**Data Description**

**Transaction Table**
* TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
* TransactionAMT: transaction payment amount in USD
* ProductCD: product code, the product for each transaction
* card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
* addr: address
* dist: distance
* P_ and (R__) emaildomain: purchaser and recipient email domain
* C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
* D1-D15: timedelta, such as days between previous transaction, etc.
* M1-M9: match, such as names on card and address, etc.
* Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
* Categorical Features:
ProductCD,
card1 - card6,
addr1, addr2,
P_emaildomain,
R_emaildomain,
M1 - M9

**Identity Table**
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

* Categorical Features:
DeviceType,
DeviceInfo,
id_12 - id_38

**There is three problems related to datatsets.***
1. Columns name are masked 
2. Imbalanced dataset
3. Time series dataset

**Importing Libraries**

In [None]:
import datetime
import re
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
import xgboost as xgb

from sklearn.model_selection import cross_validate, GridSearchCV, validation_curve,RandomizedSearchCV
from sklearn.utils import resample
from sklearn.model_selection import KFold, StratifiedKFold,TimeSeriesSplit, GroupKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, RobustScaler
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, \
    roc_auc_score, confusion_matrix, classification_report, plot_roc_curve
from sklearn.model_selection import train_test_split


**Importing Train and Test datasets**

In [None]:
train_identity= pd.read_csv("../input/ieee-fraud-detection/train_identity.csv")
train_transaction= pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv')                           

In [None]:
df_train=pd.merge(train_transaction, train_identity, on="TransactionID", how="left")

df_train.columns = [col.lower() for col in df_train.columns]
df_train.head(20)

In [None]:
test_identity= pd.read_csv('../input/ieee-fraud-detection/test_identity.csv')
test_identity=test_identity.set_axis(['TransactionID', 'id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo'], axis=1)
test_transaction= pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv')

df_test=pd.merge(test_transaction, test_identity, on="TransactionID", how="left")

df_test.columns = [col.lower() for col in df_test.columns]
df_test.head(20)

In [None]:
sample_sub = pd.read_csv('../input/ieee-fraud-detection/sample_submission.csv')

In [None]:
#function of missing values
def missing_values_table(dataframe, na_name=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 1)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df , end="\n")
    if na_name:
        return na_columns

#function of one hot encoding
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe    


**Data Preprocessing**

In [None]:
#deal with missing values
missing_values_table(df_train, na_name=False)


Vxx features and id_xx features have high nulity ratio.Since we do not know what these features exactly present, we reduce of Vxx cols. 
reference: https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id
Also PCA is usefull method for Vxx features.

In [None]:
#Reducing Vxx Columns
#Determining which columns are related by the number of NANs present
#Finding groups of Vs with similar NAN structure

nans_df= df_train.isna()
nans_group={}
i_cols=["v" + str(i) for i in range (1,340)]
for col in df_train.columns:
    cur_group=nans_df[col].sum()
    try:
        nans_group[cur_group].append(col)
    except:
        nans_group[cur_group]=[col]
del nans_df; x=gc.collect()

for k, v in nans_group.items():
    print("##NAN count=", k)
    print(v)

In [None]:
#Finding subsets within the groups that are highly correlated
##NAN count= 1269
#['d1', 'v281', 'v282', 'v283', 'v288', 'v289', 'v296', 'v300', 'v301', 'v313', 'v314', 'v315']
#Replacing all groups with one column from each subset.

vs=nans_group[1269]
vtitle="v281-v315, d1 "

def make_corr(vs, vtitle=""):
    cols= ["transactiondt"] + vs
    plt.figure(figsize=(15,15))
    sns.heatmap(df_train[cols].corr(), cmap="RdBu_r", annot=True, center=0.0)
    if vtitle!="" : plt.title(vtitle, fontsize=14)
    else: plt.title(vs[0]+""+vs[-1], fontsize= 14)
    plt.show()
make_corr(vs,vtitle)



In [None]:
grps = [[281],[282,283],[288,289],[296],[300,301],[313,314,315]]

def reduce_group(grps, c="v"):
    use=[]
    for g in grps:
        mx=0; vx=g[0]
        for gg in g:
            n=df_train[c+ str(gg)].nunique()
            if n> mx:
                mx=n
                vx=gg
        use.append(vx)
    print("Use these", use)
    
reduce_group(grps)



In [None]:
#Vcols after reducing

V=  [1, 2, 5, 6, 8, 11]
V+= [13, 14, 16, 20, 24, 26, 27, 30]
V+= [36, 37, 40, 41, 45, 47, 49]
V+= [54, 56, 60, 62, 65, 67, 68, 70]
V+= [76, 77, 80, 82, 86, 88, 89, 91]
V+= [96, 98, 99, 104, 107, 114, 111, 115, 117, 120, 121, 123, 124, 127, 129, 130, 136]
V+= [138, 140, 142, 147, 156, 162]
V+= [165, 160, 166]
V+= [178, 176, 173, 182, 190, 203, 205, 207, 215]
V+= [169, 171, 175, 180, 185, 188, 198, 210, 209]
V+= [218, 223, 224, 226, 228, 229, 235, 240, 258, 246, 253, 252, 260, 261, 264, 266, 267, 274, 277]
V+= [220, 222, 234, 238, 250, 271]
V+= [294, 284, 285, 286, 291, 297, 303, 305, 307, 309, 310, 320]
V+= [332, 325, 335, 338]
V+= [281, 283, 289, 296, 300, 314]

print("reduced set has", len (V), "columns")

In [None]:
#New df_train and df_test
v_cols=["v" + str(i) for i in range (1,340)]
new_v_cols = ['v'+str(x) for x  in V]
droped_v_cols = [col for col in v_cols if col not in new_v_cols]
df_train.drop(df_train[droped_v_cols], inplace=True, axis=1)
df_test.drop(df_test[droped_v_cols], inplace=True, axis=1)


**Drop missing features which have 85% missing value ratio**

In [None]:
null_cols= df_train[['id_01',"id_03","id_04",'id_07','id_08',"id_09","id_10",'id_21', 'id_22', 'id_23',
                     'id_24', 'id_25', 'id_26', 'id_27','id_33', 'id_13',"id_34",
                     'id_14','id_17', 'id_18','id_19', 'id_20', 'id_32',"dist2","d7", "d13", "d14","d12"]]

df_train.drop(null_cols, axis=1, inplace=True)
df_test.drop(null_cols, axis=1, inplace=True)

**Re-grouping of categorical features**

In [None]:
#p_emaildomain_train

df_train["p_emaildomain"].unique()
df_train["p_emaildomain"].value_counts()

df_train.loc[df_train["p_emaildomain"].isin(['gmail.com', "gmail"]), "p_emaildomain"]= "Google"
df_train.loc[df_train["p_emaildomain"].isin(['yahoo.com', "yahoo.com.mx", 'yahoo.fr', 'yahoo.de','yahoo.es','yahoo.co.uk','yahoo.co.jp']), "p_emaildomain"]= "Yahoo Mail"
df_train.loc[df_train["p_emaildomain"].isin(['outlook.com', 'msn.com', 'live.com','live.com.mx', 'outlook.es','live.fr','hotmail.com','hotmail.es', 'hotmail.fr', 'hotmail.de', 'hotmail.co.uk' ]), "p_emaildomain"]= "Microsoft"
df_train.loc[df_train["p_emaildomain"].isin(['aol.com' ]), "p_emaildomain"]= "Aol"
df_train.loc[df_train["p_emaildomain"].isin(['anonymous.com' ]), "p_emaildomain"]= "Anonymous"
df_train.loc[df_train["p_emaildomain"].isin(df_train["p_emaildomain"].value_counts()[df_train["p_emaildomain"].value_counts()<=8000].index), "p_emaildomain"]= "Others"
#others belirlemek için girdiğimiz değer aldığımız sample a göre değişiyor
df_train["p_emaildomain"].fillna("NoInf", inplace=True)
df_train["p_emaildomain"].value_counts()
p_email_domain_ratio = (df_train["p_emaildomain"].value_counts() / df_train.shape[0] * 100).sort_values(ascending=False)



#p_emaildomain_test

df_test["p_emaildomain"].unique()
df_test["p_emaildomain"].value_counts()

df_test.loc[df_test["p_emaildomain"].isin(['gmail.com', "gmail"]), "p_emaildomain"]= "Google"
df_test.loc[df_test["p_emaildomain"].isin(['yahoo.com', "yahoo.com.mx", 'yahoo.fr', 'yahoo.de','yahoo.es','yahoo.co.uk','yahoo.co.jp']), "p_emaildomain"]= "Yahoo Mail"
df_test.loc[df_test["p_emaildomain"].isin(['outlook.com', 'msn.com', 'live.com','live.com.mx', 'outlook.es','live.fr','hotmail.com','hotmail.es', 'hotmail.fr', 'hotmail.de', 'hotmail.co.uk' ]), "p_emaildomain"]= "Microsoft"
df_test.loc[df_test["p_emaildomain"].isin(['aol.com' ]), "p_emaildomain"]= "Aol"
df_test.loc[df_test["p_emaildomain"].isin(['anonymous.com' ]), "p_emaildomain"]= "Anonymous"
df_test.loc[df_test["p_emaildomain"].isin(df_test["p_emaildomain"].value_counts()[df_test["p_emaildomain"].value_counts()<=21000].index), "p_emaildomain"]= "Others"
#others belirlemek için girdiğimiz değer aldığımız sample a göre değişiyor
df_test["p_emaildomain"].fillna("NoInf", inplace=True)
p_email_domain_ratio = (df_test["p_emaildomain"].value_counts() / df_test.shape[0] * 100).sort_values(ascending=False)

#r_emaildomain_train

df_train["r_emaildomain"].unique()
df_train["r_emaildomain"].value_counts()

df_train.loc[df_train["r_emaildomain"].isin(['gmail.com', "gmail"]), "r_emaildomain"]= "Google"
df_train.loc[df_train["r_emaildomain"].isin(['yahoo.com', "yahoo.com.mx", 'yahoo.fr', 'yahoo.de','yahoo.es','yahoo.co.uk','yahoo.co.jp']), "r_emaildomain"]= "Yahoo Mail"
df_train.loc[df_train["r_emaildomain"].isin(['outlook.com', 'msn.com', 'live.com','live.com.mx', 'outlook.es','live.fr','hotmail.com','hotmail.es', 'hotmail.fr', 'hotmail.de', 'hotmail.co.uk' ]), "r_emaildomain"]= "Microsoft"
df_train.loc[df_train["r_emaildomain"].isin(['aol.com']), "r_emaildomain"]= "Aol"
df_train.loc[df_train["r_emaildomain"].isin(['anonymous.com']), "r_emaildomain"]= "Anonymous"
df_train.loc[df_train["r_emaildomain"].isin(df_train["r_emaildomain"].value_counts()[df_train["r_emaildomain"].value_counts()<=2000].index), "r_emaildomain"]= "Others"
df_train["r_emaildomain"].fillna("NoInf", inplace=True)
r_email_domain_ratio = (df_train["r_emaildomain"].value_counts() / df_train.shape[0] * 100).sort_values(ascending=False)


#r_emaildomain_test

df_test["r_emaildomain"].unique()
df_test["r_emaildomain"].value_counts()

df_test.loc[df_test["r_emaildomain"].isin(['gmail.com', "gmail"]), "r_emaildomain"]= "Google"
df_test.loc[df_test["r_emaildomain"].isin(['yahoo.com', "yahoo.com.mx", 'yahoo.fr', 'yahoo.de','yahoo.es','yahoo.co.uk','yahoo.co.jp']), "r_emaildomain"]= "Yahoo Mail"
df_test.loc[df_test["r_emaildomain"].isin(['outlook.com', 'msn.com', 'live.com','live.com.mx', 'outlook.es','live.fr','hotmail.com','hotmail.es', 'hotmail.fr', 'hotmail.de', 'hotmail.co.uk' ]), "r_emaildomain"]= "Microsoft"
df_test.loc[df_test["r_emaildomain"].isin(['aol.com']), "r_emaildomain"]= "Aol"
df_test.loc[df_test["r_emaildomain"].isin(['anonymous.com']), "r_emaildomain"]= "Anonymous"
df_test.loc[df_test["r_emaildomain"].isin(df_test["r_emaildomain"].value_counts()[df_test["r_emaildomain"].value_counts()<=3000].index), "r_emaildomain"]= "Others"
df_test["r_emaildomain"].fillna("NoInf", inplace=True)
r_email_domain_ratio = (df_test["r_emaildomain"].value_counts() / df_test.shape[0] * 100).sort_values(ascending=False)

#id_30 train
df_train["id_30"].unique()

df_train.loc[df_train["id_30"].str.contains("Windows", na=False), "id_30"] = "Windows"
df_train.loc[df_train["id_30"].str.contains("iOS", na=False), "id_30"] = "iOS"
df_train.loc[df_train["id_30"].str.contains("Mac OS", na=False), "id_30"] = "Mac"
df_train.loc[df_train["id_30"].str.contains("Android", na=False), "id_30"] = "Android"
df_train["id_30"].fillna("NAN", inplace=True)


#id_30 test
df_test["id_30"].unique()

df_test.loc[df_test["id_30"].str.contains("Windows", na=False), "id_30"] = "Windows"
df_test.loc[df_test["id_30"].str.contains("iOS", na=False), "id_30"] = "iOS"
df_test.loc[df_test["id_30"].str.contains("Mac OS", na=False), "id_30"] = "Mac"
df_test.loc[df_test["id_30"].str.contains("Android", na=False), "id_30"] = "Android"
df_test["id_30"].fillna("NAN", inplace=True)


#id_31 train
df_train["id_31"].unique()
df_train["id_31"].value_counts()

df_train.loc[df_train["id_31"].str.contains("chrome", na=False), "id_31"] = "Chrome"
df_train.loc[df_train["id_31"].str.contains("firefox", na=False), "id_31"] = "Firefox"
df_train.loc[df_train["id_31"].str.contains("safari", na=False), "id_31"] = "Safari"
df_train.loc[df_train["id_31"].str.contains("edge", na=False), "id_31"] = "Edge"
df_train.loc[df_train["id_31"].str.contains("ie", na=False), "id_31"] = "IE"
df_train.loc[df_train["id_31"].str.contains("samsung", na=False), "id_31"] = "Samsung"
df_train.loc[df_train["id_31"].str.contains("opera", na=False), "id_31"] = "Opera"
df_train.loc[df_train["id_31"].str.contains("google", na=False), "id_31"] = "Google Search Application"
df_train["id_31"].fillna("NAN", inplace=True)
df_train.loc[df_train["id_31"].isin(df_train["id_31"].value_counts()[df_train["id_31"].value_counts()< 500].index), "id_31"]= "Others"


#id_31 test

df_test["id_31"].unique()
df_test["id_31"].value_counts()

df_test.loc[df_test["id_31"].str.contains("chrome", na=False), "id_31"] = "Chrome"
df_test.loc[df_test["id_31"].str.contains("firefox", na=False), "id_31"] = "Firefox"
df_test.loc[df_test["id_31"].str.contains("safari", na=False), "id_31"] = "Safari"
df_test.loc[df_test["id_31"].str.contains("edge", na=False), "id_31"] = "Edge"
df_test.loc[df_test["id_31"].str.contains("ie", na=False), "id_31"] = "IE"
df_test.loc[df_test["id_31"].str.contains("samsung", na=False), "id_31"] = "Samsung"
df_test.loc[df_test["id_31"].str.contains("opera", na=False), "id_31"] = "Opera"
df_test["id_31"].fillna("NAN", inplace=True)
df_test.loc[df_test["id_31"].isin(df_test["id_31"].value_counts()[df_test["id_31"].value_counts()< 2001].index), "id_31"]= "Others"


#deviceinfo train

df_train["deviceinfo"].unique()
df_train.loc[df_train['deviceinfo'].str.contains('SM', na=False), 'deviceinfo'] = 'Samsung'
df_train.loc[df_train['deviceinfo'].str.contains('Moto', na=False), 'deviceinfo'] = 'Motorola'
df_train.loc[df_train['deviceinfo'].str.contains('moto', na=False), 'deviceinfo'] = 'Motorola'
df_train.loc[df_train['deviceinfo'].str.contains('HUAWEI', na=False), 'deviceinfo'] = 'Huawei'
df_train.loc[df_train['deviceinfo'].str.contains('LG', na=False), 'deviceinfo'] = 'LG'
df_train.loc[df_train['deviceinfo'].str.contains('GT-', na=False), 'deviceinfo'] = 'Samsung'
df_train.loc[df_train['deviceinfo'].str.contains('Trident', na=False), 'deviceinfo'] = 'Trident'
df_train.loc[df_train['deviceinfo'].str.contains('BLADE', na=False), 'deviceinfo'] = 'ZTE'
df_train.loc[df_train['deviceinfo'].isin(df_train['deviceinfo'].value_counts()[df_train['deviceinfo'].value_counts() < 7570].index), 'deviceinfo'] = "Others"
df_train["deviceinfo"].fillna("NAN", inplace=True)

#deviceinfo test

df_test["deviceinfo"].unique()
df_test.loc[df_test['deviceinfo'].str.contains('SM', na=False), 'deviceinfo'] = 'Samsung'
df_test.loc[df_test['deviceinfo'].str.contains('Moto', na=False), 'deviceinfo'] = 'Motorola'
df_test.loc[df_test['deviceinfo'].str.contains('moto', na=False), 'deviceinfo'] = 'Motorola'
df_test.loc[df_test['deviceinfo'].str.contains('HUAWEI', na=False), 'deviceinfo'] = 'Huawei'
df_test.loc[df_test['deviceinfo'].str.contains('LG', na=False), 'deviceinfo'] = 'LG'
df_test.loc[df_test['deviceinfo'].str.contains('GT-', na=False), 'deviceinfo'] = 'Samsung'
df_test.loc[df_test['deviceinfo'].str.contains('Trident', na=False), 'deviceinfo'] = 'Trident'
df_test.loc[df_test['deviceinfo'].str.contains('BLADE', na=False), 'deviceinfo'] = 'ZTE'
df_test.loc[df_test['deviceinfo'].isin(df_test['deviceinfo'].value_counts()[df_test['deviceinfo'].value_counts() < 6000].index), 'deviceinfo'] = "Others"
df_test["deviceinfo"].fillna("NAN", inplace=True)



**Feature Extraction**
Taking the start date ‘2017-12-01’, constructed time variables. In discussions tab you should read an excellent solutions.

In [None]:
START_DATE = '2017-12-01'
startdate = datetime.datetime.strptime(START_DATE, "%Y-%m-%d")

#train
df_train["Date"] = df_train['transactiondt'].apply(lambda x: (startdate + datetime.timedelta(seconds=x)))

df_train['New_months'] = df_train['Date'].dt.month
df_train['New_weekdays'] = df_train['Date'].dt.dayofweek
df_train['New_hours'] = df_train['Date'].dt.hour
df_train['New_days'] = df_train['Date'].dt.day
df_train['New_is_month_end'] = df_train["Date"].dt.is_month_end.astype(int)
df_train["New_is_wknd"] = df_train["Date"].dt.weekday // 4
df_train['New_is_month_start'] = df_train["Date"].dt.is_month_start.astype(int)
df_train['New_day_of_year'] = df_train["Date"].dt.dayofyear

#test
df_test["Date"] = df_test['transactiondt'].apply(lambda x: (startdate + datetime.timedelta(seconds=x)))

df_test['New_months'] = df_test['Date'].dt.month
df_test['New_weekdays'] = df_test['Date'].dt.dayofweek
df_test['New_hours'] = df_test['Date'].dt.hour
df_test['New_days'] = df_test['Date'].dt.day
df_test['New_is_month_end'] = df_test["Date"].dt.is_month_end.astype(int)
df_test["New_is_wknd"] = df_test["Date"].dt.weekday // 4
df_test['New_is_month_start'] = df_test["Date"].dt.is_month_start.astype(int)
df_test['New_day_of_year'] = df_test["Date"].dt.dayofyear

df_train.drop("Date", axis=1, inplace=True)
df_test.drop("Date", axis=1, inplace=True)


Define categorical cols for both train and test dataframes. Since the types of features are not object and have many categories, we define cat_cols manuel. Also you can add other time variables in cat_cols.

In [None]:
cat_cols_train= df_train[['productcd',  'card4', 'card6','p_emaildomain', 'r_emaildomain',
              'devicetype', 'deviceinfo','id_12','id_15','id_16','id_28',
              'id_29',"id_30", 'id_31', 'id_35', 'id_36', 'id_37', 'id_38',"m4",'m1', 'm2', 'm3'
             , 'm5', 'm6', 'm7', 'm8', 'm9',"New_months", "New_hours"]]

cat_cols_test= df_test[['productcd',  'card4', 'card6','p_emaildomain', 'r_emaildomain',
              'devicetype', 'deviceinfo','id_12','id_15','id_16','id_28',
              'id_29',"id_30", 'id_31', 'id_35', 'id_36', 'id_37', 'id_38',"m4",'m1', 'm2', 'm3'
            , 'm5', 'm6', 'm7', 'm8', 'm9']]


After one hot encoding rename cols name and control the shape of train and test dataframes.

In [None]:
# ONE-HOT ENCODING TRAIN

ohe_cols_train=cat_cols_train.columns

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first, dtype="float")
    return dataframe

df_train=one_hot_encoder(df_train, ohe_cols_train, drop_first=True)

#you should rename columns name after encoding since columns include characteristic name like :/ 

columns= [df_train.columns]
df_train= df_train.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

# ONE-HOT ENCODING TEST

ohe_cols_test=cat_cols_test.columns

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first, dtype="float")
    return dataframe

df_test=one_hot_encoder(df_test, ohe_cols_test, drop_first=True)

columns= [df_test.columns]
df_test= df_test.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

df_train.drop("card6_debitorcredit", axis=1, inplace=True)




In [None]:
#Train- Test Split#
train_index=df_train.index[:450000]
test_index=df_train.index[450000:]

train_cols=df_train.columns
X_train=df_train.loc[train_index,train_cols]
X_train.drop(["transactionid", "isfraud", "transactiondt"], axis=1, inplace=True)

X_test=df_train.loc[test_index,train_cols]
X_test.drop(["transactionid", "isfraud", "transactiondt"], axis=1, inplace=True)

y_train1 = df_train["isfraud"]
y_train=y_train1[train_index]
y_test=y_train1[test_index]



In [None]:
model =xgb.XGBClassifier(
    n_estimators=500,
    max_depth=9,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    tree_method='gpu_hist'
     )
model.fit(X_train, y_train, eval_metric=["error", "logloss"], verbose=True)

y_pred = model.predict(X_test)
train_y= model.predict(X_train)

In [None]:
print('Train Accuracy score:',accuracy_score(y_train,train_y))
print('Train F1 score:',f1_score(y_train,train_y))
print('Train Precision score:',precision_score(y_train,train_y))
print('Train Recall score:',recall_score(y_train,train_y)) 

print('Test Accuracy score:',accuracy_score(y_test,y_pred))
print('Test F1 score:',f1_score(y_test,y_pred))
print('Test Precision score:',precision_score(y_test,y_pred))
print('Test Recall score:',recall_score(y_test,y_pred)) 


In [None]:
def plot_confusion_matrix(y_test, y_pred):
    acc = round(accuracy_score(y_test, y_pred), 2)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt=".0f")
    plt.xlabel('y_pred')
    plt.ylabel('y_test')
    plt.title('Accuracy Score: {0}'.format(acc), size=10)
    plt.show()

plot_confusion_matrix(y_test, y_pred)