### Usando datos de una competencia proveniente de Kaggle. El proyecto busca maximizar el valor de la metrica auc-ROC para clasificar con mayor exito usuarios que han realizado fraudes con las tarjetas de credito. El dataset contiene mas de 200 columnas y 500 mil filas. Si tomamos con valor 1 el hecho de que la transaccion sea fraudulenta, existen aproximadamente un 4% de las mismas, por lo que el dataset no esta balanceado. Hay que trabajar sobre esta situacion y ver como lidiar con el desbalanceo. Existe como posibilidad e undersampling, oversampling, unca combinacion de ambos o ponerle peso distinto a las muestras con target 1 cuando se está entrenando el modelo. Se debe destacar que la naturaleza del set de prueba puede ser distinta a la del de entrenamiento por lo que un buen score en la validación del train set podria devolver valores bajos en los resultados del test set una vez subido a Kaggle.

In [1]:
#Importo las librerias necesaria para el analisis de datos y los calculos matematicos
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np

# Descriptor de columnas:
### Transaction Table *
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

### Identity Table *
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id_12 - id_38

In [4]:
#Cargo el dataset de train transacciones
train_trans = pd.read_csv('/content/gdrive/MyDrive/proyectito_santi/train_transaction.csv')

In [5]:
#Cargo el dataset de train de ids
train_id = pd.read_csv('/content/gdrive/MyDrive/proyectito_santi/train_identity.csv')

In [6]:
data_train=train_trans.merge(right=train_id, how='left', on='TransactionID')

In [7]:
#Borro datsets para no gastar tanta memoria
del train_trans
del train_id

In [8]:
#Cargo el dataset de test transacciones
test_trans = pd.read_csv('/content/gdrive/MyDrive/proyectito_santi/test_transaction.csv')
#Cargo el dataset de test de ids
test_id = pd.read_csv('/content/gdrive/MyDrive/proyectito_santi/test_identity.csv')

In [9]:
data_test=test_trans.merge(right=test_id, how='left', on='TransactionID')
del test_trans
del test_id


In [10]:
old_cols= list(data_test.columns[-40:-2])
new_cols= [x.replace('-','_') for x in old_cols]
dic=dict(zip(old_cols, new_cols))
data_test.rename(columns=dic, inplace= True)
del old_cols
del new_cols
del dic

#### Por inspeccion en jupyter ya se que la unica columna que no está en el test set es 'isFraud' (justamente el target) porque la idea es subir los resultados del ajuste del modelo a los test set a kaggle y ahi obtener el score.

In [11]:
#Defino la columna target para el trainset y la elimino del X_train
X_train=data_train.iloc[:,2:]
Y_train=data_train['isFraud']
#Defino el X_test
X_test=data_test.iloc[:,1:]
#Defino las series que contienen las ids de las operaciones
#id_train=data_train.iloc[:,0]
id_test=data_test.iloc[:,0]
#Elimino los datasets data_train y data_test
del data_train
del data_test

In [12]:
#Quiero eliminar del X_train las columnas con mas de 60% valores nulos, pero si las elimino del X_train tambien lo hago en el X_test
#Quiero eliminar las columnas que tengan mas de 60% de valores nulos
has_many_nans = []
for col in X_train.columns:
  if X_train[col].isna().sum() > 0:
    perc = 100*X_train[col].isna().sum()/X_train.shape[0]
    if perc > 60:
      has_many_nans.append(col)
#Borro del dataset las columnas con muchos nans
X_train.drop(columns=has_many_nans, inplace=True)
X_test.drop(columns=has_many_nans, inplace=True)

In [13]:
def Preproc(df_merged):
    #Voy a llenar los Nan de todas las columnas, numericas y categoricas de la misma manera
    #A las columnas categoricas con muchos valores (id card) no voy a hacerles one hot encoding, sino que las voy a mapear a los valores mas comunes
    for col in df_merged.columns:
        c=df_merged[col].value_counts(normalize=True)
        df_merged[col].fillna(value=np.random.choice(c.index, p=c.values),inplace=True)
  #Escribo las columnas categoricas a las cuales quiero pasar pasar a numericas, donde el valor de una clase sea igual a la cantidad 
  #de veces que aparece esa clase dentro de la columna categorica. De esa manera si un mismo id de tarjeta aparece varias veces tendra un orden
  #en importancia mayor que otras que no
    cats=['card1','card2','card3','card5','addr1','addr2','P_emaildomain']
    for col in cats:
        df_merged[col]=df_merged[col].map(df_merged[col].value_counts(normalize=True))
  #Defino a las columnas a las cuales voy a hacerles one hot encoding (uso getdummies)
    true_cats = ['ProductCD', 'card6', 'card4'] + ['M'+str(i) for i in range(1,10)]
  #Hago el hotencoding con el metodo get_dummies, en este caso me parece mas practico
    df_merged = pd.get_dummies(df_merged, prefix=true_cats, columns = true_cats, drop_first=True)
  #Ahora voya  usar Standar Scaler para escalear los datos
    scaler = StandardScaler()
    cols_to_transform=[col for col in df_merged.columns if df_merged[col].dtype != 'category']
    cols_not_to_transform=[col for col in df_merged.columns if df_merged[col].dtype == 'category']
    scaled_values= scaler.fit_transform(df_merged.loc[:,cols_to_transform])
    scaled_values = pd.DataFrame(scaled_values, columns=cols_to_transform)
    df_merged = pd.concat([df_merged[cols_not_to_transform], scaled_values],axis=1)
    return df_merged

In [14]:
#Defino los datasets de entranamiento y prueba
X_train=Preproc(X_train)
X_test=Preproc(X_test)

In [None]:
#Importo paquetes de sklearn, xgboost y imblearn para los modelos, el pipeline y para lidiar con el desbalanceo
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, RandomizedSearchCV, train_test_split
from scipy.stats import uniform, randint
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# Random forest

## Opcion 1: Undersampleo

In [None]:
#Ahora quiero undersamplear mi dataset
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier



In [None]:
rus = RandomUnderSampler(sampling_strategy = 0.25, random_state = 42)
X_train_us, Y_train_us = rus.fit_resample(X_train, Y_train)

In [None]:
ros = RandomOverSampler(random_state = 42)
X_train_os, Y_train_os = ros.fit_resample(X_train_us, Y_train_us)

In [None]:
del X_train_us
del Y_train_us

In [None]:
Y_train_os.shape

(165304,)

## Modelo

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
forest = RandomForestClassifier(random_state = 5)
# Voy a probar con y sin undersampleo
params={'max_depth':[5,7,10], 'min_samples_split':[1000, 2500, 3500]}
grid_forest = GridSearchCV(estimator= forest, param_grid= params, scoring='roc_auc', n_jobs= -1, cv= 3)

In [None]:
grid_forest.best_estimator_

RandomForestClassifier(max_depth=10, min_samples_split=1000, random_state=5)

In [None]:
rf=RandomForestClassifier(max_depth=10, min_samples_split=1000, random_state=5)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import cross_val_score
from sklearn import metrics

In [None]:
scores=cross_val_score(rf, X_train_os, Y_train_os, cv=5, scoring='roc_auc')

In [None]:
print(scores)

[0.88156354 0.88116979 0.88242381 0.88179672 0.88284149]


In [None]:
rf.fit(X_train_os,Y_train_os)

RandomForestClassifier(max_depth=10, min_samples_split=1000, random_state=5)

In [None]:
#En el test set no aparece la opcion credito-debito asi que la agrego para que hayan las mismas columnas en el train y test set
X_test.insert(218, 'card6_debit or credit', 0)

In [None]:
test_predict = rf.predict_proba(X_test)

In [None]:
test_pred_df = pd.concat([id_test, pd.DataFrame({'isFraud': test_predict[:,1]})], axis=1)

In [None]:
test_pred_df.to_csv('/content/gdrive/MyDrive/proyectito_santi/random_forest_predict.csv', index=False)

## XGboost

#### Hago una prueba sin tunear hiperparametros

In [None]:
XGB = XGBClassifier(silent=False, 
                      scale_pos_weight=1,
                      learning_rate=0.01,  
                      colsample_bytree = 0.4,
                      subsample = 0.8,
                      objective='binary:logistic', 
                      n_estimators=100, 
                      reg_alpha = 0.3,
                      max_depth=4, 
                      gamma=10)

In [None]:
scores=cross_val_score(XGB, X_train_os, Y_train_os, cv=5, scoring='roc_auc')

In [None]:
print(scores)

[0.86497508 0.86633865 0.86675269 0.86595652 0.86673073]


In [None]:
XGB.fit(X_train_os,Y_train_os)

XGBClassifier(colsample_bytree=0.4, gamma=10, learning_rate=0.01, max_depth=4,
              reg_alpha=0.3, silent=False, subsample=0.8)

In [None]:
f_import=pd.DataFrame({'Features':X_train_os.columns, 'Importance':XGB.feature_importances_})
f_import= f_import.sort_values('Importance',ascending=False)

In [None]:
#Columnas con mas importancia para el modelo sin tunear hiperparametros
f_import.head(20)

Unnamed: 0,Features,Importance
105,V74,0.073769
64,V33,0.060591
17,C8,0.055471
121,V90,0.044289
104,V73,0.043152
185,V295,0.029528
100,V69,0.029093
23,C14,0.028706
13,C4,0.027998
14,C5,0.026255


In [None]:
#X_test.insert(218, 'card6_debit or credit', 0)
test_predict = XGB.predict_proba(X_test)
test_pred_df = pd.concat([id_test, pd.DataFrame({'isFraud': test_predict[:,1]})], axis=1)
test_pred_df.to_csv('XGB_predict.csv', index=False)

Los resultados no fueron mejores que son el randon forest. Voy a intentar tunear hiperparametros

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
from scipy.stats import uniform, randint

In [None]:
params = {
    "colsample_bytree": uniform(0.3, 0.7),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.01, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "n_estimators": randint(100, 150), # default 100
    "subsample": uniform(0.4, 0.9)
}

search = RandomizedSearchCV(XGBClassifier(random_satate=5), param_distributions=params, random_state=42, n_iter=200, cv=3, verbose=1, n_jobs=-1, return_train_score=True)

search.fit(X_train_os, Y_train_os)

report_best_scores(search.cv_results_, 1)

In [None]:
search.best_estimator_

XGBClassifier(colsample_bytree=0.7515723534213954, gamma=0.3344620298315498,
              learning_rate=0.26925026952157094, max_depth=5, n_estimators=147,
              random_satate=5, subsample=0.6526951261967702)

In [None]:
best_xgb=XGBClassifier(colsample_bytree=0.7515723534213954, gamma=0.3344620298315498,
              learning_rate=0.26925026952157094, max_depth=5, n_estimators=147,
              random_satate=5, subsample=0.6526951261967702)

In [None]:
scores=cross_val_score(best_xgb, X_train_os, Y_train_os, cv=5, scoring='roc_auc')

In [None]:
print(scores) #Valores de la validacion

[0.95252963 0.95238314 0.95734399 0.95501244 0.95865854]


In [None]:
#X_test.insert(218, 'card6_debit or credit', 0)
best_xgb.fit(X_train_os,Y_train_os)
test_predict = best_xgb.predict_proba(X_test)
test_pred_df = pd.concat([id_test, pd.DataFrame({'isFraud': test_predict[:,1]})], axis=1)
test_pred_df.to_csv('best_XGB_predict.csv', index=False)

### Se observa que el mejor modelo con la ingenieria de features utilizada es el random forest, aunque para valores de validacion el xgboost ha dado mejores resultados. Esto puede deberse a que la forma de hacer ingenieria de features en el conjunto de entrenamiento no es la misma que en el conjunto de prueba provisto por Kaggle.

<img src="score-fraude.png">