# **Hypertunning Parametros**

**Datos**

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/code


**Contexto**

Es importante que las compañías de tarjetas de crédito puedan reconocer las transacciones de tarjetas de crédito fraudulentas para que a los clientes no se les cobre por artículos que no compraron.

**Contenido**

El conjunto de datos contiene transacciones realizadas con tarjetas de crédito en septiembre de 2013 por titulares de tarjetas europeos.

Este conjunto de datos presenta transacciones que ocurrieron en dos días, donde tenemos 492 fraudes de 284,807 transacciones. El conjunto de datos está muy desequilibrado, la clase positiva (fraudes) representa el 0,172 % de todas las transacciones.

Contiene solo variables de entrada numéricas que son el resultado de una transformación PCA.

Desafortunadamente, debido a problemas de confidencialidad, no se pueden proporcionar las características originales ni más información general sobre los datos. Las características V1, V2, … V28 son los principales componentes obtenidos con PCA, las únicas características que no han sido transformadas con PCA son **Time y Amount**.


La característica **Time** contiene los segundos transcurridos entre cada transacción y la primera transacción en el conjunto de datos. La variable **Amount** es la cantidad de la transacción, esta función se puede utilizar para el aprendizaje sensible a los costos dependiente del ejemplo. **Class** es la variable de respuesta y toma valor 1 en caso de fraude y 0 en caso contrario.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import HalvingRandomSearchCV

import warnings
warnings.filterwarnings("ignore")

In [None]:
url = 'https://raw.githubusercontent.com/Geerdata/DS/main/Datacoder/Arc.%20Modelo/creeditCar.csv'
df = pd.read_csv(url, sep=';')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,"-1,35981E+13","-7,27812E+14","2,53635E+14","1,37816E+14","-3,38321E+14","4,62388E+14","2,39599E+14","9,86979E+14","3,63787E+14",...,"-1,83068E+13","2,77838E+14","-1,10474E+14","6,69281E+14","1,28539E+14","-1,89115E+14","1,33558E+14","-2,10531E+14",14962,0
1,0,"1,19186E+14","2,66151E+13","1,6648E+13","4,48154E+14","6,00176E+14","-8,23608E+14","-7,8803E+14","8,51017E+14","-2,55425E+14",...,"-2,25775E+14","-6,38672E+14","1,01288E+14","-3,39846E+14","1,6717E+14","1,25895E+14","-8,9831E+14","1,47242E+14",269,0
2,1,"-1,35835E+14","-1,34016E+14","1,77321E+14","3,7978E+14","-5,03198E+14","1,8005E+14","7,91461E+14","2,47676E+14","-1,51465E+14",...,"2,47998E+14","7,71679E+14","9,09412E+14","-6,89281E+14","-3,27642E+14","-1,39097E+14","-5,53528E+14","-5,97518E+14",37866,0
3,1,"-9,66272E+14","-1,85226E+14","1,79299E+14","-8,63291E+14","-1,03089E+14","1,2472E+14","2,37609E+13","3,77436E+14","-1,38702E+14",...,"-1,083E+14","5,2736E+14","-1,90321E+14","-1,17558E+14","6,47376E+14","-2,21929E+14","6,27228E+14","6,14576E+14",1235,0
4,2,"-1,15823E+14","8,77737E+14","1,54872E+12","4,03034E+14","-4,07193E+14","9,59215E+14","5,92941E+14","-2,70533E+14","8,17739E+14",...,"-9,4307E+14","7,98278E+13","-1,37458E+14","1,41267E+14","-2,0601E+14","5,02292E+14","2,19422E+14","2,15153E+14",6999,0


In [None]:
#Cargamos el archivo compartido en memoria
url = "https://drive.google.com/uc?id="
ext = "102bkw-Z_mbpPPB3tVt7lyrGv3EqH6hFs"
df = pd.read_excel(url+ext)
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-13598071336738,-727811733098497,253634673796914,137815522427443,-338320769942518,462387777762292,239598554061257,986979012610507,363786969611213,...,-18306777944153,277837575558899,-110473910188767,669280749146731,128539358273528,-189114843888824,133558376740387,-210530534538215,14962,0
1,0,119185711131486,26615071205963,16648011335321,448154078460911,600176492822243,-823608088155687,-788029833323113,851016549148104,-255425128109186,...,-225775248033138,-638671952771851,101288021253234,-339846475529127,167170404418143,125894532368176,-898309914322813,147241691924927,269,0
2,1,-135835406159823,-134016307473609,177320934263119,379779593034328,-503198133318193,180049938079263,791460956450422,247675786588991,-151465432260583,...,247998153469754,771679401917229,909412262347719,-689280956490685,-327641833735251,-139096571514147,-553527940384261,-597518405929204,37866,0
3,1,-966271711572087,-185226008082898,179299333957872,-863291275036453,-103088796030823,124720316752486,23760893977178,377435874652262,-138702406270197,...,-108300452035545,527359678253453,-190320518742841,-117557533186321,647376034602038,-221928844458407,627228487293033,614576285006353,1235,0
4,2,-115823309349523,877736754848451,1548717846511,403033933955121,-407193377311653,959214624684256,592940745385545,-270532677192282,817739308235294,...,-943069713232919,79827849458971,-137458079619063,141266983824769,-206009587619756,502292224181569,219422229513348,215153147499206,6999,0


In [None]:
# Lectura de DF
# Vamos a eliminar la columna tiempo
df= df.drop(columns='Time')
# Estandarizamos la columna Amount
df['Amount']=(df['Amount']- np.mean(df['Amount']))/np.std(df.Amount)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-13598071336738,-727811733098497,253634673796914,137815522427443,-338320769942518,462387777762292,239598554061257,986979012610507,363786969611213,907941719789316,...,-18306777944153,277837575558899,-110473910188767,669280749146731,128539358273528,-189114843888824,133558376740387,-210530534538215,0.423089,0
1,119185711131486,26615071205963,16648011335321,448154078460911,600176492822243,-823608088155687,-788029833323113,851016549148104,-255425128109186,-166974414004614,...,-225775248033138,-638671952771851,101288021253234,-339846475529127,167170404418143,125894532368176,-898309914322813,147241691924927,-0.254305,0
2,-135835406159823,-134016307473609,177320934263119,379779593034328,-503198133318193,180049938079263,791460956450422,247675786588991,-151465432260583,207642865216696,...,247998153469754,771679401917229,909412262347719,-689280956490685,-327641833735251,-139096571514147,-553527940384261,-597518405929204,1.479036,0
3,-966271711572087,-185226008082898,179299333957872,-863291275036453,-103088796030823,124720316752486,23760893977178,377435874652262,-138702406270197,-549519224713749,...,-108300452035545,527359678253453,-190320518742841,-117557533186321,647376034602038,-221928844458407,627228487293033,614576285006353,-0.20977,0
4,-115823309349523,877736754848451,1548717846511,403033933955121,-407193377311653,959214624684256,592940745385545,-270532677192282,817739308235294,753074431976354,...,-943069713232919,79827849458971,-137458079619063,141266983824769,-206009587619756,502292224181569,219422229513348,215153147499206,0.055969,0


In [None]:
df_ones=df[df['Class']==1] # Filtro de caracteristica
print(df_ones.shape)
df_zeros=df[df['Class']==0] # Filtro de NO caracteristica
print(df_zeros.shape)

(222, 30)
(97777, 30)


In [None]:
df_zeros= df_zeros.sample(3*df_ones.shape[0]) # Tamaño de muestra 3 veces el de la caracteristica
print(df_zeros.shape)
# Concatenar
df_final=pd.DataFrame(np.concatenate([df_ones, df_zeros],axis=0), columns=df.columns)
print(df_final.shape)
df_final.head()

(666, 30)
(888, 30)


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-23122270000000.0,195199200000000.0,-160985100000000.0,39979060000000.0,-522187900000000.0,-142654500000000.0,-253738700000000.0,139165700000000.0,-277008900000000.0,-277227200000000.0,...,517232400000000.0,-350493700000000.0,-465211100000000.0,320198200000000.0,445191700000000.0,177839800000000.0,261145000000000.0,-143275900000000.0,-0.266707,1.0
1,-30435410000000.0,-315730700000000.0,108846300000000.0,22886440000000.0,135980500000000.0,-106482300000000.0,325574300000000.0,-677936500000000.0,-270952800000000.0,-838586600000000.0,...,661695900000000.0,435477200000000.0,137596600000000.0,-293803200000000.0,279798000000000.0,-145361700000000.0,-252773100000000.0,357642300000000.0,-0.242319,1.0
2,-230335000000000.0,1759247000000.0,-359744700000000.0,233024300000000.0,-821628300000000.0,-757875700000000.0,562319800000000.0,-399146600000000.0,-238253400000000.0,-152541200000000.0,...,-294166300000000.0,-932391100000000.0,172726300000000.0,-873295400000000.0,-156114300000000.0,-542627900000000.0,395659900000000.0,-153028800000000.0,0.839447,1.0
3,-439797400000000.0,135836700000000.0,-25928440000000.0,267978700000000.0,-112813100000000.0,-170653600000000.0,-349619700000000.0,-248777700000000.0,-24776790000000.0,-480163700000000.0,...,573574100000000.0,176967700000000.0,-436206900000000.0,-535018600000000.0,252405300000000.0,-657487800000000.0,-827135700000000.0,849573400000000.0,-0.263987,1.0
4,123423500000000.0,30197400000000.0,-430459700000000.0,473279500000000.0,362420100000000.0,-135774600000000.0,171344500000000.0,-496358500000000.0,-128285800000000.0,-244746900000000.0,...,-37906830000000.0,-704181000000000.0,-656804800000000.0,-163265300000000.0,148890100000000.0,566797300000000.0,-100162200000000.0,146792700000000.0,-0.266661,1.0


In [None]:
df_final.shape

(888, 30)

In [None]:
df_final.isnull().sum()

Unnamed: 0,0
V1,0
V2,0
V3,0
V4,0
V5,0
V6,0
V7,0
V8,0
V9,0
V10,0


In [None]:
# Separar en X y y
y= df_final.Class
X= df_final.drop(columns='Class', axis=1)
print(X.shape, y.shape)

(888, 29) (888,)


In [None]:
# Separar en train y test
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape)

(621, 29) (267, 29)


## **Seleccion manual**

In [None]:
model= xgb.XGBClassifier(learning_rate=0.001)
# Lista de hiperparametros
params_1 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 5}
params_2 = {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 7}
params_3 = {'criterion': 'gini', 'splitter': 'random', 'max_depth': 10}
#

## **Estos procesos pueden demorar bastante** !!

In [None]:
# Modelo 1
model.set_params(**params_1).fit(X_train, y_train)
print(f'Accuracy para Modelo 1 = {round(accuracy_score(y_test, model.predict(X_test)), 5)}')
# Modelo 2
model.set_params(**params_2).fit(X_train, y_train)
print(f'Accuracy para Modelo 2 = {round(accuracy_score(y_test, model.predict(X_test)), 5)}')
# Modelo 3
model.set_params(**params_3).fit(X_train, y_train)
print(f'Accuracy para Modelo 3 = {round(accuracy_score(y_test, model.predict(X_test)), 5)}')

Accuracy para Modelo 1 = 0.77154
Accuracy para Modelo 2 = 0.77154
Accuracy para Modelo 3 = 0.77154


## **Grid Search**

In [None]:
params_grid = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [5,6,7],
        'criterion':['entropy','gini']
        }

In [None]:
#Cantidad de Combinaciones
3*5*3*3*3*2

810

In [None]:
# Tiempo de ejecución indeterminado
grid_cv = GridSearchCV(model, params_grid, scoring="accuracy", n_jobs=-1, cv=3)
grid_cv.fit(X_train, y_train)

print("Mejores Parametros", grid_cv.best_params_)
print("Mejor CV score", grid_cv.best_score_)
print(f'Accuracy del modelo = {round(accuracy_score(y_test, grid_cv.predict(X_test)), 5)}')

Mejores Parametros {'colsample_bytree': 0.6, 'criterion': 'entropy', 'gamma': 0.5, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6}
Mejor CV score 0.7407407407407408
Accuracy del modelo = 0.77154


## **Randomized Search CV**

In [None]:
# Tiempo de ejecucion indeterminada
grid_cv = RandomizedSearchCV(model, params_grid, scoring="accuracy", n_jobs=-1, cv=3)
grid_cv.fit(X_train, y_train)

print("Mejores parametros", grid_cv.best_params_)
print("Mejor score de CV", grid_cv.best_score_)
print(f'Accuracy del modelo = {round(accuracy_score(y_test, grid_cv.predict(X_test)), 5)}')

Mejores parametros {'subsample': 1.0, 'min_child_weight': 5, 'max_depth': 5, 'gamma': 0.5, 'criterion': 'entropy', 'colsample_bytree': 0.8}
Mejor score de CV 0.7407407407407408
Accuracy del modelo = 0.77154


## **Halving Grid Search**

In [None]:
# tiempo de ejecucuon 125 seg
halving_cv = HalvingGridSearchCV(model, params_grid, scoring="accuracy", factor=3)
halving_cv.fit(X_train, y_train)

print("Mejores parametros", halving_cv.best_params_)
print("Mejor Score CV", halving_cv.best_score_)
print(f'Accuracy del modelo = {round(accuracy_score(y_test, halving_cv.predict(X_test)), 5)}')

Mejores parametros {'colsample_bytree': 0.6, 'criterion': 'gini', 'gamma': 1.5, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 1.0}
Mejor Score CV 0.7444098303911388
Accuracy del modelo = 0.77154


## **Halving Randomized Search**

In [None]:
# tiempo de ejecucuon 13 seg
halving_cv = HalvingRandomSearchCV(model, params_grid, scoring="accuracy", factor=3)
halving_cv.fit(X_train, y_train)

print("Mejores parametros", halving_cv.best_params_)
print("Mejor CV score", halving_cv.best_score_)
print(f'Accuracy del modelo = {round(accuracy_score(y_test, halving_cv.predict(X_test)), 5)}')

Mejores parametros {'subsample': 0.6, 'min_child_weight': 10, 'max_depth': 5, 'gamma': 0.5, 'criterion': 'gini', 'colsample_bytree': 0.8}
Mejor CV score 0.7481308411214952
Accuracy del modelo = 0.77154
