# TP 1 - Aprendizaje de Maquina

Alumno: Rodrigo Pazos

## Consigna

Una plataforma de ventas online nos contrata para que realicemos un modelo que nos
permita detectar un posible fraude dada cierta operación para ello contamos con un dataset
que contiene las siguientes columnas:

- Step: representa una unidad de tiempo donde 1 step equivale a 1 hora
- type: tipo de transacción en línea
- amount: el importe de la transacción
- nameOrig: cliente que inicia la transacción
- oldbalanceOrg: saldo antes de la transacción
- newbalanceOrig: saldo después de la transacción
- nameDest: destinatario de la transacción
- oldbalanceDest: saldo inicial del destinatario antes de la transacción
- newbalanceDest: el nuevo saldo del destinatario después de la transacción
- isFraud: transacción fraudulenta

Utilizando los modelos de clasificación vistos hasta el momento generar un notebook que
permita de ser posible resolver el problema que nos está planteando el cliente.
IMPORTANTE
Sabemos que por cada transacción aprobada el porcentaje de ganancia es de un
20%, y por cada fraude aprobado se pierde el 100% del dinero de la transacción.
Realizar un análisis y determinar un modelo que permita maximizar la ganancia de la
empresa.

## Solucion

In [21]:
import pandas as pd
import numpy as np

### Carga y preparacion de datos

In [2]:
df = pd.read_csv("./data/PS_20174392719_1491204439457_log.csv")
df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


Usamos get dummies para encodear la columna type

In [3]:
df = pd.get_dummies(df, columns=["type"])
df.head(5)

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0,0,0,1,0
2,1,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0,0,0,1
3,1,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,0,1,0,0,0
4,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0,0,0,1,0


In [4]:
len(df["nameOrig"].unique())

6353307

No podemos usar get dummies para name porque hay muchas cuentas diferentes. Ademas como hay casi la misma cantidad de rows como de cuentas de origen diferentes podriamos considerar que no suma mucha informacion o que si lo hay es marginal

In [5]:
len(df["nameDest"].unique())

2722362

Aunque mejora un poco porque hay menos cuentas unicas de destino sigue sin ser algo manejable o practico para este dataset. Entonces se determina eliminar estas dos columnas ya que no suman ningun valor

In [6]:
x_df = df.drop(columns=["nameDest", "nameOrig", "isFraud"])
y_df = df[["isFraud"]]

In [7]:
x_df.head(5)

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,170136.0,160296.36,0.0,0.0,0,0,0,0,1,0
1,1,1864.28,21249.0,19384.72,0.0,0.0,0,0,0,0,1,0
2,1,181.0,181.0,0.0,0.0,0.0,0,0,0,0,0,1
3,1,181.0,181.0,0.0,21182.0,0.0,0,0,1,0,0,0
4,1,11668.14,41554.0,29885.86,0.0,0.0,0,0,0,0,1,0


In [8]:
y_df.head(5)

Unnamed: 0,isFraud
0,0
1,0
2,1
3,1
4,0


In [9]:
X = x_df.values
y = y_df.values

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

[[1.58000000e+02 9.99867300e+04 7.84507000e+03 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.34000000e+02 6.24056300e+04 1.66030000e+04 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [1.40000000e+01 2.29139307e+06 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 1.00000000e+00]
 ...
 [1.86000000e+02 7.98405200e+04 1.01529259e+07 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [1.41000000e+02 5.68604500e+05 1.08178428e+06 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [3.25000000e+02 1.47782180e+05 4.05900000e+03 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]
(5090096, 12)
(1272524, 12)
(5090096, 1)
(1272524, 1)


### Logistic Regression

In [11]:
# Create logistic regression model
log_reg = LogisticRegression()

# Fit the model to the training data
log_reg.fit(X_train, y_train)



  y = column_or_1d(y, warn=True)


In [28]:
y_pred_log_reg = log_reg.predict(X_test)
y_pred_log_reg.shape

(1272524,)

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred_log_reg)
precision = precision_score(y_test, y_pred_log_reg)
recall = recall_score(y_test, y_pred_log_reg)
f1 = f1_score(y_test, y_pred_log_reg)

(accuracy, precision, recall, f1)

(0.998257007333457,
 0.35044064282011406,
 0.41194393662400974,
 0.3787114845938376)

### Decision tree

In [15]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=3)

dt.fit(X_train, y_train)

In [27]:
y_pred_dt = dt.predict(X_test)
y_pred_dt.shape

(1272524,)

In [20]:
accuracy = accuracy_score(y_test, y_pred_dt)
precision = precision_score(y_test, y_pred_dt)
recall = recall_score(y_test, y_pred_dt)
f1 = f1_score(y_test, y_pred_dt)

(accuracy, precision, recall, f1)

(0.9990137710565773,
 0.9898477157360406,
 0.2376599634369287,
 0.38329238329238324)

En principio, unicamente observando estas metricas, pareciera que decision tree, con muy poca profunidad, performa mucho mejor que logistic regression

In [61]:
amount = X_test[:, 1]


def calculate_revenue(amount, y_true, y_pred):
    zipped = np.stack((y_true, y_pred), axis=1)
    accepted_frauds = np.all(zipped == [1, 0], axis=1)
    accepted_legit = np.all(zipped == [0, 0], axis=1)
    revenue = amount * accepted_legit.astype(int) * 0.2
    loss = amount * accepted_frauds.astype(int)
    return (revenue-loss).sum()
    
print(calculate_revenue(amount, y_test.flatten(), y_pred_dt))
print(calculate_revenue(amount, y_test.flatten(), y_pred_log_reg))

44607490371.84802
44711268404.03599
