# Deteccion de fraude en ETH

Este conjunto de datos contiene filas de fraudes conocidos y transacciones válidas realizadas a través de Ethereum, un tipo de criptomoneda. 

*Este conjunto de datos está desequilibrado.*

### Here is a *description* of the rows of the dataset:

- Index: the index number of a row

- Address: the address of the ethereum account

- **FLAG**: *whether the transaction is fraud or not*

- Avg min between sent tnx: Average time between sent transactions for account in minutes

- Avgminbetweenreceivedtnx: Average time between received transactions for account in minutes

- TimeDiffbetweenfirstand_last(Mins): Time difference between the first and last transaction

- Sent_tnx: Total number of sent normal transactions

- Received_tnx: Total number of received normal transactions

- NumberofCreated_Contracts: Total Number of created contract transactions

- UniqueReceivedFrom_Addresses: Total Unique addresses from which account received transactions

- UniqueSentTo_Addresses20: Total Unique addresses from which account sent transactions

- MinValueReceived: Minimum value in Ether ever received

- MaxValueReceived: Maximum value in Ether ever receive d

- AvgValueReceived5Average value in Ether ever received

- MinValSent: Minimum value of Ether ever sent

- MaxValSent: Maximum value of Ether ever sent

- AvgValSent: Average value of Ether ever sent

- MinValueSentToContract: Minimum value of Ether sent to a contract

- MaxValueSentToContract: Maximum value of Ether sent to a contract

- AvgValueSentToContract: Average value of Ether sent to contracts

- TotalTransactions(IncludingTnxtoCreate_Contract): Total number of transactions

- TotalEtherSent:Total Ether sent for account address

- TotalEtherReceived: Total Ether received for account address

- TotalEtherSent_Contracts: Total Ether sent to Contract addresses

- TotalEtherBalance: Total Ether Balance following enacted transactions

- TotalERC20Tnxs: Total number of ERC20 token transfer transactions

- ERC20TotalEther_Received: Total ERC20 token received transactions in Ether

- ERC20TotalEther_Sent: Total ERC20token sent transactions in Ether

- ERC20TotalEtherSentContract: Total ERC20 token transfer to other contracts in Ether

- ERC20UniqSent_Addr: Number of ERC20 token transactions sent to Unique account addresses

- ERC20UniqRec_Addr: Number of ERC20 token transactions received from Unique addresses

- ERC20UniqRecContractAddr: Number of ERC20token transactions received from Unique contract addresses

- ERC20AvgTimeBetweenSent_Tnx: Average time between ERC20 token sent transactions in minutes

- ERC20AvgTimeBetweenRec_Tnx: Average time between ERC20 token received transactions in minutes

- ERC20AvgTimeBetweenContract_Tnx: Average time ERC20 token between sent token transactions

- ERC20MinVal_Rec: Minimum value in Ether received from ERC20 token transactions for account

- ERC20MaxVal_Rec: Maximum value in Ether received from ERC20 token transactions for account

- ERC20AvgVal_Rec: Average value in Ether received from ERC20 token transactions for account

- ERC20MinVal_Sent: Minimum value in Ether sent from ERC20 token transactions for account

- ERC20MaxVal_Sent: Maximum value in Ether sent from ERC20 token transactions for account

- ERC20AvgVal_Sent: Average value in Ether sent from ERC20 token transactions for account

- ERC20UniqSentTokenName: Number of Unique ERC20 tokens transferred

- ERC20UniqRecTokenName: Number of Unique ERC20 tokens received

- ERC20MostSentTokenType: Most sent token for account via ERC20 transaction

- ERC20MostRecTokenType: Most received token for account via ERC20 transactions





In [None]:
# Manejo de datos
import pandas as pd
import numpy as np
# Pre-procesamiento
from sklearn.preprocessing import RobustScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import seaborn as sns

#Graficos

import matplotlib.pyplot as plt

#Importar libreria para traer archivos externos
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


        
        

In [None]:
#Cargando el dataset        
df_eth_fraude = pd.read_csv('../input/ethereum-frauddetection-dataset/transaction_dataset.csv',delimiter=',',header='infer')
df_eth=df_eth_fraude.copy()
print(df_eth.shape)
df_eth.head()

In [None]:
df_eth.info()

In [None]:
#Transformar las variables "objeto" en categoricas
categoricas = df_eth.select_dtypes('O').columns.astype('category')
df_eth[categoricas]

In [None]:
#Corroborar si el dataset está desbalanceado
cantidad = df_eth['FLAG'].value_counts()

plt.pie(cantidad, labels=cantidad)
plt.title('Cantidad de Fraudes')
plt.legend(cantidad.keys().tolist())
plt.show()

In [None]:
# Matriz de correlacion
corr = df_eth.corr()

mascara = np.zeros_like(corr)
mascara[np.triu_indices_from(mascara)]=True
with sns.axes_style('white'):
    fig, ax = plt.subplots(figsize=(18,10))
    sns.heatmap(corr,  mask=mascara, annot=False, cmap='CMRmap', center=0, square=True)

### Limpieza del dataset

In [None]:
#Explorar variables faltantes
df_eth[df_eth.isnull().any(axis=1)]

In [None]:
df_eth.isnull().sum()

In [None]:
fraud=df_eth[df_eth['FLAG']==1]
valid=df_eth[df_eth['FLAG']==0]

In [None]:
Columnas_nulas=df_eth.iloc[:,26:49]
Columnas_fill_cero=[]
for nombre in Columnas_nulas:
    p=df_eth[nombre].mean()
    maxv=df_eth[nombre].max()
    minv=df_eth[nombre].min()
    if p == 0:
        Columnas_fill_cero.append(nombre)
        print('Columna: {}\n ===> Promedio: {}\n ===> Valor Max{}\n ===> Valor Min{}\n'.format(nombre,p,maxv,minv))

In [None]:
Columnas_fill_mean=[]
for nombre in Columnas_nulas:
    p=df_eth[nombre].mean()
    maxv=df_eth[nombre].max()
    minv=df_eth[nombre].min()
    if p != 0:
        Columnas_fill_mean.append(nombre)
        print('Columna: {}\n ===> Promedio: {}\n ===> Valor Max{}\n ===> Valor Min{}\n'.format(nombre,p,maxv,minv))

In [None]:
for col in Columnas_fill_cero:
    df_eth[col].fillna(0,inplace=True)
    
for col in Columnas_fill_mean:
    df_eth[col].fillna(df_eth[col].mean(),inplace=True)

In [None]:
df_eth = df_eth.dropna(axis=0, how='any')
df_eth = df_eth.drop(columns=['Unnamed: 0','Index'])
df_eth.isnull().sum()

In [None]:
df_eth['Address_enc'] = LabelEncoder().fit_transform(df_eth['Address'])
df_eth['ERC20 most sent token type_enc'] = LabelEncoder().fit_transform(df_eth[' ERC20 most sent token type'])
df_eth['ERC20_most_rec_token_type_enc'] = LabelEncoder().fit_transform(df_eth[' ERC20_most_rec_token_type'])

df_eth = df_eth.drop(columns=['Address',' ERC20 most sent token type',' ERC20_most_rec_token_type'])
df_eth

In [None]:
no_var = df_eth.var() == 0
print(df_eth.var()[no_var])
print('\n')

# Borrar las columnas con varianza 0 ya que no ayudan en el modelo.
df_eth.drop(df_eth.var()[no_var].index, axis = 1, inplace = True)
print(df_eth.var())
print(df_eth.shape)

In [None]:
df_eth.to_csv(r'./transaction_dataset_procesado.csv', index = False)