# Proyecto Inteligencia de Negocio con Procesos ETL

## Practica 1
## Grupo 11:
  - Malave Yela Roberto
  - Silva Naranjo Bryan Patricio
  - Ricardo Peñafiel Miño

### Los datos fueron extraidos de la siguiente fuente de Kaggle:
 https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets?resource=download

#### En el proyecto se utilizaron 2 archivos CSV, un archivo JSON y un archivo SQL que originalmente era csv pero se lo transformó a sql

### Este dataset resulta útil para el análisis, ya que permite:

- Explorar patrones de comportamiento financiero en usuarios legítimos frente a usuarios fraudulentos.

- Identificar relaciones entre el tipo de transacción y la probabilidad de fraude.

- Analizar tendencias temporales (momentos del día o secuencia de transacciones en que ocurre el fraude).

- Construir modelos predictivos y evaluar el impacto de sus distintas variables estadísticas

# Instalacion del Contenedor Docker y la base de Datos

### Desde la consola de docker desktop ejecutamos los siguientes comandos, estos crearan el contenedor que aloja a una Base de Datos Postgress llamada db_grupo11

docker pull postgres

docker run --name cont_int_grupo11 -e POSTGRES_USER=admin -e POSTGRES_PASSWORD=adminpass -e POSTGRES_DB=db_grupo11 -p 5432:5432 -d postgres

# Instalación de paquetes del ambiente virtual

In [1]:
#!pip install pandas dotenv sqlalchemy
#!pip install psycopg2

# Importación de Dependencias


In [2]:
import pandas as pd
from dotenv import load_dotenv
import os
load_dotenv()
from sqlalchemy import create_engine

## Leer Variables de Entorno de la DataBase

Se imprime una variable para comprobar su funcionamiento

In [3]:
DB_USER=os.getenv('DB_USER')
DB_PASS=os.getenv('DB_PASS')
DB_NAME=os.getenv('DB_NAME')
DB_HOST=os.getenv('DB_HOST')

print(DB_HOST)

localhost


## Previo al siguiente apartado, se crea la conexión con la base de datos postgres desde DataSPell

###  Cargar Base de Datos
### Se crea el engine o controlador que se conecta a la base de datos

In [4]:
engine=create_engine(f'postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}')

# Generación de los DataFrame
## Carga de Datos Json y CSV

In [5]:
df_fraud=pd.read_json('data/train_fraud_labels.json')
df_cards=pd.read_csv('data/cards_data.csv')
df_transactions=pd.read_csv('data/transactions_data.csv')
df_users=pd.read_csv('data/users_data.csv')

# Conversión de CSV a SQL del dataSet Users
### Después se agrega una tabla SQL

In [6]:
df_users.to_sql('users', engine, if_exists='replace', index=False)

1000

### Se ejecuta un comando SQL para verificar que la tabla quedo registrada

In [7]:
df_users_sql=pd.read_sql('select * from users', engine)
#df_users_sql

# Visualización y Filtros de los DataFrame
## DataFrame 1



In [8]:
df_fraud.head(10)

Unnamed: 0,target
10649266,No
23410063,No
9316588,No
12478022,No
9558530,No
12532830,No
19526714,No
9906964,No
13224888,No
13749094,No


In [9]:

df_fraud.describe()

Unnamed: 0,target
count,8914963
unique,2
top,No
freq,8901631


## DataFrame 2

In [10]:
df_cards.head(10)

Unnamed: 0,id,client_id,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web
0,4524,825,Visa,Debit,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,2731,825,Visa,Debit,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,3701,825,Visa,Debit,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,42,825,Visa,Credit,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,4659,825,Mastercard,Debit (Prepaid),5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No
5,4537,1746,Visa,Credit,4404898874682993,09/2003,736,YES,1,$27500,09/2003,2012,No
6,1278,1746,Visa,Debit,4001482973848631,07/2022,972,YES,2,$28508,02/2011,2011,No
7,3687,1746,Mastercard,Debit,5627220683410948,06/2022,48,YES,2,$9022,07/2003,2015,No
8,3465,1746,Mastercard,Debit (Prepaid),5711382187309326,11/2020,722,YES,2,$54,06/2010,2015,No
9,3754,1746,Mastercard,Debit (Prepaid),5766121508358701,02/2023,908,YES,1,$99,07/2006,2012,No


In [11]:
df_cards.describe()

Unnamed: 0,id,client_id,card_number,cvv,num_cards_issued,year_pin_last_changed
count,6146.0,6146.0,6146.0,6146.0,6146.0,6146.0
mean,3072.5,994.939636,4820426000000000.0,506.220794,1.503091,2013.436707
std,1774.341709,578.614626,1328582000000000.0,289.431123,0.519191,4.270699
min,0.0,0.0,300105500000000.0,0.0,1.0,2002.0
25%,1536.25,492.25,4486365000000000.0,257.0,1.0,2010.0
50%,3072.5,992.0,5108957000000000.0,516.5,1.0,2013.0
75%,4608.75,1495.0,5585237000000000.0,756.0,2.0,2017.0
max,6145.0,1999.0,6997197000000000.0,999.0,3.0,2020.0


Filtro para mostrar cuantas tarjetas tiene cada cliente

In [12]:
df_cards.groupby("client_id")["id"].count()

client_id
0       4
1       3
2       5
3       4
4       5
       ..
1995    4
1996    3
1997    7
1998    3
1999    2
Name: id, Length: 2000, dtype: int64

Filtro que muestra el limite de credito de mayor a menor

In [13]:
df_cards.sort_values("credit_limit", ascending=False)

Unnamed: 0,id,client_id,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web
2773,1026,743,Visa,Debit,4251505296439839,11/2023,630,YES,1,$9998,02/2003,2010,No
694,476,1804,Mastercard,Debit,5979460179212685,10/2022,565,YES,1,$9984,01/2020,2020,No
903,487,1424,Mastercard,Debit,5004994096233324,03/2020,489,NO,1,$9957,01/2020,2020,No
6106,2285,97,Mastercard,Debit,5447193146031175,12/2023,290,YES,2,$9956,03/2011,2011,No
3001,3746,1475,Visa,Debit,4818828811526445,05/2024,311,YES,2,$9956,07/2005,2010,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1621,3443,846,Visa,Credit,4518067619451768,06/2020,429,YES,2,$0,06/2009,2011,No
478,5957,1975,Mastercard,Credit,5320022308833354,12/2021,92,YES,1,$0,12/2009,2010,No
4633,265,37,Discover,Credit,6845375674595536,02/2024,943,YES,1,$0,01/2011,2011,No
221,4318,668,Mastercard,Credit,5764603958082866,08/2021,397,YES,1,$0,08/2010,2010,No


filtro para ver los tipos de tarjeta que se usan en este dataframe

In [14]:
tipo_tarjeta=df_cards['card_type'].value_counts()
tipo_tarjeta

card_type
Debit              3511
Credit             2057
Debit (Prepaid)     578
Name: count, dtype: int64

## DataFrame 3

In [15]:
df_transactions.head(10)

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,5499,
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311,
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,4829,
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,4829,
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813,
5,7475333,2010-01-01 00:07:00,1807,165,$4.81,Swipe Transaction,20519,Bronx,NY,10464.0,5942,
6,7475334,2010-01-01 00:09:00,1556,2972,$77.00,Swipe Transaction,59935,Beulah,ND,58523.0,5499,
7,7475335,2010-01-01 00:14:00,1684,2140,$26.46,Online Transaction,39021,ONLINE,,,4784,
8,7475336,2010-01-01 00:21:00,335,5131,$261.58,Online Transaction,50292,ONLINE,,,7801,
9,7475337,2010-01-01 00:21:00,351,1112,$10.74,Swipe Transaction,3864,Flushing,NY,11355.0,5813,


Eliminar columna errors ya que es irrelevante, todos sus datos son NaN

In [16]:
df_transactions.drop("errors", axis=1, inplace=False)


Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,5499
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,5311
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,4829
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,4829
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,5813
...,...,...,...,...,...,...,...,...,...,...,...
13305910,23761868,2019-10-31 23:56:00,1718,2379,$1.11,Chip Transaction,86438,West Covina,CA,91792.0,5499
13305911,23761869,2019-10-31 23:56:00,1766,2066,$12.80,Online Transaction,39261,ONLINE,,,5815
13305912,23761870,2019-10-31 23:57:00,199,1031,$40.44,Swipe Transaction,2925,Allen,TX,75002.0,4900
13305913,23761873,2019-10-31 23:58:00,1986,5443,$4.00,Chip Transaction,46284,Daly City,CA,94014.0,5411


Filtro para saber la transaccion con mayor cantidad de dinero

In [17]:
maximo = df_transactions["amount"].max()
minimo = df_transactions["amount"].min()
print("Mayor:", maximo, "Menor:", minimo)


Mayor: $999.97 Menor: $-0.00


Filtro para saber todas las transacciones online

In [18]:
sinchip=df_transactions[df_transactions['use_chip']=='Online Transaction']
sinchip

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors
7,7475335,2010-01-01 00:14:00,1684,2140,$26.46,Online Transaction,39021,ONLINE,,,4784,
8,7475336,2010-01-01 00:21:00,335,5131,$261.58,Online Transaction,50292,ONLINE,,,7801,
18,7475346,2010-01-01 00:34:00,394,4717,$26.04,Online Transaction,39021,ONLINE,,,4784,
24,7475353,2010-01-01 00:43:00,301,3742,$10.17,Online Transaction,39021,ONLINE,,,4784,
26,7475356,2010-01-01 00:45:00,566,3439,$16.86,Online Transaction,16798,ONLINE,,,4121,
...,...,...,...,...,...,...,...,...,...,...,...,...
13305879,23761832,2019-10-31 23:22:00,1556,2972,$17.65,Online Transaction,88459,ONLINE,,,5311,
13305880,23761833,2019-10-31 23:22:00,1797,5660,$34.81,Online Transaction,15143,ONLINE,,,4784,
13305888,23761843,2019-10-31 23:33:00,1069,5167,$59.71,Online Transaction,39021,ONLINE,,,4784,
13305897,23761853,2019-10-31 23:39:00,1422,5696,$694.30,Online Transaction,70268,ONLINE,,,4722,


Filtro de transacciones con errores por fondos insuficientes

In [19]:
errores=df_transactions[df_transactions['errors']=='Insufficient Balance']
errores

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors
401,7475792,2010-01-01 07:02:00,1424,4710,$-72.00,Swipe Transaction,59935,Kingman,AZ,86401.0,5499,Insufficient Balance
483,7475881,2010-01-01 07:22:00,843,184,$37.54,Swipe Transaction,89462,Terre Haute,IN,47805.0,5300,Insufficient Balance
484,7475882,2010-01-01 07:22:00,1424,4710,$72.00,Swipe Transaction,59935,Kingman,AZ,86401.0,5499,Insufficient Balance
524,7475935,2010-01-01 07:37:00,319,248,$104.81,Swipe Transaction,9263,Fresno,CA,93727.0,5912,Insufficient Balance
577,7476004,2010-01-01 07:51:00,1190,5358,$90.10,Online Transaction,38958,ONLINE,,,7801,Insufficient Balance
...,...,...,...,...,...,...,...,...,...,...,...,...
13305329,23761138,2019-10-31 18:37:00,1727,5329,$101.82,Chip Transaction,18215,Columbia,SC,29229.0,5719,Insufficient Balance
13305367,23761185,2019-10-31 18:52:00,1383,4949,$161.35,Chip Transaction,83434,Somerville,MA,2143.0,7538,Insufficient Balance
13305757,23761675,2019-10-31 21:57:00,87,3607,$20.00,Chip Transaction,27092,Leander,TX,78641.0,4829,Insufficient Balance
13305803,23761735,2019-10-31 22:23:00,1851,3164,$166.38,Online Transaction,32480,ONLINE,,,4899,Insufficient Balance


## DataFrame 4

In [20]:
df_users.head(10)

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,$29278,$59696,$127613,787,5
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,$37891,$77254,$191349,701,5
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,$22681,$33483,$196,698,5
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,$163145,$249925,$202328,722,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,$53797,$109687,$183855,675,1
5,68,42,70,1977,10,Male,58 Birch Lane,41.55,-90.6,$20599,$41997,$0,704,3
6,1075,36,67,1983,12,Female,5695 Fifth Street,38.22,-85.74,$25258,$51500,$102286,672,3
7,1711,26,67,1993,12,Male,1941 Ninth Street,45.51,-122.64,$26790,$54623,$114711,728,1
8,1116,81,66,1938,7,Female,11 Spruce Avenue,40.32,-75.32,$26273,$42509,$2895,755,5
9,1752,34,60,1986,1,Female,887 Grant Street,29.97,-92.12,$18730,$38190,$81262,810,1


Filtro para saber cuantos usuarios son hombres y mujeres

In [21]:
df_users["gender"].value_counts()

gender
Female    1016
Male       984
Name: count, dtype: int64

Filtro para saber el usuarios con mayor puntaje de credito

In [22]:
df_users.loc[df_users["credit_score"].idxmax()]

id                                1884
current_age                         18
retirement_age                      64
birth_year                        2001
birth_month                          5
gender                            Male
address              660 Seventh Drive
latitude                         39.98
longitude                       -82.98
per_capita_income               $28092
yearly_income                   $57281
total_debt                      $89114
credit_score                       850
num_credit_cards                     1
Name: 30, dtype: object

In [23]:
maximo = df_users["credit_score"].max()
minimo = df_users["credit_score"].min()

print("Mayor puntaje:", maximo)
print("Menor puntaje:", minimo)

Mayor puntaje: 850
Menor puntaje: 480


Vamos a filtrar a usuario con edad mayor a 30 años

In [24]:
mayores30=df_users[df_users["current_age"]>30]
mayores30

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,$29278,$59696,$127613,787,5
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,$37891,$77254,$191349,701,5
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,$22681,$33483,$196,698,5
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,$163145,$249925,$202328,722,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,$53797,$109687,$183855,675,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1993,391,85,66,1934,7,Female,31 Hill Boulevard,33.69,-78.89,$19025,$35270,$1769,731,6
1995,986,32,70,1987,7,Male,6577 Lexington Lane,40.65,-73.58,$23550,$48010,$87837,703,3
1996,1944,62,65,1957,11,Female,2 Elm Drive,38.95,-84.54,$24218,$49378,$104480,740,4
1997,185,47,67,1973,1,Female,276 Fifth Boulevard,40.66,-74.19,$15175,$30942,$71066,779,3


# TAREA 2

TRATAMIENTO DE DATOS DE LOS DATAFRAMES DEL PROYECTO

## TRATAMIENTO DE TABLA USERS_DATA.CSV

Hacemos un copia del dataframe para poder recurar informacion si se pierde o se daña

In [29]:
df_users

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,$29278,$59696,$127613,787,5
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,$37891,$77254,$191349,701,5
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,$22681,$33483,$196,698,5
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,$163145,$249925,$202328,722,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,$53797,$109687,$183855,675,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,986,32,70,1987,7,Male,6577 Lexington Lane,40.65,-73.58,$23550,$48010,$87837,703,3
1996,1944,62,65,1957,11,Female,2 Elm Drive,38.95,-84.54,$24218,$49378,$104480,740,4
1997,185,47,67,1973,1,Female,276 Fifth Boulevard,40.66,-74.19,$15175,$30942,$71066,779,3
1998,1007,66,60,1954,2,Male,259 Valley Boulevard,40.24,-76.92,$25336,$54654,$27241,618,1


In [67]:
df_users_copy = df_users.copy()

In [30]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   int64  
 1   current_age        2000 non-null   int64  
 2   retirement_age     2000 non-null   int64  
 3   birth_year         2000 non-null   int64  
 4   birth_month        2000 non-null   int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   object 
 10  yearly_income      2000 non-null   object 
 11  total_debt         2000 non-null   object 
 12  credit_score       2000 non-null   int64  
 13  num_credit_cards   2000 non-null   int64  
dtypes: float64(2), int64(7), object(5)
memory usage: 218.9+ KB


Comprobaremos si hay datos duplicados

In [31]:
print(df_users.isna().sum())

id                   0
current_age          0
retirement_age       0
birth_year           0
birth_month          0
gender               0
address              0
latitude             0
longitude            0
per_capita_income    0
yearly_income        0
total_debt           0
credit_score         0
num_credit_cards     0
dtype: int64


## Luego normalizaremos los tipos de datos para poder tratar la información de mejor manera, cambiando los tipos a enteros, flotantes, string, etc.

Transformación a enteros

In [32]:
enteros = ['id','current_age','retirement_age','birth_year','birth_month','credit_score','num_credit_cards']
for i in enteros:
    df_users[i] = pd.to_numeric(df_users[i], errors='coerce').astype('Int64')

In [34]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   Int64  
 1   current_age        2000 non-null   Int64  
 2   retirement_age     2000 non-null   Int64  
 3   birth_year         2000 non-null   Int64  
 4   birth_month        2000 non-null   Int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   object 
 10  yearly_income      2000 non-null   object 
 11  total_debt         2000 non-null   object 
 12  credit_score       2000 non-null   Int64  
 13  num_credit_cards   2000 non-null   Int64  
dtypes: Int64(7), float64(2), object(5)
memory usage: 232.6+ KB


Transformación a flotantes

In [33]:
flotantes = ['latitude','longitude']
for i in flotantes:
    df_users[i] = pd.to_numeric(df_users[i], errors='coerce')

In [35]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   Int64  
 1   current_age        2000 non-null   Int64  
 2   retirement_age     2000 non-null   Int64  
 3   birth_year         2000 non-null   Int64  
 4   birth_month        2000 non-null   Int64  
 5   gender             2000 non-null   object 
 6   address            2000 non-null   object 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   object 
 10  yearly_income      2000 non-null   object 
 11  total_debt         2000 non-null   object 
 12  credit_score       2000 non-null   Int64  
 13  num_credit_cards   2000 non-null   Int64  
dtypes: Int64(7), float64(2), object(5)
memory usage: 232.6+ KB


Transformación de los campos de dinero quitando simbolo de dolar y hacerlo entero

In [36]:
money_cols = ['per_capita_income','yearly_income','total_debt']
for c in money_cols:
    df_users[c] = (df_users[c]
                   .astype(str)
                   .str.replace(r'[^\d\.\-]', '', regex=True)
                  )
    df_users[c] = pd.to_numeric(df_users[c], errors='coerce')

Transformación a strings

In [37]:
df_users['gender'] = df_users['gender'].astype('string')
df_users['address'] = df_users['address'].astype('string')

Transformación a numéricos

In [38]:
df_users.head(
)

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,29278,59696,127613,787,5
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,37891,77254,191349,701,5
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,22681,33483,196,698,5
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,163145,249925,202328,722,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,53797,109687,183855,675,1


### Ahora verificaremos si tenemos datos duplicados que sean exactos

In [39]:
print('Duplicados exactos:', df_users.duplicated().sum())

duplicados_id = df_users[df_users.duplicated(subset=['id'], keep=False)].sort_values('id')
print('Ejemplos duplicados por id:')
duplicados_id.head()

Duplicados exactos: 0
Ejemplos duplicados por id:


Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards


## Revisión para comprobar si tenemos valores nulos

In [40]:
df_users.isnull().sum()

id                   0
current_age          0
retirement_age       0
birth_year           0
birth_month          0
gender               0
address              0
latitude             0
longitude            0
per_capita_income    0
yearly_income        0
total_debt           0
credit_score         0
num_credit_cards     0
dtype: int64

In [41]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   Int64  
 1   current_age        2000 non-null   Int64  
 2   retirement_age     2000 non-null   Int64  
 3   birth_year         2000 non-null   Int64  
 4   birth_month        2000 non-null   Int64  
 5   gender             2000 non-null   string 
 6   address            2000 non-null   string 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   int64  
 10  yearly_income      2000 non-null   int64  
 11  total_debt         2000 non-null   int64  
 12  credit_score       2000 non-null   Int64  
 13  num_credit_cards   2000 non-null   Int64  
dtypes: Int64(7), float64(2), int64(3), string(2)
memory usage: 232.6 KB


## Aplicar un lambda a aplicar regla del negocio que la edad y el score crediticio cumplan una regla del negocio en este caso un lambda que, trabajando con los campos de ingresos anuales y el score crediticio agreguemos una nueva columna indicando si el riesgo es alto o bajo para el usuario

In [42]:
df_users['riesgo'] = df_users.apply(lambda r: 'alto' if (r['credit_score'] < 500 or r['total_debt'] > (r['yearly_income'] if pd.notna(r['yearly_income']) else 0)) else 'bajo',axis=1)
df_users

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,riesgo
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,29278,59696,127613,787,5,alto
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,37891,77254,191349,701,5,alto
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,22681,33483,196,698,5,bajo
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,163145,249925,202328,722,4,bajo
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,53797,109687,183855,675,1,alto
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,986,32,70,1987,7,Male,6577 Lexington Lane,40.65,-73.58,23550,48010,87837,703,3,alto
1996,1944,62,65,1957,11,Female,2 Elm Drive,38.95,-84.54,24218,49378,104480,740,4,alto
1997,185,47,67,1973,1,Female,276 Fifth Boulevard,40.66,-74.19,15175,30942,71066,779,3,alto
1998,1007,66,60,1954,2,Male,259 Valley Boulevard,40.24,-76.92,25336,54654,27241,618,1,bajo


convertimos a tipo string la nueva columna riesgo

In [43]:
df_users['riesgo'] = df_users['riesgo'].astype('string')

# Ingreso de datos de mcc_codes.json (diccionario de datos)

In [49]:
import json
with open('data/mcc_codes.json', 'r') as f:
    mcc_dict = json.load(f)

In [50]:
print(type(mcc_dict))
print(list(mcc_dict.items())[:5])

<class 'dict'>
[('5812', 'Eating Places and Restaurants'), ('5541', 'Service Stations'), ('7996', 'Amusement Parks, Carnivals, Circuses'), ('5411', 'Grocery Stores, Supermarkets'), ('4784', 'Tolls and Bridge Fees')]


In [51]:

df_mcc = pd.DataFrame(list(mcc_dict.items()), columns=['mcc', 'mcc_description'])
df_mcc['mcc'] = df_mcc['mcc'].astype(int)
df_mcc.head()

Unnamed: 0,mcc,mcc_description
0,5812,Eating Places and Restaurants
1,5541,Service Stations
2,7996,"Amusement Parks, Carnivals, Circuses"
3,5411,"Grocery Stores, Supermarkets"
4,4784,Tolls and Bridge Fees


# MERGE del dataframe MCC con Transactions_data

In [58]:
df_mcc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   mcc              109 non-null    int64 
 1   mcc_description  109 non-null    object
dtypes: int64(1), object(1)
memory usage: 1.8+ KB


In [59]:
df_transactions = df_transactions.merge(df_mcc, on='mcc', how='left')

In [60]:
df_transactions.head()

Unnamed: 0,id_trans,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,...,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,riesgo,mcc_description_y
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,...,594 Mountain View Street,46.8,-100.76,23679,48277,110153,740,4,alto,Miscellaneous Food Stores
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,...,604 Pine Street,40.8,-91.12,18076,36853,112139,834,5,alto,Department Stores
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,...,2379 Forest Lane,33.18,-117.29,16894,34449,36540,686,3,alto,Money Transfer
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,...,903 Hill Boulevard,41.42,-87.35,26168,53350,128676,685,5,alto,Money Transfer
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,...,166 River Drive,38.86,-76.6,33529,68362,96182,711,2,alto,Drinking Places (Alcoholic Beverages)


# MERGE de id de cliente del dataframe user_data con el client_id de transactions_data

luego de realizar el merge queda agregado la informacion del cliente como la edad, el genero, ingresos, score crediticio, etc.

In [61]:
df_transactions = df_transactions.merge(df_users,left_on='client_id',right_on='id',how='left',suffixes=('_trans', '_user'))

In [62]:
df_transactions.head()

Unnamed: 0,id_trans,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,...,gender_user,address_user,latitude_user,longitude_user,per_capita_income_user,yearly_income_user,total_debt_user,credit_score_user,num_credit_cards_user,riesgo_user
0,7475327,2010-01-01 00:01:00,1556,2972,$-77.00,Swipe Transaction,59935,Beulah,ND,58523.0,...,Female,594 Mountain View Street,46.8,-100.76,23679,48277,110153,740,4,alto
1,7475328,2010-01-01 00:02:00,561,4575,$14.57,Swipe Transaction,67570,Bettendorf,IA,52722.0,...,Male,604 Pine Street,40.8,-91.12,18076,36853,112139,834,5,alto
2,7475329,2010-01-01 00:02:00,1129,102,$80.00,Swipe Transaction,27092,Vista,CA,92084.0,...,Male,2379 Forest Lane,33.18,-117.29,16894,34449,36540,686,3,alto
3,7475331,2010-01-01 00:05:00,430,2860,$200.00,Swipe Transaction,27092,Crown Point,IN,46307.0,...,Female,903 Hill Boulevard,41.42,-87.35,26168,53350,128676,685,5,alto
4,7475332,2010-01-01 00:06:00,848,3915,$46.41,Swipe Transaction,13051,Harwood,MD,20776.0,...,Male,166 River Drive,38.86,-76.6,33529,68362,96182,711,2,alto


### Agregar indice secuencial desde 1 en la tabla df_user

comprobamos la informacion de nuestro DF y agegaremos una columna adicional para no alterar el indice original

In [64]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2000 non-null   Int64  
 1   current_age        2000 non-null   Int64  
 2   retirement_age     2000 non-null   Int64  
 3   birth_year         2000 non-null   Int64  
 4   birth_month        2000 non-null   Int64  
 5   gender             2000 non-null   string 
 6   address            2000 non-null   string 
 7   latitude           2000 non-null   float64
 8   longitude          2000 non-null   float64
 9   per_capita_income  2000 non-null   int64  
 10  yearly_income      2000 non-null   int64  
 11  total_debt         2000 non-null   int64  
 12  credit_score       2000 non-null   Int64  
 13  num_credit_cards   2000 non-null   Int64  
 14  riesgo             2000 non-null   string 
dtypes: Int64(7), float64(2), int64(3), string(3)
memory usage: 248.2 KB


In [65]:
df_users['id_secuencial'] = range(1, len(df_users) + 1)
df_users.head()

Unnamed: 0,id,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,riesgo,id_secuencial
0,825,53,66,1966,11,Female,462 Rose Lane,34.15,-117.76,29278,59696,127613,787,5,alto,1
1,1746,53,68,1966,12,Female,3606 Federal Boulevard,40.76,-73.74,37891,77254,191349,701,5,alto,2
2,1718,81,67,1938,11,Female,766 Third Drive,34.02,-117.89,22681,33483,196,698,5,bajo,3
3,708,63,63,1957,1,Female,3 Madison Street,40.71,-73.99,163145,249925,202328,722,4,bajo,4
4,1164,43,70,1976,9,Male,9620 Valley Stream Drive,37.76,-122.44,53797,109687,183855,675,1,alto,5
