## Información sobre el dataset

El conjunto BAF consta de seis conjuntos de datos generados a partir de un conjunto de datos reales de detección de fraudes en la apertura de cuentas bancarias en línea. Se trata de una aplicación relevante para Fair ML, ya que las predicciones del modelo
se traducen en la concesión o denegación de servicios financieros a las personas, lo que puede agravar las desigualdades sociales existentes. 
Por ejemplo, denegar sistemáticamente el acceso al crédito a las personas de un grupo puede perpetuar o incluso ampliar las desigualdades sociales existentes.
puede perpetuar o incluso ampliar las desigualdades de riqueza existentes. Cada variante del conjunto de datos cuenta con tipos predeterminados y controlados de sesgo de los datos en múltiples pasos temporales.  Las variantes mencionadas, combinadas con los cambios de distribución temporal inherentes
inherentes a la distribución de datos subyacente, constituyen un medio innovador para poner a prueba el rendimiento y equidad de los modelos de ML destinados a operar en entornos dinámicos.

Los conjuntos de datos de la suite se generaron aprovechando modelos de redes generativas adversariales (GAN) de última generación. Una razón importante para elegir estos métodos fue tener en cuenta la privacidad de los solicitantes, una preocupación cada vez mayor en el panorama social y legislativo actual.

Cada conjunto de datos está compuesto por un total de un millón de instancias de aplicaciones individuales, con un total de treinta características. Estas últimas representan propiedades observadas de las solicitudes, ya sean obtenidas directamente directamente del solicitante (por ejemplo, su situación laboral) o derivadas de la información facilitada (por ejemplo, si el número de teléfono facilitado es válido). (por ejemplo, si el número de teléfono facilitado es válido), y agregaciones de los datos (por ejemplo, frecuencia de solicitudes en un código postal determinado). en un código postal determinado). Los datos abarcan ocho meses de solicitudes, que pueden identificarse en la columna "mes".  En cuanto a los atributos protegidos, el conjunto de datos proporciona la edad, los ingresos personales y la situación laboral del solicitante. Para proporcionar cierto grado de privacidad diferencial, aplicamos ruido en las instancias del conjunto de datos original y categorizamos las columnas de información personal, como los ingresos y la edad, antes del entrenamiento del modelo GAN.

## Variables

• income (numeric): Annual income of the applicant (in decile form). Ranges between [0.1, 0.9].

• name_email_similarity (numeric): Metric of similarity between email and applicant’s name. Higher values represent higher similarity. Ranges between [0, 1].

• prev_address_months_count (numeric): Number of months in previous registered address of the applicant, i.e. the applicant’s previous residence, if applicable. Ranges between [−1, 380] months (-1 is a missing value).

• current_address_months_count (numeric): Months in currently registered address of the applicant. Ranges between [−1, 429] months (-1 is a missing value).

• customer_age (numeric): Applicant’s age in years, rounded to the decade. Ranges between [10, 90] years.

• days_since_request (numeric): Number of days passed since application was done. Ranges between [0, 79] days.

• intended_balcon_amount (numeric): Initial transferred amount for application. Ranges between [−16, 114] (negatives are missing values).

• payment_type (categorical): Credit payment plan type. 5 possible (annonymized) values.

• zip_count_4w (numeric): Number of applications within same zip code in last 4 weeks. Ranges between [1, 6830].

• velocity_6h (numeric): Velocity of total applications made in last 6 hours i.e., average number of applications per hour in the last 6 hours. Ranges between [−175, 16818].

• velocity_24h (numeric): Velocity of total applications made in last 24 hours i.e., average number of applications per hour in the last 24 hours. Ranges between [1297, 9586]

• velocity_4w (numeric): Velocity of total applications made in last 4 weeks, i.e., average number of applications per hour in the last 4 weeks. Ranges between [2825, 7020].

• bank_branch_count_8w (numeric): Number of total applications in the selected bank branch in last 8 weeks. Ranges between [0, 2404].

• date_of_birth_distinct_emails_4w (numeric): Number of emails for applicants with same date of birth in last 4 weeks. Ranges between [0, 39].

• employment_status (categorical): Employment status of the applicant. 7 possible (annonymized) values.
• credit_risk_score (numeric): Internal score of application risk. Ranges between [−191, 389].

• email_is_free (binary): Domain of application email (either free or paid).

• housing_status (categorical): Current residential status for applicant. 7 possible (annonymized) values.

• phone_home_valid (binary): Validity of provided home phone.

• phone_mobile_valid (binary): Validity of provided mobile phone.

• bank_months_count (numeric): How old is previous account (if held) in months. Ranges between [−1, 32] months (-1 is a missing value).

• has_other_cards (binary): If applicant has other cards from the same banking company.

• proposed_credit_limit (numeric): Applicant’s proposed credit limit. Ranges between [200, 2000].

• foreign_request (binary): If origin country of request is different from bank’s country.

• source (categorical): Online source of application. Either browser (INTERNET) or app (TELEAPP).

• session_length_in_minutes (numeric): Length of user session in banking website in minutes. Ranges between [−1, 107] minutes (-1 is a missing value).

• device_os (categorical): Operative system of device that made request. Possible values are: Windows, macOS, Linux, X11, or other.

• keep_alive_session (binary): User option on session logout.

• device_distinct_emails (numeric): Number of distinct emails in banking website from the used device in last 8 weeks. Ranges between [−1, 2] emails (-1 is a missing value).

• device_fraud_count (numeric): Number of fraudulent applications with used device. Ranges between [0, 1].

• month (numeric): Month where the application was made. Ranges between [0, 7].

• fraud_bool (binary): If the application is fraudulent or not.


In [5]:
import pandas as pd

pd_data = pd.read_csv("Base.csv")
pd_data.shape

In [11]:
pd_data.dtypes.to_dict()

{'fraud_bool': dtype('int64'),
 'income': dtype('float64'),
 'name_email_similarity': dtype('float64'),
 'prev_address_months_count': dtype('int64'),
 'current_address_months_count': dtype('int64'),
 'customer_age': dtype('int64'),
 'days_since_request': dtype('float64'),
 'intended_balcon_amount': dtype('float64'),
 'payment_type': dtype('O'),
 'zip_count_4w': dtype('int64'),
 'velocity_6h': dtype('float64'),
 'velocity_24h': dtype('float64'),
 'velocity_4w': dtype('float64'),
 'bank_branch_count_8w': dtype('int64'),
 'date_of_birth_distinct_emails_4w': dtype('int64'),
 'employment_status': dtype('O'),
 'credit_risk_score': dtype('int64'),
 'email_is_free': dtype('int64'),
 'housing_status': dtype('O'),
 'phone_home_valid': dtype('int64'),
 'phone_mobile_valid': dtype('int64'),
 'bank_months_count': dtype('int64'),
 'has_other_cards': dtype('int64'),
 'proposed_credit_limit': dtype('float64'),
 'foreign_request': dtype('int64'),
 'source': dtype('O'),
 'session_length_in_minutes': dty

In [12]:
pd_data['fraud_bool'].value_counts(normalize=True)

0    0.988971
1    0.011029
Name: fraud_bool, dtype: float64

## Librerías para EDA automático

https://towardsdatascience.com/4-libraries-that-can-perform-eda-in-one-line-of-python-code-b13938a06ae

https://pub.towardsai.net/5-python-packages-for-effortless-eda-94abddac3bc5

https://github.com/shivpalSW/EDA-with-AutoEDA-libraries

In [14]:
!pip install sweetviz

Collecting sweetviz
  Downloading https://files.pythonhosted.org/packages/7b/d7/b83a6a5548f6fd028c18e198f116e0be641c0db72cc12b0d6ddb836d0fa4/sweetviz-2.2.1-py3-none-any.whl (15.1MB)
Collecting importlib-resources>=1.2.0 (from sweetviz)
  Downloading https://files.pythonhosted.org/packages/38/71/c13ea695a4393639830bf96baea956538ba7a9d06fcce7cef10bfff20f72/importlib_resources-5.12.0-py3-none-any.whl
Installing collected packages: importlib-resources, sweetviz
Successfully installed importlib-resources-5.12.0 sweetviz-2.2.1


In [15]:
import sweetviz as sv
# https://www.analyticsvidhya.com/blog/2021/01/making-exploratory-data-analysis-sweeter-with-sweetviz-2-0/



                                             |                                             | [  0%]   00:00 ->…

Report BAF_EDA_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
