# ANÁLISIS DESCRIPTIVO DE DATOS (EDA)

En este notebook se realiza el análisis descriptivo del dataset **Bank Customer Churn Prediction**, el cual consiste en conocer y enterder los datos con los que se va a trabajar.

Durante esta fase, se intenta responder a las siguientes cuestiones:
- ¿Cuántos registros tiene el conjunto?
- ¿Están todas las filas completas o hay campos con valores nulos?
- ¿De que naturaleza son los datos? (cualitativos o cuantitativos)
- ¿Cuál es la variable target? (la variable a predecir)
- ¿Qué distribución siguen las variables?




## Configuración inicial

** Import lirberías **

In [7]:
import os
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

** Carga del dataset **

In [13]:
file_path = (os.path.dirname(os.getcwd()) + '/data/')
print(file_path)
data = pd.read_csv(file_path + '/Bank_Customer_Churn_Prediction.csv')

c:\TFG\TFG_ADE/data/


## Presentación de dataset

Visuaización de las variables del dataset

In [50]:
print('Cantidad de filas y columnas del dataset: ', data.shape)
print('\n')
print('Nombre columnas: ')
data.columns.values

Cantidad de filas y columnas del dataset:  (10000, 12)


Nombre columnas: 


array(['customer_id', 'credit_score', 'country', 'gender', 'age',
       'tenure', 'balance', 'products_number', 'credit_card',
       'active_member', 'estimated_salary', 'churn'], dtype=object)

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


In [25]:
#Visualizar las primeras filas del dataset
data.head()

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [35]:
#Comprobación de valores nulos en los datos
print('Cantidad de valores nulos: ')
data.isnull().sum()

Cantidad de valores nulos: 


customer_id         0
credit_score        0
country             0
gender              0
age                 0
tenure              0
balance             0
products_number     0
credit_card         0
active_member       0
estimated_salary    0
churn               0
dtype: int64

In [40]:
#Comprobación de valores duplicados
print('Número de valores duplicados: ', data.duplicated().sum())

Número de valores duplicados:  0


In [45]:
data.nunique()

customer_id         10000
credit_score          460
country                 3
gender                  2
age                    70
tenure                 11
balance              6382
products_number         4
credit_card             2
active_member           2
estimated_salary     9999
churn                   2
dtype: int64

El dataset contiene **10.000** registros y tiene **12** columnas

## Distribución de las variables

In [61]:
stats_order = ['mean', 'std', 'max', 'min','25%', '50%', '75%',]
data[['age','tenure','credit_score', 'balance', 'products_number', 'credit_card', 'active_member', 'estimated_salary', 'churn']].describe().loc[stats_order].T

Unnamed: 0,mean,std,max,min,25%,50%,75%
age,38.9218,10.487806,92.0,18.0,32.0,37.0,44.0
tenure,5.0128,2.892174,10.0,0.0,3.0,5.0,7.0
credit_score,650.5288,96.653299,850.0,350.0,584.0,652.0,718.0
balance,76485.889288,62397.405202,250898.09,0.0,0.0,97198.54,127644.24
products_number,1.5302,0.581654,4.0,1.0,1.0,1.0,2.0
credit_card,0.7055,0.45584,1.0,0.0,0.0,1.0,1.0
active_member,0.5151,0.499797,1.0,0.0,0.0,1.0,1.0
estimated_salary,100090.239881,57510.492818,199992.48,11.58,51002.11,100193.915,149388.2475
churn,0.2037,0.402769,1.0,0.0,0.0,0.0,0.0
