## Práctica Análisis Exploratorio con Visualizaciones y Librería Personalizada


Profundizar en el análisis exploratorio de datos mediante la creación de una librería personalizada de visualización con funciones reutilizables, aplicando buenas prácticas de desarrollo y análisis estadístico, y documentando todo en un repositorio profesional (GitHub) y un PDF explicativo.

## 1. Preprocesamiento

In [4]:
import pandas as pd
from libreria_modulo_1 import analysis, preprocessing, visualization 

In [5]:
datos = pd.read_csv('CTG.csv')
datos.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,5/3/1996,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,5/3/1996,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [6]:
analysis.completitud_datos(datos)

FileName    0.0014
Date        0.0014
SegFile     0.0014
b           0.0014
e           0.0014
LBE         0.0014
LB          0.0014
AC          0.0014
Max         0.0014
Nmax        0.0014
Mode        0.0014
Nzeros      0.0014
Width       0.0014
Min         0.0014
AD          0.0014
E           0.0014
D           0.0014
C           0.0014
B           0.0014
A           0.0014
Tendency    0.0014
Variance    0.0014
Median      0.0014
Mean        0.0014
FS          0.0014
SUSP        0.0014
DE          0.0014
LD          0.0014
CLASS       0.0014
NSP         0.0014
UC          0.0009
FM          0.0009
ALTV        0.0009
MLTV        0.0009
ASTV        0.0009
MSTV        0.0009
DR          0.0005
DP          0.0005
DL          0.0005
DS          0.0005
dtype: float64

`Debido a la buena calidad de los datos se creó una funcion que crea datos NaN.`
La funcion `analysis.completitud_datos(datos_sucios)` no reemplaza `check_data_completeness_JavierMartinezReyes`, solo es para describir el porcentaje de nulos

In [7]:
datos_sucios = preprocessing.agrega_nan(datos, min_frac=0.0, max_frac=0.3)

In [8]:
analysis.completitud_datos(datos_sucios)

b           0.2898
Nmax        0.2884
SUSP        0.2729
FS          0.2724
LBE         0.2659
D           0.2588
FM          0.2442
e           0.2372
Tendency    0.2325
Median      0.2325
A           0.2306
LD          0.2283
AC          0.2217
DE          0.2123
C           0.2118
Max         0.2114
SegFile     0.2043
CLASS       0.2001
B           0.1822
DL          0.1813
Mean        0.1649
ALTV        0.1602
Variance    0.1512
Min         0.1404
MSTV        0.1395
UC          0.1249
Date        0.1165
ASTV        0.0944
AD          0.0930
NSP         0.0836
Nzeros      0.0747
Mode        0.0672
LB          0.0545
MLTV        0.0531
DS          0.0277
Width       0.0254
DR          0.0108
DP          0.0094
E           0.0080
FileName    0.0038
dtype: float64

### 1.1 Eliminar columnas con más del 20% de valores nulos

In [9]:
df_limpio = preprocessing.delete_missing_values(datos_sucios, porcentage=0.2)
print("Forma del DataFrame limpio:", df_limpio.shape) 

Análisis inicial:
- Filas: 2129
- Columnas originales: 40
- Porcentaje máximo de NaN permitido por columna: 20.0%
- Máximo de NaN permitidos por columna: 425/2129

Análisis por columnas:
- Columnas con más de 20.0% de NaN: 18
- Columnas a eliminar: ['SegFile', 'b', 'e', 'LBE', 'AC', 'FM', 'Max', 'Nmax', 'Median', 'Tendency', 'A', 'C', 'D', 'DE', 'LD', 'FS', 'SUSP', 'CLASS']

Resultados:
- Columnas eliminadas: 18
- Columnas restantes: 22
- Forma final del DataFrame: (2129, 22)
Forma del DataFrame limpio: (2129, 22)


### 1.2 Imputar valores faltantes restantes con métodos adecuados:


In [10]:
datos_imputados = preprocessing.impute_missing_values(datos_sucios, method='knn')


Valores faltantes antes de imputar:
13374 en total
Columna categórica 'FileName': imputada con moda 'S8001034.dsp'
Columna categórica 'Date': imputada con moda '2/22/1995'
Columna categórica 'SegFile': imputada con moda 'CTG0001.txt'
Columnas numéricas: imputadas con KNN


In [11]:
datos_imputados.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0001.txt,240.0,1292.6,125.8,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,1176.6,632.0,133.0,132.0,5.4,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,2/22/1995,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,1065.0,1766.8,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,5/3/1996,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.4,1.0


In [12]:
# Verificar si hay valores "nan" como strings
print("Verificando valores 'nan' como strings:")
for col in datos_imputados.columns:
    if datos_imputados[col].dtype == 'object':
        nan_strings = (datos_imputados[col] == 'nan').sum()
        if nan_strings > 0:
            print(f"Columna '{col}': {nan_strings} valores 'nan' como string")

# Verificar también valores NaN reales
print("\nValores NaN reales restantes:")
print(datos_imputados.isnull().sum().sum())

Verificando valores 'nan' como strings:

Valores NaN reales restantes:
0


In [13]:
analysis.completitud_datos(datos_imputados)

FileName    0.0
Date        0.0
SegFile     0.0
b           0.0
e           0.0
LBE         0.0
LB          0.0
AC          0.0
FM          0.0
UC          0.0
ASTV        0.0
MSTV        0.0
ALTV        0.0
MLTV        0.0
DL          0.0
DS          0.0
DP          0.0
DR          0.0
Width       0.0
Min         0.0
Max         0.0
Nmax        0.0
Nzeros      0.0
Mode        0.0
Mean        0.0
Median      0.0
Variance    0.0
Tendency    0.0
A           0.0
B           0.0
C           0.0
D           0.0
E           0.0
AD          0.0
DE          0.0
LD          0.0
FS          0.0
SUSP        0.0
CLASS       0.0
NSP         0.0
dtype: float64

In [14]:
datos_imputados.describe()

Unnamed: 0,b,e,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
count,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,...,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0
mean,864.79319,1735.726814,133.391984,133.25758,2.783513,6.788348,3.683461,46.88914,1.32629,9.759013,...,0.020302,0.034964,0.034023,0.159023,0.114382,0.044553,0.02707,0.091634,4.43547,1.299347
std,795.738809,871.829062,9.405495,9.759415,3.289762,33.630294,2.742041,16.810818,0.857145,17.506342,...,0.134057,0.172606,0.180872,0.356747,0.298146,0.203097,0.153593,0.270834,2.855362,0.602501
min,0.0,287.0,106.0,106.0,0.0,0.0,0.0,12.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,179.0,1080.0,127.0,126.0,0.0,0.0,2.0,33.0,0.7,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
50%,653.0,1500.0,133.0,133.0,2.0,0.0,3.4,48.0,1.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0
75%,1377.2,2371.0,140.0,140.0,4.2,3.0,5.0,61.0,1.7,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0
max,3194.0,3599.0,160.0,159.0,21.0,564.0,23.0,87.0,7.0,91.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0


### 1.3 Detectar y tratar valores atípicos (outliers) con IQR o z-score

In [15]:
outliers_iqr = preprocessing.detect_outliers_iqr(datos_imputados, factor=1.5)

In [16]:
outliers_iqr

Unnamed: 0,columna,Q1,Q3,IQR,limite_inferior,limite_superior,num_outliers,porcentaje_outliers
0,b,179.0,1377.2,1198.2,-1618.3,3174.5,3,0.140911
1,e,1080.0,2371.0,1291.0,-856.5,4307.5,0,0.0
2,LBE,127.0,140.0,13.0,107.5,159.5,6,0.281822
3,LB,126.0,140.0,14.0,105.0,161.0,0,0.0
4,AC,0.0,4.2,4.2,-6.3,10.5,64,3.006106
5,FM,0.0,3.0,3.0,-4.5,7.5,265,12.447158
6,UC,2.0,5.0,3.0,-2.5,9.5,60,2.818225
7,ASTV,33.0,61.0,28.0,-9.0,103.0,0,0.0
8,MSTV,0.7,1.7,1.0,-0.8,3.2,61,2.865195
9,ALTV,0.0,11.0,11.0,-16.5,27.5,291,13.668389


In [17]:
outliers_zscore = preprocessing.detect_outliers_zscore(datos_imputados, threshold=3.0)

In [18]:
outliers_zscore

Unnamed: 0,columna,media,std,threshold,num_outliers,porcentaje_outliers
0,b,864.79319,795.738809,3.0,0,0.0
1,e,1735.726814,871.829062,3.0,0,0.0
2,LBE,133.391984,9.405495,3.0,0,0.0
3,LB,133.25758,9.759415,3.0,0,0.0
4,AC,2.783513,3.289762,3.0,37,1.737905
5,FM,6.788348,33.630294,3.0,26,1.221231
6,UC,3.683461,2.742041,3.0,21,0.986379
7,ASTV,46.88914,16.810818,3.0,0,0.0
8,MSTV,1.32629,0.857145,3.0,27,1.268201
9,ALTV,9.759013,17.506342,3.0,56,2.630343


## 2. Análisis de Datos

Análisis completo de completitud y características de los datos usando la función personalizada.

In [19]:
# Análisis completo de completitud y características de los datos
resultado_completo = analysis.check_data_completeness_JavierMartinezReyes(datos_imputados)

ANÁLISIS COMPLETO DE COMPLETITUD DE DATOS
Dimensiones del DataFrame: 2129 filas x 40 columnas
Total de valores: 85,160
Total de valores nulos: 0
Porcentaje general de completitud: 100.00%

CLASIFICACIÓN DE VARIABLES:
- Continua: 24 columnas
- Discreta: 13 columnas
- Categórica_Alta: 2 columnas
- Categórica_Media: 1 columnas

Columnas con mayor porcentaje de nulos:
- FileName: 0.0% (Categórica_Alta)
- Date: 0.0% (Categórica_Media)
- SegFile: 0.0% (Categórica_Alta)
- b: 0.0% (Continua)
- e: 0.0% (Continua)


In [20]:
# Mostrar resumen general
print("📊 RESUMEN GENERAL:")
resultado_completo['resumen_general']

📊 RESUMEN GENERAL:


Unnamed: 0,columna,tipo_dato,valores_totales,valores_no_nulos,valores_nulos,porcentaje_completitud,porcentaje_nulos,valores_unicos,clasificacion
0,FileName,object,2129,2129,0,100.0,0.0,352,Categórica_Alta
1,Date,object,2129,2129,0,100.0,0.0,48,Categórica_Media
2,SegFile,object,2129,2129,0,100.0,0.0,1694,Categórica_Alta
3,b,float64,2129,2129,0,100.0,0.0,1333,Continua
4,e,float64,2129,2129,0,100.0,0.0,1360,Continua
5,LBE,float64,2129,2129,0,100.0,0.0,175,Continua
6,LB,float64,2129,2129,0,100.0,0.0,113,Continua
7,AC,float64,2129,2129,0,100.0,0.0,63,Continua
8,FM,float64,2129,2129,0,100.0,0.0,158,Continua
9,UC,float64,2129,2129,0,100.0,0.0,50,Continua


In [21]:
# Mostrar estadísticos de dispersión
print("📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):")
resultado_completo['estadisticos_dispersion']

📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):


Unnamed: 0,columna,tipo,media,mediana,desv_std,varianza,min,max,q25,q75,rango,coef_variacion
0,b,Continua,864.7932,653.0,795.7388,633200.2517,0.0,3194.0,179.0,1377.2,3194.0,92.0149
1,e,Continua,1735.7268,1500.0,871.8291,760085.9134,287.0,3599.0,1080.0,2371.0,3312.0,50.2285
2,LBE,Continua,133.392,133.0,9.4055,88.4633,106.0,160.0,127.0,140.0,54.0,7.051
3,LB,Continua,133.2576,133.0,9.7594,95.2462,106.0,159.0,126.0,140.0,53.0,7.3237
4,AC,Continua,2.7835,2.0,3.2898,10.8225,0.0,21.0,0.0,4.2,21.0,118.1874
5,FM,Continua,6.7883,0.0,33.6303,1130.9967,0.0,564.0,0.0,3.0,564.0,495.412
6,UC,Continua,3.6835,3.4,2.742,7.5188,0.0,23.0,2.0,5.0,23.0,74.442
7,ASTV,Continua,46.8891,48.0,16.8108,282.6036,12.0,87.0,33.0,61.0,75.0,35.8523
8,MSTV,Continua,1.3263,1.2,0.8571,0.7347,0.2,7.0,0.7,1.7,6.8,64.6273
9,ALTV,Continua,9.759,0.0,17.5063,306.472,0.0,91.0,0.0,11.0,91.0,179.3864


In [22]:
# Mostrar clasificación de variables
print("🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:")
resultado_completo['clasificacion_variables']

🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:


Unnamed: 0,columna,clasificacion,criterio,tipo_original,es_numerica,es_categorica
0,FileName,Categórica_Alta,Valores únicos: 352,object,False,True
1,Date,Categórica_Media,Valores únicos: 48,object,False,True
2,SegFile,Categórica_Alta,Valores únicos: 1694,object,False,True
3,b,Continua,Valores únicos: 1333,float64,True,False
4,e,Continua,Valores únicos: 1360,float64,True,False
5,LBE,Continua,Valores únicos: 175,float64,True,False
6,LB,Continua,Valores únicos: 113,float64,True,False
7,AC,Continua,Valores únicos: 63,float64,True,False
8,FM,Continua,Valores únicos: 158,float64,True,False
9,UC,Continua,Valores únicos: 50,float64,True,False


## 3. Visualizaciones (eleva dificultad agregando interactividad, estadísticas o múltiples variables)

### 3.1 Histogramas

In [23]:
visualization.plot_interactive_histogram(datos_imputados, column='LBE')

In [24]:
visualization.plot_interactive_histogram(datos_imputados, column='UC', group_by='D', bins=60)

### 3.2 Boxplots

In [25]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', group_by='NSP')

In [26]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', target_class='D')

### 3.2 Barras Horizontales

In [36]:
visualization.plot_interactive_bar_horizontal(datos_imputados, column='LBE')

### 3.3 Líneas


In [28]:
visualization.plot_interactive_line_timeseries(datos_imputados, x_column='e', y_column='DP')

### 3.4  Dot Plots

In [29]:
visualization.plot_interactive_dot_comparison(datos_imputados, column='b', group1='6/6/1998', group2='5/10/1998', group_column='Date')

### 3.5 Densidad

In [30]:
visualization.plot_interactive_density_multiclass(datos_imputados, column= 'b', class_column='DP')

### 3.6 Violín

In [31]:
visualization.plot_interactive_violin_swarm(datos_imputados, column='b', group_by= 'DP')

### 3.7 Heatmap

In [32]:


# Heatmap básico - Correlación de Spearman
visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='spearman', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas


In [33]:

visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='pearson', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas
