## Práctica Análisis Exploratorio con Visualizaciones y Librería Personalizada


Profundizar en el análisis exploratorio de datos mediante la creación de una librería personalizada de visualización con funciones reutilizables, aplicando buenas prácticas de desarrollo y análisis estadístico, y documentando todo en un repositorio profesional (GitHub) y un PDF explicativo.

## 1. Preprocesamiento

In [1]:
import pandas as pd
from libreria_modulo_1 import analysis, preprocessing, visualization 

In [2]:
datos = pd.read_csv('CTG.csv')
datos.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,5/3/1996,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,5/3/1996,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [3]:
analysis.completitud_datos(datos)

FileName    0.0014
Date        0.0014
SegFile     0.0014
b           0.0014
e           0.0014
LBE         0.0014
LB          0.0014
AC          0.0014
Max         0.0014
Nmax        0.0014
Mode        0.0014
Nzeros      0.0014
Width       0.0014
Min         0.0014
AD          0.0014
E           0.0014
D           0.0014
C           0.0014
B           0.0014
A           0.0014
Tendency    0.0014
Variance    0.0014
Median      0.0014
Mean        0.0014
FS          0.0014
SUSP        0.0014
DE          0.0014
LD          0.0014
CLASS       0.0014
NSP         0.0014
UC          0.0009
FM          0.0009
ALTV        0.0009
MLTV        0.0009
ASTV        0.0009
MSTV        0.0009
DR          0.0005
DP          0.0005
DL          0.0005
DS          0.0005
dtype: float64

`Debido a la buena calidad de los datos se creó una funcion que crea datos NaN.`
La funcion `analysis.completitud_datos(datos_sucios)` no reemplaza `check_data_completeness_JavierMartinezReyes`, solo es para describir el porcentaje de nulos

In [4]:
datos_sucios = preprocessing.agrega_nan(datos, min_frac=0.0, max_frac=0.3)

In [5]:
analysis.completitud_datos(datos_sucios)

ALTV        0.2950
DP          0.2912
Mean        0.2649
Min         0.2466
MLTV        0.2395
Nmax        0.2273
e           0.2222
E           0.2161
SUSP        0.2151
FM          0.2048
Date        0.2029
Max         0.1982
LBE         0.1832
SegFile     0.1785
DR          0.1677
b           0.1555
D           0.1555
Tendency    0.1555
Nzeros      0.1489
Width       0.1442
UC          0.1442
B           0.1428
C           0.1419
A           0.1381
DS          0.1329
FileName    0.1301
LB          0.0784
MSTV        0.0761
DE          0.0629
FS          0.0625
CLASS       0.0601
Variance    0.0432
NSP         0.0404
AD          0.0380
ASTV        0.0343
DL          0.0146
LD          0.0117
Median      0.0070
AC          0.0047
Mode        0.0042
dtype: float64

### 1.1 Eliminar columnas con más del 20% de valores nulos

In [6]:
df_limpio = preprocessing.delete_missing_values(datos_sucios, porcentage=0.2)
print("Forma del DataFrame limpio:", df_limpio.shape) 

Análisis inicial:
- Filas: 2129
- Columnas originales: 40
- Porcentaje máximo de NaN permitido por columna: 20.0%
- Máximo de NaN permitidos por columna: 425/2129

Análisis por columnas:
- Columnas con más de 20.0% de NaN: 11
- Columnas a eliminar: ['Date', 'e', 'FM', 'ALTV', 'MLTV', 'DP', 'Min', 'Nmax', 'Mean', 'E', 'SUSP']

Resultados:
- Columnas eliminadas: 11
- Columnas restantes: 29
- Forma final del DataFrame: (2129, 29)
Forma del DataFrame limpio: (2129, 29)


### 1.2 Imputar valores faltantes restantes con métodos adecuados:


In [7]:
datos_imputados = preprocessing.impute_missing_values(datos_sucios, method='knn')


Valores faltantes antes de imputar:
11669 en total
Columna categórica 'FileName': imputada con moda 'S8001034.dsp'
Columna categórica 'Date': imputada con moda '2/22/1995'
Columna categórica 'SegFile': imputada con moda 'CTG0001.txt'
Columnas numéricas: imputadas con KNN


In [8]:
datos_imputados.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0001.txt,240.0,1856.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,69.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,5/3/1996,CTG0003.txt,1537.4,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,S8001034.dsp,2/22/1995,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,3.6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [9]:
# Verificar si hay valores "nan" como strings
print("Verificando valores 'nan' como strings:")
for col in datos_imputados.columns:
    if datos_imputados[col].dtype == 'object':
        nan_strings = (datos_imputados[col] == 'nan').sum()
        if nan_strings > 0:
            print(f"Columna '{col}': {nan_strings} valores 'nan' como string")

# Verificar también valores NaN reales
print("\nValores NaN reales restantes:")
print(datos_imputados.isnull().sum().sum())

Verificando valores 'nan' como strings:

Valores NaN reales restantes:
0


In [10]:
analysis.completitud_datos(datos_imputados)

FileName    0.0
Date        0.0
SegFile     0.0
b           0.0
e           0.0
LBE         0.0
LB          0.0
AC          0.0
FM          0.0
UC          0.0
ASTV        0.0
MSTV        0.0
ALTV        0.0
MLTV        0.0
DL          0.0
DS          0.0
DP          0.0
DR          0.0
Width       0.0
Min         0.0
Max         0.0
Nmax        0.0
Nzeros      0.0
Mode        0.0
Mean        0.0
Median      0.0
Variance    0.0
Tendency    0.0
A           0.0
B           0.0
C           0.0
D           0.0
E           0.0
AD          0.0
DE          0.0
LD          0.0
FS          0.0
SUSP        0.0
CLASS       0.0
NSP         0.0
dtype: float64

In [11]:
datos_imputados.describe()

Unnamed: 0,b,e,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
count,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,...,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0
mean,902.318674,1686.537724,133.371958,133.300477,2.723498,8.458368,3.687401,47.06326,1.327673,10.133649,...,0.021335,0.035621,0.033929,0.155169,0.118234,0.050282,0.031391,0.090695,4.481959,1.299907
std,852.519797,854.546418,9.54944,9.711305,3.557594,36.368589,2.72516,17.048415,0.881275,17.609825,...,0.136469,0.178312,0.167453,0.358623,0.31695,0.218526,0.171765,0.269481,2.981694,0.604121
min,0.0,287.0,106.0,106.0,0.0,0.0,0.0,12.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,132.0,1062.2,127.0,126.0,0.0,0.0,2.0,32.0,0.7,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
50%,650.0,1423.0,133.0,133.0,1.0,0.0,3.4,49.0,1.2,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0
75%,1514.0,2274.0,140.2,140.0,4.0,3.0,5.0,61.0,1.7,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0
max,3296.0,3599.0,160.0,160.0,26.0,564.0,23.0,87.0,7.0,91.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0


### 1.3 Detectar y tratar valores atípicos (outliers) con IQR o z-score

In [12]:
outliers_iqr = preprocessing.detect_outliers_iqr(datos_imputados, factor=1.5)

In [13]:
outliers_iqr

Unnamed: 0,columna,Q1,Q3,IQR,limite_inferior,limite_superior,num_outliers,porcentaje_outliers
0,b,132.0,1514.0,1382.0,-1941.0,3587.0,0,0.0
1,e,1062.2,2274.0,1211.8,-755.5,4091.7,0,0.0
2,LBE,127.0,140.2,13.2,107.2,160.0,8,0.375763
3,LB,126.0,140.0,14.0,105.0,161.0,0,0.0
4,AC,0.0,4.0,4.0,-6.0,10.0,83,3.898544
5,FM,0.0,3.0,3.0,-4.5,7.5,285,13.386566
6,UC,2.0,5.0,3.0,-2.5,9.5,63,2.959136
7,ASTV,32.0,61.0,29.0,-11.5,104.5,0,0.0
8,MSTV,0.7,1.7,1.0,-0.8,3.2,70,3.287929
9,ALTV,0.0,12.0,12.0,-18.0,30.0,266,12.494129


In [14]:
outliers_zscore = preprocessing.detect_outliers_zscore(datos_imputados, threshold=3.0)

In [15]:
outliers_zscore

Unnamed: 0,columna,media,std,threshold,num_outliers,porcentaje_outliers
0,b,902.318674,852.519797,3.0,0,0.0
1,e,1686.537724,854.546418,3.0,0,0.0
2,LBE,133.371958,9.54944,3.0,0,0.0
3,LB,133.300477,9.711305,3.0,0,0.0
4,AC,2.723498,3.557594,3.0,35,1.643964
5,FM,8.458368,36.368589,3.0,33,1.550023
6,UC,3.687401,2.72516,3.0,22,1.033349
7,ASTV,47.06326,17.048415,3.0,0,0.0
8,MSTV,1.327673,0.881275,3.0,33,1.550023
9,ALTV,10.133649,17.609825,3.0,60,2.818225


## 2. Análisis de Datos

Análisis completo de completitud y características de los datos usando la función personalizada.

In [16]:
# Análisis completo de completitud y características de los datos
resultado_completo = analysis.check_data_completeness_JavierMartinezReyes(datos_imputados)

ANÁLISIS COMPLETO DE COMPLETITUD DE DATOS
Dimensiones del DataFrame: 2129 filas x 40 columnas
Total de valores: 85,160
Total de valores nulos: 0
Porcentaje general de completitud: 100.00%

CLASIFICACIÓN DE VARIABLES:
- Continua: 24 columnas
- Discreta: 13 columnas
- Categórica_Alta: 2 columnas
- Categórica_Media: 1 columnas

Columnas con mayor porcentaje de nulos:
- FileName: 0.0% (Categórica_Alta)
- Date: 0.0% (Categórica_Media)
- SegFile: 0.0% (Categórica_Alta)
- b: 0.0% (Continua)
- e: 0.0% (Continua)


In [17]:
# Mostrar resumen general
print("📊 RESUMEN GENERAL:")
resultado_completo['resumen_general']

📊 RESUMEN GENERAL:


Unnamed: 0,columna,tipo_dato,valores_totales,valores_no_nulos,valores_nulos,porcentaje_completitud,porcentaje_nulos,valores_unicos,clasificacion
0,FileName,object,2129,2129,0,100.0,0.0,348,Categórica_Alta
1,Date,object,2129,2129,0,100.0,0.0,48,Categórica_Media
2,SegFile,object,2129,2129,0,100.0,0.0,1749,Categórica_Alta
3,b,float64,2129,2129,0,100.0,0.0,1186,Continua
4,e,float64,2129,2129,0,100.0,0.0,1350,Continua
5,LBE,float64,2129,2129,0,100.0,0.0,166,Continua
6,LB,float64,2129,2129,0,100.0,0.0,128,Continua
7,AC,float64,2129,2129,0,100.0,0.0,28,Continua
8,FM,float64,2129,2129,0,100.0,0.0,177,Continua
9,UC,float64,2129,2129,0,100.0,0.0,50,Continua


In [18]:
# Mostrar estadísticos de dispersión
print("📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):")
resultado_completo['estadisticos_dispersion']

📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):


Unnamed: 0,columna,tipo,media,mediana,desv_std,varianza,min,max,q25,q75,rango,coef_variacion
0,b,Continua,902.3187,650.0,852.5198,726790.0048,0.0,3296.0,132.0,1514.0,3296.0,94.481
1,e,Continua,1686.5377,1423.0,854.5464,730249.5801,287.0,3599.0,1062.2,2274.0,3312.0,50.6687
2,LBE,Continua,133.372,133.0,9.5494,91.1918,106.0,160.0,127.0,140.2,54.0,7.16
3,LB,Continua,133.3005,133.0,9.7113,94.3094,106.0,160.0,126.0,140.0,54.0,7.2853
4,AC,Continua,2.7235,1.0,3.5576,12.6565,0.0,26.0,0.0,4.0,26.0,130.6259
5,FM,Continua,8.4584,0.0,36.3686,1322.6743,0.0,564.0,0.0,3.0,564.0,429.9717
6,UC,Continua,3.6874,3.4,2.7252,7.4265,0.0,23.0,2.0,5.0,23.0,73.9046
7,ASTV,Continua,47.0633,49.0,17.0484,290.6485,12.0,87.0,32.0,61.0,75.0,36.2245
8,MSTV,Continua,1.3277,1.2,0.8813,0.7766,0.2,7.0,0.7,1.7,6.8,66.3774
9,ALTV,Continua,10.1336,1.0,17.6098,310.1059,0.0,91.0,0.0,12.0,91.0,173.7758


In [19]:
# Mostrar clasificación de variables
print("🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:")
resultado_completo['clasificacion_variables']

🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:


Unnamed: 0,columna,clasificacion,criterio,tipo_original,es_numerica,es_categorica
0,FileName,Categórica_Alta,Valores únicos: 348,object,False,True
1,Date,Categórica_Media,Valores únicos: 48,object,False,True
2,SegFile,Categórica_Alta,Valores únicos: 1749,object,False,True
3,b,Continua,Valores únicos: 1186,float64,True,False
4,e,Continua,Valores únicos: 1350,float64,True,False
5,LBE,Continua,Valores únicos: 166,float64,True,False
6,LB,Continua,Valores únicos: 128,float64,True,False
7,AC,Continua,Valores únicos: 28,float64,True,False
8,FM,Continua,Valores únicos: 177,float64,True,False
9,UC,Continua,Valores únicos: 50,float64,True,False


## 3. Visualizaciones (eleva dificultad agregando interactividad, estadísticas o múltiples variables)

### 3.1 Histogramas

In [20]:
visualization.plot_interactive_histogram(datos_imputados, column='LBE')

📁 Directorio 'plots' creado para guardar gráficos
⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido


In [21]:
visualization.plot_interactive_histogram(datos_imputados, column='UC', group_by='D', bins=60)

⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido


### 3.2 Boxplots

In [22]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', group_by='NSP')

⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido


In [23]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', target_class='D')

⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido


### 3.2 Barras Horizontales

In [24]:
visualization.plot_interactive_bar_horizontal(datos_imputados, column='LBE')

### 3.3 Líneas


In [25]:
visualization.plot_interactive_line_timeseries(datos_imputados, x_column='e', y_column='DP')

### 3.4  Dot Plots

In [26]:
visualization.plot_interactive_dot_comparison(datos_imputados, column='b', group1='6/6/1998', group2='5/10/1998', group_column='Date')

### 3.5 Densidad

In [27]:
visualization.plot_interactive_density_multiclass(datos_imputados, column= 'b', class_column='DP')

### 3.6 Violín

In [28]:
visualization.plot_interactive_violin_swarm(datos_imputados, column='b', group_by= 'DP')

### 3.7 Heatmap

In [29]:


# Heatmap básico - Correlación de Spearman
visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='spearman', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas
⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido


In [30]:

visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='pearson', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas
⚠️  Error guardando gráfico: 
Image export using the "kaleido" engine requires the Kaleido package,
which can be installed using pip:

    $ pip install --upgrade kaleido

💡 Instala kaleido para guardar imágenes: pip install kaleido
