## Práctica Análisis Exploratorio con Visualizaciones y Librería Personalizada


Profundizar en el análisis exploratorio de datos mediante la creación de una librería personalizada de visualización con funciones reutilizables, aplicando buenas prácticas de desarrollo y análisis estadístico, y documentando todo en un repositorio profesional (GitHub) y un PDF explicativo.

## 1. Preprocesamiento

In [1]:
import pandas as pd
from libreria_modulo_1 import analysis, preprocessing, visualization 

In [2]:
datos = pd.read_csv('CTG.csv')
datos.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,5/3/1996,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,5/3/1996,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [3]:
analysis.completitud_datos(datos)

FileName    0.0014
Date        0.0014
SegFile     0.0014
b           0.0014
e           0.0014
LBE         0.0014
LB          0.0014
AC          0.0014
Max         0.0014
Nmax        0.0014
Mode        0.0014
Nzeros      0.0014
Width       0.0014
Min         0.0014
AD          0.0014
E           0.0014
D           0.0014
C           0.0014
B           0.0014
A           0.0014
Tendency    0.0014
Variance    0.0014
Median      0.0014
Mean        0.0014
FS          0.0014
SUSP        0.0014
DE          0.0014
LD          0.0014
CLASS       0.0014
NSP         0.0014
UC          0.0009
FM          0.0009
ALTV        0.0009
MLTV        0.0009
ASTV        0.0009
MSTV        0.0009
DR          0.0005
DP          0.0005
DL          0.0005
DS          0.0005
dtype: float64

`Debido a la buena calidad de los datos se creó una funcion que crea datos NaN.`
La funcion `analysis.completitud_datos(datos_sucios)` no reemplaza `check_data_completeness_JavierMartinezReyes`, solo es para describir el porcentaje de nulos

In [4]:
datos_sucios = preprocessing.agrega_nan(datos, min_frac=0.0, max_frac=0.3)

In [5]:
analysis.completitud_datos(datos_sucios)

Nmax        0.2795
Mode        0.2790
Nzeros      0.2762
Min         0.2659
DE          0.2485
ASTV        0.2306
B           0.2264
e           0.2259
SegFile     0.1968
AC          0.1898
CLASS       0.1738
A           0.1705
AD          0.1696
LD          0.1606
LBE         0.1541
FS          0.1517
SUSP        0.1503
E           0.1498
DL          0.1428
DR          0.1273
MLTV        0.1179
NSP         0.1019
D           0.0939
Median      0.0935
FileName    0.0911
UC          0.0855
Width       0.0737
Tendency    0.0709
Max         0.0700
ALTV        0.0690
Date        0.0587
Variance    0.0399
FM          0.0385
DS          0.0272
LB          0.0160
DP          0.0132
MSTV        0.0108
b           0.0061
C           0.0056
Mean        0.0028
dtype: float64

### 1.1 Eliminar columnas con más del 20% de valores nulos

In [6]:
df_limpio = preprocessing.delete_missing_values(datos_sucios, porcentage=0.2)
print("Forma del DataFrame limpio:", df_limpio.shape) 

Análisis inicial:
- Filas: 2129
- Columnas originales: 40
- Porcentaje máximo de NaN permitido por columna: 20.0%
- Máximo de NaN permitidos por columna: 425/2129

Análisis por columnas:
- Columnas con más de 20.0% de NaN: 8
- Columnas a eliminar: ['e', 'ASTV', 'Min', 'Nmax', 'Nzeros', 'Mode', 'B', 'DE']

Resultados:
- Columnas eliminadas: 8
- Columnas restantes: 32
- Forma final del DataFrame: (2129, 32)
Forma del DataFrame limpio: (2129, 32)


### 1.2 Imputar valores faltantes restantes con métodos adecuados:


In [7]:
datos_imputados = preprocessing.impute_missing_values(datos_sucios, method='knn')


Valores faltantes antes de imputar:
10763 en total
Columna categórica 'FileName': imputada con moda 'S8001034.dsp'
Columna categórica 'Date': imputada con moda '2/22/1995'
Columna categórica 'SegFile': imputada con moda 'CTG0002.txt'
Columnas numéricas: imputadas con KNN


In [8]:
datos_imputados.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,12/1/1996,CTG0002.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.4,0.0,1.0,0.0,4.0,2.0
1,Fmcs_1.txt,5/3/1996,CTG0002.txt,5.0,885.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1.0
2,Fmcs_1.txt,5/3/1996,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,5/3/1996,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,5/3/1996,CTG0002.txt,533.0,1147.0,131.4,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


In [9]:
# Verificar si hay valores "nan" como strings
print("Verificando valores 'nan' como strings:")
for col in datos_imputados.columns:
    if datos_imputados[col].dtype == 'object':
        nan_strings = (datos_imputados[col] == 'nan').sum()
        if nan_strings > 0:
            print(f"Columna '{col}': {nan_strings} valores 'nan' como string")

# Verificar también valores NaN reales
print("\nValores NaN reales restantes:")
print(datos_imputados.isnull().sum().sum())

Verificando valores 'nan' como strings:

Valores NaN reales restantes:
0


In [10]:
analysis.completitud_datos(datos_imputados)

FileName    0.0
Date        0.0
SegFile     0.0
b           0.0
e           0.0
LBE         0.0
LB          0.0
AC          0.0
FM          0.0
UC          0.0
ASTV        0.0
MSTV        0.0
ALTV        0.0
MLTV        0.0
DL          0.0
DS          0.0
DP          0.0
DR          0.0
Width       0.0
Min         0.0
Max         0.0
Nmax        0.0
Nzeros      0.0
Mode        0.0
Mean        0.0
Median      0.0
Variance    0.0
Tendency    0.0
A           0.0
B           0.0
C           0.0
D           0.0
E           0.0
AD          0.0
DE          0.0
LD          0.0
FS          0.0
SUSP        0.0
CLASS       0.0
NSP         0.0
dtype: float64

In [11]:
datos_imputados.describe()

Unnamed: 0,b,e,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
count,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,...,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0,2129.0
mean,877.169427,1693.979341,133.156108,133.310066,2.754423,7.575643,3.652068,47.031471,1.336043,9.734592,...,0.024436,0.036091,0.033835,0.163813,0.132981,0.046619,0.028197,0.090603,4.526707,1.288916
std,890.897094,897.566079,9.391424,9.804568,3.320319,38.694435,2.793298,15.635529,0.889418,17.9526,...,0.154399,0.182233,0.173156,0.346989,0.300879,0.207762,0.161275,0.276297,2.872557,0.590773
min,0.0,287.0,106.0,106.0,0.0,0.0,0.0,12.0,0.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,55.0,1011.0,127.0,126.0,0.0,0.0,1.0,35.0,0.7,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
50%,538.0,1320.0,133.0,133.0,1.8,0.0,3.0,47.6,1.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.2,1.0
75%,1514.0,2383.6,140.0,140.0,4.0,2.0,5.0,59.6,1.7,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,1.0
max,3296.0,3599.0,160.0,160.0,26.0,564.0,23.0,87.0,7.0,91.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0


### 1.3 Detectar y tratar valores atípicos (outliers) con IQR o z-score

In [12]:
outliers_iqr = preprocessing.detect_outliers_iqr(datos_imputados, factor=1.5)

In [13]:
outliers_iqr

Unnamed: 0,columna,Q1,Q3,IQR,limite_inferior,limite_superior,num_outliers,porcentaje_outliers
0,b,55.0,1514.0,1459.0,-2133.5,3702.5,0,0.0
1,e,1011.0,2383.6,1372.6,-1047.9,4442.5,0,0.0
2,LBE,127.0,140.0,13.0,107.5,159.5,8,0.375763
3,LB,126.0,140.0,14.0,105.0,161.0,0,0.0
4,AC,0.0,4.0,4.0,-6.0,10.0,65,3.053077
5,FM,0.0,2.0,2.0,-3.0,5.0,317,14.88962
6,UC,1.0,5.0,4.0,-5.0,11.0,23,1.080319
7,ASTV,35.0,59.6,24.6,-1.9,96.5,0,0.0
8,MSTV,0.7,1.7,1.0,-0.8,3.2,70,3.287929
9,ALTV,0.0,11.0,11.0,-16.5,27.5,301,14.138093


In [14]:
outliers_zscore = preprocessing.detect_outliers_zscore(datos_imputados, threshold=3.0)

In [15]:
outliers_zscore

Unnamed: 0,columna,media,std,threshold,num_outliers,porcentaje_outliers
0,b,877.169427,890.897094,3.0,0,0.0
1,e,1693.979341,897.566079,3.0,0,0.0
2,LBE,133.156108,9.391424,3.0,0,0.0
3,LB,133.310066,9.804568,3.0,0,0.0
4,AC,2.754423,3.320319,3.0,38,1.784876
5,FM,7.575643,38.694435,3.0,31,1.456083
6,UC,3.652068,2.793298,3.0,14,0.657586
7,ASTV,47.031471,15.635529,3.0,0,0.0
8,MSTV,1.336043,0.889418,3.0,32,1.503053
9,ALTV,9.734592,17.9526,3.0,59,2.771254


## 2. Análisis de Datos

Análisis completo de completitud y características de los datos usando la función personalizada.

In [16]:
# Análisis completo de completitud y características de los datos
resultado_completo = analysis.check_data_completeness_JavierMartinezReyes(datos_imputados)

ANÁLISIS COMPLETO DE COMPLETITUD DE DATOS
Dimensiones del DataFrame: 2129 filas x 40 columnas
Total de valores: 85,160
Total de valores nulos: 0
Porcentaje general de completitud: 100.00%

CLASIFICACIÓN DE VARIABLES:
- Continua: 23 columnas
- Discreta: 14 columnas
- Categórica_Alta: 2 columnas
- Categórica_Media: 1 columnas

Columnas con mayor porcentaje de nulos:
- FileName: 0.0% (Categórica_Alta)
- Date: 0.0% (Categórica_Media)
- SegFile: 0.0% (Categórica_Alta)
- b: 0.0% (Continua)
- e: 0.0% (Continua)


In [17]:
# Mostrar resumen general
print("📊 RESUMEN GENERAL:")
resultado_completo['resumen_general']

📊 RESUMEN GENERAL:


Unnamed: 0,columna,tipo_dato,valores_totales,valores_no_nulos,valores_nulos,porcentaje_completitud,porcentaje_nulos,valores_unicos,clasificacion
0,FileName,object,2129,2129,0,100.0,0.0,350,Categórica_Alta
1,Date,object,2129,2129,0,100.0,0.0,48,Categórica_Media
2,SegFile,object,2129,2129,0,100.0,0.0,1710,Categórica_Alta
3,b,float64,2129,2129,0,100.0,0.0,989,Continua
4,e,float64,2129,2129,0,100.0,0.0,1345,Continua
5,LBE,float64,2129,2129,0,100.0,0.0,144,Continua
6,LB,float64,2129,2129,0,100.0,0.0,69,Continua
7,AC,float64,2129,2129,0,100.0,0.0,57,Continua
8,FM,float64,2129,2129,0,100.0,0.0,125,Continua
9,UC,float64,2129,2129,0,100.0,0.0,45,Continua


In [18]:
# Mostrar estadísticos de dispersión
print("📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):")
resultado_completo['estadisticos_dispersion']

📈 ESTADÍSTICOS DE DISPERSIÓN (Variables Numéricas):


Unnamed: 0,columna,tipo,media,mediana,desv_std,varianza,min,max,q25,q75,rango,coef_variacion
0,b,Continua,877.1694,538.0,890.8971,793697.6318,0.0,3296.0,55.0,1514.0,3296.0,101.565
1,e,Continua,1693.9793,1320.0,897.5661,805624.8657,287.0,3599.0,1011.0,2383.6,3312.0,52.9857
2,LBE,Continua,133.1561,133.0,9.3914,88.1988,106.0,160.0,127.0,140.0,54.0,7.0529
3,LB,Continua,133.3101,133.0,9.8046,96.1296,106.0,160.0,126.0,140.0,54.0,7.3547
4,AC,Continua,2.7544,1.8,3.3203,11.0245,0.0,26.0,0.0,4.0,26.0,120.545
5,FM,Continua,7.5756,0.0,38.6944,1497.2593,0.0,564.0,0.0,2.0,564.0,510.7742
6,UC,Continua,3.6521,3.0,2.7933,7.8025,0.0,23.0,1.0,5.0,23.0,76.4854
7,ASTV,Continua,47.0315,47.6,15.6355,244.4698,12.0,87.0,35.0,59.6,75.0,33.2448
8,MSTV,Continua,1.336,1.2,0.8894,0.7911,0.2,7.0,0.7,1.7,6.8,66.5711
9,ALTV,Continua,9.7346,0.0,17.9526,322.2958,0.0,91.0,0.0,11.0,91.0,184.4207


In [19]:
# Mostrar clasificación de variables
print("🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:")
resultado_completo['clasificacion_variables']

🏷️ CLASIFICACIÓN AUTOMÁTICA DE VARIABLES:


Unnamed: 0,columna,clasificacion,criterio,tipo_original,es_numerica,es_categorica
0,FileName,Categórica_Alta,Valores únicos: 350,object,False,True
1,Date,Categórica_Media,Valores únicos: 48,object,False,True
2,SegFile,Categórica_Alta,Valores únicos: 1710,object,False,True
3,b,Continua,Valores únicos: 989,float64,True,False
4,e,Continua,Valores únicos: 1345,float64,True,False
5,LBE,Continua,Valores únicos: 144,float64,True,False
6,LB,Continua,Valores únicos: 69,float64,True,False
7,AC,Continua,Valores únicos: 57,float64,True,False
8,FM,Continua,Valores únicos: 125,float64,True,False
9,UC,Continua,Valores únicos: 45,float64,True,False


## 3. Visualizaciones (eleva dificultad agregando interactividad, estadísticas o múltiples variables)

### 3.1 Histogramas

In [20]:
visualization.plot_interactive_histogram(datos_imputados, column='LBE')

💾 Gráfico guardado como JPG: plots\histogram_LBE_20251130_180327.jpg


In [21]:
visualization.plot_interactive_histogram(datos_imputados, column='UC', group_by='D', bins=60)

💾 Gráfico guardado como JPG: plots\histogram_UC_by_D_20251130_180329.jpg


### 3.2 Boxplots

In [22]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', group_by='NSP')

💾 Gráfico guardado como JPG: plots\boxplot_b_by_NSP_20251130_180332.jpg


In [23]:
visualization.plot_interactive_boxplot(datos_imputados, column='b', target_class='D')

💾 Gráfico guardado como JPG: plots\boxplot_b_class_D_20251130_180334.jpg


### 3.2 Barras Horizontales

In [24]:
visualization.plot_interactive_bar_horizontal(datos_imputados, column='LBE')

### 3.3 Líneas


In [25]:
visualization.plot_interactive_line_timeseries(datos_imputados, x_column='e', y_column='DP')

### 3.4  Dot Plots

In [26]:
visualization.plot_interactive_dot_comparison(datos_imputados, column='b', group1='6/6/1998', group2='5/10/1998', group_column='Date')

### 3.5 Densidad

In [27]:
visualization.plot_interactive_density_multiclass(datos_imputados, column= 'b', class_column='DP')

### 3.6 Violín

In [28]:
visualization.plot_interactive_violin_swarm(datos_imputados, column='b', group_by= 'DP')

### 3.7 Heatmap

In [29]:


# Heatmap básico - Correlación de Spearman
visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='spearman', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas
💾 Gráfico guardado como JPG: plots\correlation_heatmap_spearman_threshold_0.4_20251130_180337.jpg


In [30]:

visualization.plot_interactive_correlation_heatmap(
    datos_imputados, 
    method='pearson', 
    annot=True, 
    max_variables=10,
    show_only_significant=True, 
    threshold=0.4
)

⚠️  Mostrando 10 de 37 variables numéricas
💾 Gráfico guardado como JPG: plots\correlation_heatmap_pearson_threshold_0.4_20251130_180339.jpg
