[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/repos-especializacion-UdeA/data-raw/blob/main/notebooks/04_features_EDA.ipynb)

# Analisis exploratorio de los datos

El siguiente notebook explora de manera sencilla un archivo de matlab donde se guarda la información de un sensor.

In [None]:
# Solo ejecutelo la primera vez si no tiene esto instalado
import sys
!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions

In [1]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: ok


## 1. Librerias y configuraciones previas

In [2]:
import sys
import os
import zipfile

# Get the absolute path of the current notebook
data_set = "./datasets/features_data_set.csv"
url_data_set = 'https://raw.githubusercontent.com/repos-especializacion-UdeA/data-raw/refs/heads/main/notebooks/datasets/features_data_set.csv'
try:
    import google.colab
    try:
        import scipy.io
    except ImportError:
        !pip install scipy
    data_set = url_data_set   
except ImportError:
    ruta_base = './'


In [3]:
# command to view figures in Jupyter notebook
# %matplotlib inline 

# Tratamiento de datos
# ==============================================================================
import pandas as pd
import numpy as np
import scipy as sc
from ydata_profiling import ProfileReport

# Almacenar en caché los resultados de funciones en el disco
# ==============================================================================
import joblib


# Gestion de librerias
# ==============================================================================
from importlib import reload

# Matemáticas y estadísticas
# ==============================================================================
import math

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns


# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

# Formateo y estilo
# ==============================================================================
from IPython.display import Markdown, display

# Biblioteca scipy y componentes
# ==============================================================================
import scipy.io
from scipy import signal


## 2. Funciones

In [4]:
# Funciones de utilidad
# ==============================================================================

# To Do...


## 3. Carga del dataset

A continuación se realiza la carga del dataset completo

In [5]:
# Carga del dataset
df = pd.read_csv(data_set)

A continuación se verifica la carga del dataset:

In [6]:
df.head()

Unnamed: 0,s,emg_1,emg_2,emg_3,emg_4,emg_5,emg_6,emg_7,emg_8,emg_9,emg_10,rep,label
0,1,0.05251,0.002414,0.002445,0.002417,0.0024,0.006204,0.0024,0.041218,0.0024,0.019526,0,0
1,1,0.038543,0.00244,0.002513,0.002443,0.002426,0.002803,0.0024,0.029789,0.0024,0.005035,0,0
2,1,0.035662,0.002448,0.002564,0.002446,0.002478,0.001975,0.0024,0.025287,0.0024,0.000813,0,0
3,1,0.037038,0.002425,0.002542,0.00242,0.002526,0.002129,0.0024,0.026216,0.0024,0.001485,0,0
4,1,0.035718,0.002404,0.002478,0.002401,0.002542,0.002346,0.0024,0.026433,0.0024,0.002234,0,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46925 entries, 0 to 46924
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   s       46925 non-null  int64  
 1   emg_1   46925 non-null  float64
 2   emg_2   46925 non-null  float64
 3   emg_3   46925 non-null  float64
 4   emg_4   46925 non-null  float64
 5   emg_5   46925 non-null  float64
 6   emg_6   46925 non-null  float64
 7   emg_7   46925 non-null  float64
 8   emg_8   46925 non-null  float64
 9   emg_9   46925 non-null  float64
 10  emg_10  46925 non-null  float64
 11  rep     46925 non-null  int64  
 12  label   46925 non-null  int64  
dtypes: float64(10), int64(3)
memory usage: 4.7 MB


Hay un total de 13 columnas y ninguna tiene registros faltantes (missing values). Debido a esto, no nos tendremos que preocupar por realizar una imputación de datos. Pero hay muchos datos.

In [8]:
len(df.columns)

13

In [9]:
# Convertir a categorico
df['s'] = pd.Categorical(df['s'])
df['rep'] = pd.Categorical(df['rep'])
df['label'] = pd.Categorical(df['label'])

Se verifica que los cambios en el dataframe se hayan efectuado.

In [10]:
#Lista de variables categóricas
catCols = df.select_dtypes(include = ['object', 'category']).columns.tolist()
print(f"Variables categoricas: {catCols}")
numCols = df.select_dtypes(include = ['float64','int32','int64']).columns.tolist()
print(f"Variables categoricas: {numCols}")

Variables categoricas: ['s', 'rep', 'label']
Variables categoricas: ['emg_1', 'emg_2', 'emg_3', 'emg_4', 'emg_5', 'emg_6', 'emg_7', 'emg_8', 'emg_9', 'emg_10']


## EDA

In [11]:
profile = ProfileReport(df, title="Pandas Profiling Report")

In [12]:
profile.to_notebook_iframe()
# profile.to_widgets() # Bloqueo la maquina

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Almacenando el EDA

In [13]:
# Exportando html
profile.to_file("./html_report/report_EDA.html")

# As a JSON string
json_data = profile.to_json()

# As a file
profile.to_file("./json_report/report_EDA.json")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Referencias

* https://github.com/chuawt/eda-starter
* https://www.kaggle.com/code/bextuychiev/my-6-part-powerful-eda-template
* https://community.ibm.com/community/user/ai-datascience/blogs/shivam-solanki1/2020/02/19/eda-exploratory-data-analysis-with-example-in-jupy
* https://github.com/Saba-Gul/Exploratory-Data-Analysis-and-Statistical-Analysis-Notebooks
* https://www.datacamp.com/es/tutorial/pandas-profiling-ydata-profiling-in-python-guide
* https://docs.profiling.ydata.ai/latest/
* https://github.com/Saba-Gul/Exploratory-Data-Analysis-and-Statistical-Analysis-Notebooks/blob/main/Statistics_for_ML.ipynb
* https://github.com/Saba-Gul/Exploratory-Data-Analysis-and-Statistical-Analysis-Notebooks/blob/main/Online_Ed_Adaptability.ipynb
* https://github.com/Saba-Gul/Exploratory-Data-Analysis-and-Statistical-Analysis-Notebooks/blob/main/Heart_Failure_Survival_Classification.ipynb
* https://github.com/akueisara/audio-signal-processing/blob/master/week%204/A4/A4Part2.py
* https://docs.profiling.ydata.ai/latest/