[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/repos-especializacion-UdeA/data-raw/blob/main/notebooks/merge_databases.ipynb)

# Combinación de todas las bases de datos

> ### Objetivo
> Combinar las bases de datos de los sujetos en una sola.

In [1]:
try:
    import scipy.io
except ImportError:
    !pip install scipy

## 1. Librerias y configuraciones previas

In [79]:
import sys
import os
import zipfile

# Get the absolute path of the current notebook
notebook_path = "."
print(notebook_path)
try:
    import google.colab
    !git clone https://github.com/repos-especializacion-UdeA/data-raw.git
    %cd /content/data-raw/notebooks   
    %pwd
    ruta_base = '/content/data-raw/notebooks/'
    sys.path.append(ruta_base)
except ImportError:
    print("El notebook no se está ejecutando en Google Colab.")
    ruta_base = './'

.
El notebook no se está ejecutando en Google Colab.


In [3]:
# command to view figures in Jupyter notebook
%matplotlib inline 

# Tratamiento de datos
# ==============================================================================
import pandas as pd
import numpy as np
import scipy as sc

# Almacenar en caché los resultados de funciones en el disco
# ==============================================================================
import joblib


# Gestion de librerias
# ==============================================================================
from importlib import reload

# Matemáticas y estadísticas
# ==============================================================================
import math

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns


# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

# Formateo y estilo
# ==============================================================================
from IPython.display import Markdown, display

# Biblioteca scipy y componentes
# ==============================================================================
import scipy.io
from scipy import signal


## 2. Carga del dataset

A continuación, se muestran los archivos seleccionados (forma `Sx_A1_E1.mat`) elegidos de la base de datos DB1 de Ninapro para el tratamiento posterior.

Como los datasets estan 

In [50]:
DATASETS_PATH = "./raw_datasets/"

archivos_mat = ['S1_A1_E1.mat', 
         'S2_A1_E1.mat', 
         'S3_A1_E1.mat', 
         'S4_A1_E1.mat', 
         'S5_A1_E1.mat', 
         'S6_A1_E1.mat', 
         'S7_A1_E1.mat', 
         'S8_A1_E1.mat', 
         'S9_A1_E1.mat', 
         'S10_A1_E1.mat', 
         'S11_A1_E1.mat', 
         'S12_A1_E1.mat', 
         'S13_A1_E1.mat', 
         'S14_A1_E1.mat', 
         'S15_A1_E1.mat', 
         'S16_A1_E1.mat', 
         'S17_A1_E1.mat', 
         'S18_A1_E1.mat', 
         'S19_A1_E1.mat', 
         'S20_A1_E1.mat', 
         'S21_A1_E1.mat', 
         'S22_A1_E1.mat', 
         'S23_A1_E1.mat', 
         'S24_A1_E1.mat', 
         'S25_A1_E1.mat', 
         'S26_A1_E1.mat',
         'S27_A1_E1.mat']

Visualización de las caracteristicas del archivo MAT cargado. Se toma una muestra del primer archivo solamente

In [51]:
archivo_mat = scipy.io.loadmat(DATASETS_PATH + archivos_mat[0])
# print(type(archivo_mat))
print(archivo_mat.keys())

dict_keys(['__header__', '__version__', '__globals__', 'emg', 'stimulus', 'glove', 'subject', 'exercise', 'repetition', 'restimulus', 'rerepetition'])


#### Metadatos

In [52]:
# Metadatos
print("Metadatos del archivo")
for key in list(archivo_mat.keys())[:3]:
    # print(key,": ",archivo_mat[key],sep = "")
    print(f"{key}: {archivo_mat[key]} | tipo:{type(archivo_mat[key])}")

Metadatos del archivo
__header__: b'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Mon Jul 28 11:54:15 2014' | tipo:<class 'bytes'>
__version__: 1.0 | tipo:<class 'str'>
__globals__: [] | tipo:<class 'list'>


#### Informacion de los sensores

In [53]:
# Informacion de los sensores
print("Sensores")
for key in ['emg','glove']:
    f, c = archivo_mat[key].shape
    print(f"{key}: F: {f}; C: {c} | tipo:{type(archivo_mat[key])}")

Sensores
emg: F: 101014; C: 10 | tipo:<class 'numpy.ndarray'>
glove: F: 101014; C: 22 | tipo:<class 'numpy.ndarray'>


#### Informacion sobre el ejercicio

In [54]:
## Carga de la base de datos
for key in ['subject', 'exercise']:
    f, c = archivo_mat[key].shape
    print(f"{key}: {archivo_mat[key][0,0]} | tipo:{type(archivo_mat[key])}")

subject: 1 | tipo:<class 'numpy.ndarray'>
exercise: 1 | tipo:<class 'numpy.ndarray'>


#### Estimulos

In [55]:
## Carga de la base de datos
for key in ['stimulus', 'repetition', 'restimulus', 'rerepetition']:
    f, c = archivo_mat[key].shape
    print(f"{key}: F: {f}; C: {c} | tipo:{type(archivo_mat[key])}")

stimulus: F: 101014; C: 1 | tipo:<class 'numpy.ndarray'>
repetition: F: 101014; C: 1 | tipo:<class 'numpy.ndarray'>
restimulus: F: 101014; C: 1 | tipo:<class 'numpy.ndarray'>
rerepetition: F: 101014; C: 1 | tipo:<class 'numpy.ndarray'>


Carga de cada una de las bases de datos asociadas a los sugetos y combinación en una sola.

In [62]:
## Carga de la base de datos
data_base = pd.DataFrame()
for i in range(len(archivos_mat)):
    ruta_archivo_mat = DATASETS_PATH + archivos_mat[i]
    archivo_mat = scipy.io.loadmat(ruta_archivo_mat)
    # Obtencion y renombrado de las columnas de interes
    df_emg = pd.DataFrame(archivo_mat['emg'])
    df_emg.columns = ['emg_' + str(col + 1) for col in df_emg.columns]
    df_restimulus = pd.DataFrame(archivo_mat['restimulus'])
    df_restimulus.rename(columns={0: 'label'}, inplace= True)
    df_repetition = pd.DataFrame(archivo_mat['rerepetition'])
    df_repetition.rename(columns={0: 'rep'},inplace= True)
    df_subject = pd.DataFrame({'s': [i + 1] * df_repetition.shape[0]}, dtype='int8')
    df_subject =  pd.concat([df_subject, df_emg, df_repetition, df_restimulus], axis=1)
    print(f"{df_subject.shape[0]} muestras ['sujeto','emg','rerepetition','restimulus'] sujeto {i + 1} agregadas")
    data_base = pd.concat([data_base, df_subject], ignore_index=True)

101014 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 1 agregadas
100686 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 2 agregadas
100720 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 3 agregadas
100835 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 4 agregadas
100894 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 5 agregadas
101083 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 6 agregadas
100817 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 7 agregadas
100854 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 8 agregadas
100925 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 9 agregadas
100778 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 10 agregadas
100899 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 11 agregadas
100920 muestras ['sujeto','emg','rerepetition','restimulus'] sujeto 12 agregadas
100948 muestras ['sujeto','emg','rere

## Informacion de la base de datos

In [63]:
data_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2731393 entries, 0 to 2731392
Data columns (total 13 columns):
 #   Column  Dtype  
---  ------  -----  
 0   s       int8   
 1   emg_1   float64
 2   emg_2   float64
 3   emg_3   float64
 4   emg_4   float64
 5   emg_5   float64
 6   emg_6   float64
 7   emg_7   float64
 8   emg_8   float64
 9   emg_9   float64
 10  emg_10  float64
 11  rep     uint8  
 12  label   uint8  
dtypes: float64(10), int8(1), uint8(2)
memory usage: 216.2 MB


### Contenido de los primeros registros del dataframe

A continuación se realiza una vista preliminar del dataframe resultante.

In [65]:
# Primeros registros
data_base.head(3)

Unnamed: 0,s,emg_1,emg_2,emg_3,emg_4,emg_5,emg_6,emg_7,emg_8,emg_9,emg_10,rep,label
0,1,0.0684,0.0024,0.0024,0.0024,0.0024,0.0098,0.0024,0.0488,0.0024,0.0342,0,0
1,1,0.0586,0.0024,0.0024,0.0024,0.0024,0.0049,0.0024,0.0415,0.0024,0.0293,0,0
2,1,0.0562,0.0024,0.0024,0.0024,0.0024,0.0049,0.0024,0.0391,0.0024,0.0244,0,0


In [67]:
# Ultimos registros
data_base.tail(3)

Unnamed: 0,s,emg_1,emg_2,emg_3,emg_4,emg_5,emg_6,emg_7,emg_8,emg_9,emg_10,rep,label
2731390,27,0.0684,0.0049,0.0049,0.0024,0.0049,0.0024,0.0049,0.1416,0.0024,0.0024,0,0
2731391,27,0.0781,0.0024,0.0049,0.0049,0.0049,0.0024,0.0049,0.1538,0.0049,0.0024,0,0
2731392,27,0.083,0.0024,0.0024,0.0049,0.0049,0.0049,0.0049,0.1611,0.0049,0.0049,0,0


#### Informacion basica del archivo

In [68]:
print(data_base.shape)
print(data_base.isna().sum())

(2731393, 13)
s         0
emg_1     0
emg_2     0
emg_3     0
emg_4     0
emg_5     0
emg_6     0
emg_7     0
emg_8     0
emg_9     0
emg_10    0
rep       0
label     0
dtype: int64


In [94]:
raw_dataset_name = "raw_dataset"
raw_dataset_csv = raw_dataset_name + ".csv"
dest_zip = raw_dataset_name + ".zip"
dest_dir_datasets = "./datasets/"
if not(os.path.exists(dest_dir_datasets + dest_zip)):
    # Archivo no existe
    # Se exporta el dataframe a un archivo CSV
    print(f"Generando archivo {raw_dataset_csv}")
    data_base.to_csv(dest_dir_datasets + raw_dataset_csv, index=False)
    stat_dataset = os.stat(dest_dir_datasets + raw_dataset_csv)
    print(f"Tamaño del dataset {raw_dataset_csv}: {stat_dataset.st_size/((1024 * 1024))} MB")
    # Creacion del archivo comprimido
    with zipfile.ZipFile(dest_dir_datasets + dest_zip, 'w', zipfile.ZIP_DEFLATED) as zipf:
        print(f"Archivo {dest_zip} generado")
        zipf.write(dest_dir_datasets + raw_dataset_csv)
        os.remove(dest_dir_datasets + raw_dataset_csv)
        print(f"Archivo {raw_dataset_csv} eliminado")
else:
   print("No se hace nada el archivo o  existe")

Generando archivo raw_dataset.csv
Tamaño del dataset raw_dataset.csv: 201.359450340271 MB
Archivo raw_dataset.zip generado
Archivo raw_dataset.csv eliminado


## Referencias

* https://github.com/parasgulati8/NinaPro-Helper-Library
* https://github.com/Lif3line/nina_helper_package_mk2
* https://github.com/cnzero/NinaproCNN/tree/master
* https://github.com/sebastiankmiec/NinaTools
* https://github.com/sun2009ban/divide_NinaPro_database_5
* https://github.com/tsagkas/sEMG-HandGestureRecognition 
* https://repositorio.unbosque.edu.co/items/61d39597-5a61-491c-909a-849e53efe8ad
* https://github.com/parasgulati8/NinaPro-Helper-Library/blob/master/
* https://pmc.ncbi.nlm.nih.gov/articles/PMC1455479/#sec2
* https://github.com/emckiernan/electrophys
* https://github.com/emckiernan/electrophys/blob/master/EMG/EMGbasics/code/EMGvisualization.ipynb
* https://gist.github.com/emckiernan/005e971b29a4a0532ee804869470f426
* https://electrophys.wordpress.com/
* https://electrophys.wordpress.com/home/dataanalysis/extraccion-y-visualizacion-datos/
* https://electrophys.wordpress.com/home/electromyography/graphing-and-exploring-emg-data/
* https://electrophys.wordpress.com/home/electromyography/filtering-and-analyzing-emg-data/