# Notebook de separação dos dados e geração das bases de treino, validação e teste.

Neste notebook temos a seleção dos dados do dataset que serão utilizados no projeto além da separação dos dados para a base de treino, validação e teste necessárias para a elaboração dos modelos e análises posteriores.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df_original = pd.read_csv('public_dataset/metadata_compiled.csv')

In [None]:
df_original

Unnamed: 0,uuid,datetime,cough_detected,latitude,longitude,age,gender,respiratory_condition,fever_muscle_pain,status,...,quality_3,dyspnea_3,wheezing_3,stridor_3,choking_3,congestion_3,nothing_3,cough_type_3,diagnosis_3,severity_3
0,00039425-7f3a-42aa-ac13-834aaa2b6b92,2020-04-13T21:30:59.801831+00:00,0.9609,31.3,34.8,15.0,male,False,False,healthy,...,,,,,,,,,,
1,0009eb28-d8be-4dc1-92bb-907e53bc5c7a,2020-04-12T04:02:18.159383+00:00,0.9301,40.0,-75.1,34.0,male,True,False,healthy,...,,,,,,,,,,
2,0012c608-33d0-4ef7-bde3-75a0b1a0024e,2020-04-15T01:03:59.029326+00:00,0.0482,-16.5,-71.5,,,,,,...,,,,,,,,,,
3,001328dc-ea5d-4847-9ccf-c5aa2a3f2d0f,2020-04-13T22:23:06.997578+00:00,0.9968,,,21.0,male,False,False,healthy,...,,,,,,,,,,
4,001c85a8-cc4d-4921-9297-848be52d4715,2020-04-17T15:24:35.822355+00:00,0.0735,40.6,-3.6,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20067,fff13fa2-a725-49ef-812a-39c6cedda33d,2020-04-13T17:51:36.956822+00:00,0.7154,31.9,34.7,21.0,male,True,False,healthy,...,,,,,,,,,,
20068,fff3ff61-2387-4139-938b-539db01e6be5,2020-06-28T21:28:21.530881+00:00,0.5257,51.6,-0.2,,female,False,False,symptomatic,...,,,,,,,,,,
20069,fff474bf-39a4-4a61-8348-6b992fb5e439,2020-04-10T05:10:36.787070+00:00,0.1945,-39.0,-68.1,,,,,,...,,,,,,,,,,
20070,fffaa9f8-4db0-46c5-90fb-93b7b014b55d,2020-04-13T18:58:26.954663+00:00,0.0243,41.0,28.8,50.0,male,True,True,healthy,...,,,,,,,,,,


**Seleção dos dados do dataset**

*   O primeiro filtro é quanto o parecer dos três especialistas quanto a análise do áudio. Neste caso é necessário que pelo menos um especialista tenha analisado o áudio para atestar quanto a veracidade do diagnóstico, além da anotação de features extras sobre o áudio.

*   Outro filtro é quanto a variável cough_detected. Essa variável entrega uma análise pré-processada para saber a probabilidade do áudio se tratar de um som de tosse. Para tanto áudios com o valor 0.0 possuem 0% de probabilidade de serem relacionado a uma tosse.  O maior valor para esta variável encontrada na base foi de 0.9994. Logo, definimos como um threshold o valor de 0.5, ou seja não iremos considerar nas análises posteriores registros com cough_detected menor que 0.5. 

In [None]:
df_expert = df_original[df_original.cough_type_1.notna() | df_original.cough_type_2.notna() | df_original.cough_type_3.notna()]
df_expert = df_expert[df_expert.status.notna()]
df_expert = df_expert[df_expert.cough_detected > 0.5]

In [None]:
df_expert

Unnamed: 0,uuid,datetime,cough_detected,latitude,longitude,age,gender,respiratory_condition,fever_muscle_pain,status,...,quality_3,dyspnea_3,wheezing_3,stridor_3,choking_3,congestion_3,nothing_3,cough_type_3,diagnosis_3,severity_3
27,005b8518-03ba-4bf5-86d2-005541442357,2020-04-14T20:16:53.677536+00:00,0.9854,45.2,19.7,23.0,female,False,False,healthy,...,,,,,,,,,,
47,008ba489-31ad-44d8-856b-fcf72369dc46,2020-04-13T23:09:36.585124+00:00,0.9962,38.1,-122.2,28.0,female,False,False,healthy,...,good,False,False,True,False,False,False,wet,lower_infection,mild
48,008c1c9e-aeef-40c5-846c-24f1b964f884,2020-04-12T21:25:00.131353+00:00,0.9751,48.9,2.7,44.0,male,False,False,symptomatic,...,good,False,False,False,False,False,True,wet,healthy_cough,pseudocough
64,00bf9f83-2e8f-47cf-a4f2-97f2beceebc1,2020-04-13T19:08:23.388936+00:00,0.9815,41.1,29.0,37.0,male,True,False,healthy,...,good,False,False,False,False,False,False,wet,healthy_cough,pseudocough
68,00ce5b06-c302-4387-bbd7-86355a4a8c12,2020-04-13T20:14:58.986747+00:00,0.9900,5.1,-73.6,41.0,female,True,False,symptomatic,...,good,True,False,False,False,False,False,dry,upper_infection,severe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20017,ff49f6b7-fa54-4780-a4be-b0594e628aae,2020-04-14T03:37:40.338837+00:00,0.9540,41.4,60.4,17.0,female,False,True,symptomatic,...,good,False,False,False,False,False,True,wet,lower_infection,severe
20023,ff5f97db-9b64-4e35-afe8-af463d5c2c60,2020-04-18T20:27:04.557378+00:00,0.7858,48.9,2.3,49.0,male,False,True,symptomatic,...,,,,,,,,,,
20035,ff8435f6-76b5-42c1-8f4c-7479710e71bf,2020-05-14T11:27:45.230404+00:00,0.9947,-16.5,-68.2,38.0,male,False,True,healthy,...,good,False,False,False,False,False,True,dry,upper_infection,mild
20037,ff8bfcc9-3df2-4752-8280-63f023fba31c,2020-04-13T15:45:10.722965+00:00,0.9830,,,,female,False,False,healthy,...,,,,,,,,,,


Divisão dos dados nas bases de treino, validação e teste

Neste ponto temos que os dados foram escolhidos aleatóriamente e divididos em 20% para o teste,
16% para validação e 64% para treinamento.


In [None]:
from numpy.random import RandomState
rng = RandomState()

train_val = df_expert.sample(frac=0.8, random_state=rng)
test = df_expert.loc[~df_expert.index.isin(train_val.index)]

train = train_val.sample(frac=0.8, random_state=rng)
val = train_val.loc[~train_val.index.isin(train.index)]

In [None]:
train.to_csv('train.csv')

In [None]:
val.to_csv('val.csv')

In [None]:
test.to_csv('test.csv')

Além dos dados divididos em treinamento, validação e teste, salvamos os dados originais em um arquivo csv.

In [None]:
df_original.to_csv('original.csv')

Para termos mais amostras para o treinamento de classificadores de áudios, selecionamos também amostras que não tiveram anotação por especialistas, mas possuem o valor de cough_detected acima de 0.5.

In [None]:
df_out = df_original.loc[~df_original.index.isin(df_expert.index)]
df_out = df_out[df_out.cough_type_1.isna()]
df_out = df_out[df_out.cough_type_2.isna()]
df_out = df_out[df_out.cough_type_2.isna()]
df_out = df_out[df_out.status.notna()]
df_out = df_out[df_out.cough_detected > 0.5]

In [None]:
df_out

Unnamed: 0,uuid,datetime,cough_detected,latitude,longitude,age,gender,respiratory_condition,fever_muscle_pain,status,...,quality_3,dyspnea_3,wheezing_3,stridor_3,choking_3,congestion_3,nothing_3,cough_type_3,diagnosis_3,severity_3
0,00039425-7f3a-42aa-ac13-834aaa2b6b92,2020-04-13T21:30:59.801831+00:00,0.9609,31.3,34.8,15.0,male,False,False,healthy,...,,,,,,,,,,
1,0009eb28-d8be-4dc1-92bb-907e53bc5c7a,2020-04-12T04:02:18.159383+00:00,0.9301,40.0,-75.1,34.0,male,True,False,healthy,...,,,,,,,,,,
3,001328dc-ea5d-4847-9ccf-c5aa2a3f2d0f,2020-04-13T22:23:06.997578+00:00,0.9968,,,21.0,male,False,False,healthy,...,,,,,,,,,,
7,0028b68c-aca4-4f4f-bb1d-cb4ed5bbd952,2020-05-24T12:12:46.394647+00:00,0.8937,,,28.0,female,False,False,healthy,...,,,,,,,,,,
8,00291cce-36a0-4a29-9e2d-c1d96ca17242,2020-04-13T15:10:58.405156+00:00,0.9883,39.4,67.2,15.0,male,False,False,healthy,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20064,ffe5e2a4-ef67-464d-b1cd-b0e321f6a2dd,2020-04-22T05:40:51.730942+00:00,0.5591,13.0,77.6,26.0,male,False,False,healthy,...,,,,,,,,,,
20065,ffedc843-bfc2-4ad6-a749-2bc86bdac84a,2020-06-05T03:41:37.481463+00:00,0.9498,-34.5,-58.5,23.0,male,False,False,healthy,...,,,,,,,,,,
20066,ffeea120-92a4-40f9-b692-c3865c7a983f,2020-05-02T10:18:27.348859+00:00,0.9784,14.3,121.1,22.0,female,False,False,healthy,...,,,,,,,,,,
20067,fff13fa2-a725-49ef-812a-39c6cedda33d,2020-04-13T17:51:36.956822+00:00,0.7154,31.9,34.7,21.0,male,True,False,healthy,...,,,,,,,,,,


In [None]:
df_out.to_csv('train_additional_data.csv')