# PERINATAL ENCOUNTERS SIMULATION AND DATASET CREATION

<p>Este notebook se centrará en crear un dataset con los datos que en teoría recogen los especialistas cuando una mujer embarazada acude a una cita de seguimiento o a una urgencia. Los datos sintéticos han sido creados con <a href="https://synthea.mitre.org/about">synthea</a>.</p>

## DESCRIPCIÓN DE LOS DATASETS

<ul>
    <li>Careplans: datasets que recoge los planes del cuidado, con códigos y descripciones de tratamientos (ej. cuidado antenatal). </li>
    <li>Conditions: registra condiciones médicas diagnosticadas para los pacientes.</li>
    <li>Observations: contiene observaciones clínicas, como altura, peso, IMC y signos vitales.</li>
    <li>Patients: información demográfoca y general de los pacientes.</li>
</ul>


**Estrategia**

Vamos a filtrar datos para mujeres _embarazadas_ basándonos en las condiciones relacionadas con el embarazo y generar un dataset consolidado conlas características solicitadas. Esto incluirá <code>patient_id</code>, datos clínicos y condiciones de salud específicas.

In [4]:
import  pandas as pd
from datetime import datetime

careplans = pd.read_csv('careplans.csv')
conditions = pd.read_csv('conditions.csv')
observations = pd.read_csv('observations.csv')
patients = pd.read_csv('patients.csv')

### Lectura de los cuatro datasets

In [6]:
careplans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1822 entries, 0 to 1821
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 1822 non-null   object 
 1   START              1822 non-null   object 
 2   STOP               961 non-null    object 
 3   PATIENT            1822 non-null   object 
 4   ENCOUNTER          1822 non-null   object 
 5   CODE               1822 non-null   int64  
 6   DESCRIPTION        1822 non-null   object 
 7   REASONCODE         571 non-null    float64
 8   REASONDESCRIPTION  571 non-null    object 
dtypes: float64(1), int64(1), object(7)
memory usage: 128.2+ KB


In [16]:
careplans.head()

Unnamed: 0,Id,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,REASONCODE,REASONDESCRIPTION
0,01612794-cb3e-02ac-1e6f-857a03b9bc69,2018-11-18,2018-12-13,d778d02c-1b71-582e-0c64-e9ce715996fc,f549ca01-9e59-e01c-0b2d-c583240c3ab1,225358003,Wound care (regime/therapy),284549007.0,Laceration of hand (disorder)
1,7ca9b0a8-2941-8c67-0778-0f132ea1a29e,2022-10-06,2023-02-02,d778d02c-1b71-582e-0c64-e9ce715996fc,aa60e9ca-6271-8103-97ad-052099369212,53950000,Respiratory therapy (procedure),,
2,2f17bca4-8478-8a83-b2bd-33a2d53485ce,2024-03-08,2024-04-19,958cbaed-4a32-40ff-f2ee-d55edc4f7611,e839b319-4afe-187b-219c-374d7529e79b,47387005,Head injury rehabilitation (regime/therapy),62564004.0,Concussion with loss of consciousness (disorder)
3,ae9b05d7-b88b-0a25-5f98-bcb2101baea9,2024-09-19,,d778d02c-1b71-582e-0c64-e9ce715996fc,0a723901-5534-ddd3-dfd3-4bee300feb41,134435003,Routine antenatal care (regime/therapy),,
4,c6e88795-3d78-21a4-3ffa-12bbf59e7ae9,2014-03-25,,ff3708b4-748c-ae52-2430-bc60dd9fb5dd,dbe4c0db-e925-25a1-d666-1a4b962aa43b,276239002,Therapy (regime/therapy),,


In [8]:
conditions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14480 entries, 0 to 14479
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   START        14480 non-null  object
 1   STOP         10628 non-null  object
 2   PATIENT      14480 non-null  object
 3   ENCOUNTER    14480 non-null  object
 4   SYSTEM       14480 non-null  object
 5   CODE         14480 non-null  int64 
 6   DESCRIPTION  14480 non-null  object
dtypes: int64(1), object(6)
memory usage: 792.0+ KB


In [18]:
conditions.head()

Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,SYSTEM,CODE,DESCRIPTION
0,2014-10-27,2016-11-07,958cbaed-4a32-40ff-f2ee-d55edc4f7611,9c7664c0-33a3-6778-7845-7b08030a230c,http://snomed.info/sct,314529007,Medication review due (situation)
1,2014-12-18,2015-12-24,d778d02c-1b71-582e-0c64-e9ce715996fc,3682fee4-9f43-f8ae-bc5c-8a0b8b42528f,http://snomed.info/sct,314529007,Medication review due (situation)
2,2015-05-01,2015-05-11,d778d02c-1b71-582e-0c64-e9ce715996fc,c77234be-e24b-5ebf-0e7a-0c835da53642,http://snomed.info/sct,195662009,Acute viral pharyngitis (disorder)
3,2015-09-13,2015-10-02,d778d02c-1b71-582e-0c64-e9ce715996fc,dec60578-6d45-e7a1-ccbf-5c8a031f93c7,http://snomed.info/sct,444814009,Viral sinusitis (disorder)
4,2015-12-24,2015-12-31,d778d02c-1b71-582e-0c64-e9ce715996fc,99b291c0-18a3-afd1-f002-cb3bf4367b4e,http://snomed.info/sct,66383009,Gingivitis (disorder)


In [10]:
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280211 entries, 0 to 280210
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   DATE         280211 non-null  object
 1   PATIENT      280211 non-null  object
 2   ENCOUNTER    272606 non-null  object
 3   CATEGORY     272606 non-null  object
 4   CODE         280211 non-null  object
 5   DESCRIPTION  280211 non-null  object
 6   VALUE        280211 non-null  object
 7   UNITS        204565 non-null  object
 8   TYPE         280211 non-null  object
dtypes: object(9)
memory usage: 19.2+ MB


In [20]:
observations.head()

Unnamed: 0,DATE,PATIENT,ENCOUNTER,CATEGORY,CODE,DESCRIPTION,VALUE,UNITS,TYPE
0,2015-11-02T23:58:08Z,958cbaed-4a32-40ff-f2ee-d55edc4f7611,75f6d06e-01e4-999f-656e-a4def079a8f6,vital-signs,8302-2,Body Height,146.5,cm,numeric
1,2015-11-02T23:58:08Z,958cbaed-4a32-40ff-f2ee-d55edc4f7611,75f6d06e-01e4-999f-656e-a4def079a8f6,vital-signs,72514-3,Pain severity - 0-10 verbal numeric rating [Sc...,1.0,{score},numeric
2,2015-11-02T23:58:08Z,958cbaed-4a32-40ff-f2ee-d55edc4f7611,75f6d06e-01e4-999f-656e-a4def079a8f6,vital-signs,29463-7,Body Weight,41.8,kg,numeric
3,2015-11-02T23:58:08Z,958cbaed-4a32-40ff-f2ee-d55edc4f7611,75f6d06e-01e4-999f-656e-a4def079a8f6,vital-signs,39156-5,Body mass index (BMI) [Ratio],19.5,kg/m2,numeric
4,2015-11-02T23:58:08Z,958cbaed-4a32-40ff-f2ee-d55edc4f7611,75f6d06e-01e4-999f-656e-a4def079a8f6,vital-signs,59576-9,Body mass index (BMI) [Percentile] Per age and...,67.5,%,numeric


In [14]:
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364 entries, 0 to 363
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Id                   364 non-null    object 
 1   BIRTHDATE            364 non-null    object 
 2   DEATHDATE            144 non-null    object 
 3   SSN                  364 non-null    object 
 4   DRIVERS              364 non-null    object 
 5   PASSPORT             336 non-null    object 
 6   PREFIX               354 non-null    object 
 7   FIRST                364 non-null    object 
 8   MIDDLE               292 non-null    object 
 9   LAST                 364 non-null    object 
 10  SUFFIX               6 non-null      object 
 11  MAIDEN               219 non-null    object 
 12  MARITAL              278 non-null    object 
 13  RACE                 364 non-null    object 
 14  ETHNICITY            364 non-null    object 
 15  GENDER               364 non-null    obj

In [22]:
patients.head()

Unnamed: 0,Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,MIDDLE,LAST,...,CITY,STATE,COUNTY,FIPS,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,INCOME
0,958cbaed-4a32-40ff-f2ee-d55edc4f7611,2003-10-13,,999-46-5780,S99985773,X43751887X,Ms.,Sol312,Ciara810,Baumbach677,...,Tyngsborough,Massachusetts,Middlesex County,,0,42.678887,-71.466652,83748.67,8386.12,847165
1,d778d02c-1b71-582e-0c64-e9ce715996fc,2004-12-09,,999-44-8153,S99922963,X17785689X,Ms.,Irma773,Terresa418,Shields502,...,Barnstable,Massachusetts,Barnstable County,25001.0,2648,41.712246,-70.45111,80934.7,70863.0,62412
2,ff3708b4-748c-ae52-2430-bc60dd9fb5dd,1991-09-05,,999-31-9506,S99963368,X48163920X,Mrs.,Rocio28,Ángela136,Bermúdez789,...,Chicopee,Massachusetts,Hampden County,25013.0,1020,42.175006,-72.570417,10853.87,707739.51,1348
3,8cc46582-8727-0024-6010-c5e7e2943578,2000-07-25,,999-55-8680,S99950926,X19763814X,Ms.,Lelah386,Crystal2,Leannon79,...,Worcester,Massachusetts,Worcester County,25027.0,1605,42.192066,-71.751869,122151.8,144375.81,95771
4,6da671b7-6462-2ced-5b86-5b0fced4308b,1974-02-04,,999-31-4155,S99970004,X2187110X,Mrs.,Olympia319,Pamula578,Huels583,...,Revere,Massachusetts,Suffolk County,25025.0,2151,42.461726,-71.000002,186763.75,853299.76,26068


Como podemos observar tienen columnas en común que podemos usar para mapear los datos de interés y construiur nuestro dataset final. Podemos observar que 
tratan con fechas anteriores a las que estamos manejando en el resto de notebooks (septiembre de 2024) y también encomtramos valores NaN,  que habrá que 
hacer manejo de ellos, tal y como hemos hecho anteriormente.

**Filtrado de los datos**

In [42]:
#1. Filtrar datos relacionados con el embarazo en conditions
pregnancy_conditions = conditions[conditions['DESCRIPTION'].str.contains(
    "pregnancy|antenatal|gestation", case=False, na=False)]

#2. Obtener IDs de pacientes relacionados con embarazo
pregnancy_patients = pregnancy_conditions['PATIENT'].unique()

#3. Filtrar pacientes que coinciden con los IDs y están vivas
pregnant_patientsAlive = patients[(patients['Id'].isin(
    pregnancy_patients)) & (patients['DEATHDATE'].isna())]

#4. Filtramos observaciones de interés: los signos vitales, peso, altura, IMC y pruebas de glucosa. 
relevant_observations=observations[observations['DESCRIPTION'].str.contains(
    "Body Height|Body Weight|Body mass index|HeartRate|diastolicBP|systolicBP|Glucose", case=False, na=False)]

#5. Filtrar observaciones para las pacientes embarazadas
pregnancy_observations = relevant_observations[
relevant_observations['PATIENT'].isin(pregnancy_patients)]

#6. Unir los datos para crear dataset
encounter_data = pregnancy_observations.merge(
    pregnant_patientsAlive,
    left_on='PATIENT',
    right_on='Id',
    suffixes=('_observation', '_patient')
)

#7.Seleccionar colummnas relevantes para el análisis
encounter_data_filtered = encounter_data[[
    'PATIENT', 'DESCRIPTION', 'VALUE', 'UNITS', 'DATE', 'BIRTHDATE', 'GENDER', 'RACE', 'ETHNICITY'
]]
#8. Añadir info del tipo de consulta desde careplans

reason = careplans[['ENCOUNTER', 'DESCRIPTION']].rename(columns={'DESCRIPTION': 'ENCOUNTER_TYPE'})
encounter_data_reason = encounter_data.merge(
    reason, 
    left_on='ENCOUNTER', 
    right_on='ENCOUNTER', 
    how='left'
)
#9. Filtrar columnas relevantes
encounter_data_reasonFiltered = encounter_data_reason[[
    'PATIENT', 'DESCRIPTION', 'VALUE', 'UNITS', 'DATE', 'BIRTHDATE', 'GENDER', 'RACE', 'ETHNICITY'
]]



In [87]:
encounter_data_reasonFiltered.head()

Unnamed: 0,PATIENT,DESCRIPTION,VALUE,UNITS,DATE,BIRTHDATE,GENDER,RACE,ETHNICITY,Age,HAS_DIABETES
0,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Height,146.5,cm,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
1,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Weight,41.8,kg,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
2,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Ratio],19.5,kg/m2,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
3,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Percentile] Per age and...,67.5,%,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
4,d778d02c-1b71-582e-0c64-e9ce715996fc,Body Height,149.9,cm,2015-12-24T14:59:14Z,2004-12-09,F,white,nonhispanic,11,0


In [73]:
def calculate_age(birthdate, observation_date):
    """
    Función que calcula la edad del paciente a la fecha de la consulta.
    ARGS:
        - Birthdate: fecha de  nacimiento.
        - Observation_date: fecha de la consulta.
    RETURNS:
        Valor de la edad calculada.
    """
    try:
        if isinstance(birthdate, str):
            birthdate = datetime.strptime(birthdate.split('T')[0], '%Y-%m-%d')
        if isinstance(observation_date, str):
            observation_date = datetime.strptime(observation_date.split('T')[0], '%Y-%m-%d')
        age = observation_date.year - birthdate.year - ((observation_date.month, observation_date.day) < (birthdate.month, birthdate.day))
        return age if age >= 0 else None  # Devuelve None si la edad es negativa
    except Exception as e:
        return None  # Manejar errores devolviendo None


In [75]:
encounter_data_reasonFiltered['Age'] = encounter_data_reasonFiltered.apply(
    lambda row: calculate_age(row['BIRTHDATE'], row['DATE']), axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  encounter_data_reasonFiltered['Age'] = encounter_data_reasonFiltered.apply(


In [77]:
#Creamos un conjunto de pacientes con condiciones relacionadas a la diabetes
diabetes_patients = set(conditions[conditions['DESCRIPTION'].str.contains("diabetes", case=False, na=False)]['PATIENT'])

#Añadimos columna  para indicar si el paciente tiene diabetes
encounter_data_reasonFiltered['HAS_DIABETES'] = encounter_data_reasonFiltered['PATIENT'].apply(
    lambda x: 1 if x in diabetes_patients else 0
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  encounter_data_reasonFiltered['HAS_DIABETES'] = encounter_data_reasonFiltered['PATIENT'].apply(


In [85]:
#Dataset final:
encounter_data_reasonFiltered.head()

Unnamed: 0,PATIENT,DESCRIPTION,VALUE,UNITS,DATE,BIRTHDATE,GENDER,RACE,ETHNICITY,Age,HAS_DIABETES
0,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Height,146.5,cm,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
1,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Weight,41.8,kg,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
2,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Ratio],19.5,kg/m2,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
3,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Percentile] Per age and...,67.5,%,2015-11-02T23:58:08Z,2003-10-13,F,asian,nonhispanic,12,0
4,d778d02c-1b71-582e-0c64-e9ce715996fc,Body Height,149.9,cm,2015-12-24T14:59:14Z,2004-12-09,F,white,nonhispanic,11,0


Estamos observando que se han generado pacientes embarazadas con 12 años y 11, algo que NO es lógico. Vamos a limpiarlo

In [91]:
 encounter_data_reasonFiltered =  encounter_data_reasonFiltered[
    ( encounter_data_reasonFiltered['Age'] >= 15) & (encounter_data_reasonFiltered['Age'] <= 60)
]

In [93]:
encounter_data_reasonFiltered.head()

Unnamed: 0,PATIENT,DESCRIPTION,VALUE,UNITS,DATE,BIRTHDATE,GENDER,RACE,ETHNICITY,Age,HAS_DIABETES
24,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Height,157.6,cm,2018-11-19T23:58:08Z,2003-10-13,F,asian,nonhispanic,15,0
25,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Weight,54.3,kg,2018-11-19T23:58:08Z,2003-10-13,F,asian,nonhispanic,15,0
26,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Ratio],21.9,kg/m2,2018-11-19T23:58:08Z,2003-10-13,F,asian,nonhispanic,15,0
27,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body mass index (BMI) [Percentile] Per age and...,70.8,%,2018-11-19T23:58:08Z,2003-10-13,F,asian,nonhispanic,15,0
28,958cbaed-4a32-40ff-f2ee-d55edc4f7611,Body Height,158.2,cm,2019-11-25T23:58:08Z,2003-10-13,F,asian,nonhispanic,16,0


In [101]:
encounter_data_reasonFiltered.to_csv('synthea_data')

## ANALISIS DEL DATASET

Vamos a filtrar que sea en el periodo de fechas común con los datos del wearable. 

In [99]:
# Primera fecha registrada y última fecha registrada
date_range = (
    encounter_data_reasonFiltered['DATE'].min(),
    encounter_data_reasonFiltered['DATE'].max()
)
print(f"Rango de fechas registradas que recoge el dataset:", date_range)

Rango de fechas registradas que recoge el dataset: ('2014-12-30T02:35:59Z', '2024-12-23T07:20:05Z')
