# ESCUELA COLOMBIANA DE INGENIERÍA JULIO GARAVITO
# PRINCIPIOS Y TECNOLOGÍAS IA 2025-2
# PREDICCIÓN DE ENFERMEDADES A PARTIR DE HISTORIAS CLINICAS
# PROYECTO FINAL

**OBJETIVOS**
1. Desarrollar un modelo de inteligencia artificial capaz de predecir enfermedades utilizando datos provenientes de historias clínicas.

2. Analizar, limpiar y estructurar el conjunto de datos clínicos para garantizar su calidad y utilidad dentro del proceso de modelado.

3. Aplicar y reforzar los conceptos teóricos y prácticos vistos en la materia, especialmente aquellos relacionados con aprendizaje automático, preprocesamiento de datos y evaluación de modelos.


In [None]:
!pip install pyhealth "pandas==1.5.3" "numpy<2.0.0"
!pip install transformers torch

Collecting pyhealth
  Downloading pyhealth-1.1.6-py2.py3-none-any.whl.metadata (28 kB)
Collecting pandas==1.5.3
  Downloading pandas-1.5.3.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting numpy<2.0.0
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting rdkit>=2022.03.4 (from pyhealth)
  Downloading rdkit-2025.9.3-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.2 kB)
Collecting pandarallel>=1.5.3 (from pyhealth)
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mne>=1.0.3 (from 



In [None]:
import pyhealth
import transformers
import pandas as pd
import torch
import numpy as np
import kagglehub
import os
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

In [None]:
path = kagglehub.dataset_download("tboyle10/medicaltranscriptions")
df = pd.read_csv(os.path.join(path,'mtsamples.csv'))
df.head()

Using Colab cache for faster access to the 'medicaltranscriptions' dataset.


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [None]:
df = df.dropna(subset=['transcription', 'medical_specialty'])
top_categories = df['medical_specialty'].value_counts().nlargest(5).index.tolist()
df_filtered = df[df['medical_specialty'].isin(top_categories)].copy()

print(f"\nCategorías seleccionadas para entrenar: {top_categories}")
print(f"Total de historias clínicas a procesar: {len(df_filtered)}")


Categorías seleccionadas para entrenar: [' Surgery', ' Consult - History and Phy.', ' Cardiovascular / Pulmonary', ' Orthopedic', ' Radiology']
Total de historias clínicas a procesar: 2603


In [None]:
print("\n2. Cargar BioBERT")
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModel.from_pretrained("dmis-lab/biobert-v1.1")

def get_biobert_embedding(text):
    inputs = tokenizer(str(text), return_tensors="pt", max_length=128, truncation=True, padding='max_length')

    with torch.no_grad():
        outputs = model(**inputs)

    return outputs.last_hidden_state[0][0].numpy()


print("3. Generar embeddings con BioBERT")
df_filtered['embedding'] = df_filtered['transcription'].apply(get_biobert_embedding)


2. Cargar BioBERT


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

3. Generar embeddings con BioBERT


model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [None]:
X = np.stack(df_filtered['embedding'].values) # Los vectores de BioBERT
y = df_filtered['medical_specialty']          # LasDles

print("\n4. Entrenando el modelo")

# Convertimos las etiquetas de texto a números (Surgery -> 0, Orthopedic -> 1...)
le = LabelEncoder()
y_encoded = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)


4. Entrenando el modelo


In [None]:
print("\n--- RESULTADOS DEL ANÁLISIS ---")
y_pred = clf.predict(X_test)

target_names = le.classes_
print(classification_report(y_test, y_pred, target_names=target_names))


--- RESULTADOS DEL ANÁLISIS ---
                             precision    recall  f1-score   support

 Cardiovascular / Pulmonary       0.28      0.37      0.32        68
 Consult - History and Phy.       0.75      0.72      0.73       109
                 Orthopedic       0.29      0.37      0.33        65
                  Radiology       0.46      0.53      0.49        55
                    Surgery       0.77      0.62      0.69       224

                   accuracy                           0.57       521
                  macro avg       0.51      0.52      0.51       521
               weighted avg       0.61      0.57      0.58       521



In [None]:
print("\n--- PRUEBA DE PREDICCIÓN ---")
ejemplo_idx = 0
texto_real = df_filtered.iloc[ejemplo_idx]['transcription'][:100] + "..."
cat_real = df_filtered.iloc[ejemplo_idx]['medical_specialty']
vector_prueba = df_filtered.iloc[ejemplo_idx]['embedding'].reshape(1, -1)
prediccion_num = clf.predict(vector_prueba)[0]
prediccion_txt = le.inverse_transform([prediccion_num])[0]

print(f"Texto: {texto_real}")
print(f"Realidad: {cat_real}")
print(f"IA Predijo: {prediccion_txt}")


--- PRUEBA DE PREDICCIÓN ---
Texto: 2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size righ...
Realidad:  Cardiovascular / Pulmonary
IA Predijo:  Cardiovascular / Pulmonary


In [None]:
# --- PASO 7: PRUEBA CON UNA NOTA NUEVA ---

# Deberia ser Cardiovascular.
mi_nueva_nota = """CHIEF COMPLAINT: Shortness of breath and palpitations.
HISTORY OF PRESENT ILLNESS: The patient is a 68-year-old male with a known history of coronary artery disease and hypertension.
He presents today with worsening dyspnea on exertion and sensation of rapid heartbeat.
Review of systems is positive for orthopnea.
PHYSICAL EXAM: Irregularly irregular rhythm detected on auscultation. No murmurs.
IMPRESSION:
1. New onset Atrial Fibrillation with rapid ventricular response.
2. Congestive Heart Failure exacerbation.
3. Uncontrolled Hypertension.
"""

print(f"Analizando nota: {mi_nueva_nota}")

#Convertimos el texto a vector
vector_nota = get_biobert_embedding(mi_nueva_nota)

vector_nota_reshaped = vector_nota.reshape(1, -1)
prediccion_numerica = clf.predict(vector_nota_reshaped)

#Traducimos el número a texto (Ej: 0 -> 'Surgery')
prediccion_texto = le.inverse_transform(prediccion_numerica)

print("-" * 30)
print(f"PREDICCIÓN DE LA IA: {prediccion_texto[0]}")
print("-" * 30)

# --- PROBABILIDADES ---
probabilidades = clf.predict_proba(vector_nota_reshaped)[0]

# Ordenamos de mayor a menor probabilidad
indices_ordenados = probabilidades.argsort()[::-1]

print("\nConfianza por categoría:")
for i in indices_ordenados:
    categoria = le.inverse_transform([i])[0]
    prob = probabilidades[i] * 100
    print(f"{categoria}: {prob:.2f}%")

Analizando nota: CHIEF COMPLAINT: Shortness of breath and palpitations.
HISTORY OF PRESENT ILLNESS: The patient is a 68-year-old male with a known history of coronary artery disease and hypertension.
He presents today with worsening dyspnea on exertion and sensation of rapid heartbeat.
Review of systems is positive for orthopnea.
PHYSICAL EXAM: Irregularly irregular rhythm detected on auscultation. No murmurs.
IMPRESSION:
1. New onset Atrial Fibrillation with rapid ventricular response.
2. Congestive Heart Failure exacerbation.
3. Uncontrolled Hypertension.

------------------------------
PREDICCIÓN DE LA IA:  Cardiovascular / Pulmonary
------------------------------

Confianza por categoría:
 Cardiovascular / Pulmonary: 59.09%
 Consult - History and Phy.: 35.53%
 Radiology: 5.03%
 Surgery: 0.22%
 Orthopedic: 0.13%
