<a href="https://colab.research.google.com/github/timkaaya/github-lab/blob/main/Timothy_ICS_3_2_B_November_10th.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**MAKE A COPY OF THIS COLAB NOTEBOOK BEFORE STARTING**

## Introduction


You've been given this notebook with code that loads a medical abstracts dataset and introduces several real-world data quality issues that you'll need to handle. The code **loads data from the Hugging Face medical abstracts dataset (train + test combined)**




## Your Task
- Build a complete NLP classification pipeline as per what we've covered in class in the last 2 weeks
- Evaluate your model performance with appropriate metrics
- Deploy the final model with a Gradio UI where users can input medical abstracts and get predictions, and make sure the title of your Gradio app is your first name
- You are free to use any algorithm as well as any feature extraction method, that you see fit, given the data and context of this problem / model


## Deliverables
- Once done, use the class attendance [Google form](https://forms.gle/ThaqeLtnHB7ui4rE9) to upload your file as well as answer some questions based on what you built
- All links close after 9:45 am, November 10th, 2025
- In case of any issues, any of the class reps, can reach out via email / phone.

In [20]:
'''
- DO NOT MODIFY ANY CODE IN THIS CELL.
- MAKING ANY CHNAGES WILL RESULT IN A FAILING THIS EXERCISE.

'''

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

splits = {
    'train': 'data/train-00000-of-00001.parquet',
    'test': 'data/test-00000-of-00001.parquet'
}

df = pd.concat(
    [
        pd.read_parquet("hf://datasets/TimSchopf/medical_abstracts/" + splits[s])
        for s in ["train", "test"]
    ]
)

np.random.seed(42)

df.loc[np.random.choice(df.index, size=int(0.15 * len(df)), replace=False), 'medical_abstract'] = np.nan
df.loc[np.random.choice(df.index, size=int(0.05 * len(df)), replace=False), 'condition_label'] = np.nan
df['medical_abstract'] = df['medical_abstract'].str.replace(' ', '  ', regex=False)
df = pd.concat([df, df.sample(frac=0.05)], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df['condition_label'] = (df['condition_label']).astype(str)
df['medical_abstract'] = df['medical_abstract'].astype(str)


df.head()

Unnamed: 0,condition_label,medical_abstract
0,5.0,Sudden death caused by coronary artery a...
1,3.0,Motor unit discharge characteristics and ...
2,4.0,Prevalence of coronary heart disease in ...
3,,Light microscopic diagnosis of human micr...
4,5.0,Use of a knee-brace for control of tibi...


# Start writing you code below, add more cells if needed.

In [21]:
df.shape, df.isnull().sum()


((15160, 2),
 condition_label     0
 medical_abstract    0
 dtype: int64)

In [22]:
df['medical_abstract'] = df['medical_abstract'].fillna('No abstract provided')
df['condition_label'] = df['condition_label'].replace('nan', np.nan)
df['condition_label'] = df['condition_label'].fillna('unknown')


In [23]:
df.isnull().sum()


Unnamed: 0,0
condition_label,0
medical_abstract,0


In [24]:
import re
df['medical_abstract'] = df['medical_abstract'].str.replace('\n', ' ')
df['medical_abstract'] = df['medical_abstract'].str.replace('  ', ' ')
df['medical_abstract'] = df['medical_abstract'].str.strip()
df['medical_abstract'] = df['medical_abstract'].apply(lambda x: re.sub(r'[^a-zA-Z ]', '', x.lower()))


In [25]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['condition_encoded'] = le.fit_transform(df['condition_label'])


In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['medical_abstract'], df['condition_encoded'],
    test_size=0.2, random_state=42
)


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [28]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)


In [29]:
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
acc = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(acc * 100, 2), "%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Model Accuracy: 50.0 %

Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.58      0.61       632
           1       0.55      0.37      0.44       291
           2       0.56      0.33      0.42       391
           3       0.62      0.53      0.57       579
           4       0.39      0.65      0.49       925
           5       1.00      0.00      0.01       214

    accuracy                           0.50      3032
   macro avg       0.63      0.41      0.42      3032
weighted avg       0.57      0.50      0.48      3032



In [30]:
for i in range(3):
    sample = X_test.sample(1).values[0]
    pred = le.inverse_transform(model.predict(vectorizer.transform([sample])))[0]
    print(f"\nAbstract: {sample[:150]}...")
    print(f"Predicted Condition: {pred}")



Abstract: nan...
Predicted Condition: 5.0

Abstract: hypoxiainduced in vivo sickling of transgenic mouse red cells to develop an animal model for sickle cell anemia we have created transgenic mice that e...
Predicted Condition: 5.0

Abstract: prognostic indices for tumor relapse and tumor mortality in follicular thyroid carcinoma to establish an objective basis for therapeutic decisions and...
Predicted Condition: 1.0


In [31]:
import gradio as gr

def predict_condition(abstract):

    text = re.sub(r'[^a-zA-Z ]', '', abstract.lower())
    text = ' '.join(text.split())
    vec = vectorizer.transform([text])
    pred = model.predict(vec)[0]
    label = le.inverse_transform([pred])[0]
    return f"Predicted Condition: {label}"

app = gr.Interface(
    fn=predict_condition,
    inputs=gr.Textbox(label="Enter Medical Abstract"),
    outputs="text",
    title="Timothy - Medical Abstract Classifier"
)

app.launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://fea37c29f4256f3657.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


