<a href="https://colab.research.google.com/github/reddy7356/reddy7356/blob/main/Healthcare_Sentiment_Analysis_Model_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Healthcare Sentiment Analysis Model Architecture

This module defines the architecture for a sentiment analysis model
specifically designed for healthcare academic evaluations in Anesthesiology
and Cardiology specialties.

The architecture combines BERT-based models with domain-specific adaptations
for analyzing feedback from medical students, residents, and fellows.
"""


In [1]:
from IPython import get_ipython
from IPython.display import display

In [2]:
!pip install transformers
!pip install torch
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install nltk
!pip install matplotlib
!pip install seaborn
!pip install gradio
!pip install flask
!pip install gunicorn
!pip install sqlalchemy
!pip install psycopg2-binary
!pip install pytest
!pip install black
!pip install flake8




Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [3]:
import os
import numpy as np
import pandas as pd
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

In [4]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [5]:
# Ensure NLTK resources are available
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [6]:
from IPython import get_ipython
from IPython.display import display

In [7]:
class HealthcareSentimentModel:
    """
    A sentiment analysis model for healthcare academic evaluations.

    This class implements a hybrid approach combining:
    1. BERT-based deep learning model fine-tuned for healthcare text
    2. Domain-specific lexicon adaptations for medical terminology
    3. Specialty-specific features for Anesthesiology and Cardiology
    """

    def __init__(self, model_name="emilyalsentzer/Bio_ClinicalBERT", device=None):
        """
        Initialize the healthcare sentiment analysis model.

        Args:
            model_name (str): The pre-trained model to use (default: Bio_ClinicalBERT)
            device (str): Device to use for computation ('cuda' or 'cpu')
        """
        self.model_name = model_name
        self.device = device if device else ('cuda' if torch.cuda.is_available() else 'cpu')
        self.tokenizer = None
        self.model = None
        self.vader_analyzer = SentimentIntensityAnalyzer()
        self.specialty_terms = {
            'anesthesiology': set(),
            'cardiology': set()
        }

    def load_specialty_terms(self, anesthesiology_path=None, cardiology_path=None):
        """
        Load specialty-specific terminology for improved domain adaptation.

        Args:
            anesthesiology_path (str): Path to file containing Anesthesiology terms
            cardiology_path (str): Path to file containing Cardiology terms
        """
        # Default specialty terms if paths not provided
        if not anesthesiology_path:
            self.specialty_terms['anesthesiology'] = {
                'intubation', 'extubation', 'laryngoscopy', 'anesthesia', 'sedation',
                'analgesia', 'airway', 'ventilation', 'epidural', 'spinal',
                'propofol', 'fentanyl', 'remifentanil', 'sevoflurane', 'desflurane',
                'regional', 'nerve block', 'post-operative', 'preoperative', 'intraoperative'
            }
        else:
            with open(anesthesiology_path, 'r') as f:
                self.specialty_terms['anesthesiology'] = set(line.strip() for line in f)

        if not cardiology_path:
            self.specialty_terms['cardiology'] = {
                'echocardiogram', 'electrocardiogram', 'cardiac', 'heart failure', 'myocardial',
                'infarction', 'arrhythmia', 'atrial', 'ventricular', 'fibrillation',
                'tachycardia', 'bradycardia', 'angiography', 'catheterization', 'stent',
                'pacemaker', 'defibrillator', 'valve', 'coronary', 'hypertension'
            }
        else:
            with open(cardiology_path, 'r') as f:
                self.specialty_terms['cardiology'] = set(line.strip() for line in f)

    def initialize_model(self):
        """Initialize and prepare the BERT-based model for fine-tuning."""
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            self.model_name,
            num_labels=3,  # Negative, Neutral, Positive
            output_attentions=False,
            output_hidden_states=False,
        ).to(self.device)

    def get_vader_sentiment(self, text, specialty=None):
        """
        Get VADER sentiment scores with domain adaptations.

        Args:
            text (str): Input text
            specialty (str): Medical specialty context

        Returns:
            dict: Modified VADER sentiment scores
        """
        # Preprocess text with domain-specific adaptations
        processed_text = self.preprocess_text(text, specialty)

        # Get base VADER scores
        scores = self.vader_analyzer.polarity_scores(processed_text)

        return scores

    def preprocess_text(self, text, specialty=None):
        """
        Preprocess text with domain-specific adaptations.

        Args:
            text (str): Input text
            specialty (str): Medical specialty context

        Returns:
            str: Processed text
        """
        # Placeholder for your domain-specific preprocessing logic
        # Example:
        if specialty:
            # Check if specialty-specific terms are in the text and possibly modify them.
            # For example, you could replace abbreviations with full terms
            for term in self.specialty_terms.get(specialty, []):
                if term in text.lower():
                    # Replace term with a modified version or do some other processing
                    pass

        # Additional preprocessing steps (e.g., lowercasing, removing punctuation) can be added here.
        return text  # Return the text after preprocessing



    def prepare_data(self, texts, labels=None, specialty=None, max_length=512):
        """
        Prepare data for the BERT model.

        Args:
            texts (list): List of text samples
            labels (list): Optional list of labels
            specialty (str): Medical specialty context
            max_length (int): Maximum sequence length

        Returns:
            dict: Tokenized inputs for the model
        """
        # Preprocess texts with domain-specific adaptations
        processed_texts = [self.preprocess_text(text, specialty) for text in texts]

        # Tokenize the texts
        encodings = self.tokenizer(
            processed_texts,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors="pt"
        )

        return encodings # Return the encodings

    def fine_tune(self, train_texts, train_labels, val_texts=None, val_labels=None,
                 specialty=None, epochs=3, batch_size=16, learning_rate=2e-5):
        """
        Fine-tune the BERT model on healthcare evaluation data.

        Args:
            train_texts (list): Training text samples
            train_labels (list): Training labels
            val_texts (list): Validation text samples
            val_labels (list): Validation labels
            specialty (str): Medical specialty context
            epochs (int): Number of training epochs
            batch_size (int): Batch size for training
            learning_rate (float): Learning rate

        Returns:
            dict: Training metrics
        """
        if not self.model:
            self.initialize_model()

        # Prepare datasets
        train_dataset = self.prepare_data(train_texts, train_labels, specialty)

        if val_texts and val_labels:
            val_dataset = self.prepare_data(val_texts, val_labels, specialty)
        else:
            # Split training data if validation set not provided
            train_dataset, val_dataset = train_test_split(train_dataset, test_size=0.2)

        # Define training arguments
        training_args = TrainingArguments(
            output_dir='./results',
            num_train_epochs=epochs,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            warmup_steps=500,
            weight_decay=0.01,
            logging_dir='./logs',
            logging_steps=10,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            learning_rate=learning_rate,
        )

        # Define trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
        )

        # Train the model
        trainer.train()

        # Evaluate the model
        eval_results = trainer.evaluate()

        return eval_results

    def predict(self, texts, specialty=None):
        """
        Predict sentiment for new texts using the hybrid approach.

        Args:
            texts (list): List of text samples to analyze
            specialty (str): Medical specialty context

        Returns:
            list: Predicted sentiment labels and scores
        """
        if not self.model:
            raise ValueError("Model not initialized. Call initialize_model() first.")
        # Get BERT predictions
        dataset = self.prepare_data(texts, specialty=specialty)

        self.model.eval()
        with torch.no_grad():
            outputs = self.model(**{k: v for k, v in dataset.items() if k != 'labels'})
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=1).cpu().numpy()

        # Get VADER predictions for ensemble approach
        vader_scores = [self.get_vader_sentiment(text, specialty) for text in texts]

        # Combine predictions (simple ensemble)
        results = []
        for i, (bert_pred, vader_score) in enumerate(zip(predictions, vader_scores)):
            # Convert BERT prediction (0, 1, 2) to sentiment label
            sentiment_map = {0: 'negative', 1: 'neutral', 2: 'positive'}
            bert_sentiment = sentiment_map[bert_pred]

            # Determine VADER sentiment
            if vader_score['compound'] >= 0.05:
                vader_sentiment = 'positive'
            elif vader_score['compound'] <= -0.05:
                vader_sentiment = 'negative'
            else:
                vader_sentiment = 'neutral'

            # Ensemble decision (weighted toward BERT for medical text)
            # In a production system, these weights would be tuned based on validation data
            if bert_sentiment == vader_sentiment:
                final_sentiment = bert_sentiment
            else:
                # When they disagree, trust BERT more for medical domain
                final_sentiment = bert_sentiment

            results.append({
                'text': texts[i],
                'sentiment': final_sentiment,
                'bert_prediction': bert_sentiment,
                'vader_scores': vader_score,
                'confidence': float(torch.softmax(logits[i], dim=0)[bert_pred].cpu())
            })

        return results

    def save_model(self, path):
        """
        Save the fine-tuned model and tokenizer.

        Args:
            path (str): Directory path to save the model
        """
        if not os.path.exists(path):
            os.makedirs(path)

        self.model.save_pretrained(path)
        self.tokenizer.save_pretrained(path)

        # Save specialty terms
        with open(os.path.join(path, 'specialty_terms.txt'), 'w') as f:
            for specialty, terms in self.specialty_terms.items():
                f.write(f"[{specialty}]\n")
                for term in terms:
                    f.write(f"{term}\n")
                f.write("\n")


    def load_model(self, path):  # This was the unindented method causing the error. Now indented correctly.
        """
        Load a fine-tuned model and tokenizer.

        Args:
            path (str): Directory path to load the model from
        """
        self.model = AutoModelForSequenceClassification.from_pretrained(path).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(path)

        # Load specialty terms if available
        specialty_terms_path = os.path.join(path, 'specialty_terms.txt')
        if os.path.exists(specialty_terms_path):
            current_specialty = None
            self.specialty_terms = {'anesthesiology': set(), 'cardiology': set()}

            with open(specialty_terms_path, 'r') as f:
                for line in f:
                    line = line.strip()
                    if not line:
                        continue

                    if line.startswith('[') and line.endswith(']'):
                        current_specialty = line[1:-1]
                    elif current_specialty and current_specialty in self.specialty_terms:
                        self.specialty_terms[current_specialty].add(line)

# Example usage
if __name__ == "__main__":
    # Initialize model
    model = HealthcareSentimentModel()
    model.initialize_model()
    model.load_specialty_terms()

    # Example texts
    texts = [
        "The resident demonstrated excellent intubation technique and airway management skills.",
        "Student needs improvement in cardiac auscultation and ECG interpretation.",
        "Fellow showed outstanding knowledge of anesthetic pharmacology and patient monitoring.",
        "The medical student struggled with identifying heart murmurs and cardiac pathology."
    ]

    # Make predictions
    anesthesiology_results = model.predict(texts[:2], specialty='anesthesiology')
    cardiology_results = model.predict(texts[2:], specialty='cardiology')

    # Print results
    print("Anesthesiology Results:")
    for result in anesthesiology_results:
        print(f"Text: {result['text']}")
        print(f"Sentiment: {result['sentiment']}")
        print(f"Confidence: {result['confidence']:.4f}")
        print()

    print("Cardiology Results:")
    for result in cardiology_results:
        print(f"Text: {result['text']}")
        print(f"Sentiment: {result['sentiment']}")
        print(f"Confidence: {result['confidence']:.4f}")
        print()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Anesthesiology Results:
Text: The resident demonstrated excellent intubation technique and airway management skills.
Sentiment: positive
Confidence: 0.4628

Text: Student needs improvement in cardiac auscultation and ECG interpretation.
Sentiment: positive
Confidence: 0.4438

Cardiology Results:
Text: Fellow showed outstanding knowledge of anesthetic pharmacology and patient monitoring.
Sentiment: positive
Confidence: 0.5270

Text: The medical student struggled with identifying heart murmurs and cardiac pathology.
Sentiment: positive
Confidence: 0.5316

