# Medical Transcriptions Data Exploration

This notebook explores the medical transcriptions dataset downloaded from Kaggle. We'll analyze the content and characteristics of the transcriptions to prepare for fine-tuning the Gemma 2B model.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
plt.style.use('seaborn')

# Set up paths
data_dir = Path('../data/raw')
transcriptions_file = data_dir / 'mtsamples.csv'

ModuleNotFoundError: No module named 'pandas'

## 1. Load and Examine Dataset

Let's load the medical transcriptions dataset and examine its basic structure.

In [None]:
# Load the dataset
df = pd.read_csv(transcriptions_file)

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
display(df.head())
print("\nLast few rows:")
display(df.tail())

## 2. Check Dataset Information

Let's examine the structure of our dataset, including column types and non-null counts.

In [None]:
# Display dataset information
print("Dataset Info:")
display(df.info())

print("\nColumn Names:")
print(df.columns.tolist())

print("\nDataset Description:")
display(df.describe(include='all'))

## 3. Handle Missing Values

Let's check for missing values in our dataset and visualize them.

In [None]:
# Check missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

print("Missing Values Count:")
display(pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
}))

# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.tight_layout()
plt.show()

## 4. Text Analysis

Let's analyze the characteristics of the medical transcriptions.

In [None]:
# Analyze text length distribution
df['text_length'] = df['transcription'].str.len()

print("Text Length Statistics:")
display(df['text_length'].describe())

# Plot text length distribution
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='text_length', bins=50)
plt.title('Distribution of Text Lengths')
plt.xlabel('Text Length (characters)')
plt.ylabel('Count')
plt.show()

# Most common medical specialties
plt.figure(figsize=(12, 6))
df['medical_specialty'].value_counts().head(15).plot(kind='bar')
plt.title('Top 15 Medical Specialties')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 5. Sample Analysis

Let's look at some sample transcriptions to understand their content and structure.

In [None]:
# Display random samples from different medical specialties
n_samples = 3
random_specialties = df['medical_specialty'].sample(n=n_samples)

for specialty in random_specialties:
    sample = df[df['medical_specialty'] == specialty].sample(n=1).iloc[0]
    print(f"\nMedical Specialty: {specialty}")
    print(f"Description: {sample['description']}")
    print(f"Keywords: {sample['keywords']}")
    print("\nFirst 500 characters of transcription:")
    print(f"{sample['transcription'][:500]}...")

## 6. Data Export

Save processed data for model training.

<a href="https://colab.research.google.com/github/rrrohit1/fine-tune-gemma-2-2b/blob/main/notebooks/starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>