# Exploratory Data Analysis (EDA)

This notebook contains exploratory data analysis for the Healthcare Medical Coding Assistant project. The goal is to understand the datasets and their characteristics before proceeding with further processing and modeling.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
# Load datasets
claims_df = pd.read_csv('../data/raw/claims.csv')
encounters_df = pd.read_csv('../data/raw/encounters.csv')
notes_df = pd.read_json('../data/raw/notes.jsonl', lines=True)

# Display the first few rows of each dataset
print('Claims Data:')
display(claims_df.head())

print('Encounters Data:')
display(encounters_df.head())

print('Clinical Notes Data:')
display(notes_df.head())

In [3]:
# Summary statistics
print('Claims Data Summary:')
display(claims_df.describe())

print('Encounters Data Summary:')
display(encounters_df.describe())

print('Clinical Notes Data Summary:')
display(notes_df.describe())

In [4]:
# Visualizing the distribution of claims amounts
plt.figure(figsize=(10, 6))
sns.histplot(claims_df['amount'], bins=30, kde=True)
plt.title('Distribution of Claims Amounts')
plt.xlabel('Amount')
plt.ylabel('Frequency')
plt.show()

In [5]:
# Checking for missing values
print('Missing values in Claims Data:')
print(claims_df.isnull().sum())

print('Missing values in Encounters Data:')
print(encounters_df.isnull().sum())

print('Missing values in Clinical Notes Data:')
print(notes_df.isnull().sum())

## Conclusion

This exploratory data analysis provides insights into the datasets used in the Healthcare Medical Coding Assistant project. Further analysis and preprocessing will be required to prepare the data for modeling.