# Data Quality Validation

This notebook validates the accuracy of medical code mappings by comparing 
description similarity scores between original and mapped terminology.

## Methodology
- Text preprocessing (lowercase, strip whitespace)
- Similarity scoring using sequence matching
- Threshold-based quality assessment (80% threshold)
- Overall goodness score calculation

In [None]:
# Compare original description vs. mapped description similarity
def calculate_similarity_score(desc1, desc2):
    """Calculate text similarity for medical code validation"""
    from difflib import SequenceMatcher
    return SequenceMatcher(None, desc1, desc2).ratio() * 100

In [None]:
# Preprocess text data
df['Description'] = df['Description'].str.lower().str.strip()
df['DIAGNOSIS_DESCRIPTION'] = df['DIAGNOSIS_DESCRIPTION'].str.lower().str.strip()

In [None]:
# Calculate similarity scores and store in a new column
df['Similarity_Score'] = df.apply(lambda row: calculate_similarity_score(row['Description']), axis=1)

In [None]:
# Define a threshold for considering a good match
threshold = 80

In [None]:
# Calculate the overall score for the Description column
overall_goodness_score = df['Similarity_Score'].sum() / len(df)

In [None]:
print("Overall Goodness Score:", overall_goodness_score)