# Exploratory Data Analysis on SASL Glossed Dataset

In [None]:

This tutorial walks you through basic exploratory data analysis (EDA) for a South African Sign Language (SASL) dataset.
We will load the data, inspect its structure, and visualise gloss frequency.

### Objectives:
- Understand the structure of a SASL gloss dataset
- Perform basic summary statistics
- Visualise common glosses and signer metadata


In [None]:
# Step 1: Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Enable inline plotting
%matplotlib inline


In [None]:
# Step 2: Load the dataset
# Make sure the CSV file is in the expected path or update the path accordingly

df = pd.read_csv('../data/sasl_dataset.csv')
df.head()

In [None]:
# Step 3: Get a summary of the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
df.describe(include='all')

In [None]:
# Step 4: Check for missing values
df.isnull().sum()

In [None]:
# Step 5: Visualise frequency of glosses
# This shows the most frequently occurring gloss labels in the dataset

plt.figure(figsize=(14,6))
gloss_counts = df['gloss'].value_counts().head(20)
sns.barplot(x=gloss_counts.values, y=gloss_counts.index, palette='viridis')
plt.title('Top 20 Most Frequent Glosses')
plt.xlabel('Frequency')
plt.ylabel('Gloss Label')
plt.tight_layout()
plt.show()

In [None]:
# Step 6: (Optional) Explore signer demographics if available
if 'signer_id' in df.columns:
    plt.figure(figsize=(8,5))
    df['signer_id'].value_counts().plot(kind='bar')
    plt.title('Signer Distribution')
    plt.xlabel('Signer ID')
    plt.ylabel('Count')
    plt.show()