The project [american-sign-language-detection](https://github.com/ashgadala/american-sign-language-detection#) contains a data set that is about %40 reusable for NGT. The following confusions were found:

| True Label | Predicted As | Status |
|------------|--------------|--------|
| A | Y | Remove |
| D | Z, X | Remove |
| E | S, M | Remove |
| G | NONE | Remove |
| H | NONE | Remove |
| J | NONE | Remove |
| K | Z, X | Remove |
| M | Q | Remove |
| N | K | Remove |
| O | NONE | Remove |
| P | K | Remove |
| Q | K | Remove |
| R | M | Remove |
| S | E | Remove |
| T | F | Remove |
| W | K | Remove |

### Good Letters (Keep Existing Data)

| Letter | Samples | Status |
|--------|---------|--------|
| B | 1281 | Keep |
| C | 578 | Keep |
| F | 1024 | Keep |
| I | 1021 | Keep |
| L | 1417 | Keep |
| U | 1182 | Keep |
| V | 1227 | Keep |
| X | 1094 | Keep |
| Y | 2179 | Keep |
| Z | 1464 | Keep |

**Total Good Samples:** 12,467  
**Average per Letter:** 1,246  
**Target for New Letters:** 1,246 samples 

The code below cleans the ASL set of the confused data and prepares a clean dataset.

In [2]:
import csv
from collections import Counter



In [3]:
csv_path = '../data/dataset/asl_keypoint.csv'
label_path = '../data/dataset/keypoint_classifier_label.csv'

# Read labels
with open(label_path, 'r', encoding='utf-8-sig') as f:
    labels = [row[0] for row in csv.reader(f)]

print("Labels:", labels)

Labels: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


In [4]:
# Count samples per label index
label_counts = Counter()

with open(csv_path, 'r', encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    for row in reader:
        if row:
            label_index = int(row[0])
            label_counts[label_index] += 1

# Show counts
for idx in sorted(label_counts.keys()):
    letter = labels[idx]
    count = label_counts[idx]
    print(f"{letter}: {count}")

A: 4136
B: 1281
C: 578
D: 543
E: 2122
F: 1024
G: 2565
H: 1294
I: 1021
J: 1402
K: 1943
L: 1417
M: 1245
N: 1459
O: 477
P: 575
Q: 694
R: 1287
S: 1640
T: 1251
U: 1182
V: 1227
W: 1302
X: 1094
Y: 2179
Z: 1464


In [5]:
# Letters with confusion issues - REMOVE these
CONFUSED = ['A', 'D', 'E', 'G', 'H', 'J', 'K', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'W']

# Letters without issues - KEEP these
GOOD = ['B', 'C', 'F', 'I', 'L', 'U', 'V', 'X', 'Y', 'Z']

# Calculate stats
keep_samples = sum(label_counts[labels.index(letter)] for letter in GOOD)
remove_samples = sum(label_counts[labels.index(letter)] for letter in CONFUSED)

print(f"KEEP ({len(GOOD)} letters): {keep_samples} samples")
print(f"REMOVE ({len(CONFUSED)} letters): {remove_samples} samples")
print(f"\nAverage per good letter: {keep_samples // len(GOOD)}")

KEEP (10 letters): 12467 samples
REMOVE (16 letters): 23935 samples

Average per good letter: 1246


In [6]:
# Map old indices to new indices for good letters only
old_to_new = {}
for new_idx, letter in enumerate(GOOD):
    old_idx = labels.index(letter)
    old_to_new[old_idx] = new_idx

print("Index mapping:")
for old_idx, new_idx in sorted(old_to_new.items()):
    print(f"{labels[old_idx]}: {old_idx} -> {new_idx}")

Index mapping:
B: 1 -> 0
C: 2 -> 1
F: 5 -> 2
I: 8 -> 3
L: 11 -> 4
U: 20 -> 5
V: 21 -> 6
X: 23 -> 7
Y: 24 -> 8
Z: 25 -> 9


In [7]:
output_csv = '../data/dataset/cleaned_keypoint.csv'

kept = 0
removed = 0

with open(csv_path, 'r', encoding='utf-8-sig') as infile, \
     open(output_csv, 'w', newline='', encoding='utf-8') as outfile:
    
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    
    for row in reader:
        if row:
            old_label = int(row[0])
            
            if old_label in old_to_new:
                row[0] = str(old_to_new[old_label])
                writer.writerow(row)
                kept += 1
            else:
                removed += 1

print(f"Kept: {kept}, Removed: {removed}")

Kept: 12467, Removed: 23935
