Python code to extract digit 'types' from Kaggle/MNIST Digit Recognizer dataset. Training data is first sorted by its label, then each digit is run through a k-means clustering algorithm to extract different styles of handwriting.
It occurred to me that everyone writes certain digits in different ways, for instance what I call the 'z' two versus the 'loopy' two:
Or the 'closed' four versus the 'open' four:
So, using the sklearn KMeans module, I extracted four, ten, and twenty clusters for each digit in the labeled training set. Since the KMeans algorithm initializes randomly, running with different seeds (which I did not do for continuity purposes) will slightly alter which digit attributes are extracted in each cluster set, but I found it interesting that the 'one with a hat' didn't come out clearly until the 20-cluster set and that the 'seven with a mustache' is vaguely in two of the 10-cluster means, but is very clear in only one of the 20-cluster set. Also, some people write really slanted ones!






