# *k*-NN classification

Sharon Kim<br>
9 April 2017

### A summary of the algorithm

A feature function evaluates a name based on a certain feature of the name. For example, the `length` function counts the number of letters in a name, and the `hardCons` function counts the number of hard consonants in a name. Each feature can be thought of as a separate axis on an `n`-dimensional plane, where `n` = the number of features. Thus, each name can be plotted on this `n`-dimensional plane.

In many cases, it is possible to distinguish male names from female ones based on the criteria that a feature function provides; this is what the computer learns to do given one or more feature functions and a set of training names. On a new set of test names, the k-NN classification algorithm works by taking the average gender of the closest `k` names for each test name. This is the predicted gender of the test name.

In [1]:
from math import sqrt
import string
import sys
import os

Here are the provided feature functions:

In [2]:
def length(name):
    """feature function with length of name"""
    return [('length', len(name))]

def vowels(name):
    """feature function with number of vowels in name"""
    def isVowel(c):
        return c in 'aeiouy'
    return [('vowels', len(filter(isVowel, name)))]

Test on some values:

In [3]:
length('andrea')

[('length', 6)]

In [4]:
vowels('andrea')

[('vowels', 3)]

## Step 1: Featurize

Some more functions that create tuples of feature names and their values.

In [5]:
def hardCons(name):
    """feature function with number of hard consonants in name"""
    def isHard(c):
        return c in 'bdgptk'
    return [('hard', len(filter(isHard, name)))]

def countLetters(name):
    """feature function with count of each letter in name"""
    def countLettersHelper(c):
        return ('count-'+c, name.count(c))
    return map(countLettersHelper, string.lowercase)

def lastLetter(name):
    """feature function indicating 1 for last letter of name, and 0 for other letters"""
    def lastLetterHelper(c):
        return ('last-'+c, int(name[-1]==c))
    return map(lastLetterHelper, string.lowercase)

def firstLetter(name):
    """feature function indicating 1 for first letter of name, and 0 for other letters"""
    def firstLetterHelper(c):
        return ('first-'+c, int(name[0]==c))
    return map(firstLetterHelper, string.lowercase)


Test the functions on different names:

In [6]:
hardCons('emily')

[('hard', 0)]

In [7]:
hardCons('dante')

[('hard', 2)]

In [8]:
hardCons('karina')

[('hard', 1)]

In [9]:
countLetters('andrea')

[('count-a', 2),
 ('count-b', 0),
 ('count-c', 0),
 ('count-d', 1),
 ('count-e', 1),
 ('count-f', 0),
 ('count-g', 0),
 ('count-h', 0),
 ('count-i', 0),
 ('count-j', 0),
 ('count-k', 0),
 ('count-l', 0),
 ('count-m', 0),
 ('count-n', 1),
 ('count-o', 0),
 ('count-p', 0),
 ('count-q', 0),
 ('count-r', 1),
 ('count-s', 0),
 ('count-t', 0),
 ('count-u', 0),
 ('count-v', 0),
 ('count-w', 0),
 ('count-x', 0),
 ('count-y', 0),
 ('count-z', 0)]

In [10]:
lastLetter('jack')

[('last-a', 0),
 ('last-b', 0),
 ('last-c', 0),
 ('last-d', 0),
 ('last-e', 0),
 ('last-f', 0),
 ('last-g', 0),
 ('last-h', 0),
 ('last-i', 0),
 ('last-j', 0),
 ('last-k', 1),
 ('last-l', 0),
 ('last-m', 0),
 ('last-n', 0),
 ('last-o', 0),
 ('last-p', 0),
 ('last-q', 0),
 ('last-r', 0),
 ('last-s', 0),
 ('last-t', 0),
 ('last-u', 0),
 ('last-v', 0),
 ('last-w', 0),
 ('last-x', 0),
 ('last-y', 0),
 ('last-z', 0)]

In [11]:
firstLetter('jack')

[('first-a', 0),
 ('first-b', 0),
 ('first-c', 0),
 ('first-d', 0),
 ('first-e', 0),
 ('first-f', 0),
 ('first-g', 0),
 ('first-h', 0),
 ('first-i', 0),
 ('first-j', 1),
 ('first-k', 0),
 ('first-l', 0),
 ('first-m', 0),
 ('first-n', 0),
 ('first-o', 0),
 ('first-p', 0),
 ('first-q', 0),
 ('first-r', 0),
 ('first-s', 0),
 ('first-t', 0),
 ('first-u', 0),
 ('first-v', 0),
 ('first-w', 0),
 ('first-x', 0),
 ('first-y', 0),
 ('first-z', 0)]

## Step 2: Combine Features

In [12]:
def combine(featureFunc1, featureFunc2):
    """returns a function that, when applied to a name, returns a feature vector
    combining the feature vectors from featureFunc1 and featureFunc2.
    """
    def combineHelper(name):
        return featureFunc1(name) + featureFunc2(name)
    return combineHelper

def combineMany(listOfFeatureFuncs):
    """returns a function that, when applied to a name, returns a feature vector
    combining the feature vectors from featureFunc1 and featureFunc2
    """
    return reduce(combine, listOfFeatureFuncs)

In [13]:
lengthVowels = combine(length, vowels)

In [14]:
type(lengthVowels)

function

In [15]:
lengthVowels('andrea')

[('length', 6), ('vowels', 3)]

In [16]:
lengthFirst = combine(length, firstLetter)

In [17]:
lengthFirst('emily')

[('length', 5),
 ('first-a', 0),
 ('first-b', 0),
 ('first-c', 0),
 ('first-d', 0),
 ('first-e', 1),
 ('first-f', 0),
 ('first-g', 0),
 ('first-h', 0),
 ('first-i', 0),
 ('first-j', 0),
 ('first-k', 0),
 ('first-l', 0),
 ('first-m', 0),
 ('first-n', 0),
 ('first-o', 0),
 ('first-p', 0),
 ('first-q', 0),
 ('first-r', 0),
 ('first-s', 0),
 ('first-t', 0),
 ('first-u', 0),
 ('first-v', 0),
 ('first-w', 0),
 ('first-x', 0),
 ('first-y', 0),
 ('first-z', 0)]

## Step 3: Classify


<img src="http://cs111.wellesley.edu/archive/cs111_fall15/public_html/assignments/ps11/lengthVowelsNew.png" width=560>

**Helper Function**

In [18]:
def euclideanDistanceFrom(v):
    """returns a function that takes a vector and computes the distance between that vector and v"""
    def euclideanDistanceHelper(w):
        """distance between vectors v and w"""
        def squareDiff(i):
            """square of difference between elements in ith position of vectors v and w"""
            return (v[i][1] - w[i][1])**2
        return sqrt(sum(map(squareDiff, range(len(v)))))
    return euclideanDistanceHelper

## Step 3a: Training

In [19]:
def buildTrainingVectors(featurefunc, trainfile):
    """read data files in directory, return a list of tuples,
    where each tuple in the list corresponds to a name in the data.
    Each tuple is of the form (genderlabel, featurevector),
    where genderlabel is the gender of the name,
    and featurevector is the feature vector of the name under featurefunc.
    """

    def lineToFeatures(line):
        """convert a line in trainfile to a tuple of genderlabel and feature vector"""
        name, gender = line.split()
        return gender, featurefunc(name)

    with open(trainfile) as f:
        genderFeatureList = map(lineToFeatures, f.readlines())
    return genderFeatureList

In [20]:
lengthVowels = combine(length, vowels)
lvTrain = buildTrainingVectors(lengthVowels, 'train.txt')
print len(lvTrain)

2806


In [21]:
lvTrain[:10]

[('male', [('length', 4), ('vowels', 2)]),
 ('male', [('length', 4), ('vowels', 2)]),
 ('male', [('length', 5), ('vowels', 2)]),
 ('male', [('length', 5), ('vowels', 2)]),
 ('male', [('length', 7), ('vowels', 3)]),
 ('male', [('length', 5), ('vowels', 2)]),
 ('male', [('length', 7), ('vowels', 3)]),
 ('male', [('length', 9), ('vowels', 4)]),
 ('male', [('length', 6), ('vowels', 3)]),
 ('male', [('length', 6), ('vowels', 3)])]

In [22]:
lvTrain[-10:]

[('female', [('length', 7), ('vowels', 3)]),
 ('female', [('length', 4), ('vowels', 2)]),
 ('female', [('length', 7), ('vowels', 3)]),
 ('female', [('length', 7), ('vowels', 3)]),
 ('female', [('length', 5), ('vowels', 2)]),
 ('female', [('length', 5), ('vowels', 2)]),
 ('female', [('length', 7), ('vowels', 3)]),
 ('female', [('length', 6), ('vowels', 2)]),
 ('female', [('length', 5), ('vowels', 3)]),
 ('female', [('length', 6), ('vowels', 3)])]

### Step 3b: Sort Distances

In [23]:
def labelsSortedByDistance(testvector, trainingVectors):
    """create a list of (genderlabel, distance) tuples from trainingVectors
    (trainingVectors is a list of (gender, featvec) tuples, of the type 
    returned by buildTrainingVectors).
    The distance is the euclideanDistance between featvec and testvector.
    Sort this list of tuples by distance
    """
    distFunc = euclideanDistanceFrom(testvector)
    def labelFeat2labelDistance(labelFeatTuple):
        """return a tuple consisting of the label and the distance of the 
        feature vector to testvector
        """
        return labelFeatTuple[0], distFunc(labelFeatTuple[1])
    distances = map(labelFeat2labelDistance, trainingVectors)
    distances.sort(key=lambda x:x[1])
    return distances


In [24]:
lvcl = combineMany([length, vowels, countLetters, lastLetter])

In [25]:
lvclTrain = buildTrainingVectors(lvcl, 'train.txt')

In [26]:
beyonceLvcl = lvcl('beyonce')

In [27]:
lvclDistances = labelsSortedByDistance(beyonceLvcl, lvclTrain)

In [28]:
lvclDistances = labelsSortedByDistance(beyonceLvcl, lvclTrain)

In [29]:
lvclDistances[:21]

[('male', 2.23606797749979),
 ('female', 2.23606797749979),
 ('male', 2.449489742783178),
 ('male', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('female', 2.449489742783178),
 ('male', 2.6457513110645907),
 ('male', 2.6457513110645907),
 ('male', 2.6457513110645907),
 ('male', 2.6457513110645907),
 ('male', 2.6457513110645907)]

In [30]:
emmavec = lengthVowels('emma')

In [31]:
williamvec = lengthVowels('william')

In [32]:
distanceFromEmma = euclideanDistanceFrom(emmavec)

In [33]:
distanceFromEmma(williamvec)

3.1622776601683795

In [34]:
avavec = lengthVowels('ava')
distanceFromEmma(avavec)

1.0

In [35]:
avavec

[('length', 3), ('vowels', 2)]

In [36]:
emmavec

[('length', 4), ('vowels', 2)]

This is why the distance is 1, that is the distance between the two vectors [3, 2] and [4, 2].

## Step 3c: Prediction

In [37]:
def predictGender(featurefunc, testname, trainingVectors, k):
    """find the k nearest training data points to name in terms of feature 
    vectors, return the most common label among those points
    """
    testvector = featurefunc(testname)
    distances = labelsSortedByDistance(testvector, trainingVectors)
    nearest_labels = map(lambda x:x[0], distances[:k])
    label_counts = map(lambda label: (nearest_labels.count(label), label), set(nearest_labels))
    return max(label_counts)[1]

In [38]:
predictGender(lvcl, 'beyonce', lvclTrain, 11)

'female'

In [39]:
predictGender(lvcl, 'kumar', lvclTrain, 11)

'male'

In [40]:
predictGender(lvcl, 'theodore', lvclTrain, 11)

'female'

In [41]:
predictGender(lvcl, 'eni', lvclTrain, 11)

'male'

In [42]:
predictGender(lvcl, 'eniana', lvclTrain, 11)

'female'

## Step 3d: Evaluation (computing accuracy)

In [43]:
def computeAccuracy(featurefunc, testfile, trainingVectors, k):
    """Go through each name in testfile, predict its gender label,
    and print the predictions for each example as instructed..
    Return the accuracy (the proportion of examples for which the labels are correctly predicted)."""
    # fill in
    numCorrect = 0.0
    numData = 0.0
    with open(testfile) as f:
        for line in f.readlines():
            numData += 1
            name, actual = line.split()
            print name, actual,
            predicted = predictGender(featurefunc, name, trainingVectors, k)
            print predicted,
            if predicted == actual:
                numCorrect += 1
                print 'CORRECT'
            else:
                print 'WRONG'
    return numCorrect/numData

In [44]:
computeAccuracy(lvcl, 'test.txt', lvclTrain, 3)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8483870967741935

In [45]:
vhlf = combineMany([vowels, hardCons, lastLetter, firstLetter])
trainVhlf = buildTrainingVectors(vhlf, 'train.txt')

In [46]:
computeAccuracy(vhlf, 'test.txt', trainVhlf, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8032258064516129

## Part 2

- Create some features (at least three new ones)
- Create at least 2 classifiers
- Measure the accuracy
- Discuss which combinations of features seems to do better. Does this fit with your previous intuition about the features?

<b>Step 1:</b> create some features (at least three new ones)

One idea I had for a feature function is one that could count the number of syllables in a name. However, I'm not sure this is within the scope of my abilities. I'm thinking that the extra -a at the end of many female names adds an extra syllable. By contrast, most male names end with a hard consonant that doesn't provide an extra syllable.

In [47]:
def softCons(name):
    """feature function with number of soft consonants in name"""
    def isSoft(c):
        return c in 'cfhjlmnrsvwxyz'
    return [('soft', len(filter(isSoft, name)))]

def consonants(name):
    """feature function with number of consonants in name"""
    def isConsonant(c):
        return c in 'bcdfghjklmnpqrstvwxz'
    return [('consonants', len(filter(isConsonant, name)))]

def softened_consonant(name):
    '''feature function that checks for the presence of an h after a consonant'''
    def isSoftened(c):
        if 'h' in name:
            h = name.index('h')
            c = name[h-1]
            return c in 'cpst'
    return [('softened', len(filter(isSoftened, name)))]

<b>Step 2:</b> create at least 2 classifiers

In [48]:
c1 = combine(softCons, consonants)
c2 = combine(softCons, softened_consonant)
c3 = combineMany([softCons, consonants, softened_consonant])

<b>Step 3:</b> measure the accuracy

In [49]:
t1 = buildTrainingVectors(c1, 'train.txt')
computeAccuracy(c1, 'test.txt', t1, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male female WRONG
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5032258064516129

In [50]:
t2 = buildTrainingVectors(c2, 'train.txt')
computeAccuracy(c2, 'test.txt', t2, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5096774193548387

In [51]:
t3 = buildTrainingVectors(c3, 'train.txt')
computeAccuracy(c3, 'test.txt', t3, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male female WRONG
johnathan male female WRONG
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.4967741935483871

Additionally, I try isolating each feature function to see which is most effective.

In [52]:
s_t = buildTrainingVectors(softCons, 'train.txt')
computeAccuracy(softCons, 'test.txt', s_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [53]:
c_t = buildTrainingVectors(consonants, 'train.txt')
computeAccuracy(consonants, 'test.txt', c_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [54]:
so_t = buildTrainingVectors(softened_consonant, 'train.txt')
computeAccuracy(softened_consonant, 'test.txt', so_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.4967741935483871

### Discussion

Here was my intuition for all of these functions:
- `softCons`: the opposite of `hardCons`. If male names tend to have more hard consonants, would it be possible to say that female names have more soft consonants?
- `consonants`: my intuition was that male names have more consonants in general than female names. This was a tricky one, though, since female names are longer. It might be the case that it is more useful to break this function down to either soft consonants or hard consonants, as we have done, but I thought I would give this one a shot anyway.
- `softened_consonants`: male names have more hard consonants, but what if these hard consonants are softened with the letter "h" following it? Would it then make the name more feminine, since a hard consonant + h makes a softer sound than the hard consonant alone? All of these consonants make a softer sound when an "h" is added: "c" to "ch", "p" to "ph", "s" to "sh", and "t" to "th".

`softCons` + `consonants` = 0.503<br>
`softCons` + `softened_consonant` = 0.509<br>
`softCons` + `softened_consonant` + `consonants` = 0.497

`softCons` = 0.50<br>
`consonants` = 0.50<br>
`softened_consonant` = 0.497<br>

My results are not very promising; all accuracies hover at around 0.50, which is equivalent to randomly guessing for each name. A lot of the intuition I gathered from the first pass through this assignment is encapsulated with the given functions; therefore, I am going to try experimenting with those (maybe also a combination of the given functions and my own functions) instead.

Available functions:
- `length`
- `vowels`
- `hardCons`
- `countLetters`
- `lastLetter`
- `firstLetter`

Which of these given functions are the most influential? I train and test each of these feature functions one by one.

In [55]:
l_t = buildTrainingVectors(length, 'train.txt')
computeAccuracy(length, 'test.txt', l_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [56]:
v_t = buildTrainingVectors(vowels, 'train.txt')
computeAccuracy(vowels, 'test.txt', v_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [57]:
h_t = buildTrainingVectors(hardCons, 'train.txt')
computeAccuracy(hardCons, 'test.txt', h_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [58]:
c_t = buildTrainingVectors(countLetters, 'train.txt')
computeAccuracy(countLetters, 'test.txt', c_t, 11)

daniel male female WRONG
joseph male male CORRECT
isaac male female WRONG
jack male male CORRECT
levi male male CORRECT
adrian male female WRONG
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.6935483870967742

In [59]:
la_t = buildTrainingVectors(lastLetter, 'train.txt')
computeAccuracy(lastLetter, 'test.txt', la_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

In [60]:
f_t = buildTrainingVectors(firstLetter, 'train.txt')
computeAccuracy(firstLetter, 'test.txt', f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male male CORRECT
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.5

`length` = 0.50<br>
`vowels` = 0.50<br>
`hardCons` = 0.50<br>
`countLetters` = 0.69<br>
`lastLetter` = 0.50<br>
`firstLetter` = 0.50

`countLetters` seems to be the most influential, with an accuracy of 0.69. This makes sense because this particular function does not distill the characteristics of each name into a single digit; instead, it gives detailed information (letter by letter) about which letters are most frequently used. For example, there are many ways for both male and female names to have 3 vowels, but fewer ways for both male and female names to have a certain set of letters in common.

Out of curiosity, what would happen if I used all of these functions at once?

In [61]:
all_f = combineMany([length, vowels, hardCons, countLetters, lastLetter, firstLetter])
all_f_t = buildTrainingVectors(all_f, 'train.txt')
computeAccuracy(all_f, 'test.txt', all_f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male female WRONG
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8612903225806452

If all of the given functions are used at once, I get an accuracy of 0.86, which is the highest accuracy I've seen yet. It's interesting to note that although individually, the functions were unimpressive, together, they have a very high accuracy &mdash; one that is significantly higher than just using `countLetters` alone.

What happens if I combine this last classifier with one of my functions, `softCons`?

In [62]:
everything_f = combineMany([length, vowels, hardCons, countLetters, lastLetter, firstLetter, softCons])
everything_f_t = buildTrainingVectors(everything_f, 'train.txt')
computeAccuracy(everything_f, 'test.txt', everything_f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male female WRONG
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male female WRONG
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8451612903225807

Unfortunately, I get a reduced accuracy (0.845), but not by too much. What about my two other functions?

In [63]:
e1_f = combineMany([length, vowels, hardCons, countLetters, lastLetter, firstLetter, softened_consonant])
e1_f_t = buildTrainingVectors(e1_f, 'train.txt')
computeAccuracy(e1_f, 'test.txt', e1_f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male male CORRECT
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male female WRONG
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male female WRONG
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.832258064516129

In [64]:
e2_f = combineMany([length, vowels, hardCons, countLetters, lastLetter, firstLetter, consonants])
e2_f_t = buildTrainingVectors(e2_f, 'train.txt')
computeAccuracy(e2_f, 'test.txt', e2_f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male female WRONG
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male male CORRECT
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male male CORRECT
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8354838709677419

All given functions + `softened_consonant` = 0.832<br>
All given functions + `consonants` = 0.835

`consonants` gives slightly better results. So how about all given functions + `softCons` + `consonants`?

In [65]:
e3_f = combineMany([length, vowels, hardCons, countLetters, lastLetter, firstLetter, softCons, consonants])
e3_f_t = buildTrainingVectors(e3_f, 'train.txt')
computeAccuracy(e3_f, 'test.txt', e3_f_t, 11)

daniel male male CORRECT
joseph male male CORRECT
isaac male female WRONG
jack male male CORRECT
levi male male CORRECT
adrian male male CORRECT
brandon male male CORRECT
ian male male CORRECT
nathaniel male female WRONG
juan male male CORRECT
max male male CORRECT
declan male male CORRECT
diego male male CORRECT
richard male male CORRECT
brian male male CORRECT
marcus male male CORRECT
theodore male female WRONG
tucker male male CORRECT
kingston male male CORRECT
maximus male male CORRECT
devin male male CORRECT
eduardo male male CORRECT
zander male male CORRECT
chance male female WRONG
finn male male CORRECT
erick male male CORRECT
beau male male CORRECT
johnathan male male CORRECT
dante male male CORRECT
gregory male male CORRECT
erik male male CORRECT
dawson male male CORRECT
desmond male male CORRECT
joaquin male male CORRECT
allen male male CORRECT
adan male male CORRECT
gideon male male CORRECT
dexter male male CORRECT
esteban male male CORRECT
ismael male male CORRECT
enzo male

0.8419354838709677

I get 0.842, which is worse than all given functions + `softCons`. I guess name classification is a tricky thing past accuracies of 0.85.