Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your collaborators below:

In [1]:
COLLABORATORS = ""

---

In [2]:
Collaborators = ""

In [3]:
import numpy as np

In this problem, you will write code that implements the prototype theory of categorization using Shepard's universal law to calculate the similarity between pairs of stimuli. 

## Part A (0.5 points)

Recall that the features of the prototype should be determined by a "majority rule" vote between the members of the category: if exactly **half or more** of the members in a category have a particular feature, the category prototype should have that feature. Otherwise, it shouldn't. For example, let's again say we have features for some fruits, as in Problem 4:

|            | Sweet | Sour  | Bitter | Salty | Seeds |
|:-----------|:-----:|:-----:|:------:|:-----:|:-----:|
| Apple      | 1     | 0     | 0      | 0     | 1     |
| Orange     | 1     | 1     | 0      | 0     | 1     |
| Lemon      | 0     | 1     | 1      | 0     | 1     |
| Grapefruit | 1     | 1     | 1      | 0     | 1     |
| Banana     | 1     | 0     | 0      | 0     | 0     |
| Tomato     | 1     | 0     | 0      | 0     | 1     |

As a NumPy array, the features would look like this:

In [4]:
fruit_features = np.array([
    [True,  False, False, False, True ],
    [True,  True,  False, False, True ],
    [False, True,  True,  False, True ],
    [True,  True,  True,  False, True ],
    [True,  False, False, False, False],
    [True,  False, False, False, True ]])

The prototype for these features would then be:

| Sweet | Sour | Bitter | Salty | Seeds | 
|:-----:|:----:|:------:|:-----:|:-----:|
| 1     | 1    | 0      | 0     | 1     |

because 5/6 fruits have the "sweet" feature, 3/6 fruits have the "sour" feature, 2/6 fruits have the "bitter" feature, 0/6 fruits have the "salty" feature, and 5/6 fruits have the "seeds" feature. As a NumPy array, this would look like:

In [5]:
fruit_prototype = np.array([True, True, False, False, True])

<div class="alert alert-success">Complete the function `prototype` so that it takes an $n\times m$ array of category members by features, and returns the features corresponding to the prototype of that category.</div>

In [6]:
def prototype(features):
    """
    Compute the prototype features, based on the given features of
    category members. The prototype should have a feature if half or
    more of the category members have that feature.

    Hint: this function is similar to the `threshold` function from
    Problem Set 0.
    
    Your solution can be done in 1 line of code (including the return 
    statement).
    
    Parameters
    ----------
    features : boolean numpy array with shape (n, m)
        The first dimension corresponds to n category members, and the
        second dimension to m features.
    
    Returns
    -------
    boolean numpy array with shape (m,) corresponding to the features
    of the prototype of the category members
    
    """
    proto = []
    lst = []
    proto.append(np.mean(features, axis = 0))
    for p in proto[0]:
        if p >= 0.5:
            lst.append(True)
        else:
            lst.append(False)
    return np.array(lst)

Test your function on the fruit features, to see if it gives the right prototype:

In [7]:
print("Actual prototype:   " + str(fruit_prototype))
print("Computed prototype: " + str(prototype(fruit_features)))

Actual prototype:   [ True  True False False  True]
Computed prototype: [ True  True False False  True]


In [8]:
# add your own test cases here!


In [9]:
"""Test the prototype function."""
from nose.tools import assert_equal
from numpy.testing import assert_array_equal

# make sure they get features that are half or more
assert_array_equal(prototype(np.array([[0, 1], [0, 0]], dtype=bool)), np.array([0, 1]))

for i in range(10):
    # create a random array of features
    n, m = np.random.randint(10, 100, 2)
    features = np.random.randint(0, 2, (n, m)).astype(bool)
    
    # compute the prototype
    proto = prototype(features)
    
    # check the shape and type
    assert_equal(proto.shape, (m,), "incorrect shape for the prototype array")
    assert_equal(proto.dtype, np.bool, "prototype is not a boolean array")
    
    # check that the prototype is correct
    for j in range(m):
        count = features[:, j].sum()
        if count >= (n / 2) and not proto[j]:
            raise AssertionError("prototype should have feature {}, but it doesn't".format(j))
        elif count < (n / 2) and proto[j]:
            raise AssertionError("prototype should NOT have feature {}, but it does".format(j))

print("Success!")

Success!


----

## Part B (0.5 points)

According to **Shepard's Universal Law of Generalization**, the similarity between two feature vectors is defined as follows:

> For binary feature vectors ${\bf a}$ and ${\bf b}$ define a function $d:\{0,1\}^n \rightarrow \mathbb{Z}$ such that $d({\bf a},{\bf b})$ is the number of features (positions) by which ${\bf a}$ and ${\bf b}$ differ. The similarity between ${\bf a}$ and ${\bf b}$ may be calculated as $s({\bf a},{\bf b}) = e^{-d({\bf a},{\bf b})}$.

Because we are using binary feature representations, note that $d({\bf a},{\bf b})$ corresponds to the Hamming distance between ${\bf a}$ and ${\bf b}$

Returning to our fruits example from earlier, let's take the example of computing the similarity between grapefruits and bananas:

|            | Sweet | Sour  | Bitter | Salty | Seeds |
|:-----------|:-----:|:-----:|:------:|:-----:|:-----:|
| Grapefruit | 1     | 1     | 1      | 0     | 1     |
| Banana     | 1     | 0     | 0      | 0     | 0     |

So according to Shepard's Universal Law of Generalization, we want to look at the number of places where the two feature vectors are different. In this case, there are three positions in which grapefruit and banana differ: on the "sour", "bitter", and "seeds" features. So, $d(\textbf{grapefruit},\textbf{banana})=3$. Thus, the similarity would be:

$$
\begin{align*}
s(\mathbf{grapefruit},\mathbf{banana})&=e^{-d(\mathbf{grapefruit},\mathbf{banana})}\\
&=e^{-3}\\
&=0.049787068367863944
\end{align*}
$$

<div class="alert alert-success">Complete the function `shepard_sim`, which takes as input two binary feature vectors of length $m$, `a` and `b`, and returns a value between 0 and 1 representing the similarity between the two inputs.</div>

**Hint:** Try looking up the logical operator `np.logical_xor` and thinking about how you could use it to complete `shepard_sim`. You can also use the `^` operator, which works similarly to `&` and `|` (recall that you used these in the truth tables problem of Problem Set 0).

In [10]:
np.logical_xor?

In [11]:
def shepard_sim(a, b):
    """
    Computes the similarity between binary feature vectors a and b, using
    Shepard's law of generalization:
    
    S(a, b) = e^(-d(a, b))
    
    where d(a, b) corresponds to the number of locations where a and b differ
    (i.e., a=1 and b=0, or a=0 and b=1).
    
    Hint: your answer can be done in 1 line of code, including the return
    statement.
    
    Parameters
    ----------
    a, b : boolean numpy array with shape (m,)
    
    Returns
    -------
    similarity between a and b
    
    """
    return np.math.exp(-sum(np.logical_xor(a,b)))

Test your function on the grapefruit and banana feature vectors:

In [12]:
grapefruit_features = np.array([True,  True,  True, False, True ])
banana_features  = np.array([True, False,  False,  False, False ])

shepard_sim(grapefruit_features, banana_features)

0.049787068367863944

In [13]:
# add your own test cases here!


In [14]:
"""Test the shepard_sim function."""
from numpy.testing import assert_almost_equal

a = np.array([ True, False,  True,  True,  True,  True], dtype=bool)
b = np.array([ True,  True,  True, False, False, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.01831563888873418)

a = np.array([False,  True,  True,  True,  True, False,  True,  True,  True,
       False, False, False, False, False, False,  True], dtype=bool)
b = np.array([False,  True, False,  True, False, False, False, False,  True,
       False,  True,  True, False, False, False,  True], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.0024787521766663585)

a = np.array([ True,  True, False,  True], dtype=bool)
b = np.array([ True,  True,  True, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.1353352832366127)

a = np.array([ True], dtype=bool)
b = np.array([ True], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 1.0)

a = np.array([False,  True,  True, False,  True,  True,  True], dtype=bool)
b = np.array([False,  True, False, False, False,  True,  True], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.1353352832366127)

a = np.array([False, False], dtype=bool)
b = np.array([False, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 1.0)

a = np.array([ True,  True,  True, False,  True, False,  True,  True,  True,
       False, False,  True,  True, False,  True], dtype=bool)
b = np.array([ True, False,  True,  True, False,  True, False, False,  True,
       False,  True, False,  True,  True, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 4.5399929762484854e-05)

a = np.array([False,  True, False], dtype=bool)
b = np.array([ True,  True, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.36787944117144233)

a = np.array([False,  True,  True,  True,  True,  True,  True], dtype=bool)
b = np.array([False, False,  True,  True,  True, False,  True], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.1353352832366127)

a = np.array([ True, False, False,  True,  True,  True], dtype=bool)
b = np.array([False,  True, False, False, False, False], dtype=bool)
assert_almost_equal(shepard_sim(a, b), 0.006737946999085467)

print("Success!")

Success!


---

## Part C (1 point)

Now that we have both a way of computing prototypes, and another way of quantifying similarity, let's revisit our animal dataset from Problem 4.

First, let's load our data in. For convenience, we are going to convert the animal and feature names to lists:

In [15]:
data = np.load("data/50animals.npz")

# create variables out of the arrays
animal_features = data['animal_features']
feature_names = list(data['feature_names'])
animal_names = list(data['animal_names'])

Recall that `animal_features` corresponds to a $50\times 85$ boolean array of features:

In [16]:
animal_features

array([[ True,  True, False, ..., False,  True, False],
       [ True,  True, False, ..., False,  True,  True],
       [ True, False,  True, ..., False,  True, False],
       ..., 
       [ True,  True, False, ...,  True,  True, False],
       [ True,  True, False, ...,  True,  True, False],
       [ True,  True, False, ...,  True,  True, False]], dtype=bool)

And that `feature_names` corresponds to a list of length 85 of the feature names (only the first 10 are shown here, because the list is fairly long -- though feel free to take a look at the whole list if you want to!):

In [17]:
feature_names[:10]

['active',
 'agile',
 'aquatic',
 'arctic habitat',
 'big',
 'bipedal',
 'black',
 'blue',
 'brown',
 'builds nests']

And that `animal_names` corresponds to a list of length 50 of the animal names (again, only showing the first 10 here, because the list is long):

In [18]:
animal_names[:10]

['antelope',
 'bat',
 'beaver',
 'blue whale',
 'bobcat',
 'buffalo',
 'chihuahua',
 'chimpanzee',
 'collie',
 'cow']

<div class="alert alert-success">Complete the function `find_feature_prototype` to take the name of a feature and find the **prototype** of the animals that have that feature, using your function `prototype`.</div>

In [19]:
def find_feature_prototype(name, features, feature_names):
    """
    Computes the prototype of all animals with a given feature.
    
    Hint: your solution can be done in 4 lines of code, including the
    return statement.
    
    You should be using boolean indexing in your answer -- refer back to
    Problem Set 0 if you forget how to do this!
    
    Parameters
    ----------
    name : string
        the name of a feature
    features : boolean numpy array
        animals by features, with shape (n, m)
    feature_names : list of strings
        list of feature names with length m
    
    Returns
    -------
    boolean numpy array of the prototype's features, with shape (m,)
    
    """
    feat_index = feature_names.index(name)
    animals = features[:,feat_index] == True
    return prototype(features[animals,:])

Try running your function on a few different features, and see what the features are for the prototype:

In [20]:
claws = find_feature_prototype('claws', animal_features, feature_names)
print("The 'claws' prototype has the following features:")
print(np.array(feature_names) [claws])

The 'claws' prototype has the following features:
['active' 'agile' 'black' 'brown' 'claws' 'eats red meat' 'eats vegetation'
 'fast' 'flightless' 'forager' 'forest dweller' 'furry' 'gray' 'has paws'
 'has tail' 'hunter' 'intelligent' 'lean' 'nocturnal' 'quadrapedal'
 'scavenger' 'small' 'solitary' 'timid' 'white' 'wild']


In [21]:
domesticated_prototype = find_feature_prototype('domesticated', animal_features, feature_names)
print("The 'domesticated' prototype has the following features:")
print(np.array(feature_names)[domesticated_prototype])

The 'domesticated' prototype has the following features:
['active' 'agile' 'black' 'brown' 'claws' 'domesticated' 'eats red meat'
 'eats vegetation' 'fast' 'field dweller' 'flightless' 'furry' 'gray'
 'grazer' 'has long legs' 'has paws' 'has tail' 'intelligent' 'lean'
 'patches' 'quadrapedal' 'small' 'social' 'spots' 'swims' 'timid' 'white']


In [22]:
# add your own test cases here!

In [23]:
"""Test the find_feature_prototype function."""
from numpy.testing import assert_array_equal

# load the animal data
data = np.load("data/50animals.npz")
af = data['animal_features']
fn = list(data['feature_names'])
data.close()

# check the coastal prototype
coastal_prototype = find_feature_prototype('coastal habitat', af, fn)
assert_array_equal(coastal_prototype, np.array([ True,  True,  True, False,  True, False,  True, False,  True,
       False, False, False, False, False, False, False,  True, False,
       False, False,  True,  True, False, False, False,  True,  True,
       False, False, False,  True,  True, False,  True,  True, False,
        True,  True,  True, False, False, False, False, False, False,
        True,  True,  True, False, False,  True, False,  True, False,
        True, False, False, False, False, False, False, False,  True,
       False, False, False,  True,  True, False, False, False, False,
        True,  True, False, False, False, False,  True, False], dtype=bool))

for i in range(20):
    # create a random feature array, with some generic feature names
    n, m = np.random.randint(10, 100, 2)
    features = np.random.randint(0, 2, (n, m)).astype(bool)
    names = ["feature_{}".format(j) for j in range(m)]
    
    # check that the prototype is correct
    j = np.random.randint(0, m)
    true_proto = prototype(np.array([f for f in features if f[j]]))
    proto = find_feature_prototype('feature_{}'.format(j), features, names)
    assert_array_equal(proto, true_proto)

# check that the function uses prototype
old_prototype = prototype
del prototype
try:
    find_feature_prototype('tusks', af, fn)
except NameError:
    pass
else:
    raise AssertionError("find_feature_prototype does not call the prototype function")
finally:
    prototype = old_prototype
    del old_prototype

print("Success!")

Success!


---

## Part D (1 point)

<div class="alert alert-success">Now, using both your `find_feature_prototype` function and your `shepard_sim` function, complete the function `find_similar_animals` to find the **five most similar animals** to the prototype.</div>

Note: Just like in Problem 4, the `np.argsort()` function can come in handy here (take a look at Problem Set 0 if you forget how it's used). To keep ties in the original order, make sure to use mergesort (which is [stable](http://programmers.stackexchange.com/a/247441)) as so:

```
indices = np.argsort(array, kind='mergesort')
```

In [24]:
def find_similar_animals(name, features, feature_names, animal_names):
    """
    Finds the five most similar animals to the prototype for the given feature.
    You should return the animals in order from most similar to least similar
    to the prototype. 
    
    If two animals have the same similarity score, find_similar_animals 
    should break ties in the REVERSE of the order they appear in animal_names 
    (e.g., if the first two entries in animal_names are A and B, and both animals 
    A and B have the same similarity to target animal C, find_similar_animals should 
    place B BEFORE A when ranking them in terms of their similarity to C.)
    
    Hint: your solution can be done in 4 lines of code, including the return
    statement.
    
    Parameters
    ----------
    feature : string
        the name of a feature
    features : boolean numpy array
        animals by features, with shape (n, m)
    feature_names : list of strings
        list of feature names with length m
    animal_names : list of strings
        list of animal names with length n
    
    Returns
    -------
    a list of five animal names
    
    """
    prototype = find_feature_prototype(name, features, feature_names)
    animals = []
    for feat in features:
        animals.append(shepard_sim(prototype, feat))

    animals = np.argsort(animals, kind='mergesort')
    animals = animals[-5:]
    animals = animals[::-1]
    final = []
    
    for elem in animals[0:]:
        final.append(animal_names[elem])
    return final

Look and see what the five most similar animals are for the "coastal" and "tusks" prototypes you calculated above:

In [25]:
print("The most similar animals to the claws prototype are:")
print(find_similar_animals('claws', animal_features, feature_names, animal_names))

The most similar animals to the claws prototype are:
['siamese cat', 'weasel', 'persian cat', 'wolf', 'raccoon']


In [26]:
print("The most similar animals to the domesticated prototype are:")
print(find_similar_animals('domesticated', animal_features, feature_names, animal_names))

The most similar animals to the domesticated prototype are:
['chihuahua', 'dalmatian', 'collie', 'persian cat', 'horse']


In [27]:
# add your own test cases here!

In [28]:
"""Test the find_similar_animals function."""
from numpy.testing import assert_array_equal

# load the animal data
data = np.load("data/50animals.npz")
af = data['animal_features']
fn = list(data['feature_names'])
an = list(data['animal_names'])
data.close()

# check the coastal animals
assert_equal(
    find_similar_animals('coastal habitat', af, fn, an), 
    ['dolphin', 'beaver', 'seal', 'otter', 'killer whale'])

# check the tunnels animals
assert_equal(
    find_similar_animals('digs tunnels', af, fn, an),
    ['weasel', 'mouse', 'rat', 'rabbit', 'fox'])

# check the tusks animals 
assert_equal(
    find_similar_animals('tusks', af, fn, an),
    ['elephant', 'walrus', 'hippopotamus', 'ox', 'rhinoceros'])

# check the aquatic animals with a different feature array
assert_equal(
    find_similar_animals('aquatic', af[:25, :40], fn[:40], an[:25]),
    ['humpback whale', 'dolphin', 'killer whale', 'blue whale', 'elephant'])

# check that the function uses find_feature_prototype
old_find_feature_prototype = find_feature_prototype
del find_feature_prototype
try:
    find_similar_animals('coastal habitat', af, fn, an)
except NameError:
    pass
else:
    raise AssertionError("find_similar_animals does not call the prototype function")
finally:
    find_feature_prototype = old_find_feature_prototype
    del old_find_feature_prototype

# check that the function uses shepard_sim
old_shepard_sim = shepard_sim
del shepard_sim
try:
    find_similar_animals('coastal habitat', af, fn, an)
except NameError:
    pass
else:
    raise AssertionError("find_similar_animals does not call the shepard_sim function")
finally:
    shepard_sim = old_shepard_sim
    del old_shepard_sim

print("Success!")

Success!


---

## Part E (1.25 points)

Run your `find_feature_prototype` function and see what features the 'field dweller' prototype has:

In [29]:
field_dweller_prototype = find_feature_prototype('field dweller', animal_features, feature_names)
print("The 'field dweller' prototype has features:")
print(np.array(feature_names)[field_dweller_prototype])

The 'field dweller' prototype has features:
['active' 'agile' 'black' 'brown' 'eats fruit' 'eats vegetation' 'fast'
 'field dweller' 'flightless' 'forager' 'furry' 'gray' 'grazer'
 'has hooves' 'has tail' 'herbivore' 'intelligent' 'plains dweller'
 'quadrapedal' 'social' 'strong' 'timid' 'white' 'wild']


Now run your function `find_similar_animals` for the input 'field dweller':

In [30]:
find_similar_animals('field dweller', animal_features, feature_names, animal_names)

['zebra', 'horse', 'antelope', 'sheep', 'ox']

<div class="alert alert-success">What are the five most similar animals it returns to prototype of the 'carnivore' animals? (**0.5 points**) </div> 

For the 'field dweller' prototype, the five most similar animals are: zebra, horse, antelope, sheep, and ox. 

<div class="alert alert-success">
Do you agree that these animals are are similar to the prototype? Do they match your intuitions for the *most* similar animals to the prototype (that is, if you were to intuitively pick out the five animals most similar to the carnivore prototype, would you pick those five in that order)? (**0.25 points**) </div>

Yes, I agree that these animals are indeed similar to the prototype. They do match my intuitions for the most similar animals to the prototype, however I would order them as zebra, antelope, ox, horse, sheep instead. 

<div class="alert alert-success"> If you answered "yes" above, explain what it is either about Shepard's law, or about prototypes, that makes this a good similarity metric. If "no", what about Shepard's law or prototypes causes your intuitions to be violated? (**0.5 points**) </div>

I believe what makes this model a good similarity metric is the fact that by Shepard's Law of Generalization, we are looking at the number of places where the feature vectors are *different*, not the same. Even though it is using binary features for categorization (like Tversky's model), I think the prototype model based off of the differences rather than the similarities is evidently a stronger way to categorize similarities.

---

Before turning this problem in remember to do the following steps:

1. **Restart the kernel** (Kernel$\rightarrow$Restart)
2. **Run all cells** (Cell$\rightarrow$Run All)
3. **Save** (File$\rightarrow$Save and Checkpoint)

<div class="alert alert-danger">After you have completed these three steps, ensure that the following cell has printed "No errors". If it has <b>not</b> printed "No errors", then your code has a bug in it and has thrown an error! Make sure you fix this error before turning in your problem set.</div>

In [31]:
print("No errors!")

No errors!
