# Assignment 2: Prototype and exemplar models

**Please do not consult external resources other than the readings for this assignment. Prototype and exemplar models have been extended beyond the implementations of the models presented in this homework. We want you to think about the implications of the representations posited by these models and what they mean for providing explanations for human behavior.**

Make sure you have done the required readings for this homework (which were also required readings for class):

* Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: MIT
  Press [chapter 2 & 3]


## Data representation

Both the models in this assignment represent observations from the
world as collection of attributes, each of which has a corresponding
value. For example, a ball can have a radius and a weight attribute
with values 12 cm (4.72 inches) and 500 g (1.1 lbs) respectively.

We can have many examples of these items to learn from. We can
represent these items as rows of a matrix, and attributes as columns
of a matrix. For example, if we had 3 balls with the following values:

| Ball # | Radius (cm) | Weight (g) |
|--------|-------------|------------|
|      0 |           5 |        200 |
|      1 |           3 |        400 |
|      2 |          20 |       1000 |


We can represent this information in a matrix, that looks like this:

The Ball # is implicitly encoded in the matrix as the index of the
row. Each column contains the values of each attribute. The names of
each attributes can be retrieved using the indices of the columns.

$$
\begin{bmatrix}
5 & 200 \\
3 & 400 \\
20 & 1000 
\end{bmatrix}
$$

## Models

In this assignment, we will be replicating the modeling results from Medin and Schaffer (1978).

[Medin, D. L., & Schaffer, M. M. (1978). Context theory of
classification learning. Psychological Review, 85(3), 207–238.
https://doi.org/10.1037/0033-295X.85.3.207](https://groups.psych.northwestern.edu/medin/documents/MedinSchaffer1978PsychRev.pdf)

Here are the stimuli in the paper:

![](static/medin-and-schaffer-1978-stimuli.png)

Let's tabulate the stimuli from the paper into a matrix. The wonders of small data!

In [24]:
import numpy as np

raw_stimuli = [1, 1, 1, 0, 0,
               1, 0, 1, 0, 0,
               1, 0, 1, 1, 0,
               1, 1, 0, 1, 0,
               0, 1, 1, 1, 0,
               1, 1, 0, 0, 1,
               0, 1, 1, 0, 1,
               0, 0, 0, 1, 1,
               0, 0, 0, 0, 1]

raw_stimuli = np.array(raw_stimuli).reshape(9, -1)
raw_stimuli

array([[1, 1, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 1, 0],
       [1, 1, 0, 1, 0],
       [0, 1, 1, 1, 0],
       [1, 1, 0, 0, 1],
       [0, 1, 1, 0, 1],
       [0, 0, 0, 1, 1],
       [0, 0, 0, 0, 1]])

Note that the rows represent the various examples in
the training set. The first four columns are the features from the
paper, and the last column represents the category.

We are representing category $A$ using $0$ and category $B$ using $1$.

## Prototype model

The prototype model posits that humans generate a prototype of the
various categories they encounter in the world, and use this prototype
to determine the category of an observation in the world.

These models are supervised, i.e. they need labelled data to function.
Our data is split into two categories, $A$ and $B$.

In this assignment, we will look at a simple prototype model that
assumes that people keep track of the mean values of attributes of a
category.

**Implement the `get_prototype()` function**

This function should take the dataset, and the category for which we want to build a 
prototype and return the prototypes for each designated category (mean value of attributes for each designated category). 

Specifically, the return value for the items in the dataset for this assigment
should be a numpy vector of length 4.

Make sure the category column is not included in the prototype.


In [64]:
def get_prototype(dataset, category):
    # dataset: numpy matrix
    # category: number    
    # pass
    # print(dataset)
    data_without_category = dataset[dataset[:, -1] == category]
    # print(data_without_category)
    mean = np.mean(data_without_category[:, :-1], axis=0)
    # print(data_without_category[:, :-1])
    # print(mean.shape)
    return mean

Let's display the prototypes of the categories

In [65]:
ALL_CATEGORIES = np.arange(0, np.max(raw_stimuli[:, 4]) + 1)
for category in ALL_CATEGORIES:
    prototype = get_prototype(raw_stimuli, category)
    print(f"Category: {category}, Prototype: {prototype}")

Category: 0, Prototype: [0.8 0.6 0.8 0.6]
Category: 1, Prototype: [0.25 0.5  0.25 0.25]


To infer a category from an example in the wild, we will need a
function calculates the distance from a set of stimuli to a prototype.
This prototype model will use the Euclidean distance measure, which is

$$
d_{iP} = \sqrt{\sum_{k = 1}^{N} (x_{ik} - P_k)^2}
$$

where $i$ is the index of the item, $P$ represents a specific
prototype, $k$ represents an attribute, and $N$ is the total number of
attributes.

**Implement the `distance_from_prototype()` function**

The return value should be a vector of distances, one for each example.

In [66]:
def distance_from_prototype(observations, prototype):
    # prototype: the prototype of a category, excluding the category column.
    # observations: numpy matrix, excluding the category column.   
    # pass  
    distances = np.sqrt(np.sum((observations - prototype) ** 2, axis=1))
    
    return distances

Let's see how far away the items in the dataset are from the prototype
of category $A$.

In [67]:
a_prototype = get_prototype(raw_stimuli, 0)
raw_stimuli_without_category = raw_stimuli[:,:4]
distance_from_prototype(raw_stimuli_without_category, a_prototype)

array([0.77459667, 0.89442719, 0.77459667, 1.        , 1.        ,
       1.09544512, 1.09544512, 1.34164079, 1.41421356])

Now we have all the information we need to calculate category
probabilities for a set of stimuli. To get probabilities from
distances from prototypes we use the following formula.

$$
P(A|x_i) = 1 - \frac{d_{iP_A}}{\sum_{c \in \text{Values(C)}}d_{iP_c}}
$$

where $C$ is the set of all categories.

Note that this is predicting the probability of category $A$, given
that we have stimulus $x_i$.

**Implement the `prototype_predict_category_probabilities(dataset, observations)` function**

You will have to, for each category:

* determine the prototypes using the `get_prototype()`
* determine the distances of the observations from the prototypes using the `distance_from_prototype()`

Using these functions, you should calculate $P(c|x_i)$ for each stimuli
in each category.

The return value should be a matrix of probabities of, where the rows
represent the observations and the columns represent the categories.

Do not worry about efficiency, just accuracy.

In [68]:
def prototype_predict_category_probabilities(dataset, observations):
    # dataset: numpy matrix
    # observations: numpy matrix with observations to predict the categories of
    # The global variable ALL_CATEGORIES is an array with all the category values
    # FILL IN
    # pass
    prototype_probabilities = np.zeros((observations.shape[0], len(ALL_CATEGORIES)))
    
    for category in ALL_CATEGORIES:
        prototype = get_prototype(dataset, category)
        distances = distance_from_prototype(observations, prototype)
        prototype_probabilities[:, category] = distances

    sum_di_Pc = np.sum(prototype_probabilities, axis=1, keepdims=True)
    prototype_probabilities = 1 - (prototype_probabilities / sum_di_Pc)
    
    return prototype_probabilities
            

Let's predict the categories of the training set itself.

**Test case**

The first four rows of the matrix should be

```python
[[0.6075119 , 0.3924881 ],
 [0.57273642, 0.42726358],
 [0.64247257, 0.35752743],
 [0.54523913, 0.45476087],
 # ...
 ]
```

This means that for the first observation, the $P(A)$ is 0.61, and the
$P(B)$ is 0.39. For the second observation, the $P(A)$ is 0.57, and the
$P(B)$ is 0.43, and so on.

In [76]:
result = prototype_predict_category_probabilities(raw_stimuli, raw_stimuli_without_category)
print(result)
print("Test results ==> ", result[0][0].round(2) == 0.61, result[0][1].round(2) == 0.39, result[1][0].round(2) == 0.57, result[1][1].round(2) == 0.43)

[[0.6075119  0.3924881 ]
 [0.57273642 0.42726358]
 [0.64247257 0.35752743]
 [0.54523913 0.45476087]
 [0.54523913 0.45476087]
 [0.46918161 0.53081839]
 [0.46918161 0.53081839]
 [0.41917462 0.58082538]
 [0.31866518 0.68133482]]
Test results ==>  True True True True


## Exemplar models

Exemplar models posit no abstraction process. They theorize that humans store
examples of concepts (along with their categories) and use similarity to those
examples to categorize a particular observation.

It's important to note that exemplar models have a fitting process,
where constants are fit using human data. Each attribute $k$ will have
a parameter $s_k$, where $0 \leq s_k \leq 1$. $s_k$ represents a
penalty for an attribute not matching. We will represent this in the
code as a `penalties` parameter array/vector, which we will later fit.

To determine the similarity between two items, the exemplar model uses this formula.

$$
  \text{sim}(x, y) = \prod_{k = 1}^{N} \begin{cases}
    1 & \text{if } x_k = y_k \\
    s_k & \text{if } x_k \neq y_k
  \end{cases}
$$


**Implement `exemplar_similarity()`**

The return value should be a number representing the similarity.

In [78]:
def exemplar_similarity(item_a, item_b, penalties):
    # item_a: vector representing an observation
    # item_b: vector representing an observation
    # penalties: vector of s_k mentioned above
    # pass
    exemplar_similarity = 1.0
    loop_range = len(item_a)
    
    for item in range(loop_range):
        if item_a[item] != item_b[item]:
            exemplar_similarity *= penalties[item]
    
    return exemplar_similarity


**Test case**

We can test the code using two items from the dataset, by using hard
coded penalty values of 0.5 for each attribute. Since these examples
differ by one attribute, the similarity should be 0.5 (the penalty value).

In [80]:
sim = exemplar_similarity(raw_stimuli_without_category[0], raw_stimuli_without_category[1], [0.5, 0.5, 0.5, 0.5])
print(f"sim({raw_stimuli_without_category[0]}, {raw_stimuli_without_category[1]}) = {sim}")

sim([1 1 1 0], [1 0 1 0]) = 0.5


Now we can use the similarity information to calculate the
probability that any observation is in category $A$:

$$
P(A | x_i) = \frac{\sum_{j = 1}^{M_A} \text{sim}(x_i, A_j)}
{\sum_{c \in \text{Values}(C)}\sum_{j = 1}^{M_c} \text{sim}(x_i, c_j)
}
$$

where

* $C$ is the set of all categories
* $M_c$ is the number of examples in category $c$
* $c_j$ is the $j$-th example in category $c$

**Implement `exemplar_predict_category_probabilities()`**

Again, do not worry about efficiency, just accuracy. The return value
should be similar to `prototype_predict_category_probabilities()`, a
matrix of probabities with rows representing observations, and
columns representing the categories. Feel free to define and use
helper functions if you need them.



In [104]:
def exemplar_predict_category_probabilities(dataset, observations, penalties):
    # dataset: numpy matrix
    # penalties: the penalites for the similarity function
    
    # The global variable ALL_CATEGORIES is an array with all the category values
    # pass
    
    exemplar_probabilities = np.zeros((observations.shape[0], len(ALL_CATEGORIES)))
    
    for i, observation in enumerate(observations):
        similarities_per_category = np.zeros(len(ALL_CATEGORIES))

        for data in dataset:
            similarity = exemplar_similarity(observation, data[:4], penalties)
            similarities_per_category[int(data[4])] += similarity
                # print(data, data[:4], data[4])
            
        exemplar_probabilities[i, :] = similarities_per_category / np.sum(similarities_per_category)
    
    return exemplar_probabilities

Let's predict the categories of the dataset using itself. We will hard
code the penalties for now.

**Test case**

The first four rows of the matrix should be

```python
[[0.65454545, 0.34545455],
 [0.72      , 0.28      ],
 [0.7826087 , 0.2173913 ],
 [0.65217391, 0.34782609],
 # ...
 ]
```


In [105]:
result = exemplar_predict_category_probabilities(raw_stimuli, raw_stimuli_without_category, [0.5, 0.5, 0.5, 0.5])
print(result)
print("Test results ==> ", result[0][0].round(2) == 0.65, result[0][1].round(2) == 0.35, result[1][0].round(2) == 0.72, result[1][1].round(2) == 0.28)
print("Test results ==> ", result[2][0].round(2) == 0.78, result[2][1].round(2) == 0.22, result[3][0].round(2) == 0.65, result[3][1].round(2) == 0.35)

[[0.65454545 0.34545455]
 [0.72       0.28      ]
 [0.7826087  0.2173913 ]
 [0.65217391 0.34782609]
 [0.65217391 0.34782609]
 [0.48       0.52      ]
 [0.48       0.52      ]
 [0.34883721 0.65116279]
 [0.27272727 0.72727273]]
Test results ==>  True True True True
Test results ==>  True True True True


## Parameter fitting

We've been using hard-coded penalties (0.5 each) so far. The exemplar
model is specified such that these parameters should be fit by human
data. Let's tabulate the human data from the paper and fit the
parameters.

We will use `scipy`'s `curve_fit()` function to fit our parameters. It
uses the
[Levenberg–Marquardt](https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm)
algorithm to find good parameter values. It returns a tuple with the
parameters and errors, which we discard here.

Install `scipy` using `pip3 install scipy` if it is not installed.

You will not need to write any code for this section, just use the
fitted penalties to answer the questions at the end of this assignment.


In [106]:
# TODO? why todo here even though we are not required to write anything in this function? hmm.
from scipy.optimize import curve_fit

# Human A probabilities for Medin and Schaffer (1978)

human_a_probs = np.array([0.78, 0.88, 0.81, 0.88, 0.81, 0.16, 0.16, 0.12, 0.03])

def exemplar_predict_a_prob(dataset, *penalties):
    dataset_without_categories = dataset[:, :4]
    predictions = exemplar_predict_category_probabilities(dataset, dataset_without_categories, penalties)
    a_predictions = predictions[:, 0]
    return a_predictions

penalties, _ = curve_fit(exemplar_predict_a_prob, raw_stimuli, human_a_probs, p0=[0.5, 0.5, 0.5, 0.5])
print(f"Penalties: {penalties}")

exemplar_predict_category_probabilities(raw_stimuli, raw_stimuli_without_category, penalties)


Penalties: [0.16065222 0.37070295 0.18720096 0.07054756]


array([[0.79800606, 0.20199394],
       [0.90000202, 0.09999798],
       [0.96740734, 0.03259266],
       [0.89120799, 0.10879201],
       [0.88263076, 0.11736924],
       [0.23400825, 0.76599175],
       [0.21197944, 0.78802056],
       [0.13042492, 0.86957508],
       [0.04188833, 0.95811167]])

## Questions

Respond to each question in one or two paragraphs. Remember, we are
looking for connections to psychological plausibility in humans.
  
* Create a table summarizing the predictions of the prototype model
  and the exemplar model for the 9 training stimuli (the 5 instances
  of category $A$ and the 4 instances of category $B$, and also the
  performance of humans during the test phase. Be sure you are
  consistent in either providing $P(A)$ for all 9 instances, or
  providing $P(A)$ for the first 5 and $P(B)$ for the
  last 4. The human data can be found in the lecture slides.

* How well do each of the models predict the human data? Describe
  patterns in the data. Then, answer this question quantitatively. For
  example, you might compute the correlation between each model's
  predictions and the human data, and report these two values. Or you
  might compute the root mean square deviation (RMSD) between each
  model's predictions and the human data. or you might use another
  measure of model fit. Justify the measure you choose.

* As you learned in lecture, the key difference between the models is
  with regard to $A$ stimuli $\begin{bmatrix} 1 & 1 & 1 & 0
  \end{bmatrix}$ and $\begin{bmatrix} 1 & 0 & 1 & 0
  \end{bmatrix}$. Refer to the lecture slides about the
  differential predictions the models make on these stimuli. Was that
  the case in your results? Why does this pattern indicate that the
  exemplar model may better capture how people represent categories?

* When we have multiple models of human cognition, we typically want
  to determine which models are better than others. The prototype
  model presented performs poorly compared to the exemplar model. On
  the other hand, the prototype model is parameter-free and the
  exemplar models contains 4 parameters (i.e., the s(i)s) that need to
  be fit to human data. When evaluating these two models to determine
  which model is a better and more useful model of human cognition,
  how should we penalize the exemplar model for the free parameters it
  requires, and the flexibility that they provide? You can provide
  quantitative and/or verbal descriptions for your answer.

* What kind of human categorization behavior might be beyond either
  the prototype model or the exemplar model? You need only consider
  the limits of one of the two models.

# Answers

1. Below is the table that collates the predictions of prototype model, exemplar model, and the human testing phase.    

| Stimuli | Prototype P(A) | Prototype P(B) | Exemplar P(A) | Exemplar P(B)  |  Human-Observed | Human-Predicted |
|---------|----------------|----------------|---------------|----------------|-----------------|-----------------|
|     4A  |       0.61     |       0.39     |       0.80    |       0.20     |       0.78      |       0.79      |
|     7A  |       0.57     |       0.43     |       0.90    |       0.10     |       0.88      |       0.94      |
|    15A  |       0.64     |       0.36     |       0.97    |       0.03     |       0.81      |       0.97      |
|    13A  |       0.55     |       0.45     |       0.89    |       0.11     |       0.88      |       0.86      |
|     5A  |       0.55     |       0.45     |       0.88    |       0.12     |       0.81      |       0.86      |
|    12B  |       0.47     |       0.53     |       0.23    |       0.77     |       0.84      |       0.76      |
|     2B  |       0.47     |       0.53     |       0.21    |       0.79     |       0.84      |       0.76      |
|    14B  |       0.42     |       0.58     |       0.13    |       0.87     |       0.88      |       0.93      |
|    10B  |       0.32     |       0.68     |       0.04    |       0.96     |       0.97      |       0.97      |


2. As we can clearly see, the prototype model is quiet far in terms of the predictions compared to human data vs exemplar model is. We can see that the exemplar model much more clearly predicts how a human predicts the result based on the stimuli provided. For example, for the stimuli 4A, Prototype model predicted a 0.61 probability whereas, the exemplar model predicts a 0.80 which is closer to the human testing result of 0.79. For some stimuli, the predicts are even an exact match barring some drastic errors in calculations. For a better quantitative view, let's calculate the root mean square deviation value for both the models. 
   1. For the prototype model - 
      1. Total deviations i.e. how far each value is from the human observed values = |0.61 - 0.78|, |0.57 - 0.88|, |0.64 - 0.81|, |0.55 - 0.88|, |0.55 - 0.81|, |0.53 - 0.84|, |0.53 - 0.84|, |0.58 - 0.88|, |0.68 - 0.97| = |0.61 - 0.78| = 0.17, 0.31, 0.17, 0.33, 0.26, 0.31, 0.31, 0.30, 0.29
      1. Squared Deviations - 0.0289, 0.0961, 0.0289, 0.1089, 0.0676, 0.0961, 0.0961, 0.09, 0.0841
      2. square root of the Mean of these squared deviations = Sum(0.0289, 0.0961, 0.0289, 0.1089, 0.0676, 0.0961, 0.0961, 0.09, 0.0841) / 9 is approximately 25.8%
   2. For the exemplar model - 
      1. Total deviations i.e. how far each value is from the human observed values = |0.80 - 0.78|, |0.90 - 0.88|, |0.97 - 0.81|, |0.89 - 0.88|, |0.88 - 0.81|, |0.77 - 0.84|, |0.79 - 0.84|, |0.87 - 0.88|, |0.96 - 0.97| = 0.02, 0.02, 0.16, 0.01, 0.07, 0.07, 0.05, 0.01, 0.01
      2. Squared Deviations - 0.0004, 0.0004, 0.0256, 0.0001, 0.0049, 0.0049, 0.0025, 0.0001, 0.0001
      3. Mean of these squared deviations = Sum(0.0004, 0.0004, 0.0256, 0.0001, 0.0049, 0.0049, 0.0025, 0.0001, 0.0001) / 9 = 0.00433
      4. Root mean square deviation = square root of 0.00433 which is approximately 6.6%.
   3. What the RMSD does is it measures the average deviation of predictions from actual human data for both the models. For the prototype mode, we see the value is 28% which is much higher than the value for exemplar model which is 6.6%. The higher the value, the more the deviation. So, we can say quantitatively, that the exemplar model has a better fit to human data as compared to prototype model.
3. The key difference between the models can be seen from 4A and 7A as mentioned in the question - [ 1 1 1 0] and [1 0 1 0]. The prototype model tries to categorize it based on the how close a stimulus comes to a prototype and in that, if we see the prediction ratings for these two examples, the values are almost 50-50, equally split - 61% and 57% probabilities, as if the model is not able to categorize it to the correct one - A. Whereas, the exemplar model has done an excellent job of categorizing it to A with a 80% to 90% probability. Because exemplar model works on the idea of similarity between two items. Even though A is [1 1 1 1] and we know that 4A is a closer item to A than 7A is, but even then, we can see a 80% probability for 4A and 90% for 7A that they are in the category A. This was the case in my results too. Surprisingly, that is also closer to how a human would categorize them based on the test results as provided in the lecture as well as the paper by Medin and Schafer. This pattern suggests that the exemplar model captures how people categorize stimuli better than prototype model because it reflects the fact that human categorization is often instance-based rather than a singular, generalized prototype.
4. Both the models Prototype and exemplar have their pros and cons. The prototype model with it's simple flow without parameter runs into cons like less accuracy and less likely to overfit but then again it is not complex enough to capture the way a human might think and process stimuli. On the other hand, with exemplar model, even though it has the 4 parameters that account for its accuracy, it also makes it complex enough to capture the nuanced variations in human cognition. That is also evident from the RMSD exercise we did in the previous answers. To quantitatively generate a report of which model might be better we can use scores like precision, accuracy, f1-score, BIC or AIC values, recall scores that are generally used to evaluate any Machine Learning model. Especially values like Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) might come in handy because they penalize the models for the number of parameters they end up using. For example, Lower BIC values might mean a better balance between model fit and model complexity. Even though, we know for a fact that exemplar model is closer to the human values, the number of parameter used should also mean we get a better model fit based on the BIC values.
5. Even though both prototype and exemplar models are good at categorizing binary as well as multi dimensional data as was the case in our experiment, all the data used was linear. The main limitations of the models would arise when dealing with non-linear data. One might think that these models also suck at categorizing outliers but exemplar model is pretty good at categorizing them even though prototype might not be. Humans are more than capable of handling non-linear, complex, dynamic, context-sensitive types of data as well as can deal well with outlier but prototype model might be at a disadvantage. To fix these errors or to tackle them atleast, in the lecture we talked about other models like Rulex model which is a hybrid model to better handle complex data.



