## Short answer grading

Professor Fund likes teaching, but she does not like grading student's work - she would much rather spend that time interacting directly with students.

You are going to help her out by developing a machine learning model to make this arduous task easier, at least for short answer questions.

In the attached workspace, you will load in a dataset related to short answer grading, then explore some models that you think could streamline this arduous task.

| Name | Type | Description |
| ---- | ---- | ---- |
|`acc_rf`	|1d numpy array	|1. Accuracy of random forest for each question ID.|
|`acc_rf_kf`	|2d numpy array	|2. Accuracy of random forest for each question ID in each fold.|
|`acc_rf_mean`	|1d numpy array	|2. Mean accuracy of random forest for each question ID across folds.|
|`cluster_ids`	|1d numpy array	|3. Cluster ID for each sample.|
|`cluster_correctness`	|2d numpy array	|3. Percent of samples in each cluster with correct answer.|

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score

**This question includes bonus points - you only need to earn 20 points of credit to get full credit, but you _can_ earn up to 30 points (and therefore, up to 110% on the overall exam). You may work on whichever sections you like to earn the first 20 points, but don't spend time trying to get to 30 points until you have first answered the rest of the exam.**

Use this `random_state` throughout your work: 

In [2]:
random_state = 0

You are working with a data set that has four different short answer questions, each with a reference answer and approximately 30 student answers.

First, read in the data:

In [3]:
df = pd.read_csv("answers.csv")

In [4]:
df["id"].unique() # four question IDs

array([0, 1, 2, 3])

In [5]:
df["question"].unique() # four questions

array(['Carrie wanted to find out which was harder, a penny or a nickel, so she did a scratch test. How would this tell her which is harder?',
       'Look at the picture on the right. Label the poles on each magnet. (The bottom 2 magnets are stuck together, the others are not.) What is the rule that explains why you labeled the poles the way you did?',
       'A solution is a type of mixture. What makes it different from other mixtures?',
       'Katie got a guitar for her birthday. She experimented with the strings and found she could change their sounds. One way Katie could change the sound of a string was to tighten it. Describe how the sound was different when the string was tightened.'],
      dtype=object)

In [6]:
df["reference_answer"].unique() # four reference answers (the "official" correct answer)

array(['The harder coin will scratch the other.',
       'Like poles repel and opposite poles attract.',
       'A solution is a mixture formed when a solid dissolves in a liquid.',
       'When the string was tighter, the pitch was higher.'], dtype=object)

In [7]:
df.head() # note many answers per question

Unnamed: 0,answer,correct,question,reference_answer,id
0,You could tell if it has the hardest if most o...,0,"Carrie wanted to find out which was harder, a ...",The harder coin will scratch the other.,0
1,If just the penny could scratch and the nickel...,1,"Carrie wanted to find out which was harder, a ...",The harder coin will scratch the other.,0
2,Whichever one was damaged most was less hard.,1,"Carrie wanted to find out which was harder, a ...",The harder coin will scratch the other.,0
3,Rub them against a crystal.,0,"Carrie wanted to find out which was harder, a ...",The harder coin will scratch the other.,0
4,Which had less scratches.,1,"Carrie wanted to find out which was harder, a ...",The harder coin will scratch the other.,0


### 1. Train a RandomForest

You will train and evaluate a separate model for each question, so in everything you do, you will start by iterating over the question IDs.

Then, for each question ID:

* get the data for that question ID
* divide the data into a training and test set, using the random state specified above, and leaving 10 random samples per question in the test set with the rest in the training set.
* Use a `CountVectorizer` with `stop_words = "english"` to create a numeric representation of the answers to that question, using the answer data in the training set to fit the `CountVectorizer`.
* train a `RandomForestClassifier` to predict whether or not the answer is correct. Use 20 trees in your forest, and use the random state specified above.
* save the result on the validation set (per question ID) in `acc_rf`.

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
question_ids = df["id"].unique()
acc_rf = np.zeros(len(question_ids))
# implement here
for i, question_id in enumerate(question_ids):
    question_data = df[df["id"] == question_id]
    X = question_data["answer"]
    y = question_data["correct"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10, random_state=random_state)
    vectorizer = CountVectorizer(stop_words="english")
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    model = RandomForestClassifier(n_estimators=20, random_state=random_state)
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    acc_rf[i] = accuracy_score(y_test, y_pred)
print(acc_rf)

[0.6 0.7 0.8 0.8]


### 2. Evaluate a RandomForest with KFold CV

Since you are working with a very small data set, you are concerned that perhaps the results are highly dependent on the random draw of training vs test samples. So, you repeat the analysis above, but with a KFold CV for evaluation:

* first, you iterate over question IDs
* then, you set up a KFold CV with 5 folds. Shuffle the data and use the random state specified above.
* inside each fold, 
  * Use a `CountVectorizer` with `stop_words = "english"` to create a numeric representation of the answers to that question, using the training set to fit the `CountVectorizer`.
  * train a `RandomForestClassifier` to predict whether or not the answer is correct. Use 20 trees in your forest, and use the random state specified above.
  * save the result on the validation set (per question ID) in `acc_rf_kf`.

In [11]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
question_ids = df["id"].unique()
n_folds = 5
acc_rf_kf = np.zeros((len(question_ids), n_folds))
# implement here
for i, question_id in enumerate(question_ids):
    question_data = df[df["id"] == question_id]
    X = question_data["answer"]
    y = question_data["correct"]
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=random_state)
    for j, (train_index, test_index) in enumerate(kf.split(X)):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        vectorizer = CountVectorizer(stop_words="english")
        X_train_vec = vectorizer.fit_transform(X_train)
        X_test_vec = vectorizer.transform(X_test)
        model = RandomForestClassifier(n_estimators=20, random_state=random_state)
        model.fit(X_train_vec, y_train)
        y_pred = model.predict(X_test_vec)
        acc_rf_kf[i, j] = accuracy_score(y_test, y_pred)

Then, get the average accuracy per question ID (across folds) and save it in `acc_rf_mean`.

In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
acc_rf_mean = np.mean(acc_rf_kf, axis=1)
print(acc_rf_mean)

[0.61071429 0.81428571 0.77857143 0.74642857]


### 3. Clustering similar answers

The supervised learning model above is interesting, but is only useful for questions on which Professor Fund already has labeled data from previous semesters. It won't help her grade new questions that have not been used in previous semesters.

To help with brand-new questions, you will also create a supervised learning model that will group together similar answers, making them easier to grade.

In the next cell, 

* first, you iterate over question IDs. For each question, 
  * Use a `CountVectorizer` with `stop_words = "english"` to create a numeric representation of the answers to that question, using the answer data to fit the `CountVectorizer`.
  * then use a `KMeans` model with 3 clusters to group similar answers. Use the random state specified above, and leave all other settings at their default values.
  * save the cluster ID of each sample in `cluster_ids`.

In [14]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
cluster_ids = np.zeros(len(df), dtype=int)
# implement here
for i, question_id in enumerate(question_ids):
    question_data = df[df["id"] == question_id]
    X = question_data["answer"]
    vectorizer = CountVectorizer(stop_words="english")
    X_vec = vectorizer.fit_transform(X)
    kmeans = KMeans(n_clusters=3, random_state=random_state)
    cluster_ids[df["id"] == question_id] = kmeans.fit_predict(X_vec)

If your clustering model is doing a good job of grouping together answers that will be graded similarly, we would expect that within a cluster, most of the answers should be either correct or incorrect (i.e. most answers within a cluster should have similar "correctness".)

For each question and each cluster, compute the average "correctness" within the cluster (i.e. the average value of "correct" among samples within that cluster). Save the result in `cluster_correctness`.

In [15]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
question_ids = df["id"].unique()
n_clusters = 3  
cluster_correctness = np.zeros((len(question_ids), n_clusters))
# implement here
for i, question_id in enumerate(question_ids):
    question_data = df[df["id"] == question_id]
    for j in range(n_clusters):
        cluster_data = question_data[cluster_ids[df["id"] == question_id] == j]
        if len(cluster_data) > 0:
            cluster_correctness[i, j] = cluster_data["correct"].mean()
print(cluster_correctness)


[[1.         0.57142857 0.4       ]
 [0.05882353 0.75       0.14285714]
 [0.23333333 0.         0.5       ]
 [0.38095238 0.57142857 0.        ]]
