## Dumb Predictor
#### Sam Berkson
#### CPSC 323

The goal of this assignment was to create a dumb predictor, a classifier that predicts the most common class label, and run it over a real life skewed dataset.  I chose the Titanic dataset, a list of attributes for every passenger on the Titanic during its last voyage.  I ran my model over the 'survived' attribute, predicting whether someone survived the journey.

First, I need to read in the necessary libraries.

In [1]:
import pandas as pd
import time

Next, I created a class containing my dumb predictor.  It has a constructor and a predict method.  The predict method takes in the data as a list, and finds the most commonly occuring value.  It then appends this value into the predictions list X number of times, where X is the size of the input.

In [2]:
class DumbPredictor:
    "Object used to fit data to a dumb prediction model, where the predicted value on unseen instances is the most commonly occuring value in the training set."

    def __init__(self):
        "Initializes the DumbPredictor object."
        self.most_common = None
        self.data = None
    
    # Find most common class label to use as prediction for all instances
    def predict(self, data):
        start = time.time()
        self.data = data
        self.most_common = max(set(self.data), key=self.data.count)

        predictions = []
        for i in range(len(self.data)):
            predictions.append(self.most_common)
        end = time.time()
        return predictions, (end - start)

Next, I read in my dataset and separated the 'survived' column into a list.

In [3]:
df = pd.DataFrame(dtype=None)
df = pd.read_csv('titanic.csv')

# Create a list of the target values
input = df['survived'].tolist()

Next, I instantiated my model and passed in my input list.  I then ran a parallel analysis of both the input and the predictions, finding which were correctly and incorrectly identified and incrementing counter variables for each metric (FP, TP, FN, TN).

In [4]:
model = DumbPredictor()
results, time = model.predict(input)

Positive, Negative, TP, FP, TN, FN = 0, 0, 0, 0, 0, 0

for index, val in enumerate(results):
    if results[index] == 'yes':
        Positive += 1
        if results[index] and input[index] == 'yes':
            TP += 1
        else:
            FP += 1
    elif results[index] == 'no':
        Negative += 1
        if results[index] and input[index] == 'no':
            TN += 1
        else:
            FN += 1


Now, I will analyze my model on 5 metrics:
* Accuracy
* Precision
* Recall
* F1
* Time

In [5]:

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * ((Precision * Recall) / (Precision + Recall))

print("Accuracy: ", Accuracy)
print("Precision: ", Precision)
print("Recall: ", Recall)
print("F1: ", F1)
print("Time: ", time)

def confusion_matrix(total, TP, FP, TN, FN):
    print("Total:", total, "          | Predicted Survived:", Positive, "| Predicted Died:", Negative)
    print("Actual Survived:", TP + FN, "| TP: ", TP, "               | FN: ", FN, " ")
    print("Actual Died:", FP + TN, "     | FP: ", FP, "                | TN: ", TN, " ")
    
print("\nConfusion Matrix:")
confusion_matrix(len(results), TP, FP, TN, FN)

Accuracy:  0.6769650159018628
Precision:  0.6769650159018628
Recall:  1.0
F1:  0.8073692766188024
Time:  0.00021600723266601562

Confusion Matrix:
Total: 2201           | Predicted Survived: 2201 | Predicted Died: 0
Actual Survived: 1490 | TP:  1490                | FN:  0  
Actual Died: 711      | FP:  711                 | TN:  0  


Overall, my model performed as well as I expected it to.  'yes' was the most commonly occuring class label, and it occured in exactly 67.695% of instances, matching the accuracy of my model.  It follows that my recall would be one, since I predicted 'yes' for every instance.  My F1 score also looks in like to where it should be, along with the time it took to predict (there are only 2201 instances, it shouldnt take very long to predict).  My confusion matrix looks as expected, matching all of my other metrics.