# Assessing Model Accuracy

The purpose of this notebook is to build on an example notebook, and add test holdouts and to show model evaluation.

At the end of this notebook, we can see how much better or worse than guessing the mean for input would be.  For this basic model, it performs roughly at the level of guessing the mean.  This notebook is uesful for demonstrating the use of train_test_split, which can be used for setting aside a test set in Python, and for the use of a `.score()` method, which exists for most models to demonstrate the accuracy against a holdout set.


In [1]:
import random
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import nltk
from nltk.corpus import stopwords

from sklearn.linear_model import Ridge

In [7]:
# Import the data to a df
df = pd.read_csv('data/siop_ml_train_participant.csv')

# Limit the results to a single answer / score combination for demonstration
df = df.drop(['Respondent_ID', 'open_ended_1', 'open_ended_3', 'open_ended_4', 'open_ended_5', 
         'A_Scale_score', 'O_Scale_score', 'E_Scale_score', 'N_Scale_score'], axis=1)

# Confirm that the data has been imported and is formatted correctly
df.head(3)

Unnamed: 0,open_ended_2,C_Scale_score
0,I would reach out to my boss and ask him or he...,3.75
1,I would continue to work on the project that w...,5.0
2,I would start working on the project now and g...,4.666667


In [11]:

tokens = [t for t in df['open_ended_2'][0].split()]
nltk.FreqDist(tokens)


FreqDist({'I': 4, 'my': 3, 'boss': 2, 'to': 2, 'it': 2, 'this': 2, 'would': 2, 'results.': 1, 'gonna': 1, 'what': 1, ...})

Use these two to set which facet you're attempting to predict from which input.

In [3]:
training_columns = ['open_ended_1', 'open_ended_2', 'open_ended_3', 'open_ended_4', 'open_ended_5']
criterion_columns = ['E_Scale_score', 'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score']

In [4]:
def simple_prep (df):
    for col in training_columns:
        
        # Lowercase it all
        df[col].str.lower()
    
        # Remove non-alphanumeric characters
        df[col].replace('[^a-zA-Z0-9]', ' ', regex = True)
    
    return df

prepped_data = simple_prep (df)

In [5]:
# Generate our testing and training sets and show their relative sizes
train, test = train_test_split(prepped_data, test_size=0.3)
print(train.shape)
print(test.shape)

(761, 11)
(327, 11)


When hosted in GitHub, this sample will only show the most recent run, however you can clone this repository to run it locally and sample the results yourself.  The results for this model are not great, but better than guessing.  Standard deviation is provided for comparison.

In [6]:
predictor = 'open_ended_2'
regressor = 'C_Scale_score'


In [7]:
def vectorize_data (df_train, df_test, predictor, regressor):
    # Set the TF-IDF vectorization settings
    vectorizer = TfidfVectorizer(min_df=1)
    
    # Convert text into vectors
    X_train = vectorizer.fit_transform(df_train[predictor]) 

    # Also convert test data into vectors
    X_test = vectorizer.transform(df_test[predictor]) 
    
    # Set the criterion values for the training set
    y_train = df_train[regressor]

    # Set the criterion values for the test set
    y_test = df_test[regressor]
    
    # return the vectorizer object so we can use it later for evaluation
    return X_train, X_test, y_train, y_test, vectorizer
  
X_train, X_test, y_train, y_test, vectorizer = vectorize_data (train, test, predictor, regressor)

feature_names = vectorizer.get_feature_names()
print(feature_names[:10])

['10', '100', '30', '40', '45', '8n', '90', '95', 'abandon', 'abilities']


In [9]:
# Utilizing ridge regression
mod = Ridge(alpha=1.0, random_state=42)

# Fit the model using the our training data and criterion column
mod.fit(X_train, y_train)

print('Score for pred {0} from {1} is {2}'.format(regressor, predictor, mod.score(X_test, y_test)))


Score for pred C_Scale_score from open_ended_2 is -0.021640579428152007
