# Support Vector Regression and Feature Engineering

The first iteration of our model was limited by a general lack of feature engineering.  The only model parameters being considered were those which impacted the n-gram range utilized in vectorization.  This means that each model simply predicted outcomes from the TF-IDF document matrix alone.

This iteration does not currently utilize any weighted voting, however later iterations may get this back.


In [1]:
import random
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.kernel_ridge import KernelRidge

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics


In [2]:
# Import the data to a df
df = pd.read_csv('../data/siop_ml_train_participant.csv')

# Import the test data to a new df
eval_df = pd.read_csv('../data/siop_ml_dev_participant.csv')

# Confirm that the data has been imported and is formatted correctly
df.head(3)


Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score
0,10446116527,"I would change my vacation week, because I am ...",I would reach out to my boss and ask him or he...,I would not go. I am a not a social person. I ...,I would ask my manager why he/she gave me such...,I would find this experience super enjoyable. ...,2.25,3.75,3.166667,3.75,2.916667
1,10440100535,I would talk to my colleague and see if they w...,I would continue to work on the project that w...,I would talk to my colleague and try to talk t...,I would feel upset about the negative feedback...,I would find this experience enjoyable. I feel...,4.666667,4.416667,4.583333,5.0,1.333333
2,10462850071,I would feel upset because perhaps I already b...,I would start working on the project now and g...,I would feel guilty about thinking about not g...,I would feel really defensive about it. I woul...,I would find it enjoyable because I would be r...,2.25,4.75,4.083333,4.666667,2.166667


In [3]:
eval_df.head(3)

Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5
0,10460010474,I would look into changing my vacation plans t...,I would work on the project little by little d...,I would probably still go. Just depending on h...,I would see what I could to do to improve the ...,I would absolutely enjoy being involved in thi...
1,10440103178,"I have always been a team player, but this wou...",I would first address my concerns with my boss...,I would be all in. While accompaniment would b...,I definitely would not be happy about this sit...,I would absolutely find this experience enjoya...
2,10440099430,I would try to come to a compromise with my co...,I would go to my boss and ask him if he has an...,"I would go to the event, it's possible that if...",I would pay attention and take an honest look ...,I would find it enjoyable because I love learn...


For clarity and simplicity, the training and criterion columns are declared in these hardcoded variables

In [4]:
training_columns = ['open_ended_1', 'open_ended_2', 'open_ended_3', 'open_ended_4', 'open_ended_5']
criterion_columns = ['E_Scale_score', 'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score']

In [5]:
def simple_prep (df):
    for col in training_columns:
        
        # Lowercase it all
        df[col].str.lower()
    
        # Remove non-alphanumeric characters
        df[col].replace('[^a-zA-Z0-9]', ' ', regex = True)
    
    return df

prepped_data = simple_prep (df)


In [6]:
# Generate our testing and training sets and show their relative sizes
train, test = train_test_split(prepped_data, test_size=0.3)


### Transformation and Training

First we'll need to defind the function for vectorizing each input column.

In [7]:
# Allow a new model to be initialized for each column / regressor
def new_model ():
    # Define the model parameters we'll be using
#     mod = SVR(gamma='scale', C=1.0, epsilon=0.2)
#     mod = KernelRidge(alpha=1.0)
    mod = Ridge(alpha=1.0, random_state=42)
    return mod

# Abstract away as much as possible so we can reuse this general vectorizing and training function
def vectorize_and_train (df_train, y_train):
    
    # Set the TF-IDF vectorization settings
    vectorizer = TfidfVectorizer(min_df=1, max_df=3, ngram_range=(1,4))
    
    # Fit the training data and tranform it into vectors
    X_train = vectorizer.fit_transform(df_train) 
    
    # Convert test text into vectors
    X_test = vectorizer.transform(df_train) 
    
    # We're using the same y-values from the training df
    y_train = y_train
    
    # Generate a new model instance
    mod = new_model()
    
    # Fit our model with the data
    mod.fit(X_train, y_train)
    
    # return the vectorizer object so we can use it later for evaluation
    return X_train, X_test, y_train, vectorizer, mod


In [8]:
# Define our model sets
X_train_set = {}    
X_test_set = {}
y_train_set = {}
vect_set = {}
model_set = {}

# Iterate over the training columns
for tc in training_columns:
    
    # Generate a new dict for each column so we can access the models later
    X_train_set[tc] = {}
    X_test_set[tc] = {}
    y_train_set[tc] = {}
    vect_set[tc] = {}
    model_set[tc] = {}
    
    # Within the training columns, iterate over the outcome variables
    for cc in criterion_columns:    
        X_train_set[tc][cc], X_test_set[tc][cc], y_train_set[tc][cc], vect_set[tc][cc], model_set[tc][cc] = vectorize_and_train (train[tc], train[cc])


### Transforming and Predicting on Input Data

Since we've set up the nested model structures, we can move ahead and process the input file we received at the beginning of the notebook.

In [9]:
# Iterate over our outcome variables
for ec_column in criterion_columns:

    # Predict a value from a model trained on each column
    for sj_column in training_columns:

        # Transform the evaluation data into vectorized data on the correct vectorizer vocab
        eval_transformed = vect_set[sj_column][ec_column].transform(eval_df[sj_column])

        # Predict the values with the corresponding model
        y_pred = model_set[sj_column][ec_column].predict(eval_transformed)
        
        print(y_pred[0])

    # Average the weighted voting results and assign them to the criterion columns in the results
    eval_df[ec_column] = np.mean( np.array([ y_pred ]), axis=0 )


3.5140085753668444
3.3943966919725543
3.503780148443494
3.563645034902285
3.5904874915022553
4.112202502140089
4.1632494095217645
4.257150832443749
4.1932528477100925
4.194302974290611
3.9209795008062844
3.8508090539369375
3.9887728123336292
3.9313180681620983
3.8930188347211003
4.3670323097763495
4.390878934982989
4.444945696326962
4.4483035310711525
4.440098584325213
2.176183193403312
2.1190063575349507
1.9847714043467704
2.0339893045599333
2.003329717317797


In [11]:
# # Generate a Dataframe from the results
# output = pd.concat([eval_df["Respondent_ID"].reset_index(drop=True), results.reset_index(drop=True)], axis=1)

# Drop the training columns so we report on only Respondent ID and their predicted values
output = eval_df.drop(columns=training_columns) 

# Set the headers asked for by the challenge organizers
correct_headers = ['Respondent_ID', 'E_Pred', 'A_Pred', 'O_Pred','C_Pred', 'N_Pred']

# Show the current column titles to ensure they're in the correct order
print('Replace column names from \n{0} to:\n{1}'.format(list(output), correct_headers))

# Rename our column headers so they match what folks expect
output.columns = [correct_headers]

# Send the output frame to a CSV, and exclude the indices with index=False
output.to_csv('../data/prediction_output.csv', encoding='utf-8', index=False)

Replace column names from 
['Respondent_ID', 'E_Scale_score', 'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score'] to:
['Respondent_ID', 'E_Pred', 'A_Pred', 'O_Pred', 'C_Pred', 'N_Pred']


### Model Testing and Evaluation

Now that we've got our models trained, we'll want to evaluate their performance on test data

In [12]:
# JK JK, I'm not actually doing that right now :sadpanda:

# If you want to modify this section to include test evaluation, this would be really helpful
# ---------------------------------------------------------------------------------

# Get training set
X_test = test[training_columns]
y_test = test[criterion_columns]

# print(X_test[:3])
# print(y_test[:3])


In [33]:
# Instantiate a new dataframe for evaluation only
y_predicted_df = pd.DataFrame()

# Iterate over our outcome variables
for ec_column in criterion_columns:

    # Predict a value from a model trained on each column
    for sj_column in training_columns:

        # Transform the evaluation data into vectorized data on the correct vectorizer vocab
        eval_transformed = vect_set[sj_column][ec_column].transform(X_test[sj_column])

        # Predict the values with the corresponding model
        y_pred = model_set[sj_column][ec_column].predict(eval_transformed)
        print(y_pred)
    # Average the weighted voting results and assign them to the criterion columns in the results
    y_predicted_df[ec_column] = np.mean( np.array([ y_pred ]), axis=0 )
    
print(y_predicted_df.head())

[3.45957222 3.57452484 3.46428695 3.57588623 3.47425639 3.45920644
 3.54648871 3.38413119 3.42161343 3.53897192 3.54729584 3.47581406
 3.24420865 3.44539416 3.51527726 3.58274279 3.37241548 3.46448149
 3.31785898 3.58687715 3.5776391  3.48629105 3.43542475 3.42772713
 3.38032294 3.48209528 3.45360178 3.56265378 3.45677448 3.37136998
 3.51634048 3.61553179 3.51185249 3.37975743 3.58975139 3.49301983
 3.49655928 3.43375418 3.56862513 3.5687758  3.53561423 3.40259709
 3.54713023 3.47970957 3.32749007 3.55481201 3.36137696 3.50634915
 3.44312815 3.50215237 3.46010312 3.42464017 3.37410084 3.47854257
 3.58699831 3.47138943 3.54939898 3.39562751 3.48925846 3.50567788
 3.46041123 3.50246852 3.42842545 3.36545788 3.48625228 3.45590453
 3.58952787 3.44358417 3.57755314 3.53743917 3.64504321 3.5645781
 3.54781445 3.4344367  3.56807471 3.51115075 3.37678116 3.55161812
 3.3986722  3.54315558 3.41234554 3.42511865 3.47833481 3.38730806
 3.48803267 3.56297369 3.4498583  3.48064594 3.50066245 3.55478

[3.49284609 3.27389605 3.53999923 3.54527077 3.26574321 3.50970227
 3.51944879 3.50836014 3.57572544 3.55815122 3.54371758 3.56172658
 3.52793997 3.49952802 3.42943716 3.42760369 3.36245968 3.45508867
 3.53521805 3.37028192 3.3601598  3.49763081 3.35921261 3.42048035
 3.54756182 3.53978008 3.3859617  3.58623582 3.45984523 3.39917831
 3.42276796 3.68878945 3.47951827 3.43265404 3.63355524 3.49944613
 3.50187746 3.52251277 3.43997304 3.77979695 3.39000426 3.59262159
 3.68622011 3.49933923 3.47618846 3.43108848 3.46246708 3.50070435
 3.38828152 3.39738401 3.39900289 3.46445401 3.43964515 3.514665
 3.45581433 3.43943221 3.52005667 3.688843   3.48998015 3.38250443
 3.47686747 3.49551925 3.51103373 3.34346119 3.64561648 3.66666781
 3.23686332 3.44998719 3.52171504 3.49132607 3.37563423 3.55246238
 3.56166616 3.54836372 3.44406446 3.48549736 3.43848561 3.40447322
 3.33085149 3.406666   3.3803369  3.61491488 3.50379526 3.50329166
 3.43102701 3.48949857 3.54949263 3.53528381 3.58159765 3.607953

[4.18839351 4.1896538  4.15265964 4.12788386 4.13941211 4.21817863
 4.08734949 4.20401654 4.19257555 4.24182386 4.27066357 4.10591387
 4.14533476 4.16495995 4.13131153 4.13514411 4.07048256 4.14621436
 4.195706   4.220587   4.11295862 4.16623391 4.07045405 4.19622546
 4.21787933 4.2987076  4.25844707 4.08173504 4.156144   4.19349079
 4.10452393 4.1299351  4.16056401 4.17302168 4.09015435 4.03045835
 4.190934   4.09496648 4.11926724 4.14234286 4.23315116 4.11948943
 4.06381197 4.16493305 4.1544151  4.13636104 4.14362124 3.98599021
 4.11492877 4.17712451 4.19210981 4.20993495 4.16581548 4.1083817
 4.14300067 4.05704162 4.14233628 4.1267637  4.14616885 4.12549339
 4.19538238 4.16256454 4.27538948 4.22266711 4.19859035 4.07553429
 4.1013186  4.13709073 4.16376358 4.09872963 4.14773239 4.25510062
 4.11796973 4.13733543 4.16505662 4.20269011 4.12354702 4.1156152
 4.15742482 4.10902637 4.02428982 4.18503162 4.10305561 4.14292841
 4.28411736 4.0364781  4.0925449  4.17220463 4.31150478 4.133591

[4.03661456 3.95877138 3.9708443  3.86551426 3.91956523 3.88611946
 3.84769044 3.95848486 4.01698568 3.93811863 3.85056489 3.89536072
 3.86584528 3.9583766  3.94186    3.77483325 3.88663249 3.87651643
 3.99664439 3.82292549 3.85311066 3.96523475 3.93026806 4.04522083
 3.9030399  3.82627177 3.98064227 3.91180417 3.71002327 3.94578359
 3.84984271 4.05556538 3.93924284 3.87633156 3.83911921 3.88503624
 3.86886113 3.81791783 3.86986927 3.84385232 3.88739566 3.82593526
 3.89726789 3.83736928 3.91131464 4.02353826 3.91560583 3.89963996
 3.75237197 3.9361361  3.88673704 3.90399871 3.86267852 3.94262304
 3.93559756 3.91449292 3.85098148 4.01389435 3.89362981 3.94207947
 3.96140009 3.8749418  3.8777219  3.98419879 3.84982129 4.04075739
 3.84719753 3.84790468 3.95052044 3.89520509 3.91944373 3.95655968
 3.82394578 3.8609963  3.90329297 3.8855903  3.83073896 3.95684548
 3.78119363 3.96238819 3.89462836 3.90785612 3.82393466 3.93002739
 3.81342927 3.88422761 3.85252172 3.88790083 3.89381108 3.9351

[4.32279765 4.4243804  4.3885155  4.46506864 4.40895764 4.39505673
 4.34892638 4.39381931 4.45536591 4.47235829 4.29921321 4.40602414
 4.23380543 4.36308536 4.33211779 4.42439589 4.43643519 4.44264032
 4.27977236 4.4780426  4.4606727  4.36293645 4.35579021 4.40822539
 4.36508449 4.36958559 4.33796058 4.43241998 4.39872598 4.44638677
 4.48328052 4.41190051 4.43876931 4.38646741 4.43654459 4.43342239
 4.49478592 4.34274581 4.46601981 4.48456919 4.42883462 4.4414046
 4.44188426 4.40353165 4.26399043 4.41543572 4.40598023 4.39575782
 4.46599611 4.43170466 4.33381701 4.34027995 4.39183412 4.30528454
 4.37296532 4.39539155 4.46134057 4.37235258 4.42570102 4.39622348
 4.33623008 4.40907935 4.39591047 4.30150218 4.36697898 4.39842726
 4.47401239 4.42390438 4.46208766 4.34426283 4.49032713 4.40689878
 4.46979288 4.4111938  4.46430091 4.40429164 4.35898011 4.48779653
 4.37922052 4.3458809  4.34394313 4.38512095 4.33997072 4.33376034
 4.44850626 4.41730769 4.37039932 4.40661758 4.40923171 4.39715

[4.46076803 4.32692598 4.39308937 4.39587057 4.36173448 4.44965165
 4.33780403 4.2231463  4.37815574 4.52387624 4.47391952 4.40246123
 4.27506105 4.38315456 4.382149   4.47083557 4.4152446  4.50912992
 4.43896038 4.48490789 4.493106   4.37655385 4.36129031 4.43306919
 4.37065304 4.39954006 4.3218443  4.31384112 4.30235412 4.48502494
 4.49384796 4.28885106 4.45247217 4.48443469 4.38301414 4.42463136
 4.33377536 4.42895139 4.37380927 4.45261805 4.39760768 4.49442088
 4.36483269 4.37449272 4.34374008 4.47623967 4.41986116 4.4801177
 4.49790232 4.47143974 4.4566956  4.46291733 4.41633593 4.44420364
 4.3927158  4.39192053 4.47308988 4.41109561 4.40132264 4.42941785
 4.35405525 4.42351305 4.39197616 4.28244586 4.26826181 4.37899009
 4.42074077 4.40553755 4.38840697 4.40671513 4.42789253 4.39579813
 4.3256058  4.41034532 4.38455997 4.37746945 4.42261077 4.46088827
 4.45646372 4.34942797 4.489301   4.39485695 4.36256544 4.3916844
 4.31767633 4.37609233 4.36366492 4.42032964 4.34162237 4.391554

[1.95723271 2.13437081 2.07347855 2.09070383 2.06893752 2.14315833
 2.06491109 1.96014918 1.90293628 2.04884297 2.08664033 2.04447046
 2.02155888 2.05291445 2.06635022 2.19659528 2.03728783 2.16608018
 2.10591591 2.25328167 2.01940341 2.05152067 1.94834314 1.9215872
 2.12971184 2.13084382 2.03213864 2.25564433 2.15274627 2.05607641
 2.03112859 1.97633794 2.1533101  2.00906511 2.12503092 2.06548579
 2.05502471 2.0526277  2.05901997 2.04428899 2.03071352 2.13990128
 2.11724692 2.11986759 2.1149471  1.95796734 1.94825718 2.04449894
 2.25751803 2.07722853 1.9944546  2.16131337 2.09477945 1.99734653
 1.92369651 2.10997866 2.06675677 1.89318553 1.99954936 1.96336706
 1.94708083 2.02635208 2.03575972 2.03062375 1.94903231 1.9023104
 2.0893556  2.03317413 1.98428282 2.0244741  1.96506988 2.10220641
 2.20935685 2.12242341 2.04816327 2.06276912 1.96633647 1.97280756
 1.97950271 1.96759186 2.10921121 2.04354017 2.09007909 2.07148806
 2.08862067 1.94578994 1.99668277 2.0961256  2.10242578 2.074279

[2.01668608 2.11318401 1.95926836 2.03474283 2.15902173 1.74914557
 1.98910794 2.1359315  1.97893119 2.04490456 2.12578535 2.05212471
 1.86217848 2.01020233 2.16231384 1.95744319 2.00832434 2.02107534
 2.06706364 2.01366937 2.0837717  2.19322007 2.21931018 2.07465241
 2.01134152 2.20554385 2.13741888 2.04081756 2.03815373 2.08365023
 2.09597678 1.9017219  2.03933763 2.06677089 2.03217597 2.03717333
 2.01996756 2.04923053 2.0126961  1.91613853 1.99442021 1.91643434
 1.9665184  2.02977306 2.09900975 2.04988264 2.09336395 2.20588813
 2.19298925 1.96847182 1.96039771 2.10285191 2.04878028 1.99699796
 2.10961484 2.10979434 2.0960577  1.92989517 1.89072858 2.11348055
 2.12728035 2.18311197 2.06662919 2.21816529 1.95271682 1.9961784
 2.34198905 2.09932176 2.04999159 2.18453874 2.00588082 2.03122533
 1.97413957 2.08867456 2.1763171  2.02618507 2.18184325 2.1356404
 2.17321387 2.26747759 2.1754041  1.90571668 2.00253971 2.00963215
 1.9823312  2.18827313 1.96518751 2.02674948 2.04085311 1.976596

In [14]:
errors = np.array([])

# Calculate the R-squared for each column
for col in y_predicted_df:
    sq_error = metrics.mean_squared_error(y_test[col], y_predicted_df[col]).round(4)
    print('The error for {0} was {1}'.format(col, sq_error))
    errors = np.append(errors, sq_error)

# Average the errors
mean_error = np.mean( np.array( errors ), axis=0 )

# Show the error, averaged across the outcome variables
print('\nThe mean error for this model was {0}'.format(mean_error.round(4)))

The error for E_Scale_score was 0.5778
The error for A_Scale_score was 0.3623
The error for O_Scale_score was 0.4732
The error for C_Scale_score was 0.3388
The error for N_Scale_score was 0.6161

The mean error for this model was 0.4736
