# NLP Encodings and Support Vector Regression

This notebook is to summarize some of the approaches we've been testing and to apply some of the NLP encodings as new features.  We'll utilize SVR in this particular approach, however this notebook is primarily aimed at piloting the value of manually encoding some features.

In [1]:
import random
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.svm import SVR


We'll want to pull in our data and confirm that it's looking good

In [2]:
# Import the data to a df
df = pd.read_csv('data/siop_ml_train_participant.csv')

# Import the test data to a new df
eval_df = pd.read_csv('data/siop_ml_dev_participant.csv')

# Confirm that the data has been imported and is formatted correctly
df.head(3)


Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score
0,10446116527,"I would change my vacation week, because I am ...",I would reach out to my boss and ask him or he...,I would not go. I am a not a social person. I ...,I would ask my manager why he/she gave me such...,I would find this experience super enjoyable. ...,2.25,3.75,3.166667,3.75,2.916667
1,10440100535,I would talk to my colleague and see if they w...,I would continue to work on the project that w...,I would talk to my colleague and try to talk t...,I would feel upset about the negative feedback...,I would find this experience enjoyable. I feel...,4.666667,4.416667,4.583333,5.0,1.333333
2,10462850071,I would feel upset because perhaps I already b...,I would start working on the project now and g...,I would feel guilty about thinking about not g...,I would feel really defensive about it. I woul...,I would find it enjoyable because I would be r...,2.25,4.75,4.083333,4.666667,2.166667


In [3]:
training_columns = ['open_ended_1', 'open_ended_2', 'open_ended_3', 'open_ended_4', 'open_ended_5']
criterion_columns = ['E_Scale_score', 'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score']

We'll be working with a collection of functions, they're wrapped in the Functions_For_Machine_Learning.ipynb notebook for more readability.  We may or may not use them all here, however they can be found or referenced over there if needed.

In [7]:
functions = "Functions_For_Machine_Learning.ipynb"
%run $functions

For this first iteration, we'll be including wordLengths, presentTag, and pastTag functions to encode respondent input 

In [None]:
# Using the .apply() function, we can take this user function and specify the 'E_Scale_score' 
# column for our function.  Then, assign the output of that function to a new column
df['overfive'] = df['open_ended_1'].apply(wordLengths)
print(type(df['open_ended_1'].apply(wordLengths)))

# Let's check our shape here:
print(df.shape)

# Let's also view a specific, random respondent.  If they are 
print (df.iloc[np.random.randint(0, len(df))])

df.head(3)

In [None]:
# For simplicity, let's limit this to one input column, which we'll assign to df_train
df_train = df['open_ended_1']

# To understand where we'll be adding the extra feature column here we'll call it 'df_train_extra'
df_train_extra = df['overfive']

# Our outcome variable should reflect the same item we chose to code for above
y_train = df['E_Scale_score']

# Check to make sure that the two of our shapes are equal:
print(df_train.shape)
print(df_train_extra.shape)

In [None]:
# Here we're using a ridge regression model
mod = Ridge(alpha=1.0, random_state=42)

# Set the TF-IDF vectorization settings
vectorizer = TfidfVectorizer(min_df=1, max_df=4, ngram_range=(1,2))

# Here we start by fitting our vectors to the text inputs
main_transformer = vectorizer.fit_transform(df_train) 


In [None]:
# Condense our sparse matrix into an array, and set the feature names as columns
dense_transformer = pd.DataFrame(main_transformer.toarray(), columns=vectorizer.get_feature_names())

# Since we're manipulating the columns a bunch, let's make sure we haven't buggered anything
print (dense_transformer.shape)
print (df_train_extra.shape)

In [None]:
# Concatenate the two dataframes
X_train = pd.concat([dense_transformer.reset_index(), df_train_extra.reset_index()], axis=1)

# Convert text into vectors
X_test = vectorizer.transform(df_train) 

# We're using the same y-values from the training df
y_train = y_train

# # Fit our model with the data
mod.fit(X_train, y_train)


In [None]:
# Allow a new model to be initialized for each column / regressor
def new_model ():
    # Define the model parameters we'll be using
    mod = Ridge(alpha=1.0, random_state=42)
    return mod

# Abstract away as much as possible so we can reuse this general vectorizing and training function
def vectorize_and_train (df_train, y_train):
    
    # Set the TF-IDF vectorization settings
    vectorizer = TfidfVectorizer(min_df=1, max_df=3, ngram_range=(1,4))
    
    # Fit the training data and tranform it into vectors
    X_train = vectorizer.fit_transform(df_train) 
    
    # Convert test text into vectors
    X_test = vectorizer.transform(df_train) 
    
    # We're using the same y-values from the training df
    y_train = y_train
    
    # Generate a new model instance
    mod = new_model()
    
    # Fit our model with the data
    mod.fit(X_train, y_train)
    
    # return the vectorizer object so we can use it later for evaluation
    return X_train, X_test, y_train, vectorizer, mod
