# Predicting from Multiple Inputs

As we explore more options around encoded values, it'll be helpful to include more predictors and engineered features.  In order to do this, we'll need to be able to encode features and then factor them into models.  This notebook works through these examples with a sample encoding function and can be used as a reference for later.

In [1]:
import random
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge


In [2]:
# Import the data to a df
df = pd.read_csv('data/siop_ml_train_participant.csv')

# Confirm that the data has been imported and is formatted correctly
df.head(3)


Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score
0,10446116527,"I would change my vacation week, because I am ...",I would reach out to my boss and ask him or he...,I would not go. I am a not a social person. I ...,I would ask my manager why he/she gave me such...,I would find this experience super enjoyable. ...,2.25,3.75,3.166667,3.75,2.916667
1,10440100535,I would talk to my colleague and see if they w...,I would continue to work on the project that w...,I would talk to my colleague and try to talk t...,I would feel upset about the negative feedback...,I would find this experience enjoyable. I feel...,4.666667,4.416667,4.583333,5.0,1.333333
2,10462850071,I would feel upset because perhaps I already b...,I would start working on the project now and g...,I would feel guilty about thinking about not g...,I would feel really defensive about it. I woul...,I would find it enjoyable because I would be r...,2.25,4.75,4.083333,4.666667,2.166667


In most notebooks, this is where you'd separate your training data in order to evaluate model effectiveness.  Instead, we're just using the entire dataset and we'll be training on the entire input.

In [3]:
train = df

In [4]:
# This step simply adds a dummy column, representing the addition of a new feature

# We're adding a simple title here, however this is a dummy variable
val_name = "extraversion_encoding"

# We're creating a function that will encode a given value as either high or low
def code_high_or_low (value):
    # If our given value is over 3 (the average E score) they are 'high'
    if value > 3:
        coding = 1
    # Otherwise, give them a 'low' encoding
    else:
        coding = 0
    return coding

# Using the .apply() function, we can take this user function and specify the 'E_Scale_score' 
# column for our function.  Then, assign the output of that function to a new column in our training dataframe
train[val_name] = train.E_Scale_score.apply(code_high_or_low)

# Let's check our shape here:
print(train.shape)

# Let's also view a specific, random respondent.  If they are 
print (train.iloc[np.random.randint(0, len(train))])

(1088, 12)
Respondent_ID                                                  10440099152
open_ended_1             I would find out what everyone's plans are. Th...
open_ended_2             I would get this project done as quickly as po...
open_ended_3             I would still attend the networking meeting. I...
open_ended_4             I would discuss the feedback with my superviso...
open_ended_5             I would absolutely jump at the opportunity to ...
E_Scale_score                                                      3.16667
A_Scale_score                                                      4.16667
O_Scale_score                                                      3.91667
C_Scale_score                                                      4.66667
N_Scale_score                                                      1.91667
extraversion_encoding                                                    1
Name: 224, dtype: object


The cell above should have a randomly selected participant (`train.iloc[np.random.randint(0, len(train))]`) - if their extraversion score is above 3, their `extraversion_encoding` should be 1, otherwise it should be 0.

In [5]:
# For simplicity, let's limit this to one input column, which we'll assign to df_train
df_train = train['open_ended_1']

# To understand where we'll be adding the extra feature column here we'll call it 'df_train_extra'
df_train_extra = train['extraversion_encoding']

# Our outcome variable should reflect the same item we chose to code for above
y_train = train['E_Scale_score']

# Check to make sure that the two of our shapes are equal:
print(df_train.shape)
print(df_train_extra.shape)


(1088,)
(1088,)


In [6]:
# Here we're using a ridge regression model
mod = Ridge(alpha=1.0, random_state=42)

# Set the TF-IDF vectorization settings
vectorizer = TfidfVectorizer(min_df=1, max_df=4, ngram_range=(1,2))

# Here we start by fitting our vectors to the text inputs
main_transformer = vectorizer.fit_transform(df_train) 


In order to combine the vectorized inputs, we'll need to make them a dense array. Other vectorizers support todense() as described in the Towards Data Science post linked below, however the TFIDF Vectorizer instead uses toarray().

https://towardsdatascience.com/natural-language-processing-on-multiple-columns-in-python-554043e05308

Below, we expect to see a dense matrix with many features (columns) but they should have the same number of rows

In [7]:
# Condense our sparse matrix into an array, and set the feature names as columns
dense_transformer = pd.DataFrame(main_transformer.toarray(), columns=vectorizer.get_feature_names())

# Since we're manipulating the columns a bunch, let's make sure we haven't buggered anything
print (dense_transformer.shape)
print (df_train_extra.shape)

(1088, 16755)
(1088,)


Now that we've got a dense array and dataframe column for the encoded values, we can combine them with the pandas concat function.  In order to make sure that we don't mix up our indices, we'll need to reset them here, otherwise the .shapes get out of whack.

Lastly, we can transform the output with our fancy new combined dataframe, and train the regression model on that.

In [8]:
# Concatenate the two dataframes
X_train = pd.concat([dense_transformer.reset_index(), df_train_extra.reset_index()], axis=1)

# Convert text into vectors
X_test = vectorizer.transform(df_train) 

# We're using the same y-values from the training df
y_train = y_train

# # Fit our model with the data
mod.fit(X_train, y_train)


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=42, solver='auto', tol=0.001)

Great.  Now the model is trained and we can make predictions with it.  Since we encoded all high extraversion values with a 1, this should give the model a strong weight for the value of this fictional 'extraversion value' we've generated.

The snippet below takes a standard chunk of text, however the 'extraversion value' is randomized.

In [9]:
# Select a random participant
respondent_id = np.random.randint(0, len(df))

# Randomly generate a high or low encoding
extraversion_encoding = np.random.randint(0, 2)

# Set the respondent's true value for 
true_extraversion_value = y_train[respondent_id]

# Select a random respondent's response for item #1
example_text = df['open_ended_1'][respondent_id]

# Create a new Dataframe so we can concatenate it for transformation later
extra_value_for_test = pd.DataFrame(columns=['ext_coding'])

# Assign the randomly generated value to the 0th row in the dataframe
extra_value_for_test.loc[0] = extraversion_encoding

# Transform the testing text and condense it
main_transformer = vectorizer.transform([example_text]) 
dense_transformer = pd.DataFrame(main_transformer.toarray(), columns=vectorizer.get_feature_names())

# Generate an example input by concatenating our text and encoded dataframes
example_test = pd.concat([dense_transformer.reset_index(), extra_value_for_test.reset_index()], axis=1)

# Generate the prediction for this example input
prediction = mod.predict(example_test)

print("Actual extraversion is {0}".format(true_extraversion_value.round(2)))
print("Predicted extraversion is {0}\nRespondent randomly assigned \"{2}\" as their E encoding".format(prediction[0].round(2), example_text, extraversion_encoding))
print("\nRespondent Input: \n{0}".format(example_text))

Actual extraversion is 3.67
Predicted extraversion is 3.72
Respondent randomly assigned "1" as their E encoding

Respondent Input: 
I would not give in either.  If I gave in, my colleague might think that I am weak and run over me in the future.  I would suggest that we either split up the days, or come to some sort of other compromise.


### Summary

If the example above is working correctly, you should generally see that the predicted value for extraversion is swayed by encoding for the `extraversion_encoding`. When this encoding is accurate, such as for cases where actual extraversion is high and their encoding is high as well, we should see that the prediction is closer.  

If the extraversion is low and the encoding is still high however (or the reverse), then we should expect to see a greater distance between the predicted and actual values.
