# TF-IDF and Linear Regression Demo
https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/ch04.html

Proof of concept for one possible approach, using TF-IDF and ridge regression.

This approach uses no engineered features, and exclusively bases the regression model on the presence or absence of certain words in the respondent's answer to a particular question.  We are not considering other responses, meaning this model is attempting to predict one trait from a single response.  In a final product, we would leverage all of the responses for determining traits.

This notebook is for demonstrating the high-level concepts around converting text into features which can be used for using regression.

In [1]:
import random
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge


In [2]:
# Import the data to a df
train = pd.read_csv('data/siop_ml_train_participant.csv')

# Limit the results to a single answer / score combination for demonstration
train = train.drop(['Respondent_ID', 'open_ended_2', 'open_ended_3', 'open_ended_4', 'open_ended_5', 
         'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score'], axis=1)

# Confirm that the data has been imported and is formatted correctly
train.head(3)


Unnamed: 0,open_ended_1,E_Scale_score
0,"I would change my vacation week, because I am ...",2.25
1,I would talk to my colleague and see if they w...,4.666667
2,I would feel upset because perhaps I already b...,2.25


In [3]:
def simple_prep (df, column):
    # Lowercase it all
    df[column].str.lower()
    
    # Remove non-alphanumeric characters
    df[column].replace('[^a-zA-Z0-9]', ' ', regex = True)
    
    return df

prepped_data = simple_prep (train, 'open_ended_1')


In [4]:
def vectorize_training_data (df, column):
    # Set the TF-IDF vectorization settings
    vectorizer = TfidfVectorizer(min_df=5)
    
    # Convert text into vectors
    X = vectorizer.fit_transform(df[column]) 
    
    # return the vectorizer object so we can use it later for evaluation
    return X, vectorizer
    
X, vectorizer = vectorize_training_data (prepped_data, 'open_ended_1')


In [5]:
# Utilizing ridge regression for this PoC
mod = Ridge(alpha=1.0, random_state=241)

# Set the criterion column so we can fit the model
y = train['E_Scale_score']

# Fit the model using the our training data and criterion column
mod.fit(X, y) 


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=241, solver='auto', tol=0.001)

###  Viewing Sample Results

Typically the accuracy of a model would be assessed by utilizing hold-out groups and training on x% of the data, then testing on the remaining portion.  Multiple 'folds' of this evaluation can be performed by randomly assigning the data to be in training or testing set.  This multiple folding process is known as Cross-Validation.

For this model we are simply proofing the idea, and so we are assessing the accuracy of the model by looking at a random sample row and using the model to classify the test row.

In [6]:
def sample_results (row_num, vectorizer):
    print ('User input for row [{0}]: {1}\n'.format(row_num, train['open_ended_1'][row_num]))
    print ('Actual Score: \t\t{0}'.format(train['E_Scale_score'][row_num].round(2)))
    
    # Use the same data transformation process on the sample row provided
    sample_test_data = vectorizer.transform([train['open_ended_1'][row_num]]) 
    
    # Run that transformed vector against the model by using .predict()
    rslt = mod.predict(sample_test_data)
    
    print ('Predicted Score: \t{0}'.format(rslt[0].round(2)))
    print ('\nCriterion StdDev: \t{0}'.format(train['E_Scale_score'].std().round(2)))


When hosted in GitHub, this sample will only show the most recent run, however you can clone this repository to run it locally and sample the results yourself.  The results for this model are not great, but better than guessing.  Standard deviation is provided for comparison.

In [21]:
# To see results for a specific row, change this value to a row index
test_row_index = random.randint(0, len(train['open_ended_1']))

sample_results (test_row_index, vectorizer)


User input for row [637]: I would take a look at my own schedule to see if it was more mobile. As it is likely that is is, I would change my schedule.  If I could not change it, I would try to work out a compromise. If that failed, I would ask my supervisor who requested first.

Actual Score: 		1.92
Predicted Score: 	3.2

Criterion StdDev: 	0.79
