# Final Submission Notebook

Up to this point we've created a few different notebooks to pilot different approaches.  In this notebook, we'll be pulling these approaches together into a single submission.

This notebook works in conjunction with two input CSV's, which it will train from and evaluate on.  The first CSV is loaded into a dataframe known in the notebook as `df`.  As part of the data preparation process, it will encode the user inputs for each of the columns.  These preparation steps will be wrapped in a preparation function so they can be called for evaluation.

The training set includes 5 outcome variables we wish to predict for. We will iterate over these and train a distinct model for each.

Once trained on this input CSV, it will evaluate performance on the test dataset, known in the notebook as `eval_df`

In [1]:
import random
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.linear_model import Ridge

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics


In [2]:
# Import the data to a df
df = pd.read_csv('data/siop_ml_train_participant.csv')

# Import the test data to a new df
eval_df = pd.read_csv('data/siop_ml_dev_participant.csv')

# Confirm that the data has been imported and is formatted correctly
df.head(3)


Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5,E_Scale_score,A_Scale_score,O_Scale_score,C_Scale_score,N_Scale_score
0,10446116527,"I would change my vacation week, because I am ...",I would reach out to my boss and ask him or he...,I would not go. I am a not a social person. I ...,I would ask my manager why he/she gave me such...,I would find this experience super enjoyable. ...,2.25,3.75,3.166667,3.75,2.916667
1,10440100535,I would talk to my colleague and see if they w...,I would continue to work on the project that w...,I would talk to my colleague and try to talk t...,I would feel upset about the negative feedback...,I would find this experience enjoyable. I feel...,4.666667,4.416667,4.583333,5.0,1.333333
2,10462850071,I would feel upset because perhaps I already b...,I would start working on the project now and g...,I would feel guilty about thinking about not g...,I would feel really defensive about it. I woul...,I would find it enjoyable because I would be r...,2.25,4.75,4.083333,4.666667,2.166667


In [3]:
eval_df.head(3)

Unnamed: 0,Respondent_ID,open_ended_1,open_ended_2,open_ended_3,open_ended_4,open_ended_5
0,10460010474,I would look into changing my vacation plans t...,I would work on the project little by little d...,I would probably still go. Just depending on h...,I would see what I could to do to improve the ...,I would absolutely enjoy being involved in thi...
1,10440103178,"I have always been a team player, but this wou...",I would first address my concerns with my boss...,I would be all in. While accompaniment would b...,I definitely would not be happy about this sit...,I would absolutely find this experience enjoya...
2,10440099430,I would try to come to a compromise with my co...,I would go to my boss and ask him if he has an...,"I would go to the event, it's possible that if...",I would pay attention and take an honest look ...,I would find it enjoyable because I love learn...


For clarity and simplicity, the training and criterion columns are declared in these hardcoded variables

In [4]:
training_columns = ['open_ended_1', 'open_ended_2', 'open_ended_3', 'open_ended_4', 'open_ended_5']
criterion_columns = ['E_Scale_score', 'A_Scale_score', 'O_Scale_score', 'C_Scale_score', 'N_Scale_score']

### Data Preparation and Encoding Functions

There's a number of steps that will need to be applied to the training and evaluation sets.  This area covers the process of doing some light cleaning, use of the NLP-based functions found in another notebook, Functions_For_Machine_Learning.ipynb

In [None]:
def simple_prep (df):
    for col in training_columns:
        
        # Lowercase all the columns
        df[col] = df[col].str.lower()
    
        # Remove non-alphanumeric characters
        df[col] = df[col].replace('[^a-zA-Z0-9]', ' ', regex = True)
    
    return df

# Remove capitalization and special characters
prepped_data = simple_prep (df)

In [None]:
# We'll need these functions for some of the prep work
filename = "Functions_For_Machine_Learning.ipynb"
%run $filename

In [None]:
# Generate our testing and training sets and show their relative sizes
train, test = train_test_split(prepped_data, test_size=0.3)


### Transformation and Training

First we'll need to defind the function for vectorizing each input column.

In [None]:
# Allow a new model to be initialized for each column / regressor
def new_model ():
    # Define the model parameters we'll be using
    mod = Ridge(alpha=1.0, random_state=42)
    return mod

# Abstract away as much as possible so we can reuse this general vectorizing and training function
def vectorize_and_train (df_train, y_train, training_col):

    # For now, let's limit this to the wordLengths function only
    df_train['past_tense_{0}'.format(training_col)] = df_train[training_col].apply(pastTag) 
    df_train['present_tense_{0}'.format(training_col)] = df_train[training_col].apply(presentTag) 
    df_train['articles_{0}'.format(training_col)] = df_train[training_col].apply(articleTag) 
    df_train['word_lengths_{0}'.format(training_col)] = df_train[training_col].apply(wordLengths) 
    df_train['first_person_{0}'.format(training_col)] = df_train[training_col].apply(firstpersonTag) 
    df_train['negative_words_{0}'.format(training_col)] = df_train[training_col].apply(negAffect) 
    df_train['positive_words_{0}'.format(training_col)] = df_train[training_col].apply(posAffect) 
    df_train['tentative_words_{0}'.format(training_col)] = df_train[training_col].apply(tentAffect) 
    df_train['cause_words_{0}'.format(training_col)] = df_train[training_col].apply(causeAffect)

    # Set the TF-IDF vectorization settings
    vectorizer = TfidfVectorizer(min_df=1, max_df=3, ngram_range=(1,4))
    
    # Fit the training data and tranform it into vectors
    X_train = vectorizer.fit_transform(df_train[training_col])

    # Convert test text into vectors
    X_test = vectorizer.transform(df_train[training_col]) 
    
    # Condense our sparse matrix into an array, and set the feature names as columns
    dense_transformer = pd.DataFrame(X_train.toarray(), columns=vectorizer.get_feature_names())
    
    # Concatenate the two dataframes
    concat_df = pd.concat([dense_transformer.reset_index(), df[training_col].reset_index()], axis=1)

    # We're using the same y-values from the training df
    y_train = y_train
    
    # Generate a new model instance
    mod = new_model()

    # Fit our model with the data
    mod.fit(X_train, y_train)
    
    # return the vectorizer object so we can use it later for evaluation
    return X_train, X_test, y_train, vectorizer, mod


In [None]:
# Define our model sets
X_train_set = {}    
X_test_set = {}
y_train_set = {}
vect_set = {}
model_set = {}

# Iterate over the training columns
for tc in training_columns:
    
    # Generate a new dict for each column so we can access the models later
    X_train_set[tc] = {}
    X_test_set[tc] = {}
    y_train_set[tc] = {}
    vect_set[tc] = {}
    model_set[tc] = {}
    
    # Within the training columns, iterate over the outcome variables
    for cc in criterion_columns:    
        X_train_set[tc][cc], X_test_set[tc][cc], y_train_set[tc][cc], vect_set[tc][cc], model_set[tc][cc] = vectorize_and_train (train[[tc]], train[[cc]], tc)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy [ipykernel_launcher.py:11]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy [ipykernel_launcher.py:12]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy [ipykernel_launcher.py:13]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydat

In [None]:
# Iterate over our outcome variables
for ec_column in criterion_columns:

    # Predict a value from a model trained on each column
    for sj_column in training_columns:

        # Transform the evaluation data into vectorized data on the correct vectorizer vocab
        eval_transformed = vect_set[sj_column][ec_column].transform(eval_df[sj_column])

        # Predict the values with the corresponding model
        y_pred = model_set[sj_column][ec_column].predict(eval_transformed)

    # Average the voting results and assign them to the criterion columns in the results
    eval_df[ec_column] = np.mean( np.array([ y_pred ]), axis=0 )


In [None]:
# # Generate a Dataframe from the results
# output = pd.concat([eval_df["Respondent_ID"].reset_index(drop=True), results.reset_index(drop=True)], axis=1)

# Drop the training columns so we report on only Respondent ID and their predicted values
output = eval_df.drop(columns=training_columns) 

# Set the headers asked for by the challenge organizers
correct_headers = ['Respondent_ID', 'E_Pred', 'A_Pred', 'O_Pred','C_Pred', 'N_Pred']

# Show the current column titles to ensure they're in the correct order
print('Replace column names from \n{0} to:\n{1}'.format(list(output), correct_headers))

# Rename our column headers so they match what folks expect
output.columns = [correct_headers]

# Send the output frame to a CSV, and exclude the indices with index=False
output.to_csv('data/prediction_output.csv', encoding='utf-8', index=False)