<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/DS_FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

I am a psychology PhD student, and for my dissertation I recently collected data in which I asked participants to tell me:

*   1) What motivated them to get the last clothing item they purchased (*called Motivation_Description*)
*   2) How likely they are to get a new clothing item in the future, even if it is already similar to something they already have in their closet (*called author*)

# Planned Analysis 

The purpose of this project is to analyze people's motivation to engage in hyperconsumption in regards to clothing, which is consuming material goods that go beyond one's needs. 

Specifically, in the analysis below, I use Naive Bayes to test whether people's responses to 1) above could be used to predict their responses to 2) above. 

In other words, can people's descriptions of their motivation for making their last clothing item purchase be used to predict how likely they are to hyperconsume in the clothing domain in the future?

For determining the size of the training and testing set, I put 70% of the data into the training set and 30% into the testing set. I used this ratio because it's desirable to use a generous amount of data for building the model from the training set.

# Interpretation of Results

At bottom of document.

## Importing Libraries

In [0]:
import numpy as np
from numpy.linalg import norm
import pandas as pd
import matplotlib.pyplot as plt
import numpy
from typing import TypeVar, Callable
dframe = TypeVar('pd.core.frame.DataFrame')
narray = TypeVar('numpy.ndarray')
import math

## Importing Spacy

In [0]:
import spacy
!python -m spacy download en_core_web_md # download the dictionary
import en_core_web_md
nlp = en_core_web_md.load()

## Importing UO Puddles

In [0]:
!rm -r 'uo_puddles'
my_github_name = 'uo-puddles' # can replace with your account name
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url
import uo_puddles.uo_puddles as up

## Importing Dataset from Github

In [0]:
url = 'https://raw.githubusercontent.com/saralieber/CS_Studio/master/Short_GS_Data_for_DS_Proj_new.csv'

df1 = pd.read_csv(url)

In [51]:
df1.head()

Unnamed: 0,Motivation_Description,author
0,I wanted to start running and needed a way to ...,3
1,I had lost some weight and needed new pants to...,3
2,"I really like the movie it portrays, I like we...",2
3,My mom was going to get rid of it and I though...,3
4,I got this item because it looked warm and sof...,4


In [6]:
len(df1)

351

## Randomly Shuffle the Data Set

In [52]:
set_seed = 1234

rsgen = np.random.RandomState(set_seed)

# Shuffled Dataset
shuffled_df1 = df1.sample(frac=1, random_state = rsgen).reset_index(drop=True)
len(shuffled_df1)

348

## Calculate Size of Training Set and Testing Set

Note: Training set is 70% of the data, testing set is 30%.

In [53]:
# Calculating n's for Testing and Training Tables
n_testing = (len(shuffled_df1))*.3
n_testing # 104

104.39999999999999

In [54]:
n_training = (len(shuffled_df1)) - n_testing
n_training # 244

243.60000000000002

## Create the Training Set and Testing Set

And pull out the columns for the predictor variable
*   Called *Motivation_Description*, which is an explanation of what motivated each participant to get their last clothing item)

and outcome variable 
*   Called *author*, which is participants' response to "Indicate the extent to which you agree with each of the following statements from 1 (Strongly disagree) to 5 (Strongly agree): - If I desire a new piece of clothing this year, there is a good chance I will get it, even if it is already similar to something else that I have in my closet."

In [0]:
# Training Set
training_table = shuffled_df1[:246].reset_index(drop=True)

# Testing Set
testing_table = shuffled_df1[246:].reset_index(drop=True)


# Grab the Motivation_Description and Intent1 Columns from the Training and Testing Sets and Convert into Lists (easier to work with text data this way)
training_text = training_table['Motivation_Description'].tolist()
training_outcome = training_table['author'].tolist()

testing_text = testing_table['Motivation_Description'].tolist()
testing_outcome = testing_table['author'].tolist()

## Convert the variables from lists to strings

In [0]:
training_outcome_str = []
for item in training_outcome:
    training_outcome_str.append(str(item))

In [0]:
training_text_str = []
for item in training_text:
    training_text_str.append(str(item))

In [0]:
testing_outcome_str = []
for item in testing_outcome:
    testing_outcome_str.append(str(item))

In [0]:
testing_text_str = []
for item in testing_text:
    testing_text_str.append(str(item))

## Build a word-bag from the training set

In [61]:
word_table = pd.DataFrame(columns=['word', '1', '2', '3', '4', '5']) 
word_table.head() # currently empty

Unnamed: 0,word,1,2,3,4,5


In [0]:
for i in range (len(training_text)): 
  training_sentences = training_text_str[i].lower()
  doc = nlp(training_sentences) 
  author = training_outcome_str[i] 

  for token in doc: 
    if token.is_alpha and not token.is_stop:
      up.update_gothic_row(word_table, token.text, author) 

## Sort the table alphabetically by word

In [63]:
sorted_word_table = word_table.sort_values(by=['word'])
sorted_word_table = sorted_word_table.reset_index(drop=True)
sorted_word_table.head()

Unnamed: 0,word,1,2,3,4,5
0,able,0,0,0,1,0
1,accepted,0,1,0,0,0
2,account,0,0,1,0,0
3,activities,0,0,1,0,1
4,actually,0,0,0,1,1


## Set word column as the table index

In [64]:
sorted_word_table = sorted_word_table.set_index('word')  
sorted_word_table.head() 

Unnamed: 0_level_0,1,2,3,4,5
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
able,0,0,0,1,0
accepted,0,1,0,0,0
account,0,0,1,0,0
activities,0,0,1,0,1
actually,0,0,0,1,1


## Perform Naive Bayes

I had to slightly change the way the function was written to get it to perform with the way my data is formatted.

In [0]:
def bayes_gothic(evidence:list, evidence_bag:dframe, training_table:dframe, laplace:float=1.0) -> tuple:
  assert isinstance(evidence, list), f'evidence not a list but instead a {type(evidence)}'
  assert all([isinstance(item, str) for item in evidence]), f'evidence must be list of strings (not spacy tokens)'
  assert isinstance(evidence_bag, pd.core.frame.DataFrame), f'evidence_bag not a dframe but instead a {type(evidence_bag)}'
  assert isinstance(training_table, pd.core.frame.DataFrame), f'training_table not a dataframe but instead a {type(training_table)}'
  assert 'author' in training_table, f'author column is not found in training_table'

  author_list = set(training_outcome_str)
  mapping = ['1', '2', '3', '4', '5']
  label_list = [mapping.index(auth) for auth in author_list]
  n_classes = len(set(label_list))
  #assert len(list(evidence_bag.values())[0]) == n_classes, f'Values in evidence_bag do not match number of unique classes ({n_classes}) in labels.'

  word_list = evidence_bag.index.values.tolist()

  evidence = list(set(evidence))  #remove duplicates
  counts = []
  probs = []
  for i in range(n_classes):
    ct = label_list.count(i)
    counts.append(ct)
    probs.append(ct/len(label_list))

  #now have counts and probs for all classes

  #CONSIDER CHANGING TO LN OF PRODUCTS. END UP SUMMING LOGS OF EACH ITEM. AVOIDS UNDERFLOW.
  results = []
  for a_class in range(n_classes):
    numerator = 1
    for ei in evidence:
      if ei not in word_list:
        #did not see word in training set
        the_value =  1/(counts[a_class] + len(evidence_bag))
      else:
        values = evidence_bag.loc[ei].tolist()
        the_value = ((values[a_class]+laplace)/(counts[a_class] + laplace*len(evidence_bag)))
      numerator *= the_value
    #if (numerator * probs[a_class]) == 0: print(evidence)
    results.append(max(numerator * probs[a_class], 2.2250738585072014e-308))

  return tuple(results)

### Build a result list that contains tuples made up of the probability that each word came from someone who responded 1 - strongly disagree, 2 - disagree, 3 - neither agree nor disagree, 4 - agree, or 5 - strongly agree on the outcome variable.

In [0]:
result_list = [] 

for i in range(len(testing_text_str)): 
  sentences = testing_text_str[i].lower() 
  
  word_list = [] 
  doc = nlp(sentences) 

  for i in range(len(doc)): 
    token = doc[i]
    if token.is_alpha and not token.is_stop:
      word_list.append(token.text) 

  result = bayes_gothic(word_list, sorted_word_table, training_table) 
  result_list.append(result) 

In [0]:
result_list[:30]

## Make predictions

That is, based on the probabilities assigned above, for each word in the word bag, which response on the outcome variable does our model predict that person gave?

In [89]:
authors = ['1', '2', '3', '4', '5']

predictions = []

for i in range(len(result_list)):
  result = result_list[i] 

  for i in range(len(result)):
    m = max(result)  
    author_index = result.index(m) 
    author = authors[author_index] 
    pred = author 

  predictions.append(pred) 

predictions[:10] 

['4', '4', '4', '2', '3', '4', '4', '3', '4', '4']

## Line up the predictions with the actual scores on the outcome variable from the testing set

In [94]:
cases = list(zip(predictions,testing_outcome_str))
print(cases[:10])

[('4', '5'), ('4', '2'), ('4', '4'), ('2', '4'), ('3', '4'), ('4', '4'), ('4', '2'), ('3', '3'), ('4', '3'), ('4', '2')]


## Calculate Correct Predictions

In [95]:
print(cases.count(('1', '1')))
print(cases.count(('2', '2')))
print(cases.count(('3', '3')))
print(cases.count(('4', '4')))
print(cases.count(('5', '5')))

0
1
3
38
0


## Calculate Accuracy

In [96]:
accuracy = (1+3+38)/len(testing_outcome_str)
accuracy # .41

0.4117647058823529

# Interpreting Results

The results indicated that our model was able to predict participants' responses from 1 to 5 regarding how likely they were to get new clothing in the future that they didn't need with 41% accuracy based on their responsesto what motivated them to get their last clothing item. 

We could possibly improve the accuracy of the model by using a different ML algorithm, like CNN, and/or by using cross-validation and tuning hyperparameters used when developing the model.