# Introduction
Talkspace is an online, text-based therapy service. I have been using the service for about a year, and over that time, I've noticed it's sometimes hard for me to stay engaged. There have been long stretches of times where I have not responded to my therapist. [Patient engagement is critical to developing a therapeutic relationship](http://dx.doi.org/10.3389/fpsyg.2015.02013), so if I can increase my own responsiveness, I might increase the efficacy of my therapy.

Unfortunately, Talkspace hasn't provided me with any kind of patient engagemnt questionaire (such as the [Patient Activation Measure](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1361049/)), so I don't have any existing historical measures of enagement. The metrics most accessible to me that I might want to optimize for are my response time and how many words I write to my therapist per day I take to respond, i.e.

$$
\frac{\textrm{word count of message}}{\textrm{time it takes me to write the message}}
$$

These quantities will obviously vary, and my suspicion is that some of that variability can be explained. If I can find a statistically significant factor that partly explains the variability, I'll have a lever to pull to improve my therapeutic relationship and my own mental health.

# Getting the data (you can do this too!)
Prior to this analysis, I wrote an [tool](https://github.com/vaughn-johnson/talkspace-scraper) to allow anyone to scrape his or her Talkspace message history. If you are familiar with Javascript and NPM, you can use the tool to save that data to a database or export it directly as a file.

This is obviously _highly sensitive data_. You should __never__ give your Talkspace username and password to a third party you do not trust. If you choose to use this tool, please exercise caution by [reviewing the code first](https://github.com/vaughn-johnson/talkspace-scraper). I appreciate that this stipulation might make the tool less accessible, but I strongly discourage anyone from blindly using this tool without being sure that his or her username and password will be secure.

With that being said, here I am importing my own data from my own database. If you're unfamiliar with `dotenv`, I would _highly_ encourae reading about it [here](https://pypi.org/project/python-dotenv/).

# Data Cleaning and Preparation

In [3]:
library(mongolite)
library(ggplot2)
library(dplyr)
library(dotenv)
library(tidyr)
library(quanteda)
library(lubridate)
library(GGally)
library(rio)

load_dot_env()
options(repr.plot.width=15, repr.plot.height=8)

In [4]:
# If you don't have access to this private database, you can see access the results at the api called below
mongo_response = mongo(
    collection = "messages",
    url = Sys.getenv("MONGO_CONNECTION_STRING"),
    verbose = TRUE,
    options = ssl_options(weak_cert_validation = FALSE, allow_invalid_hostname = FALSE)
)$find(
    '{ "message_type": { "$in": [1] } }', #other message types include automated messages from Talkspace
    '{ "message": 1, "display_name": 1, "created_at": 1 }'
)

 Found 612 records... Imported 612 records. Simplifying into dataframe...


In [5]:
messages = mongo_response %>%
           mutate(created_at = as_datetime(created_at)) %>%

           # Clean up messages
           mutate(message = gsub('.*> ', '', message)) %>%
           mutate(message = gsub('(Vaughn,\n*|Respectfully,[\n ]*Dallas)', '', message)) %>%
           mutate(message = gsub('\n\n+', '\n', message)) %>%

           # Make this more usable by others
           mutate(display_name = recode(display_name, 'Vaughn' = 'Me', 'Dallas' = 'My Therapist')) %>%

           # Sort by send date
           arrange(created_at) %>%
        
           # Group consecutive messages from the same person into blocks
           mutate(message_block = cumsum(display_name != replace_na(lag(display_name), 'Me')))


In [None]:
MESSAGES_API = 'https://us-central1-talkspace-293821.cloudfunctions.net/talkspace-public-api?format=csv'

messages = rio::import(MESSAGES_API, format='csv')

In [None]:
message_blocks = messages %>%
                 # Concatenate consecutive messages
                 group_by(message_block) %>%
                 summarise(display_name = min(display_name),
                           created_at = min(created_at),
                           message = paste(message, collapse='')) %>%
                 
                 mutate(message_length = nchar(message)) %>%
                 mutate(word_count = stringr::str_count(message, ' ') + 1) %>%
                 mutate(question_count = stringr::str_count(message, '\\?')) %>%
                 mutate(readability = textstat_readability(message, measure=c('Flesch'))$Flesch) %>%
                 drop_na() %>%
                 mutate(response_time = time_length(lag(created_at) %--% created_at, unit='days')) %>%
                 mutate(words_per_day = word_count / time_length(response_time, unit='days')) %>%

                 mutate(prev_message_length = lag(message_length)) %>%
                 mutate(prev_word_count = lag(word_count)) %>%
                 mutate(prev_question_count = lag(question_count)) %>%
                 mutate(prev_readability = lag(readability))


## Exploration
First, let's just look at the distribution of our features.

In [None]:
ggplot(message_blocks, aes(x=readability, fill=display_name, title='Flesch Reading Ease Score of Messages')) +
    geom_histogram() +
    facet_wrap('display_name', dir = 'v') +
    xlab('Readability') +
    theme_minimal() +
    ggtitle('Flesch Reading Ease Score of Messages')

In [None]:
ggplot(message_blocks, aes(x=question_count, fill=display_name, title='Flesch Reading Ease Score of Messages')) +
    geom_histogram() +
    facet_wrap('display_name', dir = 'v') +
    xlab('Question Count') +
    theme_minimal() +
    ggtitle('Number of questions asked (measured by appearence of \"?\")')

These are the quantities I'm trying to optimize for.

In [None]:
ggplot(message_blocks, aes(x=words_per_day, fill=display_name, title='Flesch Reading Ease Score of Messages')) +
    geom_histogram() +
    facet_wrap('display_name', dir = 'v') +
    xlab('Words per Day') +
    theme_minimal() +
    ggtitle('How many words typed per day it took to respond')

In [None]:
ggplot(message_blocks, aes(x=response_time, fill=display_name, title='Flesch Reading Ease Score of Messages')) +
    geom_histogram() +
    facet_wrap('display_name', dir = 'v') +
    xlab('Response Time') +
    theme_minimal() +
    ggtitle('Time to respond (in days)')

(My therapist Dallas has much shorter response times than me)

Here are some pair-wise plots of some of the featues I've extracted above that I think might explain how I interact with my therapist

In [None]:
pair_cols = c(
    "words_per_day",
    "response_time",
    "prev_word_count",
    "prev_question_count",
    "prev_readability"
)


ggpairs(message_blocks[c('display_name', pair_cols)], aes(color=display_name, alpha=0.4)) +
    theme_minimal() +
    ggtitle('Response data (Each point is a message, "Prev" means the previous message the point is responding to)')

Nothing incredibly interesting jumps out here. We see some normally distributed varaibles plotted against uniformly distributed variables, which of course produces a bell shape. There isn't a clear relationship anywhere. 

# Do the characteristics of my therapist's messages explain my responsiveness?

Let's try to develop a model that tries to explain my responsiveness based on the messages I'm responding to. My hypothesis is that I might take longer to respond to more complex messages. Messages with a lower readability score, which are longer, or which ask me more question. I think if these were going to have have any affect, it would probably be linear, so fitting a generalized linear model with least squares seems appropriate.

To be clear, this data is _not_ well suited for linear regression. The response variable I'm interested in does not appear to be normally distributed with any of its covariates, and the covariants aren't perfectly non-collinear. The observations are obviously not independent (though they don't seem to show any obvious autocorrelation). However, I think it's ok to give it a look.

In [None]:
my_messages = message_blocks %>% filter(display_name == 'Me')  %>% drop_na()

In [None]:
acf((my_messages)$response_time, main = 'Auto correlation of my response times')

In [None]:
acf((my_messages)$words_per_day, main = 'Auto correlation of my words per day')

In [None]:
summary.glm(glm("response_time ~ prev_word_count + prev_question_count + prev_readability", family='gaussian', my_messages))

The summary of that regression _clearly_ shows a very poor fit. I feel comfortable continuing on with the hypothesis that my therapist's messages have no effect on my responses.

I'll fit and test a second model, but instead of predicting `words per day`, I'll predict how long it takes me to respond (using the same covariates).

In [None]:
summary.glm(glm("words_per_day ~ prev_word_count + prev_question_count + prev_readability", family='gaussian', my_messages))

This second model is a little more significant, and worlds apart a better fit than the previous model. It seems like the length of my therapist's previous messages might have an effect on how long it takes me to respond. The point estimate is `0.0037`, which with the units we're using, means for every additional word my therapist send me, it takes me something on the order of 5 and a half minutes longer respond.

# Results
There are two interacting conclusions from this
1. Nothing that I looked at today from my therapists messages makes me send more messages per unit time
2. The longer my therapist's messages are, the longer it takes me to respond

What this really means is that there is some _compensatory_ effect between how long it takes me to respond to my therapist, and how much I write back, and that compensatory effect washes out any measurable effect from the length of the message I'm responding to. If my therapist sends me a long message, I will generally take longer, but respond with a longer message. This actually shows up as a slight correlation between my response time and word count (`r = 0.325`)

In [None]:
cor(my_messages$response_time, my_messages$word_count)

In [None]:
ggplot(my_messages, aes(x=response_time, y=word_count)) +
    geom_point() +
    geom_smooth(method = "lm") +
    theme_minimal() +
    xlab('Respone Time (days)') +
    ylab('Word Count')

# Conclusions and next steps

I have a new rule of thumb for my therapist. For every additional paragraph you send me, you can expect an additional 18 hours or so for me to respond. However, when I finally do respond, my response will generally be slightly longer.

## Next steps
I would like to build out a language model of these messages. Sometimes my therapist characterizes some of the things I send to him as good or bad, and it would be fabulous to build a model that could predict that judgement on arbitrary text.