<a href="https://colab.research.google.com/github/shere-khan/machine_learning/blob/master/BespokeAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Modeling Approach
#### Data: Features and formatting
The problem could be modeled as such. For a given customer, let $x \in \mathbb{R}^p$, be a customer feature vector with $p$ the number of dimensions, and let $y \in \mathbb{R}$ be the predicted LTV for a given customer. We call $m$ the time window which considers only the first  $m$ days of data from a customer sign-up, and we use $x$ to predict the LTV $y$ for a given customer. As the first modeling approach, we only consider the data in aggregate. Meaning we construct customer features across the span of $m$ days. Examples of customer features include but are not limited to the following. We will create a 90/10 split for our train/test.

#### Things to consider
Given a users LTV can change over time, it might be better to model the prediction as a tiered category $y \in \{Low, Middle, High\}$ as an indication for spending potential. Perhaps this is a more natural modeling approach that seeks to categorize a customers spending potential in relation to the spending potential of other customers as opposed to simply outputting LTV as an unbounded range. In the former case, the tiers act as a data structure encoding some kind of understanding of hierarchy that's agnostic to actual value spent. It only cares about value spent in relation to what others have spent. Seems like a silly distinction but might actually have importance. If we were to take this approach, we would use logistic regression in place of linear regression, and our neural network last layer activation would be a softmax. The loss would change to cross entropy. We would have to change the metrics from $R^2$, to AUC, recall, precision, F1 score.

- Average dollar amount spent per transaction in time window.
<br>type: real number

- Average number of messages before transaction
<br>type: real number
- Total amount spent in time window
<br>type: real number
- Time to first transaction
<br>type: real number (days)
- Number of transactions
<br>type: real number
- Most common transaction time (morning, noon, evening, night, early morning) should probably include some notion of time differential
<br>type: categorical where the number of categories corresponds to the number of divisions for a 24 hour period.
- Average time interval length between transactions
<br>type: real number
- Repeat customer (number of signups/cancellations to same creator)?
<br>type: real number
- Number of declined payments?
<br>type: real number
- Type of transaction 
<br>type: real number
- Num views before transaction
<br>type: real number
- Total number of messages 
<br>type: real number
- Time subscription was made
<br>type: real number
- Number of transactions made during live video?
<br>type: real number
- Message content. Does message content provide us with a signal for identifying potential whales? You mentioned not using message text, but for the sake of completion, it might be worth it to think of something along these lines. For each message (document), use W2V or Glove to get embeddings for words. Average the emebddings for that doc. Then for each vector embedding across documents and customers, you could run k-means with like 10 centroids or something to create clusters. For the actual feature, you could pass those cluster IDs in as a categorical variable. Or you could just average all vectors across documents for a single customer, and concatenate the result to your feature vector for the customer. My theory is the extended generalization achieved through averaging and clustering might washout any potential signal, but still worth it to experiment along those lines.
<br>type: $\mathbb{R}^l$, where $l$ is the number of dimensions of a W2V feature vector, or the number of centroids you choose.
<br>
- number of tips
- number of mass messages
- number of customer messages

### Select SQL examples
Below are some examples of queries I would write to create the features. I had no data to test with so I'm sure there are some syntax mistakes, but I hope I've communicated the general idea



In [None]:
# Messages:
# - customerId
# - creatorId (assuming this is the creator of the onlyFans hosting content)
# - createdAt 
# - sender (creator or customer) 
# - messageText

# Subscriptions:
# - customerId
# - creatorId
# - subscribedAt
# - isExpired (Boolean)

# Transactions:
# - customerId
# - creatorId 
# - createdAt
# - amount
# - transactionType (types: tip, mass message, custom message) 

# total num messages sent to creator
select
    count(*) as num_message
from subscriptions join messages
    on subscriptions.customerId = messages.sender
where subscribedAt <= date_add(subscribedAt, interval 2 week)
group by sender, creatorId


# average amount of time bw transactions. The first window to get the time difference
# between transactions, and the next window to average over the time difference column.
with time_diff_table as (
    select
        *,
        subscribedAt - lag(subscribedAt) over (
            partition by subscriptions.creatorId, subscriptions.customerId
            order by createdAt) as time_between_transactions
    from subscriptions join transactions
        on subscriptions.creatorId = transactions.creatorId
    where subscribedAt <= date_add(subscribedAt, interval 2 week)
)
select
    avg(time_between_transactions) as avg_time_bw_transactions
from time_diff_table
group by subscriptions.creatorId, subscriptions.customerId

## Model selection pros vs cons
#### Linear regression
We can model the problem using linear regression as such $\hat{y} = w^\intercal x$
<br>Pros
- Interpretable: linear regression is still used because it's simple and interpretable. 
- Simple: good for baselining
<br>

Cons
- Nonlinearity must be added manually. This goes for interactions between features and standalone features that are nonlinear

Things to watch out for:
- Make sure your assumptions are correct:
<br>
i) linearity assumption: plot the residuals and check to see if there is a curve. If so, transform your data using log or exponent.
<br>
ii) heteroskadasticity: make sure your residuals are normally distributed with constant variance
<br>
iii) collinearity: Remove correlated features otherwise the standard errors in your parameters could be inflated causing the model to be uninterpretable.
<br>
iv) check for outliers and high leverage points. This could noise to your data. Noise from high leverage points can be mitigated by either removing the outlier, or scaling a dimension by removing the median and dividing by the 3rd and 1st quantiles.
v) Residuals aren't correlated: you can check this by plotting the residuals. The data should not resemble time series data
- Features must be scaled or magnitude of corresponding weights might be misleading. That is dimensions of high magnitude relative to other dimensions can lead to weights of high magnitude, signaling potentially undue importance of a feature.

#### Gradient Boosting
This an ensemble method that essentially creates successive models to correct for residuals, then combines each model additively to yield a prediction.

Pros
- More powerful than standard linear regression
- No need for feature scaling because it uses decision trees
- Ensemble methods are more robust to noise aka reduces chances of overfitting
- Explainability: Decision tree based methods can yeild feature importance ranking
- Learns Nonlinear relationships: By partitioning feature space into smaller and smaller regions
Cons
- Longer training time perhaps. Maybe not actually because most XGBoost libraries are highly optimized. That could easily be confirmed.

#### Neural network
Pros
- Can learn powerful nonlinear relationships (nth order interactions).

Cons
- Potentially high training time (though shouldn't have high training time for this data. Simple MLP should get the job done)
- More complexity could lead to overfitting. Could use dropout to mitigate this possibility.
- Non interpretable
- More things to consider when you have a complex model like NNs

Things to consider (just a few. This is a simpler model so I won't list all potential things to consider here. Just things I would think to look for)
- Still need to scale data. Batch normalization layers solves this problem.
- Depth/width (one/two hidden layers should be more than enough as I don't expect there to be much higher order interactions between the features than that. Perhaps W2V embeddings would necessitate more layers in order to catpure interactions but I doubt it.)
- what kind of activation functions to use. In this case, any of them should work fine, but need to watch out for vanishing/exploding gradients with certain datasets and architectures. Choosing ReLU is the safest bet. Last layer is a linear activation function of course.
- Need to closely observe training loss and take steps accordingly. Are we getting traction on the data? If not, what do we do? Consider learning rate, optimization method (batch, stochastic, RMS prop, adam). Did we overfit? If so, are we using dropout? How much data do we have?

## Offline Training
Would use a 90/10 train/test split for all. Validation set can be used when tuning hyperparameters. I would use sklearn or R for training logistic/linear regression as it has a lot of stuff baked in. XGBoost for gradient boosting. I beleive it takes care of the cross validation under the hood. It's a highly optimized library. I would use PyTorch for the MLP, as that's the library I'm most comfortable with. This would require creating a training loop where we run the preset number of epochs, then manually calculate and plot the train/test error and check for overfitting and model stability.

## Online Training
Say we chose to implement this model in production in order to inform creators about potential spenders. Because the LTV of a customer can change over time, perhaps training on a daily basis on the entire set of customers would be best. We could update the parameters or train a fresh model. Or we can create a holdout set in order to validate the metrics of the model by ensuring there isn't any overfitting etc (everything looks like we expect it to based on offline metrics) before publishing it in prod.

## Future work
I think the above is a fair baseline, but of course, there is much more we can do with this, although my hyposthesis is that we should get traction with the above first. After getting traction and decent results with the above baseline, a next possible step is to use sequential models. Instead of using the data in aggregate, we would consider time series data and use a model that can take in multiple time steps for a single data point like an autoregressive model or an RNN. That is, one data point would be represented by $m$ days and a $p$ dimensional data point for each day, and instead of constructing features by aggregating over $m$ days, we would consider each day individually.