This notebook shows my solution to a Data Scientist task that I've been requested to solve. Both datasets have been published in my profile, as well as the explanation of thouse datasets.

Whomever wants to submit his own solution and/or wants to discuss/improve these results is welcome!

In [None]:
# Loading Libraries Module #
import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import multiprocessing

from IPython.display import display, Markdown
from joblib import Parallel, delayed
from mlxtend.evaluate import permutation_test
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Loading Data Module #
# Paths should be modified to replicate #
ex_1 = pd.read_csv("/kaggle/input/two-distributions-comparison/data_exercise1.csv")
ex_2_train = pd.read_csv("/kaggle/input/lifetime-value/train.csv")
ex_2_test = pd.read_csv("/kaggle/input/lifetime-value/test.csv")

In [None]:
# Tunning Values #
NUMERICS = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
NUM_CPUS = multiprocessing.cpu_count() - 1
RESAMPLE_SIZE = min(ex_1.groupby("type").apply(lambda x: len(x)))
RESAMPLES = 100000
pd.set_option('display.max_rows', 500)

In [None]:
# Functions #
def bp_hist_fun(df, title=None):
    """
    Function that displays a combination of boxplot and histogram plots where all the columns are the target variable according to each level of the categorical variable studied 
    """
    fig1, f1_axes = plt.subplots(ncols=2, nrows=1, sharey=True, figsize=(16, 8))
    df_aux = pd.melt(df)
    sns.boxplot(x="variable", y="value", data=df_aux, ax=f1_axes[0])
    for type_i in df_aux["variable"].unique():
        sns.distplot(df_aux[df_aux["variable"]==type_i]["value"],
                     label=type_i, hist=True, ax=f1_axes[1], vertical=True, kde=False)
    if title:
        fig1.suptitle(title)
    plt.legend()
    plt.show() 

def qcd(array):
    """
    Function that calculates the quartile coefficient of dispersion of an array
    """
    return (np.percentile(array, 75) - np.percentile(array, 25)) / (np.percentile(array, 75) + np.percentile(array, 25))

def preprocess_df(df, train=False):
    """
    Function that modifies a dataframe (coming from the provided data) to modify and create variables.
    """
    if isinstance(df["join_date"][0], str):
        df["join_date"] = df.apply(lambda x: datetime.datetime.strptime(x["join_date"], '%Y-%m-%d %H:%M:%S'), axis=1)
    
    df["join_date_all_day"] = df.apply(lambda x: x["join_date"].strftime("%Y-%m-%d"), axis=1)
    
    df["join_date_day"] = df.apply(lambda x: x["join_date"].weekday(), axis=1)
    
    df["join_date_labour"] = np.where(df["join_date_day"] > 4, 'weekend', 'labour')
    
    df["join_date_month"] = df.apply(lambda x: x["join_date"].month, axis=1)
    
    df["join_date_year"] = df.apply(lambda x: x["join_date"].year, axis=1)

    df["is_cancelled"] = df["is_cancelled"].fillna("Not Acknowledged")
    
    if train:
        df["frau"] = np.where(df["target"] < 0, 'fraudulent', 'cool')
    
    return df

## First Exercise

### Question 1: Which metric would you use to compare the two distributions?

Before answering this question, let's check the distribution values, the main summary statistics and normalty test in order to get insights of the distribution.

In [None]:
fig = plt.figure(figsize=(20,10))
for type_i in ex_1["type"].unique():
    sns.distplot(ex_1[ex_1["type"]==type_i]["value"], 
                 label=type_i, hist=False)
plt.legend()
plt.show()

In [None]:
display(ex_1.groupby("type").describe())

In [None]:
test = stats.jarque_bera(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values)

Markdown("The Statistic of the Jarque Bera normality test for type_1_500_samples is {:.2f}, with a p_value of {:.2f}".format(test[0], test[1]))

In [None]:
test = stats.jarque_bera(ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values)

Markdown("The Statistic of the Jarque Bera normality test for type_2_10000_samples is {:.2f}, with a p_value of {:.2f}".format(test[0], test[1]))

The main insights that may affect are:
- *Type 2* has much more values than *Type 1*.
- *Type 2* and *Type 1* have the same first quartile (25%), while *Type 2* is bigger in maximum and *Type 1* in the rest of quartiles.
- The standard deviation is slightly lower in *Type 2* while the mean is slightly higher in *Type 1*.
- None of the distributions follow significantly a normal distribution. Transform data using logarithms or square root is not allowed because of negative values.

The thing is, mean and standard deviation values are sensible to extreme values, what specially *Type 2* has. 

So,
- **T-statistic** (even for standard t-test or for welch t-test, which works for unequal sized samples or variances) is **discarded** because of non-normality. 
- **Mean** and **standard deviation values** are **discarded** because of extreme values.

Then, my options would be:
- The **Median**, because is a central location metric more robust tot he exposed issues. Furthermore, it shows that "50% of the values are this value or higher".
- The **Quartile Coefficient of Dispersion**, because is a variability metric mor robust to exposed issues.

However, I would use all the "discarded" metrics with permutation tests.

### Question 2: Is there a graphical way to compare the distributions?

A combination of histogram (shown in question 1 as density plot) with boxplot allows to check the distribution of the values while showing the main location measure statistics. Therefore, it can be used to check differences between distributions.

In [None]:
fig = plt.figure(figsize=(10,10))
sns.boxplot(x="type", y="value", data=ex_1, showmeans=True)
plt.title("Boxplot comparison")
plt.show()

Nonetheless, it has been exposed that same sizes are unequal and there are extrem values. Then, I would squeeze the data using resampling methods. With this I would be able to see the sampling distribution of the statistics of interest, allowing to see if the distribution of the statistic of interest is significantly higher than the other. To take in consideration the different in sample sizes, I've decided to extract samples with resampling, using the sample size of the little group.

In [None]:
type1_resamples = Parallel(n_jobs=NUM_CPUS)(
    delayed(np.random.choice)(
        a = ex_1[ex_1["type"]=="type_1_500_samples"]["value"],
        size = RESAMPLE_SIZE
    ) for perm in range(RESAMPLES))

type2_resamples = Parallel(n_jobs=NUM_CPUS)(
    delayed(np.random.choice)(
        a = ex_1[ex_1["type"]=="type_2_10000_samples"]["value"],
        size = RESAMPLE_SIZE
    ) for perm in range(RESAMPLES))

In [None]:
bp_hist_fun(
    pd.DataFrame({"type_1_500_samples": map(np.mean, type1_resamples),
                  "type_2_10000_samples": map(np.mean, type2_resamples)}),
    title="Resmpling distribution of means")

In [None]:
bp_hist_fun(
    pd.DataFrame({"type_1_500_samples": map(np.std, type1_resamples),
                  "type_2_10000_samples": map(np.std, type2_resamples)}),
    title="Resmpling distribution of standard deviations")

In [None]:
bp_hist_fun(
    pd.DataFrame({"type_1_500_samples": map(np.median, type1_resamples),
                  "type_2_10000_samples": map(np.median, type2_resamples)}),
    title="Resmpling distribution of medians")

In [None]:
bp_hist_fun(
    pd.DataFrame({"type_1_500_samples": map(qcd, type1_resamples),
                  "type_2_10000_samples": map(qcd, type2_resamples)}),
    title="Resmpling distribution of quartile coefficients of dispersion")

Visually, all these inputs show that type 1 tends to have higher distribution of values. Nonetheless, statistical tests are required to check if the difference is statisticall significant. 

### Question 3: How would you design a statistical test for this problem? How would you overcome the problem of the different sample size?

Following what has been exposed in the previous two questions and to avoid this problem, a non-parametric test should be applied. For that, I'm using two approaches (various indicators will help to reach better conclusions):
- "Classical" non-parametric tests
- Permutation test

For all the hypothesis contrasts, the Null Hypothesis is the equality between both distributions.

In [None]:
test = stats.mannwhitneyu(
    x=ex_1[ex_1["type"]=="type_1_500_samples"]["value"],
    y=ex_1[ex_1["type"]=="type_2_10000_samples"]["value"], 
    alternative="two-sided")

Markdown("The Statistic of the Mann-Whitney-U test is {:.2f}, with a p_value of {:.4f}".format(test[0], test[1]))

In [None]:
test = stats.kruskal(
    ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
    ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values)

Markdown("The Statistic of the Kruskal Wallis test is {:.2f}, with a p_value of {:.4f}".format(test[0], test[1]))

In [None]:
Markdown("The p_value of the permutation test comparing mean values is: {:.4f}".format(
        permutation_test(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
                         ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values,
                         method="approximate", num_rounds=RESAMPLES, seed=0,
                         func=lambda x, y: np.abs(np.mean(x) - np.mean(y))))
        )

In [None]:
Markdown("The p_value of the permutation test comparing standard deviation values is: {:.4f}".format(
        permutation_test(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
                         ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values,
                         method="approximate", num_rounds=RESAMPLES, seed=0,
                         func=lambda x, y: np.abs(np.std(x) - np.std(y))))
        )

In [None]:
Markdown("The p_value of the permutation test comparing median values is: {:.4f}".format(
        permutation_test(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
                         ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values,
                         method="approximate", num_rounds=RESAMPLES, seed=0,
                         func=lambda x, y: np.abs(np.median(x) - np.median(y))))
        )

In [None]:
Markdown("The p_value of the permutation test comparing quartile coefficient of dispersion values is: {:.4f}".format(
        permutation_test(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
                         ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values,
                         method="approximate", num_rounds=RESAMPLES, seed=0,
                         func=lambda x, y: np.abs(qcd(x) - qcd(y))))
        )

In [None]:
Markdown("The p_value of the permutation test using welch t-test statistic values is: {:.4f}".format(
        permutation_test(ex_1[ex_1["type"]=="type_1_500_samples"]["value"].values,
                         ex_1[ex_1["type"]=="type_2_10000_samples"]["value"].values,
                         method="approximate", num_rounds=RESAMPLES, seed=0,
                         func=lambda x, y: stats.ttest_ind(x, y, equal_var=False)))
        )

The combination of the tests are pointing into the same direction (except for median and quartile coefficient of dispersion permutation tests, that conjoint with weird distributions) , there are no significant evidences to determine that one of the distributions differs from the other. Nonetheless, all the conclusions are extracted from the initial sample. A replica of this analysis and/or new samples of data should confirm that this is true.

## Second Exercise

### Question 1: Exploratory Data Analysis

Noticing that there's a timestamp variable, it may be interesting to check if there are differences across months, years and days. In addition, NaN values may be found in *is_cancelled*, which are the cases that are not yet ackownledged, for that, the NaN values will be a new category called "Not Acknowledged".

In [None]:
## Data pre-processing #
ex_2_train = preprocess_df(ex_2_train, train=True)

cat = ["join_date_labour", "join_date_day", "join_date_month", "join_date_year", "credit_card_level", "aff_type", "country_segment", "product_type", "hidden", "is_cancelled", "product"]
num = ["STV"]

ex_2_train.head()

Except for STV and the target itself, there are no numerical variables. So, I'm showing boxplots from each level of the cathegorical variables. In order to check if there are visible differences between them.

In [None]:
for category in cat:
    fig = plt.figure(figsize=(10,10))
    sns.boxplot(x=category, y="target", data=ex_2_train)
    plt.title("Boxplot Comparison of LTV distribution in variable {}".format(category))
    plt.show()
    print("Summary Statistics of LTV according to {}".format(category))
    display(ex_2_train.groupby(category)["target"].describe())

Knowing that there are fraudulent users (target < 0), let's check with a chi-squared test for categorical variables. And it's shown that the distribution of fraudulent users is random just when dealing with aff_type.

In [None]:
for category in cat:
    stat, p, degrees, perf = stats.chi2_contingency(
        pd.crosstab(ex_2_train[category],ex_2_train["frau"])
    )
    print("The p-value of the chi-squared test between fraudulent new variable and {} is {:.2f},the statistic is {:.2f}".format(category, p, stat))
    

In [None]:
#Execute if you have time
#  p_value = permutation_test(ex_2_train["STV"], ex_2_train["target"],
#                            method='approximate', num_rounds=RESAMPLES,
#                            func=lambda x, y: np.corrcoef(x, y)[1][0],
#                            seed=0)

# Markdown("There's a significant (p_value = {:.2f}) but small positive correlation (Observed Pearson Correlation = {:.2f}) between STC and the target".format(p_value, np.corrcoef(ex_2_train["STV"], ex_2_train["target"])[1][0]))

fig = plt.figure(figsize=(10,10))
# sns.regplot(x="STV", y="target", data=ex_2_train) #Execute if you have time
sns.scatterplot(x="STV", y="target", data=ex_2_train) 
plt.show()

In brief:
- **STV** is correlated with **LTV**. When dealing with *type_ex* product, **STV** and **LTV** are the same. 
- Those registered in *weekend* tend to have a higher **LTV** distribution than in *labour* days.
- There are clear **LTV** distribution difference according to the **credit_card_level**, **hidden** and **is_cancelled**.
- The **LTV** distribution when **aff_type** is *other* is much more robust than the rest of options.
- There are differences between **country_segmentation** in **LTV** distribution.
- *Type_ex* have no frauds, but *type_u* and *type_x* reach the highest and lowest values of **LVT**. When **is_cancelled** is *Not Acknowledged*, there are no frauds.
- With some differences, different **product_type** follow quite similar **LTV** distributions.
- *Fraud* is present and related with all almost all features.

Challenges:
- The historic just reaches 6 months of information, the study of seasonability would be incomplete. And from an external point of view, it's something that should be at least studied.
- There's just information based from the point that they purchase one of them, there's no monitorization.
- There's a "blocking" effect, 1 to 4 (if they buy the 4 types of product) instances come from the same user. *credit_card*, *join_date*, *hidden*, *is_lp*, *aff_type* and *country_segment* follow a direct relationship with the user_id.


### Question 2: Modelization

In order to modelize through the proposed levels of aggregation, I see 4 modellization options that may be interesting:
- **Random Forest Model** using the descriptive variable for every row of information, making an aggreation post-hoc.
- A **Mixed-Effects Linear Model**, considering the user as blocking factor (we just measure the intrinsec variability of the data without the effect of any user). While the product (the choice of the user) is the input value with the *STV*. *{High computational cost}*
- A **Multi-Objective Regression model** for each level of aggregation (or a regression for each level). Using the mean for *LTV* output and for the *STV* input, and the sum for the rest of features. *{To implement}*
- **Recurrent Neural Network (RNN)** using the aggregation of the previous month. This option won't allow to explicitly train the model with the features information, but it will use the intrinsic varibility and tendency of the data through the time. *{To implement}*

To have an insight of the expected test error, I'm using June of 2019 as a validation sample.

In [None]:
ex_2_train.head()

### Random Forest

In [None]:
hot_encoders = ["credit_card_level", "aff_type", "product", "is_cancelled"]
ex_2_train_encoded = pd.concat([ex_2_train[["STV", "hidden", "is_lp", "target", "join_date_month"]], 
                                pd.get_dummies(ex_2_train[hot_encoders])],
                               axis=1)

train = ex_2_train_encoded[ex_2_train_encoded["join_date_month"]!=6]
val = ex_2_train_encoded[ex_2_train_encoded["join_date_month"]==6]

In [None]:
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
regressor.fit(train.drop(["target", "join_date_month"], 1), train["target"])
y_pred = regressor.predict(val.drop(["target", "join_date_month"], 1))

In [None]:
Markdown("The expected mean absolute error is: {:.4f}".format(mean_absolute_error(val["target"], y_pred)))

In [None]:
fig = plt.figure(figsize=(20,10))
sns.distplot(val["target"] - y_pred)
plt.title("Distribution of validation error")
plt.show()

### Mixed-Effects Linear Model

In [None]:
train_lme = ex_2_train[ex_2_train["join_date_month"]!=6]
val_lme = ex_2_train[ex_2_train["join_date_month"]==6]

In [None]:
md = smf.mixedlm("target ~ STV * product_type", train_lme, groups=train_lme["user_id"])

mdf = md.fit()

print(mdf.summary())

In [None]:
y_pred = mdf.predict(val_lme)

In [None]:
Markdown("The expected mean absolute error is: {:.4f}".format(mean_absolute_error(val_lme["target"], y_pred)))

In [None]:
fig = plt.figure(figsize=(20,10))
sns.distplot(val_lme["target"] - y_pred)
plt.title("Distribution of validation error")
plt.show()

This is a really high cost computating model, just in the training sample there are 466.844 different *user_id*. Nonetheless, it would be a great generalization of the model. Once the effect of the user is extracted of the estimations, we could see the voluntary consume effect. Even though the effect the variables that define the user could be included in the model, it would be necessary more computing time to make this model fit.

### Multi Regression

In [None]:
hot_encoders = ["credit_card_level", "aff_type", "product", "is_cancelled"]
ex_2_train_encoded = pd.concat([ex_2_train[["STV", "hidden", "is_lp", "target", "product_type", "country_segment", "join_date_month"]], 
                                pd.get_dummies(ex_2_train[hot_encoders])],
                               axis=1)
train = ex_2_train_encoded[ex_2_train_encoded["join_date_month"]!=6]
val = ex_2_train_encoded[ex_2_train_encoded["join_date_month"]==6]

In [None]:
train_2 = pd.concat([
    train.drop(["STV", "target"], 1).groupby(["join_date_month", "country_segment", "product_type"], as_index=False).sum().reset_index(drop=True),
        train[["STV", "target", "join_date_month", "country_segment", "product_type"]].groupby(["join_date_month", "country_segment", "product_type"], as_index=False).mean().reset_index(drop=True)],
axis=1)

val_2 = pd.concat([
    val.drop(["STV", "target"], 1).groupby(["join_date_month", "country_segment", "product_type"], as_index=False).sum().reset_index(drop=True),
        val[["STV", "target", "join_date_month", "country_segment", "product_type"]].groupby(["join_date_month", "country_segment", "product_type"], as_index=False).mean().reset_index(drop=True)],
axis=1)

There would be a model for each level of aggregation. However, it would be just 7 instances for the training of each regression, and just relationships inside the level of agregations are taken into account.

### RNN

The input would be the aggregated data as a matrix, sequentially for each month that has past. So, each new month will consider the past month to make it's predictions. Furthermore, applying LSTM cells, more months in the past may be considered for the modelization. In addition, the aggregates of the inputs may be passed as input to the same network.

### Question 3: Prediction

Using the **Random Forest** model as a predictor (which has the lowest validation error), the prediction for July 2019 will be exported in the same directory where this notebook is located.

In [None]:
ex_2_test = preprocess_df(ex_2_test)

ex_2_test_encoded = pd.concat([ex_2_test[["STV", "hidden", "is_lp", "join_date_month"]], 
                                pd.get_dummies(ex_2_test[hot_encoders])],
                               axis=1)


In [None]:
ex_2_test["y_pred"] = regressor.predict(ex_2_test_encoded.drop(["join_date_month"], 1))

In [None]:
output = pd.pivot_table(ex_2_test, values='y_pred', index=['join_date_month', 'country_segment'],
                        columns=['product_type'], aggfunc=np.mean)

In [None]:
flattened = pd.DataFrame(output.to_records())
flattened["join_date_month"] = "2019-07-01"
flattened.columns = list(map(lambda x: x.replace("type", "mean_target_type"),flattened.columns))
display(flattened)
flattened.to_csv("prediction_july_19.csv") #The file is located in the same place as this notebook.

In addition, using the **Mixed-Effects Lineal Model** as a predictor, the prediction for July 2019 will be exported in the same directory where this notebook is located.

In [None]:
ex_2_test["y_pred_lme"] = mdf.predict(ex_2_test)
output = pd.pivot_table(ex_2_test, values='y_pred_lme', index=['join_date_month', 'country_segment'],
                        columns=['product_type'], aggfunc=np.mean)
flattened = pd.DataFrame(output.to_records())
flattened["join_date_month"] = "2019-07-01"
flattened.columns = list(map(lambda x: x.replace("type", "mean_target_type"),flattened.columns))
display(flattened)
flattened.to_csv("prediction_july_19_lme.csv") #The file is located in the same place as this notebook.

# Personal Notes

Despite of not being able to implement succesfully what was required in the final exercise, I've pointed out the strategies that I would like to follow (in a working environment, discussing with the rest of the team). I would like to ask for feedback in the aspects that you consider wrong or incomplete, I faced a challenge (which I'm interested into) and with your clarifications I will learn from my errors.

Of course, I'm fully aware that this document could be a lot prettier, also including references and explanations. However, with my actual schedule, the time I was able to invest in this test was limited (1.5 workdays), and I tried to make it as good and clear as possible. I would have liked to explore some multivariate techniques to go beyond bivariate relationships, performins some clustering, and then, implement the rest of the models optimizing the hyperparameters with cross-validation. 