# Project hypotheses

## Localization

Current data come from a supposedly wide audience, since Reddit is a well-known tool in the US.
The first hypothesis states that Reddit utilisation doesn't apply a strong input filtering of the relevant population, thus biasing the dataset towards a subpopulation.
Second hypothesis is that business context and targeted population are sufficiently similar to the dataset population, even if the location is different, such as in France where Reddit has less coverage.

## Time shift

Even if at current state (2022/12/26), r/RAOP rules have strong emphasis on the relative legitimacy (reddit account metadata) of applicants to avoid inappropriate requests, they cannot be taken into account here.
Indeed, they may have evolved over time, which already adds bias to the historical data, but obviously cannot be applied retrospectively now, 9 years later.
Nonetheless, we assume that altruism is a time constant through a wide population.
World-wide economic situation shift over time is neglected since our business object is a vital food product, 🚀 popular, and affordable enough for many people.

## Wisdom of crowds

Even if we disregard the rules process, Reddit structure (comments, votes, account metadata) is assimilated to an influence soft-voting tool.
That's why it's assumed that donation process and request legitimacy are not misplaced, and we're confident about the transfert between RAOP donation purpose and our business objective.
So if a request led to a donation, thus the request was legitimate.

# Business context

## Marketing campaign

I'm running a pizza restaurant at a fast-growing pace with some few localizations.
In order to promote our upcoming additional location, we're launching a marketing campaign to donate some pizza to people who made a request.
It can leverage some pain points:
+ Expand our brand image
+ Minimize unsells waste
+ Donate to people in need

Currently, our resources can't afford to have dedicated people to develop and run this kind of process. Lucky for me, I used to be a Data Scientist and r/RAOP+kaggle gives me data to work with.

## Business objectives

1. Train a model to predict legitimacy *(i.e. pizza donation)* of a request at the moment of request to avoid target leakage.
2. Find a process that doesn't disapprove or lower the previous legitimacy of donation at the moment of data retrieval, if there's such a thing.

## Future concerns

The current depicted design doesn't leverage any concerns about legitimate requests actual donation and marketing performances.
Indeed, legitimate requests could be all fulfilled or partially depending on our selection process, volume, donation supply chain, seasonality, cost efficiency, and many other variables.
For now, the project focus on donation legitimacy modelisation.

# Data preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Load dataset

In [2]:
pizza_raw_data = pd.read_json('../data/pizza_data.json',
                              dtype={"giver_username_if_known": str,
                                        "number_of_upvotes_of_request_at_retrieval": int,
                                        "post_was_edited": bool,
                                        "request_id": str,
                                        "request_number_of_comments_at_retrieval": int,
                                        "request_text": str,
                                        "request_text_edit_aware": str,
                                        "request_title": str,
                                        "requester_account_age_in_days_at_request": float,
                                        "requester_account_age_in_days_at_retrieval": float,
                                        "requester_days_since_first_post_on_raop_at_request": float,
                                        "requester_days_since_first_post_on_raop_at_retrieval": float,
                                        "requester_number_of_comments_at_request": int,
                                        "requester_number_of_comments_at_retrieval": int,
                                        "requester_number_of_comments_in_raop_at_request": int,
                                        "requester_number_of_comments_in_raop_at_retrieval": int,
                                        "requester_number_of_posts_at_request": int,
                                        "requester_number_of_posts_at_retrieval": int,
                                        "requester_number_of_posts_on_raop_at_request": int,
                                        "requester_number_of_posts_on_raop_at_retrieval": int,
                                        "requester_number_of_subreddits_at_request": int,
                                        "requester_received_pizza": bool,
                                        "requester_subreddits_at_request": list,
                                        "requester_upvotes_minus_downvotes_at_request": int,
                                        "requester_upvotes_minus_downvotes_at_retrieval": int,
                                        "requester_upvotes_plus_downvotes_at_request": int,
                                        "requester_upvotes_plus_downvotes_at_retrieval": int,
                                        "requester_user_flair": str,
                                        "requester_username": str,
                                        "unix_timestamp_of_request": int,
                                        "unix_timestamp_of_request_utc": int})

In [3]:
pizza_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 32 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   giver_username_if_known                               4040 non-null   object 
 1   number_of_downvotes_of_request_at_retrieval           4040 non-null   int64  
 2   number_of_upvotes_of_request_at_retrieval             4040 non-null   int32  
 3   post_was_edited                                       4040 non-null   bool   
 4   request_id                                            4040 non-null   object 
 5   request_number_of_comments_at_retrieval               4040 non-null   int32  
 6   request_text                                          4040 non-null   object 
 7   request_text_edit_aware                               4040 non-null   object 
 8   request_title                                         4040

In [4]:
pizza_raw_data.head()

Unnamed: 0,giver_username_if_known,number_of_downvotes_of_request_at_retrieval,number_of_upvotes_of_request_at_retrieval,post_was_edited,request_id,request_number_of_comments_at_retrieval,request_text,request_text_edit_aware,request_title,requester_account_age_in_days_at_request,...,requester_received_pizza,requester_subreddits_at_request,requester_upvotes_minus_downvotes_at_request,requester_upvotes_minus_downvotes_at_retrieval,requester_upvotes_plus_downvotes_at_request,requester_upvotes_plus_downvotes_at_retrieval,requester_user_flair,requester_username,unix_timestamp_of_request,unix_timestamp_of_request_utc
0,,0,1,False,t3_l25d7,0,Hi I am in need of food for my 4 children we a...,Hi I am in need of food for my 4 children we a...,Request Colorado Springs Help Us Please,0.0,...,False,[],0,1,0,1,,nickylvst,1317852607,1317849007
1,,2,5,False,t3_rcb83,0,I spent the last money I had on gas today. Im ...,I spent the last money I had on gas today. Im ...,"[Request] California, No cash and I could use ...",501.1111,...,False,"[AskReddit, Eve, IAmA, MontereyBay, RandomKind...",34,4258,116,11168,,fohacidal,1332652424,1332648824
2,,0,3,False,t3_lpu5j,0,My girlfriend decided it would be a good idea ...,My girlfriend decided it would be a good idea ...,"[Request] Hungry couple in Dundee, Scotland wo...",0.0,...,False,[],0,3,0,3,,jacquibatman7,1319650094,1319646494
3,,0,1,True,t3_mxvj3,4,"It's cold, I'n hungry, and to be completely ho...","It's cold, I'n hungry, and to be completely ho...","[Request] In Canada (Ontario), just got home f...",6.518438,...,False,"[AskReddit, DJs, IAmA, Random_Acts_Of_Pizza]",54,59,76,81,,4on_the_floor,1322855434,1322855434
4,,6,6,False,t3_1i6486,5,hey guys:\n I love this sub. I think it's grea...,hey guys:\n I love this sub. I think it's grea...,[Request] Old friend coming to visit. Would LO...,162.063252,...,False,"[GayBrosWeightLoss, RandomActsOfCookies, Rando...",1121,1225,1733,1887,,Futuredogwalker,1373657691,1373654091


## Data leakage prevention

Some features may lead to data leakage.
One is directly linked to pizza donation, `giver_username_if_known`.
Others may be since they aren't concerned about at_request/at_retrieval split-up, such as `requester_user_flair` *(requester badge obtention after receiving a pizza donation)*, `request_text` and `post_was_edited` *(some request posts are edited after getting a pizza donation)*.
So, these features are removed from our project.

In [14]:
pizza_prevented_data = pizza_raw_data.loc[:, ~(pizza_raw_data
                                               .columns
                                               .isin(["giver_username_if_known",
                                                      "requester_user_flair",
                                                      "request_text",
                                                      "post_was_edited"]))
                       ]

## Split training data

In [15]:
target_name = 'requester_received_pizza'
seed = 101
X = pizza_prevented_data.copy()
y = X.pop(target_name)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)

In [16]:
print(f"From our {pizza_raw_data.shape[0]} samples, we'll use {X_train.shape[0]} of them to train the model.")

From our 4040 samples, we'll use 3232 of them to train the model.


In [17]:
X_train.describe(include="all")

Unnamed: 0,number_of_downvotes_of_request_at_retrieval,number_of_upvotes_of_request_at_retrieval,request_id,request_number_of_comments_at_retrieval,request_text_edit_aware,request_title,requester_account_age_in_days_at_request,requester_account_age_in_days_at_retrieval,requester_days_since_first_post_on_raop_at_request,requester_days_since_first_post_on_raop_at_retrieval,...,requester_number_of_posts_on_raop_at_retrieval,requester_number_of_subreddits_at_request,requester_subreddits_at_request,requester_upvotes_minus_downvotes_at_request,requester_upvotes_minus_downvotes_at_retrieval,requester_upvotes_plus_downvotes_at_request,requester_upvotes_plus_downvotes_at_retrieval,requester_username,unix_timestamp_of_request,unix_timestamp_of_request_utc
count,3232.0,3232.0,3232,3232.0,3232.0,3232,3232.0,3232.0,3232.0,3232.0,...,3232.0,3232.0,3232,3232.0,3232.0,3232.0,3232.0,3232,3232.0,3232.0
unique,,,3232,,3148.0,3224,,,,,...,,,2399,,,,,3232,,
top,,,t3_m9dxg,,,[REQUEST],,,,,...,,,[],,,,,mindfragment,,
freq,,,1,,81.0,4,,,,,...,,,585,,,,,1,,
mean,2.413366,5.977413,,2.853032,,,250.563901,755.241885,16.177954,520.436075,...,1.246287,18.022587,,1144.773205,2682.154394,3636.832,7584.899,,1342695000.0,1342692000.0
std,3.009515,8.710719,,4.67757,,,296.577877,328.500959,68.784004,267.56107,...,0.61999,21.701225,,3712.847322,6341.838215,25018.16,40137.5,,23300230.0,23299570.0
min,0.0,0.0,,0.0,,,0.0,45.291562,0.0,0.0,...,0.0,0.0,,-173.0,-173.0,0.0,0.0,,1297723000.0,1297723000.0
25%,1.0,2.0,,0.0,,,3.637118,519.489132,0.0,284.515923,...,1.0,1.0,,3.0,21.0,9.0,50.0,,1320192000.0,1320189000.0
50%,2.0,4.0,,1.0,,,157.06717,753.834248,0.0,529.845451,...,1.0,11.0,,177.0,692.5,354.0,1278.5,,1342565000.0,1342561000.0
75%,3.0,7.0,,4.0,,,385.619852,897.417031,0.0,778.395281,...,1.0,27.0,,1154.0,3280.0,2287.75,6782.75,,1364039000.0,1364035000.0


In [18]:
y_train.describe()

count      3232
unique        2
top       False
freq       2437
Name: requester_received_pizza, dtype: object

## Dissociate features at request from at retrieval

In order to avoid data leakage, for example a request that had a donation could have a posteriori some upvotes boost, for our first objective to model legitimacy of a request, only features at request time are accounted.

In [22]:
univariate_features = ["request_id",
                       "requester_username",
                       "unix_timestamp_of_request_utc",
                       "request_title",
                       "request_text_edit_aware"]

at_request_features = []
at_retrieval_features = []

for selected_time, selected_features in {"request": at_request_features, "retrieval": at_retrieval_features}.items():
    dataset_features = (X_train
                        .filter(regex=f'.*{selected_time}$')
                        .columns
                        .tolist())
    selected_features.extend(dataset_features)

In [23]:
X_train = X_train[set(univariate_features + at_request_features)]
#pizza_retrieval_data = X_train[set(univariate_features + at_retrieval_features)]

  X_train = X_train[set(univariate_features + at_request_features)]


# Data exploration
## Let's start first with non-textual data

In order to have ground level refrence, let's start with a very basic modelisation with underperforming results.

In [24]:
X_train.select_dtypes(exclude=["object"]).corrwith(y_train)

requester_days_since_first_post_on_raop_at_request    0.113513
requester_account_age_in_days_at_request              0.046025
requester_number_of_posts_at_request                  0.008033
requester_upvotes_minus_downvotes_at_request          0.032900
requester_upvotes_plus_downvotes_at_request           0.032593
requester_number_of_comments_at_request               0.022811
requester_number_of_posts_on_raop_at_request          0.145767
unix_timestamp_of_request                            -0.109959
unix_timestamp_of_request_utc                        -0.109957
requester_number_of_subreddits_at_request             0.024470
requester_number_of_comments_in_raop_at_request       0.136583
dtype: float64