# Project hypotheses

## Localization

Current data come from a supposedly wide audience, since Reddit is a well-known tool in the US.
The first hypothesis states that Reddit utilisation doesn't apply a strong input filtering of the relevant population, thus biasing the dataset towards a subpopulation.
Second hypothesis is that business context and targeted population are sufficiently similar to the dataset population, even if the location is different, such as in France where Reddit has less coverage.

## Time shift

Even if at current state (2022/12/26), r/RAOP rules have strong emphasis on the relative legitimacy (reddit account metadata) of applicants to avoid inappropriate requests, they cannot be taken into account here.
Indeed, they may have evolved over time, which already adds bias to the historical data, but obviously cannot be applied retrospectively now, 9 years later.
Nonetheless, we assume that altruism is a time constant through a wide population.
World-wide economic situation shift over time is also neglected since our business object is a vital food product, 🚀 popular, and still affordable.

## Wisdom of crowds

Even if we disregard the rules process, Reddit structure (comments, votes, account metadata) is assimilated to an influence soft-voting tool.
That's why it's assumed that donation process and request legitimacy are not misplaced, and we're confident about the transfert between RAOP donation purpose and our business objective.
So if a request led to a donation, thus the request was legitimate.

# Business context

## Marketing campaign

I'm running a pizza restaurant at a fast-growing pace with some few localizations.
In order to promote our upcoming additional location, we're launching a marketing campaign to donate some pizza to people who made a request.
It can leverage some pain points:
+ Expand our brand image
+ Minimize unsells waste
+ Donate to people in need

Currently, our resources can't afford to have dedicated people to develop and run this kind of process. Lucky for me, I used to be a Data Scientist and r/RAOP+kaggle gives me data to work with.

## Business objectives

1. Train a model to predict legitimacy *(i.e. pizza donation)* of a request at the moment of request to avoid target leakage.
2. Find a process that doesn't disapprove or lower the previous legitimacy of donation at the moment of data retrieval, if there's such a thing.

## Future concerns

The current depicted design doesn't leverage any concerns about legitimate requests leading to actual donation and marketing performances.
Indeed, legitimate requests could be all fulfilled or partially depending on our selection process, volume, donation supply chain, seasonality, cost efficiency, and many other variables.
For now, the project focus on donation legitimacy modelisation.

# Data preparation

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy.stats import f_oneway
import seaborn as sns

from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import train_test_split, HalvingGridSearchCV
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import classification_report, f1_score, precision_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## Load dataset

In [2]:
pizza_raw_data = pd.read_json('../data/pizza_data.json',
                              dtype={"giver_username_if_known": str,
                                     "number_of_upvotes_of_request_at_retrieval": int,
                                     "post_was_edited": bool,
                                     "request_id": str,
                                     "request_number_of_comments_at_retrieval": int,
                                     "request_text": str,
                                     "request_text_edit_aware": str,
                                     "request_title": str,
                                     "requester_account_age_in_days_at_request": float,
                                     "requester_account_age_in_days_at_retrieval": float,
                                     "requester_days_since_first_post_on_raop_at_request": float,
                                     "requester_days_since_first_post_on_raop_at_retrieval": float,
                                     "requester_number_of_comments_at_request": int,
                                     "requester_number_of_comments_at_retrieval": int,
                                     "requester_number_of_comments_in_raop_at_request": int,
                                     "requester_number_of_comments_in_raop_at_retrieval": int,
                                     "requester_number_of_posts_at_request": int,
                                     "requester_number_of_posts_at_retrieval": int,
                                     "requester_number_of_posts_on_raop_at_request": int,
                                     "requester_number_of_posts_on_raop_at_retrieval": int,
                                     "requester_number_of_subreddits_at_request": int,
                                     "requester_received_pizza": bool,
                                     "requester_subreddits_at_request": list,
                                     "requester_upvotes_minus_downvotes_at_request": int,
                                     "requester_upvotes_minus_downvotes_at_retrieval": int,
                                     "requester_upvotes_plus_downvotes_at_request": int,
                                     "requester_upvotes_plus_downvotes_at_retrieval": int,
                                     "requester_user_flair": str,
                                     "requester_username": str,
                                     "unix_timestamp_of_request": int,
                                     "unix_timestamp_of_request_utc": int})

In [3]:
pizza_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 32 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   giver_username_if_known                               4040 non-null   object 
 1   number_of_downvotes_of_request_at_retrieval           4040 non-null   int64  
 2   number_of_upvotes_of_request_at_retrieval             4040 non-null   int32  
 3   post_was_edited                                       4040 non-null   bool   
 4   request_id                                            4040 non-null   object 
 5   request_number_of_comments_at_retrieval               4040 non-null   int32  
 6   request_text                                          4040 non-null   object 
 7   request_text_edit_aware                               4040 non-null   object 
 8   request_title                                         4040

The dataset has no missing values, so imputation processes will not be covered in this notebook, but should be considered in a full production-ready pipeline.

## Data leakage prevention

Some features may lead to data leakage.
One is directly linked to pizza donation, `giver_username_if_known`.
Others may be since they aren't concerned about at_request/at_retrieval split-up, such as `requester_user_flair` *(requester badge obtention after receiving a pizza donation)*, `request_text` and `post_was_edited` *(some request posts are edited after getting a pizza donation)*.
So, these features are removed from our project.

In [4]:
pizza_prevented_data = pizza_raw_data.loc[:, ~(pizza_raw_data
                                               .columns
                                               .isin(["giver_username_if_known",
                                                      "requester_user_flair",
                                                      "request_text",
                                                      "post_was_edited"]))
                       ]

## Split training data

In [5]:
target_name = 'requester_received_pizza'
seed = 101
X = pizza_prevented_data.copy()
y = X.pop(target_name)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)

In [6]:
dataset_period = (dt.strptime("29/09/2013", "%d/%m/%Y") - dt.strptime("08/12/2010", "%d/%m/%Y")).days

print(f'''From our {pizza_raw_data.shape[0]} samples, we'll use {X_train.shape[0]} of them to train the model.
From {len(y_train)} requests, {y_train.sum()} led to a donation, which represents {round(y_train.sum()/len(y_train), 3)*100}%.
Thus, among {dataset_period} days, 1 pizza was donate every {round(dataset_period/y_train.sum(), 2)} day.''')

From our 4040 samples, we'll use 3232 of them to train the model.
From 3232 requests, 795 led to a donation, which represents 24.6%.
Thus, among 1026 days, 1 pizza was donate every 1.29 day.


One request out of four is therefore legitimate according to our statements.

## Dissociate features at request from at retrieval

In order to avoid data leakage, for example a request that had a donation could have a posteriori some upvotes boost, only features at request time are accounted according to the first objective to model legitimacy of a request.

In [7]:
univariate_features = ["request_id",
                       "requester_username",
                       "unix_timestamp_of_request_utc", #non-utc timestamp is redundant and less convenient
                       "request_title",
                       "request_text_edit_aware"]

at_request_features = []
at_retrieval_features = []

for selected_time, selected_features in {"at_request": at_request_features, "at_retrieval": at_retrieval_features}.items():
    dataset_features = (X_train
                        .filter(regex=f'.*{selected_time}$')
                        .columns
                        .tolist())
    selected_features.extend(dataset_features)

In [8]:
X_train = X_train[univariate_features + at_request_features]
#pizza_retrieval_data = X_train[univariate_features + at_retrieval_features]

In [9]:
print(f"{X_train.shape[1]} raw features can be used to predict legitimacy of a request.")

15 raw features can be used to predict legitimacy of a request.


In [10]:
X_train.describe(include="all")

Unnamed: 0,request_id,requester_username,unix_timestamp_of_request_utc,request_title,request_text_edit_aware,requester_account_age_in_days_at_request,requester_days_since_first_post_on_raop_at_request,requester_number_of_comments_at_request,requester_number_of_comments_in_raop_at_request,requester_number_of_posts_at_request,requester_number_of_posts_on_raop_at_request,requester_number_of_subreddits_at_request,requester_subreddits_at_request,requester_upvotes_minus_downvotes_at_request,requester_upvotes_plus_downvotes_at_request
count,3232,3232,3232.0,3232,3232.0,3232.0,3232.0,3232.0,3232.0,3232.0,3232.0,3232.0,3232,3232.0,3232.0
unique,3232,3232,,3224,3148.0,,,,,,,,2399,,
top,t3_m9dxg,mindfragment,,[REQUEST],,,,,,,,,[],,
freq,1,1,,4,81.0,,,,,,,,585,,
mean,,,1342692000.0,,,250.563901,16.177954,115.656869,0.673886,21.008045,0.068379,18.022587,,1144.773205,3636.832
std,,,23299570.0,,,296.577877,68.784004,194.069195,3.48407,48.62435,0.33835,21.701225,,3712.847322,25018.16
min,,,1297723000.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,-173.0,0.0
25%,,,1320189000.0,,,3.637118,0.0,0.0,0.0,0.0,0.0,1.0,,3.0,9.0
50%,,,1342561000.0,,,157.06717,0.0,24.0,0.0,5.0,0.0,11.0,,177.0,354.0
75%,,,1364035000.0,,,385.619852,0.0,140.0,0.0,22.0,0.0,27.0,,1154.0,2287.75


# Data exploration
## Let's start first with non-textual data ...

In order to have ground level refrence, let's start with a very basic modelisation with underperforming results.

In [11]:
X_train_num = X_train.select_dtypes(exclude=["object"])

In [12]:
X_train_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3232 entries, 3418 to 813
Data columns (total 10 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   unix_timestamp_of_request_utc                       3232 non-null   int32  
 1   requester_account_age_in_days_at_request            3232 non-null   float64
 2   requester_days_since_first_post_on_raop_at_request  3232 non-null   float64
 3   requester_number_of_comments_at_request             3232 non-null   int32  
 4   requester_number_of_comments_in_raop_at_request     3232 non-null   int32  
 5   requester_number_of_posts_at_request                3232 non-null   int32  
 6   requester_number_of_posts_on_raop_at_request        3232 non-null   int32  
 7   requester_number_of_subreddits_at_request           3232 non-null   int32  
 8   requester_upvotes_minus_downvotes_at_request        3232 non-null   int32  


### Feature engineering

Votes give great insights with the sum related to visibility and difference related to consensus/polarisation. However, the 2 variables here are absolute and a relative one is missing.

In [13]:
num = 'requester_upvotes_minus_downvotes_at_request'
den = 'requester_upvotes_plus_downvotes_at_request'
X_train_num['requester_relative_consensual_votes_at_request'] = X_train_num[[num, den]].apply(lambda row: row.iloc[1] and row.iloc[0] / row.iloc[1] or 0,
                                                                                              axis = 1)

### Feature importances

In [14]:
def make_mi_scores(X, y):
    mi_results = mutual_info_classif(X, y, random_state=seed)
    mi_results = pd.Series(mi_results, name="MI Scores", index=X.columns)
    return mi_results

def make_f_oneway_scores(X, y):
    grp_anova = X.groupby(y)

    f_values = []
    for feat in X.columns:
        s, p = f_oneway(grp_anova.get_group(0)[feat],
                        grp_anova.get_group(1)[feat])
        s = round(s, 4)
        p = round(p, 4)
        f_values.append((s, p))

    f_scores = pd.Series(f_values, name="F_oneway Scores", index=X.columns)
    return f_scores

def make_corr_scores(X, y):
    return X.corrwith(y)

In [15]:
mi_scores = make_mi_scores(X_train_num, y_train)
f_oneway_scores = make_f_oneway_scores(X_train_num, y_train)
corr_scores = make_corr_scores(X_train_num, y_train)

pd_info = pd.concat([mi_scores, corr_scores, f_oneway_scores], axis='columns')
pd_info.columns =["Mutual Info", "Correlation", "F_oneway (s,p)"]
pd_info

Unnamed: 0,Mutual Info,Correlation,"F_oneway (s,p)"
unix_timestamp_of_request_utc,0.021659,-0.109957,"(39.5304, 0.0)"
requester_account_age_in_days_at_request,0.005067,0.046025,"(6.8567, 0.0089)"
requester_days_since_first_post_on_raop_at_request,0.010643,0.113513,"(42.1623, 0.0)"
requester_number_of_comments_at_request,0.0,0.022811,"(1.6815, 0.1948)"
requester_number_of_comments_in_raop_at_request,0.002449,0.136583,"(61.4012, 0.0)"
requester_number_of_posts_at_request,0.00303,0.008033,"(0.2084, 0.648)"
requester_number_of_posts_on_raop_at_request,0.006741,0.145767,"(70.1211, 0.0)"
requester_number_of_subreddits_at_request,0.012759,0.02447,"(1.9353, 0.1643)"
requester_upvotes_minus_downvotes_at_request,0.0,0.0329,"(3.4999, 0.0615)"
requester_upvotes_plus_downvotes_at_request,0.002027,0.032593,"(3.435, 0.0639)"


It seems that almost all features have some importance, with `requester_number_of_posts_at_request` being the least important of them.
The previous created `requester_relative_consensual_votes_at_request` feature looks like a good new addition, and beating its creation features.

### Modelisation

Now that we're at the modelisation step, there is one important question, which is what metric to use.
The real subquestion is, in terms of business, what is more prolific and harmless in the fact of donate to a mostly illegitimate request or on contrary not donating to a legitimate request.
The first, a false positive, harms the branding image and financial cost, the other, a false negative, harms notoriety.
But most of all, as described earlier, this algorithm doesn't lead directly to donation. It outputs only so-called legitimacy, which will take a big place in the donation decision but still can be hold back on a decision algorithm.
With that in mind, the main goal should be to minimize false positives with a metric such as precision, as long as false negatives are not numerous.

In [27]:
def create_impute_preproc(numerical_features, scaler_method = None):
    steps = [('scaler_method', scaler_method)]
    scaler_transformer = Pipeline(steps)

    # Preprocessor
    preprocessor = ColumnTransformer(
        transformers=[
            ('scaler_transformer', scaler_transformer, numerical_features)
        ], remainder='drop')

    # Model pipeline
    steps = [('preproc', preprocessor)]
    proc_pipe = Pipeline(steps)

    return proc_pipe

def create_pipeline(train_set, preproc, estimator, model_name, grid_strat, hyperparameters, n_folds,
                    eval_metric='precision', verbosity=3, n_jobs=10):
    X, y = train_set
    X_preproc = preproc.fit_transform(X)
    X_preproc = pd.DataFrame(X_preproc, columns=X.columns, index=X.index)

    # Model pipeline
    steps = [(model_name, estimator)]
    model_pipe = Pipeline(steps)

    # Grid Search
    cv = grid_strat(model_pipe,
                    param_grid=hyperparameters,
                    cv=n_folds,
                    scoring=eval_metric,
                    n_jobs=n_jobs,
                    verbose=verbosity,
                    random_state=seed)

    grid_model = cv.fit(X_preproc, y)

    return grid_model

def print_pipe_results(train_set, model):
    X, y = train_set

    yhat = model.best_estimator_.predict(X)
    print(f'In-samples resutls:\n {classification_report(y, yhat)}')
    print()
    print(f'''Confusion matrix:
{pd.DataFrame(confusion_matrix(y, yhat, normalize='all'),
              columns=["PredNeg", "PredPos"],
              index=["Neg", "Pos"])}''')

#### Logistic Regression

In [17]:
model = LogisticRegression(solver='liblinear',
                           random_state=seed)

model_name = 'logreg'

parameters = {f'{model_name}__C': np.logspace(0.7, 1.3, 4),
              f'{model_name}__fit_intercept': (True, False),
              f'{model_name}__class_weight': (None, 'balanced'),
              f'{model_name}__penalty': ['l1', 'l2'],
              f'{model_name}__max_iter': np.linspace(20, 2000, 4).astype(int)
              }

logreg_gridCV = create_pipeline(train_set = [X_train_num, y_train],
                                preproc=create_impute_preproc(X_train_num.columns, scaler_method=None),
                                estimator = model,
                                model_name = model_name,
                                grid_strat=HalvingGridSearchCV,
                                hyperparameters = parameters,
                                n_folds = 4,
                                eval_metric='precision',
                                verbosity=1)

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 39
max_resources_: 3232
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 128
n_resources: 39
Fitting 4 folds for each of 128 candidates, totalling 512 fits
----------
iter: 1
n_candidates: 43
n_resources: 117
Fitting 4 folds for each of 43 candidates, totalling 172 fits
----------
iter: 2
n_candidates: 15
n_resources: 351
Fitting 4 folds for each of 15 candidates, totalling 60 fits
----------
iter: 3
n_candidates: 5
n_resources: 1053
Fitting 4 folds for each of 5 candidates, totalling 20 fits
----------
iter: 4
n_candidates: 2
n_resources: 3159
Fitting 4 folds for each of 2 candidates, totalling 8 fits




In [28]:
print_pipe_results(train_set = [X_train_num, y_train],
                   model = logreg_gridCV)

In-samples resutls:
               precision    recall  f1-score   support

       False       0.82      0.68      0.74      2437
        True       0.35      0.54      0.43       795

    accuracy                           0.64      3232
   macro avg       0.59      0.61      0.58      3232
weighted avg       0.70      0.64      0.66      3232


Confusion matrix:
      PredNeg   PredPos
Neg  0.512686  0.241337
Pos  0.113861  0.132116


Logistic regression doesn't offer great performance whatever the tuning.
There is an hypertune way to improve precision up to 65% but decrease catastrophically the recall, so overall it's worse than the current state.

#### Gaussian Naive Bayes

In [63]:
model = GaussianNB()

model_name = 'gaus_nb'

parameters = {}

gaussnb_gridCV = create_pipeline(train_set = [X_train_num, y_train],
                                 preproc=create_impute_preproc(X_train_num.columns, scaler_method=None),
                                 estimator = model,
                                 model_name = model_name,
                                 grid_strat=HalvingGridSearchCV,
                                 hyperparameters = parameters,
                                 n_folds = 4,
                                 eval_metric='precision',
                                 verbosity=1)

n_iterations: 1
n_required_iterations: 1
n_possible_iterations: 1
min_resources_: 3232
max_resources_: 3232
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 1
n_resources: 3232
Fitting 4 folds for each of 1 candidates, totalling 4 fits


In [64]:
print_pipe_results(train_set = [X_train_num, y_train],
                   model = gaussnb_gridCV)

In-samples resutls:
               precision    recall  f1-score   support

       False       0.75      0.97      0.85      2437
        True       0.27      0.03      0.05       795

    accuracy                           0.74      3232
   macro avg       0.51      0.50      0.45      3232
weighted avg       0.63      0.74      0.65      3232


Confusion matrix:
      PredNeg   PredPos
Neg  0.735149  0.018874
Pos  0.239171  0.006807


#### Random Forest Classifier

In [75]:
model = RandomForestClassifier(random_state=seed, n_jobs=10)

model_name = 'rf'

parameters = {f'{model_name}__n_estimators': np.linspace(10, 300, 4).astype(int),
              f'{model_name}__max_depth': np.linspace(2, 9, 4).astype(int)}

rf_gridCV = create_pipeline(train_set = [X_train_num, y_train],
                            preproc=create_impute_preproc(X_train_num.columns, scaler_method=None),
                            estimator = model,
                            model_name = model_name,
                            grid_strat=HalvingGridSearchCV,
                            hyperparameters = parameters,
                            n_folds = 4,
                            eval_metric='f1', #better overall, little downgrade on precision high boost on recall
                            verbosity=1)

n_iterations: 3
n_required_iterations: 3
n_possible_iterations: 3
min_resources_: 359
max_resources_: 3232
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 16
n_resources: 359
Fitting 4 folds for each of 16 candidates, totalling 64 fits
----------
iter: 1
n_candidates: 6
n_resources: 1077
Fitting 4 folds for each of 6 candidates, totalling 24 fits
----------
iter: 2
n_candidates: 2
n_resources: 3231
Fitting 4 folds for each of 2 candidates, totalling 8 fits


In [76]:
print_pipe_results(train_set = [X_train_num, y_train],
                   model = rf_gridCV)

In-samples resutls:
               precision    recall  f1-score   support

       False       0.81      1.00      0.89      2437
        True       0.95      0.28      0.44       795

    accuracy                           0.82      3232
   macro avg       0.88      0.64      0.67      3232
weighted avg       0.85      0.82      0.78      3232


Confusion matrix:
      PredNeg   PredPos
Neg  0.750619  0.003403
Pos  0.176052  0.069926


#### XGBoost Classifier

In [77]:
model = xgb.XGBClassifier(random_state=seed,
                          objective='binary:logistic',
                          eval_metric=precision_score,
                          tree_method='gpu_hist')

model_name = 'xgb_cl'

parameters = {f'{model_name}__n_estimators': np.linspace(20, 500, 3).astype(int),
              f'{model_name}__learning_rate': np.logspace(-1, 0, 2),
              f'{model_name}__max_depth': np.linspace(2, 9, 3).astype(int),
              f'{model_name}__booster': ['gbtree'],
              f'{model_name}__colsample_bytree': np.logspace(-0.7, 0, 2),
              f'{model_name}__subsample': np.logspace(-0.7, 0, 2)
             }

xgb_gridCV = create_pipeline(train_set = [X_train_num, y_train],
                             preproc=create_impute_preproc(X_train_num.columns, scaler_method=None),
                             estimator = model,
                             model_name = model_name,
                             grid_strat=HalvingGridSearchCV,
                             hyperparameters = parameters,
                             n_folds = 4,
                             eval_metric='precision', # f1 is bad -> 0.47 precision / 0.46 recall
                             verbosity=1)

n_iterations: 4
n_required_iterations: 4
n_possible_iterations: 4
min_resources_: 119
max_resources_: 3232
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 72
n_resources: 119
Fitting 4 folds for each of 72 candidates, totalling 288 fits
----------
iter: 1
n_candidates: 24
n_resources: 357
Fitting 4 folds for each of 24 candidates, totalling 96 fits
----------
iter: 2
n_candidates: 8
n_resources: 1071
Fitting 4 folds for each of 8 candidates, totalling 32 fits
----------
iter: 3
n_candidates: 3
n_resources: 3213
Fitting 4 folds for each of 3 candidates, totalling 12 fits


In [78]:
print_pipe_results(train_set = [X_train_num, y_train],
                   model = xgb_gridCV)

In-samples resutls:
               precision    recall  f1-score   support

       False       0.79      0.99      0.88      2437
        True       0.83      0.20      0.32       795

    accuracy                           0.79      3232
   macro avg       0.81      0.59      0.60      3232
weighted avg       0.80      0.79      0.74      3232


Confusion matrix:
      PredNeg   PredPos
Neg  0.744121  0.009901
Pos  0.197092  0.048886


#### Conclusion

After model comparison and light hypertunings, results aren't very concluding. It was expected with only these variables *(backed up by the features importance part)*.
Something quite surprising nonetheless is that CART can achieve very high precision performance but at the cost of a diying recall.
Let's move to textual modelisation to pump up our game.

## ... and get the final word.

In [79]:
X_train_text = X_train.select_dtypes(include=["object"])

In [80]:
X_train_text.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3232 entries, 3418 to 813
Data columns (total 5 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   request_id                       3232 non-null   object
 1   requester_username               3232 non-null   object
 2   request_title                    3232 non-null   object
 3   request_text_edit_aware          3232 non-null   object
 4   requester_subreddits_at_request  3232 non-null   object
dtypes: object(5)
memory usage: 151.5+ KB


Only `request_title` and `request_text_edit_aware` are NLP oriented features, so it'll be the main focus.