## Splitting data

In [3]:
import pandas as pd
import spacy
import umap
import numpy as np

from pathlib import Path
import sys
sys.path.append('..')

import warnings
warnings.filterwarnings('ignore')

from ml_editor.data_processing import format_raw_df, get_random_train_test_split, get_vectorized_inputs_and_label, get_split_by_author

data_path = Path('../data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

### Random Split

In [4]:
train_df_rand, test_df_rand = get_random_train_test_split(df[df['is_question']], test_size=0.3, random_state=42)

In [5]:
print('{} questions in training, {} in test'.format(len(train_df_rand), len(test_df_rand)))
train_owners = set(train_df_rand['OwnerUserId'].values)
test_owners = set(test_df_rand['OwnerUserId'].values)

print('{} different owners in the training set'.format(len(train_owners)))
print('{} different owners in the testing set'.format(len(test_owners)))
print('{} owners appear in both sets'.format(len(train_owners.intersection(test_owners))))

5579 questions in training, 2392 in test
2968 different owners in the training set
1496 different owners in the testing set
574 owners appear in both sets


### Author Split
Some authors may be more skilled at asking questions than others. If an author appears in both the training and test set, a model could successfully predict the performance of their questions simply by successfully identifying the author. Note that simply removing the `AuthorId` from the set of features does not fully solve this problem, as the formulation of a question may be author specific (especially if some authors include their signature).

To make sure we are accurately judging question quality, we would want to make sure that a given author only appears in either the training set or the validation set. This guarantees that a model will not be able to leverage information to identify a given author and use it to predict more easily.

To remove this potential source of bias, let's split data by author.

In [7]:
train_author, test_author = get_split_by_author(df[df['is_question']], test_size=0.3, random_state=42)

print('{} questions in training, {} in test.'.format(len(train_author), len(test_author)))
train_owners = set(train_author['OwnerUserId'].values)
test_owners = set(test_author['OwnerUserId'].values)

print('{} different owners in the training set.'.format(len(train_owners)))
print('{} different owners in the test set.'.format(len(test_owners)))
print('{} owners appear in both sets '.format(len(train_owners.intersection(test_owners))))

5495 questions in training, 2476 in test.
2723 different owners in the training set.
1167 different owners in the test set.
0 owners appear in both sets 


Going forward we will use the author split, but there are other methods of splitting data for other types of data. For example, we may want to use a time-based split in order to see whether training on questions written in a given period can produce a model that works well on questions from a more recent period.