#11-divide-train-test
In this notebook, we subset the origianl merged dataset from the previous notebook. We selected the most 1000 populart products and 1000 customers who purchased the most. After subsetting the data, we divided it into train and test data using "StratifiedShuffleSplit" package. The 80% of the same customers belong to train data, and the rest of them goes to the test data. With this method, both training data and test data can share the same customers.

In [2]:
from google.colab import drive
drive.mount('/content/drive')                              

Mounted at /content/drive


In [None]:
# import data
import pandas as pd 
final_df = pd.read_csv('/content/drive/MyDrive/Colab_sized_data.csv')

In [None]:
# Drop unnecessary columns
final_df = final_df.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1)
final_df = final_df.reset_index(drop = True)

## Subset data

In [None]:
# Select the 1000 most popular products which was purchased during the 2 years.
arciticle_count = final_df['article_id'].value_counts()

article_list = arciticle_count.keys().tolist()
count_list = arciticle_count.values.tolist()

article_count_df = pd.DataFrame({'id': article_list, 'count': count_list})
article_count_df= article_count_df.sort_values('count', ascending = False)
article_count_df = article_count_df[:1000]

used_id = article_count_df.id.unique()

final_df = final_df[final_df['article_id'].isin(used_id)]

In [None]:
# Select customers
id_count = final_df['customer_id'].value_counts()

id_list = id_count.keys().tolist()
count_list = id_count.values.tolist()

id_count_df = pd.DataFrame({'id': id_list, 'count': count_list})

# filter out customers of which transaction history is less than 2.
used_id = id_count_df[id_count_df['count'] > 1].id.unique().tolist()

final_df = final_df[final_df['customer_id'].isin(used_id)]

In [None]:
final_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,t_dat,customer_id,article_id,price,sales_channel_id,FN,Active,club_member_status,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
3,3,3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2,1.0,1.0,ACTIVE,...,Campaigns,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...
4,4,4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2,1.0,1.0,ACTIVE,...,Campaigns,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...
5,5,5,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687001,0.016932,2,1.0,1.0,ACTIVE,...,Campaigns,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1023,Special Offers,V-neck knitted jumper with long sleeves and ri...
7,7,7,2018-09-20,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,688873012,0.030492,1,,,ACTIVE,...,Blouse,A,Ladieswear,1,Ladieswear,11,Womens Tailoring,1010,Blouses,"Blouse in a soft weave with a narrow collar, c..."
11,11,11,2018-09-20,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,688873011,0.030492,1,,,ACTIVE,...,Blouse,A,Ladieswear,1,Ladieswear,11,Womens Tailoring,1010,Blouses,"Blouse in a soft weave with a narrow collar, c..."


## Split data into training and test

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

splitter = StratifiedShuffleSplit(n_splits = 2, test_size = .2, random_state= 7)

for train_index, test_index in splitter.split(np.zeros(final_df.shape[0]), final_df['customer_id'].to_numpy()):
  train = final_df.iloc[train_index]
  test = final_df.iloc[test_index]

## Sanity check
After splitting data, let's see the size of training and test data. Also, we need to check how many unique customers in each dataset.

In [None]:
train.shape

(1781011, 35)

In [None]:
test.shape

(445253, 35)

In [None]:
len(test['customer_id'].unique())

286521

In [None]:
len(train['customer_id'].unique())

390826

## Save training and test data

In [None]:
train.to_csv('/content/drive/MyDrive/training_test/train_revise.csv')

In [None]:
test.to_csv('/content/drive/MyDrive/training_test/test_revise.csv')