## Preprocessing

Cleaning data and preprocessing for mining. This step is based on information concluded from data check.

Cleaning is encapsulated in __utils.clean_orders__ and its main purpose is to remove orders with non-positive material quantities and multiple dates.

Preprocessing by __utils.encode_orders__ encodes multiple rows per order format of data to single one-hot row per order, thus producing huge but sparse matrix of order-by-material format, with false/true cell values corresponding to absence/presence of any quantity of material in column for order in row. Order date and organization, being single for each order, are joined to order-by-material matrix.

Preprocessed orders are buffered to pkl in order to be fastly reused in mining

In [1]:
from utils import read_orders, clean_orders, encode_orders

In [2]:
orders = clean_orders(read_orders())

In [3]:
orders_encoded = encode_orders(orders)

In [4]:
orders_encoded.to_pickle('data/orders_p.pkl.gz')

In [5]:
print('There are {} orders with one-hot encoded {} materials, cleaned and ready for mining'
    .format(
        orders_encoded.shape[0],
        len(set(orders_encoded.columns).difference({ 'order_date', 'org' }))
    )
)

There are 2291173 orders with one-hot encoded 21213 materials, cleaned and ready for mining
