## Pre-processing
The next segment performs the following:
1. Import the required libraries
2. Set the working directory to the Starspace directory. We can navigate using that as reference.
3. Read the data file
4. Drop the rows with missing values. We have an abundance of quality data, we can do away with the ones with missing data
5. Use the right data types for the attributes.
6. Subset the dataset so that we only use the attributes required for Starspace.


In [1]:
import os
import dask.dataframe as dd
import pandas as pd
os.chdir("/home/admin123/Starspace/")
fp = "/home/admin123/Starspace/"+ "data/Online_Retail.csv"
df = pd.read_csv(fp)
df = df[-df['InvoiceNo'].str.startswith("C")]
df = df[-df['StockCode'].str.startswith("BANK")]
req_cols = ["CustomerID", "InvoiceNo", "StockCode"]
df = df[req_cols]
df = df.dropna()
df["CustomerID"] = df["CustomerID"].astype(int)
df = df.sort_values(by=['CustomerID'])

### Note:
An embeddifng is a representation of a data point in a Euclidean space. An embedding in a $\mathbf{R}^p$ space has $p$ real components. An embedding determines a latent representation of the data elements. This latent representation is unobserved and is typically of smaller dimension than the original dataset.  

## The Starspace Embedding Model
We will use the page space/page embedding model described in https://github.com/facebookresearch/StarSpace to generate the embeddings associated with the items. As described in the link, a user is represented by the pages he or she fans. In this example, pages map to invoices and the words in the pages are items purchased in a particular invoice. The page embedding model learns the embeddings for the page. Analogously, we will learn the embeddings for the invoice. Recall that an invoice represents a collection of items purchased together. A user is represented by the average of the invoices associated with his/her purchases.  

In [2]:
od1 = dict(tuple(df.groupby(["CustomerID"])))

In [3]:
fpp = "/home/admin123/Starspace/data/Online_Retail_CustInvoices.txt"
prefix = "itemcode_"
customer_invoices = dict()
sep = " "
for customer in od1.keys():
    df_cust_invoice = od1[(customer)]
    cust_invoices = df_cust_invoice["StockCode"].unique().tolist()
    cust_invoices = [prefix + inv for inv in cust_invoices]
    current_invoices = sep.join(cust_invoices)
    customer_invoices[customer] = current_invoices

        
all_lines = list()

for cust in customer_invoices:
    record = customer_invoices[cust]
    record = record + "\n"
    all_lines.append(record)



fo = open(fpp, "w+")
fo.seek(0, 2)
lines_written = fo.writelines(all_lines)
# Close opened file
fo.close()

In [None]:
df[df['StockCode'].str.startswith("BANK")].shape

In [4]:
df.shape

(397912, 3)

In [None]:
df.shape