# Step One: Cleaning the Dataset

It is necessary to preprocess our data to get it into an appropriate and useful format for training our word vectors. To create custom trained word vectors, we need a corpus - a collection of sentences (or product descriptions in our case). Using these sentences, our training will create relationships between each of the words and quantify these as word vectors (or word embeddings).

### Requirements

- We are using an online retail dataset from the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php) which can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/). 

- We will clean and train our data in Python - the programming language of choice for many machine learning and data science projects. We will make use of the popular ```pandas``` library to manipulate our dataset.

Note: The training files are stored in this repo for convenience however, it would be most appropriate to create a new directory and virtual environment to the API directory and follow these steps there.  

In [1]:
import pandas as pd

### Importing the Data

Our data is formatted as an Excel file, so we can read it into a ```pandas DataFrame```. As mentioned, we only need the "Description" column for our use so will only import this column.

As our dataset is large, this may take a few minutes. Once this is completed, we should see the first few rows of the data.


In [2]:
df = pd.read_excel('Online_Retail.xlsx', usecols="C")
df.head()

Unnamed: 0,Description
0,WHITE HANGING HEART T-LIGHT HOLDER
1,WHITE METAL LANTERN
2,CREAM CUPID HEARTS COAT HANGER
3,KNITTED UNION FLAG HOT WATER BOTTLE
4,RED WOOLLY HOTTIE WHITE HEART.


### Cleaning the Data

As our dataset has over half a million lines, it would not be practical to manually check for missing or nonsensical entries. Therefore, we must find ways to manipulate and clean the dataset as a whole so our model will be as useful as possible. This was a trial and error process specific to this dataset and there are likely further steps that could be added. 

1. Remove any empty rows

There are some empty rows in our dataset, which we can check for. Make sure you run this block *before* removing the empty rows!

In [4]:
print("Empty rows before cleaning:", df.isnull().sum())

Empty rows before cleaning: Description    1454
dtype: int64


Now we can remove these empty rows and check there are no null valued rows remaining. Make sure to modify the DataFrame in place.

In [5]:
df.dropna(inplace=True)

print("Empty rows after cleaning:", df.isnull().sum())

Empty rows after cleaning: Description    0
dtype: int64


2. Remove non-product descriptions

The next step in cleaning our data involves checking the contents of the sentences. In order to analyse our data line by line, we will convert our DataFram to a text file. 