# Step One: Cleaning the Dataset

It is necessary to preprocess our data to get it into an appropriate and useful format for training our word vectors. To create custom trained word vectors, we need a corpus - a collection of sentences (or product descriptions in our case). Using these sentences, our training will create relationships between each of the words and quantify these as word vectors (or word embeddings).


### Requirements

- We are using an online retail dataset from the [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php) which can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00352/).

- We will clean and train our data in Python - the programming language of choice for many machine learning and data science projects. We will make use of the popular `pandas` library to manipulate our dataset.

Note: The training files are stored in this repo for convenience however, it would be most appropriate to create a new directory and virtual environment to the API directory and follow these steps there.


In [5]:
import pandas as pd

### Importing the Data

Our data is formatted as an Excel file, so we can read it into a `pandas DataFrame`. As mentioned, we only need the "Description" column for our use so will only import this column.

As our dataset is large, this may take a few minutes. Once this is completed, we should see the first few rows of the data.


In [6]:
df = pd.read_excel('Online_Retail.xlsx', usecols="C")
df.head()

Unnamed: 0,Description
0,WHITE HANGING HEART T-LIGHT HOLDER
1,WHITE METAL LANTERN
2,CREAM CUPID HEARTS COAT HANGER
3,KNITTED UNION FLAG HOT WATER BOTTLE
4,RED WOOLLY HOTTIE WHITE HEART.


### Cleaning the Data

As our dataset has over half a million lines, it would not be practical to manually check for missing or nonsensical entries. Therefore, we must find ways to manipulate and clean the dataset as a whole so our model will be as useful as possible. This was a trial and error process specific to this dataset and there are likely further steps that could be added.

#### 1. Remove any empty rows

There are some empty rows in our dataset, which we can check for. Make sure you run this block _before_ removing the empty rows!


In [7]:
print("Empty rows before cleaning:", df.isnull().sum())

Empty rows before cleaning: Description    1454
dtype: int64


Now we can remove these empty rows and check there are no null valued rows remaining. Make sure to modify the ```DataFrame``` in place.


In [8]:
df.dropna(inplace=True)

print("Empty rows after cleaning:", df.isnull().sum())

Empty rows after cleaning: Description    0
dtype: int64


#### 2. Remove non-product descriptions

The next step in cleaning our data involves checking the contents of the sentences. After a brief look at the dataset, it was clear that some product descriptions indicated a product was missing or damaged. These lines would not be useful in our model as we are looking for useful keywords to use in search queries for related products. However, it is important to find the balance between removing problematic data and retaining as large a dataset as possible.

We found that these irrelevant lines often contained the following terms:


In [9]:
bad_lines = "\?|damaged|damages|damage"

print("Rows before cleaning:", df.size)

Rows before cleaning: 540455


We can now manipulate our DataFrame object to drop any rows containing our unwanted terms. We can see at least 1000 lines have been removed (so long as you haven't already run this block!).


In [10]:
df = df[df["Description"].str.contains(bad_lines) == False]

print("Rows after cleaning:", df.size)

Rows after cleaning: 540269


#### 3. Remove unhelpful keywords

We've removed rows containing unhelpful product descriptions, but what about unhelpful keywords? On evaluating the dataset again, it was clear that many descriptions contained both relevant description and irrelevant or nonsensical words or punctuation. It would be wasteful to remove thousands more lines from our dataset however, bad keywords could reduce the usefulness of our model. 

In our case, we will remove any problematic words or punctuation from a line while retaining the relevant product description. This strikes a good balance between dataset size and usefulness. 

After considering examples from the data, our method is to: 

- remove words with less than 3 characters
- remove words containing "?" or "/"
- remove punctuation (".", ",", """) from words then add the word back to the sentence


While this would likely be possible using ```DataFrame```, we will convert this to a text file to do this last part of cleaning. Conveniently, our final output will need to be a text file so this step would always have been necessary. 

Notes: 
- For clarity, we will convert our ```DataFrame``` to a text file separately to removing bad keywords however, they could be done in one step. 
- We convert the text to lowercase at this point so our word embeddings are in the most useful format.

In [11]:
text_file = open("dataset.txt", "w")
for data in df.columns:
    text_file.write((df[data].to_string(index=False)+'\n').lower())

text_file.close()

Now we can implement our method and remove or replace any remaining problematic keywords. Here, we delete the previous text file, but you can remove the first and last lines if you would like to keep both for comparison. 

In [14]:
import os

bad_chars = ["?", "/"]
punct = [".", ",", "\""]

with open('dataset.txt') as oldfile, open('cleaned_dataset.txt', 'w') as newfile:
    for line in oldfile:
        clean_arr = []
        sentence = line.split()
        for word in sentence:
            if not any(chars in word for chars in bad_chars):
                if len(word) > 2:
                    if any(chars in word for chars in punct):
                        clean_word = word
                        for item in punct:
                            clean_word = clean_word.replace(item, "")
                        clean_arr.append(clean_word)

                    else: 
                        clean_arr.append(word)
        clean_str = ' '.join(clean_arr) +'\n'
        newfile.write(clean_str)

os.remove("dataset.txt")


Finally we have our cleaned dataset ```cleaned_dataset.txt``` in an appropriate corpus format! We can now make use of Word2Vec and begin training our word embeddings. Check out ```model_training.ipynb``` for the next step!