1. As a first step, you must pre-process the documents. In particular, for the text fields (title,
description) you should:
● Removing stop words
● Tokenization
● Removing punctuation marks
● Stemming
● and... anything else you think it's needed (bonus point)

In [3]:
import nltk
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marti\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [7]:
import json

In [12]:
dataset_path = '../../data/fashion_products_dataset.json'
with open(dataset_path, 'r') as file:
    data = json.load(file)
print(f"Total products in dataset: {len(data)}")

data_sample_title = data[0]['title']
print(f"Sample product title: {data_sample_title}")

Total products in dataset: 28080
Sample product title: Solid Women Multicolor Track Pants


In [15]:
def build_terms(document):
    """
    Preprocess the document text (title + description) removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.

    Argument:
    document -- a dictionary with 'title' and 'description' keys

    Returns:
    tokens - a list of tokens corresponding to the input text after the preprocessing
    """
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words('english'))

    text = document['title'] + ' ' + document['description']
    text = text.lower()
    text = ''.join(char if char.isalnum() or char.isspace() else ' ' for char in text)
    text = text.split(" ")
    text = [term for term in text if term not in stop_words]
    text = [term for term in text if term != '']
    text = [stemmer.stem(term) for term in text]
    return text

sample_document = data[0]
sample_terms = build_terms(sample_document)
print(f"Sample document terms: {sample_terms}")

Sample document terms: ['solid', 'women', 'multicolor', 'track', 'pant', 'yorker', 'trackpant', 'made', '100', 'rich', 'comb', 'cotton', 'give', 'rich', 'look', 'design', 'comfort', 'skin', 'friendli', 'fabric', 'itch', 'free', 'waistband', 'great', 'year', 'round', 'use', 'proudli', 'made', 'india']
