In [1]:
import sys
print(sys.version)

3.7.7 (default, Mar 10 2020, 15:43:03) 
[Clang 11.0.0 (clang-1100.0.33.17)]


# Introduction
This document outlines techniques to preprocess text and represent it using word embedding techniques to facilitate detecting similarity between chunks of text. 

The motivation for this is to power intelligent inventory search in e-commerce and dataset will be specific to e-commerce. 

## Preprocessing
The preprocessing of text is required to ensure cleanliness and correct format of data. Commonly followed steps include:
* Uniform case - ensuring uniformity case of text. This step could result in loss of information such as noun capitalization
* Remove stop words - stop words such as 'is', 'the', 'are', 'a', 'an' etc are removed
* Remove punctuation - punctuations such as '.', '!', '?', '$' etc are removed. Note: this could lead to further loss of information from the text
* lemmatization - process of producing morphological variants of a base word or lemma e.g. democracy, democratic, democratization
* Stemming - heuristic method of removing affixes e.g. drawing -> draw, drawings -> draw


## Word Embeddings
Word embeddings are a way to represent the words within a chunk of text(Document) so as to capture as much of the semantic (vocabulary and meaning) and syntax (grammatical structure). Techniques include:
* Bag of Words(BoW)
* TF-IDF
* Smooth Inverse Frequency (SIF)
* Word2Vec
    * Continous BoW
    * Continous Skip-gram
    
## Similarity
Once representations/vectors of text is calculated, we will be using similarity functions to determine similarity
* Cosine similarity
* Jaccard distance
* Euclidean distance
* Word Mover's distance
    
## Methodology
We will be applying the following steps. Techniques applied within each step may vary.
* Text Preprocessing
* Feature Extraction
* Similarity Calculation
* Threshold/Decision Function

## Dataset
We'll be using a text similarity dataset from [Kaggle](https://www.kaggle.com/rishisankineni/text-similarity?select=train.csv). Download the dataset (`train.csv` and `test.csv`) to a directory named `data`. Once you have a satisfactory model, test your model on an e-commerce [Kaggle](https://www.kaggle.com/cclark/product-item-data) dataset. Download the data to the `data` directory and rename file to `ecommerce_product_listing.csv`. You may need to update your methodology for the new dataset.

## Libraries
We will be using the following libraries and all input dataset will be processed as per the English language semantics and syntax
### `nltk`
Install all supporting libraries for the `nltk` module using the provided script. 
`python -m nltk.downloader all`

In [6]:
import pandas as pd
import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import  WordNetLemmatizer
from unidecode import unidecode
import string

## Preprocesing

In [26]:
# list of stop words and punctuation in the English language
STOP_WORDS = stopwords.words('english') + list(string.punctuation)
lemmatizer = WordNetLemmatizer()

def pre_process(corpus):
    '''pre processes corpus(input text) before processing'''
    # uniform case
    corpus = corpus.lower()
    # transate non ascii characters to ascii
    corpus = unidecode(corpus)
    # tokenize words
    corpus = word_tokenize(corpus)
    # remove stop words
    corpus = [i for i in corpus if i not in STOP_WORDS]
    # lemmatize
    for index, word in enumerate(corpus):
        corpus[index] = lemmatizer.lemmatize(word)
    
        
    return " ".join(corpus)
    


In [27]:
pre_process('Hi, my name is José. I live in Puñta Caña. I ate couple of eggs and went swimming yesterday')

'hi name jose live punta cana ate couple egg went swimming yesterday'

## Feature Extraction

### TF-IDF

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'summer dress ideal for picnics, long walks in the park or a BBQ', 
    'semi-casual attire, perfect for both work and a night out with friends'
]
for index, text in enumerate(corpus):
    corpus[index] = pre_process(text)
    
# create vocabulary using uni, bi and tri grams
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
# fit and transform TF-IDF
tfidf_vectorizer.fit(corpus)
feature_vectors = tfidf_vectorizer.transform(corpus)

['summer dress ideal picnic long walk park bbq', 'semi-casual attire perfect work night friend']
