# Basic Similarity Model

This similarity model uses TF-IDF to map sentences to vector space. Calculating different similarity metrics (Euclidean, Cosine, Jaccard) on these embeddings, a KNN model is trained on the similarity metrics to determine a final similarity model

The overall process can be considered as follows:

- Preprocessing
    - Uniform casing,Non-Ascii Removal, Stop Word Removal and Lemmatization 
    - Null Values, Duplicates and extraneous data columns are dropped
- Feature Extraction
    - The corpus vocabulary is converted using TFIDF
- Similarity
    - Taking the sent2vec embeddings: Cosine, Euclidean and Jaccard similarity are found
- Model Application
    - A KNN model is applied on the similarity measurements to make predictions
    

 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import copy

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from unidecode import unidecode
import string

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

## Overall Setup

- Preprocessing: Casing, Non-Ascii Removal, Stop Word Removal Lemmatization
- Data Preparation: Create Train + Test feature, label sets
- Feature Extraction: TF-IDF, Word2Vec, Sent2Vec: Word2Vec w/TF-IDF Weight
- Similarity: Cosine, Euclidean, Jaccard
- Model Training: KNN
- Testing: Run KNN on validity dataset

### Load in Data

Here we are just viewing the data to understand what we are working with.

In [2]:
data = pd.read_csv("data/train.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,description_x,description_y,ticker_x,ticker_y,same_security
0,0,first trust dow jones internet,first trust dj internet idx,FDN,FDN,True
1,1,schwab intl large company index etf,schwab strategic tr fundamental intl large co ...,FNDF,FNDF,True
2,2,vanguard small cap index adm,vanguard small-cap index fund inst,VSMAX,VSCIX,False
3,3,duke energy corp new com new isin #us4 sedol #...,duke energy corp new com new isin #us26441c204...,DUK,DUK,True
4,4,visa inc class a,visa inc.,V,V,True


In [3]:
test = pd.read_csv("data/test.csv")
test.head()

Unnamed: 0,test_id,description_x,description_y,same_security
0,0,semtech corp,semtech corporation,
1,1,vanguard mid cap index,vanguard midcap index - a,
2,2,spdr gold trust gold shares,spdr gold trust spdr gold shares,
3,3,vanguard total bond index adm,vanguard total bond market index,
4,4,oakmark international fund class i,oakmark international cl i,


### Preprocessing

Defining preprocessing functions

In [4]:
STOP_WORDS = stopwords.words('english') + list(string.punctuation)
lemmatizer = WordNetLemmatizer()

#taken from embeddings_understanding.ipynb
    
def pre_process(text):
    
    #remove different casing
    text = text.lower()
    
    #convert all non-ascii characters
    text = unidecode(text)
    
    #tokenize words
    text = word_tokenize(text)
    
    #Remove stop words
    text = [i for i in text if i not in STOP_WORDS]
    
    #lemmatize
    for index, word in enumerate(text):
        text[index] = lemmatizer.lemmatize(word)
        
    return " ".join(text)

Removing ticker columns and index column, along with duplicates and null rows

##### *Why remove the ticker columns?*

Matching ticker columns denote descriptions that have the same security, however we already have a 'same security' column that also tell us this. In addition, in a generalized context, we will only have access to the description - so here we will choose to remove the ticker columns for data simplicity.

In [5]:
data = pd.read_csv("data/train.csv")
data = data.drop(['Unnamed: 0','ticker_x','ticker_y'], 1)
data = data.drop_duplicates()
data = data.dropna(0)

Run basic preprocessing on data

In [6]:
data.loc[:,'description_x'] = data['description_x'].apply(pre_process, 1)
data.loc[:, 'description_y'] = data['description_y'].apply(pre_process, 1)
data.head()

Unnamed: 0,description_x,description_y,same_security
0,first trust dow jones internet,first trust dj internet idx,True
1,schwab intl large company index etf,schwab strategic tr fundamental intl large co ...,True
2,vanguard small cap index adm,vanguard small-cap index fund inst,False
3,duke energy corp new com new isin us4 sedol b7...,duke energy corp new com new isin us26441c2044...,True
4,visa inc class,visa inc,True


### Data Preparation

Here we are going to seperate the 'labels' (same security) from the 'features' (descriptions)

In [7]:
labels = copy.deepcopy(data.loc[:,'same_security'])
labels.head()

0     True
1     True
2    False
3     True
4     True
Name: same_security, dtype: bool

In [8]:
features = copy.deepcopy(data.loc[:,'description_x':'description_y'])
features.head()

Unnamed: 0,description_x,description_y
0,first trust dow jones internet,first trust dj internet idx
1,schwab intl large company index etf,schwab strategic tr fundamental intl large co ...
2,vanguard small cap index adm,vanguard small-cap index fund inst
3,duke energy corp new com new isin us4 sedol b7...,duke energy corp new com new isin us26441c2044...
4,visa inc class,visa inc


### Feature Extraction

#### TF-IDF

Now we are going to create the TFIDF matrix:

The matrix is setup as a n_description x n_words matrix, where each row is a one-hot style encoding where if a word is
present in a description, that cell contains the tfidf score of the word. This encoding style helps for similarity scoring between sentences.

In [9]:
#Given a corpus frame returns 1-D array of tf-idf score of each sentence in corpus
def calculate_tfidfscore(word_bank):
    
    #convert to list
    corpus = []
    corpus = word_bank.values.tolist()
    
    #we are flattening the list to create a single corpus
    #odd values = description_x
    #even values = description_y
    corpus = [val for sublist in corpus for val in sublist]

    #calculate matrix
    vectorizer = TfidfVectorizer(ngram_range=(1,1))
    vectorizer.fit(corpus)
    matrix = vectorizer.transform(corpus)

    return matrix

Finding actual TF-IDF scores

In [10]:
#calculate tfidfs matrix
tfidf_train = calculate_tfidfscore(features)

tfidf_train

<4274x1631 sparse matrix of type '<class 'numpy.float64'>'
	with 19932 stored elements in Compressed Sparse Row format>

In [11]:
print(tfidf_train.toarray()[0:5])
print(tfidf_train[0:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
  (0, 1471)	0.3397349884643645
  (0, 882)	0.48510219835327667
  (0, 833)	0.5021212998408029
  (0, 651)	0.4107037877081583
  (0, 533)	0.4779671471704836
  (1, 1471)	0.34825191456420274
  (1, 833)	0.514709140802457
  (1, 773)	0.3280565663954867
  (1, 651)	0.4209998535465491
  (1, 524)	0.5735269103846407
  (2, 1277)	0.4756143956735665
  (2, 901)	0.4855076782134331
  (2, 835)	0.3917544613038878
  (2, 793)	0.2667747122137138
  (2, 599)	0.2787745153192131
  (2, 445)	0.4855076782134331
  (3, 1450)	0.3267552659771573
  (3, 1391)	0.38662914147414085
  (3, 1277)	0.3684276453202312
  (3, 901)	0.37609133006948864
  (3, 835)	0.3034667895564472
  (3, 793)	0.20665307851476256
  (3, 679)	0.45792116635316527
  (3, 599)	0.21594854821181492
  (3, 421)	0.27466730318179344
  (4, 1536)	0.32614020380025294
  (4, 1324)	0.5041327957765964
  (4, 793)	0.3855481960624046
  (4, 368)	0.43845

### Similarity Modeling

#### Cosine Similarity and Euclidean Distance

Using scikitlearn builtin functions to create a matrix of:
- Cosine similarity
- Euclidean Distance
- Manhatten Distance

In [12]:
def calculate_similarities(tfidf):
    i = 0
    similarities = []
    while i < tfidf.shape[0]:
        
        #each i, i+1 pair corresponds to a description_x and description_y pair
        similarities.append([cosine_similarity(tfidf[i], tfidf[i+1])[0], 
                             euclidean_distances(tfidf[i], tfidf[i+1])[0],
                            manhattan_distances(tfidf[i], tfidf[i+1])[0]])
        i += 2
        
    df = pd.DataFrame(similarities, columns=['cosine','euclidean', 'manhatten'])
    return df

In [13]:
similarities = calculate_similarities(tfidf_train)

data = data.join(similarities)
data.head()

Unnamed: 0,description_x,description_y,same_security,cosine,euclidean,manhatten
0,first trust dow jones internet,first trust dj internet idx,True,[0.5496660174729826],[0.9490352812482973],[1.8960536552037706]
1,schwab intl large company index etf,schwab strategic tr fundamental intl large co ...,True,[0.5920399564134528],[0.9032829496747377],[2.3593189262507597]
2,vanguard small cap index adm,vanguard small-cap index fund inst,False,[0.6451095555362751],[0.8424849487839233],[1.5563255256885102]
3,duke energy corp new com new isin us4 sedol b7...,duke energy corp new com new isin us26441c2044...,True,[0.624520027007332],[0.8665794516288371],[1.8267404110904208]
4,visa inc class,visa inc,True,[0.8589960308272581],[0.5310441962261561],[0.697373750857696]


### Model Training

#### K Nearest Neighbors Model

Here, we are going to train a KNN Classifier Model on the similarity statistics to create a final prediction algorithm.

In [14]:
KNN_model = KNeighborsClassifier(n_neighbors=4)
KNN_model.fit(similarities, labels)
KNN_model

KNeighborsClassifier(n_neighbors=4)

In [15]:
pred = KNN_model.predict(similarities)
data.insert(3, 'KNN_Predictions', pred)
data.head(15)

Unnamed: 0,description_x,description_y,same_security,KNN_Predictions,cosine,euclidean,manhatten
0,first trust dow jones internet,first trust dj internet idx,True,True,[0.5496660174729826],[0.9490352812482973],[1.8960536552037706]
1,schwab intl large company index etf,schwab strategic tr fundamental intl large co ...,True,True,[0.5920399564134528],[0.9032829496747377],[2.3593189262507597]
2,vanguard small cap index adm,vanguard small-cap index fund inst,False,False,[0.6451095555362751],[0.8424849487839233],[1.5563255256885102]
3,duke energy corp new com new isin us4 sedol b7...,duke energy corp new com new isin us26441c2044...,True,True,[0.624520027007332],[0.8665794516288371],[1.8267404110904208]
4,visa inc class,visa inc,True,True,[0.8589960308272581],[0.5310441962261561],[0.697373750857696]
5,ford motor co new div 0.600,ford motor co,True,True,[0.7160190990066819],[0.7536324050799807],[1.6602897899467042]
6,united state steel corp,united sts stl cp new,True,True,[0.18540072998906892],[1.2764006189366495],[3.320989054334075]
7,vanguard total international bond index etf,vanguard total intl bond index etf,True,True,[0.6920752796409834],[0.784760753808467],[1.1474067997463036]
8,schwab strategic tr u sml c,schwab u small cap etf,True,False,[0.31259955553693014],[1.172519035634876],[2.9929678094242465]
9,mf value fd cl,mf value fund cl,True,True,[0.8693665934139639],[0.5111426544244498],[0.7835784393081499]


In [16]:
print(confusion_matrix(labels, pred))
print(classification_report(labels, pred))

[[ 400  124]
 [ 232 1381]]
              precision    recall  f1-score   support

       False       0.63      0.76      0.69       524
        True       0.92      0.86      0.89      1613

    accuracy                           0.83      2137
   macro avg       0.78      0.81      0.79      2137
weighted avg       0.85      0.83      0.84      2137



We can see here based on our output, this model shows an 83% accuracy rate, with a KNN model where n=4.

### Testing

Preprocess the testing data

In [17]:
test = test.drop_duplicates()
test = test.drop(['test_id', 'same_security'], 1)
test = test.dropna(0)

test.iloc[:,0] = test.iloc[:,0].apply(pre_process, 1)
test.iloc[:,0] = test.iloc[:,1].apply(pre_process, 1)

Determine our predictions for the test data set

In [18]:
#tfidf
tfidf_test = calculate_tfidfscore(test)
#similarities
similarities_test = calculate_similarities(tfidf_test)
#predict
pred = KNN_model.predict(similarities_test)

test.insert(2, 'KNN_Predictions', pred)
test.head(15)


Unnamed: 0,description_x,description_y,KNN_Predictions
0,semtech corporation,semtech corporation,True
1,vanguard midcap index,vanguard midcap index - a,True
2,spdr gold trust spdr gold share,spdr gold trust spdr gold shares,True
3,vanguard total bond market index,vanguard total bond market index,True
4,oakmark international cl,oakmark international cl i,True
5,pfizer inc com,pfizer inc com,True
6,sptn glb xus idx adv,sptn glb xus idx adv,True
7,vanguard total bond market index fund investor...,vanguard total bond market index fund investor...,True
8,banco latinoamericano come-e,banco latinoamericano come-e,True
9,baidu inc spons ad repr 0.10 ord cl us0.00005,baidu inc spons ads repr 0.10 ord cls a us0.00005,True


In [19]:
true_count = 0
false_count = 0

for i in pred:
    if (i): true_count += 1
    else: false_count += 1

print("Counts:")
print("True Count: %d" % (true_count))
print("False Count: %d \n" % (false_count))

Counts:
True Count: 505
False Count: 11 



### Next Steps

We can see with an 83% accuracy rating, this model provides a good basis for a text similarity model. 

However, we have two main caveats:
1. TF-IDF disregards semantic meaning of text
2. Our accuracy is from rerunning on the training set

To combat this, some next steps include the integration of word2vec embedding (such as those in Spacy, Gensim) with our TF-IDF word embedding to account for semantic meaning. Another is to use an extra validation set - whether split from our training set or using a new set.