### Cosine Similarity

#### The cosine similarity is the cosine of the angle between two vectors. In text analysis, each vector can represent a document. The greater the value of θ, the less the value of cos θ, thus the less the similarity between two documents.

### Extract Key Words from Financial Glossary List

In [1]:
# sample regulations
reg_1 = 'A minimum liquidation time that is five day for all other swaps'
reg_2 = 'A maximum liquidation time that is one days for all other swaps'
reg_3 = 'A maximum liquidation time that is six days for all other swaps'

In [2]:
print(reg_1)
print(reg_2)
print(reg_3)

A minimum liquidation time that is five day for all other swaps
A maximum liquidation time that is one days for all other swaps
A maximum liquidation time that is six days for all other swaps


#### Suppose our list includes [minimum, maximum, liquidation, swap]


#### We also extract all the numbers from sentences

#### reg_1 [minimum, liquidation, swap, five]

#### reg_2 [maximum, liquidation, swap, one]

#### reg_3 [maximum, liquidation, swaps, six]

### Convert to matrices

#### reg_1 [1, 0, 1, 1, 5]

#### reg_2 [0, 1, 1, 1, 1]

#### reg_3 [0, 1, 1, 1, 6]

In [3]:
from sklearn.preprocessing import StandardScaler

list1 = [[1, 0, 1, 1, 5], [0, 1, 1, 1, 1], [0, 1, 1, 1,6]]
print (list1)

[[1, 0, 1, 1, 5], [0, 1, 1, 1, 1], [0, 1, 1, 1, 6]]


In [4]:
# Standardize features
scaler = StandardScaler()
x = scaler.fit_transform(list1)
print(x)

[[ 1.41421356 -1.41421356  0.          0.          0.46291005]
 [-0.70710678  0.70710678  0.          0.         -1.38873015]
 [-0.70710678  0.70710678  0.          0.          0.9258201 ]]


In [5]:
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(x)

array([[ 1.        , -0.7522874 , -0.56170721],
       [-0.7522874 ,  1.        , -0.12251278],
       [-0.56170721, -0.12251278,  1.        ]])

### Utilizing NLP

In [6]:
# sample regulations
r1 = 'Reporting markets shall provide trade and supporting data reports to the Commission on a daily basis. Such reports shall include transaction-level trade data and related order information for each futures or options contract. Reports shall also include time and sales data, reference files and other information as the Commission or its designee may require. All reports must be submitted at the time, and in the manner and format, and with the specific content specified by the Commission or its designee. Upon request, such information shall be accompanied by data that identifies or facilitates the identification of each trader for each transaction or order included in a submitted trade and supporting data report if the reporting market maintains such data.'
r2 = 'Trade and supporting data reports should be provided to the Commission daily. The reports shall include the following information: transaction-level trade data and related order information for each futures or options contract; time and sales data, reference files and other information as the Commission or its designee may require. All reports must be submitted with the specific content specified by the Commission or its designee at the time, and in the manner and format. If requested, such information shall be accompanied by data that identifies or facilitates the identification of all the trader for each submitted trade and supporting data report if available'

In [7]:
print(r1)
print()
print(r2)

Reporting markets shall provide trade and supporting data reports to the Commission on a daily basis. Such reports shall include transaction-level trade data and related order information for each futures or options contract. Reports shall also include time and sales data, reference files and other information as the Commission or its designee may require. All reports must be submitted at the time, and in the manner and format, and with the specific content specified by the Commission or its designee. Upon request, such information shall be accompanied by data that identifies or facilitates the identification of each trader for each transaction or order included in a submitted trade and supporting data report if the reporting market maintains such data.

Trade and supporting data reports should be provided to the Commission daily. The reports shall include the following information: transaction-level trade data and related order information for each futures or options contract; time an

### r1 can be viewed as external regulation and r2 can be viewed as internal policy. We would like to calculate the distance between two sentences to see if those regulations match with each other.

#### a. Raw texts are preprocessed with the most common words and punctuation removed, tokenization, and stemming (or lemmatization)

#### b. A dictionary of unique terms found in the whole corpus is created. Texts are quantified first by calculating the term frequency (tf) for each document. The numbers are used to create a vector for each document where each component in the vector stands for the term frequency in that document. Let n be the number of documents and m be the number of unique terms. Then we have an n by m tf matrix.

#### c. The core of the rest is to obtain a “term frequency-inverse document frequency” (tf-idf) matrix. Inverse document frequency is an adjustment to term frequency. This adjustment deals with the problem that generally speaking certain terms do occur more than others. Thus, tf-idf scales up the importance of rarer terms and scales down the importance of more frequent terms relative to the whole corpus.

#### d. The calculated tf-idf is normalized by the Euclidean norm so that each row vector has a length of 1. The normalized tf-idf matrix should be in the shape of n by m. A cosine similarity matrix (n by n) can be obtained by multiplying the if-idf matrix by its transpose (m by n).

In [8]:
# Pre-processing with nltk
# Normalize by stemming
documents = [r1, r2]
import nltk, string, numpy
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
    return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
    return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [9]:
# Normalize by lemmatization
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [10]:
import warnings
warnings.filterwarnings("ignore")
# Turn text into vectors of term frequency:
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)

<2x44 sparse matrix of type '<class 'numpy.int64'>'
	with 76 stored elements in Compressed Sparse Row format>

In [11]:
# Normalized (after lemmatization) text in the four documents are tokenized and each term is indexed:
print (LemVectorizer.vocabulary_)

{'reporting': 29, 'market': 21, 'shall': 34, 'provide': 24, 'trade': 40, 'supporting': 38, 'data': 7, 'report': 28, 'commission': 3, 'daily': 6, 'basis': 2, 'include': 16, 'transactionlevel': 43, 'related': 27, 'order': 23, 'information': 18, 'future': 13, 'option': 22, 'contract': 5, 'time': 39, 'sale': 33, 'reference': 26, 'file': 10, 'designee': 8, 'require': 32, 'submitted': 37, 'manner': 20, 'format': 12, 'specific': 35, 'content': 4, 'specified': 36, 'request': 30, 'accompanied': 0, 'identifies': 15, 'facilitates': 9, 'identification': 14, 'trader': 41, 'transaction': 42, 'included': 17, 'maintains': 19, 'provided': 25, 'following': 11, 'requested': 31, 'available': 1}


In [12]:
# Tf-matrix
tf_matrix = LemVectorizer.transform(documents).toarray()
print (tf_matrix)

[[1 0 1 3 1 1 1 6 2 1 1 0 1 1 1 1 2 1 3 1 1 2 1 2 1 0 1 1 5 2 1 0 1 1 4 1
  1 2 2 2 3 1 1 1]
 [1 1 0 3 1 1 1 5 2 1 1 1 1 1 1 1 1 0 4 0 1 0 1 1 0 1 1 1 4 0 0 1 1 1 2 1
  1 2 2 2 3 1 0 1]]


In [13]:
# Should be 2 by 44 
tf_matrix.shape


(2, 44)

In [14]:
# Calculate idf and turn tf matrix to tf-idf matrix
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print (tfidfTran.idf_)
print()
tfidf_matrix = tfidfTran.transform(tf_matrix)
print (tfidf_matrix.toarray())

[1.         1.40546511 1.40546511 1.         1.         1.
 1.         1.         1.         1.         1.         1.40546511
 1.         1.         1.         1.         1.         1.40546511
 1.         1.40546511 1.         1.40546511 1.         1.
 1.40546511 1.40546511 1.         1.         1.         1.40546511
 1.40546511 1.40546511 1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.40546511 1.        ]

[[0.0754519  0.         0.10604501 0.2263557  0.0754519  0.0754519
  0.0754519  0.45271139 0.1509038  0.0754519  0.0754519  0.
  0.0754519  0.0754519  0.0754519  0.0754519  0.1509038  0.10604501
  0.2263557  0.10604501 0.0754519  0.21209002 0.0754519  0.1509038
  0.10604501 0.         0.0754519  0.0754519  0.37725949 0.21209002
  0.10604501 0.         0.0754519  0.0754519  0.3018076  0.0754519
  0.0754519  0.1509038  0.1509038  0.1509038  0.2263557  0.0754519
  0.10604501 0.0754519 ]
 [0.08947804 0.12575827 0.         0.26843413 0.0

In [15]:
# Calculate cosine similarity
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print (cos_similarity_matrix)

[[1.         0.86416488]
 [0.86416488 1.        ]]
