## Similarity Measures

Run similarity measures for 3 stocks from a manual scraping as we work on the automated web scraping in parallel.  

The three stocks are EBAY, IBM, and INTC.  I have scraped the current quarter, and then each of the previous year for the same quarter.  ie 7-25-17, 7-25-16, etc

In [4]:
#import text file of Quarter 10 and  Quarter 9 for Apple's Risk Factors
import pandas as pd
rdd = pd.read_table('manualscrape.txt', sep = "|") #import text file with pipe delimiter
df = pd.DataFrame(rdd) #convert to pandas df

In [5]:
list(df) #list Dataframe columns

['cik ', ' ticker ', ' date ', ' year ', ' quarter ', ' text']

In [6]:
df #see output

Unnamed: 0,cik,ticker,date,year,quarter,text
0,511143,IBM,2017-07-25,2017,2,Except for the historical information and dis...
1,511143,IBM,2016-07-26,2016,2,Except for the historical information and dis...
2,511143,IBM,2015-07-28,2015,2,Except for the historical information and dis...
3,511143,IBM,2014-07-29,2014,2,Except for the historical information and dis...
4,511143,IBM,2013-07-31,2013,2,Litigation Reform Act of 1995. Forward-lookin...
5,511143,IBM,2012-07-31,2012,2,Except for the historical information and dis...
6,511143,IBM,2011-07-26,2011,2,Except for the historical information and dis...
7,511143,IBM,2010-07-27,2010,2,Except for the historical information and dis...
8,511143,IBM,2009-07-28,2009,2,Except for the historical information and dis...
9,511143,IBM,2008-07-29,2008,2,Except for the historical information and dis...


In [152]:
df_lists = df.groupby((' ticker ', ' year '))[' text'].apply(lambda x: list(x)).tolist() #create list from text

### Parsing and Simularity Functions

Code block that parsing the list of words and sets up similarity functions.

In [160]:
import re, math
from collections import Counter

WORD = re.compile(r'\w+')

# Cosine similarity function
def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator

# Jaccard Similarity Function
def get_jaccard(vec1, vec2):
    float(len(vec1.intersection(vec2))*1.0/len(vec1.union(vec2)))

# Word Vector Format needed for cosine similarity (ie words with counts)
def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)

# Word Vector Format needed for jaccard similarity (vector of words)
def text_to_vector_js(text):
     words = WORD.findall(text)
     return words

# function to calculate cosine similarity when two quarters are passed
def calc_cosine(quarterCurrent, quarterOld):
    text1 = quarterCurrent
    text2 = quarterOld
    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)
    cosine = get_cosine(vector1, vector2)
    return cosine

# function to calculate jaccard similarity when two quarters are passed
def calc_jaccard(quarterCurrent, quarterOld):
    textjs1 = quarterCurrent
    textjs2 = quarterOld
    vectorjs1 = set(text_to_vector_js(textjs1))
    vectorjs2 = set(text_to_vector_js(textjs2))
    jaccard = get_jaccard(vectorjs1, vectorjs2)
    return jaccard

In [161]:
# empty lists for similarity metrics to be appended as columns later
cos_sim = []
jac_sim = []

# loop through dataframe and calculate similarity metrics of adjacent row
for i in range(0, (len(df))):
    cos_value = calc_cosine(df_lists[((len(df)- 1)-1)-i][0], df_lists[((len(df)-1))-i][0])
    cos_sim.append(cos_value)
    j_value = jaccard_similarity(df_lists[((len(df)- 1)-1)-i][0], df_lists[((len(df)-1))-i][0])
    jac_sim.append(j_value)

In [157]:
# append similarity lists as a column in dataframe
df['cosine_similarity'] = cos_sim
df['jaccard_similarity'] = jac_sim

In [158]:
# show dataframe
df

Unnamed: 0,cik,ticker,date,year,quarter,text,cosine_similarity,jaccard_similarity
0,511143,IBM,2017-07-25,2017,2,Except for the historical information and dis...,0.909603,0.647541
1,511143,IBM,2016-07-26,2016,2,Except for the historical information and dis...,0.927431,0.66129
2,511143,IBM,2015-07-28,2015,2,Except for the historical information and dis...,0.888949,0.709302
3,511143,IBM,2014-07-29,2014,2,Except for the historical information and dis...,0.971591,0.938462
4,511143,IBM,2013-07-31,2013,2,Litigation Reform Act of 1995. Forward-lookin...,0.604016,0.2723
5,511143,IBM,2012-07-31,2012,2,Except for the historical information and dis...,0.870876,0.129136
6,511143,IBM,2011-07-26,2011,2,Except for the historical information and dis...,0.997605,0.845714
7,511143,IBM,2010-07-27,2010,2,Except for the historical information and dis...,0.996986,0.866843
8,511143,IBM,2009-07-28,2009,2,Except for the historical information and dis...,0.996006,0.876289
9,511143,IBM,2008-07-29,2008,2,Except for the historical information and dis...,0.622013,0.10936


In [159]:
# export to text
df.to_csv('/Users/z013nx1/Documents/stocksWithSimilarityMetrics.txt')