# Congressional Committees w/ Stock Trades w/ String Algorithms

Exploring matching Congressional stock trade data and stock descriptions with Congressional Committee descriptions

reference: https://pythonspot.com/nltk-stop-words/

----

#### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML



In [2]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.width', None)
# pd.set_option('display.max_colwidth', None)
# pd.set_option('max_seq_item', None)

In [3]:
#string matching imports
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from thefuzz import fuzz
from thefuzz import process
import textdistance
import jaro
import jellyfish



In [4]:
#natural language processing imports
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import treebank
import string

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

[nltk_data] Downloading package punkt to /Users/sm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/sm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/sm/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/sm/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package treebank to /Users/sm/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

----

##### Reading Dataframes

Read in Stock trades by Congress members with Yahoo finance stock info

In [6]:
df_trades = pd.read_csv("..//data//processed//stock_watchers_w_yfinance_03_12_2022.csv", encoding="utf-8")

In [7]:
df_trades.head(1)

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,cap_gains,amount_low,amount_high,ticker2,name,sector,industry,longbusinesssummary,website,stock_description
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,,1001,15000.0,NEE,"NextEra Energy, Inc.",Utilities,Utilities—Regulated Electric,"NextEra Energy, Inc., through its subsidiaries...",https://www.nexteraenergy.com,"Utilities, Utilities—Regulated Electric, NextE..."


In [8]:
df_trades['sector_industry'] = df_trades['sector'] + ' ' + df_trades['industry']

In [9]:
df_trades.head(1)

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,amount_low,amount_high,ticker2,name,sector,industry,longbusinesssummary,website,stock_description,sector_industry
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,1001,15000.0,NEE,"NextEra Energy, Inc.",Utilities,Utilities—Regulated Electric,"NextEra Energy, Inc., through its subsidiaries...",https://www.nexteraenergy.com,"Utilities, Utilities—Regulated Electric, NextE...",Utilities Utilities—Regulated Electric


Read in Congress Committee Descriptions Extracted from Committee.gov sites (with a few exceptions)

In [10]:
df_subcomittees = pd.read_csv('..//data//handmade//congress_commitee_descriptions.csv')

In [11]:
df_subcomittees.head(1)

Unnamed: 0,committee,committee_fullname,committee_description,website
0,SSFR09,Africa and Global Health Policy,The subcommittee deals with all matters concer...,https://www.foreign.senate.gov/download/2021-1...


In [12]:
df_subcomittees['committee_description2'] = df_subcomittees['committee_fullname'] + ' ' + df_subcomittees['committee_description']

In [13]:
df_subcomittees.head(1)

Unnamed: 0,committee,committee_fullname,committee_description,website,committee_description2
0,SSFR09,Africa and Global Health Policy,The subcommittee deals with all matters concer...,https://www.foreign.senate.gov/download/2021-1...,Africa and Global Health Policy The subcommitt...


-----

### Cleaning the Stock Description Columns

In [14]:
df_trades['stock_description2'] = df_trades.stock_description
# df.head(1)

In [15]:
df_trades.stock_description2 = df_trades.stock_description2.astype(str).str.lower()

In [16]:
df_trades.stock_description2.head(1)

0    utilities, utilities—regulated electric, nexte...
Name: stock_description2, dtype: object

In [17]:
df_trades.sector_industry = df_trades.sector_industry.astype(str).str.lower()

In [18]:
df_trades.sector_industry.head(1)

0    utilities utilities—regulated electric
Name: sector_industry, dtype: object

##### Removing Punctuation from description

In [19]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [20]:
df_trades.stock_description2 = df_trades.stock_description2.str.replace('[{}]'.format(string.punctuation), '')

  df_trades.stock_description2 = df_trades.stock_description2.str.replace('[{}]'.format(string.punctuation), '')


In [21]:
df_trades.stock_description2.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description2, dtype: object

In [24]:
df_trades.stock_description2 = df_trades.stock_description2.replace('—', ' ')

In [25]:
df_trades.stock_description2.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description2, dtype: object

In [22]:
df_trades.sector_industry = df_trades.sector_industry.str.replace('[{}]'.format(string.punctuation), '')

  df_trades.sector_industry = df_trades.sector_industry.str.replace('[{}]'.format(string.punctuation), '')


In [23]:
df_trades.sector_industry.head(1)

0    utilities utilities—regulated electric
Name: sector_industry, dtype: object

##### Removing "Stop Words"

In [26]:
stops = set(stopwords.words('english'))
print(stops)

{'have', 'she', 'their', 'won', "you'd", 'doesn', 'it', 'mightn', 'with', 'had', 'yourself', 'where', 'while', 'all', 'out', 'd', 'until', 'or', 'am', 'ma', "you're", 'haven', 've', 'then', 'how', 'this', 'are', 'under', 'those', 'me', 'same', 'above', 'yourselves', 'themselves', 'shouldn', 'ours', 't', 'wasn', "that'll", 'most', 'be', "don't", 'he', 'shan', 's', 'they', 'between', 'from', 'we', 'its', 'did', 'over', 'at', 'because', 'down', 'more', "should've", 'if', 'by', 'only', 'hadn', 'than', 'being', 'his', "haven't", 'both', "needn't", 'himself', 'been', 'in', 're', 'before', 'but', 'on', 'further', 'hasn', 'each', "isn't", 'wouldn', 'other', 'will', 'so', 'here', 'yours', 'm', 'i', 'whom', 'just', 'any', "didn't", "mustn't", 'can', 'for', 'y', 'should', 'needn', 'which', "weren't", 'there', 'didn', 'has', 'my', 'not', 'having', "it's", 'hers', 'below', 'up', 'theirs', 'after', 'him', 'into', 'about', 'now', 'too', 'through', 'to', 'mustn', 'during', 'myself', 'few', 'that', 'do

In [27]:
df_trades['stock_description3'] = df_trades.stock_description2.apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))

In [28]:
df_trades.stock_description3.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description3, dtype: object

In [29]:
# df.head(2)

##### Removing numbers/digits from descriptions

In [30]:
df_trades.stock_description3 = df_trades.stock_description3.str.replace('\d+', '')

  df_trades.stock_description3 = df_trades.stock_description3.str.replace('\d+', '')


In [31]:
df_trades.stock_description3.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description3, dtype: object

In [32]:
# df_trades

In [33]:
# stops2 = stopwords.words('english')
# print(stops2)

In [34]:
# stops2 = stopwords.words('english')

In [35]:
# Consider the word: Antinationalist, Morpheme
# https://www.analyticsvidhya.com/blog/2021/06/part-3-step-by-step-guide-to-nlp-text-cleaning-and-preprocessing/

Save a Copy

In [36]:
# df_trades.to_csv('..//data//processed//stock_watchers_w_yfinance_edited_03_13_2022.csv', index = False)

----

### Cleaning the Committee Description Column

In [37]:
df_subcomittees.committee_description2 = df_subcomittees.committee_description2.astype(str).str.lower()

In [38]:
df_subcomittees.committee_description2.head(1)

0    africa and global health policy the subcommitt...
Name: committee_description2, dtype: object

##### Removing Punctuation from description

In [39]:
df_subcomittees.committee_description2 = df_subcomittees.committee_description2.str.replace('[{}]'.format(string.punctuation), '')

  df_subcomittees.committee_description2 = df_subcomittees.committee_description2.str.replace('[{}]'.format(string.punctuation), '')


In [40]:
df_subcomittees.committee_description2.head(2)

0    africa and global health policy the subcommitt...
1    africa global health and global human rights t...
Name: committee_description2, dtype: object

##### Removing "Stop Words"

In [41]:
df_subcomittees['committee_description3'] = df_subcomittees.committee_description2.apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))

In [42]:
df_subcomittees.committee_description3.head(1)

0    africa global health policy subcommittee deals...
Name: committee_description3, dtype: object

In [43]:
# df.stock_description2

##### Removing numbers/digits from descriptions

In [44]:
df_subcomittees.committee_description3 = df_subcomittees.committee_description3.str.replace('\d+', '')

  df_subcomittees.committee_description3 = df_subcomittees.committee_description3.str.replace('\d+', '')


In [45]:
df_subcomittees.committee_description3.head(1)

0    africa global health policy subcommittee deals...
Name: committee_description3, dtype: object

In [46]:
# df_subcomittees.head(1)

##### Step 3

In [47]:
stops = set(stopwords.words('english'))
# print(stops)

In [48]:
# pat = r'\b(?:{})\b'.format('|'.join(stop))
# test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
# test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# # Same results.
# # 0              I love car
# # 1       This view amazing
# # 2    I feel great morning
# # 3       I excited concert
# # 4          He best friend

Save a copy

In [49]:
# df_subcomittees.to_csv('..//data//handmade//congress_commitee_descriptions_edited_03_13_22.csv', index = False)

-----

### Merging with algorithm on stock and committee descriptions

In [50]:
# merged = empty_df
# for trade in df_trades:
#     for each comittee in df_comitee:
#       # match ticker descp to comittee descp OR ANY OTHER ALGORITHM
#         flt_s_score = similar(ticker_desc, comittee_desc)
#         if flt_s_score > 0.6:
#             add this trade + comite descp to merged
# drop the rows where member comittee != ticker_comitee
# do analysis (edited) 

In [51]:
# relevant columns

# df.stock_description3
# df_subcomittees.committee_description3

In [52]:
empty_df = pd.DataFrame()   
empty_df.empty

True

In [172]:
#partial ratio
def similar(a,b):
    return fuzz.partial_ratio(a, b)

In [54]:
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [175]:
# for column in df_trades[0:5]:
#     print(df_trades[column].values)

In [57]:
for trade in df_trades.stock_description3[0:1]:
    print(trade)

utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida


In [58]:
ls_stock_description3 = df_trades.stock_description3.values.tolist()

In [59]:
for committee in df_subcomittees.committee_description3[0:1]:
    print(committee)

africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response


In [60]:
ls_committee_description3 = df_subcomittees.committee_description3.values.tolist()

In [None]:
merged = empty_df
for trade in df_trades[0:20]:
    print(trade)
    for committee in df_subcomittees[0:20]:
        print(committee)
      # match ticker description to committee description with A TBD ALGORITHM
        flt_s_score = similar(df_trades.sector_industry, df_subcomittees.committee_description3)
        if flt_s_score > 25:
            print(flt_s_score)
#             add this trade + comite descp to merged


In [65]:
# drop the rows where member comittee != ticker_comittee


In [66]:
# do analysis (edited) 

-----

### Fine-tuning the Algorithms

Exploring types of algorithms on the columns

##### Simple Ratios

In [67]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [68]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [69]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.017437961099932932

In [70]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.01625072547881602

In [71]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.05188679245283019

In [72]:
# should match using industry/sector and committee description
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.005719733079122974

In [73]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.015608740894901144

In [74]:
def similar(a,b):
    return fuzz.ratio(a, b)

In [75]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [76]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [77]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

2

In [78]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

5

In [79]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1

In [80]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [81]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [82]:
def similar(a,b):
    return fuzz.token_set_ratio(a, b)

In [83]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [84]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

4

In [85]:
# should match using industry/sector and committee description
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

17

In [86]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

49

In [87]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0

In [88]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

9

##### Partial Ratio

In [89]:
def similar(a,b):
    return fuzz.partial_ratio(a, b)

In [90]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [91]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

14

In [92]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

4

!!!! High ratio with industry/sector and committee description

In [93]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

61

In [94]:
# should match using industry/sector and committee description
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

45

In [95]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

18

##### Token Sort Ratio

In [96]:
def similar(a,b):
    return fuzz.token_sort_ratio(a, b)

In [97]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [98]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [99]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

2

In [100]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1

In [101]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1

In [102]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1

##### Token Set

In [103]:
def similar(a,b):
    return fuzz.token_set_ratio(a, b)

In [104]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [105]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

4

In [106]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

17

In [107]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

49

In [108]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0

In [109]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

9

##### Hamming Distance (finding the places where the strings vary)

In [110]:
textdistance.hamming.normalized_similarity('arrow', 'arow')

0.4

In [111]:
#the edit distance is 1 for only the difference being one letter different
def similar(a,b):
    return textdistance.hamming(a, b)

In [112]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

2

In [113]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

855

In [114]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1305

In [115]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

391

In [116]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1009

In [117]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1295

In [118]:
#75% similar between text and test
def similar(a,b):
    return textdistance.hamming.normalized_similarity(a, b)

In [119]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [120]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.04894327030033374

In [121]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.018796992481203034

In [122]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.0050890585241730735

In [123]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.001978239366963397

In [124]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.02631578947368418

##### Levenshtein Distance

In [125]:
#number of edits it will take to transform one to the other
textdistance.levenshtein('arrow', 'arow')

1

In [126]:
#number of edits it will take to transform one to the other
def similar(a,b):
    return textdistance.levenshtein(a, b)

In [127]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

2

In [128]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

639

In [129]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1069

In [130]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

362

In [131]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

975

In [132]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1004

In [133]:
textdistance.levenshtein.normalized_similarity('arrow', 'arow')

0.8

In [134]:
def similar(a,b):
    return textdistance.levenshtein.normalized_similarity(a, b)

In [135]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [136]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.289210233592881

In [137]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.19624060150375944

In [138]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.07888040712468192

In [139]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.035608308605341255

In [140]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.24511278195488717

##### Jaccard Index 

(find the number of common tokens and divide it by the total number of unique tokens)

"We first tokenize the string by default space delimiter, to make words in the strings as tokens. Then we compute the similarity score." 

In [141]:
tokens_1 = "hello world".split()
tokens_2 = "world hello".split()

In [142]:
textdistance.jaccard(tokens_1 , tokens_2)

1.0

In [143]:
tokens_1 = "hello new world".split()
tokens_2 = "hello world".split()

In [144]:
textdistance.jaccard(tokens_1 , tokens_2)

0.6666666666666666

In [145]:
def similar(a,b):
    return textdistance.jaccard(a, b)

In [146]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

1.0

In [147]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.6493362831858407

In [148]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.28967065868263475

In [149]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.07888040712468193

In [150]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.036561264822134384

In [151]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.4397003745318352

##### Sorensen-Dice

"Falling under set similarity, the logic is to find the common tokens, and divide it by the total number of tokens present by combining both sets." 

In [152]:
tokens_1 = "hello world".split()
tokens_2 = "world hello".split()

In [153]:
textdistance.sorensen(tokens_1 , tokens_2)

1.0

In [154]:
tokens_1 = "hello new world".split()
tokens_2 = "hello world".split()

In [155]:
textdistance.sorensen(tokens_1 , tokens_2)

0.8

In [156]:
def similar(a,b):
    return textdistance.sorensen(a, b)

In [157]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

1.0

In [158]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.7873910127431254

In [159]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.4492164828786999

In [160]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.14622641509433962

In [161]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.07054337464251668

In [162]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.6108220603537982

##### Ratcliff-Obershelp similarity

In [163]:
string1, string2 = "i am going home", "gone home"

In [164]:
textdistance.ratcliff_obershelp(string1, string2)

0.6666666666666666

In [165]:
def similar(a,b):
    return textdistance.ratcliff_obershelp(a, b)

In [166]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [167]:
# should match 
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.12206572769953052

In [168]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.06848520023215322

In [169]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.09433962264150944

In [170]:
# should match 
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.06291706387035272

In [171]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.18106139438085328

----

-----

Data Notes:

1. combine committee fullname with committee description in new column
2. remove duplicate words in each description
3. (agriculture vs. agricultural)

5. remove numbers and words, punctuation 

* a 
* includes
* deals 
* shall
* jurisdiction
* policy 
* member
* ranking
* 


