# Congressional Committees w/ Stock Trades w/ String Algorithms

Exploring matching Congressional stock trade data and stock descriptions with Congressional Committee descriptions

reference: https://pythonspot.com/nltk-stop-words/

----

#### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML



In [2]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# pd.set_option('display.width', None)
# pd.set_option('display.max_colwidth', None)
# pd.set_option('max_seq_item', None)

In [3]:
#string matching imports
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from thefuzz import fuzz
from thefuzz import process
import textdistance
import jaro
import jellyfish



In [4]:
#natural language processing imports
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import treebank
import string

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

[nltk_data] Downloading package punkt to /Users/sm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/sm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/sm/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/sm/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package treebank to /Users/sm/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

----

### Reading Dataframes

Read in Stock trades by Congress members with Yahoo finance stock info

In [6]:
df_trades = pd.read_csv("..//data//processed//stock_watchers_w_yfinance_03_12_2022.csv", encoding="utf-8")

In [7]:
df_trades.head(1)

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,cap_gains,amount_low,amount_high,ticker2,name,sector,industry,longbusinesssummary,website,stock_description
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,,1001,15000.0,NEE,"NextEra Energy, Inc.",Utilities,Utilities—Regulated Electric,"NextEra Energy, Inc., through its subsidiaries...",https://www.nexteraenergy.com,"Utilities, Utilities—Regulated Electric, NextE..."


In [8]:
df_trades['sector_industry'] = df_trades['sector'] + ' ' + df_trades['industry']

In [9]:
df_trades.head(1)

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,amount_low,amount_high,ticker2,name,sector,industry,longbusinesssummary,website,stock_description,sector_industry
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,1001,15000.0,NEE,"NextEra Energy, Inc.",Utilities,Utilities—Regulated Electric,"NextEra Energy, Inc., through its subsidiaries...",https://www.nexteraenergy.com,"Utilities, Utilities—Regulated Electric, NextE...",Utilities Utilities—Regulated Electric


In [10]:
# df_trades.columns

Read in Congress Committee Descriptions Extracted from Committee.gov sites (with a few exceptions)

In [11]:
df_subcomittees = pd.read_csv('..//data//handmade//congress_commitee_descriptions.csv')

In [12]:
df_subcomittees.head(1)

Unnamed: 0,committee,committee_fullname,committee_description,website
0,SSFR09,Africa and Global Health Policy,The subcommittee deals with all matters concer...,https://www.foreign.senate.gov/download/2021-1...


In [13]:
df_subcomittees['committee_description2'] = df_subcomittees['committee_fullname'] + ' ' + df_subcomittees['committee_description']

In [14]:
df_subcomittees.columns

Index(['committee', 'committee_fullname', 'committee_description', 'website',
       'committee_description2'],
      dtype='object')

In [15]:
df_subcomittees.head(1)

Unnamed: 0,committee,committee_fullname,committee_description,website,committee_description2
0,SSFR09,Africa and Global Health Policy,The subcommittee deals with all matters concer...,https://www.foreign.senate.gov/download/2021-1...,Africa and Global Health Policy The subcommitt...


Read in Congress Committee Assignments

In [16]:
df_committee_members = pd.read_csv("..//data//processed//congress_committees.csv", encoding="utf-8")

In [17]:
df_committee_members.head(2)

Unnamed: 0,committee,name,party,rank,bioguide
0,SSAF,Debbie Stabenow,majority,1,S000770
1,SSAF,Patrick J. Leahy,majority,2,L000174


In [18]:
df_committee_members.columns

Index(['committee', 'name', 'party', 'rank', 'bioguide'], dtype='object')

-----

### Cleaning the Stock Description Columns

In [19]:
df_trades['stock_description2'] = df_trades.stock_description
# df.head(1)

In [20]:
df_trades.stock_description2 = df_trades.stock_description2.astype(str).str.lower()

In [21]:
df_trades.stock_description2.head(1)

0    utilities, utilities—regulated electric, nexte...
Name: stock_description2, dtype: object

In [22]:
df_trades.sector_industry = df_trades.sector_industry.astype(str).str.lower()

In [23]:
df_trades.sector_industry.head(1)

0    utilities utilities—regulated electric
Name: sector_industry, dtype: object

Data Notes:

1. combine committee fullname with committee description in new column
2. remove duplicate words in each description
3. (agriculture vs. agricultural)
5. remove numbers and words, punctuation 

* a 
* includes
* deals 
* shall
* jurisdiction
* policy 
* member
* ranking
* 

-----

##### Removing Punctuation from description

In [24]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [25]:
df_trades.stock_description2 = df_trades.stock_description2.str.replace('[{}]'.format(string.punctuation), '')

  df_trades.stock_description2 = df_trades.stock_description2.str.replace('[{}]'.format(string.punctuation), '')


In [26]:
df_trades.stock_description2.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description2, dtype: object

In [27]:
df_trades.stock_description2 = df_trades.stock_description2.replace('—', ' ')

In [28]:
df_trades.stock_description2.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description2, dtype: object

In [29]:
df_trades.sector_industry = df_trades.sector_industry.str.replace('[{}]'.format(string.punctuation), '')

  df_trades.sector_industry = df_trades.sector_industry.str.replace('[{}]'.format(string.punctuation), '')


In [30]:
df_trades.sector_industry.head(1)

0    utilities utilities—regulated electric
Name: sector_industry, dtype: object

##### Removing "Stop Words"

In [31]:
stops = set(stopwords.words('english'))
print(stops)

{'we', "haven't", 'mightn', 'o', "mustn't", 'again', 'where', 'ourselves', 'not', 'just', 'me', "she's", 'who', 'same', 'been', "needn't", "mightn't", 'at', 'doesn', 'here', 'nor', 'about', 'of', "that'll", 'haven', 'shan', "hasn't", 'its', 'through', 'that', 'hadn', "should've", 'your', 'which', 're', 'an', 'ain', 'no', "don't", 'isn', 'those', 'any', "you're", 'there', 'my', 'd', 'wasn', 'but', 'will', 'i', 'can', 'once', 'should', 'mustn', 'both', 'myself', 'most', 'needn', 'weren', 'itself', 'himself', "wasn't", 'if', "wouldn't", 'herself', 'have', 'against', 'their', 'do', 'aren', 'was', 'very', 'couldn', 'are', 'other', "weren't", "hadn't", 'by', "couldn't", 'didn', 'up', 'shouldn', "doesn't", 'so', "didn't", 'for', 's', 'our', 'yourselves', 'until', 'won', 'off', 'll', 'her', 'were', 'then', 'down', 'themselves', 'some', 'when', 'is', 'and', "shouldn't", 'a', 'few', 'does', 'too', "won't", 'this', 'more', 'he', 'to', 'being', "you'll", 'yourself', "aren't", 'above', 'ma', 'she',

In [32]:
df_trades['stock_description3'] = df_trades.stock_description2.apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))

In [33]:
df_trades.stock_description3.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description3, dtype: object

In [34]:
# df.head(2)

##### Removing numbers/digits from descriptions

In [35]:
df_trades.stock_description3 = df_trades.stock_description3.str.replace('\d+', '')

  df_trades.stock_description3 = df_trades.stock_description3.str.replace('\d+', '')


In [36]:
df_trades.stock_description3.head(1)

0    utilities utilities—regulated electric nextera...
Name: stock_description3, dtype: object

In [37]:
# df_trades

In [38]:
# stops2 = stopwords.words('english')
# print(stops2)

In [39]:
# stops2 = stopwords.words('english')

In [40]:
# Consider the word: Antinationalist, Morpheme
# https://www.analyticsvidhya.com/blog/2021/06/part-3-step-by-step-guide-to-nlp-text-cleaning-and-preprocessing/

Save a Copy

In [41]:
# df_trades.to_csv('..//data//processed//stock_watchers_w_yfinance_edited_03_13_2022.csv', index = False)

----

### Cleaning the Committee Description Column

In [42]:
df_subcomittees.committee_description2 = df_subcomittees.committee_description2.astype(str).str.lower()

In [43]:
df_subcomittees.committee_description2.head(1)

0    africa and global health policy the subcommitt...
Name: committee_description2, dtype: object

##### Removing Punctuation from description

In [44]:
df_subcomittees.committee_description2 = df_subcomittees.committee_description2.str.replace('[{}]'.format(string.punctuation), '')

  df_subcomittees.committee_description2 = df_subcomittees.committee_description2.str.replace('[{}]'.format(string.punctuation), '')


In [45]:
df_subcomittees.committee_description2.head(2)

0    africa and global health policy the subcommitt...
1    africa global health and global human rights t...
Name: committee_description2, dtype: object

##### Removing "Stop Words"

In [46]:
df_subcomittees['committee_description3'] = df_subcomittees.committee_description2.apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))

In [47]:
df_subcomittees.committee_description3.head(1)

0    africa global health policy subcommittee deals...
Name: committee_description3, dtype: object

In [48]:
# df.stock_description2

##### Removing numbers/digits from descriptions

In [49]:
df_subcomittees.committee_description3 = df_subcomittees.committee_description3.str.replace('\d+', '')

  df_subcomittees.committee_description3 = df_subcomittees.committee_description3.str.replace('\d+', '')


In [50]:
df_subcomittees.committee_description3.head(1)

0    africa global health policy subcommittee deals...
Name: committee_description3, dtype: object

In [51]:
# df_subcomittees.head(1)

##### Step 3

In [52]:
stops = set(stopwords.words('english'))
# print(stops)

In [53]:
# pat = r'\b(?:{})\b'.format('|'.join(stop))
# test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
# test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# # Same results.
# # 0              I love car
# # 1       This view amazing
# # 2    I feel great morning
# # 3       I excited concert
# # 4          He best friend

Save a copy

In [54]:
# df_subcomittees.to_csv('..//data//handmade//congress_commitee_descriptions_edited_03_13_22.csv', index = False)

-----

### Merging with algorithm on stock and committee descriptions

In [55]:
# merged = empty_df
# for trade in df_trades:
#     for each comittee in df_comitee:
#       # match ticker descp to comittee descp OR ANY OTHER ALGORITHM
#         flt_s_score = similar(ticker_desc, comittee_desc)
#         if flt_s_score > 0.6:
#             add this trade + comite descp to merged
# drop the rows where member comittee != ticker_comitee
# do analysis (edited) 

In [56]:
# relevant columns

# df.stock_description3
# df_subcomittees.committee_description3

In [57]:
#define the algorithm being used
def similar(a,b):
    return fuzz.partial_ratio(a, b)

In [58]:
# checking that similar works
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [59]:
# for column in df_trades[0:5]:
#     print(df_trades[column].values)

In [60]:
for trade in df_trades.stock_description3[0:1]:
    print(trade)

utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida


In [62]:
for committee in df_subcomittees.committee_description3[0:1]:
    print(committee)

africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response


In [64]:
len(df_trades)

23119

##### Test: Looping through lists

In [101]:
# ls_stock_description3 = df_trades.stock_description3.values.tolist()
ls_trades = df_trades.values.tolist()
ls_trades[0]

['02/24/2022',
 '03/11/2022',
 'Shelley M Capito',
 'Spouse',
 'NEE',
 '1001 - 15000',
 'NextEra Energy, Inc. Common Stock',
 'Stock',
 'Sale (Partial)',
 '--',
 'https://efdsearch.senate.gov/search/view/ptr/e7893c34-0761-4c2b-ac52-e303f166517f/',
 nan,
 nan,
 '1001',
 15000.0,
 'NEE',
 'NextEra Energy, Inc.',
 'Utilities',
 'Utilities—Regulated Electric',
 'NextEra Energy, Inc., through its subsidiaries, generates, transmits, distributes, and sells electric power to retail and wholesale customers in North America. The company generates electricity through wind, solar, nuclear, and fossil fuel, such as coal and natural gas facilities. It also develops, constructs, and operates long-term contracted assets with a focus on renewable generation facilities, electric transmission facilities, and battery storage projects; and owns, develops, constructs, manages and operates electric generation facilities in wholesale energy markets. As of December 31, 2020, the company operated approximately 

In [125]:
ls_trades[0][24]

'utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida'

In [103]:
# ls_committee_description3 = df_subcomittees.committee_description3.values.tolist()
ls_subcomittees = df_subcomittees.values.tolist()
ls_subcomittees[0]

['SSFR09',
 'Africa and Global Health Policy',
 'The subcommittee deals with all matters concerning U.S. relations with countries in Africa (except those, like the countries of North Africa, specifically covered by other subcommittees), as well as regional intergovernmental organizations like the African Union and the Economic Community of West African States. This subcommittee’s regional responsibilities include all matters within the geographic region, including matters relating to: (1) terrorism and non-proliferation; (2) crime and illicit narcotics; (3) U.S. foreign assistance programs; and (4) the promotion of U.S. trade and exports. In addition, this subcommittee has global responsibility for health-related policy, including disease outbreak and response.',
 'https://www.foreign.senate.gov/download/2021-117th-subcommittees',
 'africa and global health policy the subcommittee deals with all matters concerning us relations with countries in africa except those like the countries of

In [129]:
ls_subcomittees[0][5]

'africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response'

In [140]:
#testing list (time: 23:15)

#establish an empty list
ls_rows_test = []

# iterate through each trade of trades lsit
for trade in ls_trades[0:500]:
#     print(trade)
#     print(trade[24])
#     print('a')
    
    #Iterate through each committee in committee list for each element of trades lists
    for committee in ls_subcomittees:
#         print(committee[1])
# #         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade[24], committee[5])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_ls = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows_test.append(new_ls)

In [141]:
len(ls_rows_test)

12255

##### Looping through Dataframes

In [65]:
from itertools import chain

In [None]:
#working for loop!! (time: __)

#establish an empty list
ls_rows = []

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[0:500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)

In [67]:
#loop 2

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[501:1000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



KeyboardInterrupt: 

In [None]:
#loop 3

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[1001:1500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 4

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[1501:2000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 5

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[2001:2500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 6

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[2501:3000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 7

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[3001:3500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 8

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[3501:4000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 9

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[4001:4500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 10

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[4501:5000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 11

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[5001:5500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 12

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[5501:6000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 13

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[6001:6500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 14

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[6501:7000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 15

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[7001:7500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 16

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[7501:8000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 17

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[8001:8500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 18

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[8501:9000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 19

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[9001:9500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 20

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[9501:10000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 21

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[10001:10500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 22

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[10501:11000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 23

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[11001:11500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 24

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[11501:12000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 25

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[12001:12500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 26

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[12501:13000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 27

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[13001:13500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 28

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[13501:14000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 29

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[14001:14500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 30

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[14501:15000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 31

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[15001:15500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 32

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[15501:16000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 33

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[16001:16500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 34

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[17501:18000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 35

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[18001:18500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 36

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[18501:19000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 37

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[19001:19500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 38

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[19501:20000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 39

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[20001:20500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 40

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[20501:21000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 41

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[21001:21500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 42

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[21501:22000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 43

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[22001:22500].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 44

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[22501:23000].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [None]:
#loop 45

# iterate through each row (trade) of trades dataframe
for index, trade in df_trades[23001:].iterrows():
#     print(trade)
#     print(index, trade['stock_description3'])
#     print('a')
    
    #Iterate through each committee in committee dataframe for each row of trades dataframe
    for index, committee in df_subcomittees.iterrows():
#         print(index, committee['committee_description3'])
#         print('b')

      # match ticker description to committee description with ALGORITHM (which one TBD)
        flt_s_score = similar(trade['stock_description3'], committee['committee_description3'])
#         print(flt_s_score)
        if flt_s_score > 20:
#             print(flt_s_score)
            
#             # add this trade + commitee description to merged
            new_row = list(chain(trade, committee))
#             print(new_row)
#             print('c')
            ls_rows.append(new_row)



In [69]:
len(ls_rows)

13079

In [68]:
ls_rows[0:5]

[['02/24/2022',
  '03/11/2022',
  'Shelley M Capito',
  'Spouse',
  'NEE',
  '1001 - 15000',
  'NextEra Energy, Inc. Common Stock',
  'Stock',
  'Sale (Partial)',
  '--',
  'https://efdsearch.senate.gov/search/view/ptr/e7893c34-0761-4c2b-ac52-e303f166517f/',
  nan,
  nan,
  '1001',
  15000.0,
  'NEE',
  'NextEra Energy, Inc.',
  'Utilities',
  'Utilities—Regulated Electric',
  'NextEra Energy, Inc., through its subsidiaries, generates, transmits, distributes, and sells electric power to retail and wholesale customers in North America. The company generates electricity through wind, solar, nuclear, and fossil fuel, such as coal and natural gas facilities. It also develops, constructs, and operates long-term contracted assets with a focus on renewable generation facilities, electric transmission facilities, and battery storage projects; and owns, develops, constructs, manages and operates electric generation facilities in wholesale energy markets. As of December 31, 2020, the company ope

In [None]:
ls_rows[23115:23119]

In [None]:
# 23119

##### Make into Merged DataFrame

In [70]:
merged = pd.DataFrame(ls_rows)
merged.columns =['transaction_date', 'disclosure_date', 'politician', 'owner', 'ticker', 'amount', 'asset_description', 'asset_type', 'transaction_type', 'comment', 'ptr_link', 'location', 'cap_gains', 'amount_low', 'amount_high', 'ticker2', 'name', 'sector', 'industry', 'longbusinesssummary', 'website', 'stock_description','sector_industry', 'stock_description2', 'stock_description3', 'committee', 'committee_fullname', 'committee_description', 'website','committee_description2', 'committee_description3']

In [71]:
merged.head(10)

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,stock_description,sector_industry,stock_description2,stock_description3,committee,committee_fullname,committee_description,website,committee_description2,committee_description3
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSJU05,"Antitrust, Commercial, and Administrative Law","Subcommittee on Antitrust, Commercial and Admi...",https://judiciary.house.gov/subcommittees/subc...,antitrust commercial and administrative law su...,antitrust commercial administrative law subcom...
1,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSAG14,"Biotechnology, Horticulture, and Research","Policies, statutes, and markets relating to ho...",https://republicans-agriculture.house.gov/issu...,biotechnology horticulture and research polici...,biotechnology horticulture research policies s...
2,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,SSHR09,Children and Families,The Subcommittee has jurisdiction over a wide ...,https://www.help.senate.gov/about/subcommittees,children and families the subcommittee has jur...,children families subcommittee jurisdiction wi...
3,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSAP19,"Commerce, Justice, Science, and Related Agencies",DEPARTMENT OF COMMERCE DEPARTMENT OF JUSTICE N...,https://appropriations.house.gov/sites/democra...,commerce justice science and related agencies ...,commerce justice science related agencies depa...
4,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,JCSE,Commission on Security and Cooperation in Europe,The Commission on Security and Cooperation in ...,https://www.csce.gov/about-commission-security...,commission on security and cooperation in euro...,commission security cooperation europe commiss...
5,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSAG22,"Commodity Exchanges, Energy, and Credit","Policies, statutes, and markets relating to co...",https://republicans-agriculture.house.gov/issu...,commodity exchanges energy and credit policies...,commodity exchanges energy credit policies sta...
6,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSAG15,Conservation and Forestry,Policies and statutes relating to resource con...,https://republicans-agriculture.house.gov/issu...,conservation and forestry policies and statute...,conservation forestry policies statutes relati...
7,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSJU03,"Courts, Intellectual Property, and the Internet","The Subcommittee on Courts, Intellectual Prope...",https://judiciary.house.gov/subcommittees/subc...,courts intellectual property and the internet ...,courts intellectual property internet subcommi...
8,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,SSAP02,Defense,DEPARTMENT OF COMMERCE DEPARTMENT OF JUSTICE N...,https://appropriations.house.gov/sites/democra...,defense department of commerce department of j...,defense department commerce department justice...
9,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,"Utilities, Utilities—Regulated Electric, NextE...",utilities utilities—regulated electric,utilities utilities—regulated electric nextera...,utilities utilities—regulated electric nextera...,HSVR09,Disability Assistance and Memorial Affairs,The Subcommittee on Disability Assistance and ...,https://veterans.house.gov/subcommittees/disab...,disability assistance and memorial affairs the...,disability assistance memorial affairs subcomm...


### Matching with Member Committee Assignments

In [72]:
df_committee_members.head(3)

Unnamed: 0,committee,name,party,rank,bioguide
0,SSAF,Debbie Stabenow,majority,1,S000770
1,SSAF,Patrick J. Leahy,majority,2,L000174
2,SSAF,Sherrod Brown,majority,3,B000944


In [73]:
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [74]:
similar("Thomas H Tuberville", "Tommy Tuberville")

75

In [75]:
similar("Shelley M Capito", "Shelley Moore Capito")

75

In [76]:
#test

#establish an empty list
ls_rows2 = []

# iterate through each row of merged dataframe
for index, row in merged.iterrows():
#     print(row)
#     print(index, row['politician'])
#     print(index, row['committee'])
#     print('a')
    
    #Iterate through each row of member committee assignment dataframe for each row of merged dataframe
    for index, member in df_committee_members.iterrows():
#         print(index, member['name'])
#         print(index, member['committee'])
#         print('b')

      # match names of trades to members with algorithm
        name_score = similar(row['politician'], member['name'])
#         print(name_score)
        if row['committee'] == member['committee'] and name_score > 60:
#             print('eureka')
            
            # add this trade + commitee description to merged
            new_row2 = list(chain(row, member))
#             print(new_row2)
#             print('c')
            ls_rows2.append(new_row2)



KeyboardInterrupt: 

In [77]:
ls_rows2[0:2]

[['02/24/2022',
  '03/11/2022',
  'Shelley M Capito',
  'Spouse',
  'NEE',
  '1001 - 15000',
  'NextEra Energy, Inc. Common Stock',
  'Stock',
  'Sale (Partial)',
  '--',
  'https://efdsearch.senate.gov/search/view/ptr/e7893c34-0761-4c2b-ac52-e303f166517f/',
  nan,
  nan,
  '1001',
  15000.0,
  'NEE',
  'NextEra Energy, Inc.',
  'Utilities',
  'Utilities—Regulated Electric',
  'NextEra Energy, Inc., through its subsidiaries, generates, transmits, distributes, and sells electric power to retail and wholesale customers in North America. The company generates electricity through wind, solar, nuclear, and fossil fuel, such as coal and natural gas facilities. It also develops, constructs, and operates long-term contracted assets with a focus on renewable generation facilities, electric transmission facilities, and battery storage projects; and owns, develops, constructs, manages and operates electric generation facilities in wholesale energy markets. As of December 31, 2020, the company ope

In [78]:
edited = pd.DataFrame(ls_rows2)
edited.columns =['transaction_date', 'disclosure_date', 'politician', 'owner', 'ticker', 'amount', 'asset_description', 'asset_type', 'transaction_type', 'comment', 'ptr_link', 'location', 'cap_gains', 'amount_low', 'amount_high', 'ticker2', 'name', 'sector', 'industry', 'longbusinesssummary', 'website', 'stock_description','sector_industry', 'stock_description2', 'stock_description3', 'committee', 'committee_fullname', 'committee_description', 'website','committee_description2', 'committee_description3', 'committee', 'name', 'party', 'rank', 'bioguide']

In [80]:
len(edited)

671

In [81]:
edited

Unnamed: 0,transaction_date,disclosure_date,politician,owner,ticker,amount,asset_description,asset_type,transaction_type,comment,...,committee_fullname,committee_description,website,committee_description2,committee_description3,committee,name,party,rank,bioguide
0,02/24/2022,03/11/2022,Shelley M Capito,Spouse,NEE,1001 - 15000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Partial),--,...,Senate Committee on Appropriations,The Senate Appropriations Committee is the lar...,https://www.appropriations.senate.gov/about/ju...,senate committee on appropriations the senate ...,senate committee appropriations senate appropr...,SSAP,Shelley Moore Capito,minority,10,C001047
1,01/14/2022,02/14/2022,Thomas H Tuberville,Joint,NEE,15001 - 50000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Full),--,...,Children and Families,The Subcommittee has jurisdiction over a wide ...,https://www.help.senate.gov/about/subcommittees,children and families the subcommittee has jur...,children families subcommittee jurisdiction wi...,SSHR09,Tommy Tuberville,minority,7,T000278
2,01/14/2022,02/14/2022,Thomas H Tuberville,Joint,NEE,15001 - 50000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Full),--,...,"Courts, Intellectual Property, and the Internet","The Subcommittee on Courts, Intellectual Prope...",https://judiciary.house.gov/subcommittees/subc...,courts intellectual property and the internet ...,courts intellectual property internet subcommi...,HSJU03,Thomas Massie,minority,7,M001184
3,01/14/2022,02/14/2022,Thomas H Tuberville,Joint,NEE,15001 - 50000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Full),--,...,Employment and Workplace Safety,The Subcommittee Chairman is Senator John Hick...,https://www.help.senate.gov/about/subcommittees,employment and workplace safety the subcommitt...,employment workplace safety subcommittee chair...,SSHR11,Tommy Tuberville,minority,2,T000278
4,01/14/2022,02/14/2022,Thomas H Tuberville,Joint,NEE,15001 - 50000,"NextEra Energy, Inc. Common Stock",Stock,Sale (Full),--,...,Select Revenue Measures,The jurisdiction of the Subcommittee on Select...,https://waysandmeans.house.gov/subcommittees/S...,select revenue measures the jurisdiction of th...,select revenue measures jurisdiction subcommit...,HSWM05,Thomas R. Suozzi,majority,9,S001201
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
666,2020-11-17,12/09/2020,Hon. Josh Gottheimer,joint,MSFT,250001 - 500000,Microsoft Corporation,,sale_full,,...,"Tourism, Trade, and Export Promotion","The Subcommittee on Tourism, Trade, and Export...",https://www.commerce.senate.gov/tourism-trade-...,tourism trade and export promotion the subcomm...,tourism trade export promotion subcommittee to...,SSCM39,Ron Johnson,minority,4,J000293
667,2020-11-17,12/09/2020,Hon. Josh Gottheimer,joint,MSFT,250001 - 500000,Microsoft Corporation,,sale_full,,...,Intelligence and Counterterrorism,The Intelligence and Counterterrorism Subcommi...,https://republicans-homeland.house.gov/subcomm...,intelligence and counterterrorism the intellig...,intelligence counterterrorism intelligence cou...,HSHM05,Josh Gottheimer,majority,5,G000583
668,2020-11-17,12/09/2020,Hon. Josh Gottheimer,joint,MSFT,250001 - 500000,Microsoft Corporation,,sale_full,,...,"Tourism, Trade, and Export Promotion","The Subcommittee on Tourism, Trade, and Export...",https://www.commerce.senate.gov/tourism-trade-...,tourism trade and export promotion the subcomm...,tourism trade export promotion subcommittee to...,SSCM39,Ron Johnson,minority,4,J000293
669,2020-11-17,12/09/2020,Hon. Josh Gottheimer,joint,MSFT,500001 - 1000000,Microsoft Corporation,,sale_full,,...,Intelligence and Counterterrorism,The Intelligence and Counterterrorism Subcommi...,https://republicans-homeland.house.gov/subcomm...,intelligence and counterterrorism the intellig...,intelligence counterterrorism intelligence cou...,HSHM05,Josh Gottheimer,majority,5,G000583


In [82]:
# edited.to_csv('..//data//processed//members_stocks_and_committees_03_14_2022.csv', index = False)

In [68]:
# drop the rows where member comittee != ticker_comittee


In [69]:
# do analysis (edited) 

-----

### Fine-tuning the Algorithms

Exploring types of algorithms on the columns

##### Simple Ratios

In [70]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [71]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [72]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.017437961099932932

In [73]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.01625072547881602

In [74]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.05188679245283019

In [75]:
# should match using industry/sector and committee description
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.005719733079122974

In [76]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.015608740894901144

In [77]:
def similar(a,b):
    return fuzz.ratio(a, b)

In [78]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [79]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [80]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

2

In [81]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

5

In [82]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1

In [83]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [84]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [85]:
def similar(a,b):
    return fuzz.token_set_ratio(a, b)

In [86]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [87]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

4

In [88]:
# should match using industry/sector and committee description
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

17

In [89]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

49

In [90]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0

In [91]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

9

##### Partial Ratio

In [92]:
def similar(a,b):
    return fuzz.partial_ratio(a, b)

In [93]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

In [94]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

14

In [95]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

4

!!!! High ratio with industry/sector and committee description

In [96]:
# should match using industry/sector and committee description
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

61

In [97]:
# should match using industry/sector and committee description
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

45

In [98]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

18

##### Token Sort Ratio

In [99]:
def similar(a,b):
    return fuzz.token_sort_ratio(a, b)

In [100]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [101]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

2

In [102]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

2

In [103]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1

In [104]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1

In [105]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1

##### Token Set

In [106]:
def similar(a,b):
    return fuzz.token_set_ratio(a, b)

In [107]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

100

In [108]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

4

In [109]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

17

In [110]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

49

In [111]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0

In [112]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

9

##### Hamming Distance (finding the places where the strings vary)

In [113]:
textdistance.hamming.normalized_similarity('arrow', 'arow')

0.4

In [114]:
#the edit distance is 1 for only the difference being one letter different
def similar(a,b):
    return textdistance.hamming(a, b)

In [115]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

2

In [116]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

855

In [117]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1305

In [118]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

391

In [119]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

1009

In [120]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1295

In [121]:
#75% similar between text and test
def similar(a,b):
    return textdistance.hamming.normalized_similarity(a, b)

In [122]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [123]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.04894327030033374

In [124]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.018796992481203034

In [125]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.0050890585241730735

In [126]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.001978239366963397

In [127]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.02631578947368418

##### Levenshtein Distance

In [128]:
#number of edits it will take to transform one to the other
textdistance.levenshtein('arrow', 'arow')

1

In [129]:
#number of edits it will take to transform one to the other
def similar(a,b):
    return textdistance.levenshtein(a, b)

In [130]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

2

In [131]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

639

In [132]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

1069

In [133]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

362

In [134]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

975

In [135]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

1004

In [136]:
textdistance.levenshtein.normalized_similarity('arrow', 'arow')

0.8

In [137]:
def similar(a,b):
    return textdistance.levenshtein.normalized_similarity(a, b)

In [138]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [139]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.289210233592881

In [140]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.19624060150375944

In [141]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.07888040712468192

In [142]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.035608308605341255

In [143]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.24511278195488717

##### Jaccard Index 

(find the number of common tokens and divide it by the total number of unique tokens)

"We first tokenize the string by default space delimiter, to make words in the strings as tokens. Then we compute the similarity score." 

In [144]:
tokens_1 = "hello world".split()
tokens_2 = "world hello".split()

In [145]:
textdistance.jaccard(tokens_1 , tokens_2)

1.0

In [146]:
tokens_1 = "hello new world".split()
tokens_2 = "hello world".split()

In [147]:
textdistance.jaccard(tokens_1 , tokens_2)

0.6666666666666666

In [148]:
def similar(a,b):
    return textdistance.jaccard(a, b)

In [149]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

1.0

In [150]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.6493362831858407

In [151]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.28967065868263475

In [152]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.07888040712468193

In [153]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.036561264822134384

In [154]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.4397003745318352

##### Sorensen-Dice

"Falling under set similarity, the logic is to find the common tokens, and divide it by the total number of tokens present by combining both sets." 

In [155]:
tokens_1 = "hello world".split()
tokens_2 = "world hello".split()

In [156]:
textdistance.sorensen(tokens_1 , tokens_2)

1.0

In [157]:
tokens_1 = "hello new world".split()
tokens_2 = "hello world".split()

In [158]:
textdistance.sorensen(tokens_1 , tokens_2)

0.8

In [159]:
def similar(a,b):
    return textdistance.sorensen(a, b)

In [160]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

1.0

In [161]:
# should match
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.7873910127431254

In [162]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.4492164828786999

In [163]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.14622641509433962

In [164]:
# should match
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.07054337464251668

In [165]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.6108220603537982

##### Ratcliff-Obershelp similarity

In [166]:
string1, string2 = "i am going home", "gone home"

In [167]:
textdistance.ratcliff_obershelp(string1, string2)

0.6666666666666666

In [168]:
def similar(a,b):
    return textdistance.ratcliff_obershelp(a, b)

In [169]:
# high match
similar("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

0.9090909090909091

In [170]:
# should match 
similar("utilities utilities—regulated electric nextera energy inc subsidiaries generates transmits distributes sells electric power retail wholesale customers north america company generates electricity wind solar nuclear fossil fuel coal natural gas facilities also develops constructs operates longterm contracted assets focus renewable generation facilities electric transmission facilities battery storage projects owns develops constructs manages operates electric generation facilities wholesale energy markets december   company operated approximately  megawatts net generating capacity serves approximately  million people approximately  million customer accounts east lower west coasts florida approximately  circuit miles transmission distribution lines  substations company formerly known fpl group inc changed name nextera energy inc  nextera energy inc founded  headquartered juno beach florida", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.12206572769953052

In [171]:
# should match
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.06848520023215322

In [172]:
# should match
similar("technology consumer electronics", "cybersecurity infrastructure protection innovation cyber security infrastructure protection innovation subcommittee jurisdiction cybersecurity infrastructure security agency cisa science technology directorate focuses efforts advance federal network security improve critical infrastructure security also oversees cisa‚Äôs chemical security programs crosscutting science technology initiatives")

0.09433962264150944

In [173]:
# should match 
similar("utilities utilities—regulated electric", "All matters relating to energy research, development, and demonstration projects therefor; commercial application of energy technology; Department of Energy research, development, and demonstration programs; Department of Energy laboratories; Department of Energy science activities; Department of Energy international research, development, and demonstration projects; energy supply activities; nuclear, solar, and renewable energy, and other advanced energy technologies; uranium supply and enrichment, and Department of Energy waste management; Department of Energy environmental management research, development, and demonstration; fossil energy research and development; clean coal technology; energy conservation research and development, including building performance, alternate fuels, distributed power systems, and industrial process improvements; pipeline research, development, and demonstration projects; energy standards; other appropriate matters as referred by the Chair; and relevant oversight.") 

0.06291706387035272

In [174]:
# shouldn't match 
similar("technology consumer electronics apple inc designs manufactures markets smartphones personal computers tablets wearables accessories worldwide also sells various related services addition company offers iphone line smartphones mac line personal computers ipad line multipurpose tablets airpods max overear wireless headphone wearables home accessories comprising airpods apple tv apple watch beats products homepod ipod touch provides applecare support services cloud services store services operates various platforms including app store allow customers discover download applications digital content books music video games podcasts additionally company offers various services apple arcade game subscription service apple music offers users curated listening experience ondemand radio stations apple news subscription news magazine service apple tv offers exclusive original content apple card cobranded credit card apple pay cashless payment service well licenses intellectual property company serves consumers small midsized businesses education enterprise government markets distributes thirdparty applications products app store company also sells products retail online stores direct sales force thirdparty cellular network carriers wholesalers retailers resellers apple inc incorporated  headquartered cupertino california", "africa global health policy subcommittee deals matters concerning us relations countries africa except like countries north africa specifically covered subcommittees well regional intergovernmental organizations like african union economic community west african states subcommittee’s regional responsibilities include matters within geographic region including matters relating  terrorism nonproliferation  crime illicit narcotics  us foreign assistance programs  promotion us trade exports addition subcommittee global responsibility healthrelated policy including disease outbreak response")

0.18106139438085328

----