### **Group**: Peter Endes-Nagy, Khawaja Hassan, Shah Ali, Viktoria Konya


# **Task2**

Use the senator speeches in the folder 105-extracted-date located in Inputs and do the following:

i) text preprocessing, removing stopwords, punctuations. Justify your choices.

ii) use cosine similarity to find whose senator's speech is closest to senator Biden. Use sen105kh_fix.csv and/or Wikipedia to validate your findings (i.e., understand if the most similar speeches are senators from the same state and/or party). Describe your findings. 

iv) How do your results change if you apply stemming or lemmatization? In your opinion is it better to apply stemming or lemmatization? Why?

In [2]:
# Import libraries
import pandas as pd
import os
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Change path
os.chdir("C:\\Users\\User\\Documents\\GitHub\\Python-Programming-and-Text-Analysis")
path_input = "Inputs\\105-extracted-date"

## 1. Data import

In [4]:
# Read in files and create df
speech = []
filename = []
for element in os.listdir(path_input):
    if ".txt" in element:
        file = open(os.path.join(path_input,element)).readlines() 
        file = ''.join(file) 
        filename.append(element.replace(".txt", ""))
        speech.append(file)

# Put messages to dataframe
df = pd.DataFrame({'filename':filename,'speech':speech})

# Check
pd.set_option("display.max_columns", None)
df[0:10]

Unnamed: 0,filename,speech
0,105-abraham-mi,<DOC>\n<DOCNO>105-abraham-mi-1-19981112</DOCNO...
1,105-akaka-hi,<DOC>\n<DOCNO>105-akaka-hi-1-19981021</DOCNO>\...
2,105-allard-co,<DOC>\n<DOCNO>105-allard-co-1-19981009</DOCNO>...
3,105-ashcroft-mo,<DOC>\n<DOCNO>105-ashcroft-mo-1-19981021</DOCN...
4,105-baucus-mt,<DOC>\n<DOCNO>105-baucus-mt-1-19981021</DOCNO>...
5,105-bennett-ut,<DOC>\n<DOCNO>105-bennett-ut-1-19981112</DOCNO...
6,105-biden-de,<DOC>\n<DOCNO>105-biden-de-1-19981021</DOCNO>\...
7,105-bingaman-nm,<DOC>\n<DOCNO>105-bingaman-nm-1-19981021</DOCN...
8,105-bond-mo,<DOC>\n<DOCNO>105-bond-mo-1-19981012</DOCNO>\n...
9,105-boxer-ca,<DOC>\n<DOCNO>105-boxer-ca-1-19981021</DOCNO>\...


In [5]:
# Check one element
df['speech'].iloc[0]



## 2. Data preparation

First, we need to retrieve only the text from the files.

For this we are going to:
* Parse the documents with BeautifulSoup
* Remove the words that contain the filename
* Do additional data cleaning: removeing the \n, removing extra spaces and leading spaces.

In [6]:
# Define function to remove the filenames
def replaceName(document, filename):
    document_clean = re.sub(r"(?:\s){}[^, ]*".format(filename), "", document) 
    return str(document_clean)

In [7]:
my_dict = dict.fromkeys(df['filename'])

for ind in df.index:
    
    # Get filename
    filename = df['filename'][ind]
    
    # Get speech
    speech = df['speech'][ind]
    
    # Parse speech
    soup = BeautifulSoup(speech, 'html')
    
    # Remove \n
    s = soup.text.replace("\n", " ")

    # Remove extra spaces
    s = re.sub("\s\s+", " ", s)
    
    # Remove word that contains filename
    s = replaceName(s, filename)
    
    # Remove leading spaces
    s.strip()
    
    my_dict[filename] = s

In [8]:
# Check one element
my_dict['105-abraham-mi']



In [9]:
# Convert back to df
df = pd.DataFrame.from_dict(my_dict, orient='index').rename_axis('filename').reset_index()
df.rename(columns={ df.columns[1]: "speech" }, inplace = True)
df.head()

Unnamed: 0,filename,speech
0,105-abraham-mi,"Mr. ABRAHAM. Mr. President, during debate on ..."
1,105-akaka-hi,"Mr. AKAKA. Mr. President, I am pleased that t..."
2,105-allard-co,"Mr. ALLARD. Mr. President, I rise to make a f..."
3,105-ashcroft-mo,"Mr. ASHCROFT. Mr. President, the Senate is no..."
4,105-baucus-mt,Mr. BAUCUS. I understand that the House has s...


## 3. Data preprocessing and tokenization, applying TF-IDF vectorizer, calculationg cosine similarity

In the next step the following were done:

* We created **text_preprocesser()** function that removes all punctuations, non-alphabetical characeters, stopwords and words with less than 3 characters.
* We applyed the preprocessor on the speeches.
* Then we applied the **TF-IDF vectorizer** where we choped off the lest frequent terms (that appeared in less than 5% of the documents) and the most important ters (that appeared in more than 95% of the documents).

#### 3.1. Text preprocesser

In [10]:
# Define preprocesser
def text_preprocesser(text):
    
    # Remove not alphanumerical characters
    text = re.sub(r'\W',' ', text)
    
    # Remove not alphanumerical characters
    text = re.sub("[^a-zA-Z]+", " ", text)
    
    # Tokenize
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    
    # Remove words with length less than 3 characters
    tokens = [word for word in tokens if len(word)>=3]
    
    preprocessed_text = ' '.join(tokens)
    
    return preprocessed_text

In [11]:
# Check one element
text_preprocesser(df['speech'].iloc[0])



#### 3.2. Apply text preprocesser on the speeches

In [12]:
# Apply to df
# df_short = df.head(10)
df_preprocessed = df['speech'].apply(text_preprocesser)

# Save
import pickle
with open(f'df_preprocessed.p', 'wb') as f:
        pickle.dump(df_preprocessed, f)

In [14]:
# Import csv again
# with open(f'df_preprocessed.p', 'rb') as data:
#     df_preprocessed_loaded = pickle.load(data)
df_preprocessed[0]



In [15]:
len(df_preprocessed)

100

#### 3.3. Apply TF-IDF vectorizer

We will:
* ignore terms that appear in less than 5% of the documents and
* ignore terms that appear in more than 95% of the documents

In [13]:
# Apply TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.05, max_df= 0.95) # preprocessing was done before
tfidf = tfidf_vectorizer.fit_transform(df_preprocessed)

In [16]:
# Convert to df
df_tfidf = pd.DataFrame(tfidf.toarray().transpose(), index=tfidf_vectorizer.get_feature_names())
df_tfidf.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005864,0.0,0.00875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aapcc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010124,0.0,0.003027,0.0,0.004333,0.007435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056279
aaron,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009087,0.002231,0.0,0.0,0.001944,0.0,0.0,0.001355,0.0,0.001732,0.0,0.0,0.001886,0.0,0.0,0.0,0.0,0.001658,0.0,0.002737,0.0,0.006041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002162,0.0,0.0,0.0,0.0,0.0,0.0,0.001475,0.002989,0.003962,0.000902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001349,0.0,0.001498,0.0,0.0,0.0,0.003272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013637,0.004084,0.002567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aarp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002094,0.0,0.0,0.0,0.0,0.0,0.0,0.002205,0.0,0.0,0.033106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002536,0.0,0.005559,0.0,0.011108,0.002856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001865,0.0,0.0,0.0,0.003438,0.0,0.004516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000956,0.0,0.0,0.0,0.002333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005289,0.0,0.003248,0.0,0.002368,0.0,0.0,0.0,0.0,0.002085,0.0,0.003883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.00453,0.0,0.0,0.004474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01002,0.00226,0.0,0.0,0.0,0.0,0.004726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004534,0.0,0.0,0.0,0.0,0.0,0.005491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aback,0.0,0.0,0.0,0.001491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004521,0.0,0.0,0.0,0.0,0.0,0.0,0.001215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002403,0.002371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007897,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001918,0.0,0.0
abandon,0.003513,0.0,0.001624,0.002313,0.002179,0.003345,0.002944,0.003983,0.002058,0.004936,0.0041,0.004691,0.001275,0.000922,0.005709,0.002452,0.002079,0.002438,0.00227,0.005573,0.001485,0.003663,0.0,0.0,0.003283,0.001354,0.006934,0.0,0.003665,0.000399,0.001413,0.00262,0.004657,0.001745,0.002019,0.001657,0.0,0.001367,0.004056,0.003773,0.001209,0.0,0.009194,0.004395,0.0,0.0,0.0,0.00108,0.0015,0.003193,0.00226,0.001424,0.0,0.0,0.001474,0.002987,0.0,0.013066,0.001224,0.008188,0.008622,0.002199,0.008888,0.002968,0.003247,0.002023,0.003188,0.000749,0.003651,0.002091,0.001662,0.002452,0.0,0.0027,0.0,0.002138,0.0,0.0,0.004555,0.000983,0.005547,0.00183,0.0,0.002001,0.0,0.0,0.0,0.003892,0.001136,0.0,0.001282,0.0,0.00059,0.0,0.0,0.001042,0.005127,0.002231,0.002789,0.0
abandoned,0.012813,0.006581,0.006771,0.003013,0.0,0.0,0.001227,0.004151,0.0,0.007202,0.002136,0.006518,0.006642,0.024972,0.0,0.002555,0.0,0.011854,0.002366,0.002323,0.001548,0.0,0.002024,0.001827,0.004562,0.004234,0.0,0.015335,0.0,0.003738,0.001473,0.00091,0.001618,0.001819,0.000701,0.000863,0.0,0.00285,0.0,0.003932,0.00252,0.0,0.007665,0.000916,0.001146,0.005007,0.00164,0.001689,0.0,0.001109,0.002355,0.0,0.0,0.001776,0.002304,0.0,0.0,0.001409,0.001275,0.000948,0.006739,0.0,0.0,0.007733,0.001269,0.000703,0.016609,0.0,0.001902,0.004358,0.001732,0.000852,0.004786,0.007503,0.008143,0.007798,0.0,0.000721,0.0,0.007169,0.00578,0.0,0.012317,0.003128,0.0,0.00229,0.0,0.0,0.001183,0.004253,0.001336,0.003581,0.001229,0.0,0.0,0.0,0.005343,0.00155,0.004068,0.0
abandoning,0.0,0.0,0.0,0.0,0.001999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00225,0.0,0.0,0.0,0.002046,0.0,0.0,0.0,0.001608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006406,0.001235,0.00152,0.0,0.0,0.003722,0.0,0.0,0.0,0.005062,0.0,0.0,0.0,0.0,0.003965,0.0,0.001953,0.0,0.0,0.003105,0.0,0.0,0.0,0.0,0.003307,0.0,0.0,0.0,0.002018,0.0,0.001362,0.0,0.0,0.008775,0.001374,0.0,0.0,0.001525,0.0,0.0,0.0,0.002049,0.000981,0.0,0.0,0.005573,0.0,0.00509,0.0,0.0,0.0,0.0,0.004032,0.0,0.0,0.002084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001023,0.0
abandonment,0.004225,0.0,0.002605,0.0,0.0,0.0,0.000944,0.002129,0.0,0.001583,0.0,0.01003,0.0,0.001478,0.0,0.0,0.0,0.003909,0.003641,0.001788,0.0,0.002937,0.0,0.0,0.00351,0.0,0.0,0.001388,0.0,0.000639,0.000756,0.0,0.0,0.0,0.001079,0.0,0.0,0.002193,0.0,0.007261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000866,0.0,0.0,0.001812,0.002283,0.002713,0.0,0.0,0.0,0.0,0.0,0.0,0.001459,0.0,0.0,0.0,0.00238,0.0,0.0,0.0,0.0,0.0,0.0,0.003998,0.0,0.0,0.001443,0.00179,0.002572,0.0,0.0,0.002435,0.001576,0.0,0.0,0.009478,0.001604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001671,0.002741,0.0,0.0,0.0


In [58]:
# Column names
senator_names = df['filename'].str.split('-').str[1] + '-' + df['filename'].str.split('-').str[2] 
senator_names.head()

0     abraham-mi
1       akaka-hi
2      allard-co
3    ashcroft-mo
4      baucus-mt
Name: filename, dtype: object

In [59]:
# Assign column names to matrix
df_tfidf.columns = senator_names
df_tfidf.head(10)

filename,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005864,0.0,0.00875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aapcc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010124,0.0,0.003027,0.0,0.004333,0.007435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056279
aaron,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009087,0.002231,0.0,0.0,0.001944,0.0,0.0,0.001355,0.0,0.001732,0.0,0.0,0.001886,0.0,0.0,0.0,0.0,0.001658,0.0,0.002737,0.0,0.006041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002162,0.0,0.0,0.0,0.0,0.0,0.0,0.001475,0.002989,0.003962,0.000902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001349,0.0,0.001498,0.0,0.0,0.0,0.003272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013637,0.004084,0.002567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aarp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002094,0.0,0.0,0.0,0.0,0.0,0.0,0.002205,0.0,0.0,0.033106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002536,0.0,0.005559,0.0,0.011108,0.002856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001865,0.0,0.0,0.0,0.003438,0.0,0.004516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000956,0.0,0.0,0.0,0.002333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005289,0.0,0.003248,0.0,0.002368,0.0,0.0,0.0,0.0,0.002085,0.0,0.003883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.00453,0.0,0.0,0.004474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01002,0.00226,0.0,0.0,0.0,0.0,0.004726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004534,0.0,0.0,0.0,0.0,0.0,0.005491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aback,0.0,0.0,0.0,0.001491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004521,0.0,0.0,0.0,0.0,0.0,0.0,0.001215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002403,0.002371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001047,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007897,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001918,0.0,0.0
abandon,0.003513,0.0,0.001624,0.002313,0.002179,0.003345,0.002944,0.003983,0.002058,0.004936,0.0041,0.004691,0.001275,0.000922,0.005709,0.002452,0.002079,0.002438,0.00227,0.005573,0.001485,0.003663,0.0,0.0,0.003283,0.001354,0.006934,0.0,0.003665,0.000399,0.001413,0.00262,0.004657,0.001745,0.002019,0.001657,0.0,0.001367,0.004056,0.003773,0.001209,0.0,0.009194,0.004395,0.0,0.0,0.0,0.00108,0.0015,0.003193,0.00226,0.001424,0.0,0.0,0.001474,0.002987,0.0,0.013066,0.001224,0.008188,0.008622,0.002199,0.008888,0.002968,0.003247,0.002023,0.003188,0.000749,0.003651,0.002091,0.001662,0.002452,0.0,0.0027,0.0,0.002138,0.0,0.0,0.004555,0.000983,0.005547,0.00183,0.0,0.002001,0.0,0.0,0.0,0.003892,0.001136,0.0,0.001282,0.0,0.00059,0.0,0.0,0.001042,0.005127,0.002231,0.002789,0.0
abandoned,0.012813,0.006581,0.006771,0.003013,0.0,0.0,0.001227,0.004151,0.0,0.007202,0.002136,0.006518,0.006642,0.024972,0.0,0.002555,0.0,0.011854,0.002366,0.002323,0.001548,0.0,0.002024,0.001827,0.004562,0.004234,0.0,0.015335,0.0,0.003738,0.001473,0.00091,0.001618,0.001819,0.000701,0.000863,0.0,0.00285,0.0,0.003932,0.00252,0.0,0.007665,0.000916,0.001146,0.005007,0.00164,0.001689,0.0,0.001109,0.002355,0.0,0.0,0.001776,0.002304,0.0,0.0,0.001409,0.001275,0.000948,0.006739,0.0,0.0,0.007733,0.001269,0.000703,0.016609,0.0,0.001902,0.004358,0.001732,0.000852,0.004786,0.007503,0.008143,0.007798,0.0,0.000721,0.0,0.007169,0.00578,0.0,0.012317,0.003128,0.0,0.00229,0.0,0.0,0.001183,0.004253,0.001336,0.003581,0.001229,0.0,0.0,0.0,0.005343,0.00155,0.004068,0.0
abandoning,0.0,0.0,0.0,0.0,0.001999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00225,0.0,0.0,0.0,0.002046,0.0,0.0,0.0,0.001608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006406,0.001235,0.00152,0.0,0.0,0.003722,0.0,0.0,0.0,0.005062,0.0,0.0,0.0,0.0,0.003965,0.0,0.001953,0.0,0.0,0.003105,0.0,0.0,0.0,0.0,0.003307,0.0,0.0,0.0,0.002018,0.0,0.001362,0.0,0.0,0.008775,0.001374,0.0,0.0,0.001525,0.0,0.0,0.0,0.002049,0.000981,0.0,0.0,0.005573,0.0,0.00509,0.0,0.0,0.0,0.0,0.004032,0.0,0.0,0.002084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001023,0.0
abandonment,0.004225,0.0,0.002605,0.0,0.0,0.0,0.000944,0.002129,0.0,0.001583,0.0,0.01003,0.0,0.001478,0.0,0.0,0.0,0.003909,0.003641,0.001788,0.0,0.002937,0.0,0.0,0.00351,0.0,0.0,0.001388,0.0,0.000639,0.000756,0.0,0.0,0.0,0.001079,0.0,0.0,0.002193,0.0,0.007261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000866,0.0,0.0,0.001812,0.002283,0.002713,0.0,0.0,0.0,0.0,0.0,0.0,0.001459,0.0,0.0,0.0,0.00238,0.0,0.0,0.0,0.0,0.0,0.0,0.003998,0.0,0.0,0.001443,0.00179,0.002572,0.0,0.0,0.002435,0.001576,0.0,0.0,0.009478,0.001604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001671,0.002741,0.0,0.0,0.0


In [60]:
len(df_tfidf)

22530

#### 4. Apply cosine similarity

In [61]:
from sklearn.metrics.pairwise import cosine_similarity

In [82]:
# Get number of columns
num_col = len(df_tfidf.columns)
num_col

100

In [77]:
# Reshape 
for i in range(1, num_col+1):
    globals()["txt" + str(i)] = df_tfidf[df_tfidf.columns[i-1]].values.reshape(1, -1) 

In [78]:
# Biden
txt7

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.00168191]])

In [85]:
# Similarity between Biden and Roberts - sample
print("Similarity Biden and Roberts:", cosine_similarity(txt7, txt82))

Similarity Biden and Roberts: [[0.39314126]]


In [66]:
# Reshape to vector
import numpy as np
for i in range(1, num_col+1):
    globals()["txt" + str(i)] = np.squeeze(np.asarray(globals()["txt" + str(i)]))

In [67]:
txt7

array([0.        , 0.        , 0.        , ..., 0.        , 0.        ,
       0.00168191])

In [69]:
# Create list of vectors
list_of_txts = []
for i in range(1, num_col+1):
    element = "txt" + str(i)
    list_of_txts.append(element)
print(list_of_txts)

['txt1', 'txt2', 'txt3', 'txt4', 'txt5', 'txt6', 'txt7', 'txt8', 'txt9', 'txt10', 'txt11', 'txt12', 'txt13', 'txt14', 'txt15', 'txt16', 'txt17', 'txt18', 'txt19', 'txt20', 'txt21', 'txt22', 'txt23', 'txt24', 'txt25', 'txt26', 'txt27', 'txt28', 'txt29', 'txt30', 'txt31', 'txt32', 'txt33', 'txt34', 'txt35', 'txt36', 'txt37', 'txt38', 'txt39', 'txt40', 'txt41', 'txt42', 'txt43', 'txt44', 'txt45', 'txt46', 'txt47', 'txt48', 'txt49', 'txt50', 'txt51', 'txt52', 'txt53', 'txt54', 'txt55', 'txt56', 'txt57', 'txt58', 'txt59', 'txt60', 'txt61', 'txt62', 'txt63', 'txt64', 'txt65', 'txt66', 'txt67', 'txt68', 'txt69', 'txt70', 'txt71', 'txt72', 'txt73', 'txt74', 'txt75', 'txt76', 'txt77', 'txt78', 'txt79', 'txt80', 'txt81', 'txt82', 'txt83', 'txt84', 'txt85', 'txt86', 'txt87', 'txt88', 'txt89', 'txt90', 'txt91', 'txt92', 'txt93', 'txt94', 'txt95', 'txt96', 'txt97', 'txt98', 'txt99', 'txt100']


In [70]:
# Create similarity matrix
from scipy import sparse
A =  np.array(np.array([eval(name) for name in list_of_txts]))
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

pairwise dense output:
 [[1.         0.11286872 0.1361684  ... 0.09720429 0.16259734 0.13949189]
 [0.11286872 1.         0.07545337 ... 0.04603068 0.07772243 0.06784303]
 [0.1361684  0.07545337 1.         ... 0.0607008  0.09026423 0.08167521]
 ...
 [0.09720429 0.04603068 0.0607008  ... 1.         0.08235814 0.08108126]
 [0.16259734 0.07772243 0.09026423 ... 0.08235814 1.         0.13055899]
 [0.13949189 0.06784303 0.08167521 ... 0.08108126 0.13055899 1.        ]]



In [87]:
# Define similarity matrix
df_similarity = pd.DataFrame(similarities)
df_similarity.columns = senator_names
df_similarity.insert(0,'senator',senator_names)
df_similarity.head()

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
0,abraham-mi,1.0,0.112869,0.136168,0.22277,0.134141,0.123907,0.158303,0.143244,0.12753,0.162881,0.106249,0.140346,0.108282,0.126602,0.123742,0.151738,0.078532,0.125201,0.135361,0.175003,0.087042,0.126784,0.117626,0.158389,0.16031,0.118495,0.210736,0.176772,0.193032,0.102856,0.15284,0.185463,0.128237,0.14561,0.181447,0.218029,0.075938,0.1655,0.1565,0.142731,0.1483,0.146812,0.179198,0.143164,0.110481,0.135706,0.136606,0.207545,0.050482,0.165648,0.258783,0.129574,0.095932,0.09354,0.139071,0.118411,0.135085,0.224905,0.144719,0.171199,0.152855,0.196718,0.098884,0.139862,0.191531,0.259811,0.258826,0.186733,0.127116,0.152569,0.254194,0.14375,0.153657,0.132892,0.200711,0.141704,0.176935,0.107303,0.174815,0.117775,0.13527,0.126273,0.164989,0.102136,0.178014,0.113388,0.159714,0.098766,0.16162,0.111183,0.167588,0.135673,0.056847,0.113687,0.139094,0.083433,0.175991,0.097204,0.162597,0.139492
1,akaka-hi,0.112869,1.0,0.075453,0.078826,0.109133,0.073202,0.064699,0.108374,0.070092,0.101515,0.07919,0.051575,0.070959,0.090988,0.085608,0.08004,0.050489,0.102484,0.084109,0.070745,0.056415,0.05563,0.068405,0.044922,0.104363,0.060923,0.137096,0.073775,0.091576,0.052401,0.080564,0.078229,0.063998,0.074877,0.085277,0.110003,0.098905,0.079206,0.112095,0.103805,0.093976,0.050097,0.079194,0.070912,0.044968,0.061148,0.073818,0.090675,0.034393,0.096819,0.08341,0.06366,0.046898,0.367874,0.086635,0.074574,0.138343,0.102868,0.074207,0.086233,0.058906,0.075305,0.0634,0.066233,0.081747,0.068069,0.128439,0.086763,0.076385,0.074787,0.171675,0.059545,0.075109,0.061542,0.111407,0.174446,0.113375,0.047114,0.081079,0.073563,0.070318,0.06128,0.095037,0.054312,0.047152,0.069558,0.086855,0.051625,0.05917,0.058635,0.091361,0.077797,0.06371,0.130197,0.075711,0.046238,0.106356,0.046031,0.077722,0.067843
2,allard-co,0.136168,0.075453,1.0,0.096285,0.089875,0.097429,0.110033,0.116389,0.087289,0.086955,0.066927,0.07615,0.07845,0.090597,0.08448,0.099059,0.162403,0.089549,0.092877,0.092678,0.077663,0.067892,0.08666,0.075839,0.124364,0.085107,0.137095,0.078923,0.105982,0.084675,0.111564,0.095517,0.131119,0.099127,0.116429,0.099702,0.048927,0.081861,0.114084,0.095563,0.09011,0.08197,0.135867,0.084602,0.070411,0.138622,0.087637,0.115318,0.040959,0.123221,0.096576,0.076025,0.076996,0.063882,0.07916,0.078764,0.082895,0.103009,0.107424,0.117813,0.076138,0.158231,0.055835,0.106362,0.088577,0.098479,0.145382,0.141676,0.085377,0.099467,0.178677,0.103749,0.098914,0.077551,0.118817,0.09719,0.082444,0.079914,0.102951,0.0793,0.099673,0.096056,0.08524,0.092999,0.070413,0.093418,0.078393,0.085117,0.081972,0.075444,0.102453,0.094339,0.045962,0.125331,0.107462,0.057723,0.104104,0.060701,0.090264,0.081675
3,ashcroft-mo,0.22277,0.078826,0.096285,1.0,0.120479,0.120272,0.226275,0.180542,0.15945,0.16007,0.090807,0.185773,0.085601,0.111861,0.142458,0.149835,0.080298,0.10202,0.141938,0.187907,0.091325,0.115177,0.135818,0.152225,0.185826,0.101243,0.242532,0.183147,0.192048,0.108019,0.163379,0.185337,0.171343,0.139297,0.145098,0.181387,0.073602,0.215327,0.151447,0.141689,0.125553,0.166707,0.164305,0.160366,0.117136,0.14371,0.166338,0.251176,0.062968,0.151126,0.232417,0.158699,0.108867,0.078124,0.152119,0.106908,0.120492,0.299364,0.170936,0.206295,0.15812,0.203418,0.08913,0.136632,0.238891,0.125207,0.242089,0.224272,0.129943,0.12349,0.247305,0.129548,0.169356,0.12911,0.183996,0.103196,0.168589,0.123844,0.196554,0.104129,0.121895,0.16157,0.154283,0.122175,0.156725,0.092328,0.215206,0.087906,0.157137,0.1444,0.144412,0.174687,0.05043,0.115463,0.138321,0.090248,0.1615,0.10395,0.176826,0.163304
4,baucus-mt,0.134141,0.109133,0.089875,0.120479,1.0,0.094567,0.084955,0.110201,0.118031,0.112623,0.092247,0.068766,0.107518,0.14347,0.322914,0.176853,0.080644,0.232519,0.078913,0.078725,0.084261,0.084905,0.117779,0.060271,0.149323,0.074498,0.189076,0.078487,0.121661,0.093363,0.174252,0.103917,0.10362,0.085093,0.116304,0.108637,0.073479,0.08747,0.122969,0.135043,0.127204,0.145033,0.110237,0.107486,0.078333,0.100248,0.107739,0.124661,0.03663,0.116596,0.120329,0.075162,0.063941,0.093468,0.099922,0.128188,0.158524,0.118499,0.134036,0.126353,0.081185,0.108438,0.076444,0.133082,0.095956,0.086287,0.148293,0.147193,0.120607,0.095811,0.183442,0.101466,0.085803,0.094912,0.140394,0.262251,0.134457,0.08424,0.09679,0.121141,0.093928,0.11817,0.122108,0.098392,0.061454,0.097737,0.082457,0.071058,0.076421,0.096576,0.108764,0.090406,0.091628,0.173847,0.095451,0.046468,0.111071,0.131653,0.109849,0.124073


In [75]:
# Get most similar senator to Biden
df_similarity_nobiden = df_similarity[df_similarity['senator']!= 'biden-de']
df_similarity_nobiden[df_similarity_nobiden['biden-de'] == df_similarity_nobiden['biden-de'].max()]

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
81,roberts-ks,0.126273,0.06128,0.096056,0.16157,0.11817,0.10251,0.393141,0.169133,0.089207,0.101143,0.068286,0.136421,0.07097,0.112377,0.156467,0.162916,0.074508,0.078247,0.13405,0.149908,0.144366,0.08472,0.177548,0.073833,0.163261,0.097997,0.250567,0.096586,0.164155,0.091846,0.230042,0.161682,0.09624,0.097036,0.178549,0.139657,0.069572,0.080799,0.168429,0.116394,0.093313,0.11074,0.158501,0.113884,0.079019,0.248101,0.197052,0.115123,0.062417,0.116736,0.096152,0.213137,0.153596,0.082842,0.088105,0.151941,0.120187,0.102637,0.192941,0.144978,0.114362,0.186868,0.095625,0.092214,0.110414,0.17613,0.260311,0.223188,0.323771,0.088567,0.223135,0.152674,0.150316,0.076375,0.139321,0.111355,0.115571,0.089378,0.121334,0.069218,0.150597,1.0,0.099154,0.149093,0.067108,0.073951,0.082782,0.100891,0.121676,0.193425,0.156933,0.122108,0.092232,0.105342,0.122252,0.134421,0.137328,0.163768,0.152826,0.099569


The most similar senator to Biden based on the speech is: Patrick Roberts, Republican senator of Kansas. The similarity between the speeches of the two senators is:  0.393141.

Let's ceheck the 10 most similar senators to Biden.

In [92]:
df_similarity_nobiden.sort_values('biden-de', ascending = False).head(10)

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
81,roberts-ks,0.126273,0.06128,0.096056,0.16157,0.11817,0.10251,0.393141,0.169133,0.089207,0.101143,0.068286,0.136421,0.07097,0.112377,0.156467,0.162916,0.074508,0.078247,0.13405,0.149908,0.144366,0.08472,0.177548,0.073833,0.163261,0.097997,0.250567,0.096586,0.164155,0.091846,0.230042,0.161682,0.09624,0.097036,0.178549,0.139657,0.069572,0.080799,0.168429,0.116394,0.093313,0.11074,0.158501,0.113884,0.079019,0.248101,0.197052,0.115123,0.062417,0.116736,0.096152,0.213137,0.153596,0.082842,0.088105,0.151941,0.120187,0.102637,0.192941,0.144978,0.114362,0.186868,0.095625,0.092214,0.110414,0.17613,0.260311,0.223188,0.323771,0.088567,0.223135,0.152674,0.150316,0.076375,0.139321,0.111355,0.115571,0.089378,0.121334,0.069218,0.150597,1.0,0.099154,0.149093,0.067108,0.073951,0.082782,0.100891,0.121676,0.193425,0.156933,0.122108,0.092232,0.105342,0.122252,0.134421,0.137328,0.163768,0.152826,0.099569
66,lieberman-ct,0.258826,0.128439,0.145382,0.242089,0.148293,0.154044,0.326581,0.25372,0.150841,0.178145,0.136742,0.193277,0.125295,0.148746,0.13037,0.221909,0.106513,0.143247,0.193833,0.341827,0.137798,0.142598,0.154611,0.130963,0.19426,0.152814,0.306443,0.170998,0.356972,0.123081,0.202561,0.226652,0.166731,0.145903,0.224967,0.203816,0.085166,0.212083,0.245912,0.16867,0.174671,0.142727,0.18524,0.160344,0.139319,0.205149,0.173839,0.210163,0.067336,0.185987,0.208261,0.189795,0.163107,0.13346,0.189216,0.158749,0.146561,0.279634,0.247565,0.256739,0.174676,0.260628,0.110719,0.185894,0.187547,0.227306,1.0,0.236852,0.206412,0.195715,0.312457,0.163489,0.214852,0.150073,0.256058,0.175871,0.19347,0.127895,0.228361,0.131291,0.171193,0.260311,0.234623,0.170744,0.135731,0.137003,0.157913,0.13584,0.1728,0.17561,0.212781,0.203276,0.07079,0.144483,0.208427,0.126568,0.248013,0.149132,0.200665,0.193752
89,smith-or,0.111183,0.058635,0.075444,0.1444,0.096576,0.064139,0.278569,0.137244,0.064015,0.095974,0.047391,0.063234,0.098237,0.090342,0.103733,0.094746,0.055281,0.068353,0.101655,0.104168,0.06387,0.103425,0.080503,0.056854,0.171952,0.081355,0.150599,0.080607,0.103461,0.058326,0.10062,0.119264,0.075802,0.068504,0.095722,0.129306,0.053173,0.072083,0.093115,0.145778,0.092433,0.06559,0.083338,0.069084,0.069402,0.090331,0.097422,0.12085,0.034801,0.077142,0.08138,0.135454,0.066878,0.050295,0.078843,0.0733,0.115241,0.092011,0.095486,0.105404,0.095802,0.126805,0.050568,0.075844,0.095894,0.102762,0.17561,0.13494,0.122355,0.066921,0.157751,0.09976,0.124858,0.065852,0.121205,0.103982,0.126085,0.065958,0.093437,0.0784,0.083576,0.193425,0.066003,0.124044,0.072199,0.05373,0.083458,0.071885,0.235206,1.0,0.109018,0.07996,0.049954,0.069132,0.103489,0.083608,0.097964,0.112582,0.100953,0.490846
68,lugar-in,0.127116,0.076385,0.085377,0.129943,0.120607,0.086746,0.276504,0.155083,0.090977,0.087945,0.078314,0.109951,0.084804,0.085454,0.132139,0.113305,0.057361,0.097215,0.112234,0.135316,0.166065,0.090861,0.150247,0.068066,0.14175,0.08333,0.218017,0.085547,0.1651,0.087218,0.188402,0.13025,0.078317,0.097684,0.176122,0.132074,0.129067,0.082294,0.199383,0.104684,0.099572,0.109506,0.175432,0.116566,0.076928,0.165083,0.227608,0.108558,0.071651,0.124426,0.105384,0.113802,0.095373,0.059972,0.100289,0.157877,0.105543,0.114867,0.160322,0.153493,0.110866,0.227128,0.091762,0.089163,0.121169,0.127799,0.206412,0.17975,1.0,0.083512,0.191202,0.122921,0.11689,0.082683,0.142601,0.098333,0.107966,0.08869,0.120212,0.064662,0.203443,0.323771,0.101709,0.125744,0.060859,0.086774,0.076525,0.082102,0.086213,0.122355,0.120193,0.107314,0.049846,0.085147,0.11422,0.080407,0.122502,0.095925,0.14325,0.091617
61,kyl-az,0.196718,0.075305,0.158231,0.203418,0.108438,0.134244,0.275706,0.197886,0.128838,0.145679,0.097388,0.16886,0.160388,0.13789,0.131256,0.156613,0.121069,0.095201,0.133289,0.177394,0.18381,0.136621,0.151824,0.110517,0.213907,0.102091,0.279459,0.136667,0.209539,0.126786,0.186945,0.18605,0.179427,0.168521,0.179072,0.248988,0.073178,0.172506,0.211065,0.147288,0.153262,0.154445,0.187808,0.215583,0.139703,0.184916,0.143054,0.236148,0.096183,0.164744,0.201399,0.13225,0.198653,0.128318,0.119114,0.12779,0.135632,0.210194,0.231963,0.277318,0.151137,1.0,0.077715,0.139946,0.220128,0.204396,0.260628,0.259246,0.227128,0.151019,0.273703,0.154914,0.160724,0.106024,0.176302,0.1464,0.148107,0.14241,0.194162,0.145454,0.134408,0.186868,0.170858,0.129939,0.141086,0.11293,0.208003,0.178034,0.168014,0.126805,0.166751,0.177352,0.071165,0.117107,0.172499,0.142137,0.20234,0.108592,0.167185,0.170935
26,daschle-sd,0.210736,0.137096,0.137095,0.242532,0.189076,0.141661,0.261798,0.228371,0.190508,0.190507,0.149772,0.21344,0.134924,0.191217,0.189009,0.246906,0.128533,0.16553,0.175937,0.206939,0.193051,0.157412,0.345041,0.15789,0.245313,0.155013,1.0,0.17963,0.260346,0.149294,0.345902,0.26653,0.173169,0.174338,0.209806,0.213695,0.197082,0.225101,0.199846,0.167978,0.205649,0.25879,0.221852,0.190546,0.154351,0.173742,0.254379,0.221117,0.067309,0.207974,0.166936,0.175264,0.143372,0.156937,0.182575,0.344374,0.180572,0.325254,0.2642,0.285257,0.189425,0.279459,0.133605,0.201264,0.217078,0.209339,0.306443,0.364015,0.218017,0.154605,0.358586,0.18149,0.211809,0.182746,0.223784,0.161245,0.207841,0.139677,0.224388,0.148346,0.187086,0.250567,0.219339,0.166785,0.131808,0.124329,0.16467,0.129981,0.164228,0.150599,0.211252,0.178739,0.085745,0.13676,0.173336,0.139878,0.245228,0.144444,0.23722,0.207716
51,hutchison-tx,0.129574,0.06366,0.076025,0.158699,0.075162,0.078017,0.256332,0.157708,0.075083,0.350205,0.159914,0.104402,0.051826,0.091991,0.085195,0.139421,0.054783,0.075853,0.109947,0.136839,0.093431,0.082987,0.090107,0.083747,0.117275,0.088054,0.175264,0.090185,0.11832,0.083015,0.107296,0.119431,0.07179,0.113857,0.118089,0.333307,0.0578,0.082424,0.101262,0.114219,0.077532,0.121186,0.116953,0.078598,0.078687,0.107548,0.105686,0.099628,0.057816,0.136436,0.095545,1.0,0.137701,0.073328,0.092717,0.077367,0.082843,0.114353,0.121381,0.128001,0.105585,0.13225,0.083056,0.104087,0.087048,0.129094,0.189795,0.193999,0.113802,0.081394,0.234806,0.079426,0.146405,0.086412,0.117394,0.102461,0.294176,0.077247,0.119067,0.0646,0.094117,0.213137,0.100616,0.155122,0.053295,0.065892,0.097426,0.080958,0.113173,0.135454,0.16236,0.101961,0.066503,0.062943,0.0775,0.104775,0.11745,0.123008,0.099764,0.075584
7,bingaman-nm,0.143244,0.108374,0.116389,0.180542,0.110201,0.108237,0.253516,1.0,0.107797,0.144066,0.096554,0.093001,0.1855,0.134962,0.102424,0.156199,0.1128,0.104288,0.136087,0.16547,0.12903,0.093409,0.136969,0.098952,0.186463,0.104644,0.228371,0.125524,0.184216,0.17659,0.159196,0.157191,0.129807,0.099251,0.130918,0.174735,0.075587,0.159994,0.156786,0.155836,0.137024,0.100553,0.156409,0.110784,0.106037,0.125986,0.157919,0.14501,0.066492,0.122176,0.123098,0.157708,0.106945,0.109885,0.149622,0.127941,0.142243,0.205856,0.160887,0.182347,0.123139,0.197886,0.109339,0.140927,0.123383,0.157506,0.25372,0.178799,0.155083,0.135122,0.228476,0.106464,0.16073,0.116622,0.160705,0.252212,0.188691,0.074849,0.184095,0.186586,0.116933,0.169133,0.149889,0.153445,0.072636,0.091591,0.111163,0.086611,0.140162,0.137244,0.140568,0.123799,0.066298,0.100778,0.124774,0.094491,0.160341,0.117306,0.145044,0.127455
83,roth-de,0.102136,0.054312,0.092999,0.122175,0.098392,0.06515,0.248345,0.153445,0.100697,0.073161,0.099373,0.087966,0.057549,0.068338,0.070663,0.098897,0.066036,0.099625,0.097128,0.107814,0.072052,0.094103,0.100005,0.100026,0.107558,0.08168,0.166785,0.076083,0.10525,0.08664,0.103254,0.120857,0.081148,0.083702,0.092509,0.097939,0.04476,0.085238,0.104113,0.082757,0.109702,0.107156,0.104186,0.117923,0.100918,0.105996,0.094286,0.113612,0.037681,0.109924,0.07907,0.155122,0.062269,0.07423,0.103262,0.080001,0.076481,0.109336,0.206761,0.134167,0.097891,0.129939,0.048482,0.111298,0.078258,0.107,0.170744,0.193447,0.125744,0.086355,0.207852,0.07139,0.135551,0.080065,0.216137,0.078328,0.093743,0.084678,0.128738,0.05722,0.081138,0.149093,0.127477,1.0,0.0502,0.066484,0.070948,0.085596,0.092407,0.124044,0.118176,0.094866,0.048721,0.069081,0.091172,0.062999,0.097078,0.094697,0.100445,0.093145
70,mccain-az,0.254194,0.171675,0.178677,0.247305,0.183442,0.168927,0.244909,0.228476,0.192535,0.190238,0.205908,0.157099,0.206503,0.186667,0.210612,0.236206,0.224506,0.190847,0.207957,0.263991,0.146484,0.18658,0.21501,0.177894,0.25318,0.138403,0.358586,0.192507,0.257872,0.17944,0.270299,0.265262,0.206606,0.210784,0.287172,0.226713,0.191958,0.247414,0.223076,0.259957,0.228793,0.215547,0.218322,0.200218,0.16427,0.173549,0.219754,0.272864,0.070597,0.323032,0.184909,0.234806,0.225576,0.274372,0.191859,0.21462,0.166732,0.276514,0.262816,0.328808,0.184583,0.273703,0.111724,0.241772,0.223265,0.222478,0.312457,0.363566,0.191202,0.180951,1.0,0.214891,0.206227,0.168558,0.222431,0.218162,0.208152,0.156915,0.272418,0.168607,0.237442,0.223135,0.260128,0.207852,0.137614,0.16679,0.182608,0.165612,0.171427,0.157751,0.291378,0.197928,0.139314,0.206272,0.230364,0.154497,0.24743,0.171817,0.207033,0.244417


Based on their Wikipedia, from the 10 senators 7 are Republican and 3 are Decmocrat:
* Republicans: Roberts, Smith, Lugar, Hutchison, Roth, Mccain, Kyl
* Democrats: Lieberman, Daschle, Bingaman
    
The results seem pretty counter-intuitive but what we are doing is checking out word frequencies here. If both sides are touching same areas of the agenda, then the speeches aren’t that much different. Maybe some particular wordings can be different, but getting lost because of the agenda.
That Kansas senator might be on the top, because he talks about the same topics as Biden and their use of words aren’t that much different, only the way they do so. A sentiment analysis could bring light on the differences by checking neg or poz sentiments. or some other approach leaving bag of words behind and care about structure.

## 4. Data preprocessing and lemmatization, applying TF-IDF vectorizer, calculaiong cosine similarity

In order to compare our results we chose to use lemmatization instead of using stemming. The main reason for this is that in this case the meaning of the words matters. For example 'america' and 'american' should be a different word.

In [110]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
from nltk.corpus import wordnet

#### 4.1. Text preprocesser - Lemmatization

In [108]:
# Lemmatizer

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def text_preprocesser_lemmatize(text):
    
    # Remove special characters
    text = re.sub(r'\W',' ', text)
    
    # Remove not alphabet characters
    text = re.sub("[^a-zA-Z]+", " ", text)
    
    # Lowercase and tokenize
    tokens = [word.lower() for word in nltk.word_tokenize(text)]
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    
    # Remove words with length less than 3 characters
    tokens = [token for token in tokens if len(token)>=3]
    
    lemma = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
    
    # Join
    preprocessed_text = ' '.join(lemma)

    return preprocessed_text


In [109]:
# Check one element
text_preprocesser_lemmatize(df['speech'].iloc[0])

'abraham president debate final passage omnibus appropriation bill american competitiveness workforce improvement act include title subdivision ask unanimous consent number document print record include two document receive administration negotiation whose inclusion seek help illuminate meaning provision legislation one key point document change july version september version copy submit change marked redline marking unfortunately however submit copy version copy fax marking appear effect make september version unintelligible result printing garble text also contain marking show change accordingly ask correct version document submit appear final issue record congress copy september document submit material appear july version delete september version black bracket material include july version add september version print italic abraham president rise register serious concern provision omnibus appropriation bill include understand protest senate legislative provision append commerce jus

#### 4.2. Apply text preprocesser

In [111]:
# Apply to df
df_preprocessed_lemmatize = df['speech'].apply(text_preprocesser_lemmatize)

# Save
import pickle
with open(f'df_preprocessed_lemmatize.p', 'wb') as f:
        pickle.dump(df_preprocessed_lemmatize, f)

In [114]:
# Check one element
df_preprocessed_lemmatize[0]

'abraham president debate final passage omnibus appropriation bill american competitiveness workforce improvement act include title subdivision ask unanimous consent number document print record include two document receive administration negotiation whose inclusion seek help illuminate meaning provision legislation one key point document change july version september version copy submit change marked redline marking unfortunately however submit copy version copy fax marking appear effect make september version unintelligible result printing garble text also contain marking show change accordingly ask correct version document submit appear final issue record congress copy september document submit material appear july version delete september version black bracket material include july version add september version print italic abraham president rise register serious concern provision omnibus appropriation bill include understand protest senate legislative provision append commerce jus

In [115]:
len(df_preprocessed_lemmatize)

100

#### 4.3. Apply TF-IDF vectorizer

We will:
* ignore terms that appear in less than 5% of the documents and
* ignore terms that appear in more than 95% of the documents

In [124]:
# Apply TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.05, max_df= 0.95) # preprocessing was done before
tfidf_lemma = tfidf_vectorizer.fit_transform(df_preprocessed_lemmatize)

In [126]:
# Convert to df
df_tfidf_lemma = pd.DataFrame(tfidf_lemma.toarray().transpose(), index=tfidf_vectorizer.get_feature_names())
df_tfidf_lemma.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006132,0.0,0.008438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aapcc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010217,0.0,0.00307,0.0,0.004492,0.007736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058109
aaron,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009305,0.002323,0.0,0.0,0.002016,0.0,0.0,0.001377,0.0,0.001699,0.0,0.0,0.001969,0.0,0.0,0.0,0.0,0.001679,0.0,0.002691,0.0,0.006164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00225,0.0,0.0,0.0,0.0,0.0,0.0,0.001489,0.003029,0.004037,0.000901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001389,0.0,0.001583,0.0,0.0,0.0,0.003357,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013859,0.004107,0.002589,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aarp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002121,0.0,0.0,0.0,0.0,0.0,0.0,0.002196,0.0,0.0,0.03447,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00256,0.0,0.005643,0.0,0.011853,0.002925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001903,0.0,0.0,0.0,0.003577,0.0,0.004631,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000955,0.0,0.0,0.0,0.002472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005393,0.0,0.003375,0.0,0.002425,0.0,0.0,0.0,0.0,0.002116,0.0,0.004696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.004618,0.0,0.0,0.00474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01039,0.002327,0.0,0.0,0.0,0.0,0.004861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004626,0.0,0.0,0.0,0.0,0.0,0.00563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aback,0.0,0.0,0.0,0.00158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004653,0.0,0.0,0.0,0.0,0.0,0.0,0.001269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002533,0.002393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001938,0.0,0.0
abandon,0.01448,0.005557,0.007081,0.004693,0.002841,0.002961,0.003554,0.00727,0.002406,0.010212,0.00527,0.009655,0.00666,0.022228,0.005348,0.005962,0.001762,0.011882,0.003958,0.007904,0.002516,0.00319,0.001715,0.002304,0.006654,0.004685,0.006189,0.012288,0.003209,0.003424,0.002512,0.003019,0.005486,0.006341,0.002934,0.002857,0.0,0.003434,0.005594,0.006555,0.003128,0.0,0.016587,0.004581,0.000949,0.004221,0.001389,0.004306,0.001998,0.004646,0.003906,0.001236,0.001483,0.001443,0.003167,0.002577,0.0,0.013791,0.002172,0.008073,0.013219,0.002976,0.007678,0.010215,0.0039,0.002364,0.021243,0.001346,0.004753,0.007414,0.003607,0.002856,0.004063,0.008582,0.007786,0.008831,0.0,0.000615,0.006719,0.006794,0.012072,0.001884,0.01052,0.004311,0.0,0.003856,0.0,0.003387,0.002947,0.003494,0.002202,0.002996,0.001516,0.0,0.0,0.000898,0.012083,0.003198,0.006345,0.0
abandonment,0.004307,0.0,0.002668,0.0,0.0,0.0,0.000957,0.002283,0.0,0.001603,0.0,0.010394,0.0,0.001551,0.0,0.0,0.0,0.00395,0.003728,0.001861,0.0,0.003005,0.0,0.0,0.003582,0.0,0.0,0.001362,0.0,0.000645,0.000789,0.0,0.0,0.0,0.001106,0.0,0.0,0.002157,0.0,0.007409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000901,0.0,0.0,0.00184,0.002328,0.002793,0.0,0.0,0.0,0.0,0.0,0.0,0.001521,0.0,0.0,0.0,0.002405,0.0,0.0,0.0,0.0,0.0,0.0,0.004077,0.0,0.0,0.00147,0.001834,0.002627,0.0,0.0,0.002532,0.0016,0.0,0.0,0.00991,0.001624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001692,0.002845,0.0,0.0,0.0
abate,0.0,0.0,0.0,0.001204,0.00656,0.0,0.001172,0.0,0.002777,0.0,0.0,0.0,0.002563,0.0,0.0,0.001251,0.0,0.001614,0.0,0.0,0.002904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001933,0.001743,0.012664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00386,0.0,0.0,0.0,0.0,0.0,0.002209,0.049822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001554,0.0,0.0,0.001665,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005085,0.0,0.0,0.0,0.0,0.002073,0.006974,0.0,0.0,0.0
abatement,0.0,0.0,0.0,0.001274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003414,0.0,0.0,0.0,0.0,0.0,0.003751,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00335,0.0,0.0,0.001745,0.0,0.0,0.0,0.0,0.0,0.008169,0.0,0.0,0.0,0.0,0.0,0.001169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007957,0.003944,0.0,0.0,0.0,0.012476,0.0,0.0,0.0,0.0,0.0,0.0,0.005286,0.0,0.003308,0.0,0.0,0.0,0.0,0.0,0.016413,0.0,0.011795,0.0,0.0,0.0,0.0,0.016482,0.0,0.0,0.0,0.004267,0.00269,0.0,0.0,0.0,0.0,0.0,0.018447,0.0,0.0,0.0


In [127]:
# Assign column names to matrix
df_tfidf_lemma.columns = senator_names
df_tfidf_lemma.head(10)

filename,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006132,0.0,0.008438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aapcc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010217,0.0,0.00307,0.0,0.004492,0.007736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058109
aaron,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009305,0.002323,0.0,0.0,0.002016,0.0,0.0,0.001377,0.0,0.001699,0.0,0.0,0.001969,0.0,0.0,0.0,0.0,0.001679,0.0,0.002691,0.0,0.006164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00225,0.0,0.0,0.0,0.0,0.0,0.0,0.001489,0.003029,0.004037,0.000901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001389,0.0,0.001583,0.0,0.0,0.0,0.003357,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013859,0.004107,0.002589,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aarp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002121,0.0,0.0,0.0,0.0,0.0,0.0,0.002196,0.0,0.0,0.03447,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00256,0.0,0.005643,0.0,0.011853,0.002925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001903,0.0,0.0,0.0,0.003577,0.0,0.004631,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000955,0.0,0.0,0.0,0.002472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005393,0.0,0.003375,0.0,0.002425,0.0,0.0,0.0,0.0,0.002116,0.0,0.004696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aba,0.004618,0.0,0.0,0.00474,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01039,0.002327,0.0,0.0,0.0,0.0,0.004861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004626,0.0,0.0,0.0,0.0,0.0,0.00563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aback,0.0,0.0,0.0,0.00158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004653,0.0,0.0,0.0,0.0,0.0,0.0,0.001269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002533,0.002393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001938,0.0,0.0
abandon,0.01448,0.005557,0.007081,0.004693,0.002841,0.002961,0.003554,0.00727,0.002406,0.010212,0.00527,0.009655,0.00666,0.022228,0.005348,0.005962,0.001762,0.011882,0.003958,0.007904,0.002516,0.00319,0.001715,0.002304,0.006654,0.004685,0.006189,0.012288,0.003209,0.003424,0.002512,0.003019,0.005486,0.006341,0.002934,0.002857,0.0,0.003434,0.005594,0.006555,0.003128,0.0,0.016587,0.004581,0.000949,0.004221,0.001389,0.004306,0.001998,0.004646,0.003906,0.001236,0.001483,0.001443,0.003167,0.002577,0.0,0.013791,0.002172,0.008073,0.013219,0.002976,0.007678,0.010215,0.0039,0.002364,0.021243,0.001346,0.004753,0.007414,0.003607,0.002856,0.004063,0.008582,0.007786,0.008831,0.0,0.000615,0.006719,0.006794,0.012072,0.001884,0.01052,0.004311,0.0,0.003856,0.0,0.003387,0.002947,0.003494,0.002202,0.002996,0.001516,0.0,0.0,0.000898,0.012083,0.003198,0.006345,0.0
abandonment,0.004307,0.0,0.002668,0.0,0.0,0.0,0.000957,0.002283,0.0,0.001603,0.0,0.010394,0.0,0.001551,0.0,0.0,0.0,0.00395,0.003728,0.001861,0.0,0.003005,0.0,0.0,0.003582,0.0,0.0,0.001362,0.0,0.000645,0.000789,0.0,0.0,0.0,0.001106,0.0,0.0,0.002157,0.0,0.007409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000901,0.0,0.0,0.00184,0.002328,0.002793,0.0,0.0,0.0,0.0,0.0,0.0,0.001521,0.0,0.0,0.0,0.002405,0.0,0.0,0.0,0.0,0.0,0.0,0.004077,0.0,0.0,0.00147,0.001834,0.002627,0.0,0.0,0.002532,0.0016,0.0,0.0,0.00991,0.001624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001692,0.002845,0.0,0.0,0.0
abate,0.0,0.0,0.0,0.001204,0.00656,0.0,0.001172,0.0,0.002777,0.0,0.0,0.0,0.002563,0.0,0.0,0.001251,0.0,0.001614,0.0,0.0,0.002904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001933,0.001743,0.012664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00386,0.0,0.0,0.0,0.0,0.0,0.002209,0.049822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002507,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001554,0.0,0.0,0.001665,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005085,0.0,0.0,0.0,0.0,0.002073,0.006974,0.0,0.0,0.0
abatement,0.0,0.0,0.0,0.001274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003414,0.0,0.0,0.0,0.0,0.0,0.003751,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00335,0.0,0.0,0.001745,0.0,0.0,0.0,0.0,0.0,0.008169,0.0,0.0,0.0,0.0,0.0,0.001169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007957,0.003944,0.0,0.0,0.0,0.012476,0.0,0.0,0.0,0.0,0.0,0.0,0.005286,0.0,0.003308,0.0,0.0,0.0,0.0,0.0,0.016413,0.0,0.011795,0.0,0.0,0.0,0.0,0.016482,0.0,0.0,0.0,0.004267,0.00269,0.0,0.0,0.0,0.0,0.0,0.018447,0.0,0.0,0.0


Note that the number of tokens decreased by about 6000 terms compared to using the raw tokenized dataset without lemmatization.

In [128]:
len(df_tfidf_lemma)

16275

#### 4. Apply cosine similarity

In [129]:
# Reshape 
for i in range(1, num_col+1):
    globals()["txt" + str(i)] = df_tfidf_lemma[df_tfidf_lemma.columns[i-1]].values.reshape(1, -1) 

In [130]:
# Biden
txt7

array([[0.        , 0.        , 0.        , ..., 0.02007284, 0.        ,
        0.00170392]])

In [131]:
# Similarity between Biden and Roberts - sample
print("Similarity Biden and Roberts:", cosine_similarity(txt7, txt82))

Similarity Biden and Roberts: [[0.47130203]]


In [132]:
# Reshape to vector
import numpy as np
for i in range(1, num_col+1):
    globals()["txt" + str(i)] = np.squeeze(np.asarray(globals()["txt" + str(i)]))

In [133]:
txt7

array([0.        , 0.        , 0.        , ..., 0.02007284, 0.        ,
       0.00170392])

In [134]:
# Create list of vectors
list_of_txts = []
for i in range(1, num_col+1):
    element = "txt" + str(i)
    list_of_txts.append(element)
print(list_of_txts)

['txt1', 'txt2', 'txt3', 'txt4', 'txt5', 'txt6', 'txt7', 'txt8', 'txt9', 'txt10', 'txt11', 'txt12', 'txt13', 'txt14', 'txt15', 'txt16', 'txt17', 'txt18', 'txt19', 'txt20', 'txt21', 'txt22', 'txt23', 'txt24', 'txt25', 'txt26', 'txt27', 'txt28', 'txt29', 'txt30', 'txt31', 'txt32', 'txt33', 'txt34', 'txt35', 'txt36', 'txt37', 'txt38', 'txt39', 'txt40', 'txt41', 'txt42', 'txt43', 'txt44', 'txt45', 'txt46', 'txt47', 'txt48', 'txt49', 'txt50', 'txt51', 'txt52', 'txt53', 'txt54', 'txt55', 'txt56', 'txt57', 'txt58', 'txt59', 'txt60', 'txt61', 'txt62', 'txt63', 'txt64', 'txt65', 'txt66', 'txt67', 'txt68', 'txt69', 'txt70', 'txt71', 'txt72', 'txt73', 'txt74', 'txt75', 'txt76', 'txt77', 'txt78', 'txt79', 'txt80', 'txt81', 'txt82', 'txt83', 'txt84', 'txt85', 'txt86', 'txt87', 'txt88', 'txt89', 'txt90', 'txt91', 'txt92', 'txt93', 'txt94', 'txt95', 'txt96', 'txt97', 'txt98', 'txt99', 'txt100']


In [135]:
# Create similarity matrix
from scipy import sparse
A =  np.array(np.array([eval(name) for name in list_of_txts]))
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

pairwise dense output:
 [[1.         0.11048625 0.12188617 ... 0.08611824 0.15113374 0.12806296]
 [0.11048625 1.         0.06546759 ... 0.04187035 0.07262217 0.06461585]
 [0.12188617 0.06546759 1.         ... 0.05451684 0.08047845 0.07196163]
 ...
 [0.08611824 0.04187035 0.05451684 ... 1.         0.07501323 0.07624209]
 [0.15113374 0.07262217 0.08047845 ... 0.07501323 1.         0.12278158]
 [0.12806296 0.06461585 0.07196163 ... 0.07624209 0.12278158 1.        ]]



In [136]:
# Define similarity matrix
df_similarity = pd.DataFrame(similarities)
df_similarity.columns = senator_names
df_similarity.insert(0,'senator',senator_names)
df_similarity.head()

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
0,abraham-mi,1.0,0.110486,0.121886,0.214059,0.123617,0.110593,0.151989,0.129213,0.15021,0.157791,0.095588,0.134765,0.105546,0.119687,0.120104,0.141327,0.072466,0.116841,0.125299,0.176683,0.079305,0.121118,0.108866,0.136141,0.145726,0.105709,0.199346,0.168373,0.177827,0.09132,0.141605,0.1794,0.110488,0.13963,0.166749,0.214007,0.069387,0.152558,0.152071,0.127789,0.138701,0.13159,0.166242,0.129377,0.099104,0.126492,0.121284,0.196111,0.071877,0.151861,0.265876,0.116147,0.091634,0.08881,0.124504,0.112484,0.12956,0.203818,0.130722,0.156337,0.143031,0.195537,0.097383,0.128833,0.185429,0.254233,0.250136,0.177074,0.11766,0.140678,0.239406,0.131559,0.146161,0.122846,0.187485,0.130099,0.173255,0.106051,0.158631,0.113034,0.12609,0.132755,0.148182,0.089159,0.187348,0.099303,0.200946,0.090585,0.158143,0.105178,0.159254,0.126848,0.049765,0.108563,0.120178,0.076067,0.161172,0.086118,0.151134,0.128063
1,akaka-hi,0.110486,1.0,0.065468,0.072664,0.09857,0.064234,0.063162,0.093485,0.086077,0.094349,0.078915,0.048435,0.070707,0.070815,0.077662,0.07767,0.044663,0.101087,0.083577,0.066125,0.052789,0.05159,0.070232,0.041102,0.101871,0.052757,0.133835,0.065665,0.082087,0.046288,0.081157,0.073214,0.057167,0.072905,0.073599,0.105232,0.102397,0.068755,0.10934,0.095666,0.088423,0.047279,0.074221,0.066862,0.039577,0.055898,0.071746,0.083275,0.050783,0.093374,0.078034,0.061413,0.047155,0.38691,0.078851,0.074785,0.135439,0.087925,0.068755,0.079414,0.053345,0.075361,0.059408,0.06062,0.079275,0.059731,0.115228,0.083935,0.077414,0.060933,0.16078,0.054928,0.070322,0.054763,0.105119,0.156191,0.103375,0.04368,0.071026,0.068767,0.068111,0.070587,0.092647,0.048896,0.039737,0.060879,0.10293,0.048344,0.052164,0.05623,0.086882,0.071064,0.059385,0.084238,0.067905,0.043477,0.096029,0.04187,0.072622,0.064616
2,allard-co,0.121886,0.065468,1.0,0.092392,0.078192,0.080981,0.101882,0.103268,0.108509,0.076696,0.058329,0.06946,0.075152,0.074783,0.073945,0.086406,0.164342,0.083928,0.084419,0.084638,0.077257,0.059844,0.083738,0.070937,0.111811,0.075138,0.133717,0.070808,0.091461,0.07613,0.101601,0.081377,0.121183,0.091644,0.097902,0.088859,0.042602,0.070726,0.105416,0.084704,0.082921,0.072132,0.124386,0.073762,0.063567,0.125935,0.077832,0.101616,0.057995,0.112364,0.088149,0.068559,0.071576,0.065584,0.068781,0.075228,0.074843,0.086466,0.097245,0.101952,0.068368,0.147341,0.048064,0.098142,0.078851,0.085456,0.131475,0.132196,0.076805,0.086286,0.165757,0.09033,0.093694,0.067951,0.109765,0.081287,0.071838,0.074081,0.086208,0.071823,0.099806,0.096207,0.075495,0.085392,0.061778,0.080197,0.093924,0.081228,0.072344,0.070311,0.087425,0.085684,0.037038,0.104308,0.094633,0.051181,0.091434,0.054517,0.080478,0.071962
3,ashcroft-mo,0.214059,0.072664,0.092392,1.0,0.111123,0.104321,0.226768,0.163228,0.195251,0.153874,0.084892,0.166423,0.079495,0.100717,0.14682,0.140848,0.077476,0.09337,0.134395,0.181675,0.087047,0.115632,0.134773,0.135973,0.174711,0.089264,0.234666,0.174219,0.174967,0.097335,0.155381,0.183358,0.167183,0.138362,0.134596,0.175226,0.070592,0.209229,0.146728,0.124806,0.117348,0.154125,0.156386,0.148036,0.111201,0.139797,0.157188,0.261714,0.082818,0.145533,0.222038,0.139247,0.103867,0.071941,0.13969,0.101042,0.117318,0.271362,0.161475,0.197746,0.16041,0.205969,0.083343,0.133896,0.244812,0.114556,0.232832,0.223579,0.126972,0.104501,0.241002,0.122712,0.160964,0.121427,0.16603,0.094973,0.158286,0.125762,0.184543,0.099817,0.115282,0.188682,0.142479,0.113864,0.156466,0.080747,0.279985,0.081191,0.147918,0.142217,0.137912,0.171654,0.045895,0.118334,0.132782,0.085482,0.147667,0.095386,0.167347,0.158933
4,baucus-mt,0.123617,0.09857,0.078192,0.111123,1.0,0.081132,0.078265,0.096451,0.126748,0.104981,0.085839,0.063299,0.101078,0.125296,0.338919,0.162266,0.07501,0.228886,0.072356,0.070646,0.082784,0.079888,0.118288,0.052525,0.139724,0.068561,0.183162,0.068685,0.110021,0.081857,0.170368,0.092139,0.091967,0.075711,0.103888,0.099761,0.069881,0.07724,0.124119,0.12582,0.121063,0.11512,0.103585,0.099374,0.070076,0.096443,0.095958,0.114095,0.043049,0.107313,0.110929,0.067638,0.058667,0.083406,0.089778,0.123943,0.154837,0.104136,0.121932,0.114643,0.072253,0.103576,0.068896,0.117586,0.089525,0.0769,0.13375,0.1425,0.123686,0.087321,0.167053,0.095765,0.077956,0.083325,0.130921,0.248911,0.132655,0.078663,0.087425,0.110742,0.087313,0.138901,0.1131,0.089931,0.049414,0.087264,0.095424,0.064698,0.065495,0.088357,0.1021,0.082165,0.081829,0.15406,0.085559,0.040457,0.100667,0.125115,0.103735,0.117801


In [137]:
# Get most similar senator to Biden
df_similarity_nobiden = df_similarity[df_similarity['senator']!= 'biden-de']
df_similarity_nobiden[df_similarity_nobiden['biden-de'] == df_similarity_nobiden['biden-de'].max()]

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
81,roberts-ks,0.132755,0.070587,0.096207,0.188682,0.138901,0.100805,0.471302,0.195276,0.131084,0.110772,0.074375,0.159208,0.083628,0.131326,0.181939,0.181148,0.085966,0.085364,0.155381,0.169284,0.16582,0.09018,0.200914,0.077455,0.180416,0.110204,0.287054,0.100705,0.179364,0.099173,0.255316,0.181508,0.100574,0.106002,0.192422,0.155171,0.071168,0.079236,0.205905,0.126696,0.105784,0.119414,0.172317,0.119841,0.085056,0.28847,0.213265,0.120448,0.107727,0.121769,0.099517,0.24471,0.173814,0.09958,0.089892,0.176404,0.144169,0.100878,0.222436,0.161458,0.125509,0.221076,0.107566,0.09926,0.116137,0.196712,0.308028,0.252676,0.391771,0.097395,0.248406,0.163773,0.172666,0.078592,0.153456,0.128075,0.12858,0.092521,0.129321,0.073731,0.163733,1.0,0.105288,0.164514,0.062325,0.075766,0.116758,0.111478,0.136524,0.225105,0.166806,0.134264,0.100378,0.134859,0.123464,0.153761,0.149487,0.188456,0.17122,0.108265


With the lemmatization the most similar senator to Biden based on the speech is still Patrick Roberts, Republican senator of Kansas. However, the similarity index between the two senators' speeches slightly increased from 0.393141 to 0.471302.

Let's ceheck the 10 most similar senators to Biden.

In [139]:
df_similarity_nobiden.sort_values('biden-de', ascending = False).head(10)

filename,senator,abraham-mi,akaka-hi,allard-co,ashcroft-mo,baucus-mt,bennett-ut,biden-de,bingaman-nm,bond-mo,boxer-ca,breaux-la,brownback-ks,bryan-nv,bumpers-ar,burns-mt,byrd-wv,campbell-co,chafee-ri,cleland-ga,coats-in,cochran-ms,collins-me,conrad-nd,coverdell-ga,craig-id,damato-ny,daschle-sd,dewine-oh,dodd-ct,domenici-nm,dorgan-nd,durbin-il,enzi-wy,faircloth-nc,feingold-wi,feinstein-ca,ford-ky,frist-tn,glenn-oh,gorton-wa,graham-fl,gramm-tx,grams-mn,grassley-ia,gregg-nh,hagel-ne,harkin-ia,hatch-ut,helms-nc,hollings-sc,hutchinson-ar,hutchison-tx,inhofe-ok,inouye-hi,jeffords-vt,johnson-sd,kempthorne-id,kennedy-ma,kerrey-ne,kerry-ma,kohl-wi,kyl-az,landrieu-la,lautenberg-nj,leahy-vt,levin-mi,lieberman-ct,lott-ms,lugar-in,mack-fl,mccain-az,mcconnell-ky,mikulski-md,moseleybraun-il,moynihan-ny,murkowski-ak,murray-wa,nickles-ok,reed-ri,reid-nv,robb-va,roberts-ks,rockefeller-wv,roth-de,santorum-pa,sarbanes-md,sessions-al,shelby-al,smith-nh,smith-or,snowe-me,specter-pa,stevens-ak,thomas-wy,thompson-tn,thurmond-sc,torricelli-nj,warner-va,wellstone-mn,wyden-or
81,roberts-ks,0.132755,0.070587,0.096207,0.188682,0.138901,0.100805,0.471302,0.195276,0.131084,0.110772,0.074375,0.159208,0.083628,0.131326,0.181939,0.181148,0.085966,0.085364,0.155381,0.169284,0.16582,0.09018,0.200914,0.077455,0.180416,0.110204,0.287054,0.100705,0.179364,0.099173,0.255316,0.181508,0.100574,0.106002,0.192422,0.155171,0.071168,0.079236,0.205905,0.126696,0.105784,0.119414,0.172317,0.119841,0.085056,0.28847,0.213265,0.120448,0.107727,0.121769,0.099517,0.24471,0.173814,0.09958,0.089892,0.176404,0.144169,0.100878,0.222436,0.161458,0.125509,0.221076,0.107566,0.09926,0.116137,0.196712,0.308028,0.252676,0.391771,0.097395,0.248406,0.163773,0.172666,0.078592,0.153456,0.128075,0.12858,0.092521,0.129321,0.073731,0.163733,1.0,0.105288,0.164514,0.062325,0.075766,0.116758,0.111478,0.136524,0.225105,0.166806,0.134264,0.100378,0.134859,0.123464,0.153761,0.149487,0.188456,0.17122,0.108265
66,lieberman-ct,0.250136,0.115228,0.131475,0.232832,0.13375,0.136649,0.326881,0.239909,0.184354,0.169713,0.128144,0.183956,0.121617,0.136527,0.12472,0.212809,0.113413,0.13459,0.184313,0.348708,0.134216,0.132401,0.146804,0.118434,0.174891,0.142368,0.298004,0.152042,0.352085,0.109662,0.194414,0.222346,0.154926,0.139691,0.205512,0.193112,0.080104,0.202497,0.235634,0.155635,0.16063,0.126758,0.165578,0.144386,0.128842,0.193534,0.160872,0.195574,0.09503,0.167105,0.201606,0.18714,0.160004,0.132845,0.181515,0.152753,0.139254,0.275581,0.236458,0.239398,0.168021,0.26015,0.103318,0.172775,0.177745,0.212771,1.0,0.229947,0.201471,0.182716,0.298792,0.148767,0.208681,0.139699,0.244065,0.164312,0.177393,0.127113,0.216627,0.121423,0.15757,0.308028,0.227167,0.159762,0.124174,0.119511,0.19073,0.128585,0.163789,0.168961,0.197497,0.192162,0.065774,0.135354,0.193561,0.120344,0.230036,0.140119,0.187314,0.179866
89,smith-or,0.105178,0.05623,0.070311,0.142217,0.088357,0.054417,0.277172,0.132375,0.07273,0.088831,0.042779,0.058738,0.097629,0.086906,0.109686,0.087907,0.050968,0.06761,0.096516,0.098821,0.058633,0.097074,0.076569,0.050038,0.172423,0.073941,0.141539,0.072119,0.095142,0.050533,0.090407,0.110844,0.065891,0.063218,0.083959,0.124313,0.049575,0.063268,0.086425,0.143523,0.086682,0.052756,0.07314,0.060532,0.065435,0.083582,0.091904,0.113387,0.051375,0.066038,0.079403,0.130125,0.063686,0.042005,0.070798,0.064117,0.11114,0.076779,0.087448,0.097698,0.091444,0.127663,0.0481,0.068925,0.089331,0.096562,0.168961,0.128515,0.122326,0.056966,0.149167,0.093124,0.119412,0.056441,0.114433,0.098071,0.11854,0.063115,0.085417,0.072263,0.0766,0.225105,0.059428,0.117685,0.066452,0.044396,0.092869,0.065979,0.234328,1.0,0.100768,0.074295,0.044367,0.069462,0.096123,0.078464,0.09254,0.107276,0.094106,0.509715
68,lugar-in,0.11766,0.077414,0.076805,0.126972,0.123686,0.079239,0.274742,0.145842,0.110765,0.085368,0.070018,0.111974,0.085374,0.084771,0.137591,0.100969,0.050962,0.090688,0.109138,0.129159,0.169637,0.08421,0.145385,0.061895,0.129066,0.079109,0.215033,0.075284,0.158237,0.081211,0.175034,0.126664,0.071401,0.089631,0.163936,0.121916,0.10628,0.068311,0.213157,0.096996,0.099904,0.101906,0.168408,0.109989,0.069494,0.160911,0.213709,0.098669,0.096419,0.120179,0.09741,0.107849,0.085319,0.056553,0.087371,0.155751,0.101944,0.098557,0.14968,0.136237,0.103536,0.222111,0.086851,0.081154,0.112726,0.1182,0.201471,0.172092,1.0,0.076918,0.181949,0.119683,0.114025,0.075017,0.138079,0.093332,0.102953,0.069696,0.104826,0.056731,0.188865,0.391771,0.090959,0.119029,0.054233,0.078735,0.090522,0.075471,0.080396,0.122326,0.110494,0.100151,0.045136,0.088135,0.107327,0.076879,0.115756,0.088765,0.137848,0.085267
26,daschle-sd,0.199346,0.133835,0.133717,0.234666,0.183162,0.127428,0.260803,0.216337,0.222679,0.185653,0.148583,0.192697,0.130205,0.181136,0.200047,0.24016,0.129343,0.165221,0.173978,0.197847,0.198395,0.157561,0.342865,0.148865,0.228819,0.148352,1.0,0.162146,0.252097,0.137303,0.348781,0.257241,0.165102,0.165912,0.187815,0.202635,0.19843,0.236289,0.201381,0.154717,0.204918,0.24618,0.215413,0.17934,0.148651,0.161328,0.234456,0.209304,0.095211,0.201228,0.162801,0.1707,0.140552,0.154377,0.17515,0.357816,0.170649,0.324053,0.254738,0.267159,0.18427,0.297059,0.126778,0.192993,0.202298,0.201189,0.298004,0.366189,0.215033,0.14025,0.350794,0.175236,0.206768,0.171436,0.217312,0.151816,0.190045,0.139308,0.216368,0.144423,0.178078,0.287054,0.220014,0.152682,0.118061,0.112721,0.197205,0.124378,0.154938,0.141539,0.201768,0.171212,0.079965,0.146182,0.165655,0.135472,0.231405,0.137978,0.22924,0.209386
61,kyl-az,0.195537,0.075361,0.147341,0.205969,0.103576,0.128219,0.257698,0.178381,0.164114,0.139545,0.089301,0.164202,0.156487,0.135247,0.132515,0.140894,0.121641,0.086888,0.129349,0.175789,0.204577,0.136821,0.157249,0.102594,0.188784,0.098338,0.297059,0.128668,0.193169,0.120119,0.178242,0.175679,0.169307,0.164844,0.168809,0.239239,0.058302,0.176342,0.205914,0.134833,0.149128,0.144358,0.182306,0.212472,0.134291,0.177494,0.13097,0.227093,0.119821,0.158334,0.199187,0.128942,0.164992,0.126569,0.110295,0.12743,0.126353,0.201269,0.217474,0.22391,0.150444,1.0,0.076885,0.120321,0.207077,0.182277,0.26015,0.248163,0.222111,0.149374,0.264531,0.145966,0.152443,0.09577,0.175624,0.141542,0.141199,0.141165,0.168919,0.133274,0.127872,0.221076,0.150436,0.123805,0.147003,0.09701,0.254532,0.174602,0.161765,0.127663,0.153622,0.162453,0.067102,0.125419,0.167331,0.135425,0.188615,0.103126,0.159059,0.160666
51,hutchison-tx,0.116147,0.061413,0.068559,0.139247,0.067638,0.065036,0.252216,0.155428,0.088102,0.349452,0.164067,0.091737,0.048842,0.083748,0.084436,0.133874,0.048971,0.067149,0.108255,0.127585,0.092736,0.080547,0.083916,0.078177,0.107011,0.081624,0.1707,0.079762,0.104522,0.075326,0.101575,0.11169,0.061419,0.109123,0.108835,0.330302,0.054234,0.075063,0.097939,0.106771,0.070777,0.107901,0.109076,0.068191,0.070495,0.098647,0.093313,0.092546,0.075063,0.13609,0.0762,1.0,0.13606,0.070651,0.07966,0.070071,0.074645,0.096535,0.110503,0.116301,0.099172,0.128942,0.076453,0.098761,0.076971,0.124155,0.18714,0.193924,0.107849,0.069778,0.231107,0.074978,0.14161,0.078406,0.103078,0.09829,0.29663,0.074317,0.108413,0.060449,0.088824,0.24471,0.09583,0.149199,0.043704,0.057636,0.103043,0.077062,0.105603,0.130125,0.157216,0.097048,0.062092,0.060623,0.067048,0.101035,0.109799,0.116374,0.091392,0.069192
7,bingaman-nm,0.129213,0.093485,0.103268,0.163228,0.096451,0.100313,0.248942,1.0,0.118624,0.134431,0.09094,0.084004,0.185778,0.12622,0.097584,0.140263,0.115711,0.096709,0.125371,0.139204,0.119666,0.088436,0.13364,0.089287,0.172833,0.085482,0.216337,0.100504,0.1602,0.170018,0.148431,0.147942,0.126662,0.088059,0.109982,0.149705,0.073622,0.136394,0.147984,0.124603,0.128413,0.082286,0.145288,0.102365,0.094813,0.106164,0.143588,0.13878,0.0952,0.112666,0.101932,0.155428,0.095402,0.106479,0.132243,0.118122,0.130373,0.164548,0.143551,0.148303,0.114008,0.178381,0.105799,0.124857,0.1149,0.140631,0.239909,0.17211,0.145842,0.096777,0.214626,0.097332,0.149469,0.09987,0.148995,0.250691,0.143589,0.071095,0.153309,0.186101,0.104008,0.195276,0.135912,0.147455,0.057149,0.079038,0.133395,0.078464,0.121779,0.132375,0.126187,0.113593,0.059805,0.097396,0.1119,0.089144,0.146179,0.115085,0.132648,0.119067
83,roth-de,0.089159,0.048896,0.085392,0.113864,0.089931,0.055541,0.244127,0.147455,0.094681,0.063997,0.089426,0.072068,0.051145,0.058907,0.068094,0.082736,0.061572,0.089713,0.092713,0.096736,0.063759,0.085231,0.092681,0.095052,0.097046,0.070821,0.152682,0.062877,0.093186,0.078388,0.087807,0.108935,0.071047,0.074061,0.079746,0.084466,0.039295,0.071023,0.095562,0.072305,0.099996,0.090016,0.093045,0.106055,0.095225,0.096654,0.082691,0.101036,0.051031,0.100971,0.068136,0.149199,0.057109,0.070308,0.092134,0.071664,0.069095,0.088859,0.200739,0.119008,0.088079,0.123805,0.043402,0.102305,0.069349,0.096833,0.159762,0.186238,0.119029,0.071813,0.196559,0.062976,0.124198,0.068473,0.210654,0.072184,0.080466,0.080614,0.114211,0.050441,0.07129,0.164514,0.115112,1.0,0.039579,0.054876,0.073759,0.078365,0.084762,0.117685,0.107327,0.086818,0.042816,0.068393,0.080228,0.056023,0.084005,0.087552,0.090453,0.085138
72,mikulski-md,0.146161,0.070322,0.093694,0.160964,0.077956,0.06843,0.238297,0.149469,0.277906,0.187081,0.099706,0.082226,0.072292,0.153656,0.104439,0.127458,0.062626,0.084988,0.136685,0.196564,0.089206,0.127184,0.127957,0.079678,0.107978,0.14417,0.206768,0.134617,0.187997,0.069576,0.136217,0.184695,0.084707,0.113785,0.115766,0.192646,0.105441,0.225299,0.171249,0.075826,0.124772,0.081705,0.134393,0.110863,0.099843,0.103743,0.165751,0.149079,0.066457,0.115598,0.125898,0.14161,0.088945,0.103322,0.189312,0.11159,0.098844,0.285345,0.136346,0.151326,0.131826,0.152443,0.094832,0.128508,0.124247,0.111404,0.208681,0.167675,0.114025,0.188286,0.19795,0.088788,1.0,0.133869,0.155828,0.088263,0.181517,0.10747,0.17941,0.10463,0.119042,0.172666,0.163227,0.124198,0.155772,0.227209,0.154638,0.065965,0.130659,0.119412,0.25544,0.173231,0.053222,0.062232,0.094108,0.073223,0.141973,0.109972,0.167218,0.113702


The picture is very similar to the simple tokenization case. The difference is that with lemmatization in the top 10 most similar senator we have 6 Republicans and 4 Democrats. Instead of McCain Democrat senator Mikulski appeared rank as the 10th most similar senator to Biden. 