### Imports

In [1]:
import pandas as pd
import numpy as np #used to handle np.nan

In [2]:
raw = pd.read_csv('.\Project 4\DSIR907-Project4-Group1/cleaned_debate.csv')

In [3]:
raw.shape

(9051, 3)

Below is a list of candidate names that are correctly spelled.

In [4]:
candidate_list = ['Nixon', 'Kennedy', 'Carter', 'Ford', 'Reagan', 'Anderson', 'Mondale', 'Ferraro', 'Bush', 'Dukakis', 'Quayle', 'Bentsen', 'Clinton', 'Bush', 'Perot', 'Gore', 'Stockdale', 'Dole', 'Kemp', 'Lieberman', 'Cheney', 'Kerry', 'Edwards', 'McCain', 'Obama', 'Biden', 'Palin', 'Romney', 'Ryan', 'Trump', 'Pence', 'Kaine', 'Harris']

For our purposes, it will be useful to also have a list of some of the moderator's names.

In [5]:
candidate_list.extend(['warner', 'cronkite', 'Lieberman', 'smith', 'kelly', 'thomas', 'brokaw', 'fowler', "O'brien", 'baltimore', 'audience' 'mondale', 'newman', 'david', 'lehrer', 'speakers', 'washington', 'holt', 'wallace', 'moderator',  'schieffer', 'participants', 'moderators', 'chancellor', 'crowley','fleck', 'giannotti', 'farley', 'niven', 'berkley','spivak', 'quijano', 'page', 'hubb', 'mashek','dube'])

Uncomment the following line to see all the speakers in the debate. This was used to assist in constructing the list above.

In [6]:
#raw['Speaker'].unique()

It was found in the course of our algorithm that there was an NaN value, we find it here to deal with it later.

In [7]:
raw[raw['Speaker'].isna()]

Unnamed: 0,Speaker,Text,Placeholder
6497,,"Well, I believe that we did the appropriate t...",25


Using the quote below, we find that this was a quote from Clinton.

In [8]:
raw.iloc[6497,1]

' Well, I believe that we did the appropriate thing under the circumstances. Saddam Hussein is under a U.N. resolution not to threaten his neighbors or threaten his own, repress his own citizens. Unfortunately, a lot of people, have never been as concerned about the Kurds as the United States has tried to be, and we’ve been flying an operation to protect them out of Turkey for many years now. What happened was one of the Turkish, one of the Kurdish leaders invited him to go up north, but we felt since the whole world community had told him not to do it, that once he did it we had to do something. We did not feel that I could commit. I certainly didn’t feel I should commit American troops to throw him out of where he had gone, and that was the only way to do that. So the appropriate thing strategically to do was to reduce his ability to threaten his neighbors. We did that by expanding what’s called the no-fly zone by increasing our allies’ control of the air space now from the Kuwait bo

Since the names from the transcripts are all uppercase, we convert our list to uppercase.

In [9]:
candidate_list=[candidate.upper() for candidate in candidate_list]

We are looking for mispellings of names, as such, we ignore the names we know are spelt correctly. This includes times, years and an occurence of (CNN).

In [10]:
ignore_set = set(candidate_list).union(set(['7','8','9','10','11','1980','1986','(CNN)']))

The set of values we are considering imputing is the difference between the unique speakers and our `ignore_set`.

In [11]:
impute_set = set(raw['Speaker'].unique()) - ignore_set

In [12]:
impute_set = impute_set -{np.nan} #our code won't work with the "float" NaN, so we remove it here.

In [13]:
def lev_dist_fast(str1,str2):
    """This is an implementation of Levenshtein distance, the distance between two strings is the minimal number of
    insertions, deletions and substitutions required to go from one string to the other.
    """
    #initialize a dataframe with characters of the strings on the axes and an extra empty row/column index
    dist_data = pd.DataFrame(index=list(' '+str1), columns = list(' '+str2))  
    dist_data.fillna(0, inplace=True) #Fill the na values with zero.
    #print(dist_data) #used to check that the code ran properly
    
    #record the lengths of our strings for use below.
    n=len(str2)
    m=len(str1)
    
    #Initialize the first column of the dataframe, the indices are 1 to m+1 since [0,0] is already 0
    for i in range(1, m+1):
        dist_data.iloc[i,0] = i
    
    #Initialize the first row of the dataframe similarly.
    for j in range(1, n+1):
        dist_data.iloc[0,j] = j
    
    #Loop over the entries of the dataframe
    for j in range(1, n+1): #top to bottom
        for i in range(1, m+1): # but first left to right
            
            #If the characters of the string are the same, it costs nothing to substitute
            if str1[i-1] == str2[j-1]: #due to extra padding, indices are offset by 1.
                substitutionCost = 0
            else:
                substitutionCost = 1 #if the strings aren't the same it costs 1 to substitute
            
            dist_data.iloc[i, j] = min(dist_data.iloc[i-1][j] +1, #deletion
                                       dist_data.iloc[i, j-1] +1, #insertion
                                       dist_data.iloc[i-1,j-1]+substitutionCost) #substitution
            
    #print(dist_data) #if you want to see the resulting dataframe, uncomment this print statement.   
    return dist_data.iloc[-1,-1] 
#The dictionary computes the Levenstein distance between subwords str1[:i] str2[:j], we are only interested in 
#the distance between str1 and str2 as a whole, the last entry of our dataframe.

In [14]:
lev_dist_fast('kitten', 'sitting') #this cell is purely for testing purposes. Change the strings and see what happens!

3

In [16]:
speaker_score_dict={}
vetted_set = set(candidate_list) #this is the set of correctly spelled names.

for speaker in impute_set:
    distance=20 #initialize the distance to be larger than the length of any of the names
    best_candidate='' #initialize best candidate
    for candidate in vetted_set:
        new_dist=lev_dist_fast(speaker,candidate) #compare speakers with correctly spelled names.
        
        #the best candidate for who the speaker is should be the word of minimal distance.
        if(new_dist<distance):
            distance=new_dist
            best_candidate=candidate
            
    #at this point, we have a best_candidate for who the speaker is
    #append this to a dictionary with key the distance from the speaker
    if(distance in speaker_score_dict.keys()):
        speaker_score_dict[distance].append((speaker, best_candidate)) #If the speaker's distance is already in the dictionary, add the pair of speakers to the dictionary's list.
    else:
        speaker_score_dict[distance]=[]
        speaker_score_dict[distance].append((speaker, best_candidate)) #Otherwise initialize an empty list and add the speaker, candidate pair.
        
#We're not done yet, the following simple sanity check helps a lot!        

final_dict={} #initialize our final dictionary
for i in speaker_score_dict.keys(): #Loop over our earlier dictionary
    
    #Check if the speaker only occurred in debates where their candidate identity also occurred.
    #We are assuming few typos and that mispelled names occur with the correct name in the same transcript.
    
    for pair in speaker_score_dict[i]: 
        typo_loc = raw[raw['Speaker']==pair[0]].Placeholder.unique()
        fix_loc = raw[raw['Speaker']==pair[1]].Placeholder.unique()
        if(set(typo_loc).issubset(set(fix_loc))): #the occurances of the typo only happen when the candidate name is also in the transcript.
            print(pair,i) #we print the pair together with their distance for a final eyeballing.
            
            #our final dictionary is any pair not filtered out by this sanity check.
            final_dict[pair[0]]=pair[1]

    
            

('[*]SCHIEFFER', 'SCHIEFFER') 3
('MR.FORD', 'FORD') 3
('[*]CROWLEY', 'CROWLEY') 3
('ROMNEHY', 'ROMNEY') 1
('REAGAV', 'REAGAN') 1
('OBAM', 'OBAMA') 1
('SM1TH', 'SMITH') 1
('KONDRACKE', 'MONDALE') 4
('JOHNSON', 'CLINTON') 4
('OREGONIAN', 'REAGAN') 5


From here, it is pretty claer which of these names are typos and which are simply two names which happen to share some letters. The following names were reporters at these debates

Kondracke, (Pamela) Johnson, and Dube are all journalists. Oregonian was the title of Hilliard, a reporter for the paper, when he first asked a question.

<b>Note:</b> While these names cannot distinguish between Hillary and Bill Clinton nor George Bush and George H. W. Bush, these names being imputed to the classes Republican, Democrat, and Other, mean the distinction shouldn't matter. Another minor note is that Ross Perot, while he ran as an independent, will be classified as a Republican for the purposes of our algorithm as he was a right-leaning candidate.

Anyways, we delete the names which were correct from our dictionary and update Oregonian to Hilliard.

In [17]:
del final_dict['KONDRACKE']
del final_dict['JOHNSON']
final_dict['OREGONIAN']='HILLIARD'

In [18]:
final_dict

{'[*]SCHIEFFER': 'SCHIEFFER',
 'MR.FORD': 'FORD',
 '[*]CROWLEY': 'CROWLEY',
 'ROMNEHY': 'ROMNEY',
 'REAGAV': 'REAGAN',
 'OBAM': 'OBAMA',
 'SM1TH': 'SMITH',
 'OREGONIAN': 'HILLIARD'}

In sum, there were 8 typos which we picked up from our algorithm. Since this is in a dictionary, we can simply call a .replace on the raw data to update all the names.

In [19]:
cleaned_speakers = raw.replace(final_dict)

However, we are not done yet. At least two anomalies remain. The `NaN` speaker needs to be updated to `'CLINTON'`. Additionally, there was a speaker named `W` who needs to be identified. 

In [20]:
raw[raw['Speaker']=='W'].Text[7810]

' Senator Quayle, all of us in our lifetime encounter an experience that helps shapes our adult philosophy in some form or another. Could you describe for this audience tonight what experience you may have had, and how it shaped our political philosophy?,'

This quote from `W` makes clear that they are likely a reporter of some sort. Checking the speakers in debate 33 yields. 

In [21]:
raw[raw['Placeholder']==33].Speaker.unique()

array(['WOODRUFF', 'QUAYLE', 'BENTSEN', 'MARGOLIS', 'BROKAW', '1986',
       'HUME', 'W'], dtype=object)

So, `W` is likely `WOODRUFF` and we impute this below.

In [22]:
cleaned_speakers = cleaned_speakers.replace('W','WOODRUFF')

In [23]:
cleaned_speakers = cleaned_speakers.fillna('CLINTON')

This brings us to a few other odd Speaker values:

1) `1986`
2) `1980`
3) `(CNN)`

In [24]:
cleaned_speakers[cleaned_speakers['Speaker']=='1980']

Unnamed: 0,Speaker,Text,Placeholder
8226,1980,He would never reduce benefits. And of course...,37


In [25]:
cleaned_speakers.iloc[8225]

Speaker                                                  MONDALE
Text            Well, that’s exactly the commitment that was ...
Placeholder                                                   37
Name: 8225, dtype: object

In [26]:
cleaned_speakers.iloc[8226].Text

' He would never reduce benefits. And of course, what happened right after the election is they proposed to cut Social Security benefits by 25 percent — reducing the adjustment for inflation, cutting out minimum benefits for the poorest on Social Security, removing educational benefits for dependents whose widows were trying — with widows trying to get them through college. Everybody remembers that; people know what happened., There’s a difference. I have fought for Social Security and Medicare and for things to help people who are vulnerable all my life, and I will do it as President of the United States., MS.'

Here we see that 1980 was likely a part of a statement made by Mondale. So, we replace this in our dataframe as well.

In [27]:
cleaned_speakers = cleaned_speakers.replace('1980', 'MONDALE')

Now to figure out 1986.

In [28]:
cleaned_speakers[cleaned_speakers['Speaker']=='1986']

Unnamed: 0,Speaker,Text,Placeholder
7692,1986,six million working poor families got off the...,33


In [29]:
cleaned_speakers[cleaned_speakers['Speaker']=='1986'].Text[7692]

' six million working poor families got off the payroll; six million people are off the taxpaying payrolls because of that tax reform, and they are keeping the tax money there. To help the poor, we’ll have a commitment to the programs and those programs will go on. And we are spending more in poverty programs today than we were in 1981 – that is a fact. The poverty program we are going to concentrate on is creating jobs and opportunities, so that everyone will have the opportunities that they want.] (Scattered applause),'

In [30]:
cleaned_speakers.iloc[7691]

Speaker                                                   QUAYLE
Text            I have met with those people, and I met with ...
Placeholder                                                   33
Name: 7691, dtype: object

In [31]:
cleaned_speakers.iloc[7691].Text

' I have met with those people, and I met with them in Fort Wayne, Indiana, at a food bank. You may be surprised, Tom, they didn’t ask me those questions on those votes, because they were glad that I took time out of my schedule to go down and to talk about how we are going to get a food bank going and making sure that a food bank goes in Fort Wayne, Indiana. And I have a very good record and a commitment to the poor, to those that don’t have a family, that want to have a family. This administration, and a George Bush administration, will be committed to eradicating poverty. Poverty hasn’t gone up in this administration; it hasn’t gone down much either, and that means we have a challenge ahead of us. But let me tell you something, what we have done for the poor. What we have done for the poor is that we in fact – the homeless bill, the McKinney Act, which is the major piece of legislation that deals with homeless – the Congress has cut the funding that the administration has recommende

From here, we see that 1986 is likely a continuation of a statement by Quayle.

In [32]:
cleaned_speakers = cleaned_speakers.replace('1986', 'QUAYLE')

In [33]:
cleaned_speakers[cleaned_speakers['Placeholder']==33]

Unnamed: 0,Speaker,Text,Placeholder
7669,WOODRUFF,On behalf of the Commission on Presidential D...,33
7670,WOODRUFF,For the next 90 minutes we will be questionin...,33
7671,WOODRUFF,Your leader in the Senate Bob Dole said that ...,33
7672,QUAYLE,The question goes to whether I am qualified t...,33
7673,WOODRUFF,Senator Bentsen – I’m going to interrupt at t...,33
...,...,...,...
7817,WOODRUFF,We’re sorry about that if that’s the case. Th...,33
7818,QUAYLE,"Bigger government, higher taxes. They’ve alwa...",33
7819,WOODRUFF,"Senator Bentsen, your closing statement.,",33
7820,BENTSEN,"In just 34 days, America will elect new leade...",33


In [34]:
cleaned_speakers[cleaned_speakers['Speaker']=='(CNN)'].Text[6686]

' Mr. Perot, you’ve talked about going to Washington to do what the people who run this country want you to do. But it is the president’s duty to lead, and often lead alone. How can you lead if you are forever seeking consensus before you act?,'

(CNN) is likely another moderator, so we impute it to that.

In [35]:
cleaned_speakers = cleaned_speakers.replace('(CNN)', 'MODERATOR')

In [36]:
cleaned_speakers[cleaned_speakers['Speaker']=='SM1TH']

Unnamed: 0,Speaker,Text,Placeholder


In [37]:
cleaned_speakers.to_csv('cleaned_candidates.csv')

The code below is a slower function for computing the Levenshtein Distance between two strings.

In [None]:
def lev_dist(string1,string2):
    #print('processing')
    string1=string1.lower()
    string2=string2.lower()
    if(string1==''):
        return len(string2)
    elif(string2==''):
        return len(string1)
    elif(string1[0]==string2[0]):
        return lev_dist(string1[1:],string2[1:])
    else:
        return 1+min(lev_dist(string1[1:],string2),lev_dist(string1, string2[1:]), lev_dist(string1[1:],string2[1:]))