Trump's latest call with a world leader has given everyone a case of impeachment fever. The summarized transcript has been released and it's fairly easy to see why impeachment is likely inevitable. However, people are worked up about what may be missing from the transcript. One senator suggested after a readout of the transcripts that 20 minutes of conversation might be missing ([link](https://www.newsweek.com/senator-king-suggests-least-20-minutes-are-missing-trump-ukraine-call-transcript-1462622)). However, I think the only thing that's missing is time allowance for translation which I will show below.

In [53]:
import pandas as pd
import urllib2
import os
PATH = "C:/Users/sam purkiss/Documents/Code/Text Analysis/Trump"
os.chdir(PATH)

ModuleNotFoundError: No module named 'urllib2'

First I set dictionaries with information on each transcript. Dictionaries are one of my favourite things in Python so I like to throw them in when I can.

In [51]:
files = {'nieto': {'link': 'https://raw.githubusercontent.com/sampurkiss/Misc/master/Trump/Data/call%20with%20nieto.txt', 'date': 'January 27, 2017, FROM 9:35', 'length in mins': 53},
		 'turnbull':  {'link': 'https://github.com/sampurkiss/Misc/blob/master/Trump/Data/call%20with%20turnbull.txt', 'date': 'January 28, 2017 5:05 PM', 'length in mins': 24},
		 'zelenskyy': {'link': 'https://github.com/sampurkiss/Misc/blob/master/Trump/Data/call%20with%20zelenskyy.txt', 'date': 'July 25, 2019, 9:03 PM', 'length in mins': 30}}


The question I'm interested in is what do we know about these calls? We know what the administration has claimed was said, and we know how long the conversation lasted. There were [two leaked transcripts](https://www.washingtonpost.com/graphics/2017/politics/australia-mexico-transcripts/) provided to the Washington Post some months ago which can give us context for what a normal Trump conversation with a world leader might be like. These two can be used to get an idea of whether or not Trump's call with the Ukrainian president fits into the patterns of a "normal" conversation.

I cleaned transcripts of the calls for further data analysis and pull them in below. 

In [52]:
transcript=pd.DataFrame()
for leader in files.keys(): 
	link = files[leader]['link']
	d=list()
	with open(link, encoding = 'latin-1') as f:
		d.append(f.readlines())
		file = leader
	temp = pd.DataFrame({'transcript': file, 'lines': d[0][1:]})
	new = temp['lines'].str.split(':', n=2, expand = True)
	temp['speaker'] = new[0]
	temp['lines'] = new[1]
	transcript = pd.concat([transcript, temp])
    
    
transcript['speaker']= transcript['speaker'].str.replace('The President', 'TRUMP')
transcript['speaker']= transcript['speaker'].str.replace('President Zelenskyy', 'ZELENSKYY')
transcript['num of words'] = transcript['lines'].str.split().str.len()

OSError: [Errno 22] Invalid argument: 'https://raw.githubusercontent.com/sampurkiss/Misc/master/Trump/Data/call%20with%20nieto.txt'

If you look at line 4 of the Nieto transcript, you'll notice that Nieto switches to Spanish. If we're going to compare the conversations we have to account for the fact that all sentences must be repeated twice because of translation. This complicates things, but for simplicity I've assumed that all Trump and Nieto words are doubled as a result. This should be approximately correct. It also seems reasonable to expect that Zelenskyy would have used a translator as well (which has been confirmed by at least the Washington Post). 

In [44]:
transcript['num of words'] = np.where(transcript['transcript'] =='zelensky' ,transcript['num of words']*2, transcript['num of words'])
transcript['num of words'] = np.where(transcript['transcript'] =='nieto' ,transcript['num of words']*2, transcript['num of words'])

The easiest way to see what the differences are is to check out what the number of words used per minute are. This should give us a sense of how chatty Trump and friends are.

In [45]:
words = transcript.groupby(by ='transcript').sum()
words[ 'words per min'] =None
for name in words.index:
	words.loc[name, 'words per min'] = words.loc[name, 'num of words']/ files[name]['length in mins']
words['words per min'] = (words['words per min']
                                      .astype(float).round(0).astype(int))    
words

Unnamed: 0_level_0,num of words,words per min
transcript,Unnamed: 1_level_1,Unnamed: 2_level_1
nieto,6902,130
turnbull,3198,133
zelensky,3912,130


As you can see, all the calls are remarkably similar. All clock in at about 130 words used per minute. What's even more noticeable is that the Nieto and Zelenskyy transcript, both of which required a translator, clock in at identical words per minute. Even if you remove the translation adjustment, the result is identical.


Another way to approach this is to look at number of words used by each leader to see if there are any differences. 

In [46]:
words_per_speaker =transcript.groupby(by =['transcript', 'speaker']).sum().reset_index()
words_per_speaker [ 'words per min'] =None
for name in words_per_speaker['transcript']:
	words_per_speaker [ 'words per min']  = np.where(words_per_speaker['transcript'] ==name, words_per_speaker['num of words']/ files[name]['length in mins'],
							  words_per_speaker['words per min'])
words_per_speaker['words per min'] = (words_per_speaker['words per min']
                                      .astype(float).round(0).astype(int))
words_per_speaker

Unnamed: 0,transcript,speaker,num of words,words per min
0,nieto,PEÑA NIETO,3126,59
1,nieto,TRUMP,3776,71
2,turnbull,TRUMP,1686,70
3,turnbull,TURNBULL,1512,63
4,zelensky,TRUMP,1508,50
5,zelensky,ZELENSKYY,2404,80


Strangely, in the Nieto and Turnbull call, Trump manages to say about 70 words per minute and the other world leaders squeeze in about 60. In the Zelenskyy call, Trump only manages 50 words per minute and Zelenskyy speaks 80. This indicates that, for some reason, Trump spoke 30% less and Zelenskyy spoke 30% more. 

So, we already know that the the transcript isn't the full transcript. We know it's been edited down somehow. My main question is, did Trump really speak a lot less? Did Zelenskyy really speak a lot more? Or have things been modified to hide one or the other? 

Next step, I want to translate the English words to get a sense of how many words were actually used. I think King et. al. should use the transcript with translation to see if it still seems like anything is missing. The other thing I hope to do is use sentiment analysis to dig into how each conversation actually went. Unfortunately transcripts between Trump and world leaders are notoriously hard to get (for good reason, probably) so, even though I'd love to run some ML analysis, the training set is a bit too small.