### Fun with Webscraping & Text manipulation


## Statistics in Presidential Debates

Scraping Presidential Debates from the Commission of Presidential Debates website: https://www.debates.org/voter-education/debate-transcripts/

I will not allowed to manually look up the URLs that you need, instead I will scrape them!! The root url to be scraped is the one listed above.

1. By using `requests` and `BeautifulSoup` I will find all the links / URLs on the website that links to transcriptions of **First Presidential Debates** from the years [1988, 1984, 1976, 1960]. 

2. When I have a list of the URLs I will create a Data Frame with some statistics

In [740]:
import requests
import bs4 as bs
import string
import pandas as pd
import re
from collections import Counter
from urllib.request import urlopen

In [741]:
source = requests.get("https://www.debates.org/voter-education/debate-transcripts/").content
soup = bs.BeautifulSoup(source, 'html.parser') 

In [742]:
fd = []
fdl = []
for cont in soup.find(id = "content-sm").find_all('a'):
    for j in years:
        if '%d' %j in cont.string:
            if 'The First' in cont.string:
                fd.append(cont.string)
                fdl.append(cont.get('href'))
                print("\nURL for {}: ".format(cont.text), "\nhttps://www.debates.org/{}".format(cont.get('href'))) 


URL for September 25, 1988: The First Bush-Dukakis Presidential Debate:  
https://www.debates.org//voter-education/debate-transcripts/september-25-1988-debate-transcript/

URL for October 7, 1984: The First Reagan-Mondale Presidential Debate:  
https://www.debates.org//voter-education/debate-transcripts/october-7-1984-debate-transcript/

URL for September 23, 1976: The First Carter-Ford Presidential Debate:  
https://www.debates.org//voter-education/debate-transcripts/september-23-1976-debate-transcript/

URL for September 26, 1960: The First Kennedy-Nixon Presidential Debate:  
https://www.debates.org//voter-education/debate-transcripts/september-26-1960-debate-transcript/


In [743]:
df = pd.DataFrame(columns = fd, index = ["Debate char length","war_count","most_common_w","most_common_w_count"])

aa = []
for link in fdl:
    aa.append("https://www.debates.org/{}".format(link))
c = 0    
for link in aa:
    source = requests.get(link)
    soup = bs.BeautifulSoup(source.content, features='lxml')
    #############
    char = []
    for i in soup.find_all('p'):
        char.append(i.text)    
    char_length = len(str(char)) - 1000 #removing breaklines ~ just an estimate /n
    df.iloc[0,c] = char_length
    ##############
    war_count = 0
    for word in str(char).split():
        for i in string.punctuation:
            word = word.replace(i,"")
        if word == "War" or word == "war":
            war_count += 1
    df.iloc[1,c] = war_count
    ##############
    words = re.findall(r'\w+', str(char).lower())
    counts = Counter(words)
    aaa = counts.most_common(1) 
    df.iloc[2,c] = aaa[0][0]
    df.iloc[3,c] = aaa[0][1]
    ##############
    c += 1
df 

Unnamed: 0,"September 25, 1988: The First Bush-Dukakis Presidential Debate","October 7, 1984: The First Reagan-Mondale Presidential Debate","September 23, 1976: The First Carter-Ford Presidential Debate","September 26, 1960: The First Kennedy-Nixon Presidential Debate"
Debate char length,87489,86799,80168,60274
war_count,7,2,7,3
most_common_w,the,the,the,the
most_common_w_count,805,868,858,780


    
## Now, Let's do something more interesting! 
I will scrape the first 27 data sets from this URL http://people.sc.fsu.edu/~jburkardt/datasets/regression/. Then, I will save the 5th line in each data set, this should be the name of the data set author. I will also need to get rid of the `#` symbol, the white spaces and the comma at the end.

In [744]:
source = requests.get("http://people.sc.fsu.edu/~jburkardt/datasets/regression/").content
soup = bs.BeautifulSoup(source, features='lxml')

In [746]:
links = soup.find('table').find_all('a')
full_links = ["http://people.sc.fsu.edu/~jburkardt/datasets/regression/" + i.get('href') for i in links][6:33]
full_links

['http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x02.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x03.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x04.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x05.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x06.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x07.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x08.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x09.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x10.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x11.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x12.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x13.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/regression/x14.txt',
 'http://people.sc.fsu.edu/~jburkardt/datasets/r

In [747]:
authors = []
for j in full_links:
    file = urlopen(j)
    author = (list(file))[4]
    authors.append(author)

In [748]:
authors = re.findall(r'(?<=#)[^,]+(?=,)', str(authors)) 

In [749]:
df2 = pd.DataFrame(index = Counter(authors).keys())
df2['Counts'] = Counter(authors).values()
df2.index.name = 'Authors'
df2 = df2.sort_values(by = 'Counts', ascending = False)

In [750]:
df2

Unnamed: 0_level_0,Counts
Authors,Unnamed: 1_level_1
Helmut Spaeth,16
S Chatterjee,3
R J Freund and P D Minton,2
D G Kleinbaum and L L Kupper,2
S C Narula,2
K A Brownlee,1
S Chatterjee and B Price,1
