## Option 1: BBC 21st Century's greatest films

Last year the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's aggregate pole is interesting, the long list include everyone who voted is perhaps more revealing from the data standpoint:

http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted

Our goal for this project would be to scrape this page -- using beautiful soup and regular expressions -- in order to make a searchable database that would allow us to investigate all of the films listed by all of the critics.

If you choose this project, the main challenge will be to come up with a database schema that organizes information by film as well as by critic. 

From a geocoding standpoint we would visualize this data set based on the country of the critics, and we could, with extra research, View the data set buy the country of the filmmaker.

From an exploratory standpoint, here are some questions we could ask: 

1. Which countries have the most directors?
2. Which directors have the most movies selected?
3. What year had the most movies be selected?
4. What are some other questions you can think of?

The overall challenge, once the initial page is properly formatted into a database, is to find a way to enhance this data set with another source, and integrate that information with the information from the BBC poll.

## Option 2: Supreme Court Arguments
One group project I am proposing, would be to create a database of arguments before the US Supreme Court. The Supreme Court makes transcriptions of all arguments available on its website:

https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx

These arguments are available as PDFs. The downloading and conversion of PDFs to text is neither trivial nor particularly interesting of a challenge. The real challenge is what to do with the data once we have the text. 

Below, I show the code that I used to download all the PDFs of the the oral arguments before the Supreme Court from 2016, which I then convert to text files: that is where the fun starts. The goal of this project Will be to use regular expressions to parse the text to do oral arguments, that we can then search and measure the word spoken by the Supreme Court justices from case to case and across cases.

The end goal is open: we want to have well structured data with the words of the justices entered. We also want to have a useful data set for each case that includes further information such as the final decision, the votes of each justice, who wrote the decision (and perhaps the dissenting opinion too). It is up to you what should ultimately be included and how we will be able to search through the data and view results.

From an exploratory standpoint, here are some questions we could ask: 

1. Which justice speaks the most from case to case?
2. What are the most frequent words used by each justice across cases?
3. Does frequency of speech during arguments have any relationship toThe decision?
4. What are some other questions you can think of?

As far as geocoding goes: one way we could arrange this data on the map would be by finding the state where each of these cases originated. I have not found the best source for this, so it is an open question--part of this project will entail researching data sets.


I begin by scraping the links to all the transcriptions using **beautiful soup**--if you were to choose to do this project, give it also want to speak this page for the rest of its information such as the name of the case, the docket number, etc..

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [None]:
raw_html = urlopen("https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx").read()
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [None]:
the_table = soup_doc.find(class_="table datatables")


In [None]:
the_cases = the_table.find_all('td', attrs={'style': 'text-align:left'})

In [None]:
all_2016_pdfs = []
for the_link in the_cases:
    all_2016_pdfs.append(the_link.a['href'])

In [None]:
all_2016_pdfs

Next I used the **requests** library (which I installed via pip) to download all of the PDFs to a folder on my computer.

In [None]:
import requests

for urls in all_2016_pdfs:
    link = 'https://www.supremecourt.gov/oral_arguments/' + urls
    book_name = "/Users/Jon/Documents/columbia_syllabus/pdf/" + link.split('/')[-1]
    with open(book_name, 'wb') as book:
        a = requests.get(link, stream=True)

        for block in a.iter_content(512):
            if not block:
                break

            book.write(block)

In [None]:
#Here I make a list of the names of the PDFs
pdf_names = [url.split('/')[-1] for url in all_2016_pdfs]


Here I use the built in **os** library to run command line actions from Python. I am using the command line based **xpdf** tool (specifically its **pdftotext** command)that converts PDF to text in a way that's faster and simpler than using Python libraries that deal with PDFs. (This is certainly not the only way to do this!)

In [None]:
import os
def pdf_to_text(name):
    folder = "/Users/Jon/Documents/columbia_syllabus/pdf/"
    input1 = folder + name
    txt_name = name.replace(".pdf",".txt")
    output1 = folder + txt_name
    os.system("pdftotext '%s' '%s'" % (input1, output1))

#Here's an example of a single command    
#os.system('pdftotext /Users/Jon/Documents/columbia_syllabus/pdf/16-605_2dp3.pdf /Users/Jon/Documents/columbia_syllabus/pdf/16-605_2dp3.txt')


In [None]:
#This those to the names and sends them to the function
for pdf_file in pdf_names:
    pdf_to_text(pdf_file)

In [None]:
f = open('/Users/Jon/Documents/columbia_syllabus/pdf/15-777_1b82.txt', 'r')
sample_transcript = f.read()

In [None]:
sample_transcript