In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import multiprocessing as mp
from multiprocessing import Process, Pool
import pycorenlp
import pandas as pd
from pandas.io.json import json_normalize

# Majority of code for this assignment located in pdfscrape and cityscrape folders
from pdfscrape import pdf_pipeline_ns
from cityscrape import scrape

# LSDM Assignment 2

## Project Summary

PDF documents on municipal government websites hold lots of information, though are also used for paper-based processes that allow residents and constituents to download and print the PDF form. This project is the beginning of parrellizing the process of identifying, scraping, and eventually classifying PDF documents in order to find the PDF forms in what is often a large haystack of PDFs. For this project, who main components have been developed:  
- Parrallelized web crawler to collect PDF URLs
- Parrellelized PDF scraper which also hits a Stanford CoreNLP Service running in an AWS EC2 instance

### Parrellized Web Crawler
The web crawler starts crawling from a single web page, collects all URLs on the page, and puts them into a queue. The next URL is then pulled from the queue, and again, each of the URLs on that second page get put into the queue. If allowed to complete, all of the URLs in the queue would be visited on a First In First Out basis. The web crawler has a few key features:
- limiting domain: Ability to limit visiting and scraping of a web page to a given domain
- limiting path: Ability to limit visiting and scraping of a web page to any URL with a given string in the web address (example: only visit URLs with '/dept/finance/' in the URL).
- number of pages: Limit the number of pages to be visited and scraped. The full scope of URLs collected can easily grow out of control.

See source code: https://github.com/vidal-anguiano/LSDM-Assignment-2/blob/master/cityscrape/scrape.py

### Parrellized PDF Scraper
The parallelized webscraper takes PDF URLs from the shared mp.Queue, pdflink_q (see below) as soon as one is made available by the web crawller. The PDF scraper can take several parameters which allow the user to specify the max number of pages to scrape from the PDF and an ability to scrape a subset of random pages from the PDF. This functionality was created so that, when dealing with especially large PDF documents, features could still be generated from the pages deep in a PDF document without limiting the scrape to the first couple of pages in a PDF document. In the next stage of development, where a classifier is built to separate forms from non-forms, this functionality will be great for feature generation. 

See source code: https://github.com/vidal-anguiano/LSDM-Assignment-2/blob/master/pdfscrape/pdf_pipeline_ns.py



### Use of Stanford CoreNLP
In this initial iteration, I demonstrate the ability to take text and run it through Stanford CoreNLP. The outputs generated from hitting the CoreNLP server are data on the named entities in the text. For future iterations, CoreNLP will be used heavily for further feature generation to be used in classifying PDFs. 

Instantiate connection to running Stanford CoreNLP server. The PyCoreNLP package offers a wrapper on the Stanford CoreNLP API.

In [None]:
# nlp = pycorenlp.StanfordCoreNLP("http://ec2-18-234-241-82.compute-1.amazonaws.com:9000")
nlp = pycorenlp.StanfordCoreNLP("http://localhost:9000")

Each of the below mp.Queue() instances are either used in this implementation or built now for future implementation. Notes on each one are provided below.

In [None]:
tovisit_q = mp.Queue()  # Stores the URLs that are to be visited on a FIFO basis

writeto_q = mp.Queue()  # NOT IMPLEMENTED - eventually will be used to hold data that is to later be written by
                        # another process
    
faildrd_q = mp.Queue()  # Stores the URLs that are dead or were not reached

pdflink_q = mp.Queue()  # Stores links to PDF files to be scraped

visited_qs = mp.Queue() # Stores a Set that is shared by all Processes and used for keeping track of the pages 
visited_qs.put(set())   # that have been visited in order to prevent putting URLs into the queue when they've 
                        # already been visited.

tovisit_qs = mp.Queue() # Similar to the visited_qs, this queue holds a Set of the URLs that are already in the
tovisit_qs.put(set())   # queue and prevents them from being added again.

**SKIP_URL** is used to provide a set of strings that, if found in a URL, will cause the URL to be ignored and the page will not be visited.

**w** instantiates a WebScrape object which takes the above mp.Queue() objects in addition to a URL so start the scrape from and a limiting domain (lmt_doma), which ignores any URL that is not in the limiting domain.

**arr** and **arr2** set the arguments for the fuction that scrapes the PDFs found on the City of Chicago website. Each of the parameters for the `scrape_pdfs` function are as follows (can also be seen in function docstring:
- pdflink_q: Described above
- maxpages: Maximum number of pages to be scraped from any given PDF
- base: Number of pages to be scraped from the beginning of a PDF
- random_sample: Number of pages to be scraped randomly from the entire PDF
- to_scrape: Number of PDF documents to scrape from those identified on the City of Chicago website
- scrape_file: Name of file where outputs are to be stored.
- final: Boolean to indicate which of the processes is running last so that the mp.Queue() object can be flushed before p.join() is run (will hang otherwise).
- nlp: Connection to running Stanford CoreNLP Server

In [None]:
SKIP_URL = ['department', 'spec', 'council', 'please', '311', 'phone', 'press', 'release']

w = scrape.WebScrape('http://cityofchicago.org',
                     tovisit_q,
                     writeto_q,
                     faildrd_q,
                     pdflink_q,
                     visited_qs,
                     tovisit_qs,
                     lmt_doma='www.cityofchicago.org')

arr = (pdflink_q, 10, 5, 5, 30, './scrape1.csv', 'temp1.pdf',False, nlp)
arr2 = (pdflink_q, 10, 5, 5, 60, './scrape2.csv', 'temp2.pdf', True, nlp)

### Implementation
There are three processes running below. One (p1) crawls the web to collect PDF links. The second (p2) and third (p3) processes scrape PDFs, hit the Stanford CoreNLP server, and write results to a CSV file. Results include a flag indicating whether the PDF is "fillable" or editable electronically, the number of pages in the PDF, the text scraped from the PDF, and named entity results from Core NLP.

In [None]:
p1 = Process(name="Web Crawler",
             target=w.scrape,
             args=(100,SKIP_URL))

p2 = Process(name="PDF Scraper 1",
             target=pdf_pipeline_ns.scrape_pdfs,
             args=(*arr,))

p3 = Process(name="PDF Scraper 2",
            target=pdf_pipeline_ns.scrape_pdfs,
             args=(*arr2,))


p1.start()
p2.start()
p3.start()


p1.join()
p2.join()
p3.join()

### Next Steps
Future iterations will more thoroughly make use of Stanford CoreNLP for feature generation. 