# Whereabout Streets Data Extraction
This notebook will demonstrate how to access Street and Bridge Operations PDF file and extract this data to create a work order plan template.

<div style="text-align:center"><img src="https://upload.wikimedia.org/wikipedia/en/9/94/Closeup_of_pavement_with_grass.JPG" /></div>

## Introduction
The purpose of this notebook is to create a Street and Bridge Work Order plans based on segment IDs and additional comments on long line. Markings feature layers are published in the City of Austin ArcGIS Portal page available for public view as well. 

The schedule for where sealcoat and overlay streets are completed is received through email by Street and Bridge Operations on a daily basis. It is sent as a PDF file that lists weather conditions, temperature, and provides a table of streets where paving is completed.

<b>The only manual process the user will have to do is to:</b>
- Input Segment IDs
- Make comments on long line markings
- Specify MONTH/DAY/YEAR to retrieve the table of completed streets paved for PDF name and file path
- Create any missing markings assets that are not visible in aerial imagery

This process will cut down on the previous process of manually editing a plans layout through copy-pasting imagery and writing Location IDs, work groups, markings found, and the exporting plans one at a time. An excel document will be created based on this input and read segment IDs to find all short line and specialty point markings. This will ideally generate multiple PDF plans in a faster and shorter time frame.

In the future I would like to make this script more customizable and be done seamlessly without inputting Segment IDs and inputting only specific long line markings using the maintained streets feature layer.

## Imports
The packages used for this project are:
- [exchangelib](https://github.com/ecederstrand/exchangelib) to access the attachments sent by Street and Bridge Operations
- [pdfplumber](https://github.com/jsvine/pdfplumber) to extract tables from the whereabouts report
- [pandas](https://pandas.pydata.org/) to create dataframe of extracted table and transform the data
- [openpyxl](https://openpyxl.readthedocs.io/en/stable/) to edit excel files
- [arcgis](https://esri.github.io/arcgis-python-api/apidoc/html/) to search for markings feature layer dataset

In [1]:
from exchangelib import DELEGATE, Account, Credentials, Configuration, FileAttachment, ItemAttachment
import pdfplumber
import pandas as pd
from openpyxl import Workbook,load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

from arcgis.gis import GIS
from arcgis.features import FeatureLayer

## Constants

The date by month and day constant will determine the file pdf name to use as a dataframe. Folder path will determine where the plans will be created depending on the year. This is set to the top for the purpose of changing these constants as needed.

<i>The table below explains the purpose of each constant.</i>

| Constant | Description   |
|:--------:|----|
| <b>MONTH, DAY, YEAR</b> |Date used to find PDF in month-day format and file path based on year|
|<b>FOLDER</b>      |File directory used to import SBO whereabouts reports from email|
|<b>FILE_NAME</b>   |File directory name used to extact SBO whereabouts reports from file|
|<b>SIGN_IN</b>   |Whether to prompt user to sign in to outlook email|
|<b>INPUT</b>|Whether to prompt user to input segment Ids and comments to export to excel| 

In [2]:
MONTH,DAY,YEAR = ('July',str(18),str(2019))
FOLDER = (r"G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\{}\Whereabouts_Summary").format(YEAR)
FILE_NAME = "\\".join((FOLDER," ".join((MONTH,DAY))))
SIGN_IN = True
INPUT= True

%store MONTH
%store DAY
%store YEAR
%store FOLDER
%store FILE_NAME

Stored 'MONTH' (str)
Stored 'DAY' (str)
Stored 'YEAR' (str)
Stored 'FOLDER' (str)
Stored 'FILE_NAME' (str)


## Methods
These functions will be used to extract and transform the data into a feasible format.

<i>The table below explains the purpose of each:</i>

| Method | Description   |
|:--------:|----|
|<b>lists_to_df</b> |Converts extracted nested list into a dataframe|
|<b>pdf_table_to_df</b> |Extracts table from PDF and then converts to dataframe|
|<b>input_form</b> |Prompts user to input segment IDs and long line specifications|
|<b>query_df</b>   |Query dataframe by segment IDs|

In [8]:
# Returns dataframe of transformed extracted table
def lists_to_df(data,columns):
    l = [item for sublist in data for item in sublist]
    l = [[ x for x in y if x != None and x != ''] for y in l] 
    l = [x for x in l if x[0] != 'ID#']
    for i in l:
        if i[0].isdigit() == False:
            del i[0]
        del i[len(columns):len(i)]
    df = pd.DataFrame(l,columns=columns)
    return df

# Opens PDF to extract table and convert to dataframe
def pdf_table_to_df(columns):
    with pdfplumber.open(FILE_NAME + ".pdf") as pdf:
        pg1 = pdf.pages[0]
        data = pg1.extract_tables(table_settings={})
        df = lists_to_df(data,columns)
        pdf.close()
        return df

# Prompts user to input segment IDs and comments while changing the datafram to include user input
def input_form(df,columns):
    segments, comments = [],[]
    for index,row in df.iterrows():
        location = "{} from {} to {}".format(row["Street"],row["From"],row["To"])
        console = input(location + "\nSegment ID list: ")
        try:
            list_s = list(map(int, console.split('\t')))
            segments.append(console)
        except ValueError:
            print("Skipping input...")
            segments.append(None)
        comment = input("Comment: ")
        comments.append(comment)
    df['Segment IDs'], df['Comments'] = ([s.replace('\t',',') if s != None else None for s in segments ],comments)
    print("\nInput complete.")
    
# Returns query dataframe appended if markings exist in the listed segment IDs
def query_df(fc,index,f,df,df1):
    q = "SEGMENT_ID IN({})".format(df["Segment IDs"][index])
    if q != "SEGMENT_ID IN(None)":
        c = fc.query(where=q,return_count_only=True) 
        if c != 0:
            sdf = fc.query(where=q).sdf.filter(items=f)
            sdf["Location ID"] = df["Location ID"][index]
            sdf["Comments"]= df["Comments"][index]
            df1 = df1.append(sdf)
    return df1

## Loading and Transforming Data

### Email Attachment Extraction

Attachments will be extracted from the inbox. The purpose of `getpass` is to prompt the user for a password to login to email. 

Since the attachments have already been exported to the directory file, a sign-in is not required.

In [9]:
import getpass

# Email subject line used for Street and Bridge Whereabouts report
daily_subject = "S&B Whereabouts"

# This will try to prompt the user to input email and password if SIGN_IN is True
try:
    if SIGN_IN:
        email = input("Enter email: ")
        password = getpass.getpass("Enter password: ")
        credentials = Credentials(username = email,password = password)
        config = Configuration(server='outlook.office365.com', credentials=credentials)
        account = Account(
            primary_smtp_address=email,
            config=config,
            autodiscover=False,
            access_type=DELEGATE)
        print("\nFile attachments below are:")
        for item in account.inbox.filter(subject__contains=daily_subject):
            for attachment in item.attachments:
                if isinstance(attachment, FileAttachment):
                    file_path = "\\".join([FOLDER,attachment.name])
                    with open(file_path, 'wb') as f:
                        f.write(attachment.content)
                    print(file_path)
except:
    print("\nWrong username or password")

Enter email:  Susanne.Gov@austintexas.gov
Enter password:  ···········



File attachments below are:
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\Aug 2.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 30.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 25.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 24.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 23.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 19.pdf
G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\2019\Whereabouts_Summary\July 18.pdf


### PDF tables to Excel

Now that the PDFs have been extracted and exported to the folder path, the next step is to extract the tables in the PDF and export it as an excel file.

An input form will generate so the user can input Segment ID and comment information for each of the streets listed. The columns list will only take the relevant columns from the extracted table. The `pdfplumber` package will be used to extract tables from the PDF and prompt user to submit data.

The input will be stored as a DataFrame saved to an excel document. If the user already provided input froma  previous session, the dataframe will be set to the excel file document instead.

In [12]:
from pathlib import Path

# Columns of extracted table
columns = ["Location ID", "Street", "From", "To"]
excel_file = FILE_NAME + ".xlsx"
        
# Will prompt input and export to excel unless the excel file already exists. In that case it will read excel file instead
if Path(excel_file).exists():
    df = pd.read_excel(excel_file,index_col=0)
else:
    if INPUT:
        df = pdf_table_to_df(columns)
        input_form(df,columns)
        df.to_excel(excel_file,sheet_name=" ".join((MONTH,DAY)))

In [13]:
display(df.fillna("N/A"))

Unnamed: 0,Location ID,Street,From,To,Segment IDs,Comments
0,62874,DENELL CIR,Walnut Bend Dr,10638,2039819,
1,63235,WALNUT BEND DR,Hollybluff St,10618,201059620106232010658,
2,62957,HOLLYBLUFF ST,Middle Fiskville Rd,1138,2010546,
3,63171,SHAWN LEE CV,Stephanie Lee Ln,10921,2031940,
4,63172,SHERRY LEE CV,Stephanie Lee Ln,10919,2031939,
5,63184,STEPHANIE LEE LN,Jamie Glen Way,Claywood Dr,201241720124182039782,
6,62976,JAMIE GLEN WAY,Stephanie Lee Ln,Collinwood West Dr,2010482,
7,63151,SALEM LN,Middle Fiskville Rd,Walnut Bend Dr,20105992010624,
8,62796,BLUFF BEND DR,10200,Hollybluff St,201057020123262012327201232831950193195020,
9,62907,FLORADALE DR,Middle Fiskville Rd,Cy Ln,20106632010714,


This file contains a table for the list of streets with the following columns:
- <i>Location ID</i>: unique identifier used for street paving
- <i>Street</i>: main street that is paved
- <i>From</i>: intersecting cross street
- <i>To</i>: intersecting cross street
- <i>Segment IDs</i>: list of segment IDs where street is paved seperated by commas
- <i>Comments</i>: Notes on long line markings

### Feature Layer Data Query

The next task is to find the markings through the list of segment IDs the user has inputted. For this task the `arcgis` package will be useful for extracting the markings available in each segment ID since the dataset is already available publically.

Since the markings datasets are publically available, we can login to ArcGIS Online anonymously. 

Use `client_id` instead of `None` if you wish to log-in through an AGOL federate account. Note that it will prompt user to enter code which can be found by following the instructions. Going through an AGOL federated account is useful if the user wishes to add their own layers as a reference such as [NearMap](https://go.nearmap.com/) aerial imagery. 

It will search through the markings feature layer based on the list of segment IDs provided by the excel file.

In [14]:
# variables used to find and query feature layer in AGOL
client_id = "CrnxPfTcm7Y7ZGl7"
url = r"https://services.arcgis.com/0L95CJ0VTaxqcmED/arcgis/rest/services/TRANSPORTATION_markings_{}/FeatureServer/0"
sl,sp = (pd.DataFrame(),pd.DataFrame())

# Columns for data frame. Indexes: df (0-1), shortline (3-4), specialty point (5-7)
cols = {'Location ID': 'LOCATION ID', 'Comments':'COMMENTS', 'MARKINGS_SHORT_LINE_ID': 'SHORTLINE ID',
        'SHORT_LINE_TYPE': 'SHORTLINE TYPE', 'SEGMENT_ID': 'SEGMENT ID', 'MARKINGS_SPECIALTY_POINT_ID':'SPECIALTY ID',
        'SPECIALTY_POINT_TYPE': 'SPECIALTY TYPE', 'SPECIALTY_POINT_SUB_TYPE': 'SPECIALTY SUBTYPE'}

# Access markings feature layers to query and append as a single new data frame
try:
    gis = GIS("https://austin.maps.arcgis.com/home/index.html",client_id=None)
    for index,row in df.iterrows():
        sl = query_df(FeatureLayer(url.format("short_line")),index,list(cols)[2:5],df,sl)
        sp = query_df(FeatureLayer(url.format("specialty_point")),index,list(cols)[4:],df,sp)
    markings = sl.append(sp,sort=True).reindex(columns=list(cols))
    markings.columns = list(cols.values())
    display(markings)
except:
    pass

Unnamed: 0,LOCATION ID,COMMENTS,SHORTLINE ID,SHORTLINE TYPE,SEGMENT ID,SPECIALTY ID,SPECIALTY TYPE,SPECIALTY SUBTYPE
0,63235,,7052.0,STOP_LINE,2010623,,,
1,63235,,10759.0,STOP_LINE,2010658,,,
2,63235,,9773.0,STOP_LINE,2010596,,,
0,62957,,8667.0,STOP_LINE,2010546,,,
0,63151,,6715.0,STOP_LINE,2010624,,,
1,63151,,8517.0,STOP_LINE,2010624,,,
2,63151,,9022.0,STOP_LINE,2010599,,,
0,62796,,7133.0,STOP_LINE,2012328,,,
1,62796,,8516.0,STOP_LINE,3195020,,,
2,62796,,10757.0,STOP_LINE,2012326,,,


This dataframe lists pavement markings queried by segment IDs with the following columns:
- <i>LOCATION ID</i>: Unique identifier used for street paving
- <i>COMMENTS</i>: Notes on long line markings
- <i>SHORTLINE ID</i>: Unique identifier used for short line markings
- <i>SHORTLINE TYPE</i>: Type of short line (crosswalk, stop line, yield line, etc.)
- <i>SEGMENT ID</i>: Segment ID where the markings is located
- <i>SPECIALTY ID</i>: Unique identifier used for specialty point markings
- <i>SPECIALTY TYPE</i>: Type of specialty marking domain code (Arrow, Symbol, Word, etc.)
- <i>SPECIALTY SUBTYPE</i>: Subtype of specialty marking domain code (Left turn, Bicyclist, Stop, etc.)

The dataframe will be saves in an excel sheet for it to be used again to generate the template.

In [15]:
wb = load_workbook(filename = excel_file)
sheet_name = "markings list"
if sheet_name in wb:
    ws = wb[sheet_name]
else:
    ws = wb.create_sheet(sheet_name)
    for r in dataframe_to_rows(markings, index=False, header=True):
        ws.append(r)
    wb.save(excel_file)

## Generating Whereabouts Plans
To generate whereabout plans, we will have to use the `arcpy` package, which requires Python 2 and ArcMap 10.5. Eventually, this notebook will be able to use `arcpy` in Python 3.

[Click here to access notebook](PlansTemplate.ipynb)

# Create Spreadsheet of Completed Streets
This is intended to report on extracted streets generated from the PDFs

In [57]:
import os
import pandas as pd

# Columns of extracted table
columns = ["Location ID", "Street", "From", "To"]
df = pd.DataFrame()

try:
    df.read_excel(FOLDER + "\\SBO Street List.xlsx")
except:
    for foldername,subfolders,files in os.walk(FOLDER):
        for file in files:
            if file.endswith('.pdf'):
                df1 = pdf_table_to_df(columns)
                df1["filename"] = file
                df = df.append(df1,sort=True)
    df.to_excel(FOLDER + "\\SBO Street List.xlsx",sheet_name="Report")