# Whereabout Streets Data Extraction
This notebook will demonstrate how to access Street and Bridge Operations PDF file and extract this data to create a work order plan template.

<div style="text-align:center"><img src="https://upload.wikimedia.org/wikipedia/en/9/94/Closeup_of_pavement_with_grass.JPG" /></div>

## Introduction
The purpose of this notebook is to create a Street and Bridge Work Order plans based on segment IDs and additional comments on long line. Markings feature layers are published in the City of Austin ArcGIS Portal page available for public view as well. 

The schedule for where sealcoat and overlay streets are completed is received through email by Street and Bridge Operations on a daily basis. It is sent as a PDF file that lists weather conditions, temperature, and provides a table of streets where paving is completed.

<b>The only manual process the user will have to do is to:</b>
- Input Segment IDs
- Make comments on long line markings
- Specify MONTH/DAY/YEAR to retrieve the table of completed streets paved for PDF name and file path
- Create any missing markings assets that are not visible in aerial imagery

This process will cut down on the previous process of manually editing a plans layout through copy-pasting imagery and writing Location IDs, work groups, markings found, and the exporting plans one at a time. An excel document will be created based on this input and read segment IDs to find all short line and specialty point markings. This will ideally generate multiple PDF plans in a faster and shorter time frame.

In the future I would like to make this script more customizable and be done seamlessly without inputting Segment IDs and inputting only specific long line markings using the maintained streets feature layer.

## Imports
The packages used for this project are:
- [exchangelib](https://github.com/ecederstrand/exchangelib) to access the attachments sent by Street and Bridge Operations
- [pdfplumber](https://github.com/jsvine/pdfplumber) to extract tables from the whereabouts report
- [pandas](https://pandas.pydata.org/) to create dataframe of extracted table and transform the data
- [openpyxl](https://openpyxl.readthedocs.io/en/stable/) to edit excel files
- [arcgis](https://esri.github.io/arcgis-python-api/apidoc/html/) to search for markings feature layer dataset

In [2]:
from exchangelib import DELEGATE, Account, Credentials, Configuration, FileAttachment, ItemAttachment
import pdfplumber
import pandas as pd
from openpyxl import Workbook,load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

from arcgis.gis import GIS
from arcgis.features import FeatureLayer

from functools import reduce
import numpy as np

## Constants

The date by month and day constant will determine the file pdf name to use as a dataframe. Folder path will determine where the plans will be created depending on the year. This is set to the top for the purpose of changing these constants as needed.

<i>The table below explains the purpose of each constant.</i>

| Constant | Description   |
|:--------:|----|
| <b>MONTH, DAY, YEAR</b> |Date used to find PDF in month-day format and file path based on year|
|<b>FOLDER</b>      |File directory used to import SBO whereabouts reports from email|
|<b>FILE_NAME</b>   |File directory name used to extact SBO whereabouts reports from file|
|<b>SIGN_IN</b>   |Whether to prompt user to sign in to outlook email|
|<b>INPUT</b>|Whether to prompt user to input segment Ids and comments to export to excel| 

In [3]:
MONTH,DAY,YEAR = ('May',str(30),str(2019))
FOLDER = (r"G:\ATD\Signs_and_Markings\MARKINGS\Whereabouts WORK ORDERS\{}\Whereabouts_Summary").format(YEAR)
FILE_NAME = "\\".join((FOLDER," ".join((MONTH,DAY))))
EXCEL_FILE = FILE_NAME + ".xlsx"
SIGN_IN = False # Bug in exchangelib because of update
INPUT= True

%store FOLDER
%store EXCEL_FILE

Stored 'FOLDER' (str)
Stored 'EXCEL_FILE' (str)


## Methods
These functions will be used to extract and transform the data into a feasible format.

<i>The table below explains the purpose of each:</i>

| Method | Description   |
|:--------:|----|
|<b>lists_to_df</b> |Converts extracted nested list into a dataframe|
|<b>pdf_table_to_df</b> |Extracts table from PDF and then converts to dataframe|
|<b>input_form</b> |Prompts user to input segment IDs and long line specifications|
|<b>query_df</b>   |Query dataframe by segment IDs|

In [48]:
# Opens PDF to extract table and convert to dataframe
def pdf_table_to_df(columns):
    with pdfplumber.open(FILE_NAME + ".pdf") as pdf:
        pg1 = pdf.pages[0]
        data = pg1.extract_tables(table_settings={})
        pdf.close()
    l = [item for sublist in data for item in sublist]
    l = [[ x for x in y if x != None and x != ''] for y in l] 
    l = [x for x in l if len(x) != 0]
    l = [x for x in l  if x[0] != 'ID#']
    for i in l:
        if i[0].isdigit() == False:
            del i[0]
        del i[len(columns):len(i)]
    df = pd.DataFrame(l,columns=columns)
    return df

# Prompts user to input segment IDs and comments while changing the datafram to include user input
def input_form(df,columns):
    segments, comments = [],[]
    for index,row in df.iterrows():
        location = "{} from {} to {}".format(row["Street"],row["From"],row["To"])
        console = input(location + "\nSegment ID list: ")
        try:
            segments.append(console)
        except ValueError:
            print("Skipping input...")
            segments.append(None)
        comment = input("Comment: ")
        comments.append(comment)
    df['Segment IDs'], df['Comments'] = ([s.replace('\t',',') if s != None else None for s in segments],comments)
    print("\nInput complete.")
    
# Returns query dataframe appended if markings exist in the listed segment IDs
def query_df(fc,index,f,df,df1):
    q = "SEGMENT_ID IN({})".format(df["Segment IDs"][index])
    if q != "SEGMENT_ID IN(N/A)":
        c = fc.query(where=q,return_count_only=True) 
        if c != 0:
            sdf = fc.query(where=q).sdf.filter(items=f)
            sdf["Location ID"] = df["Location ID"][index]
            sdf["Comments"] = df["Comments"][index]
            df1 = df1.append(sdf,sort=True)
    df1['COUNTS'] = 1
    return df1

# Return pivot table dataframe of counts for each markings
def markings_count(df,group,columns,wg):
    df = df.groupby(group).count()[['SEGMENT_ID']].rename(columns={"SEGMENT_ID":'COUNTS'})
    df = df.pivot_table(values='COUNTS',index='Location ID',columns=columns,aggfunc='first').reset_index()
    df[wg] = wg
    return df

# Rename markings sp based on domain code
def specialty_markings(df,field):
    if field in df.columns:
        renameList = list(zip(list(df.SPECIALTY_POINT_TYPE),list(df.SPECIALTY_POINT_SUB_TYPE)))
        arrow = ["Through","Left","Right","Left/Right","Left/Right/Through",
                 "Left/Through","Right/Through","U-turn","Lane reduction","Wrong way","Bike"]
        other = ["Green pad", "Green launch pad", "Speed hump marking","Diagonal crosshatch", "Chevron crosshatch"]
        parking = ["Parking 'L'", "Parking 'T'", "Parking stall line", "Handicap symbol"]
        symbol = ["Bike","Shared lane (Sharrow)","Bicyclist","Railroad Crossing (RxR)","Chevron","Pedestrian","Diamond"]
        word = ["Stop","Yield","Ahead","Only","Merge","Ped", "X-ing","Bus Only","Keep Clear","Do Not Block","Ped X-ing"]
        rpm = ['blue','']
        t =['word','arrow','symbol','','','rpm']
        st = [word,arrow,symbol,other,parking,rpm]
        index = 0
        for i in renameList:
            x = list(map(int,list(i)))
            temp = st[x[0] - 1][x[1] - 1] + " " + t[x[0] - 1]
            renameList[index] = temp
            index += 1
        df['SPECIALTY_POINT_TYPE'] = renameList
        return df.drop('SPECIALTY_POINT_SUB_TYPE',axis=1)
    return pd.DataFrame()

# Return dataframe of the listed specifications
def specifications(df,i):
    df["SPECIFICATIONS"] = ''
    for index,row in df.iterrows():
        keys = list(row[i:])
        values = list(df.columns)[i:]
        spec = []
        for k,v in zip(keys,values):
            if k != 'N/A' and k != '' and v != 'WORK GROUPS':
                spec.append('{} {}'.format(int(k),v.lower().replace('_',' ')))
            if row['Comments'] != 'N/A':
                sentence = 'Install {}, '.format(row['Comments']) + ', '.join(word for word in spec)
            else:
                sentence = 'Install ' + ', '.join(word for word in spec)
        df.at[index,'SPECIFICATIONS'] = sentence
    if 'WORK GROUPS' in df.columns:
        df.loc[df.Street != None,'WORK GROUPS'] = df.loc[df.Street != None,'WORK GROUPS'].apply(str)
    return df

# Returns dataframe of markings count and pages
def location_in_df(df,markings_type,workgroup):
    if 'Location ID' in df:
        count = markings_count(df,['Location ID',markings_type],(markings_type),workgroup)
        page = group_pivot(df,markings_type).reset_index()
        return count,page

# Returns dataframe grouped by a column and pivoted to show counts
def group_pivot(df,col):
    df = df.groupby(['Location ID','SEGMENT_ID','Comments',col]).count()[['COUNTS']]
    return df.pivot_table(values='COUNTS',index=['Location ID','SEGMENT_ID','Comments'],columns=(col),aggfunc='first')

# Returns dataframe of cover page
def create_cover(cover,sl_count,sp_count,wg):
    cover.loc[cover.Comments != 'N/A', wg[2]] = wg[2]
    cover.loc[cover.Comments == 'N/A', wg[2]] = 'N/A' 
    cover['PAGE'] = 1
    if not sl_count.empty and not sp_count.empty:
        cover = reduce(lambda z,y: pd.merge_ordered(z,y,on='Location ID'), [cover,sl_count,sp_count])
    elif not sl_count.empty:
        cover = pd.merge_ordered(cover,sl_count,on='Location ID')
        wg.remove('SPECIALTY MARKINGS')
    elif not sp_count.empty:
        cover = pd.merge_ordered(cover,sp_count,on='Location ID')
        wg.remove('SHORT LINE')
    else:
        cover = specifications(cover,6)
        return cover
    cover = cover.dropna(how='all',subset=list(cover.columns)[6:]).fillna('N/A')
    cover['WORK GROUPS'] = cover[wg].apply(','.join,1).apply(lambda x: [s for s in x.split(',') if s != 'N/A'])
    cover = cover.drop(columns = wg).fillna('N/A')
    cover = specifications(cover,6)
    return cover

# Returns dataframe of pages
def create_pages(pages,sl_page,sp_page):
    if not sl_page.empty and not sp_page.empty:
        pages = pd.merge_ordered(sl_page,sp_page,on=('Location ID','SEGMENT_ID','Comments')).fillna("N/A")
        pages = specifications(pages,3)
        pages = pd.merge_ordered(pages,streets,on=('Location ID','SEGMENT_ID','Comments')).drop(columns='BLOCK')
        pages = pages.sort_values(by=['Location ID','PAGE']).reset_index(drop = True)
        return pages
    elif not sl_page.empty:
        pages = specifications(sl_page.fillna('N/A'),3)
        pages = pd.merge_ordered(pages,streets,on=('Location ID','SEGMENT_ID','Comments')).sort_values(
            by=['BLOCK','Location ID']).reset_index(drop = True).drop(columns='BLOCK')
    elif not sp_page.empty:
        pages = specifications(sp_page.fillna('N/A'),3)
        pages = pd.merge_ordered(pages,streets,on=('Location ID','SEGMENT_ID','Comments')).sort_values(
            by=['BLOCK','Location ID']).reset_index(drop = True).drop(columns='BLOCK')
    else:
        pages.loc[cover.Street != None,'PAGE'] = 2
        return pages
    pages = pages.dropna(subset=['SPECIFICATIONS'])
    page = 1
    for index, row in streets.iterrows():
        if index != 0 and (row['Location ID'] != pages['Location ID'][index - 1]):
            page = 2
            pages.at[index,'PAGE'] = page
        else:
            page += 1
            pages.at[index,'PAGE'] = page
    return pages

# Creates worksheet in excel file unless the worksheet already exists
def create_ws(df,sheet_name):
    if sheet_name in wb:
        del wb[sheet_name]
    ws = wb.create_sheet(sheet_name)
    for r in dataframe_to_rows(df, index=False, header=True):
        ws.append(r)
    wb.save(EXCEL_FILE)

## Loading and Transforming Data

### Email Attachment Extraction

Attachments will be extracted from the inbox. The purpose of `getpass` is to prompt the user for a password to login to email. 

Since the attachments have already been exported to the directory file, a sign-in is not required.

In [5]:
import getpass

# Email subject line used for Street and Bridge Whereabouts report
daily_subject = "S&B Whereabouts"

# This will try to prompt the user to input email and password if SIGN_IN is True
try:
    if SIGN_IN:
        email = input("Enter email: ")
        password = getpass.getpass("Enter password: ")
        credentials = Credentials(username = email,password = password)
        config = Configuration(server='outlook.office365.com', credentials=credentials)
        account = Account(primary_smtp_address=email,config=config,autodiscover=False,access_type=DELEGATE)
        print("\nFile attachments below are:")
        for item in account.inbox.filter(subject__contains=daily_subject):
            for attachment in item.attachments:
                if isinstance(attachment, FileAttachment):
                    file_path = "\\".join([FOLDER,attachment.name])
                    with open(file_path, 'wb') as f:
                        f.write(attachment.content)
                    print(file_path)
except:
    print("\nWrong username or password")

### PDF tables to Excel

Now that the PDFs have been extracted and exported to the folder path, the next step is to extract the tables in the PDF and export it as an excel file.

An input form will generate so the user can input Segment ID and comment information for each of the streets listed. The columns list will only take the relevant columns from the extracted table. The `pdfplumber` package will be used to extract tables from the PDF and prompt user to submit data.

The input will be stored as a DataFrame saved to an excel document. If the user already provided input froma  previous session, the dataframe will be set to the excel file document instead.

In [6]:
from pathlib import Path

# Columns of extracted table
columns = ["Location ID", "Street", "From", "To"]

# Will prompt input and export to excel unless the excel file already exists. In that case it will read excel file instead
if Path(EXCEL_FILE).exists():
    df = pd.read_excel(EXCEL_FILE,index_col=0)
    df = df.fillna("N/A")
else:
    if INPUT:
        df = pdf_table_to_df(columns)
        input_form(df,columns)
        df = df.fillna("N/A")
        df.to_excel(EXCEL_FILE,sheet_name=" ".join((MONTH,DAY)))

In [7]:
display(df)

Unnamed: 0,Location ID,Street,From,To,Segment IDs,Comments
0,62963,HYMEADOW DR,Woodlawn Village Dr,12519,319430520387082038719,
1,SG-13247,Pecan Park Blvd,S Lake Line Blvd,Lake Creek Blvd,"3272671,3272712,3272816,3272915,3272705,327278...","turn bays, lane lines, bike lanes"


This file contains a table for the list of streets with the following columns:
- <i>Location ID</i>: unique identifier used for street paving
- <i>Street</i>: main street that is paved
- <i>From</i>: intersecting cross street
- <i>To</i>: intersecting cross street
- <i>Segment IDs</i>: list of segment IDs where street is paved seperated by commas
- <i>Comments</i>: Notes on long line markings

### Feature Layer Data Query

The next task is to find the markings through the list of segment IDs the user has inputted. For this task the `arcgis` package will be useful for extracting the markings available in each segment ID since the dataset is already available publically.

Since the markings datasets are publically available, we can login to ArcGIS Online anonymously. 

Use `client_id` instead of `None` if you wish to log-in through an AGOL federate account. Note that it will prompt user to enter code which can be found by following the instructions. Going through an AGOL federated account is useful if the user wishes to add their own layers as a reference such as [NearMap](https://go.nearmap.com/) aerial imagery. 

It will search through the markings feature layer based on the list of segment IDs provided by the excel file.

In [8]:
# variables used to find and query feature layer in AGOL
gis = GIS("https://austin.maps.arcgis.com/home/index.html")
url = r"https://services.arcgis.com/0L95CJ0VTaxqcmED/arcgis/rest/services/TRANSPORTATION_{}/FeatureServer/0"
sl,sp,streets = (pd.DataFrame(),pd.DataFrame(),pd.DataFrame())

# Columns for data frame. Indexes: df (0), shortline (1-4), specialty point (3 to etc.)
cols = ['SHORT_LINE_TYPE','SEGMENT_ID','SPECIALTY_POINT_TYPE','SPECIALTY_POINT_SUB_TYPE']
s_col = ['LEFT_BLOCK_FROM','RIGHT_BLOCK_FROM','SEGMENT_ID']

for index,row in df.iterrows():
    streets = query_df(FeatureLayer(url.format("street_segment")),index,s_col,df,streets)      
    sl = query_df(FeatureLayer(url.format("markings_short_line")),index,cols[:2],df,sl)
    sp = query_df(FeatureLayer(url.format("markings_specialty_point")),index,cols[1:],df,sp)
sp = specialty_markings(sp,cols[2])
# Order table
streets['BLOCK'] = np.maximum(streets[s_col[0]],streets[s_col[1]])
streets = streets.sort_values(by=['BLOCK','Location ID']).reset_index(drop = True)
streets = streets.rename(columns={'COUNTS':'PAGE'}).drop(s_col[:2],axis=1)

page = 1
for index, row in streets.iterrows():
    if index != 0 and (row['Location ID'] != streets['Location ID'][index - 1]):
        page = 2
        streets.at[index,'PAGE'] = page
    else:
        page += 1
        streets.at[index,'PAGE'] = page

### Plans Table Creation

#### Cover Table

In [49]:
wg = ['SHORT LINE','SPECIALTY MARKINGS','LONGLINE']
sl_count,sl_page = location_in_df(sl,'SHORT_LINE_TYPE',wg[0])
sp_count,sp_page = location_in_df(sp,'SPECIALTY_POINT_TYPE',wg[1])
cover = create_cover(df.copy(),sl_count,sp_count,wg)
pages = create_pages(df.copy(),sl_page,sp_page)

This dataframe lists pavement markings queried by segment IDs with the following columns:
- <i>LOCATION ID</i>: Unique identifier used for street paving
- <i>COMMENTS</i>: Notes on long line markings
- <i>WORK GROUPS</i>: Type of markings work group assigned to work order
- <i>SPECIFICATIONS</i>: Lists all markings that need to be installed on work order.


The dataframe will be saves in an excel sheet for it to be used again to generate the template.

In [50]:
display(cover)
display(pages) 

Unnamed: 0,Location ID,Street,From,To,Segment IDs,Comments,PAGE,CROSSWALK,STOP_LINE,YIELD_LINE,Bicyclist symbol,Bike arrow,Diagonal crosshatch,Left arrow,Left/Through arrow,Only word,Right arrow,Shared lane (Sharrow) symbol,WORK GROUPS,SPECIFICATIONS
0,62963,HYMEADOW DR,Woodlawn Village Dr,12519,319430520387082038719,,1,,3.0,,,,,,,,,,['SHORT LINE'],"Install 1 page, 3 stop line"
1,SG-13247,Pecan Park Blvd,S Lake Line Blvd,Lake Creek Blvd,"3272671,3272712,3272816,3272915,3272705,327278...","turn bays, lane lines, bike lanes",1,10.0,13.0,1.0,9.0,28.0,9.0,21.0,2.0,23.0,13.0,14.0,"['SHORT LINE', 'SPECIALTY MARKINGS', 'LONGLINE']","Install turn bays, lane lines, bike lanes, 1 p..."


Unnamed: 0,Location ID,SEGMENT_ID,Comments,CROSSWALK,STOP_LINE,YIELD_LINE,Bicyclist symbol,Bike arrow,Diagonal crosshatch,Left arrow,Left/Through arrow,Only word,Right arrow,Shared lane (Sharrow) symbol,SPECIFICATIONS,PAGE
0,62963,2038719,,,1.0,,,,,,,,,,Install 1 stop line,2
1,62963,2038708,,,1.0,,,,,,,,,,Install 1 stop line,3
2,62963,3194305,,,1.0,,,,,,,,,,Install 1 stop line,4
3,SG-13247,3272912,"turn bays, lane lines, bike lanes",,,,,,,,,,,4.0,"Install turn bays, lane lines, bike lanes, 4 s...",2
4,SG-13247,3272907,"turn bays, lane lines, bike lanes",1.0,1.0,,1.0,,,2.0,,2.0,2.0,2.0,"Install turn bays, lane lines, bike lanes, 1 c...",3
5,SG-13247,3272914,"turn bays, lane lines, bike lanes",,,,,,,,,,,4.0,"Install turn bays, lane lines, bike lanes, 4 s...",4
6,SG-13247,3272909,"turn bays, lane lines, bike lanes",,,,,,,,,,,4.0,"Install turn bays, lane lines, bike lanes, 4 s...",5
7,SG-13247,3272915,"turn bays, lane lines, bike lanes",2.0,1.0,,1.0,1.0,4.0,,,1.0,1.0,,"Install turn bays, lane lines, bike lanes, 2 c...",6
8,SG-13247,3272910,"turn bays, lane lines, bike lanes",,,1.0,,,5.0,,,,,,"Install turn bays, lane lines, bike lanes, 1 y...",7
9,SG-13247,2048238,"turn bays, lane lines, bike lanes",,,,,,,,,,,,,8


#### Pages Table

## Create Worksheets of DataFrames

In [10]:
wb = load_workbook(filename = EXCEL_FILE)
create_ws(cover,'Cover')
create_ws(pages,'Pages')

## Generating Whereabouts Plans
To generate whereabout plans, we will have to use the `arcpy` package, which requires Python 2 and ArcMap 10.5. Eventually, this notebook will be able to use `arcpy` in Python 3.

[Click here to access notebook](PlansTemplate.ipynb)

# (Optional) Create Spreadsheet of Completed Streets
This is intended to report on extracted streets generated from the PDFs

In [9]:
import os
import pandas as pd

# Columns of extracted table
columns = ["Location ID", "Street", "From", "To"]
df = pd.DataFrame()

try:
    df.read_excel(FOLDER + "\\SBO Street List.xlsx")
except:
    for foldername,subfolders,files in os.walk(FOLDER):
        for file in files:
            if file.endswith('.pdf'):
                FILE_NAME = "\\".join((FOLDER,file[:-4]))
                df1 = pdf_table_to_df(columns)
                df1["filename"] = file
                df = df.append(df1,sort=True)
    df.to_excel(FOLDER + "\\SBO Street List.xlsx",sheet_name="Report")