### This notebook is used to calculate the estimated distance and time from property to the firestation
I use the following codes to set the environment. If you are using Image pr-home-ds-m340-ce, some packages are pre-installed.

Env: conda create -n driving_distance python=3.12.11\
conda init bash \
source ~/.bashrc \
conda activate driving_distance\
<!---for read_data.py--->
conda install -c conda-forge fsspec s3fs=2023.6.0 pyarrow tqdm scipy aiohttp \
<!---for ask_online.py--->
conda install -c conda-forge pandas numpy boto3  requests geopy ipython \




The files directories are as follows:\
Firestations: s3://pr-home-datascience/DSwarehouse/Datasources/FireStationLocation/FireStation/FireStations_0.csv \
Properties: s3://pr-home-datascience/Users/test/IntermediateDrive/Quantarium/AD/good_address/rundt=202504/ \

We will read data from the directories.

### Import the libiaries and load the data

In [2]:
# import the libiaries
import numpy as np
import pandas as pd
import time
from tqdm import tqdm
import os
import read_data
import functions

cwd = os.getcwd()
print(cwd)

/home/sagemaker-user/DRIVING_DISTANCE


For now, we are dealing with the data by states. Within the states, the properties are listed by neighbors. So a 300 batch of properties will usually give dozens of potential nearest fire stations, which we are able to submit them all to the OpenStreetMap website. If the number of potential fire stations exceeds 100, we may need to reduce the batch size.

In [3]:
# The Paths and directories
states = ['NY']
#  CT, MA, NH, NJ, NY, PA


FS_path = "s3://pr-home-datascience/DSwarehouse/Datasources/FireStationLocation/FireStation/FireStations_0.csv"

# Use read_data.py to load the data
for state in states:
    property_path = "s3://pr-home-datascience/Users/test/IntermediateDrive/Quantarium/AD/good_address/rundt=202504/" + 'state='+states[0] + '/'
    df_FS = read_data.ReadData(FS_path, state = state, directory=False)
    df_property = read_data.ReadData(property_path, state=state, directory=True)
    print(f'There are {len(df_FS.data)} firestations in {state}. ')
    print(f'There are {len(df_property.data)} properties in {state}. ')


There are 2851 firestations in NY. 
There are 6389671 properties in NY. 


Check the keywords in the fire stations dataframe and properties dataframe.

In [5]:
print(f'There are {len(df_FS.data['ID'].unique())} unique fire stations')    
print('The fire station dataframe has keywords. ')
print(df_FS.data.keys())
print('The property dataframe has keywords. ')
print(df_property.data.keys())
print('**************************')
print('The property data examples ')
print(df_property.data.head())

There are 2851 unique fire stations
The fire station dataframe has keywords. 
Index(['OBJECTID', 'ID', 'NAME', 'TELEPHONE', 'ADDRESS', 'ADDRESS2', 'CITY',
       'STATE', 'ZIP', 'ZIPP4', 'COUNTY', 'FIPS', 'DIRECTIONS', 'EMERGTITLE',
       'EMERGTEL', 'EMERGEXT', 'CONTDATE', 'CONTHOW', 'GEODATE', 'GEOHOW',
       'HSIPTHEMES', 'NAICSCODE', 'NAICSDESCR', 'GEOLINKID', 'X', 'Y',
       'ST_VENDOR', 'ST_VERSION', 'GEOPREC', 'PHONELOC', 'QC_QA', 'STATE_ID',
       'FDID', 'FRST_MBRS', 'EMS_MBRS', 'TOTALPERS', 'NUMTRKS', 'NUMABUL',
       'TOTAL_VEHI', 'NBR_STA', 'OWNER', 'LEVEL_', 'TYPE', 'SPECIALTY',
       'EMSLICENSE', 'EMERPHONE', 'EMS', 'PERM_ID', 'GNIS_ID', 'x2', 'y2'],
      dtype='object')
The property dataframe has keywords. 
Index(['QPID', 'State', 'PA_Latitude', 'PA_Longitude', 'Match_Code',
       'Location_Code', 'add_check', 'geo_check', 'source', 'rundt', 'state'],
      dtype='object')
**************************
The property data examples 
       QPID State  PA_Latitude  PA_

### Use the functions.py to request online
Use the functions in the functions.py to calculate the distance and time from the properties to the nearest firestation.\

Please notice, we calcualte the routes to 5 nearest firestation by default and record the best route.

Steps:\
1: Set the state and output path\
2: Define chunk and batch size\
3: Loop through chunk indices (adjust range to control concurrent jobs)\
4: Run the routing process using the nearest 5 firestations

In [None]:
# Use the functions in the functions.py to calculate the distance and time from the properties to the nearest firestation
# Please notice, we calcualte the routes to 5 nearest firestation by default and record the best route.


# Chunksize: 100000 for MA, 200000 for NJ, 300000 for NY, 300000 for other states.
# We can just use 300,000 later.
state = 'NY'
output_path = "./results/" + state +'/'
# or you can save it to Amazon S3
# output_path = 's3://pr-home-datascience/Projects/AdHoc/InternProjects/2025/2025InternSummer Driving distance for Prospect Table/'+state+'/'


# identify the total length, chunksize, and batchsize
total = len(df_property.data)
chunksize = 300000
process_batch_size = 300
num_chunks = (total + chunksize - 1) // chunksize

# this for loop helps run concurrent jobs in the Pipelines
# Change i to run the jobs concurrently
for i in range(6, 10):
    print(f'Now is running the batch {i}.\n')
    # start and end indices of the property 
    start = i * chunksize
    end = min((i + 1) * chunksize, total)
    df_chunk = df_property.data.iloc[start:end]

    # process the chunk of properties
    results_df = functions.process_all_batches(df_chunk, df_FS.data, batchsize=process_batch_size)
    output_file_path = output_path +  f"FS_batch_{i}_n{str(chunksize)}.csv"

    if len(results_df) == chunksize:
        #  if all the batches give the correct results from OpenStreetMap
        results_df.to_csv(output_file_path, index = False)
        print(f'Finishing the writing file batch {i}.\n')
    else:
        #  change file names if not all the batches give the correct results
        output_file_path = output_path +  f"FS_batch_{i}_n{str(chunksize)}_failed.csv"
        results_df.to_csv(output_file_path, index = False)
        print(f'Missing data in the batch {i}.\n')

 
