# Overview

We now clean more involved features. Due to the data set being relatively small we are able to manually preprerocess given features. We start with 'data/preprocessed_dates_time_data.csv' and work to obtain "data\cleaned.csv". In particular we preprocess:

- location
- details and tags
- crew
- pager code

# Data

In [12]:
import pandas as pd 

In [30]:
df = pd.read_csv("..\data\cleaned.csv")
# df.head()

In [None]:
# Filter the DataFrame to get rows where 'pager_code' 
value="333"
filtered_df = df[df['pager_code'] == value]
pd.set_option('display.max_colwidth', None)
filtered_df[['pager_code', 'shout_details']].head(3)

# Location

Based on misspellings of locations and vague descriptions (e.g., "south of Ardlui") we have to make some assumptions to give better location data. If the location is for say "south of Ardlui" we record the location "Ardlui"  and in adjustment_km we record (s,2) to mean south by 2km - we take 2km as a rough estimate of the distance south.

We manually input the longitude and latitude coordinates with google maps (DD format). We input the coordinates of the location (Ardlui) using Google Maps and record the core latitude longitude (core_long_lat). It's important to note that the specific location may not be precise during search operations.

## Automation of position

To-do:

    -[ ] Utilize geopy to obtain the coordinates of the location based on the shout's location (so we can automate the process).
    -[ ] Implement functionality to accept What3Words coordinates.


Provide a function that takes a location and provides core_long_lat value.

Given a map and location data need to be able to convert to longitude/latitude data.

https://geopy.readthedocs.io/en/stable/index.html?highlight=what%20three%20words#geopy.geocoders.What3WordsV3.geocode


In [29]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [35]:
simple_loc=[
 'Ardlui',
 'Duckbay']

# for df take only rows with location values in simple_loc
filtered_df = df[df['location_of_shout'].str.contains('|'.join(simple_loc), case=False, na=False)]
filtered_df.head()
df=filtered_df

In [37]:
# Initialize geocoder
geolocator = Nominatim(user_agent="my_geocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Function to geocode a location
def geocode_location(location):
    try:
        # Attempt to geocode the location
        geocoded_location = geocode(f"{location}, Scotland",timeout = 5)
        return geocoded_location.latitude, geocoded_location.longitude
    except:
        # Return NA values if geocoding fails
        return pd.NA, pd.NA

In [38]:
# Apply geocoding to each location in the DataFrame
df['coords'] = df['location_of_shout'].apply(geocode_location)
df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].tolist(), index=df.index)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['coords'] = df['location_of_shout'].apply(geocode_location)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].tolist(), index=df.index)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].

In [41]:

# Group by location and determine number of call outs per year.
aggregated_df = df.groupby('location_of_shout').agg({
    'latitude': 'first',  # Since all latitudes for the same location should be equal
    'longitude': 'first'  # Ditto for longitudes
}).reset_index()

aggregated_df.head()

Unnamed: 0,location_of_shout,latitude,longitude
0,Ardlui marina.,,
1,Ardlui,56.301844,-4.721605
2,Ardlui marina,,
3,Between Inchmurrin and Duckbay marina,,
4,Duckbay,,


# Details and tags

For each shout_detail item we provide shout_details_tags as a way to quickly understand the incident. Here are the tags and details they correspond to:

Tag categories:
- Mechanical
- Rescue
- Medical
- Environmental
- Mishap
- Assistance
- FalseAlarm
- Miscellaneous
- Search
- Transport

See docs/tag_categories.md for what into each category. To obtain shout_detail_tags we read each shout_detail and manually input the tags. Alternatively one could use a LLM to classify based on tags, see docs/detail_tag_prompt.md

Reading through the shout-detials we see that LLRB also work with the follwoing organisations:
- SAS
- LLTNP
- DMMS

# Weather

We parsed weather_at_time_of_shout manually into a comma separated list of weather conditions. 

We record these changes in:

"data\codes_preprocessed_dates_time_data.csv"

Sometimes they record what the waves (the chop/swell level),temperature, light level,wind direction and strength, and visability level.

We can ask the crew to record these details at the time of the incident.

# pager_code

Pager codes initially are formed of three leading terms these are:

- 999 = someone in water,
- 333 = search, 
- 222 = urgent but no threat to life

Sometimes the main-pager-codes are followed by a sub-pager-code. We do not know what these sub-pager-codes refer to (we leave these for now).

LLRB often gets called out without need for a pager code, or with AIRWAVE. This occurs while out on training or on a previous callout.


We will ask the crew to provide more information on these. If they are not able to provide more information we will will leave the pager-codes as they are.

Given pager_code take first three letters as the code, if there are more than three letters, take the next three letters as subcode.


In [25]:
df['pager_code'].fillna("Na", inplace=True)

Identify all rows that begin with 999,333,222 and have a subcode. Extract subcode and put into subcode.

In [27]:
#First put all subcodes as None
df['subcode']=None

#First: Identify all rows that are of form 999,333,222 only. 

# Identify all rows that are of form 999,333,222 only. For the subcode put None.
df_pager_main_only = df['pager_code'].apply(lambda x: x if x in ['999','333','222'] else None)
# df_pager_main_only.value_counts()

#Now get subcodes for the rest of the rows

df['subcode'] = df['pager_code'].apply(lambda x: str(x)[3:] if x not in df_pager_main_only else None)

Extract rows that do not start with a pager code of the form 999,333,222.

In [28]:
#Identify all rows that do not start with:  999,333,222.

df_non = df[~df['pager_code'].astype(str).str.match(r'^(999|333|222)')]
# print(df_non.shape) #69
df_non['pager_code'].value_counts()

(69, 7)


pager_code
Na                                     50
other / already out                     5
On water                                2
AIRWAVE                                 2
Na on water                             2
Nil                                     1
N/A   AB messaged                       1
N/a                                     1
3032                                    1
Na already on water at safety event     1
11:30                                   1
Na already out                          1
Na just returned from previous          1
Name: count, dtype: int64

In [None]:
#save those that begin with 999,333,222
df_main = df[df['pager_code'].astype(str).str.match(r'^(999|333|222)')]

# Once pager_code has be entered for those without (df_non), concate df_main and created df_non_entered.
# identify those pager_codes that need to be entered in for based on shout_details.
#See pager code descriptions

# save df_non to csv
# df_non.to_csv('../data/pager_codes_missing.csv', index=False)

# 'crew_on_board' and 'crew_on_shore'

Ask those recording to record the initials of the crew on board and on shore in a comma separated list. If there are no crew on board or on shore, record as "None".

In [None]:
{
    "RB": "Ronnie Britton",
    "RO": "Rennie Oliver",
    "IG": "Iain Gollan (Goz)",
    "AM": "Ally McLeod",
    "ABS": "Andy Biddulph Snr",
    "ABJ": "Andy Biddulph Jnr",
    "GD": "Gemma Dorran",
    "PBT": "Phils Brooks-Taylor",
    "DON": "David O'Neil",
    "CC": "Craig Clancy",
    "GH": "Gerry Heaney",
    "AJM": "Angus John MacDonald",
    "CMS": "Callum MacKenzie Stevens",
    "DS": "David Stuart",
    "TR": "Thomas Rogers",
    "EM": "Euan MciIwraith",
    "PD": "Paul Dorrian",
    "KM": "Kevin McPartland",
    "JB": "Jenna Biddulph",
    "VM": "Vicki Murphy",
    "JM": "John Mason",
    "AC": "Andy Connell",
    "FN": "Franny Nicol",
    "FR": "Frank Rogers",
    "CA": "Christine Allan",
    "CS": "Clinton Salter"
    "JT": "James Thomson",
    "TAM":"Tam (Cox)",
    "GERARD","Gerard",
    "DAVY","Davy",
    "LEE","Lee",
}

#TODO! 

Remember to convert weather, details and location to lower case.

Ensure crew spaces are removed


In [None]:
import pandas as pd 

df = pd.read_csv("..\data\codes_preprocessed_dates_time_data.csv")
# df.drop(columns=["shout_details"], inplace=True)
# df.columns

In [None]:
#Turn values into comma separated values.
# Keep as Initials for now.

In [None]:
df['crew_on_board'].fillna("Na", inplace=True)
df['crew_on_shore'].fillna("Na", inplace=True)

In [65]:
feats=['crew_on_board', 'crew_on_shore']
df[feats].head()
df_crew=df[feats]

In [None]:
df_crew.value_counts()
#For crew_on_board
# seperate comma serparated values for each row. 
#count the number of times a given name appears.

#For a given name get a list of pager_codes that they have been on board for.
# Similarly get the weather for this persons callouts.