# Overview

We now clean more involved features. Due to the data set being relatively small we are able to manually preprerocess given features. We start with 'data/preprocessed_dates_time_data.csv' and work to obtain "data\cleaned.csv". In particular we preprocess:

- location
- details and tags
- crew
- pager code

# Data

In [67]:
import pandas as pd 

In [68]:
df = pd.read_csv("..\data\cleaned.csv")
# df.head()

In [69]:
# Filter the DataFrame to get rows where 'pager_code' 
value="333"
filtered_df = df[df['pager_code'] == value]
pd.set_option('display.max_colwidth', None)
filtered_df[['pager_code', 'shout_details']].head(3)

Unnamed: 0,pager_code,shout_details
1,333,17' Fletcher speedboat with 1 male occupant had suffered mechanical failure and was drifting just south of Ardlui. The boat and occupant were put on a long tow and taken back to their berth at ROwardenNone
3,333,Reports of a small craft adrift out in fROnt of Duckbay. After a quick search the small tender was located on the shore line. Small craft was found to be damaged and was removed fROm the water.
5,333,"Report of a vessel drifted fROm beach with no persons onboard. Caller informed Police Scotland that he was on the beach and required assistance to retrieve the vessel. LLRB launched to support, once on scene crew quickly located the caller who advised local campers were able to assist retrieve his vessel, all being well LLRB stood down and returned to base."


# Location

Based on misspellings of locations and vague descriptions (e.g., "south of Ardlui") we make restrict to simpley "Ardlui" for now. We manually input the longitude and latitude coordinates using google maps (DD format). We input the coordinates of the location (Ardlui) using Google Maps and record the core latitude and longitude (core_lat_long). It's important to note that the specific location may not be precise during search operations.

## Automation of position (later)

The following is an idea for automation for entries with a easy to google location (most times however the location is imprecise).

To-do:

    -[ ] Utilize geopy to obtain the coordinates of the location based on the shout's location (so we can automate the process).
    -[ ] Implement functionality to accept What3Words coordinates.


Provide a function (using geopy) that takes a location and provides core_lat_long value.

https://geopy.readthedocs.io/en/stable/index.html?highlight=what%20three%20words#geopy.geocoders.What3WordsV3.geocode

In [29]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [35]:
simple_loc=[
 'Ardlui',
 'Duckbay']
# for df take only rows with location values in simple_loc
filtered_df = df[df['location_of_shout'].str.contains('|'.join(simple_loc), case=False, na=False)]
filtered_df.head()
df=filtered_df

In [37]:
# Initialize geocoder
geolocator = Nominatim(user_agent="my_geocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Function to geocode a location
def geocode_location(location):
    try:
        # Attempt to geocode the location
        geocoded_location = geocode(f"{location}, Scotland",timeout = 5)
        return geocoded_location.latitude, geocoded_location.longitude
    except:
        # Return NA values if geocoding fails
        return pd.NA, pd.NA

In [38]:
# Apply geocoding to each location in the DataFrame
df['coords'] = df['location_of_shout'].apply(geocode_location)
df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].tolist(), index=df.index)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['coords'] = df['location_of_shout'].apply(geocode_location)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].tolist(), index=df.index)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['latitude', 'longitude']] = pd.DataFrame(df['coords'].

In [41]:

# Group by location and determine number of call outs per year.
aggregated_df = df.groupby('location_of_shout').agg({
    'latitude': 'first',  # Since all latitudes for the same location should be equal
    'longitude': 'first'  # Ditto for longitudes
}).reset_index()

aggregated_df.head()

Unnamed: 0,location_of_shout,latitude,longitude
0,Ardlui marina.,,
1,Ardlui,56.301844,-4.721605
2,Ardlui marina,,
3,Between Inchmurrin and Duckbay marina,,
4,Duckbay,,


# Details and tags

For each shout_detail item we provide shout_details_tags as a way to quickly understand the incident. Here are the tags and details they correspond to:

Tag categories:
- Mechanical
- Rescue
- Medical
- Environmental
- Mishap
- Assistance
- FalseAlarm
- Miscellaneous
- Search
- Transport

See docs/tag_categories.md for what into each category. To obtain shout_detail_tags we read each shout_detail and manually input the tags. Alternatively one could use a LLM to classify based on tags, see docs/detail_tag_prompt.md

Reading through the shout-details we see that LLRB also work with the following organisations:
- SAS
- LLTNP
- DMMS
We will enquire about cross organisational training and data sharing.

# Weather

We parsed weather_at_time_of_shout manually into a comma separated list of weather conditions. Though we did not collect weather into a shorted list of tags such as with shout-details-tags. We record these changes in "data\cleaned.csv". Sometimes they record what the waves (the chop/swell level), temperature, light level,wind direction and strength, and visability level. We will ask the crew to record these details at the time of the incident - see docs/data_recording.md.

# pager_code

Pager codes initially are formed of three leading terms these are:

- 999 = someone in water,
- 333 = search, 
- 222 = urgent but no threat to life

Sometimes the main-pager-codes are followed by a sub-pager-code. We do not know what these sub-pager-codes refer to (we leave these for now and ask the crew later).

LLRB often gets called out without need for a pager code, or with AIRWAVE. This occurs while out on training or on a previous callout. By analysing the shout-details we have retroactively assigned pager-codes to these entries (999,333,222).

Given pager_code take first three letters as the code, if there are more than three letters, take the next three letters as subcode.

In [25]:
# df['pager_code'].isna().sum()
# df['pager_code'].fillna("Na", inplace=True)

Identify all rows that begin with 999,333,222 and have a subcode. Extract subcode and put into subcode.

In [70]:
df['subcode'] = 'None'
df['subcode'].value_counts()

subcode
None    209
Name: count, dtype: int64

In [71]:

#First: Identify all rows that are of form 999,333,222 only. 

# Identify all rows that are of form 999,333,222 only. For the subcode put None.
df_pager_main_only = df['pager_code'].apply(lambda x: x if x in ['999','333','222'] else None)
# df_pager_main_only.value_counts()

#Now get subcodes for the rest of the rows

df['subcode'] = df['pager_code'].apply(lambda x: str(x)[3:] if x not in df_pager_main_only else "None")

Extract rows that do not start with a pager code of the form 999,333,222.

In [72]:
#Identify all rows that do not start with:  999,333,222.
# df_non = df[~df['pager_code'].astype(str).str.match(r'^(999|333|222)')]
# print(df_non.shape) #69
df['pager_code'].value_counts()

pager_code
333         72
999         55
222         45
training     2
9992167      1
333 3072     1
333 1846     1
999 1087     1
999 1573     1
3332563      1
999 2788     1
999 2465     1
333 1151     1
999 1448     1
333 1456     1
9992336      1
333 3439     1
3334906      1
9993743      1
3333630      1
999 1919     1
3331753      1
999 2784     1
333369       1
999 0210     1
999 1682     1
9991701      1
9994432      1
999 1324     1
333 2265     1
9992804      1
3332585      1
999 1977     1
222 4380     1
999 3665     1
333 2428     1
333 3077     1
333 2673     1
3331717      1
Name: count, dtype: int64

In [73]:
# if pager_code first three values begins with 999,333,222 then replace current value with 999,333,222
#for example if 9992127 replace with 999
df['pager_code'] = df['pager_code'].apply(lambda x: x[:3] if str(x)[:3] in ['999','333','222'] else x)

In [74]:
df['pager_code'].value_counts()

pager_code
333         88
999         73
222         46
training     2
Name: count, dtype: int64

In [82]:
# fill blanks in subcode with None
df['subcode'] = df['subcode'].replace(['', ' '], "None")

feats=['pager_code', 'subcode']

#For those of 'subcode' not None put into a dataframe
df[feats].head()

Unnamed: 0,pager_code,subcode
0,999,2167.0
1,333,
2,999,
3,333,
4,222,


In [88]:
#get rows that have pager_code as "training"
df_training=df[df['pager_code'] == "training"]
df_training.head()

Unnamed: 0,date_of_shout,time_of_shout,time_boat_launched,time_boat_returned,pager_code,what_three_words,location_of_shout,core_lat_long,shout_details,shout_details_tags,crew_on_board,crew_on_shore,weather_at_time_of_shout,subcode
85,23/08/2023,19:00,19:15,20:45,training,,Training,,Training,"FalseAlarm, Miscellaneous","ABS, GH, VM",,"Dry, calm",ining
90,06/08/2023,10:15,10:15,12:35,training,,Crew Training,,"Crew Training, Island Familiarisation and River Leven",Miscellaneous,"AM, PD ,DS, VM ,GH, EM",ABJ,"Good, clear",ining


In [89]:
#remove df_training from df
df = df[df['pager_code'] != "training"]

In [90]:
#get only those rows where subcode is not None with columns in feats
df_subcode_not_none = df[df['subcode'] != 'None'][feats]
df_subcode_not_none.head()
df_subcode_not_none.shape

#group these by pager_code and show me the subcodes for each pager code
# df_subcode_not_none.groupby('pager_code')['subcode'].value_counts()

#show these group separaltly for each pager code and list subcodes
df_subcode_not_none.groupby('pager_code')['subcode'].unique()




pager_code
222                                                                                                                     [ 4380]
333                    [ 2265, 369,  2428,  3077,  2673, 1753, 2585,  1846, 2563,  3072,  1151,  1456,  3439, 4906, 3630, 1717]
999    [2167,  0210,  1682, 1701, 4432,  1324, 2804,  1977,  3665,  2784,  1448,  1919,  1087,  1573,  2788,  2465, 2336, 3743]
Name: subcode, dtype: object

In [91]:
# save file 
df.to_csv("..\data\cleaned.csv", index=False)

# 'crew_on_board' and 'crew_on_shore'

Crew names where not consistant. We corrected these manually into initils. We recommend that the crew on board and on shore are recorded as initials, and as a comma separated list. If there are no crew on board or on shore, record as "None".

In [None]:
{
    "RB": "Ronnie Britton",
    "RO": "Rennie Oliver",
    "IG": "Iain Gollan (Goz)",
    "AM": "Ally McLeod",
    "ABS": "Andy Biddulph Snr",
    "ABJ": "Andy Biddulph Jnr",
    "GD": "Gemma Dorran",
    "PBT": "Phils Brooks-Taylor",
    "DON": "David O'Neil",
    "CC": "Craig Clancy",
    "GH": "Gerry Heaney",
    "AJM": "Angus John MacDonald",
    "CMS": "Callum MacKenzie Stevens",
    "DS": "David Stuart",
    "TR": "Thomas Rogers",
    "EM": "Euan MciIwraith",
    "PD": "Paul Dorrian",
    "KM": "Kevin McPartland",
    "JB": "Jenna Biddulph",
    "VM": "Vicki Murphy",
    "JM": "John Mason",
    "AC": "Andy Connell",
    "FN": "Franny Nicol",
    "FR": "Frank Rogers",
    "CA": "Christine Allan",
    "CS": "Clinton Salter"
    "JT": "James Thomson",
    "TAM":"Tam (Cox)",
    "GERARD","Gerard",
    "DAVY","Davy",
    "LEE","Lee",
}

In [92]:
df['crew_on_board'].fillna("Na", inplace=True)
df['crew_on_shore'].fillna("Na", inplace=True)
feats=['crew_on_board', 'crew_on_shore']
df[feats].head()
df_crew=df[feats]
df_crew.value_counts()