Github fork was cloned from my fork to my local storage an Docker file was activated as instructions indicat. I loaded manually the csv and json files provided to the jupyternotebook path. Some extra libraries were used (folium xgboost hyperopt mlflow pyspark), so its intallation was added to the docker file. 

For this architecture, I will assume the use of tools like Azure and Databricks for pipeline development, as they are platforms with which I am familiar. However, the general deployment approach would be similar on other platforms.

The ingestion process assumes the existence of paid data sources, which may come in various formats—batch data, JSON/CSV files, or streaming sources. These sources would be connected via an API (assuming one is available) to Azure Data Factory (ADF), a service designed to manage and orchestrate diverse data flows. Since the incoming files could have different formats and extensions, they would be routed to Azure Blob Storage, which supports the storage of multiple file types, such as Parquet, CSV, JSON, and more.

In ADF, I would design a pipeline to process and consolidate the CSV files into a single SQL table within a data warehouse structure. This consolidated table would then serve as the foundation for training a predictive model, allowing the extraction of valuable insights. A SQL-based data warehouse would be chosen due to its ability to scale efficiently with large datasets. For JSON files, the pipeline would be configured to organize and clean them, though they would remain stored in Blob Storage for further use. An automated trigger would execute the pipeline at regular intervals to keep the data updated and organized.

Next, the Databricks environment would be linked to both the data warehouse and Blob Storage. This environment would host a computing cluster that can handle the processing power required to build and train the predictive model. Within Databricks, notebooks would be utilized to import necessary libraries and contain the code for model development, similar to the structure I am using in this notebook. For modeling I would use tools such as PySpark for code construction and Hyperopt, for parameter optimization, since both have a way of working distributedly and are desgined to be scalable. I would handle model development with mlflow to track experiments and the versions and environments would be managed with the tools provided by Azure-Databricks platform for this purpose.

Below is the diagram with the representation of what I describe.

![Project Architecture Diagram](architechture_diagram.jpg)

Exploratory Data Analysis 

In [1]:
import pandas as pd
import geopandas as gpd
import shapely
import sqlalchemy
import psycopg2
import osgeo.gdal
import matplotlib.pyplot as plt
import json
import matplotlib.pyplot as plt
from shapely.geometry import box
import geopandas as gpd

In [2]:
# Read the JSON file with the satellite data
with open("satellites.json", "r") as file:
    sat_json = json.load(file)
sat_json

[{'satellite_id': 'SAT001',
  'start_time': '2018-09-01T12:00:00Z',
  'last_time': '2024-09-10T12:00:00Z',
  'frequency': 'daily',
  'bounding_box': {'xmin': -112.939131,
   'ymin': 42.596356,
   'xmax': -107.048726,
   'ymax': 46.142424},
  'cloud_cover_percentage': 12.5,
  'resolution': '10m'},
 {'satellite_id': 'SAT002',
  'start_time': '2004-09-01T12:00:00Z',
  'last_time': '2024-09-06T12:00:00Z',
  'frequency': 'bi-weekly',
  'bounding_box': {'xmin': -180, 'ymin': -90, 'xmax': 180, 'ymax': 90},
  'cloud_cover_percentage': 10.0,
  'resolution': '100m'},
 {'satellite_id': 'SAT003',
  'start_time': '2022-09-01T12:00:00Z',
  'last_time': '2024-09-10T12:00:00Z',
  'frequency': 'hourly',
  'bounding_box': {'xmin': -124.178099,
   'ymin': 30.738207,
   'xmax': -95.942831,
   'ymax': 51.538929},
  'cloud_cover_percentage': 10.0,
  'resolution': '20m'}]

In [3]:
# Read the JSON file with the protected areas data
gdf_areas = gpd.read_file("protected_areas.json")
gdf_areas

Unnamed: 0,name,category,protected_area_id,geometry
0,Yellowstone National Park,National Park,PA001,"POLYGON ((-110.839 44.4488, -110.7052 44.599, ..."
1,Yosemite National Park,National Park,PA002,"POLYGON ((-119.655 37.7244, -119.5964 37.6962,..."
2,Grand Canyon National Park,National Park,PA003,"POLYGON ((-112.1861 36.1336, -112.2156 36.2331..."


In [4]:
df1 = pd.read_csv('animals.csv')
df1.head()

Unnamed: 0,animal_id,common_name,scientific_name,redlist_cat,megafauna
0,A001,Wolf,Canis lupus,Least Concern,no
1,A002,Bison,Bison bison,Vulnerable,yes
2,A003,Elk,Cervus canadensis,Least Concern,yes
3,A004,Sierra Nevada bighorn sheep,Ovis canadensis sierrae,Endangered,no
4,A005,Sierra Nevada red fox,Vulpes vulpes necator,Critically Endangered,no


In [5]:
df2 = pd.read_csv('animal_events.csv')
df2.head()

Unnamed: 0,animal_id,timestamp,latitude,longitude
0,A001,2024-09-01 12:00:00,45.2284,-110.7622
1,A002,2024-09-01 12:00:00,44.576,-110.6763
2,A003,2024-09-01 12:00:00,44.4232,-111.1061
3,A004,2024-09-01 12:00:00,37.9058,-119.7857
4,A005,2024-09-01 12:00:00,37.7896,-119.6426


This two data sets will be joined in a single one. This is something that, in case of scalation, should be considered as a part of the ingestion and transformation pipeline previous to analysis and modelling. Assuming that files are delivered in a blob storage, an azure data factory pipeline could work to construct a datawarehouse with transformed data in an SQL Server database, wich is an excellent tool manage large amounts of data.

In [6]:
df1 = pd.read_csv('animals.csv')
df2 = pd.read_csv('animal_events.csv')

merged_df = pd.merge(df1, df2, on='animal_id', how='right')
merged_df['timestamp_date'] = pd.to_datetime(merged_df['timestamp'])
merged_df

Unnamed: 0,animal_id,common_name,scientific_name,redlist_cat,megafauna,timestamp,latitude,longitude,timestamp_date
0,A001,Wolf,Canis lupus,Least Concern,no,2024-09-01 12:00:00,45.2284,-110.7622,2024-09-01 12:00:00
1,A002,Bison,Bison bison,Vulnerable,yes,2024-09-01 12:00:00,44.576,-110.6763,2024-09-01 12:00:00
2,A003,Elk,Cervus canadensis,Least Concern,yes,2024-09-01 12:00:00,44.4232,-111.1061,2024-09-01 12:00:00
3,A004,Sierra Nevada bighorn sheep,Ovis canadensis sierrae,Endangered,no,2024-09-01 12:00:00,37.9058,-119.7857,2024-09-01 12:00:00
4,A005,Sierra Nevada red fox,Vulpes vulpes necator,Critically Endangered,no,2024-09-01 12:00:00,37.7896,-119.6426,2024-09-01 12:00:00
5,A006,Bobcat,Lynx rufus,Least Concern,yes,2024-09-01 12:00:00,37.8829,-119.7608,2024-09-01 12:00:00
6,A007,Mule deer,Odocoileus hemionus,Least Concern,yes,2024-09-01 12:00:00,36.372,-113.1627,2024-09-01 12:00:00
7,A008,Desert bighorn sheep,Ovis canadensis nelsoni,Near Threatened,yes,2024-09-01 12:00:00,36.6193,-112.3388,2024-09-01 12:00:00
8,A009,Gray fox,Urocyon cinereoargenteus,Least Concern,yes,2024-09-01 12:00:00,36.3388,-112.119,2024-09-01 12:00:00
9,A001,Wolf,Canis lupus,Least Concern,no,2024-09-01 13:00:00,44.3946,-110.8218,2024-09-01 13:00:00


It can be seen that the available data corresponds to the positions in longitudinal and latitudinal coordinates of certain animals during different hours of the day, from 12pm to 2pm. These animals are the Wolf, Bison,Elk, Sierra Nevada bighorn sheep,Sierra Nevada red fox, Bobcat,Mule deer, Desert bighorn sheep,Gray fox. Also included are labels for each animal, such as their common names, scientific names, an alert of vulnerability level, and whether or not they are considered megafauna.


The JSON files contain information on the coordinates that delimit the protected areas and the satellites that are tracking the areas, the area they cover, the frequency with which they receive information and in what period of time they have operated, their cloud cover percentage and their resolution level.

I have already experimented with the reading and visualization of the different files with Geopandas and Folium, which is another library that I included in the Docker file. In the following block, the data visualizations of the areas and positions of the animals are spliced onto a map of the planet loaded with the folium library.
I define three different color maps for diferent visualzations:

In [7]:
#Different colors depending on the Common Name category
color_map_common_name={
    'Wolf': 'black',
    'Bison': 'beige',
    'Elk': 'darkgreen',
    'Sierra Nevada bighorn sheep': 'lightgreen',      
    'Sierra Nevada red fox': 'lightred', 
    'Bobcat': 'orange',
    'Mule deer': 'darkblue',
    'Desert bighorn sheep': 'purple',
    'Gray fox': 'lightgray'
}

#Different colors depending on the Red List category
color_map_red_list = {
    'Least Concern': 'green',
    'Vulnerable': 'orange',
    'Endangered': 'red',
    'Critically Endangered': 'purple',
    'Near Threatened': 'blue'
}

#different colors according to the megafauna category
color_map_megafauna = {
    'yes': 'green',
    'no': 'blue'
}

Visualization of distribution of animals.

In [8]:
import geopandas as gpd
import folium
from shapely.geometry import Point

# Create the GeoDataFrame with coordinates
geometry = [Point(xy) for xy in zip(merged_df['longitude'], merged_df['latitude'])]
gdf = gpd.GeoDataFrame(merged_df, geometry=geometry)

# Define spatial reference system (CRS)
gdf.crs = "EPSG:4326"

# Create the map centered on a midpoint
map_center = [gdf['latitude'].mean(), gdf['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=6)


for idx, row in gdf.iterrows():
    folium.Marker(
    location=[row['latitude'], row['longitude']],
    popup=f"Common Name: {row['common_name']}<br>Scientific Name: {row['scientific_name']}<br>Red List Category: {row['redlist_cat']}<br>Timestamp: {row['timestamp']}",
    # icon=folium.Icon(color=color_map_red_list.get(row['redlist_cat'], 'gray'))
    # icon=folium.Icon(color=color_map_megafauna.get(row['megafauna'], 'gray'))
    icon=folium.Icon(color=color_map_common_name.get(row['common_name'], 'gray'))
    ).add_to(m)
        
####################################################################
# Read the GeoJSON file
gdf_areas = gpd.read_file("protected_areas.json")

# Convert the GeoDataFrame to GeoJSON
geojson_areas = gdf_areas.to_json()

# Add GeoJSON data to the map
folium.GeoJson(geojson_areas).add_to(m)


#Show the map
m

With help of the folium map, it can be seen that there are three different areas around which the animals are found: Near Grand Canyon National Park, Yosemite National Park and Yellostowne National Park. The blue boxes delimit the protected areas, and it can be noted that most of the animals were almost not found within them, at least during the time they were tracked.

Assuming that the Id's correspond to unique animals, or group of unique animals, then their trajectories can be followed during the three hours in which they were tracked. I will use TimestampedGeoJson to show the paths animals follow in the next time interactive map, where the buttons helps to see the positions in the three houres, 12, 1 and 2pm:

In [9]:
import pandas as pd
from shapely.geometry import Point, LineString
import geopandas as gpd

df = merged_df

# Convert the 'timestamp' column to datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Convert the DataFrame into a GeoDataFrame
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)


import folium
from folium.plugins import TimestampedGeoJson

# Create a map centered at the mean location of the points
m = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=6)

# Create a color palette to differentiate the animals
color_map = {
    'Wolf': 'blue',
    'Bison': 'green',
    'Elk': 'red',
    'Sierra Nevada bighorn sheep': 'purple',
    'Sierra Nevada red fox': 'orange',
    'Bobcat': 'darkred',
    'Mule deer': 'darkblue',
    'Desert bighorn sheep': 'pink',
    'Gray fox': 'black'
}


# Create shifted columns (next_latitude and next_longitude)
gdf['next_latitude'] = gdf['latitude'].shift(-1)
gdf['next_longitude'] = gdf['longitude'].shift(-1)

# Create a GeoJSON with the points and add the corresponding time
features = []
for _, row in gdf.iterrows():
    feature = {
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': [row['longitude'], row['latitude']]
        },
        'properties': {
            'time': row['timestamp'].isoformat(),  # Now the timestamp is of datetime type
            'style': {'color': color_map[row['common_name']]},
            'popup': f"Animal: {row['common_name']}<br>Timestamp: {row['timestamp']}",
            'icon': 'star',
            'iconstyle': {
                'fillColor': color_map[row['common_name']],
                'fillOpacity': 0.7,
                'stroke': 'true',
                'radius': 6}
        }
    }
    features.append(feature)



# Create a GeoJSON compatible with Folium
geojson_data = {
    'type': 'FeatureCollection',
    'features': features
}

# Add the GeoJSON points to the map with timestamp support
TimestampedGeoJson(
    geojson_data,
    period='PT1H',  # The time period for the points to appear (can be adjusted)
    add_last_point=True,
    auto_play=True,  # Automatically plays
    loop=False,      # Do not repeat the cycle
    max_speed=1,     # Maximum playback speed
    loop_button=True,  # Option for looping the map
    date_options='YYYY-MM-DD HH:mm:ss',  # Time format
    time_slider_drag_update=True  # Allow dragging the time slider
).add_to(m)

# Add the trajectories (lines) for each animal
for animal in df['common_name'].unique():
    # Filter the data for each animal
    animal_data = gdf[gdf['common_name'] == animal].sort_values(by='timestamp')
    
    # Create a list of coordinates for the trajectories
    trajectory = list(zip(animal_data['latitude'], animal_data['longitude']))
    
    # Add the trajectory as a PolyLine to the map
    folium.PolyLine(trajectory, color=color_map[animal], weight=4, opacity=1).add_to(m)



####################################################################
# Read the GeoJSON file
gdf_areas = gpd.read_file("protected_areas.json")

# Convert the GeoDataFrame to GeoJSON
geojson_areas = gdf_areas.to_json()

# Add the GeoJSON data to the map
folium.GeoJson(geojson_areas).add_to(m)


# Display the map
m

It can be seen that in Yosemite, only the bighorn sheep cross the protected area at 2pm. The bobcat appears to approach but does not enter and the red fox was nearby at 1pm but then moves away and appears to tend to circle the area. So there was only one enter in the area on the tracked period by bighorn sheep.

In Grand Canyon Park only the Mule deer enters the protected area from the northwest at 1pm, the Desert bighorn sheep stays away and the Gray fox was close at 12pm but then moves away to the northwest. Again there was only one entry to protected área by Mule deer.

In Yellowstone the Elk approaches the protected area from the west but then returns. The wolf was in the north, near Dailey Lake at 1pm, approached the protected area and crossed a small fraction of it, then headed east at 2pm, apparently surrounding Yellowstone Lake. It probably stays near lakes. The Bison was inside the protected area at 12pm, it entered it and at 2pm it had already left, heading towards the southwest. It is in this area that the most animals are observed in the protected area, with one entry and two exits for the wolf and the Bison from 12 to 2pm.

In [10]:
import pandas as pd
from shapely.geometry import Point, LineString
import geopandas as gpd

df = merged_df

# Convertir la columna 'timestamp' a formato datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Convertir el DataFrame en un GeoDataFrame
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)


import folium
from folium.plugins import TimestampedGeoJson

# Crear un mapa centrado en la ubicación media de los puntos
m = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=6)

# Crear una paleta de colores para diferenciar los animales
color_map = {
    'Wolf': 'blue',
    'Bison': 'green',
    'Elk': 'red',
    'Sierra Nevada bighorn sheep': 'purple',
    'Sierra Nevada red fox': 'orange',
    'Bobcat': 'darkred',
    'Mule deer': 'darkblue',
    'Desert bighorn sheep': 'pink',
    'Gray fox': 'lightgray'
}


# Crear columnas desplazadas (next_latitude y next_longitude)
gdf['next_latitude'] = gdf['latitude'].shift(-1)
gdf['next_longitude'] = gdf['longitude'].shift(-1)

# Crear un GeoJSON con los puntos y añadir el tiempo correspondiente
features = []
for _, row in gdf.iterrows():
    feature = {
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': [row['longitude'], row['latitude']]
        },
        'properties': {
            'time': row['timestamp'].isoformat(),  # Ahora el timestamp es de tipo datetime
            'style': {'color': color_map[row['common_name']]},
            'popup': f"Animal: {row['common_name']}<br>Timestamp: {row['timestamp']}"
        }
    }
    features.append(feature)

# Crear un GeoJSON compatible con Folium
geojson_data = {
    'type': 'FeatureCollection',
    'features': features
}

# Añadir los puntos GeoJSON al mapa con soporte de timestamp
TimestampedGeoJson(
    geojson_data,
    period='PT1H',  # El período de tiempo para que aparezcan los puntos (puede ajustarse)
    add_last_point=True,
    auto_play=True,  # Reproduce automáticamente
    loop=False,      # No repetir el ciclo
    max_speed=1,     # Velocidad máxima de reproducción
    loop_button=True,  # Opción de loop en el mapa
    date_options='YYYY-MM-DD HH:mm:ss',  # Formato del tiempo
    time_slider_drag_update=True  # Permitir arrastrar el slider del tiempo
).add_to(m)




# Agregar las trayectorias (líneas) para cada animal
for animal in df['common_name'].unique():
    # Filtrar los datos para cada animal
    animal_data = gdf[gdf['common_name'] == animal].sort_values(by='timestamp')
    
    # Crear una lista de coordenadas para las trayectorias
    trajectory = list(zip(animal_data['latitude'], animal_data['longitude']))
    
    # Añadir la trayectoria como una línea PolyLine al mapa
    folium.PolyLine(trajectory, color=color_map[animal], weight=2.5, opacity=0.8).add_to(m)





#####################################
# Add satellites data
#####################################

# Read the JSON file with satellite data
with open("satellites.json", "r") as file:
    data = json.load(file)
# Create a list to store the bounding boxes
geometries = []
satellite_ids= []
cloud_covers = []
resolutions = []
frequencies=[]
for feature in data:
    bbox = feature['bounding_box']
    # Create a box geometry from the bounding box
    min_lon, min_lat = bbox['xmin'], bbox['ymin']
    max_lon, max_lat = bbox['xmax'], bbox['ymax']
    geometries.append(box(min_lon, min_lat, max_lon, max_lat))
    satellite_ids.append(feature['satellite_id'])
    cloud_covers.append(feature['cloud_cover_percentage'])
    resolutions.append(feature['resolution'])
    frequencies.append(feature['frequency'])

# # Create a GeoDataFrame with the bounding boxes
gdf_sat = gpd.GeoDataFrame(geometry=geometries)
# geojson_data = gdf_sat.to_json()
# # Add the GeoJSON data to the map
# folium.GeoJson(geojson_data).add_to(m)


# Add the bounding boxes to the map with popups
for geom, satellite_id, cloud_cover, resolution,frequency in zip(gdf_sat.geometry, satellite_ids,cloud_covers,resolutions,frequencies):
    folium.GeoJson(
        geom,
        popup=folium.Popup(f"Satellite ID: {satellite_id}<br>Cloud Cover Percentage: {cloud_cover} <br>Resolution: {resolution}<br>Frequency: {frequency} ")
    ).add_to(m)


# Add the GeoJSON data to the map
folium.GeoJson(geojson_areas).add_to(m)



# Mostrar el mapa
m

Given that there is data on the animals every hour, it can be thought that their positions were sent by the SAT003 satellite, which tracks data every hour and effectively covers the areas where the tracked animals are found. The satellite with the least coverage is SAT001, which only covers the Yellowstone area and has a Cloud Cover Percentage of 12.5%, which indicates that there are more clouds concentrated in that area compared to the Cloud Cover Percentage of the other areas of the 10 %. This could be related to a slightly greater chance of rain in Yellowstone compared to the other areas, however it is still not important this difference. Overall, assuming that the reported value is an average over the covered time period, the percentage was found to be low and could indicate a low probability of bad weather.

Megafauna vs. Map Not Megafauna

In [11]:
# Crear el mapa centrado en un punto medio
m = folium.Map(location=map_center, zoom_start=6)

# Añadir los puntos al mapa con diferentes colores según si son megafauna o no
for idx, row in gdf.iterrows():
    color = 'green' if row['megafauna'] == 'yes' else 'blue'
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=f"Common Name: {row['common_name']}<br>Scientific Name: {row['scientific_name']}<br>Red List Category: {row['redlist_cat']}<br>Timestamp: {row['timestamp']}",
        icon=folium.Icon(color=color)
    ).add_to(m)

# Add the GeoJSON data to the map
folium.GeoJson(geojson_areas).add_to(m)

# Mostrar el mapa
m

En cuanto a la distribución de tamaños de los animales, se observa que en Grand Canyon National Park hay solo registros de megafauna y además, estos casi no se encuentran dentro del área protegida. Esto es importante ya que estos animales suelen ser más vulnerables a la casa. 

En Yosemite en su mayoría no hay megafauna excepto por el Bobcat mientras que en Yellowstone la mayoría de animales sí son megafauna excepto por el Wolf.

Vulnerability Map (Red List)

In [12]:
import geopandas as gpd
import folium
from shapely.geometry import Point

# Crear el GeoDataFrame
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(merged_df, geometry=geometry)

# Definir sistema de referencia espacial (CRS)
gdf.crs = "EPSG:4326"

# Crear el mapa centrado en un punto medio
map_center = [gdf['latitude'].mean(), gdf['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=6)

# Añadir los puntos al mapa con diferentes colores según la categoría de la Lista Roja
color_map = {
    'Least Concern': 'green',
    'Near Threatened': 'blue',
    'Vulnerable': 'orange',
    'Endangered': 'red',
    'Critically Endangered': 'black'
}

for idx, row in gdf.iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=f"Common Name: {row['common_name']}<br>Scientific Name: {row['scientific_name']}<br>Red List Category: {row['redlist_cat']}<br>Timestamp: {row['timestamp']}",
        icon=folium.Icon(color=color_map.get(row['redlist_cat'], 'gray'))
    ).add_to(m)

# Add the GeoJSON data to the map
folium.GeoJson(geojson_areas).add_to(m)

# Mostrar el mapa
m

From the map differentiating animals by vulnerability, the following can be noted:

There are no highly endangered tracked animals or animals with a significant level of concern in Grand Canyon National Park. Only the Desert bighorn sheep with a low level of Near Threatened.

In Yellowstone there is a more important level of vulnerability in the case of the Bison. This may be related to the fact that being a large animal, megafauna, is more vulnerable to hunting. It is observed that its movement tends to cross the protected area from the north and then leave it and move away towards the south. However, the Elk is also considered megafauna and is not at a level of vulnerability. There is information that there were threatened animals in this area but they have had successful conservation programs.

The most striking area regarding the red list is Yosemite, since there are two highly threatened species. The Sierra Nevada bighorn sheep which is marked as Endangered and the Sierra Nevada red fox, which is at a critical level of vulnerability. This is curious since precisely the animals that are not considered megafauna, that is, small animals, are the ones that are most vulnerable. The bobcat, which is a large animal and therefore could be more likely to be a victim of hunting, is without any level of concern. In this case, information can be found about important climate changes in the area, and considering that many species in the area are endemic, this is what puts them at risk. A decrease in snow has also been observed due to warming, which affects the red fox and puts it in critical condition.


Now, let's try to predict patterns of migration between the animals. Let's use the merged dataframe:

In [13]:

df1 = pd.read_csv('animals.csv')
df2 = pd.read_csv('animal_events.csv')
merged_df = pd.merge(df1, df2, on='animal_id', how='right')
merged_df

Unnamed: 0,animal_id,common_name,scientific_name,redlist_cat,megafauna,timestamp,latitude,longitude
0,A001,Wolf,Canis lupus,Least Concern,no,2024-09-01 12:00:00,45.2284,-110.7622
1,A002,Bison,Bison bison,Vulnerable,yes,2024-09-01 12:00:00,44.576,-110.6763
2,A003,Elk,Cervus canadensis,Least Concern,yes,2024-09-01 12:00:00,44.4232,-111.1061
3,A004,Sierra Nevada bighorn sheep,Ovis canadensis sierrae,Endangered,no,2024-09-01 12:00:00,37.9058,-119.7857
4,A005,Sierra Nevada red fox,Vulpes vulpes necator,Critically Endangered,no,2024-09-01 12:00:00,37.7896,-119.6426
5,A006,Bobcat,Lynx rufus,Least Concern,yes,2024-09-01 12:00:00,37.8829,-119.7608
6,A007,Mule deer,Odocoileus hemionus,Least Concern,yes,2024-09-01 12:00:00,36.372,-113.1627
7,A008,Desert bighorn sheep,Ovis canadensis nelsoni,Near Threatened,yes,2024-09-01 12:00:00,36.6193,-112.3388
8,A009,Gray fox,Urocyon cinereoargenteus,Least Concern,yes,2024-09-01 12:00:00,36.3388,-112.119
9,A001,Wolf,Canis lupus,Least Concern,no,2024-09-01 13:00:00,44.3946,-110.8218


It can be intuitively noted that the most important fields in determining movement should be the animal (common_name), whether it is megafauna, whether it is vulnerable, and the time of day. These mentioned columns would work as fetures while the pair (longitude,latitude) would be the target.

Since the dataset is small, it is proposed to use a RandomForestRegressor algorithm from sklearn as the initial approximation to predict. 

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder

# Load the data
df = merged_df

def Preprocessing_function(df):
    # Encode categorical variables
    encoder = OneHotEncoder(drop='first', sparse_output=False)
    categorical_columns = ['common_name', 'scientific_name', 'redlist_cat', 'megafauna']
    encoded_cats = encoder.fit_transform(df[categorical_columns])
    
    # Create a DataFrame with the encoded columns
    encoded_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(categorical_columns))
    
    # Extract temporal features
    df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
    df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
    
    # Combine the encoded features with the original DataFrame
    df = pd.concat([df, encoded_df], axis=1)
    
    # Select features for the model
    X = df.drop(columns=['animal_id', 'timestamp', 'latitude', 'longitude', 'common_name', 'scientific_name', 'redlist_cat', 'megafauna'])
    y = df[['latitude', 'longitude']]
    return X, y

X, y = Preprocessing_function(df)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Create and train the RandomForest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.7888740612860278


Let's use this model to make predictions of the positions of the animals in the not tracked hours of the day. This will only be done for hours close to those included in the dataset, since it is small and has few records.

In [15]:
# create function that takes the time of day and returns the prediction of the animals' positions at that time
def predict_migration(hours,merged_df):
    for hour in hours:
        data = {
            'animal_id': ['A001', 'A002', 'A003', 'A004', 'A005', 'A006', 'A007', 'A008','A009'],
            'common_name': ['Wolf', 'Bison', 'Elk', 'Sierra Nevada bighorn sheep', 
                            'Sierra Nevada red fox', 'Bobcat', 'Mule deer', 'Desert bighorn sheep','Gray fox'],
            'scientific_name': ['Canis lupus', 'Bison bison', 'Cervus canadensis', 
                                'Ovis canadensis sierrae', 'Vulpes vulpes necator', 
                                'Lynx rufus', 'Odocoileus hemionus', 'Ovis canadensis nelsoni','Urocyon cinereoargenteus'],
            'redlist_cat': ['Least Concern', 'Vulnerable', 'Least Concern', 'Endangered', 
                            'Critically Endangered', 'Least Concern', 'Least Concern', 'Near Threatened','Least Concern'],
            'megafauna': ['no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes','yes'],
            'timestamp': ['2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour), 
                          '2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour), 
                          '2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour), '2024-09-01 {}:00:00'.format(hour)],
            'latitude': [45.2284, 44.5760, 44.4232, 37.9058, 37.7896, 37.8829, 36.3720, 36.6193, 36.3388],
            'longitude': [-110.7622, -110.6763, -111.1061, -119.7857, -119.6426, -119.7608, -113.1627, -112.3388, -112.1190]
        }
        df = pd.DataFrame(data)
        X,y = Preprocessing_function(df)
        y_pred = model.predict(X)
        
        df['latitude'] = y_pred[:, 0]
        df['longitude'] = y_pred[:, 1]
    merged_df = pd.concat([merged_df, df.drop(['hour','day_of_week'], axis=1)], ignore_index=True)
    return merged_df

Predict for 15:00, 16:00 and 15:00

In [16]:
predict_df = predict_migration(["15","16"], merged_df)

We could use the same blocks of code that we have already used for the visualizations to see movement paths:

In [17]:
import pandas as pd
from shapely.geometry import Point, LineString
import geopandas as gpd

df = predict_df

# Convertir la columna 'timestamp' a formato datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Convertir el DataFrame en un GeoDataFrame

geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)


import folium
from folium.plugins import TimestampedGeoJson

# Crear un mapa centrado en la ubicación media de los puntos
m = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=6)

# Crear una paleta de colores para diferenciar los animales
color_map = {
    'Wolf': 'blue',
    'Bison': 'green',
    'Elk': 'red',
    'Sierra Nevada bighorn sheep': 'purple',
    'Sierra Nevada red fox': 'orange',
    'Bobcat': 'darkred',
    'Mule deer': 'darkblue',
    'Desert bighorn sheep': 'pink',
    'Gray fox': 'black'
}


# Crear columnas desplazadas (next_latitude y next_longitude)
gdf['next_latitude'] = gdf['latitude'].shift(-1)
gdf['next_longitude'] = gdf['longitude'].shift(-1)

# Crear un GeoJSON con los puntos y añadir el tiempo correspondiente
features = []
for _, row in gdf.iterrows():
    feature = {
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': [row['longitude'], row['latitude']]
        },
        'properties': {
            'time': row['timestamp'].isoformat(),  # Ahora el timestamp es de tipo datetime
            'style': {'color': color_map[row['common_name']]},
            'popup': f"Animal: {row['common_name']}<br>Timestamp: {row['timestamp']}"
        }
    }
    features.append(feature)

# Crear un GeoJSON compatible con Folium
geojson_data = {
    'type': 'FeatureCollection',
    'features': features
}

# Añadir los puntos GeoJSON al mapa con soporte de timestamp
TimestampedGeoJson(
    geojson_data,
    period='PT1H',  # El período de tiempo para que aparezcan los puntos (puede ajustarse)
    add_last_point=True,
    auto_play=True,  # Reproduce automáticamente
    loop=False,      # No repetir el ciclo
    max_speed=1,     # Velocidad máxima de reproducción
    loop_button=True,  # Opción de loop en el mapa
    date_options='YYYY-MM-DD HH:mm:ss',  # Formato del tiempo
    time_slider_drag_update=True  # Permitir arrastrar el slider del tiempo
).add_to(m)

# Agregar las trayectorias (líneas) para cada animal
for animal in df['common_name'].unique():
    # Filtrar los datos para cada animal
    animal_data = gdf[gdf['common_name'] == animal].sort_values(by='timestamp')
    
    # Crear una lista de coordenadas para las trayectorias
    trajectory = list(zip(animal_data['latitude'], animal_data['longitude']))
    
    # Añadir la trayectoria como una línea PolyLine al mapa
    folium.PolyLine(trajectory, color=color_map[animal], weight=4, opacity=1).add_to(m)



####################################################################
# Leer el archivo GeoJSON
gdf_areas = gpd.read_file("protected_areas.json")

# Convertir el GeoDataFrame a GeoJSON
geojson_areas = gdf_areas.to_json()

# Añadir los datos GeoJSON al mapa
folium.GeoJson(geojson_areas).add_to(m)


# Mostrar el mapa
m

Scrolling through the positions with the forward button, in Yosemite can be seen that the positions predicted by the algorithm suggest an eastward movement of the species and a departure of the bighorn sheep from the protected area in the following two hours for which there is no data .

In Yellowstone the tendency will be to go from north to south, moving away from the protected area.

In Grand Canyon the trend is not very clear, they seem to stay around the park only with a slight direction towards the north of it.


Given the need to consider the scalability and automation of the pipeline, a second model was developed (in the Jupyter notebook Machine Learning Model (Pyspark and Hyperopt)) using the same RandomForest algorithm, but implemented with Pyspark and its machine learning library, along with Hyperopt for hyperparameter optimization. Both libraries are supported by platforms like Databricks and are designed for distributed use, enabling the parallelization of training processes as well as the hyperparameter optimization. In this case, two separate models will be considered—one to predict the variable latitude and the other for longitude. The features considered are common_name, scientific_name, redlist_cat, megafauna, hour, month, day, and year. Although these last features are unnecessary for the provided small datasets, they are included under the assumption that complete datasets might contain data spanning many days or even years, in which case features like year or day could be significant. 

Hyperopt allows to optimize parameter tunning by using a Bayesian aproximation, but with the small size of datasets provided I could see an overfited model. This option could work better with larger datasets, but I decided to include it as an attached document.