# 67818 Applied Competitive Lab in Data Science

## Final Project

### Participants:

- **Name:** Matheus de Souza
- **Student ID:** 202000000
- 
- **Name:** Matheus de Souza
- **Student ID:** 202000000
- 
- **Name:** Matheus de Souza
- **Student ID:** 202000000
- 
- **Name:** Matheus de Souza
- **Student ID:** 202000000
- 
- **Name:** Matheus de Souza
- **Student ID:** 202000000

#### 1. Data Preparation and Feature Engineering:

We will start by defining a function for calculating the distance between two points on the Earth's surface for a given latitude and longitude. We will then use this function to calculate the distance between each fire incident and the nearest nature point, city, and camp.

For our original data containing 571000 rows:
Calculating for the nature points takes about 10 minutes.
Calculating for the camps takes about 5 minutes.
Calculating for the cities takes about 5 minutes.

In [5]:
from tqdm import tqdm
import pandas as pd
import geopandas as gpd
from geopy.distance import geodesic
import numpy as np


def calculate_distances_gdf(source_gdf, target_gdf, target_name):
    # Ensure the GeoDataFrames use WGS 84 CRS
    source_gdf.to_crs(epsg=4326, inplace=True)
    target_gdf.to_crs(epsg=4326, inplace=True)

    # Spatial index for target points
    target_sindex = target_gdf.sindex

    distances_km = []
    
    
    for idx, row in tqdm(source_gdf.iterrows(), total=source_gdf.shape[0], desc="Calculating distances for "+target_name):
        point = row['geometry']
        
        
        for buffer_deg in np.linspace(0.1, 30, num=300):
            bounds = point.buffer(buffer_deg).bounds
            possible_matches_index = list(target_sindex.intersection(bounds))
            possible_matches = target_gdf.iloc[possible_matches_index]
            if not possible_matches.empty:
                break
        else:
            print("Calculating all possibilities for index ", idx)
            print("Target:", target_name)
            print("Point:", point)
            possible_matches = target_gdf

        # Calculate geodesic distances to the possible matches
        min_distance = float('inf')
        point_coords = (point.y, point.x)
        for _, target_row in possible_matches.iterrows():
            target_point = target_row['geometry']
            target_point_coords = (target_point.y, target_point.x)
            distance = geodesic(point_coords, target_point_coords).kilometers
            if distance < min_distance:
                min_distance = distance

        distances_km.append(min_distance)

    return distances_km

# Load data from CSV files
data_df = pd.read_csv('data.csv')
nature_df = pd.read_csv('NaturePoints.csv')
cities_df = pd.read_csv('counties_cities.csv')
camp_df = pd.read_csv('camp.csv')

# Convert DataFrames to GeoDataFrames with WGS 84 CRS
data_gdf = gpd.GeoDataFrame(data_df, geometry=gpd.points_from_xy(data_df.LONGITUDE, data_df.LATITUDE), crs="EPSG:4326")
nature_gdf = gpd.GeoDataFrame(nature_df, geometry=gpd.points_from_xy(nature_df.Longitude, nature_df.Latitude), crs="EPSG:4326")
cities_gdf = gpd.GeoDataFrame(cities_df, geometry=gpd.points_from_xy(cities_df.lng, cities_df.lat), crs="EPSG:4326")
camp_gdf = gpd.GeoDataFrame(camp_df, geometry=gpd.points_from_xy(camp_df.lon, camp_df.lat), crs="EPSG:4326")

# Calculate distances
data_df['distance_to_nearest_nature_km'] = calculate_distances_gdf(data_gdf, nature_gdf, "nature")
data_df['distance_to_nearest_camp_km'] = calculate_distances_gdf(data_gdf, camp_gdf, "camp")
data_df['distance_to_nearest_city_km'] = calculate_distances_gdf(data_gdf, cities_gdf, "city")

data_df.to_csv('data_with_distances.csv', index=False)

Calculating distances for nature:  11%|█         | 60891/571425 [01:02<08:45, 971.47it/s] 


KeyboardInterrupt: 

Now...

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv("data_with_distances.csv")

# Get unique values in the STAT_CAUSE_DESCR column
unique_causes = df['STAT_CAUSE_DESCR'].unique()

for cause in unique_causes:
    # Filter the dataframe for the current cause
    cause_df = df[df['STAT_CAUSE_DESCR'] == cause]

    # Calculate the average distances
    avg_nature_distance = cause_df['distance_to_nearest_nature_km'].mean()
    avg_camp_distance = cause_df['distance_to_nearest_camp_km'].mean()
    avg_city_distance = cause_df['distance_to_nearest_city_km'].mean()

    # Print the averages for the current cause
    print(f"Average distances for {cause}:")
    print(f"  Nature: {avg_nature_distance:.2f} km")
    print(f"  Camp: {avg_camp_distance:.2f} km")
    print(f"  City: {avg_city_distance:.2f} km")
    
import pandas as pd

# Load the dataset
df = pd.read_csv("data_with_distances.csv")

# Group the data by FIRE_SIZE_CLASS and calculate the mean for City_Distance_KM within each group
avg_city_distance_by_fire_size = df.groupby('FIRE_SIZE_CLASS')['distance_to_nearest_city_km'].mean()

print("Average distance from the nearest city for each FIRE_SIZE_CLASS:")
print(avg_city_distance_by_fire_size)

Average distances for Miscellaneous:
  Nature: 29.79 km
  Camp: 16.78 km
  City: 6.43 km
Average distances for Arson:
  Nature: 26.06 km
  Camp: 18.05 km
  City: 6.69 km
Average distances for Debris Burning:
  Nature: 29.36 km
  Camp: 18.74 km
  City: 6.60 km
Average distances for Smoking:
  Nature: 27.84 km
  Camp: 17.35 km
  City: 6.55 km
Average distances for Campfire:
  Nature: 18.66 km
  Camp: 12.47 km
  City: 9.82 km
Average distances for Equipment Use:
  Nature: 29.73 km
  Camp: 16.88 km
  City: 7.27 km
Average distances for Powerline:
  Nature: 44.64 km
  Camp: 16.58 km
  City: 7.33 km
Average distances for Lightning:
  Nature: 19.43 km
  Camp: 19.34 km
  City: 16.21 km
Average distances for Railroad:
  Nature: 27.27 km
  Camp: 17.65 km
  City: 6.38 km
Average distances for Children:
  Nature: 26.48 km
  Camp: 19.20 km
  City: 4.50 km
Average distances for Fireworks:
  Nature: 26.48 km
  Camp: 20.01 km
  City: 4.75 km
Average distances for Structure:
  Nature: 34.12 km
  Camp: 