# 🔮 Traffic Infringement Data Processing

This notebook processes raw traffic infringement data and converts it into the GeoJSON format required for the heatmap visualization.

## Overview

1. Load raw data
2. Clean and preprocess
3. Geocode locations
4. Transform to GeoJSON
5. Export for visualization


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os
from pathlib import Path

## 1. Load Raw Data

First, let's load the traffic infringement data from the CSV file.


In [3]:
# Set paths
ROOT_DIR = Path('../')
DATA_DIR = ROOT_DIR / 'data'
OUTPUT_DIR = Path('../output')

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Load the traffic infringements data
infringements_path = DATA_DIR / 'trafficinfringementsissued.csv'
df = pd.read_csv(infringements_path, header=0)

# Display the first few rows
df.head()

Unnamed: 0,Financial Year,Police Region,Police District,Offence Type,Breach or Ticket,Offence Code,Offence Description,Total
0,2013-14,BRISBANE,NORTH BRISBANE,1M/1.5M Passing Offence,T,3334,FAIL TO MAINTAIN 1M/1.5M WHEN PASSING A BICYCLE,1
1,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4699,ACCREDITED PERSON FAIL TO DISPLAY ACCREDITED D...,1
2,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4700,ACCREDITED PERSON FAIL TO CARRY/PRODUCE ACCRED...,2
3,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4704,DRIVE PILOT VEHICLE NOT IN ACCORDANCE WITH GUI...,1
4,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4705,DRIVE ESCORT VEHICLE NOT IN ACCORDANCE WITH GU...,4


In [4]:
# Basic data exploration
print(f"Dataset shape: {df.shape}")
df.info()

Dataset shape: (75310, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75310 entries, 0 to 75309
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Financial Year       75309 non-null  object
 1   Police Region        75307 non-null  object
 2   Police District      75307 non-null  object
 3   Offence Type         75307 non-null  object
 4   Breach or Ticket     75307 non-null  object
 5   Offence Code         75307 non-null  object
 6   Offence Description  75307 non-null  object
 7   Total                75307 non-null  object
dtypes: object(8)
memory usage: 4.6+ MB


## 2. Clean and Preprocess Data

We need to clean the data and prepare it for geocoding.


In [5]:
# Clean column names (remove spaces, lowercase)
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]

# Show columns after cleaning
df.columns

Index(['financial_year', 'police_region', 'police_district', 'offence_type',
       'breach_or_ticket', 'offence_code', 'offence_description', 'total'],
      dtype='object')

In [6]:
# Check for missing values
print("Missing values by column:")
df.isna().sum()

Missing values by column:


financial_year         1
police_region          3
police_district        3
offence_type           3
breach_or_ticket       3
offence_code           3
offence_description    3
total                  3
dtype: int64

## 3. Geocoding

We need to convert location names to coordinates. Let's use a dictionary mapping locations to coordinates.


In [11]:
# ...existing code...
# Example mapping of Queensland regions to approximate coordinates
qld_regions = {
    'NORTH BRISBANE': {'lat': -27.4075, 'lon': 153.0543},
    'SOUTH BRISBANE': {'lat': -27.4809, 'lon': 153.0167},
    'CAPRICORNIA': {'lat': -23.3791, 'lon': 150.5100},  # Centered on Rockhampton
    'MACKAY': {'lat': -21.1412, 'lon': 149.1868},
    'SUNSHINE COAST': {'lat': -26.6500, 'lon': 153.0667},
    'WIDE BAY BURNETT': {'lat': -25.2882, 'lon': 152.3423},  # Centered near Bundaberg
    'FAR NORTH': {'lat': -16.9186, 'lon': 145.7781},  # Centered on Cairns
    'MOUNT ISA': {'lat': -20.7256, 'lon': 139.4927},
    'TOWNSVILLE': {'lat': -19.2590, 'lon': 146.8169},
    'GOLD COAST': {'lat': -28.0167, 'lon': 153.4000},
    'LOGAN': {'lat': -27.6392, 'lon': 153.1086},
    'DARLING DOWNS': {'lat': -27.5598, 'lon': 151.9507},  # Centered on Toowoomba
    'IPSWICH': {'lat': -27.6161, 'lon': 152.7610},
    'MORETON': {'lat': -27.2360, 'lon': 153.1187},  # Centered near Redcliffe
    'SOUTH WEST': {'lat': -27.9500, 'lon': 151.9500},  # Approximate center of the region
}

In [12]:
# Check which locations we have in the data
locations = df['police_district'].unique()
print(f"Locations in the dataset: {len(locations)}")
locations[:20]  # Show the first 20

Locations in the dataset: 17


array(['NORTH BRISBANE', 'SOUTH BRISBANE', 'CAPRICORNIA', 'MACKAY',
       'SUNSHINE COAST', 'WIDE BAY BURNETT', 'FAR NORTH', 'MOUNT ISA',
       'TOWNSVILLE', 'GOLD COAST', 'LOGAN', 'DARLING DOWNS', 'IPSWICH',
       'MORETON', 'SOUTH WEST', 'UNKNOWN', nan], dtype=object)

In [13]:
# Function to add coordinates to the dataframe
def add_coordinates(row):
    location = row['police_district']
    if location in qld_regions:
        return pd.Series([qld_regions[location]['lat'], qld_regions[location]['lon']])
    else:
        # Default to Brisbane for unknown locations - you may want to handle this differently
        return pd.Series([None, None])

# Apply the function to add latitude and longitude columns
df[['latitude', 'longitude']] = df.apply(add_coordinates, axis=1)

# Check how many locations were successfully geocoded
print(f"Locations with coordinates: {df['latitude'].notna().sum()} out of {len(df)}")

# Display sample with coordinates
df[df['latitude'].notna()].head()

Locations with coordinates: 70102 out of 75310


Unnamed: 0,financial_year,police_region,police_district,offence_type,breach_or_ticket,offence_code,offence_description,total,latitude,longitude
0,2013-14,BRISBANE,NORTH BRISBANE,1M/1.5M Passing Offence,T,3334,FAIL TO MAINTAIN 1M/1.5M WHEN PASSING A BICYCLE,1,-27.4075,153.0543
1,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4699,ACCREDITED PERSON FAIL TO DISPLAY ACCREDITED D...,1,-27.4075,153.0543
2,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4700,ACCREDITED PERSON FAIL TO CARRY/PRODUCE ACCRED...,2,-27.4075,153.0543
3,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4704,DRIVE PILOT VEHICLE NOT IN ACCORDANCE WITH GUI...,1,-27.4075,153.0543
4,2013-14,BRISBANE,NORTH BRISBANE,Accreditation,T,4705,DRIVE ESCORT VEHICLE NOT IN ACCORDANCE WITH GU...,4,-27.4075,153.0543


## 4. Aggregate Data

Now, let's aggregate the data by location to get the total number of infringements per location.


In [19]:
# Count number of records per location (not sum of 'total')
location_counts = df.groupby(['police_district', 'latitude', 'longitude']).size().reset_index(name='count')
location_counts = location_counts.sort_values('count', ascending=False)
location_counts.head(10)

Unnamed: 0,police_district,latitude,longitude,count
9,NORTH BRISBANE,-27.4075,153.0543,7541
10,SOUTH BRISBANE,-27.4809,153.0167,5553
3,GOLD COAST,-28.0167,153.4,5520
0,CAPRICORNIA,-23.3791,150.51,4964
12,SUNSHINE COAST,-26.65,153.0667,4892
2,FAR NORTH,-16.9186,145.7781,4768
5,LOGAN,-27.6392,153.1086,4742
1,DARLING DOWNS,-27.5598,151.9507,4694
13,TOWNSVILLE,-19.259,146.8169,4649
14,WIDE BAY BURNETT,-25.2882,152.3423,4473


## 5. Transform to GeoJSON

Now let's convert our aggregated data to GeoJSON format for the heatmap.


In [20]:
# Function to normalize values to a 0-100 scale for intensity
def normalize_values(series):
    min_val = series.min()
    max_val = series.max()
    return 100 * (series - min_val) / (max_val - min_val)

# Normalize the counts to get intensity values between 0-100
location_counts['intensity'] = normalize_values(location_counts['count'])

# Remove rows with missing coordinates
geo_data = location_counts.dropna(subset=['latitude', 'longitude'])

# Create GeoJSON feature collection
features = []
for _, row in geo_data.iterrows():
    feature = {
        "type": "Feature",
        "properties": {
            "intensity": float(row['intensity']),
            "location": row['police_district'],
            "count": int(row['count'])
        },
        "geometry": {
            "type": "Point",
            "coordinates": [float(row['longitude']), float(row['latitude'])]
        }
    }
    features.append(feature)

# Create the GeoJSON structure
geojson_data = {
    "type": "FeatureCollection",
    "features": features
}

# Preview the first feature
geojson_data["features"][0]

{'type': 'Feature',
 'properties': {'intensity': 100.0,
  'location': 'NORTH BRISBANE',
  'count': 7541},
 'geometry': {'type': 'Point', 'coordinates': [153.0543, -27.4075]}}

In [21]:
# Save the GeoJSON data to a file
output_file = OUTPUT_DIR / 'infringements.json'
with open(output_file, 'w') as f:
    json.dump(geojson_data, f, indent=2)

print(f"GeoJSON data saved to {output_file}")

GeoJSON data saved to ../output/infringements.json


In [22]:
# Also save a CSV version for compatibility
csv_output = OUTPUT_DIR / 'infringements.csv'
geo_data[['latitude', 'longitude', 'intensity']].to_csv(csv_output, index=False)
print(f"CSV data saved to {csv_output}")

CSV data saved to ../output/infringements.csv


## 6. Copy to Web App

Finally, let's copy the processed files to the client/data directory so they can be used by the web application.


In [None]:
import shutil

# Define paths
client_data_dir = ROOT_DIR / '..' / 'client' / 'data'

# Ensure the client data directory exists
os.makedirs(client_data_dir, exist_ok=True)

# Copy the GeoJSON file
shutil.copy(output_file, client_data_dir / 'infringements_qld.json')

# Copy the CSV file
shutil.copy(csv_output, client_data_dir / 'infringements_qld.csv')

print(f"Files copied to web app directory: {client_data_dir}")

Files copied to web app directory: ../../client/data
