This notebook creates featureCollections of Sentinel image "snippets" that can be used to train an ML model to recognize industrial pig farms. The featureCollections are written to two geojson files at /content/drive/MyDrive/GEE_data. These is one for images of areas around pig farms, and another for images of areas around random large buildings. Currently the Sentinel images contain only bands 2, 3, and 4 (RGB), and they encompass approximately a 240 x ?? m region centered around a (farm) building.

In its current incarnation, the notebook gathers images for pig farms in Duplin County, North Carolina, which have permitted liquid manure storage (hereafter lagoons, although they are not all technically lagoons in the strict sense of the word). The locations of those farms is derived from Montefiore et al. (2022), which gives a list of permitted pig manure lagoon locations. The random large buildings are obtained from the combined MS-Google global building footprint database. This dataset does not/should not contain pig CAFOs with lagoons, but it may contain buildings in other CAFO types.

In [1]:
import os
from google.colab import drive
import ee
import geemap.foliumap as geemap
import geopandas as gpd

In [2]:
ee.Authenticate()
ee.Initialize(project="215656163750")
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
min_building_size = 400
lagoon_distance_threshold = 200
sentinel_bands = ['B4', 'B3', 'B2'] # mainly to reduce data volume/get max resolution
                                    # keep in this order for visualization purposes
training_image_radius = 240 #to obtain (approx) 48 x ?? pixel images

In [4]:
# US counties shapefile
# This notebook works with a single county to get the workflow
# before trying to handle larger datasets
# To reproduce for a whole state, use this file:
# "/content/drive/MyDrive/Colab Notebooks/cb_2021_us_state_5m.shp"

counties = gpd.read_file(
    "/content/drive/MyDrive/Colab Notebooks/cb_2021_us_county_5m.shp"
) #should probably read directly as featureCollection

duplin = counties[counties['NAME'].str.match("Duplin")]
duplin = geemap.geopandas_to_ee(duplin[['geometry']])

In [5]:
# The MS-Google combined building footprint dataset,
# filtered to large buildings in Duplin County

country = 'USA'

buildings = (
    #get all buildings in USA
    ee.FeatureCollection(f"projects/sat-io/open-datasets/VIDA_COMBINED/{country}")
    #keep only buildings in NC --> Duplin County
    .filterBounds(duplin)
    #keep only buildings above min_building_size
    .filter(ee.Filter.gt('area_in_meters', min_building_size))
    )

num = buildings.size().getInfo()
print(f"There are {num} buildings >{min_building_size} sq m in Duplin County, NC")

There are 6672 buildings >400 sq m in Duplin County, NC


In [6]:
# Sentinel data for Duplin county
# Data for summer 2023 - is restricting to summer a good choice?
# May need to use better cloud masking:
# https://developers.google.com/earth-engine/tutorials/community/sentinel-2-s2cloudless

sentinel = (
    ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
    .filterDate('2023-06-01', '2023-09-30')
    .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 10))
    .select(sentinel_bands)
    .median() #crude cloud filter
    .clip(duplin)
)

In [7]:
# This is the Montefiore dataset of North Carolina permitted pig manure lagoon locations.

mf = gpd.read_file(
    '/content/drive/MyDrive/Colab Notebooks/Montefiore.shp'
)

mf.rename(columns={"field_1": "longitude", "field_2": "latitude", "field_3": "year"}, inplace=True)
print(f"There are {mf.shape[0]} lagoons in the Montefiore NC dataset")

#restrict to Duplin county
lagoons = geemap.geopandas_to_ee(mf[['geometry']]).filterBounds(duplin)
num = lagoons.size().getInfo()
print(f"There are {num} lagoons in Duplin County, NC")

There are 3405 lagoons in the Montefiore NC dataset
There are 819 lagoons in Duplin County, NC


In [8]:
# Determine which buildings are close to a lagoon

# Define a spatial filter
distFilter = ee.Filter.withinDistance(
    distance = lagoon_distance_threshold,
    leftField = '.geo',
    rightField =  '.geo',
    maxError = 10
)

# Define a saveBest join.
distSaveBest = ee.Join.saveBest(
    matchKey = 'points',
    measureKey = 'distance',
    outer = True
)

# Apply the join to create fc of buildings and distances to lagoons
spatialJoined = distSaveBest.apply(buildings, lagoons, distFilter)

In [9]:
# Create the pig farm training/test set (label=1)

# Select buildings near lagoons
cafo_bldgs = spatialJoined.filter(ee.Filter.notNull(['points']))

# As there may be several buildings on a farm, group them all together
# and define a square region around the centroid of the group

# First, add a buffer around each building then dissolve all overlapping
# (buffered) buildings into a single polygon (actually, an element of a
# multipolygon that contains all such polygons in the farm dataset)
def buffer_features(feature):
  buffer_radius = 100  # meters
  return feature.centroid().buffer(buffer_radius, 2)

buffers = cafo_bldgs.map(buffer_features)
multipoly = buffers.geometry().dissolve()

# Convert that multipolygon into a featureCollection of polygons
fc = ee.FeatureCollection([
  ee.Feature(multipoly)
])

fc = ee.List(fc.toList(fc.size()).map(lambda feature:
  ee.List(ee.Feature(feature).geometry().geometries()).map(lambda geom:
    ee.Feature(ee.Geometry(geom))
  )
)).flatten()

fc = ee.FeatureCollection(fc)

# For each polygon (i.e., farm), define a square region around its centroid
def buffer_and_bound(feature, buffer_radius=training_image_radius):
  return feature.centroid().buffer(buffer_radius, 2).bounds()

pig_farms = fc.map(buffer_and_bound)

# Obtain Sentinel data for the farm polygons, save to file
pig_farm_pix = sentinel.sampleRegions(
    collection=pig_farms,
    scale=10,
    #projection="epsg:32119",
    geometries=True)

task = ee.batch.Export.table.toDrive(
    collection=pig_farm_pix,
    description='sentinelPigFarms',
    folder='GEE_data',
    fileFormat='GeoJSON',
)

#task.start() # Uncomment to actually write the file

In [10]:
# Create the not-farm training/test set (label = 0)

# Take a random selection of buildings not near lagoons
# Make it the same size as the final farm dataset
num_bldgs = pig_farms.size().getInfo()
fc = spatialJoined.filter(ee.Filter.eq('points', None))
fc = fc.randomColumn().sort('random').limit(num_bldgs)

# Create a buffer around each building. There is no need to group
# buildings like we did for the farms, because in general there will not
# be multiple adjacent buildings (and the few cases that will exist
# shouldn't matter). However, we do remove all polygon properties
# for consistency with the pig_farm dataset and to reduce data volume
random_buildings = fc.map(buffer_and_bound).select([])

# Obtain Sentinel data for the building polygons, save to file
random_pix = sentinel.sampleRegions(
    collection=random_buildings,
    scale=10,
    #projection="epsg:32119",
    geometries=True)

task = ee.batch.Export.table.toDrive(
    collection=random_pix,
    description='sentinelRandomBuildings',
    folder='GEE_data',
    fileFormat='GeoJSON',
)

#task.start() # Uncomment to actually write file

In [11]:
# This map can't show the Sentinel snippets, but it does show the locations of the
# pig farms and random large buildings, and the areas around them that will be used
# as training/test data. It can take a while for all the layers to show up.

Map = geemap.Map()
Map.centerObject(duplin.first().geometry(), 13);

os.environ["HYBRID"] = 'https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}'

sentinel_viz = {
    'min': 0,
    'max': 3000,
    'bands': sentinel_bands
}

generic_viz = {
    'color': 'red',
    'width': 2,
    'lineType': 'solid',
    'fillColor': '00000000' #transparent
}

cafo_viz = {
    'color': 'yellow',
    'width': 1,
    'fillColor': '00000000'
}

random_viz = {
    'color': 'cyan',
    'width': 2,
    'fillColor': '00000000'
}

Map.add_basemap("HYBRID")
Map.add_layer(sentinel, sentinel_viz, 'Sentinel')
Map.add_layer(buildings.style(**generic_viz), {}, "All buildings")
Map.addLayer(pig_farms.style(**cafo_viz), {}, "Pig farms")
Map.addLayer(random_buildings.style(**random_viz), {}, "Random large buildings")
Map.addLayer(duplin.style(**generic_viz), {}, 'Duplin County')

Map

In [12]:
# The file-writing tasks take a while to (run ~10 mins);
# execute this cell to see their status

tasks = ee.batch.Task.list()
from datetime import datetime

# Print the tasks along with their status
for task in tasks[:5]:
    status = task.status()
    if status['state'] in ['READY', 'RUNNING', 'COMPLETED']:
      ms = status['start_timestamp_ms']
      print(f"Task {status['id']} started at {datetime.fromtimestamp(ms/1000.0)}")
      print(f"Current status: {status['state']}")
    elif status['state'] == 'FAILED':
        print(f"Task {status['id']} FAILED")
        print("   Error Message:", status['error_message'])
    else:
        print(status)


Task 2TKZKF3BWNUPAQUZQ7ES4ULJ started at 2024-05-01 20:33:05.446000
Current status: COMPLETED
Task SNIKWNI2VTC37FBX4UQTAUAT started at 2024-05-01 20:31:43.707000
Current status: COMPLETED
Task TCOUMZRQKN5EOL7ND64DSX6C FAILED
   Error Message: Image.reduceRegions: Computed value is too large.
Task 6TPIA3HCK6TIPL3KRRK3DQ7L FAILED
   Error Message: Image.reduceRegions: Computed value is too large.
Task V3BGBYDCNFU5RJHEEDDIAYPL FAILED
   Error Message: Image.reduceRegions: Computed value is too large.


In [15]:
"""
%cd /content/drive/MyDrive/Colab\ Notebooks/

#!git init
#!git config --global init.defaultBranch main
!git config --global user.email "rachel.e.mason1@gmail.com"
!git config --global user.name "Rachel Mason"

!git add createTrainingData.ipynb
!git commit -m "Moved code that processes geojson files into traingCNN.ipynb"

#!git remote add origin <INCLUDES TOKEN, SEE NOTES FOR WHAT TO INSERT>
#!git config pull.rebase false
#!git pull origin main --allow-unrelated-histories

#!git branch -m main
!git push -u origin main
"""

/content/drive/MyDrive/Colab Notebooks
[main c91b72a] Moved code that processes geojson files into traingCNN.ipynb
 1 file changed, 1 insertion(+), 1 deletion(-)
 rewrite createTrainingData.ipynb (99%)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 3.39 KiB | 315.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/rachelemason/CAFO-AI.git
   f4be364..c91b72a  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.
