8/5/2025 Now we will convert our IPR Post Sites raw data file to a file we can use with lat/long, zip codes and neighborhood name using the code we wrote for the IPR Campsite data.

by Stephen Peters

In [1]:
!pip install pandas
print("pandas installed!")

pandas installed!


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
print("libraries imported!")

libraries imported!


In [3]:
# let's take a look at our current dataset
base_dir = Path("C:/Users/Steph/local/OIT-class/datasets/original/project_data")
df = pd.read_csv(base_dir / "IRP_Post_sites.csv")
#df = sns.load_dataset("datasets/original/IRP_Campsite_Reports")
df.head()

Unnamed: 0,X,Y,OBJECTID,inc_id,created_date,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,post_location,email_time_utc,bureaus_detail,graffiti
0,-13642030.0,5696806.0,1,23-190779,2023/10/26 21:17:37+00,1,14,3,0,5,200,1.0,111-115th and SE FOSTER RD,2023/10/26 21:36:44+00,,1.0
1,-13640490.0,5704175.0,2,23-198050,2023/10/11 17:26:00+00,2,3,0,0,5,20,0.0,0-12599 E BURNSIDE ST,2023/10/11 18:36:17+00,,1.0
2,-13656230.0,5703145.0,3,23-185173,2023/09/01 17:36:25+00,0,0,0,0,5,0,0.0,200-299 SW SALMON ST,2023/09/01 17:36:47+00,,0.0
3,-13645210.0,5699758.0,4,22-96683,2023/02/02 19:03:30+00,13,17,0,0,5,80,1.0,SE 83rd - 84th and Bush,2023/02/02 19:36:48+00,,0.0
4,-13655980.0,5704290.0,5,24-14593,2024/03/15 19:40:53+00,3,10,0,0,5,20,0.0,1-3 NW 3RD AVE,2024/03/15 20:36:42+00,,1.0


In [4]:
# and let's check datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8140 entries, 0 to 8139
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   X                           8140 non-null   float64
 1   Y                           8140 non-null   float64
 2   OBJECTID                    8140 non-null   int64  
 3   inc_id                      8138 non-null   object 
 4   created_date                8140 non-null   object 
 5   post_structure_qty          8140 non-null   int64  
 6   post_people_count           8140 non-null   int64  
 7   post_people_under_25_count  8140 non-null   int64  
 8   post_children_count         8140 non-null   int64  
 9   post_reason                 8140 non-null   int64  
 10  waste_material_qty          8140 non-null   int64  
 11  dogs_present                8139 non-null   float64
 12  post_location               8139 non-null   object 
 13  email_time_utc              8136 

In [5]:
# check for missing data
df.isnull().sum()

X                                0
Y                                0
OBJECTID                         0
inc_id                           2
created_date                     0
post_structure_qty               0
post_people_count                0
post_people_under_25_count       0
post_children_count              0
post_reason                      0
waste_material_qty               0
dogs_present                     1
post_location                    1
email_time_utc                   4
bureaus_detail                8130
graffiti                         1
dtype: int64

This shows we do have some nulls, but they don't seem to be in columns we care about, like "bureas_detail", whatever that is.

In [6]:
# let's take a peek at our summary statistics
df.describe()

Unnamed: 0,X,Y,OBJECTID,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,graffiti
count,8140.0,8140.0,8140.0,8140.0,8140.0,8140.0,8140.0,8140.0,8140.0,8139.0,8139.0
mean,-13652110.0,5703928.0,4070.5,2.683784,3.880835,0.073342,0.007002,4.994963,21.267445,0.113282,0.255437
std,6232.554,4408.155,2349.959929,3.054529,4.312074,0.460134,0.153831,0.122829,29.765469,0.316956,0.436133
min,-13666500.0,5691491.0,1.0,-1.0,0.0,-2.0,-2.0,1.0,0.0,0.0,0.0
25%,-13656180.0,5701493.0,2035.75,1.0,1.0,0.0,0.0,5.0,7.0,0.0,0.0
50%,-13654460.0,5703755.0,4070.5,2.0,3.0,0.0,0.0,5.0,20.0,0.0,0.0
75%,-13646340.0,5705211.0,6105.25,4.0,5.0,0.0,0.0,5.0,20.0,0.0,1.0
max,-13634020.0,5722451.0,8140.0,45.0,55.0,16.0,6.0,5.0,730.0,1.0,1.0


Here we see some things that aren't quite right.  In what world are there -2 people in a camp?  Or -1 structures?  I am going to remove the records which have these negative values, they represent some kind of mistake.  Not a big one, but no reason I can think of to leave them in.


In [9]:
df_clean = df[
    (df["post_people_under_25_count"] >= 0) &
    (df["post_children_count"] >= 0) &
    (df["post_structure_qty"] >= 0)
]
print("This filters the DataFrame to only include rows where both values are greater than or equal to zero.")
print("Let's see what we get now:")
df_clean.describe()

This filters the DataFrame to only include rows where both values are greater than or equal to zero.
Let's see what we get now:


Unnamed: 0,X,Y,OBJECTID,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,graffiti
count,8134.0,8134.0,8134.0,8134.0,8134.0,8134.0,8134.0,8134.0,8134.0,8133.0,8133.0
mean,-13652110.0,5703925.0,4070.757192,2.685026,3.8821,0.073764,0.007499,4.994959,21.215515,0.113365,0.255502
std,6230.801,4406.401,2349.984572,3.055112,4.313258,0.459037,0.151448,0.122875,29.297604,0.317058,0.43617
min,-13666500.0,5691491.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,-13656180.0,5701491.0,2036.25,1.0,1.0,0.0,0.0,5.0,7.0,0.0,0.0
50%,-13654460.0,5703754.0,4069.5,2.0,3.0,0.0,0.0,5.0,20.0,0.0,0.0
75%,-13646340.0,5705208.0,6106.75,4.0,5.0,0.0,0.0,5.0,20.0,0.0,1.0
max,-13634020.0,5722451.0,8140.0,45.0,55.0,16.0,6.0,5.0,730.0,1.0,1.0


Ok, our data is now "clean" - at least until we discover other problems.
Let's output our current data to a .csv file before we start adding our other fields

In [10]:
# Define the base directory
base_dir_save = Path("C:/Users/Steph/local/OIT-class/datasets/processed")

# Ensure the directory exists (create it if it doesn't)
base_dir_save.mkdir(parents=True, exist_ok=True)

# Define the full output file path
output_file = base_dir_save / "IRP_Post_Sites_clean.csv"

# Save the DataFrame to CSV
df_clean.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

File saved to: C:\Users\Steph\local\OIT-class\datasets\processed\IRP_Post_Sites_clean.csv


And now we want to create a new dataframe that takes df_clean and adds the lat/long coords in new columns

In [12]:
#import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
print("libraries imported!")

# Step 1: Load your cleaned DataFrame
# we already have this in memory
# df_clean = pd.read_csv("C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Campsite_Reports_clean.csv")

# Step 2: Copy to a new DataFrame
df_latlong = df_clean.copy()

# Step 3: Create a GeoDataFrame using X/Y assuming EPSG:3857 (Web Mercator)
# (personally I don't know anything about this geometry bit and I suppose I don't need to at this time.)
geometry = [Point(xy) for xy in zip(df_latlong["X"], df_latlong["Y"])]
gdf = gpd.GeoDataFrame(df_latlong, geometry=geometry, crs="EPSG:3857")

# Step 4: Convert coordinates to WGS84 (latitude/longitude)
gdf = gdf.to_crs("EPSG:4326")

# Step 5: Extract lat/lon and assign to df_latlong
gdf["latitude"] = gdf.geometry.y
gdf["longitude"] = gdf.geometry.x

# Step 6: Drop geometry column if not needed
df_latlong = gdf.drop(columns="geometry")

df_latlong.head()
print("And now we have our lat/long!")

libraries imported!
And now we have our lat/long!


In [33]:
df_latlong.head()

Unnamed: 0,X,Y,OBJECTID,inc_id,created_date,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,post_location,email_time_utc,bureaus_detail,graffiti,latitude,longitude
0,-13642030.0,5696806.0,1,23-190779,2023/10/26 21:17:37+00,1,14,3,0,5,200,1.0,111-115th and SE FOSTER RD,2023/10/26 21:36:44+00,,1.0,45.476217,-122.548456
1,-13640490.0,5704175.0,2,23-198050,2023/10/11 17:26:00+00,2,3,0,0,5,20,0.0,0-12599 E BURNSIDE ST,2023/10/11 18:36:17+00,,1.0,45.522615,-122.534633
2,-13656230.0,5703145.0,3,23-185173,2023/09/01 17:36:25+00,0,0,0,0,5,0,0.0,200-299 SW SALMON ST,2023/09/01 17:36:47+00,,0.0,45.516132,-122.675974
3,-13645210.0,5699758.0,4,22-96683,2023/02/02 19:03:30+00,13,17,0,0,5,80,1.0,SE 83rd - 84th and Bush,2023/02/02 19:36:48+00,,0.0,45.494809,-122.576996
4,-13655980.0,5704290.0,5,24-14593,2024/03/15 19:40:53+00,3,10,0,0,5,20,0.0,1-3 NW 3RD AVE,2024/03/15 20:36:42+00,,1.0,45.52334,-122.67372


In [34]:
# time to save this as a CSV, even though we're going to be adding more stuff.  I like to be incremental.
# boy, it is easy to accidentally forget to save the right dataframe.  I keep doing that and wondering why the output is wrong.

# Define the full output file path
output_file = base_dir_save / "IRP_Post_Sites_clean-latlong.csv"
df_latlong.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

File saved to: C:\Users\Steph\local\OIT-class\datasets\processed\IRP_Post_Sites_clean-latlong.csv


In [35]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
print("libraries imported!")

# Step 1: Load your df_latlong
# already did this
#df_latlong = pd.read_csv("C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Campsite_Reports_latlong.csv")

# Step 2: Convert df_latlong to a GeoDataFrame
geometry = [Point(xy) for xy in zip(df_latlong["longitude"], df_latlong["latitude"])]
gdf = gpd.GeoDataFrame(df_latlong, geometry=geometry, crs="EPSG:4326")

# Step 3: Load shapefiles for ZIP codes and neighborhoods
zip_shapefile = "C:/Users/Steph/local/OIT-class/datasets/original/zip-code-extracted/portland-oregon-zip-code-boundaries.shp"
hood_shapefile = "C:/Users/Steph/local/OIT-class/datasets/original/Neighborhoods_regions-extracted/Neighborhoods_regions.shp"

gdf_zip = gpd.read_file(zip_shapefile).to_crs("EPSG:4326")
gdf_hood = gpd.read_file(hood_shapefile).to_crs("EPSG:4326")

print("...operation successful so far...")

libraries imported!
...operation successful so far...


In [36]:
# checking stuff what columns are in our gdf_hood, as this was a problem I had to fix in the first script
print("gdf_hood columns:", list(gdf_hood.columns))


gdf_hood columns: ['OBJECTID', 'NAME', 'COMMPLAN', 'SHARED', 'COALIT', 'HORZ_VERT', 'MAPLABEL', 'ID', 'Shape_Leng', 'Shape_Area', 'nbh_distri', 'geometry']


In [37]:
# continue our script

#gdf_zip = gpd.read_file(zip_shapefile).to_crs("EPSG:4326")
#gdf_hood = gpd.read_file(hood_shapefile).to_crs("EPSG:4326")

# Optional: Print column names to check
print("ZIP columns:", gdf_zip.columns)
print("Neighborhood columns:", gdf_hood.columns)

# Step 4: Spatial join for ZIP codes (corrected column name)
gdf_zip_joined = gpd.sjoin(
    gdf,
    gdf_zip[["geometry", "Zip_Code"]],
    how="left",
    predicate="within"
)

# Fix: Remove index_right from previous join.  We had two index_right columns
if "index_right" in gdf_zip_joined.columns:
    gdf_zip_joined = gdf_zip_joined.drop(columns=["index_right"])

# Step 5: Spatial join for neighborhoods (assuming the column is called 'Name')
gdf_full = gpd.sjoin(
    gdf_zip_joined,
    gdf_hood[["geometry", "NAME"]],
    how="left",
    predicate="within"
)

# Step 6: Rename columns and clean up
gdf_full = gdf_full.rename(columns={
    "Zip_Code": "zip_code",
    "Name": "neighborhood"
})

# Our neighborhood names are in all-caps.  let's convert those values to "title case" and only have the first letters of each word capitalized
gdf_full["NAME"] = gdf_full["NAME"].str.title()

df_latlong_hood_zip = gdf_full.drop(columns=["geometry", "index_right"])
# this space intentionally left blank

print("Now, let's see what we've got in our new dataframe:")
df_latlong_hood_zip.head()

ZIP columns: Index(['SHAPE_Leng', 'Name', 'State', 'Type', 'Zip_Code', 'geometry'], dtype='object')
Neighborhood columns: Index(['OBJECTID', 'NAME', 'COMMPLAN', 'SHARED', 'COALIT', 'HORZ_VERT',
       'MAPLABEL', 'ID', 'Shape_Leng', 'Shape_Area', 'nbh_distri', 'geometry'],
      dtype='object')
Now, let's see what we've got in our new dataframe:


Unnamed: 0,X,Y,OBJECTID,inc_id,created_date,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,post_location,email_time_utc,bureaus_detail,graffiti,latitude,longitude,zip_code,NAME
0,-13642030.0,5696806.0,1,23-190779,2023/10/26 21:17:37+00,1,14,3,0,5,200,1.0,111-115th and SE FOSTER RD,2023/10/26 21:36:44+00,,1.0,45.476217,-122.548456,97266.0,Powellhurst-Gilbert
1,-13640490.0,5704175.0,2,23-198050,2023/10/11 17:26:00+00,2,3,0,0,5,20,0.0,0-12599 E BURNSIDE ST,2023/10/11 18:36:17+00,,1.0,45.522615,-122.534633,97233.0,Hazelwood
2,-13656230.0,5703145.0,3,23-185173,2023/09/01 17:36:25+00,0,0,0,0,5,0,0.0,200-299 SW SALMON ST,2023/09/01 17:36:47+00,,0.0,45.516132,-122.675974,97204.0,Portland Downtown
3,-13645210.0,5699758.0,4,22-96683,2023/02/02 19:03:30+00,13,17,0,0,5,80,1.0,SE 83rd - 84th and Bush,2023/02/02 19:36:48+00,,0.0,45.494809,-122.576996,97266.0,Lents
4,-13655980.0,5704290.0,5,24-14593,2024/03/15 19:40:53+00,3,10,0,0,5,20,0.0,1-3 NW 3RD AVE,2024/03/15 20:36:42+00,,1.0,45.52334,-122.67372,97209.0,Old Town


In [38]:
df_latlong_hood_zip.head()


Unnamed: 0,X,Y,OBJECTID,inc_id,created_date,post_structure_qty,post_people_count,post_people_under_25_count,post_children_count,post_reason,waste_material_qty,dogs_present,post_location,email_time_utc,bureaus_detail,graffiti,latitude,longitude,zip_code,NAME
0,-13642030.0,5696806.0,1,23-190779,2023/10/26 21:17:37+00,1,14,3,0,5,200,1.0,111-115th and SE FOSTER RD,2023/10/26 21:36:44+00,,1.0,45.476217,-122.548456,97266.0,Powellhurst-Gilbert
1,-13640490.0,5704175.0,2,23-198050,2023/10/11 17:26:00+00,2,3,0,0,5,20,0.0,0-12599 E BURNSIDE ST,2023/10/11 18:36:17+00,,1.0,45.522615,-122.534633,97233.0,Hazelwood
2,-13656230.0,5703145.0,3,23-185173,2023/09/01 17:36:25+00,0,0,0,0,5,0,0.0,200-299 SW SALMON ST,2023/09/01 17:36:47+00,,0.0,45.516132,-122.675974,97204.0,Portland Downtown
3,-13645210.0,5699758.0,4,22-96683,2023/02/02 19:03:30+00,13,17,0,0,5,80,1.0,SE 83rd - 84th and Bush,2023/02/02 19:36:48+00,,0.0,45.494809,-122.576996,97266.0,Lents
4,-13655980.0,5704290.0,5,24-14593,2024/03/15 19:40:53+00,3,10,0,0,5,20,0.0,1-3 NW 3RD AVE,2024/03/15 20:36:42+00,,1.0,45.52334,-122.67372,97209.0,Old Town


If you see the output from df.head() and there's no error messages or crazy stuff in it then it worked!

In [39]:
# But wait - I see at least one blank neighborhood name.  Let's remove those records.
# but my cleanup attempt is not working.  Will have to fix this later

In [40]:
# Time to save to CSV
output_path = "C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Post_Sites_latlong_hood_zip.csv"
df_latlong_hood_zip.to_csv(output_path, index=False)
print({output_path},"saved to file!")
print("It worked! Let's do a happy dance!")

{'C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Post_Sites_latlong_hood_zip.csv'} saved to file!
It worked! Let's do a happy dance!


In [42]:
# still going to have to go back and split out by year and see what to do about records that have blank neighborhood names.