8/7/2025 Stephen.peters@gmail.com
In this notebook we step through cleaning up the IRP Campsites dataset into something we can use for our project
Base data downloaded from here: https://gis-pdx.opendata.arcgis.com/datasets/b7965b3e95db40c0bcb92e36ab7d3357_1396/explore?location=45.516263%2C-122.670353%2C12.44 


In [1]:
# first, let's make sure the pandas library is installed so we have access to dataframes
!pip install pandas
print("pandas installed!")

pandas installed!


In [2]:
# now we'll import our libraries, and including some graphing ones, just in case
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
print("libraries imported!")

libraries imported!


In [4]:
# let's take a look at our current dataset
# first we set the path to our files:
# you'll need to edit this for your system
base_dir = Path("C:/Users/Steph/local/OIT-class/project-files/datasets/original/project_data")
df = pd.read_csv(base_dir / "IRP_Campsite_Reports.csv")
#df = sns.load_dataset("datasets/original/IRP_Campsite_Reports")
# The "head" command will show us an excel-style display of our data with the columns across the top and the first four rows.
# "df" stands for "dataframe" and you'll see it's a very common generic variable name used in these cases.
df.head()

Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873
1,-13658290.0,5704987.0,2,2025/06/09 14:38:05+00,25-154472,1,20250609072843,No,883868
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866
3,-13657650.0,5706088.0,4,2025/06/09 14:38:04+00,25-154470,1,20250609072750,No,883864
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863


In [5]:
# and let's check datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172556 entries, 0 to 172555
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   X                 172556 non-null  float64
 1   Y                 172556 non-null  float64
 2   OBJECTID          172556 non-null  int64  
 3   inc_date_create   172556 non-null  object 
 4   inc_id            172556 non-null  object 
 5   duplicate         172556 non-null  int64  
 6   item_date_create  172556 non-null  int64  
 7   IS_VEHICLE        172556 non-null  object 
 8   report_id         172556 non-null  int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 11.8+ MB


In [6]:
# everything looks in order so far, nothing weird sticks out
# check for missing data
df.isnull().sum() # this command will count up any nulls in our columns

X                   0
Y                   0
OBJECTID            0
inc_date_create     0
inc_id              0
duplicate           0
item_date_create    0
IS_VEHICLE          0
report_id           0
dtype: int64

In [7]:
# No nulls!  That's also good news.
# let's take a peek at our summary statistics
df.describe()

Unnamed: 0,X,Y,OBJECTID,duplicate,item_date_create,report_id
count,172556.0,172556.0,172556.0,172556.0,172556.0,172556.0
mean,-13650670.0,5703868.0,86278.5,0.625229,20243740000000.0,529269.928586
std,7016.749,4953.767,49812.770863,0.484065,4491545000.0,191128.422086
min,-13692660.0,5690102.0,1.0,0.0,20231230000000.0,215988.0
25%,-13656210.0,5700319.0,43139.75,0.0,20240600000000.0,355527.25
50%,-13652720.0,5703773.0,86278.5,1.0,20240930000000.0,523363.0
75%,-13644520.0,5705975.0,129417.25,1.0,20250210000000.0,694517.5
max,-13633590.0,5724518.0,172556.0,1.0,20250610000000.0,883873.0


We need to remove duplicates, as we have a field labeled "duplicate" with 0 or 1.  Since the dataset didn't include any details on the fields, 
we have to make some assumptions.  Why name something "duplicate" if it doesn't mean a duplicate record for some reason?  If you look back at the output of df.head() you'll see the first two records have exactly the same item_date_create.  The first record has a "0" and the second a "1".  That seems to suggest to me that it is worthwhile to remove the records maked as duplicate.  WE CAN ALWAYS CHANGE THIS LATER.

How do we know if 0 or 1 is a duplicate?  Asking ChatGPT:
"I have this data file.  It has a field called "duplicate" that contains either a 1 or a 0.  How can I tell if 1 or 0 indicates a duplicate record I should remove?" Reply: 

From the data:

There are 107,887 records with duplicate = 1

And 64,669 records with duplicate = 0

Looking at the sample rows:

Each pair of duplicate = 1 and duplicate = 0 rows seem to have the same report_id or similar data, suggesting duplicate = 1 marks the later/extra copies.

✅ Interpretation:
duplicate = 1 likely indicates a duplicate record that should be removed.

duplicate = 0 appears to be the original or preferred version of the record.

(for now we're going to assume this is true and just go with it.)

In [11]:
# remove duplicates by copying just the ones we want to a new dataframe.
df_clean = df[df["duplicate"] == 0].copy()
print("df copied to df_clean without any duplicate=1 rows")
print("the raw data is still in df, but now we'll be working with df_clean")

df copied to df_clean without any duplicate=1 rows
the raw data is still in df, but now we'll be working with df_clean


In [12]:
# and now we have 64669 records
df_clean.info()
df_clean.head()

<class 'pandas.core.frame.DataFrame'>
Index: 64669 entries, 0 to 172555
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   X                 64669 non-null  float64
 1   Y                 64669 non-null  float64
 2   OBJECTID          64669 non-null  int64  
 3   inc_date_create   64669 non-null  object 
 4   inc_id            64669 non-null  object 
 5   duplicate         64669 non-null  int64  
 6   item_date_create  64669 non-null  int64  
 7   IS_VEHICLE        64669 non-null  object 
 8   report_id         64669 non-null  int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 4.9+ MB


Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857


Check to make sure the duplicate field only has zeros:

✅ What this does:

(df["duplicate"] == 0) creates a boolean Series (True for 0s).

.all() returns True only if every value in the column is 0.

In [14]:
# is the duplicate column really just zeros?  Let's find out
only_zeros = (df_clean["duplicate"] == 0).all()
print("Only zeros in 'duplicate' column:", only_zeros)
print("let's see what values exist:")
print(df_clean["duplicate"].unique())
print("and what are the counts of the existing values:")
print(df_clean["duplicate"].value_counts(dropna=False))
# We should see a "count" of 64669 for "0", which means all our records contain 0 in the duplicate field.  Yay!

Only zeros in 'duplicate' column: True
let's see what values exist:
[0]
and what are the counts of the existing values:
duplicate
0    64669
Name: count, dtype: int64


In [15]:
# Let's save our current file as a .csv so we have it in case we need it for something.
# Define the base directory. Remember to update this for your machine.
base_dir_save = Path("C:/Users/Steph/local/OIT-class/project-files/datasets/processed")

# Ensure the directory exists (create it if it doesn't)
base_dir_save.mkdir(parents=True, exist_ok=True)

# Define the full output file path
output_file = base_dir_save / "IRP_Campsite_Reports_clean.csv"

# Save the DataFrame to CSV
df_clean.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

File saved to: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\IRP_Campsite_Reports_clean.csv


In [17]:
# And now we want to create a new dataframe that takes df_clean and adds the lat/long coords in new columns
import geopandas as gpd
from shapely.geometry import Point
print("libraries imported!")

# Step 1: Load your cleaned DataFrame
# we already have this in memory, so I've commented it out.
# df_clean = pd.read_csv("C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Campsite_Reports_clean.csv")

# Step 2: Copy to a fresh, clean new DataFrame.  That way we can always go back if we screw something up.
df_latlong = df_clean.copy()

# Step 3: Create a GeoDataFrame using X/Y assuming EPSG:3857 (Web Mercator)
geometry = [Point(xy) for xy in zip(df_latlong["X"], df_latlong["Y"])]
gdf = gpd.GeoDataFrame(df_latlong, geometry=geometry, crs="EPSG:3857")

# Step 4: Convert coordinates to WGS84 (latitude/longitude)
gdf = gdf.to_crs("EPSG:4326")

# Step 5: Extract lat/lon and assign to df_latlong
gdf["latitude"] = gdf.geometry.y
gdf["longitude"] = gdf.geometry.x

# Step 6: Drop geometry column if not needed
df_latlong = gdf.drop(columns="geometry")
print("...and now we have our lat/longs:")
df_latlong.head()

libraries imported!
...and now we have our lat/longs:


Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873,45.466339,-122.5625
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866,45.478952,-122.568712
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863,45.527012,-122.694446
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861,45.515808,-122.662782
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857,45.527273,-122.680247


In [18]:
# And let's save the data for this step:

# Define the full output file path
output_file = base_dir_save / "IRP_Campsite_Reports_clean-latlong.csv"
df_latlong.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

File saved to: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\IRP_Campsite_Reports_clean-latlong.csv


Now we are going to create a new dataframe that uses the lat/long to add in the zip code and neighborhood name.
I had to create an account here to get a portland zip codes shapefile.  Somehow data.gov and the portland open data portal wasn't giving me one.
https://koordinates.com/

In [22]:
import geopandas as gpd
from shapely.geometry import Point
print("libraries imported!")

# Step 1: Load your df_latlong
# already did this
#df_latlong = pd.read_csv("C:/Users/Steph/local/OIT-class/datasets/processed/IRP_Campsite_Reports_latlong.csv")

# Step 2: Convert df_latlong to a GeoDataFrame
geometry = [Point(xy) for xy in zip(df_latlong["longitude"], df_latlong["latitude"])]
gdf = gpd.GeoDataFrame(df_latlong, geometry=geometry, crs="EPSG:4326")

# Step 3: Load shapefiles for ZIP codes and neighborhoods
zip_shapefile = "C:/Users/Steph/local/OIT-class/datasets/original/zip-code-extracted/portland-oregon-zip-code-boundaries.shp"
hood_shapefile = "C:/Users/Steph/local/OIT-class/datasets/original/Neighborhoods_regions-extracted/Neighborhoods_regions.shp"

gdf_zip = gpd.read_file(zip_shapefile).to_crs("EPSG:4326")
gdf_hood = gpd.read_file(hood_shapefile).to_crs("EPSG:4326")
print("shape files loaded!")

libraries imported!
shape files loaded!


In [23]:
# I had to solve a bunch of problems related to column names, so let's see what columns are in our gdf_hood:
print("gdf_hood columns:", list(gdf_hood.columns))


gdf_hood columns: ['OBJECTID', 'NAME', 'COMMPLAN', 'SHARED', 'COALIT', 'HORZ_VERT', 'MAPLABEL', 'ID', 'Shape_Leng', 'Shape_Area', 'nbh_distri', 'geometry']


In [27]:
# continue our script

# Optional: Print column names to check
#print("ZIP columns:", gdf_zip.columns)
#print("Neighborhood columns:", gdf_hood.columns)

# Step 4: Spatial join for ZIP codes
gdf_zip_joined = gpd.sjoin(
    gdf,
    gdf_zip[["geometry", "Zip_Code"]],
    how="left",
    predicate="within"
)

# Fix: Remove index_right from previous join.  We had two index_right columns.
if "index_right" in gdf_zip_joined.columns:
    gdf_zip_joined = gdf_zip_joined.drop(columns=["index_right"])

# Step 5: Spatial join for neighborhoods (assuming the column is called 'Name')
gdf_full = gpd.sjoin(
    gdf_zip_joined,
    gdf_hood[["geometry", "NAME"]],
    how="left",
    predicate="within"
)

# Step 6: Rename columns and clean up
gdf_full = gdf_full.rename(columns={
    "Zip_Code": "zip_code",
    "Name": "neighborhood"
})

# Our neighborhood names are in all-caps.  let's convert those values to "title case" and only have the first letters of each word capitalized
gdf_full["NAME"] = gdf_full["NAME"].str.title()

df_latlong_hood_zips = gdf_full.drop(columns=["geometry", "index_right"])

# this space intentionally left blank
print("This means it worked!")

This means it worked!


In [28]:
print("Now, let's see what we've got in our new dataframe:")
df_latlong_hood_zips.head()


Now, let's see what we've got in our new dataframe:


Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude,zip_code,NAME
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873,45.466339,-122.5625,97266.0,Lents
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866,45.478952,-122.568712,97266.0,Lents
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863,45.527012,-122.694446,97209.0,Northwest District
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861,45.515808,-122.662782,97214.0,Buckman
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857,45.527273,-122.680247,97209.0,Pearl District


In [29]:
# Wait - why are we getting a decimal in zip_code?
df_latlong_hood_zips.info()


<class 'pandas.core.frame.DataFrame'>
Index: 66306 entries, 0 to 172555
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   X                 66306 non-null  float64
 1   Y                 66306 non-null  float64
 2   OBJECTID          66306 non-null  int64  
 3   inc_date_create   66306 non-null  object 
 4   inc_id            66306 non-null  object 
 5   duplicate         66306 non-null  int64  
 6   item_date_create  66306 non-null  int64  
 7   IS_VEHICLE        66306 non-null  object 
 8   report_id         66306 non-null  int64  
 9   latitude          66306 non-null  float64
 10  longitude         66306 non-null  float64
 11  zip_code          66302 non-null  float64
 12  NAME              65856 non-null  object 
dtypes: float64(5), int64(4), object(4)
memory usage: 7.1+ MB


In [None]:
# Aha!  The datatype for zip_code is a floating point number, not an integer.  That's easy to fix.  Fasten your seat belts, here we go.
# df_latlong_hood_zips["zip_code"] = df_latlong_hood_zips["zip_code"].astype(int)
# dangit!  if we run the line above we get an error message.
# I'll save you the debugging, there are nulls in our zip_code column.  Feel free to run it if you wish.

In [35]:
# Let's show the rows where 'zip_code' is NaN, aka null
null_rows = df_latlong_hood_zips[df_latlong_hood_zips["zip_code"].isna()]
print(null_rows)


                   X             Y  OBJECTID         inc_date_create  \
11635  -1.365588e+07  5.709057e+06     11636  2025/05/07 18:08:04+00   
40661  -1.365988e+07  5.713015e+06     40662  2025/02/18 00:28:04+00   
55757  -1.365989e+07  5.713015e+06     55758  2024/12/27 16:18:04+00   
132075 -1.365680e+07  5.709013e+06    132076  2024/05/27 19:58:04+00   

           inc_id  duplicate  item_date_create IS_VEHICLE  report_id  \
11635   25-142838          0    20250507110008        Yes     824317   
40661   25-113812          0    20250217160638        Yes     704150   
55757   24-116745          0    20241227081305        Yes     644562   
132075   24-40427          0    20240527124128        Yes     343430   

         latitude   longitude  zip_code        NAME  
11635   45.553334 -122.672897       NaN       Boise  
40661   45.578221 -122.708820       NaN  Portsmouth  
55757   45.578221 -122.708844       NaN  Portsmouth  
132075  45.553055 -122.681167       NaN    Overlook  


In [37]:
# Ok, sure.  Four records did not get a zip code for some reason, and I'll make an executive decision to overlook this.  
# When we're going for a nobel prize, we'll look into this issue, but not now.
# Instead, we'll just remove the records where the nulls appear.
df_latlong_hood_zips.dropna(subset=["zip_code"], inplace=True)
# "inplace=True" means update the dataframe instead of copying it to a new one.

In [39]:
null_rows = df_latlong_hood_zips[df_latlong_hood_zips["zip_code"].isna()]
print("Now we should have no rows with nulls in zip_code")
print(null_rows)

Now we should have no rows with nulls in zip_code
Empty DataFrame
Columns: [X, Y, OBJECTID, inc_date_create, inc_id, duplicate, item_date_create, IS_VEHICLE, report_id, latitude, longitude, zip_code, NAME]
Index: []


In [43]:
# Ok, now we convert our zip_code to an integer:
df_latlong_hood_zips["zip_code"] = df_latlong_hood_zips["zip_code"].astype(int)
print("...and let's check:")
df_latlong_hood_zips.info()
print("If it says int64 instead of float64, then we did it!")

...and let's check:
<class 'pandas.core.frame.DataFrame'>
Index: 66302 entries, 0 to 172555
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   X                 66302 non-null  float64
 1   Y                 66302 non-null  float64
 2   OBJECTID          66302 non-null  int64  
 3   inc_date_create   66302 non-null  object 
 4   inc_id            66302 non-null  object 
 5   duplicate         66302 non-null  int64  
 6   item_date_create  66302 non-null  int64  
 7   IS_VEHICLE        66302 non-null  object 
 8   report_id         66302 non-null  int64  
 9   latitude          66302 non-null  float64
 10  longitude         66302 non-null  float64
 11  zip_code          66302 non-null  int64  
 12  NAME              65852 non-null  object 
dtypes: float64(4), int64(5), object(4)
memory usage: 7.1+ MB
If it says int64 instead of float64, then we did it!


In [44]:
print("Just to be double-sure, let's look at our dataframe:")
df_latlong_hood_zips.head()

Just to be double-sure, let's look at our dataframe:


Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude,zip_code,NAME
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873,45.466339,-122.5625,97266,Lents
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866,45.478952,-122.568712,97266,Lents
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863,45.527012,-122.694446,97209,Northwest District
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861,45.515808,-122.662782,97214,Buckman
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857,45.527273,-122.680247,97209,Pearl District


In [45]:
# Save to CSV
# don't forget to adjust your path
output_path = "C:/Users/Steph/local/OIT-class/project-files/datasets/processed/IRP_Campsite_Reports_latlong_hood_zips.csv"
df_latlong_hood_zips.to_csv(output_path, index=False)
print({output_path},"saved to file!")
print("Let's do a happy dance!")

{'C:/Users/Steph/local/OIT-class/project-files/datasets/processed/IRP_Campsite_Reports_latlong_hood_zips.csv'} saved to file!
Let's do a happy dance!


In [46]:
# Now, there's a catch.  This one data file has all the years in one.  Usually one only uses a single year in an analysis, so we should split
# the file by year.
# First we get our year and month
df_latlong_hood_zips["year"] = df_latlong_hood_zips["inc_date_create"].str[0:4]
df_latlong_hood_zips["month"] = df_latlong_hood_zips["inc_date_create"].str[5:7]
# and then we create our columns
df_latlong_hood_zips["year"] = df_latlong_hood_zips["year"].astype(int)
df_latlong_hood_zips["month"] = df_latlong_hood_zips["month"].astype(int)
# and let's see what we get:
df_latlong_hood_zips.head()

Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude,zip_code,NAME,year,month
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873,45.466339,-122.5625,97266,Lents,2025,6
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866,45.478952,-122.568712,97266,Lents,2025,6
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863,45.527012,-122.694446,97209,Northwest District,2025,6
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861,45.515808,-122.662782,97214,Buckman,2025,6
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857,45.527273,-122.680247,97209,Pearl District,2025,6


In [49]:
# ok, so we have two years
year_counts = df_latlong_hood_zips["year"].value_counts()
print(year_counts)

year
2024    45556
2025    20746
Name: count, dtype: int64


In [50]:
# and just to peek at our months
month_counts = df_latlong_hood_zips["month"].value_counts()
print(month_counts)

month
4     8038
5     8024
3     7531
2     6535
1     6341
6     4854
10    4708
8     4576
9     4424
7     4190
12    3547
11    3534
Name: count, dtype: int64


In [68]:
# That's interesting.  Let's sort by the count so we can see what months have the most campsites reported
month_counts_df = (
    df_latlong_hood_zips["month"]
    .value_counts()                     # gets counts, sorted by count by default
    .to_frame(name="count")              # turn into DataFrame
    .reset_index()                       # convert index to column
    .rename(columns={"index": "month"})   # rename year column
    .sort_values(by="count", ascending=False)  # ensure sorted by counts
)
print("Here's our homeless camps count by month, sorted most to least for both years 2024 and 2025.")
print(month_counts_df)
print("\nI find it interesting that April and May have the most campsites.  Curious, eh?")

Here's our homeless camps count by month, sorted most to least for both years 2024 and 2025.
    month  count
0       4   8038
1       5   8024
2       3   7531
3       2   6535
4       1   6341
5       6   4854
6      10   4708
7       8   4576
8       9   4424
9       7   4190
10     12   3547
11     11   3534

I find it interesting that April and May have the most campsites.  Curious, eh?


In [57]:
# Now we make a new dataframe for just the year 2024
df_2024 = df_latlong_hood_zips[df_latlong_hood_zips['year'] == 2024]
df_2024.head()

Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude,zip_code,NAME,year,month
54526,-13654250.0,5704679.0,54527,2024/12/31 23:58:04+00,24-117976,0,20241231154906,Yes,649033,45.525788,-122.658239,97232,Kerns,2024,12
54528,-13654690.0,5701901.0,54529,2024/12/31 23:48:04+00,24-117974,0,20241231154115,No,649026,45.508302,-122.66219,97214,Hosford-Abernethy,2024,12
54529,-13654710.0,5702644.0,54530,2024/12/31 23:48:04+00,24-117973,0,20241231153619,No,649022,45.512978,-122.662362,97214,Buckman,2024,12
54531,-13662520.0,5709217.0,54532,2024/12/31 23:48:04+00,24-117971,0,20241231153057,Yes,649014,45.55434,-122.73253,97210,Mc Unclaimed #14,2024,12
54537,-13641980.0,5709683.0,54538,2024/12/31 23:38:04+00,24-117965,0,20241231152247,Yes,649000,45.557269,-122.547965,97220,Parkrose,2024,12


In [59]:
# so far, looks good, but let's double-check
year_counts = df_2024["year"].value_counts()
print(year_counts)
print("If it only shows the year 2024, we're good!")

year
2024    45556
Name: count, dtype: int64
If it only shows the year 2024, we're good!


In [60]:
# Now we do this again for 2025
df_2025 = df_latlong_hood_zips[df_latlong_hood_zips['year'] == 2025]
df_2025.head()


Unnamed: 0,X,Y,OBJECTID,inc_date_create,inc_id,duplicate,item_date_create,IS_VEHICLE,report_id,latitude,longitude,zip_code,NAME,year,month
0,-13643600.0,5695238.0,1,2025/06/09 14:38:05+00,25-154473,0,20250609073114,No,883873,45.466339,-122.5625,97266,Lents,2025,6
2,-13644290.0,5697240.0,3,2025/06/09 14:38:04+00,25-154471,0,20250609072825,No,883866,45.478952,-122.568712,97266,Lents,2025,6
4,-13658280.0,5704874.0,5,2025/06/09 14:38:04+00,25-154469,0,20250609072647,No,883863,45.527012,-122.694446,97209,Northwest District,2025,6
6,-13654760.0,5703094.0,7,2025/06/09 14:38:04+00,25-154467,0,20250609072617,No,883861,45.515808,-122.662782,97214,Buckman,2025,6
9,-13656700.0,5704915.0,10,2025/06/09 14:38:03+00,25-154464,0,20250609072510,No,883857,45.527273,-122.680247,97209,Pearl District,2025,6


In [61]:
# so far, looks good, but let's double-check
year_counts = df_2025["year"].value_counts()
print(year_counts)
print("If it only shows the year 2025, we're good!")

year
2025    20746
Name: count, dtype: int64
If it only shows the year 2025, we're good!


In [62]:
# Now that we think we're done massaging our data, let's save a separate file for each year.

# Define the full output file path
output_file = base_dir_save / "IRP_Campsite_Reports_latlong-hood-zips-2024.csv"

# Save the DataFrame to CSV
df_2024.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

# Define the full output file path
output_file = base_dir_save / "IRP_Campsite_Reports_latlong-hood-zips-2025.csv"

# Save the DataFrame to CSV
df_2025.to_csv(output_file, index=False)

print(f"File saved to: {output_file}")

File saved to: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\IRP_Campsite_Reports_latlong-hood-zips-2024.csv
File saved to: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\IRP_Campsite_Reports_latlong-hood-zips-2025.csv


In [66]:
# and just to peek at our months
print("Let's remember that the 2025 year wasn't complete when this count was taken, so what months do we have?")
month_counts = df_2025["month"].value_counts()
print(month_counts)

Let's remember that the 2025 year wasn't complete when this count was taken, so what months do we have?
month
4    4226
3    4198
5    4100
1    3782
2    3371
6    1069
Name: count, dtype: int64
