## Weather Data Import and Clean

1. Access NOAA Climate Data Online (CDO):
   - Visit: https://www.ncei.noaa.gov/cdo-web/
   - This is the official National Centers for Environmental Information platform

2. Configure Data Request:
   - Dataset: Select "Daily Summaries"
   - Location: Pittsburgh International Airport (Station ID: GHCND:USW00094823)
   - Date Range: January 1, 2022 to current date
   - Variables to select:
     * Maximum temperature
     * Minimum temperature
     * Precipitation
     * Snowfall
     * Other relevant weather metrics

3. Download Process:
   - After selecting parameters, request the data download
   - Save the downloaded CSV file as `weather_data.csv`
   - Place in `data/raw/weather_data.csv`

Note: Weather data will be merged with 311 data during the data cleaning process in `notebooks/1.5_weather_data_cleaning.ipynb`

### Data Import anad Concat

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pl
import seaborn as sns
import os
import sys
import warnings
warnings.filterwarnings('ignore')



In [28]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))  # Go up one level from notebooks/
data_path = os.path.join(project_root, 'data', 'raw')

data_2025 = pd.read_csv(data_path + '/4006235.csv')
data_2024 = pd.read_csv(data_path + '/4006233.csv')
data_2023 = pd.read_csv(data_path + '/4006242.csv')
data_2022 = pd.read_csv(data_path + '/4006243.csv')


In [30]:
# Check the shape of the data
data_2025.shape, data_2024.shape, data_2023.shape, data_2022.shape

((116, 23), (366, 23), (365, 23), (365, 23))

In [31]:
# Combine all data together
data = pd.concat([data_2025, data_2024, data_2023, data_2022])
print(data.shape)

# Check the columns
print(data.columns)

# Sort by date and print out basic information
data = data.sort_values(by='DATE')

(1212, 23)
Index(['STATION', 'NAME', 'DATE', 'AWND', 'PGTM', 'PRCP', 'SNOW', 'SNWD',
       'TAVG', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT01', 'WT02',
       'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09'],
      dtype='object')


In [32]:
data.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,...,WSF2,WSF5,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09
0,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-01,6.04,,0.89,0.0,0.0,53,58.0,...,19.9,23.9,1.0,1.0,,,,,,
1,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-02,11.63,,0.05,0.0,0.0,40,43.0,...,18.1,21.9,1.0,,,,,,,
2,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-03,10.96,,0.0,0.0,0.0,25,27.0,...,19.9,23.0,,,,,,,,
3,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-04,5.59,,0.0,0.0,0.0,26,38.0,...,14.1,16.1,,,,,,,,
4,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-05,14.76,,0.0,0.0,0.0,40,49.0,...,28.0,36.9,,,,,,,,


### Basic Data Cleaning

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1212 entries, 0 to 115
Data columns (total 23 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   STATION  1212 non-null   object 
 1   NAME     1212 non-null   object 
 2   DATE     1212 non-null   object 
 3   AWND     1209 non-null   float64
 4   PGTM     65 non-null     float64
 5   PRCP     1211 non-null   float64
 6   SNOW     1210 non-null   float64
 7   SNWD     1211 non-null   float64
 8   TAVG     1212 non-null   int64  
 9   TMAX     1211 non-null   float64
 10  TMIN     1211 non-null   float64
 11  WDF2     1209 non-null   float64
 12  WDF5     1209 non-null   float64
 13  WSF2     1209 non-null   float64
 14  WSF5     1209 non-null   float64
 15  WT01     493 non-null    float64
 16  WT02     26 non-null     float64
 17  WT03     140 non-null    float64
 18  WT04     18 non-null     float64
 19  WT05     8 non-null      float64
 20  WT06     9 non-null      float64
 21  WT08     109 non-nul

### Data Preprocessing 

In [34]:
# Convert DATE column to datetime format
print("Converting DATE column to datetime...")
data['DATE'] = pd.to_datetime(data['DATE'])

# Calculate percentage of missing values for each column
print("\nCalculating missing value percentages...")
missing_percentages = (data.isnull().sum() / len(data)) * 100

# Print missing value percentages
print("\nMissing value percentages:")
for col in data.columns:
    print(f"{col}: {missing_percentages[col]:.2f}%")

# Remove columns with more than 20% missing values
threshold = 20
columns_to_drop = missing_percentages[missing_percentages > threshold].index
print(f"\nDropping columns with more than {threshold}% missing values:")
print(f"Columns to drop: {list(columns_to_drop)}")

# Drop the columns
data_cleaned = data.drop(columns=columns_to_drop)

# Print final shape
print(f"\nOriginal shape: {data.shape}")
print(f"Final shape: {data_cleaned.shape}")

# Display remaining columns
print("\nRemaining columns:")
print(data_cleaned.columns.tolist())

Converting DATE column to datetime...

Calculating missing value percentages...

Missing value percentages:
STATION: 0.00%
NAME: 0.00%
DATE: 0.00%
AWND: 0.25%
PGTM: 94.64%
PRCP: 0.08%
SNOW: 0.17%
SNWD: 0.08%
TAVG: 0.00%
TMAX: 0.08%
TMIN: 0.08%
WDF2: 0.25%
WDF5: 0.25%
WSF2: 0.25%
WSF5: 0.25%
WT01: 59.32%
WT02: 97.85%
WT03: 88.45%
WT04: 98.51%
WT05: 99.34%
WT06: 99.26%
WT08: 91.01%
WT09: 99.26%

Dropping columns with more than 20% missing values:
Columns to drop: ['PGTM', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09']

Original shape: (1212, 23)
Final shape: (1212, 14)

Remaining columns:
['STATION', 'NAME', 'DATE', 'AWND', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5']


In [35]:
data_cleaned


Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-01,6.04,0.89,0.0,0.0,53,58.0,43.0,320.0,320.0,19.9,23.9
1,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-02,11.63,0.05,0.0,0.0,40,43.0,27.0,320.0,320.0,18.1,21.9
2,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-03,10.96,0.00,0.0,0.0,25,27.0,20.0,10.0,10.0,19.9,23.0
3,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-04,5.59,0.00,0.0,0.0,26,38.0,18.0,130.0,120.0,14.1,16.1
4,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-05,14.76,0.00,0.0,0.0,40,49.0,29.0,240.0,200.0,28.0,36.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-22,7.16,0.00,0.0,0.0,61,70.0,49.0,290.0,280.0,18.1,25.1
112,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-23,2.01,0.00,0.0,0.0,62,79.0,46.0,190.0,190.0,10.1,13.0
113,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-24,4.47,0.00,0.0,0.0,69,85.0,52.0,110.0,160.0,10.1,17.0
114,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-25,4.92,0.36,0.0,0.0,70,76.0,63.0,160.0,160.0,15.0,21.0


- AWND : Average Wind Speed
- PRCP : Precipitation
- SNOW : Snowfall
- SNWD : Snow Depth
- TAVG : Average Temperature
- TMAX : Maximum Temperature
- TMIN : Minimum Temperature
- WDF2 : Direction of fastest 2-min wind
- WDF5 : Direction of fastest 5-sec wind
- WSF2 : Fastest 2-min wind speed
- WSF5 : Fastest 5-sec sind speed

In [36]:
data_cleaned.DATE.min(), data_cleaned.DATE.max()

(Timestamp('2022-01-01 00:00:00'), Timestamp('2025-04-26 00:00:00'))

In [38]:
processed_weather_path = os.path.join(project_root, 'data', 'processed', 'weather_data.csv')
data_cleaned.to_csv(processed_weather_path, index=False)
data_cleaned

Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5
0,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-01,6.04,0.89,0.0,0.0,53,58.0,43.0,320.0,320.0,19.9,23.9
1,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-02,11.63,0.05,0.0,0.0,40,43.0,27.0,320.0,320.0,18.1,21.9
2,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-03,10.96,0.00,0.0,0.0,25,27.0,20.0,10.0,10.0,19.9,23.0
3,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-04,5.59,0.00,0.0,0.0,26,38.0,18.0,130.0,120.0,14.1,16.1
4,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2022-01-05,14.76,0.00,0.0,0.0,40,49.0,29.0,240.0,200.0,28.0,36.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-22,7.16,0.00,0.0,0.0,61,70.0,49.0,290.0,280.0,18.1,25.1
112,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-23,2.01,0.00,0.0,0.0,62,79.0,46.0,190.0,190.0,10.1,13.0
113,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-24,4.47,0.00,0.0,0.0,69,85.0,52.0,110.0,160.0,10.1,17.0
114,USW00094823,"PITTSBURGH INTERNATIONAL AIRPORT, PA US",2025-04-25,4.92,0.36,0.0,0.0,70,76.0,63.0,160.0,160.0,15.0,21.0
