# Introduction

My approach to this data cleaning and preparation is to improve on Brayden's approach to form a more accurate model. If it turns out that this model is more accurate we can go with this one. We will discuss as a team to determine which model we want to go with. I'll start by introducing some of the libraries I'll be using, which for the most part is the same as Brayden's since I will be using the same standard methods to fill values and analyze the data.

## Imports and Data Path

In [None]:
# Use pip3 for installing these for python3
import numpy as np
import pandas as pd
import sklearn

data_path = "../eug_weather/data.csv" # This file is under /eug_weather. Change as needed
data = pd.read_csv(data_path)

# Sorting the Null Values
Like Brayden, I have the Null values below, sorted by most percent null to least percent null. This is based on the number of null values divided by the number of rows. 

In [None]:
# Null Analysis by Brayden but with sort
def display_number_of_null(data):
    data_is_null = data.isnull().sum()
    data_is_null = data_is_null.to_frame(name="Amount Null")
    data_is_null["Percent Null"] = ((data_is_null["Amount Null"] / len(data)) * 100).round(2)
    
    # Sort by Percent Null in descending order
    data_is_null = data_is_null.sort_values(by="Percent Null", ascending=False)

    print("\nNumber of data points: ", np.array(data).shape[0], "\n\n")
    print(data_is_null)
    
display_number_of_null(data)

data['Observation_Date'] = pd.to_datetime(data['Observation_Date'])

data['year'] = data['Observation_Date'].dt.year
data['month'] = data['Observation_Date'].dt.month
data['day'] = data['Observation_Date'].dt.day
data = data.drop(columns='Observation_Date')
print(data.describe())

In [None]:
Number of data points:  9144 


                              Amount Null  Percent Null
Freezing_Drizzle                     9143         99.99
Wave_Height_Specific_Period          9143         99.99
Damaging_Winds                       9143         99.99
Percent_Sunshine                     9141         99.97
Tornado_Funnel_Cloud                 9137         99.92
Significant_Wave_Height              9134         99.89
Freezing_Rain                        9133         99.88
Dust_BlowingDust_VolcanicAsh         9125         99.79
Glaze_Rime                           9123         99.77
Ice_Pellets                          9108         99.61
Snow_IcePellets_OnGround             9088         99.39
Drizzle                              9078         99.28
Thunderstorms                        9025         98.70
Blowing_Drifting_Snow                9010         98.53
Ground_Fog                           8996         98.38
Snow                                 8959         97.98
Ice_Fog                              8955         97.93
Hail                                 8663         94.74
Snow_Depth                           8637         94.46
Smoke_Haze                           8242         90.14
Total_Sunshine                       8027         87.78
Heavy_Fog_Mist                       7716         84.38
Rain                                 6752         73.84
Blowing_Spray                        6726         73.56
Snowfall                             6406         70.06
Snow_Water_Equivalent                5126         56.06
Time_Peak_Gust                       4908         53.67
Time_Fastest_Mile                    4799         52.48
Fog_IceFog_HeavyFog                  4133         45.20
Avg_Temperature                      2812         30.75
Fastest_5Sec_Wind_Speed                89          0.97
Fastest_5Sec_Wind_Direction            89          0.97
Avg_Wind_Speed                          9          0.10
Fastest_2Min_Wind_Speed                 8          0.09
Fastest_2Min_Wind_Direction             8          0.09
Station_Name                            0          0.00
Observation_Date                        0          0.00
Max_Temperature                         0          0.00
Precipitation                           0          0.00
Min_Temperature                         0          0.00
Station_ID                              0          0.00
       Avg_Wind_Speed  Time_Fastest_Mile  ...        month          day
count     9135.000000        4345.000000  ...  9144.000000  9144.000000
mean         6.547559        1586.291369  ...     6.515311    15.718613
std          2.830560        1076.753596  ...     3.452524     8.802492
min          0.220000           0.000000  ...     1.000000     1.000000
25%          4.470000        1304.000000  ...     4.000000     8.000000
50%          6.040000        1537.000000  ...     7.000000    16.000000
75%          8.280000        1741.000000  ...    10.000000    23.000000
max         20.800000        9999.000000  ...    12.000000    31.000000

[8 rows x 41 columns]

# Analysis of the initial data

Non-null data: Station_ID, Min_Temperature, Precipitation, Max_Temperature, Observation_Date, Station_Name

### The following rows HAVE null data. I'll discuss the plan for each one.

**Freezing_Drizzle:** Null values should just be 0 because freezing drizzle is very unlikely. 

**Wave_Height_Specific_Period:** This value isn't too important because this isn't a coastal city, so we will just assume 0" for all values. We can also just drop the row.

**Damaging_Winds:** Drop Row. Not significant.

**Percent_Sunshine:** We need more information for this. We don't have cloud cover data, which is our main problem. Drop Row. Not significant.

**Significant_Wave_Height, Dust_BlowingDust_VolcanicAsh, Glaze_Rime, Ice_Pellets, Snow_IcePellets_OnGround, Blowing_Drifting_Snow, Ground_Fog, Tornado_Funnel_Cloud, Snow, Ice_Fog, Hail, Smoke_Haze, Total_Sunshine, Heavy_Fog_Mist, Rain, Blowing_Spray, Snow_Water_Equivalent, Time_Peak_Gust, Time_Fastest_Mile, Fog_IceFog_HeavyFog, Fastest_5Sec_Wind_Speed, Fastest_5Sec_Wind_Direction:** Drop Row or fill null as 0. Not significant enough.

**Drizzle:** I want to use this row to determine if there is 0.01in or 0.02in of precipitation, then we will assume 1 for drizzle. Other than this we should make this value 0.

**Thunderstorms:** We theoretically have enough data to have a very rough estimate for this, but it would be very difficult to be close to accurate. We should assume 0 for the null values.

### Important rows

**Snowfall:** We will calculate snowfall similar to Brayden using the snow coefficient, however I want to make sure that this is only being calculated under temperatures of 32 degrees because any snowfall above that is extremely unlikely. 

**Snow_Depth:** I like Brayden's model, but it is missing a melting factor. I added one so that the data accounts for melting. Even in cold temperatures, snow begins to melt, and I found a rough melting factor of 20% per day by the COMET program, which will probably make our data a little bit more accurate.

**Avg_Temperature:** Brayden's idea works perfect, but I'm only going to apply it to null values.

**Avg_Wind_Speed, Fastest_2Min_Wind_Speed, Fastest_2Min_Wind_Direction:** These can be fixed by dropping the null rows.

# Snowfall and Snow_Depth

In [None]:
def rain_to_snow_conversion(data):
    # Ensure only temperatures < 32°F are considered for snowfall
    data.loc[data['Min_Temperature'] > 32, "Snowfall"] = 0

    # Apply snowfall conversion ratios
    conditions = [
        (data['Min_Temperature'] >= 34) & (data['Min_Temperature'] < 45),
        (data['Min_Temperature'] >= 27) & (data['Min_Temperature'] < 34),
        (data['Min_Temperature'] >= 20) & (data['Min_Temperature'] < 27),
        (data['Min_Temperature'] >= 15) & (data['Min_Temperature'] < 20),
        (data['Min_Temperature'] >= 10) & (data['Min_Temperature'] < 15),
        (data['Min_Temperature'] >= 0) & (data['Min_Temperature'] < 10),
        (data['Min_Temperature'] >= -20) & (data['Min_Temperature'] < 0),
        (data['Min_Temperature'] < -20)
    ]
    conversion_factors = [0.1, 10, 15, 20, 30, 40, 50, 100]
    
    for condition, factor in zip(conditions, conversion_factors):
        data.loc[condition, "Snowfall"] = data['Precipitation'] * factor

    return data

def compute_snow_depth(df, col='Snowfall', window=5, melt_factor=0.2):
    df['Snow_Depth'] = 0
    for i in range(1, window + 1):
        df['Snow_Depth'] += df[col].shift(i, fill_value=0) * ((1 - melt_factor) ** i)
    
    df['Snow_Depth'] += df[col]  # Add current day's snowfall
    return df

# Other Implications Stated Above

In [None]:
data_without_null_avg_wind_speed = data
data_without_null_avg_wind_speed = data_without_null_avg_wind_speed.dropna(subset=["Avg_Wind_Speed"])
display_number_of_null(data_without_null_avg_wind_speed)

In [None]:
cleaned_data = data_without_null_avg_wind_speed.drop(columns='Percent_Sunshine')

# Fill missing values before transformations
cleaned_data['Precipitation'].fillna(0, inplace=True)
cleaned_data['Min_Temperature'].fillna(method='ffill', inplace=True)
cleaned_data['Max_Temperature'].fillna(method='ffill', inplace=True)

# Compute Snowfall
cleaned_data = rain_to_snow_conversion(cleaned_data)

# Compute Snow Depth with a realistic melting factor
cleaned_data = compute_snow_depth(cleaned_data, window=5, melt_factor=0.2)

# Compute Avg Temperature (only for missing values)
cleaned_data.loc[cleaned_data['Avg_Temperature'].isnull(), 'Avg_Temperature'] = \
    (cleaned_data['Min_Temperature'] + cleaned_data['Max_Temperature']) / 2

# Drop unnecessary columns
cleaned_data.drop(columns=['Snow_Water_Equivalent', 'Time_Fastest_Mile', 'Time_Peak_Gust', 'Total_Sunshine'], inplace=True)

display_number_of_null(cleaned_data)

In [None]:
fastest_reg_model = sklearn.linear_model.LinearRegression()

# pulling out the values
rows_with_null_targets_direction = cleaned_data[cleaned_data['Fastest_2Min_Wind_Direction'].isnull()] # pulling out our missing values
rows_with_null_targets_speed = cleaned_data[cleaned_data['Fastest_2Min_Wind_Speed'].isnull()] # pulling out our missing values

# Dropping values
cleaned_data.dropna(subset=['Fastest_2Min_Wind_Direction'], inplace=True)
cleaned_data.dropna(subset=['Fastest_2Min_Wind_Speed'], inplace=True)

# pulling out test and validation for Fastest_2Min_Wind_Direction
training_direction, non_train_data_direction = sklearn.model_selection.train_test_split(cleaned_data, test_size=0.3, random_state=42)
validation_direction, test_direction = sklearn.model_selection.train_test_split(non_train_data_direction, test_size=0.5, random_state=42)

columns_to_scale = ['Avg_Wind_Speed', 'Snow_Depth', 'Percipitation', ]



# pulling out test and validation for Fastest_2Min_Wind_Direction
training_speed, non_train_data_speed = sklearn.model_selection.train_test_split(cleaned_data, test_size=0.3, random_state=42)
validation_speed, test_direction = sklearn.model_selection.train_test_split(non_train_data_speed, test_size=0.5, random_state=42)

print(cleaned_data)

In [None]:
Number of data points:  9144 


                              Amount Null  Percent Null
Fastest_5Sec_Wind_Direction             0          0.00
Fastest_5Sec_Wind_Speed                 0          0.00
Smoke_Haze                              0          0.00
Tornado_Funnel_Cloud                    0          0.00
Damaging_Winds                          0          0.00
Blowing_Spray                           0          0.00
Drizzle                                 0          0.00
Freezing_Drizzle                        0          0.00
Rain                                    0          0.00
Freezing_Rain                           0          0.00
Station_ID                              0          0.00
Snow                                    0          0.00
Snow_IcePellets_OnGround                0          0.00
Ground_Fog                              0          0.00
Ice_Fog                                 0          0.00
Wave_Height_Specific_Period             0          0.00
Significant_Wave_Height                 0          0.00
year                                    0          0.00
month                                   0          0.00
Blowing_Drifting_Snow                   0          0.00
Dust_BlowingDust_VolcanicAsh            0          0.00
Station_Name                            0          0.00
Min_Temperature                         0          0.00
Avg_Wind_Speed                          0          0.00
Precipitation                           0          0.00
Snowfall                                0          0.00
Snow_Depth                              0          0.00
Avg_Temperature                         0          0.00
Max_Temperature                         0          0.00
Fastest_2Min_Wind_Direction             0          0.00
Glaze_Rime                              0          0.00
Fastest_2Min_Wind_Speed                 0          0.00
Fog_IceFog_HeavyFog                     0          0.00
Heavy_Fog_Mist                          0          0.00
Thunderstorms                           0          0.00
Ice_Pellets                             0          0.00
Hail                                    0          0.00
day                                     0          0.00
    Station_ID                      Station_Name  Avg_Wind_Speed  ...  year  month  day
0  USW00024221  EUGENE MAHLON SWEET FIELD, OR US            8.28  ...  2000      1    1
1  USW00024221  EUGENE MAHLON SWEET FIELD, OR US            8.28  ...  2000      1    2
2  USW00024221  EUGENE MAHLON SWEET FIELD, OR US            9.84  ...  2000      1    3
3  USW00024221  EUGENE MAHLON SWEET FIELD, OR US           10.07  ...  2000      1    4
4  USW00024221  EUGENE MAHLON SWEET FIELD, OR US            4.25  ...  2000      1    5

[5 rows x 43 columns]