# Weather Prediction using Machine Learning Model

Utilizing NOAA's (https://www.ncdc.noaa.gov/cdo-web/search) database of past weather and climate data, we can predict the weather and temperature within a given range.

This notebook focuses on the temperature in NYC from 1970 to around late 2022.

## __Set up__
* Import all necessary libraries and datset.

In [126]:
import pandas as pd
import matplotlib

weather = pd.read_csv('./lsda/data/weather.csv', index_col='DATE')

## __Preprocessing__

When initially viewing the dataset, there are many NULL values present in the columns. In order to begin working with the data, this must be cleaned first.

In [127]:
'''
Q: Why are there so many NULL values???
A: This dataset was retrieved from the National Oceanic and Atmospheric Administration, a US government agency. It is plausbile that the tech for some of these sensors did not exist at the time or just simply were not installed yet.
'''

# Sum of null values of a column / total number of rows = null percent of a row
null_percent = weather.apply(pd.isnull).sum()/weather.shape[0] 
null_percent

STATION    0.000000
NAME       0.000000
ACMH       0.501478
ACSH       0.501426
AWND       0.265256
FMTM       0.475087
PGTM       0.363872
PRCP       0.000000
SNOW       0.000000
SNWD       0.000104
TAVG       0.680406
TMAX       0.000000
TMIN       0.000000
TSUN       0.998393
WDF1       0.501685
WDF2       0.498678
WDF5       0.502981
WDFG       0.734484
WDFM       0.999948
WESD       0.685228
WSF1       0.501530
WSF2       0.498678
WSF5       0.503033
WSFG       0.613055
WSFM       0.999948
WT01       0.630217
WT02       0.935034
WT03       0.933271
WT04       0.982579
WT05       0.981127
WT06       0.990615
WT07       0.994400
WT08       0.796962
WT09       0.992741
WT11       0.999274
WT13       0.886711
WT14       0.954010
WT15       0.997822
WT16       0.658993
WT17       0.996889
WT18       0.939493
WT21       0.999741
WT22       0.997459
WV01       0.999948
dtype: float64

### __Dealing with NULL values__
Extract the columns (`valid_columns`) by only accept columns where the `null_percent` is less than the `NULL_THRESHOLD`

All of the chosen columns should have NO null values except for **SNWD**

In [128]:
NULL_THRESHOLD = 0.01

valid_columns = weather.columns[null_percent < NULL_THRESHOLD]

weather = weather[valid_columns].copy()

### __SNWD (Snow Depth) Fill In__

Snow is something that is (or was) very common in New York so it's important to mainatin this data for predictions. However, Snow Depth does have some NULL values so we can utilize `.ffill()` to "guess" the snow depth.

`.ffill()` = Fill NA/NaN values by propagating the last valid observation to next valid.

In [129]:
weather = weather.ffill()

null_percent = weather.apply(pd.isnull).sum()/weather.shape[0] 
null_percent

STATION    0.0
NAME       0.0
PRCP       0.0
SNOW       0.0
SNWD       0.0
TMAX       0.0
TMIN       0.0
dtype: float64

# __Future Steps__

As NOAA only allows for 1.00GB exports, requesting data for countries is impossible. Gaining access to much larger datasets with around 100GB+ in size will allow for a deeper level of training to take place.