**IMPORTING DATA**

Kaggle Data: https://www.kaggle.com/datasets/grubenm/austin-weather/

This is a website containing the weather data in Austin Texas, with the date and the relevant climate/weather information.  I uploaded the **austin_weather.csv** into the content folder in colab.


The plain date is given originally, but I defined each season:

Winter: 12/1 to 2/28

Spring: 3/1 to 5/31

Summer: 6/1 to 8/31

Autumn: 9/1 to 11/30

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [30]:
df = pd.read_csv('austin_weather.csv')


In [31]:
#uses pandas function to convert to a datetime
df['Date'] = pd.to_datetime(df['Date'])

def convertToSeason(date):
  if date.month >= 12 or date.month < 3:
    return 0 # winter

  elif date.month >= 3 and date.month < 6:
    return 1 # spring

  elif date.month >= 6 and date.month < 9:
    return 2 # summer

  elif date.month >= 9 and date.month < 12:
    return 3 # autumn

  else:
    return -1 # error

# create a new column
df['Season'] = df['Date'].apply(convertToSeason)
df.head()

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events,Season
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm",0
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,29.87,10,10,5,16,6,25,0,,0
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.41,10,10,10,8,3,12,0,,0
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.3,10,10,7,12,4,20,0,,0
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.27,10,10,7,10,2,16,T,,0


In [47]:
scaler = StandardScaler()

# things I want to scale, numerical values generally should be
# a list
scale_features = df.columns[:13].tolist()  #first 13

# convert non-numeric values in the scale_features columns to NaN
for feature in scale_features:
    df[feature] = pd.to_numeric(df[feature], errors='coerce')

# fills empty values with the mean of the column
# might not be the best way to impute data
df[scale_features] = df[scale_features].fillna(df[scale_features].mean())

# scale
df[scale_features] = scaler.fit_transform(df[scale_features])

# make predictive model, y is the label, x is the feature
X = pd.get_dummies(df.drop(['Date', 'Season', 'VisibilityHighMiles'], axis=1))
y = df['Season']
X.head()

Unnamed: 0,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,SeaLevelPressureHighInches,...,PrecipitationSumInches_T,Events_,Events_Fog,"Events_Fog , Rain","Events_Fog , Rain , Thunderstorm","Events_Fog , Thunderstorm",Events_Rain,"Events_Rain , Snow","Events_Rain , Thunderstorm",Events_Thunderstorm
0,-0.464929,-0.758011,-1.050594,0.404923,-0.514983,-0.491606,0.46504,0.668947,0.710161,-1.402099,...,0,0,0,0,0,0,0,0,1,0
1,-1.684364,-1.612677,-1.473568,-1.366924,-1.39211,-1.421071,0.46504,0.106673,-0.115539,1.655567,...,0,1,0,0,0,0,0,0,0,0
2,-1.548871,-1.826343,-1.967038,-2.252848,-1.999351,-1.730893,-1.07676,-1.178524,-1.059196,2.489476,...,0,1,0,0,0,0,0,0,0,0
3,-1.345632,-1.755121,-2.037533,-1.883713,-1.93188,-1.854821,0.102263,-0.857225,-1.354088,2.489476,...,0,1,0,0,0,0,0,0,0,0
4,-1.548871,-1.470232,-1.332577,-1.293097,-1.122224,-0.925356,-0.169819,0.347647,0.651182,1.655567,...,1,1,0,0,0,0,0,0,0,0


In [5]:
# making testing and training data sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
y_train.head()

X_train.head()

Unnamed: 0,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,SeaLevelPressureHighInches,...,PrecipitationSumInches_T,Events_,Events_Fog,"Events_Fog , Rain","Events_Fog , Rain , Thunderstorm","Events_Fog , Thunderstorm",Events_Rain,"Events_Rain , Snow","Events_Rain , Thunderstorm",Events_Thunderstorm
133,0.89,0.381543,-0.13415,-1.440751,-1.459581,-1.545,-3.978971,-3.186645,-2.061831,-0.290221,...,0,1,0,0,0,0,0,0,0,0
522,0.280282,0.381543,0.500311,0.995539,0.901914,0.809645,0.555734,1.23122,1.417903,-0.067845,...,1,0,0,0,0,0,0,0,1,0
849,-0.397182,-0.117012,0.218328,0.47875,0.564558,0.747681,1.099898,1.632844,1.653818,0.098937,...,0,0,0,0,0,0,0,0,1,0
471,0.280282,0.310321,0.359319,0.404923,0.564558,0.809645,1.099898,0.749271,0.35629,-0.56819,...,0,1,0,0,0,0,0,0,0,0
1182,0.144789,0.239099,0.359319,0.183442,0.362144,0.623752,-0.079125,0.186998,0.35629,0.376907,...,0,1,0,0,0,0,0,0,0,0


**DEPENDENCIES**

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import Adam, RMSprop, Adagrad  # implementing optimizer
from tensorflow.keras.callbacks import ReduceLROnPlateau  # reduces dynamic learning rate when stop learning

In [7]:
y_train_encoded = to_categorical(y_train, num_classes=4)
y_test_encoded = to_categorical(y_test, num_classes=4)

model = Sequential()
model.add(Dense(units=32, activation='relu', input_dim=len(X_train.columns)))
model.add(Dropout(0.3))  # adding dropout
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.3))
# The output layer needs the same amount of nodes as classes, kinda maxes sense
# softmax seems to use a complicated formula to predict the likelihoods for each class
# then it just chooses the highest probability
model.add(Dense(units=4, activation='softmax'))

# Choose an optimizer
# Learning rate?
optimizer_instance = Adam(learning_rate=0.0002)  # You can adjust the learning rate or other parameters

# this loss function typically used for multi class classification
# optimizer I am using is Adam, SGD is like the standard one where weights are updated after each test run
# Adam combines Momentum and RMS Prop, I'm not too sure how it works but it appears to be sometimes recommended
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# I tried also using a dynamic learning rate with an Adam optimizer but it didn't seem to improve the accuracy all that much


In [8]:
#fit the model
model.fit(X_train, y_train_encoded, epochs=100, batch_size=32, validation_data=(X_test, y_test_encoded))



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7d718834d090>

Running it a bunch of times (many epochs) gets to around 70 percent accuracy, which is better than the 25% random chance of estimating but obviously not ideal.

In [46]:
#Testing accuracy

from sklearn.metrics import accuracy_score

# predict class probabilities using my current model
y_prob = model.predict(X_test)

# converting probabilities to class labels
y_hat = y_prob.argmax(axis=1)

accuracy = accuracy_score(y_test, y_hat)
print(accuracy)

0.7196969696969697
