Task 1:

I wanted to do something related to weather, as there tends to be a wide variety of datasets to choose from, and because weather patterns are affected by what happened previously. I found a dataset with various weather information over a decade, and decided to use it to predict what temperature it would be in the next hour: 

[Link: https://www.kaggle.com/budincsevity/szeged-weather ]

In terms of EDA, this dataset was fairly well maintained, with no missing values or weather recordings. A few data categories were unusable, such as Loud Cover (which always had a value of 0), or Daily Summary (which tended to be too specific to create any meaningful data out of). The other categories consisted of: the date and time of the recording; a summary of the individual recording (27 different values, "Partly Cloudy" as the most common); precipitation of rain, snow, or none; humidity from 0-1 and tending to be around 0.75; wind speed in km/h (usually around 5-10); wind direction in degrees; the visibility in km (average: 10 km); and the pressure (consistently around 1000 millibars). Most important, however, was the Temperature data, which had an average value of 11.9 degrees Celsius, and a standard deviation of 9.55 degrees (Apparent Temperature was usually 1 or 2 degrees cooler).

I originally planned to use most of the data for this task, but after the first few runs of my code (which used only the Temperature data) showed extremely promising results, I opted to instead use only the Temperature data and save on runtime, as adding in the extra data would likely not improve the accuracy significantly, and could potentially lead to overfitting.

In terms of what model I used, I went for TensorFlow's Keras, as it allows you to easily switch between Simple RNN layers and LSTM/GRU layers. I started by doing a train/dev/test split, then splitting the data into 24-hour segments (with overlap) so as to maximize the number of training samples the model has. I used relu activation, as the temperatures tended to not be positive and not close to 0, meaning it kept the most information when being passed through. I added one in-between layer of 10 nodes, then a final output layer of 1 node (the guessed temperature). For my loss function, I used mean squared error, as the fact that the values were not bound between 1 and 0 meant I would have to simply get as close to the actual value as possible.

The link below was useful in helping me get started with my RNN, even though I eventually used a very different dataset and hyperparameters.

[Link: https://www.datatechnotes.com/2018/12/rnn-example-with-keras-simplernn-in.html ]

In [None]:
# Import TensorFlow, Pandas, and Numpy
import tensorflow as tf
import pandas as pd
import numpy as np
import math
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, LSTM

#https://www.datatechnotes.com/2018/12/rnn-example-with-keras-simplernn-in.html
#https://www.kaggle.com/budincsevity/szeged-weather

#read in data and print first few values
data = pd.read_csv('weatherHistory.csv')

#get rid of unneeded data
data=data[["Temperature (C)"]]

#split into train/test/valid, but NOT randomly
train,dev,test = data.values[0:80000,:], data.values[80000:90000,:], data.values[90000:len(data),:]

#split data into 24-hour cycles
def split(dat):
 X, Y =[], []
 for i in range(len(dat)-24):
  d=i+24
  X.append(dat[i:d,])
  Y.append(dat[d,])
 return np.array(X), np.array(Y)
  
train_X,train_Y =split(train)
dev_X,dev_Y =split(dev)
test_X,test_Y =split(test)
train_X = np.reshape(train_X, (train_X.shape[0], 1, train_X.shape[1]))
dev_X = np.reshape(dev_X, (dev_X.shape[0], 1, dev_X.shape[1]))
test_X = np.reshape(test_X, (test_X.shape[0], 1, test_X.shape[1]))

#define the model
model = Sequential()
model.add(SimpleRNN(units=32, input_shape=(1,24), activation="relu"))
model.add(Dense(10, activation="relu")) 
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.summary()

model.fit(train_X,train_Y, epochs=20, batch_size=16, verbose=1)

devPredict= model.predict(dev_X)
avg_offset=0
for i in range(len(dev_Y)):
  avg_offset+=pow((dev_Y[i]-devPredict[i]), 2)
print("Mean Sq. Err of Dev Set: " + str(avg_offset[0]/len(dev_Y)))

testPredict= model.predict(test_X)
avg_offset=0
for i in range(len(test_Y)):
  avg_offset+=pow((test_Y[i]-testPredict[i]), 2)
print("Mean Sq. Err of Test Set: " + str(avg_offset[0]/len(test_Y)))

#redefine the model with LSTM
model = Sequential()
model.add(LSTM(units=32, input_shape=(1,24), activation="relu"))
model.add(Dense(10, activation="relu")) 
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.summary()

model.fit(train_X,train_Y, epochs=20, batch_size=16, verbose=1)

devPredict= model.predict(dev_X)
avg_offset=0
for i in range(len(dev_Y)):
  avg_offset+=pow((dev_Y[i]-devPredict[i]), 2)
print("Mean Sq. Err of Dev Set: " + str(avg_offset[0]/len(dev_Y)))

testPredict= model.predict(test_X)
avg_offset=0
for i in range(len(test_Y)):
  avg_offset+=pow((test_Y[i]-testPredict[i]), 2)
print("Mean Sq. Err of Test Set: " + str(avg_offset[0]/len(test_Y)))

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_8 (SimpleRNN)     (None, 32)                1824      
_________________________________________________________________
dense_18 (Dense)             (None, 10)                330       
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 11        
Total params: 2,165
Trainable params: 2,165
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Mean Sq. Err of Dev Set: 1.1067663871986384
Mean Sq. Err of Test Set: 1.0537921152838499
Model: "sequential_11"
______________________________________

After switching to an LSTM cell-based structure, the model actually did slightly worse. This surprised me initially, until I realized the non-LSTM model likely used data from exactly 24 hours ago to inform its guess (as temperatures tend to oscillate over a 24-hour period). Because this would be placed in the long-term memory of the LSTM cell, it wasn't as reliable to retrieve, making it have an overall smaller impact on the result. However, this difference is barely noticeable, as the fact that both models can usually get within 1 or 2 degrees of the actual value is fairly accurate in my mind. 

Task 2:

I opted to use a pre-trained embedding that was taken from a variety of Covid-19 news and data.

[Link: https://www.tensorflow.org/hub/tutorials/cord_19_embeddings_keras ]

For a dissimilarity score, I used the Euclidean distance between the two vectors. High value mean that the words are fairly far from eachother, whereas low values mean they are closer. This also comes with the added benefit of words that do not appear in the training set as often being set as far away from words that do, as there is not yet enough data to form a fully accurate guess. From my testing, it seems to work fairly well, even though there are a few flaws (it rates two words it has never seen before as very similar).

In [12]:
import functools
import itertools
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_hub as hub

from tqdm import trange

import math

#https://www.tensorflow.org/hub/tutorials/cord_19_embeddings_keras

#similarity and dissimilarity
def correlation(word1, word2):
  print("Cosine Similarity:      ", (np.inner(word1, word2)/(math.sqrt(np.inner(word1, word1))*math.sqrt(np.inner(word2, word2)))))
  print("Distance Dissimilarity: ", np.inner(word1-word2, word1-word2))

#load module
module = hub.load('https://tfhub.dev/tensorflow/cord-19/swivel-128d/3')

while True:
  word1 = input("Enter first word (Q to quit): ")
  if (word1=="Q"):
    break
  word2 = input("Enter second word:            ")
  correlation(module([word1]), module([word2]))
  #print(word1)
  #print(word2)

Enter first word (Q to quit): Spain
Enter second word:            Italy
Cosine Similarity:       [[0.5320724]]
Distance Dissimilarity:  [[12.1632224]]
Enter first word (Q to quit): SARS
Enter second word:            MERS
Cosine Similarity:       [[0.67628618]]
Distance Dissimilarity:  [[5.56547353]]
Enter first word (Q to quit): cough
Enter second word:            fever
Cosine Similarity:       [[0.44891569]]
Distance Dissimilarity:  [[14.94088086]]
Enter first word (Q to quit): Coronavirus
Enter second word:            throat
Cosine Similarity:       [[0.01351204]]
Distance Dissimilarity:  [[25.09213602]]
Enter first word (Q to quit): Europe
Enter second word:            Europe
Cosine Similarity:       [[1.]]
Distance Dissimilarity:  [[0.]]
Enter first word (Q to quit): Q
