# RNNs and LSTMs

Build an RNN model to classify text and an LSTM model for anomaly detection (also outlier detection) on temperature sensor data.

---

## Task 1: Text Classification

This task aims to train a sentiment analysis model to classify given sentences as **positive or negative**, based on the Recurrent Neural Network.

---

**Tasks**

1. Load data
2. Preprocess data
3. Build RNN model
4. Train model
5. Predict

In [None]:
# Write your code here

# Extracting training data
with open('task1_training_data.txt', encoding="utf8") as file:
    train_data = file.readlines()

data_labels = []
data_text = []
for lines in train_data:
    x = lines.split(' +++$+++ ')
    data_labels.append(int(x[0]))
    data_text.append(x[1])

# Splitting data into training and validation sets
l = int(0.8*len(data_text))
train_text = data_text[:l]
val_text = data_text[l:]
train_labels = data_labels[:l]
val_labels = data_labels[l:]
    
# Extracting test data
with open('task1_test_data.txt', encoding="utf8") as testfile:
    test_data = testfile.readlines()

test_text = []
i = 0
for lines in test_data[1:]:
    x = lines.split(str(i)+',')
    test_text.append(x[1])
    i += 1

print('First 10 examples from training set:\n', train_text[:10])
print('Labels for first 10 examples from training set:\n', train_labels[:10])
print('First 10 examples from test set:\n', test_text[:10])

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import numpy as np
max_tokens = 2000
max_len = 50

# Text vectorization layer
v_layer = TextVectorization(max_tokens = max_tokens, output_mode="int",
                            output_sequence_length = max_len)

# Initializing the layer to create vocabulary
v_layer.adapt(train_text)

vocab = np.array(v_layer.get_vocabulary())
print('First 20 tokens in vocabulary:\n', vocab[:20])

# Encoded data example
ex_enc = v_layer(train_text).numpy()
print(ex_enc[:10,:20])

In [None]:
import tensorflow as tf
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Activation, Embedding, LSTM
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.utils import plot_model

# Building RNN model
model = Sequential([
    Input(shape=(1,), dtype="string"),
    v_layer,
    Embedding(max_tokens + 1, 128),
    LSTM(64),
    Dense(64, activation = "relu"),
    Dense(1, activation = "sigmoid")
])

#model.summary()
plot_model(model, show_shapes=True)

In [None]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model
model_history = model.fit(train_text, train_labels, epochs=10, batch_size=128, 
                          validation_data = (val_text, val_labels))

In [None]:
import matplotlib.pyplot as plt

#plotting performance
fig = plt.figure(figsize=(12, 4))
plt.subplot(1,2,1)
plt.plot(model_history.history['accuracy'], c='b')
plt.plot(model_history.history['val_accuracy'], c='r')
plt.legend(['Training set', 'Test set'])
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.grid()

plt.subplot(1,2,2)
plt.plot(model_history.history['loss'], c='b')
plt.plot(model_history.history['val_loss'], c='r')
plt.legend(['Training set', 'Test set'])
plt.title('Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.grid()

plt.show()

In [None]:
import csv

predictions = model.predict(test_text)

for i in range(0,10):
    print("\nTest sentence: ", test_text[i])
    print("\nPredicted sentiment label value: ", predictions[i])
    
# Writing predicted data to csv file
field = ['Predicted value', 'Sentence']
row_data = []
for j in range(0,len(test_text)):
    row = [predictions[j], test_text[j]]
    row_data.append(row)

with open('ResultsCSV', 'w') as file:
    write = csv.writer(file)
    
    write.writerow(field)
    write.writerows(row_data)

## Task 2: Anomaly Detection

In manufacturing industries, the anomaly detection technique is applied to predict the abnormal activities of machines based on the data read from sensors. In machine learning and data mining, anomaly detection is the task of identifying the rare items, events, or observations that are suspicious and seem different from the majority of the data. In this task, you will predict the possible failure of the system based on the temperature data. And this failure can be detected by check if they follow the trend of the majority of the data.

---

**Dataset**

The given dataset (`ambient_temperature_system_failure.csv`) is a part of Numenta Anomaly Benchmark (NAB) dataset, which is a novel benchmark for evaluating machine learning algorithms in anomaly detection.


1. Load data
2. Preprocess data
3. Feature Engineering
4. Prepare training and testing data
5. Build LSTM model
6. Train model
5. Find anomalies

In [None]:
# Write you code here

# Required libraries
import pandas as pd
import time

# Reading file - ambient_temperature_system_failure
df = pd.read_csv("ambient_temperature_system_failure.csv")
df.head(5)

In [None]:
# Visualizing time-series data
figsize=(10,5)
df.plot(x='timestamp', y='value', figsize=figsize, title='Temperature (F)')
plt.grid()
plt.show()

In [None]:
# Pre-processing data
df['timestamp'] = pd.to_datetime(df['timestamp'])  # converting timestamp data into datatime data
df['value'] = (df['value'] - 32) * (5/9)  # converting temperature into celsius from fahrenheit
df.plot(x='timestamp', y='value', figsize=figsize, title='Temperature (C)')
plt.grid()
plt.show()


In [None]:
#Formating the data into required format
df['hours'] = df['timestamp'].dt.hour
df['dayornight'] = ((df['hours'] >= 7) & (df['hours'] <= 22)).astype(int)
df['dayoftheweek'] = df['timestamp'].dt.dayofweek
df['weekday'] = (df['dayoftheweek'] < 5).astype(int)

df.head(5)

In [None]:
from sklearn import preprocessing

# Normalizing data for LSTM model
data = df[['value', 'hours', 'dayornight', 'dayoftheweek', 'weekday']]
min_max_scaler = preprocessing.MinMaxScaler()
d_scaled = min_max_scaler.fit_transform(data)
data = pd.DataFrame(d_scaled)
data.head()

In [None]:
train_size = int(0.8*len(data))

# Training data
x_train = data[0:train_size].values
y_train = data[0:train_size][0].values

# Test data
x_test = data[train_size:].values
y_test = data[train_size:][0].values

# Defining sliding window function
def sliding_window(tempdata, window_len=24):
    res_data = []
    for i in range(0, (len(tempdata)-window_len)):
        res_data.append(data[i: i+window_len])
    return np.asarray(res_data)

# Preparing data using sliding window with window_length of 100
win_l = 50
x_train = sliding_window(x_train, win_l)
y_train = y_train[-x_train.shape[0]:]
x_test = sliding_window(x_test, win_l)
y_test = y_test[-x_test.shape[0]:]

# Shape of data
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print("x_test", x_test.shape)
print("y_test", y_test.shape)

In [None]:
# Required libraries to build LSTM
from keras.layers.core import Dropout
# other libraries already loaded

model2 = Sequential()

model2.add(LSTM(50, input_dim = x_train.shape[-1], return_sequences=True))
model2.add(Dropout(0.2))
model2.add(LSTM(100, return_sequences=False))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation = "linear"))

# Model summary
plot_model(model2, show_shapes=True)
#model2.summary()

In [None]:
# Compiling and fitting the model
model2.compile(loss="mse", optimizer="rmsprop")

# Train the model
model2_history = model2.fit(x_train, y_train, epochs=20, batch_size=128, 
                          validation_data = (x_test, y_test))

plt.figure(figsize=(12, 5))
plt.plot(model2_history.history['loss'], c='b', label = 'training_loss')
plt.plot(model2_history.history['val_loss'], c='r', label = 'test_loss')
plt.title('Training Loss vs Test Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.grid()
plt.legend()
plt.show()

In [None]:
# Plotting predicted values vs real test values

y_predict = model2.predict(x_test)
plt.figure(figsize = (10,5))
plt.plot(y_predict, c = 'r', label = 'Prediction on test data')
plt.plot(y_test, c = 'b', label = 'Test values')
plt.title('Real test values vs predicted values')
plt.legend()
plt.grid()

In [None]:
# Calculating threshold to detect anomalies
diff = []
for i in range(0, len(y_test)):
    d = abs(y_test[i] - y_predict[i])
    diff.append(d)

diff = pd.Series(diff)
outlier_fraction = 0.25
n_outliers = int(outlier_fraction*len(diff))
threshold = diff.astype(int).nlargest(n_outliers).min()

# Detecting anomalies
anomaly = (diff >= threshold).astype(int)
anom_series = pd.Series(0, index = np.arange(len(x_train)))
df['anomaly'] = anom_series.append(anomaly, ignore_index = 'True')


In [None]:
#Visualizing anomalies (Red Dots)
plt.figure(figsize=(15,7))
a = df.loc[df['anomaly'] == 1, ['timestamp', 'value']] #anomaly
plt.plot(df['timestamp'], df['value'], color='blue')
plt.scatter(a['timestamp'],a['value'], color='red', label = 'Anomaly')
#plt.axis([1.370*1e7, 1.405*1e7, 15,30])
plt.grid()
plt.legend()