 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Take-home exercises Deep Learning

<br>
<br>
<br>
<br>
<br>


# First notebook - Introduction to Deep Learning model arhitecture

<br>
<br>
<br>
<br>
<br>

### `Take-home exercise`

**Using both the sequential model API and the functional model API, create a model that would work with a tabular dataset that contains 65 columns. 64 of them represent independent features, and one represents dependent features. Use the ReLU activation function for the hidden layers. If the task is to perform binary classification use the appropriate activation function for the output layer, and the appropriate number of neurons in the output layer.**

**Note:** when determining the number of layers and neurons try to follow the general rules of thumb that we mentioned before

**For the model created using the Sequential API print out a model summary. Plot the model created using the Functional API.**

**Solution:**

In [None]:
# Define the model arhitecture using the Sequential API

import keras
from keras.models import Sequential
from keras.layers import Dense


model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(64,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.summary()

In [None]:
# Define the model arhitecture using the Functional API


import keras
from keras.models import Model
from keras.layers import Input, Dense
from keras.utils.vis_utils import plot_model


# Define input layer

input_layer_1 = Input(shape=(64,))

# Define hidden layers

hidden_layer_1 = Dense(32, activation='relu')(input_layer_1)
hidden_layer_2 = Dense(16, activation='relu')(hidden_layer_1)

# Define output layer

prediction_layer = Dense(1, activation='sigmoid')(hidden_layer_2)


model = Model(inputs=input_layer_1, outputs=prediction_layer)


# Bonus: add the argument rankdir with the value 'LR' to ask for a horizontal plot (my personal preference)

plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True, rankdir='LR')

<br>
<br>
<br>
<br>
<br><br>
<br>

# Second notebook - Training Deep Learning models with Keras

<br>
<br>
<br>
<br>
<br><br>
<br>

### `Take-home exercise`

**Using the dataset available at `https://edlitera-datasets.s3.amazonaws.com/breast_cancer_data.csv`, create a model that can predict whether a person has a benign tumor or a malignant tumor. The`diagnosis` column indicates whether a person has a bening tumor (0) or a malignant tumor (1). Try to achieve the best possible accuracy you can by modifying hyperparameters.**

**Solution:**

In [None]:
# Import necessary libraries

import pandas as pd
import numpy as np
import keras
from keras.layers import Dense
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.metrics import BinaryAccuracy
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

In [None]:
# Load the data

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/breast_cancer_data.csv")

In [None]:
df

In [None]:
# Shuffle dataset

df = df.sample(frac=1).reset_index(drop=True)

In [None]:
# Separate features from the label

X = df.iloc[:, 1:]


# Flatten data

y = df["diagnosis"]
y = y.values.flatten()

In [None]:
# Split data into train and test data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42
)

# Split train data into train and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, 
    test_size=0.3, 
    random_state=42
)

In [None]:
# Define scaler and scale data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

In [None]:
# Define input dimension

input_dimension = X_train.shape[1]

print(input_dimension)

In [None]:
# Define model

model = Sequential()

model.add(Dense(8, activation = 'relu', input_shape=(input_dimension, )))

model.add(Dense(1, activation = 'sigmoid'))


model.summary()

In [None]:
# Define optimizer

optim = Adam()


# Define loss

loss_function = BinaryCrossentropy()


# Define metric we will track 

metric = BinaryAccuracy()


# Compile model

model.compile(loss=loss_function,
              optimizer=optim,
              metrics=[metric])

In [None]:
# Fit model

model.fit(
    X_train, 
    y_train,
    epochs=150, 
    batch_size=64, 
    validation_data=(X_valid, y_valid), 
    verbose=1
)

In [None]:
# Evaluate model

y_pred = model.predict(X_test)

score = model.evaluate(X_test, y_test, verbose=1)

print(score)

In [None]:
# Predict class for text example

predicted_classes = np.where(y_pred > 0.6, 1,0)

In [None]:
# Generate classification report

print(classification_report(y_test, predicted_classes, labels=[0, 1]))

<br>
<br>
<br>
<br>
<br>

# Third notebook - Word embeddings and the Keras Embedding layer

## No exercises assigned! 

<br>
<br>
<br>
<br>
<br>

# Fourth notebook - Gensim and SpaCy

<br>
<br>
<br>
<br>
<br>

### `Take-home exercise 1`

**Create a `Skip-gram model` using `Gensim` and train it on the data inside the `Review Text` column of the dataset stored inside `https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv` file. After training take a look at what are the most similar words to the word `"shirt"` using that model.**

**Note: train the model both with, and without preprocessing your text data (removing stop words etc.) and compare the results.**

**Solution:**

In [None]:
# Import necessary libraries

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
import gensim
from gensim.models import Word2Vec

In [None]:
# Load in our data and create a Dataframe

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv")


In [None]:
# Take a look at the first five rows of our Dataframe

df.head()

In [None]:
# Create function for preprocessing text data

def preprocess_reviews(text, stop_words=None, stemmer=None):
    text_data = text.lower() # lowercase text data
    tokens = tokenizer.tokenize(text_data) # tokenize data using the given tokenizer
    if stop_words is not None:
        tokens = [t for t in tokens if not t in stop_words] # remove stop words
    if stemmer is not None:
        tokens = [stemmer.stem(t) for t in tokens] # stem text data
    return " ".join(tokens)

In [None]:
# Define tokenizer

tokenizer = RegexpTokenizer(r"[a-zA-Z]{3,}")

In [None]:
# Define stemmer

stemmer = SnowballStemmer(language="english")

In [None]:
stop_words = set(stopwords.words("english"))

In [None]:
# Preprocess review titles

df["tokenized_reviews_with_preprocessing"] = df["Review Text"].apply(preprocess_reviews, 
                                                                     args=(stop_words, stemmer))


df["tokenized_reviews_without_preprocessing"] = df["Review Text"].map(preprocess_reviews)

In [None]:
df

In [None]:
# Create list of lists
# every sublist contains a tokenized preprocessed sentence

reviews_with_preprocessing = [row.split(".") for row in df["tokenized_reviews_with_preprocessing"]]

tokenized_reviews_with_preprocessing = [sublist[0].split() for sublist in reviews_with_preprocessing]

In [None]:
# Create list of lists
# every sublist contains a tokenized sentence

reviews_without_preprocessing = [row.split(".") for row in df["tokenized_reviews_without_preprocessing"]]

tokenized_reviews_without_preprocessing = [sublist[0].split() for sublist in reviews_without_preprocessing]

In [None]:
# Train Skip-gram model

skip_gram_with_preprocessing = Word2Vec(tokenized_reviews_with_preprocessing, min_count=3, size=300, window=3, sg=1)
#skip_gram_with_preprocessing = Word2Vec(tokenized_reviews_with_preprocessing, min_count=3, vector_size=300, window=3, sg=1)

In [None]:
# Train Skip-gram model

skip_gram_without_preprocessing = Word2Vec(tokenized_reviews_without_preprocessing, min_count=3, size=300, window=3, sg=1)
#skip_gram_without_preprocessing = Word2Vec(tokenized_reviews_without_preprocessing, min_count=3, vector_size=300, window=3, sg=1)

In [None]:
# Take a look at what are the most similar words to shirt 

print(skip_gram_with_preprocessing.wv.most_similar("shirt"))

In [None]:
# Take a look at what are the most similar words to shirt 

print(skip_gram_without_preprocessing.wv.most_similar("shirt"))

<br>
<br>
<br>
<br>
<br>

### `Take-home exercise 2`

**Load the dataset available at `https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv` and create a DataFrame from it. Using the 
data inside the `Review Text` column, try to predict the appropriate value for the `Recommended IND` column.**

**Note 1: perform undersampling to balance out your dataset**

**Note 2: since we haven't yet covered LSTMs you can use the model we used in class**

**Solution:**

In [None]:
# Import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, Dense, LSTM, Dropout
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.metrics import BinaryAccuracy

In [None]:
# Get rid of randomness (as much as possible)

from numpy.random import seed
seed(42) 
import tensorflow as tf
tf.random.set_seed(42)


In [None]:
# Load in our data and create a Dataframe

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv")


In [None]:
df

In [None]:
df["Recommended IND"].value_counts()

In [None]:
# Divide by class

df_class_0 = df[df["Recommended IND"] == 0]
df_class_1 = df[df["Recommended IND"] == 1]

df_class_1 = df_class_1.sample(len(df_class_0))

In [None]:
df = pd.concat([df_class_0, df_class_1])

In [None]:
df = df.sample(frac=1).reset_index(drop=True)
df

In [None]:
# Separate dependent feature from the independent feature

X = df["Review Text"]

y = df["Recommended IND"]

In [None]:
# Separate data into training data and testing data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

In [None]:
# Separate data into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

In [None]:
# Define tokenizer
# Set number of words as 10 000
# Set value of oov token
# Leave everything else on default values

tokenizer = Tokenizer(
    num_words=10_000, 
    oov_token = "<OOV>")

In [None]:
# Fit tokenizer on train data

tokenizer.fit_on_texts(X_train)

In [None]:
# Convert into sequences of integers

X_train = tokenizer.texts_to_sequences(X_train)
X_valid = tokenizer.texts_to_sequences(X_valid)
X_test = tokenizer.texts_to_sequences(X_test)

In [None]:
# Define values important for padding

max_length = 100
trunc_type = "post"
padding_type = "post"

In [None]:
# Pad train, validation and test data

X_train = pad_sequences(X_train, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_valid = pad_sequences(X_valid, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_test = pad_sequences(X_test, padding=padding_type, maxlen=max_length, truncating=trunc_type)

In [None]:
# Define vocabulary size

vocab_size = len(tokenizer.word_index)+1

In [None]:
# Load pretrained embeddings

embedding_vector = dict()

word_embeddings = "glove.6B.100d.txt"

f = open(word_embeddings, encoding="utf8")

for line in f:
    values = line.split()
    word = values[0]
    coef = np.asarray(values[1:], dtype="float32")
    embedding_vector[word] = coef
    
f.close()

In [None]:
# Create an embedding matrix to use as weights for the embedding layer

embedding_matrix = np.zeros((vocab_size,100))
for word,i in tokenizer.word_index.items():
    embedding_value = embedding_vector.get(word)
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value

In [None]:
# Define embedding layer with pretrained weights

embedding_layer = Embedding(vocab_size, 
                            100, 
                            weights=[embedding_matrix], 
                            input_length=100, 
                            trainable=False)

In [None]:
# Define model

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(16, recurrent_dropout=0.2)) 
model.add(Dropout(0.5))
model.add(Dense(4, activation="relu"))
model.add(Dense(1,activation="sigmoid"))

model.summary()

In [None]:
# Compile model

model.compile(optimizer=Adam(),
              loss=BinaryCrossentropy(),
              metrics=[BinaryAccuracy()])

In [None]:
# Define training parameters

num_epochs = 10
batch_size = 64

# Train model

history = model.fit(X_train, 
                    y_train, 
                    batch_size=batch_size, 
                    epochs=num_epochs, 
                    verbose=1, 
                    validation_data=(X_valid, y_valid))

In [None]:
# Make predictions

y_pred = model.predict(X_test)

y_pred = y_pred > 0.5

In [None]:
# Create classification report

print(classification_report(y_test, y_pred))

<br>
<br>
<br>
<br>
<br>

### `Take-home exercise 3`

**In this exercise, you need to preprocess data stored in the `Review Text` column of the modified version of the `Womens clothing reviews` dataset. The dataset is available here: `https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv`**

**Preprocessing steps:**

* get rid of all the columns except the "Review Text" and "Class Name" columns
* select the first 1000 rows of the DataFrame
* remove punctuation
* remove stop words
* remove empty spaces
* lemmatize text data 

**Do the preprocessing using `spaCy`. After you finish preprocessing the data in the `Review Text` column, create word vectors for the text using the `Word2Vec CBOW` model and `Gensim`.**

**Solution:**

In [None]:
# import libraries

import pandas as pd
import spacy
from gensim.models import Word2Vec

In [None]:
# Load pipeline

nlp = spacy.load("en_core_web_lg")

In [None]:
# Load data and create DataFrame

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/womens_clothing_reviews.csv")

In [None]:
# Take a look at the first five rows

df.head()

In [None]:
# Select the first 1000 rows

df = df.iloc[:1000, :].copy()

In [None]:
# Leave only the two columns

df = df[["Review Text", "Class Name"]]

In [None]:
# Create a function that preprocesses data

def preprocess_text(data):
    document = nlp(data)
    cleaned_data_generator = (token.lemma_ for token in document if not token.is_punct | token.is_space | token.is_stop)
    cleaned_text = " ".join(cleaned_data_generator).lower()
    return cleaned_text

In [None]:
# Preprocess data

df["Review Text"] = df["Review Text"].apply(preprocess_text)

In [None]:
# Take a look at the first five rows

df.head()

In [None]:
# Prepare data for CBOW

reviews = [row.split(",") for row in df["Review Text"]]

tokenized_reviews = [sublist[0].split() for sublist in reviews]

In [None]:
tokenized_reviews

In [None]:
# Train CBOW model

#CBOW_model = Word2Vec(tokenized_reviews, min_count=3, size=300, window=3, sg=0)
CBOW_model = Word2Vec(tokenized_reviews, min_count=3, vector_size=300, window=3, sg=0)

In [None]:
# Check similarity between "small" and "petite" using CBOW

CBOW_model.wv.similarity("small", "petite")

<br>
<br>
<br>
<br>
<br>

# Fifth notebook - RNNs and LSTMs

<br>
<br>
<br>
<br>
<br>

### `LSTM take home exercise`

**Train an `LSTM model` to classify wines into good wines and superior wines (classes 0 and 1), using the dataset stored in the `https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv` file. Do not clean your data in any way. Instead, use it as is to prove that the model will outperform the classic Machine Learning models we trained earlier, even without text data preprocessing!**

**When done training the model, print the classification report.**

#### Solution

In [None]:
# Import necessary libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.metrics import BinaryAccuracy

In [None]:
# Import data and create a DataFrame
# Take a look at the first 5 rows

wine_data = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/wine_data_classification.csv")

wine_data.head()

In [None]:
# One-hot encode data

wine_data["wine_type"] = wine_data.wine_type.map({"great_wine": 0, "superior_wine": 1})

In [None]:
# Undersample, to get matching number of samples for both classes

count_class_0, count_class_1 = wine_data.wine_type.value_counts()

wine_data_class_0 = wine_data[wine_data['wine_type'] == 0]
wine_data_class_1 = wine_data[wine_data['wine_type'] == 1]

wine_data_class_0_under = wine_data_class_0.sample(count_class_1)
wine_data = pd.concat([wine_data_class_0_under, wine_data_class_1], axis=0)

In [None]:
wine_data["wine_type"].value_counts()

In [None]:
# Shuffle data

wine_data = wine_data.sample(frac=1).reset_index(drop=True)

In [None]:
# Define independent feature

X_wine = wine_data["description"]

# Define dependent feature

y_wine = wine_data["wine_type"]

In [None]:
# Separate data into training data and testing data

X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(
    X_wine, y_wine, test_size=0.20, random_state=42
)

In [None]:
# Separate data into training data and validation data

X_wine_train, X_wine_valid, y_wine_train, y_wine_valid = train_test_split(
    X_wine_train, y_wine_train, test_size=0.20, random_state=42
)

In [None]:
# Define tokenizer

wine_tokenizer = Tokenizer(num_words=20_000, oov_token="<OOV>")

In [None]:
# Fit tokenizer on train data

wine_tokenizer.fit_on_texts(X_wine_train)

In [None]:
# Convert into sequences of integers

X_wine_train = wine_tokenizer.texts_to_sequences(X_wine_train)
X_wine_valid = wine_tokenizer.texts_to_sequences(X_wine_valid)
X_wine_test = wine_tokenizer.texts_to_sequences(X_wine_test)

In [None]:
# Define values important for padding

max_length = 100
trunc_type = "post"
padding_type = "post"

In [None]:
# Pad train, validation and test data

X_wine_train_padded = pad_sequences(X_wine_train, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_wine_valid_padded = pad_sequences(X_wine_valid, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_wine_test_padded = pad_sequences(X_wine_test, padding=padding_type, maxlen=max_length, truncating=trunc_type)

In [None]:
# Define vocabulary size

wine_vocab_size = len(wine_tokenizer.word_index) + 1

In [None]:
# Load pretrained embeddings

embedding_vector = dict()

word_embeddings = "glove.6B.100d.txt"

f = open(word_embeddings, encoding="utf8")

for line in f:
    values = line.split()
    word = values[0]
    coef = np.asarray(values[1:], dtype="float32")
    embedding_vector[word] = coef
    
f.close()

In [None]:
# Create an embedding matrix to use as weights for the embedding layer

embedding_matrix = np.zeros((wine_vocab_size, 100))

for word, i in wine_tokenizer.word_index.items():
    embedding_value = embedding_vector.get(word)
    
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value

In [None]:
# Define embedding layer with pretrained weights

embedding_layer = Embedding(
    wine_vocab_size, 
    100, 
    weights=[embedding_matrix], 
    input_length=100, 
    trainable=False
)

In [None]:
# Define model

embedding_dim = 100
input_dim = wine_vocab_size

model = Sequential()

model.add(embedding_layer)
model.add(LSTM(64, recurrent_dropout=0.2)) 
model.add(Dropout(0.5))
model.add(Dense(1, activation="sigmoid"))

model.summary()

In [None]:
# Compile model

loss_function = BinaryCrossentropy()

metric = BinaryAccuracy()

optim = Adam()

model.compile(loss=loss_function, optimizer=optim, metrics=BinaryAccuracy())

In [None]:
# Define training parameters

num_epochs = 10
batch_size = 128

# Train model

history = model.fit(
    X_wine_train_padded, 
    y_wine_train, 
    batch_size=batch_size, 
    epochs=num_epochs, 
    verbose=1, 
    validation_data=(X_wine_valid_padded, y_wine_valid))

In [None]:
# Make predictions using model

y_wine_pred = model.predict(X_wine_test_padded)

y_wine_pred = y_wine_pred > 0.5

In [None]:
# Create a classification report 

print(classification_report(y_wine_test, y_wine_pred))

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>