# Algorithms for Big Data - Exercise 10
This lecture is focused on using CNN for object localization tasks.

You can download the dataset from this course on [Github](https://github.com/rasvob/2020-21-ARD/tree/master/datasets)

[Open in Google colab](https://colab.research.google.com/github/rasvob/2020-21-ARD/blob/master/abd_10.ipynb)
[Download from Github](https://github.com/rasvob/2020-21-ARD/blob/master/abd_10.ipynb)

In [4]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import seaborn as sns
import tensorflow as tf
# import tensorflow.compat.v2 as tf #use tensorflow v2 as a main 
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize # normalization of the matrix
import scipy
import pandas as pd

tf.version.VERSION

'2.3.0'

In [5]:
import requests
from typing import List, Tuple

In [6]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

# What is Object Localization?
Object localization is the name of the task of “classification with localization”. Namely, given an image, classify the object that appears in it, and find its location in the image, usually by using a bounding-box. 

In Object Localization, only a single object can appear in the image. If more than one object can appear, the task is called “Object Detection”.

![model](https://github.com/rasvob/2020-21-ARD/raw/master/images/class_vs_loc.png)

Object Localization can be treated as a regression problem - predicting a continuous value, such as a weight or a salary. For instance, we can represent our output (a bounding-box) as a tuple of size 4, as follows:

- (x,y, height, width)
    - (x,y): the coordination of the left-top corner of the bounding box
    - height: the height of the bounding box
    - width: the width of the bounding box
    
![model](https://github.com/rasvob/2020-21-ARD/raw/master/images/cat_bound.png)

# We need to download the data first and take a look at the dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/rasvob/2020-21-ARD/master/datasets/titanic_train.csv')

In [None]:
df.head()

# Each column has certain information about the specific passanger

In [None]:
txt_data = """Feature Name;Description
sex;Gender of passenger
age;Age of passenger
n_siblings_spouses;Number of siblings and partners aboard
parch;Number of parents and children aboard
fare;Fare passenger paid.
class;Passenger\'s class on ship
deck;Which deck passenger was on
embark_town;Which town passenger embarked from
alone;If passenger was alone"""
from io import StringIO
info = pd.read_csv(StringIO(txt_data), sep=';')
info

# Are there any missing values?

In [None]:
df.isna().sum()

# Our goal is to predict if the person survived the cruise or not
We use the 'survived' column as our target. Other columns are meant as an input variables.

In [None]:
df.dtypes

# We can start with a simple exploration analysis of the data to see what pieces of information may matter the most in the decision making process

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='survived')

### We can see that females have aprox. 2 times higher chance to survive

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='sex', hue='survived')

### Medians for are in both groups are really close to each other so we can say that age is not so significant feature

In [None]:
plt.figure(figsize=(20, 10))
sns.boxplot(data=df, y='age', x='survived')

### Not bein' alone on the other hand matters as we can see on the Number of siglings/spouses and parent/children counts

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='n_siblings_spouses', hue='survived')

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='parch', hue='survived')

### Feature 'alone' is so-called interaction variable because if combines effect of both 'parch' and 'n_siblings_spouses' features

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='alone', hue='survived')

## Money definitely mattered as well, we can see that passangers with more expensive tickets were more likely to survive
The most obvious difference is in the thid and first class survival ratio.

In [None]:
plt.figure(figsize=(20, 10))
sns.boxplot(data=df, y='fare', x='survived')

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='class', hue='survived')

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='deck', hue='survived')

# Interesting thing is that the deck is mostly unknown for the passangers except the First class tickets.

In [None]:
sns.catplot(data=df, x='class', hue='survived', col='deck', kind='count', height=4, aspect=.7)

### Passanger in the 2nd, and mostly in the 3rd, class were more likely to travel alone

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='class', hue='alone')

In [None]:
sns.catplot(data=df, x='class', hue='survived', col='alone', kind='count', height=4, aspect=.7)

### We can see that being the first class passanger with fellow people on board gave you significant advantage

In [None]:
sns.heatmap(df.pivot_table(values='survived',index='alone', columns='class', aggfunc=np.sum), annot=True, fmt="d", linewidths=.5)

### Male passangers we more likely to travel alone; their survival change was not so high compared to the females

In [None]:
sns.heatmap(df.pivot_table(values='survived',index='alone', columns='sex', aggfunc=np.sum), annot=True, fmt="d", linewidths=.5)

## We can see that dataset contains only a few towns where the passangers embarked so there should be no issue with the vectorization of it

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data=df, x='embark_town', hue='survived')

## Let's prepare the features and create simple classification model

Estimators use a system called feature columns to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and feature columns describe how the model should convert each feature.

Selecting and crafting the right set of feature columns is key to learning an effective model. A feature column can be either one of the raw inputs in the original features dict (a base feature column), or any new columns created using transformations defined over one or multiple base columns (a derived feature columns).

The linear estimator uses both numeric and categorical features. Feature columns work with all TensorFlow estimators and their purpose is to define the features used for modeling. Additionally, they provide some feature engineering capabilities like one-hot-encoding, normalization, and bucketization.

In [None]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = df[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float64))

In [None]:
feature_columns

## We need to define our own input_function

The input_function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. 

tf.data.Dataset can take in multiple sources such as a dataframe, a csv-formatted file, and more.

In [None]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  return input_function

In [None]:
X, y = df.drop('survived', axis=1), df.survived

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

In [None]:
X_train.dtypes

In [None]:
train_input_fn = make_input_fn(X_train, y_train)
test_input_fn = make_input_fn(X_test, y_test, num_epochs=1, shuffle=False)

## We take take look on the first batch of the data to see how it looks

In [None]:
ds = make_input_fn(X, y, batch_size=10)()
for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys()))
  print()
  print('A batch of class:', feature_batch['class'].numpy())
  print()
  print('A batch of alone feature:', feature_batch['alone'].numpy())
  print()
  print('A batch of Labels:', label_batch.numpy())

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(test_input_fn)

In [None]:
result

## We usually want to get raw predictions from the model, either for further analysis or for other systemcomponents

In [None]:
predictions = linear_est.predict(test_input_fn)

In [None]:
predictions_list = list(predictions)
predictions_list[:3]

## We are interested only in class_ids

In [None]:
y_pred = [x['class_ids'][0] for x  in predictions_list]

In [None]:
y_pred[:3]

## We can compute our own metrics which are missing from the Keras

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

## Derived Feature Columns
Using each base feature column separately may not be enough to explain the data. For example, the correlation between gender and the label may be different for different gender. Therefore, if you only learn a single model weight for gender="Male" and gender="Female", you won't capture every age-gender combination (e.g. distinguishing between gender="Male" AND age="30" AND gender="Male" AND age="40").

To learn the differences between different feature combinations, you can add crossed feature columns to the model (you can also bucketize age column before the cross column):

In [None]:
age_x_gender = tf.feature_column.crossed_column(['age', 'sex'], hash_bucket_size=100)
derived_feature_columns = [age_x_gender]
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns+derived_feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(test_input_fn)

In [None]:
result

In [None]:
predictions = linear_est.predict(test_input_fn)
predictions_list = list(predictions)
y_pred = [x['class_ids'][0] for x  in predictions_list]

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

# We can build even more complex models than plain-old Logistic regression
Very popular model nowadays is Gradient-boosted tree classifier (you perhaps heard about Light gradient boosting (LGB), Extreme gradient boosting (XGB) already).
TF2 provides its own implementation in form of [BoostedTreesClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesClassifier). Since data fits into memory, use entire dataset per layer. It will be faster.

## Try to change n_trees values - 5, 10, 50, 100

In [None]:
linear_est = tf.estimator.BoostedTreesClassifier(feature_columns=feature_columns, n_trees=5, n_batches_per_layer=1)
linear_est.train(train_input_fn)
result = linear_est.evaluate(test_input_fn)

In [None]:
result

In [None]:
predictions = linear_est.predict(test_input_fn)
predictions_list = list(predictions)
y_pred = [x['class_ids'][0] for x  in predictions_list]

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

# Alternative to using Estimators in case of strucutured data is to use classical neural network with fully connected layers
There are multiple preprocessing layers available in Keras for different types of columns. Goal of all the preprocessing is to make dataset features suitable for the neural network.

See [https://www.tensorflow.org/guide/keras/preprocessing_layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) for more information.


In [None]:
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

# We will use tf.data again as in the estimator case
We will wrap the dataframes with tf.data, in order to shuffle and batch the data. If you were working with a very large CSV file (so large that it does not fit into memory), you would use tf.data to read it from disk directly.

### Note about the prefetch call
 - Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.

In [None]:
def df_to_dataset(df, labels, shuffle=True, batch_size=32):
  dataframe = df.copy()
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

In [None]:
batch_size = 32
train_ds = df_to_dataset(X_train, y_train, batch_size=batch_size)
test_ds = df_to_dataset(X_test, y_test, shuffle=False, batch_size=batch_size)

The Keras preprocessing layers API allows you to build Keras-native input processing pipelines. You will use 3 preprocessing layers to demonstrate the feature preprocessing code.

- Normalization - Feature-wise normalization of the data.
    - For each of the Numeric feature, you will use a Normalization() layer to make sure the mean of each feature is 0 and its standard deviation is 1.

- CategoryEncoding - Category encoding layer.
    - You cannot feed strings directly to a model. The preprocessing layer takes care of representing strings as a one-hot vector.

- StringLookup - Maps strings from a vocabulary to integer indices.

- IntegerLookup - Maps integers from a vocabulary to integer indices.


In [None]:
def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization()

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

In [None]:
age_col = df['age']
layer = get_normalization_layer('age', train_ds)
layer(age_col)[:10]

In [None]:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_values=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())

  # Prepare a Dataset that only yields our feature.
  feature_ds = feature_ds.map(index)

  # Learn the space of possible indices.
  encoder.adapt(feature_ds)

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so we can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

In [None]:
class_col = df['class']
layer = get_category_encoding_layer('class', train_ds, 'string', max_tokens=3)
layer(class_col)

In [None]:
batch_size = 32
train_ds = df_to_dataset(X_train, y_train, batch_size=batch_size)
test_ds = df_to_dataset(X_test, y_test, shuffle=False, batch_size=batch_size)

# We will encode all our features now
## Some inputs can be left as a raw integers
## We need to interconnect even this input layers to the model

In [None]:
all_inputs = []
encoded_features = []
raw_inputs = []

CATEGORICAL_COLUMNS_STR = ['sex', 'class', 'deck', 'embark_town', 'alone']
CATEGORICAL_COLUMNS_INT = ['n_siblings_spouses', 'parch']
NUMERIC_COLUMNS = ['age', 'fare']

for header in CATEGORICAL_COLUMNS_STR:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(header, train_ds, dtype='string', max_tokens=10)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

for header in CATEGORICAL_COLUMNS_INT:
  categorical_col = tf.keras.Input(shape=(1,), name=header)
  raw_inputs.append(categorical_col)  
  all_inputs.append(categorical_col)


for header in NUMERIC_COLUMNS:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)

# Create, compile, and train the model

In [None]:
all_features = tf.keras.layers.concatenate(encoded_features + raw_inputs, axis=1)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(all_inputs, output)
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=["accuracy"])

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

In [None]:
model.fit(train_ds, epochs=100)

In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

In [None]:
y_pred = [1 if x >= 0.5 else 0 for x in model.predict(test_ds)]

In [None]:
f1_score(y_true=y_test, y_pred=y_pred)

# Task for the lecture
 - Choose another simple structured dataset - Iris for example
 - Choose either Estimator or TF Preprocessing layers approach - use Normalization layers for example
 - Build classification model using chosen approach
 - Experiment a little
 - Send me the Colab notebook with results and description of what you did and your final solution!

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
iX = iris['data']
iy = iris['target']
inames = iris['target_names']
ifeature_names = iris['feature_names']