<a href="https://colab.research.google.com/github/raj-vijay/dl/blob/master/06_Batch_Training_in_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Batch Training**

For large datasets, the entire dataset cannot be fit into the memory of a GPU. Thus, the dataset is split into batches for processing, and each batch is called an epoch. The process of splitting datasets into epochs for processing is called batch training.

![alt text](https://raw.githubusercontent.com/raj-vijay/dl/master/images/Batch%20Training.png)

**King County Housing Dataset**

Online property companies offer valuations of houses using machine learning techniques. The aim of this report is to predict the house sales in King County, Washington State, USA using Multiple Linear Regression (MLR). The dataset consisted of historic data of houses sold between May 2014 to May 2015. We will predict the sales of houses in King County with an accuracy of at least 75-80% and understand which factors are responsible for higher property value - $650K and above.”


The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle*. This data was published/released under CC0*: Public Domain. Unfortunately, the user has not indicated the source of the data. Please find the citation and database description in the Glossary and Bibliography. 

The dataset consisted of 21 variables and 21613 observations.

Installing Kaggle Package to access the diabetes dataset from Kaggle.

In [0]:
!pip install kaggle



Make .kaggle directory under root to import the Kaggle Authentication JSON.

In [0]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


Change file path to root/.kaggle/kaggle.json

In [0]:
!cp /content/kaggle.json ~/.kaggle/kaggle.json

Chmod 600 (chmod a+rwx,u-x,g-rwx,o-rwx) sets permissions so that, (U)ser / owner can read, can write and can't execute. (G)roup can't read, can't write and can't execute. (O)thers can't read, can't write and can't execute.

In [0]:
!chmod 600 /root/.kaggle/kaggle.json

Download housing dataset from Kaggle!

In [0]:
!kaggle datasets download -d shivachandel/kc-house-data

kc-house-data.zip: Skipping, found more recently modified local copy (use --force to force download)


**Load data using pandas**

- pd.read_csv() allows us to load data in batches
- Avoid loading entire dataset
- chunksize parameter provides batch size

In [0]:
# Import pandas under the alias pd
import pandas as pd
import numpy as np

# Assign the path to a string variable named data_path
data_path = '/content/kc-house-data.zip'

# Load data in batches
for batch in pd.read_csv(data_path, compression='zip', chunksize=100):
  # Extract price column
  price = np.array(batch['price'], np.float32)
  # Extract size column
  size = np.array(batch['sqft_living'], np.float32)


In [0]:
# Print the price column of housing
print(price)

[1537000.  467000.  224000.  507250.  429000.  610685. 1007500.  475000.
  360000.  400000.  402101.  400000.  325000.]


**Training a linear model in batches**

In [0]:
# Import tensorflow, pandas, and numpy
import tensorflow as tf
import pandas as pd
import numpy as np

In [0]:
# Define trainable variables
intercept = tf.Variable(0.1, tf.float32)
slope = tf.Variable(0.1, tf.float32)

In [0]:
# Define the model
def linear_regression(intercept, slope, features):
  return intercept + features*slope

In [0]:
# Compute predicted values and return loss function
def loss_function(intercept, slope, targets, features):
  predictions = linear_regression(intercept, slope, features)
  return tf.keras.losses.mse(targets, predictions)

In [0]:
# Define optimization operation
opt = tf.keras.optimizers.Adam()

In [0]:
# Load the data in batches from pandas
for batch in pd.read_csv(data_path, compression='zip', chunksize=100):
  # Extract the target and feature columns
  price_batch = np.array(batch['price'], np.float32)
  size_batch = np.array(batch['sqft_living'], np.float32)
  # Minimize the loss function
  opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch),
  var_list=[intercept, slope])

In [0]:
# Print parameter values
print(intercept.numpy(), slope.numpy())

0.31799173 0.31615734


**Full Sample**
1. One update per epoch
2. Accepts dataset without modification
3. Limited by memory

**Batch Training**
1. Multiple updates per epoch
2. Requires division of dataset
3. No limit on dataset size

**Preparing to batch train**

In [0]:
import tensorflow as tf
from tensorflow import Variable, float32

In [0]:
# Define the intercept and slope
intercept = Variable(10.0, float32)
slope = Variable(0.5, float32)

# Define the model
def linear_regression(intercept, slope, features):
	# Define the predicted values
	return intercept + slope*features

# Define the loss function
def loss_function(intercept, slope, targets, features):
    # Define the predicted values
    predictions = linear_regression(intercept, slope, features)
    # Define the MSE loss
    return keras.losses.mse(targets, predictions)

In [0]:
# Initialize adam optimizer
opt = keras.optimizers.Adam()

# Load data in batches
for batch in pd.read_csv(data_path, compression='zip', chunksize=100):
	size_batch = np.array(batch['sqft_living'], np.float32)

	# Extract the price values for the current batch
	price_batch = np.array(batch['price'], np.float32)

	# Complete the loss, fill in the variable list, and minimize
	opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch), var_list=[intercept, slope])

# Print trained parameters
print(intercept.numpy(), slope.numpy())

10.217994 0.7161536
