
Machine learning is producing increase advantages in business values and efficiencies.  However, we need to be alert of bias in machine learning models that are producing inequalities to large portions of the population.  These bias models can target gender, race, age, income levels, …   The effect can include lost opportunities for employment, financial services, housing, fair judicial system, …

This bias and potential inequality can be an unnoticed process but have a powerful impact.  It is up to us to include in our development and maintenance process to look for bias in eradicate it.
Machine learning by default is bias, since it relies on statistical bias. This is required to make predictions, classifications, and correlations on new data the model has never seen before. However, focus needs to be put on the bias on the algorithms and training data used to create the models in the first place.

We will focus on  research conducted by ProPublica, a non-profit research institution, it was found that COMPAS, a machine learning algorithm used to determine criminal defendants’ likelihood to recommit crimes.  
We will:
1.	Get data 
2.	Initial - Exploratory data analysis (EDA)
3.	Initial – Data Wrangling
4.	Exploratory data analysis (EDA)
5.	Feature Engineering - Prepare the data for Machine Learning Algorithms
6.	Train, Evaluate, and Select a Model

Work in progress

7.	Using Variant Autoencoder (VAE - tensorflow) 
   Build  ML transformer of original  dataset , but removes sensitivity (race) while keeping almost    all data
   
8.	Using Variant Fair Autoencoder (VFAE - tensorflow) 
   Bias is removed


Data:
•	Compass dataset - The data set tracks Broward county Florida 
•	US census data for some initial data comparisons



# Machine Bias
# 
"""
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

Context
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a popular commercial algorithm used by judges
and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism). It has been shown that the algorithm
is biased in favor of white defendants, and against black inmates, based on a 2 year follow up study (i.e who actually committed
crimes or violent crimes after 2 years). The pattern of mistakes, as measured by precision/sensitivity is notable.

Quoting from ProPublica: 

Black defendants were often predicted to be at a higher risk of recidivism than they actually were. Our analysis found that black defendants
who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts
(45 percent vs. 23 percent). White defendants were often predicted to be less risky than they were. Our analysis found that white defendants who
re-offended within the next two years were mistakenly labeled low risk almost twice as often as black re-offenders (48 percent vs. 28 percent).
The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more
likely to be assigned higher risk scores than white defendants.

Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent
recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.
The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were
77 percent more likely to be assigned higher risk scores than white defendants. "

Content
Data contains variables used by the COMPAS algorithm in scoring defendants, along with their outcomes within 2 years of the decision, for over
10,000 criminal defendants in Broward County, Florida. 3 subsets of the data are provided, including a subset of only violent
recividism (as opposed to, e.g. being reincarcerated for non violent offenses such as vagrancy or Marijuana).

Indepth analysis by ProPublica can be found in their data methodology article.



Each pretrial defendant received at least three COMPAS scores:  (DisplayText)
“Risk of Recidivism,”
“Risk of Violence” 
“Risk of Failure to Appear.”

COMPAS scores for each defendant ranged from1 to 10, with ten being the highest risk. Scores (ScoreTex)
1 to 4 were labeled by COMPAS as “Low”;
5 to 7 were labeled “Medium”; and
8 to 10 were labeled “High.”


Columns
0 - 4  : 'Person_ID','AssessmentID','Case_ID','Agency_Text', 'LastName',
5 - 9  : 'FirstName', 'MiddleName', 'Sex_Code_Text', 'Ethnic_Code_Text','DateOfBirth',
10 - 14: 'ScaleSet_ID', 'ScaleSet', 'AssessmentReason','Language', 'LegalStatus',
15 - 19: 'CustodyStatus', 'MaritalStatus','Screening_Date', 'RecSupervisionLevel', 'RecSupervisionLevelText',
20 - 24: 'Scale_ID', 'DisplayText', 'RawScore', 'DecileScore', 'ScoreText',
25 - 27: 'AssessmentType', 'IsCompleted', 'IsDeleted'

In [1]:
# loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# Standardizing
from sklearn.preprocessing import StandardScaler
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
# To calculate the accuracy score of the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
#
from datetime import datetime
from datetime import date
#
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
#
import collections


In [2]:
# load dataset
pthfnm = "./compas-scores-raw.csv"
df = pd.read_csv(pthfnm)

In [None]:
# Initial data cleanup

In [3]:
# update 'Ethnic_Code_Text' to have conistent values for African Americans
df.loc[df['Ethnic_Code_Text'] == 'African-Am', 'Ethnic_Code_Text'] = 'African-American'
print(pd.value_counts(df['Ethnic_Code_Text']))

African-American    27069
Caucasian           21783
Hispanic             8742
Other                2592
Asian                 324
Native American       219
Arabic                 75
Oriental               39
Name: Ethnic_Code_Text, dtype: int64


In [4]:
# DecileScore should be between 1 & 10, delete otherwise
df.DecileScore.unique()
print((df['DecileScore'] < 1).sum())

45


In [5]:
# remove DecileScore < 1
df = df[df.DecileScore >= 1]
print(pd.value_counts(df['DecileScore']))

1     18465
2      9192
3      8492
4      5338
5      4831
6      4319
7      3338
8      2799
9      2386
10     1638
Name: DecileScore, dtype: int64


# EDA  - looking at potential bias



# Feature Engineering

In [6]:
# Add column 'Age' from DateofBirth
agelist = []
currdate = date.today()
for dte in df['DateOfBirth']:
    brthdte = datetime.strptime(dte, '%m/%d/%y')
    mnthday = (currdate.month, currdate.day) < (brthdte.month, brthdte.day)
    if currdate.year > brthdte.year:
        agelist.append(currdate.year - brthdte.year - (mnthday))
    else:
        agelist.append(-1)
        

In [7]:
print(len(agelist), len(df))
df['Age'] = agelist
print(df.columns)

60798 60798
Index(['Person_ID', 'AssessmentID', 'Case_ID', 'Agency_Text', 'LastName',
       'FirstName', 'MiddleName', 'Sex_Code_Text', 'Ethnic_Code_Text',
       'DateOfBirth', 'ScaleSet_ID', 'ScaleSet', 'AssessmentReason',
       'Language', 'LegalStatus', 'CustodyStatus', 'MaritalStatus',
       'Screening_Date', 'RecSupervisionLevel', 'RecSupervisionLevelText',
       'Scale_ID', 'DisplayText', 'RawScore', 'DecileScore', 'ScoreText',
       'AssessmentType', 'IsCompleted', 'IsDeleted', 'Age'],
      dtype='object')


In [8]:
# cleanup bad Ages
# remove DecileScore < 1
(df['Age'] < 1).sum()

12782

In [9]:
df = df[df.Age >= 1]
(df['Age'] < 1).sum()

0

In [10]:
# Slice by 'DisplayText' for Risk
RiskAppear = df.loc[df['DisplayText'] == 'Risk of Failure to Appear']
RiskViolence = df.loc[df['DisplayText'] == 'Risk of Violence']
RiskRecidivism = df.loc[df['DisplayText'] == 'Risk of Recidivism']
print('Appear:', RiskAppear.shape, ' Violence: ', RiskViolence.shape,  ' Recidivism:',RiskRecidivism.shape)

Appear: (16016, 29)  Violence:  (16010, 29)  Recidivism: (15990, 29)


In [11]:
# Define prepare_data_for_ml_model_1:
def prepare_data_for_ml_model_1(dfx, target_loc):
    # Create new Dataset of selected columns to get prepare TEST and Training data for  ML model 
     
    """
    Columns
    0 - 4  : 'Person_ID','AssessmentID','Case_ID','Agency_Text', 'LastName',
    5 - 9  : 'FirstName', 'MiddleName', 'Sex_Code_Text', 'Ethnic_Code_Text','DateOfBirth',
    10 - 14: 'ScaleSet_ID', 'ScaleSet', 'AssessmentReason','Language', 'LegalStatus',
    15 - 19: 'CustodyStatus', 'MaritalStatus','Screening_Date', 'RecSupervisionLevel', 'RecSupervisionLevelText',
    20 - 24: 'Scale_ID', 'DisplayText', 'RawScore', 'DecileScore', 'ScoreText',
    25 - 28: 'AssessmentType', 'IsCompleted', 'IsDeleted','Age'
    """

    x_df = dfx.iloc[:, [7,8,14,15,16,19]] #features
    #x_df = dfx.iloc[:, [7,14,15,16,19]] #features
    tmp_age = dfx.iloc[:,28].as_matrix() #age feature, convert numpy array
    x_age = tmp_age.reshape(tmp_age.size,1)
    

    y = dfx.iloc[:,target_loc].as_matrix() #target convert numpy array


    #  lable encoder. It encodes the data into integers
    le = LabelEncoder()


    Sex_Code_Text_cat = le.fit_transform(x_df.Sex_Code_Text)
    Ethnic_Code_Text_cat = le.fit_transform(x_df.Ethnic_Code_Text)
    LegalStatus_cat = le.fit_transform(x_df.LegalStatus)
    CustodyStatus_cat = le.fit_transform(x_df.CustodyStatus)
    MaritalStatus_cat = le.fit_transform(x_df.MaritalStatus)
    RecSupervisionLevelText_cat = le.fit_transform(x_df.RecSupervisionLevelText)

    Sex_Code_Text_cat = Sex_Code_Text_cat.reshape(len(Sex_Code_Text_cat),1)
    Ethnic_Code_Text_cat = Ethnic_Code_Text_cat.reshape(len(Ethnic_Code_Text_cat),1)
    LegalStatus_cat = LegalStatus_cat.reshape(len(LegalStatus_cat),1)
    CustodyStatus_cat = CustodyStatus_cat.reshape(len(CustodyStatus_cat),1)
    MaritalStatus_cat = MaritalStatus_cat.reshape(len(MaritalStatus_cat),1)
    RecSupervisionLevelText_cat = RecSupervisionLevelText_cat.reshape(len(RecSupervisionLevelText_cat),1)

#  One-Hot encoder. It encodes the data into binary format
    onehote = OneHotEncoder(sparse=False)
    
    Sex_Code_Text_oh = onehote.fit_transform(Sex_Code_Text_cat)
    Ethnic_Code_Text_oh = onehote.fit_transform(Ethnic_Code_Text_cat)
    LegalStatus_oh = onehote.fit_transform(LegalStatus_cat)
    CustodyStatus_oh = onehote.fit_transform(CustodyStatus_cat)
    MaritalStatus_oh = onehote.fit_transform(MaritalStatus_cat)
    RecSupervisionLevelText_oh = onehote.fit_transform(RecSupervisionLevelText_cat)

# Build out feature dataset as numpy array, since One-Hot encoder creates numpy array
    X_feature =  Sex_Code_Text_oh
    X_feature = np.concatenate((X_feature,LegalStatus_oh), axis=1)
    X_feature = np.concatenate((X_feature,CustodyStatus_oh), axis=1)
    X_feature = np.concatenate((X_feature,MaritalStatus_oh), axis=1)
    X_feature = np.concatenate((X_feature,RecSupervisionLevelText_oh), axis=1)
    X_feature = np.concatenate((X_feature,x_age), axis=1)
    X_feature = np.concatenate((X_feature,Ethnic_Code_Text_oh), axis=1)

# Split data train and test
    X_train, X_test, y_train, y_test = train_test_split(X_feature, y, test_size=0.2)
    print('Length for X_train:', len(X_train), ' X_test:',len(X_test), ' y_train:',len(y_train) ,' y_test:',len(y_test))

    return X_train, X_test, y_train, y_test

# Preparing for VAE using tensorflow

In [12]:
class dataReader(object):
    # Code provided by Andrei Fajardo
    # to substitude for ...train.next_batch

    def __init__(self,*arrays,batch_size=1):
        self.arrays = arrays
        self.__check_equal_shape()
        self.num_examples = self.arrays[0].shape[0]
        self.batch_number = 0
        self.batch_size = batch_size
        self.num_batches = int(np.ceil(self.num_examples / batch_size))

    def __check_equal_shape(self):
        if any(self.arrays[0].shape[0] != arr.shape[0] for arr in self.arrays[1:]):
            raise ValueError("all arrays must be equal along first dimension")

    def next_batch(self):
        low_ix = self.batch_number*self.batch_size
        up_ix = (self.batch_number + 1)*self.batch_size
        if up_ix >= self.num_examples:
            up_ix = self.num_examples
            self.batch_number = 0 # reset batch_number to zero
        else:
            self.batch_number = self.batch_number + 1

        return [arr[low_ix:up_ix,:] for arr in self.arrays]


In [13]:
#  Tensorflow  Implementation 

import tensorflow as tf
import os
import sys
from functools import partial
from sklearn.preprocessing import StandardScaler

In [14]:
# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [15]:
# RiskRecidivism dataset target RawScore (22)
# X_train.shape (12792, 35)
X_train, X_test, y_train, y_test = prepare_data_for_ml_model_1(RiskRecidivism,22)

Length for X_train: 12792  X_test: 3198  y_train: 12792  y_test: 3198


In [16]:
print(X_train.shape)

(12792, 35)


In [17]:
scaler = StandardScaler()
X_scaler = scaler.fit_transform(X_train)

# VFAE

In [18]:
from tensorflow.contrib.layers import fully_connected, batch_norm
from datetime import datetime

  from ._conv import register_converters as _register_converters


###  Functions

In [19]:
def show_reconstructed_digits(X, outputs, model_path = None, n_test_digits = 2):
    with tf.Session() as sess:
        if model_path:
            saver.restore(sess, model_path)
        X_test = mnist.test.images[:n_test_digits]
        outputs_val = outputs.eval(feed_dict={X: X_test})

    fig = plt.figure(figsize=(8, 3 * n_test_digits))
    for digit_index in range(n_test_digits):
        plt.subplot(n_test_digits, 2, digit_index * 2 + 1)
        plot_image(X_test[digit_index])
        plt.subplot(n_test_digits, 2, digit_index * 2 + 2)
        plot_image(outputs_val[digit_index])

### Construction Phase

We will construct the graph for the VFAE architecture:

    Input: X = [X_without_s, s], where s is the sensitive feature

    Middle Encodings: We're learning the parameters for the distribution of the encodings. What's different here is that we inject both the response y and the sensitive features in the middle layers.

    Output: X_copy

In [20]:
# Construction phase
# n_s = 10 # number of sensitive features
# n_inputs = 28*28 - n_s # number of non-sensitive features
n_s = 1 # number of sensitive features
n_inputs = 35 - n_s # number of non-sensitive features

# encoders
n_hidden1 = 50
n_hidden2 = 10 # codings
n_hidden3 = 50
n_hidden4 = 10

# decoders
n_hidden5 = 50
n_hidden6 = 10
n_hidden7 = 50

# final output can take a random sample from the posterior
n_outputs = n_inputs + n_s

In [21]:
### Training rates
alpha = 1
learning_rate = 0.0001

In [22]:
### Setting up the graph
tf.reset_default_graph()
with tf.contrib.framework.arg_scope(
        [fully_connected],
        activation_fn = tf.nn.elu,
        weights_initializer = tf.contrib.layers.variance_scaling_initializer()):
    X = tf.placeholder(tf.float32, shape = [None, n_inputs], name="X_wo_s")
    s = tf.placeholder(tf.float32, shape = [None, n_s], name="s")
    X_full = tf.concat([X,s], axis=1)
    y = tf.placeholder(tf.int32, shape = [None, 1], name="y") # for your example, switch this to tf.float32 bc you'll be doing reg
    # is_unlabelled = tf.placeholder(tf.bool, shape=(), name='is_training') # don't worry about this
    with tf.name_scope("X_encoder"):
        hidden1 = fully_connected(tf.concat([X, s], axis=1), n_hidden1)
        hidden2_mean = fully_connected(hidden1, n_hidden2, activation_fn = None)
        hidden2_gamma = fully_connected(hidden1, n_hidden2, activation_fn = None)
        hidden2_sigma = tf.exp(0.5 * hidden2_gamma)
    noise1 = tf.random_normal(tf.shape(hidden2_sigma), dtype=tf.float32)
    hidden2 = hidden2_mean + hidden2_sigma * noise1         # z1
    with tf.name_scope("Z1_encoder"):
        hidden3_ygz1 = fully_connected(hidden2, n_hidden4, activation_fn = tf.nn.tanh)
        hidden4_softmax_mean = fully_connected(hidden3_ygz1, 10, activation_fn = tf.nn.softmax)
   
        #if is_unlabelled == True:
            # impute by sampling from q(y|z1)
        #    y = tf.assign(y, tf.multinomial(hidden4_softmax_mean, 1,
        #                        output_type = tf.int32))
    
        hidden3 = fully_connected(tf.concat([hidden2, tf.cast(y, tf.float32)], axis=1),
                        n_hidden3, activation_fn=tf.nn.tanh)
        hidden4_mean = fully_connected(hidden3, n_hidden4, activation_fn = None)
        hidden4_gamma = fully_connected(hidden3, n_hidden4, activation_fn = None)
        hidden4_sigma = tf.exp(0.5 * hidden4_gamma)
    noise2 = tf.random_normal(tf.shape(hidden4_sigma), dtype=tf.float32)
    hidden4 = hidden4_mean + hidden4_sigma * noise2     # z2
    with tf.name_scope("Z1_decoder"):
        hidden5 = fully_connected(tf.concat([hidden4, tf.cast(y, tf.float32)], axis=1 ),
                    n_hidden5, activation_fn = tf.nn.tanh)
        hidden6_mean = fully_connected(hidden5, n_hidden6, activation_fn = None)
        hidden6_gamma = fully_connected(hidden5, n_hidden6, activation_fn = None)
        hidden6_sigma = tf.exp(0.5 * hidden6_gamma)
    noise3 = tf.random_normal(tf.shape(hidden6_sigma), dtype=tf.float32)
    hidden6 = hidden6_mean + hidden6_sigma * noise3     # z1 (decoded)
    with tf.name_scope("X_decoder"):
        hidden7 = fully_connected(tf.concat([hidden6, s], axis=1), n_hidden7,
                                 activation_fn = tf.nn.tanh)
        hidden8 = fully_connected(hidden7, n_outputs, activation_fn = None)
    outputs = tf.sigmoid(hidden8, name="decoded_X")

### Loss Function: ELBO

In [23]:
# expected lower bound
with tf.name_scope("ELB"):
    kl_z2 = 0.5 * tf.reduce_sum(
                    tf.exp(hidden4_gamma)
                    + tf.square(hidden4_mean)
                    - 1
                    - hidden4_gamma
                    )

    kl_z1 = 0.5 * (tf.reduce_sum(
                    (1 / (1e-10 + tf.exp(hidden6_gamma))) * tf.exp(hidden2_gamma)
                    - 1
                    + hidden6_gamma
                    - hidden2_gamma
                    ) + tf.einsum('ij,ji -> i', # this might not work for you depending on version of tflow
                        (hidden6_mean-hidden2_mean) * (1 / (1e-10 + tf.exp(hidden6_gamma))),
                        tf.transpose((hidden6_mean-hidden2_mean))))

    indices = tf.range(tf.shape(y)[0])
    indices = tf.concat([indices[:, tf.newaxis], y], axis=1)
    eps = 1e-10
    log_q_y_z1 = tf.reduce_sum(tf.log(eps + tf.gather_nd(hidden4_softmax_mean, indices)))

    # Bernoulli log-likelihood
    reconstruction_loss = -(tf.reduce_sum(X_full * tf.log(outputs)
                            + (1 - X_full) * tf.log(1 - outputs)))
    cost = kl_z2 + kl_z1 + reconstruction_loss + alpha * log_q_y_z1

In [24]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(cost)
saver = tf.train.Saver()

### Initialize Graph & Load Data

In [25]:
upper_lim = 12600

In [26]:
# instatianate
y_bin = pd.cut(y_train, bins=10, labels=False) 
data_reader = dataReader(X_scaler,y_bin[:,np.newaxis], batch_size=150)

In [27]:
init = tf.global_variables_initializer()

In [28]:
# Training
n_epochs = 50
batch_size = 100
n_digits = 60

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        # n_batches = mnist.train.num_examples // batch_size
        n_batches = data_reader.num_batches
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            X_batch, y_batch = data_reader.next_batch()
            sess.run(training_op, feed_dict={X: X_batch[:,:-n_s],
                                    s: X_batch[:,-n_s:],
                                    #you replace y_batch[:,np.newaxis] with just y_batch          
                                    #y: y_batch[:,np.newaxis],        
                                    y: y_batch,
                                   #is_unlabelled: False
                                            })
        kl_z2_val, kl_z1_val, log_q_y_z1_val, reconstruction_loss_val, loss_val = sess.run([
                kl_z2,
                kl_z1,
                log_q_y_z1,
                reconstruction_loss,
                cost],
                feed_dict={X: X_batch[:,:-n_s],
                        s: X_batch[:,-n_s:],
                        #you replace y_batch[:,np.newaxis] with just y_batch 
                        #y: y_batch[:,np.newaxis]
                        y: y_batch})
        saver.save(sess, "./my_model_all_layers.ckpt")
        print("\r{}".format(epoch), "Train total loss:", loss_val,
         "\tReconstruction loss:", reconstruction_loss_val,
          "\tKL-z1:", kl_z1_val,
          "\tKL-z2:", kl_z2_val,
          "\tlog_q(y|z1):", log_q_y_z1_val)

0 Train total loss: 3779.1228 	Reconstruction loss: 1166.8595 	KL-z1: 2279.962 	KL-z2: 452.30466 	log_q(y|z1): -120.00319
1 Train total loss: 3557.648 	Reconstruction loss: 1188.415 	KL-z1: 2060.1553 	KL-z2: 438.14215 	log_q(y|z1): -129.0644
2 Train total loss: 3421.7678 	Reconstruction loss: 1023.75366 	KL-z1: 2124.5479 	KL-z2: 418.4137 	log_q(y|z1): -144.94716
3 Train total loss: 4589.365 	Reconstruction loss: 943.99585 	KL-z1: 3338.1914 	KL-z2: 438.39874 	log_q(y|z1): -131.2208
4 Train total loss: 2777.9712 	Reconstruction loss: 949.4848 	KL-z1: 1515.62 	KL-z2: 449.7776 	log_q(y|z1): -136.9111
5 Train total loss: 3518.602 	Reconstruction loss: 809.0591 	KL-z1: 2391.942 	KL-z2: 440.69077 	log_q(y|z1): -123.08969
6 Train total loss: 3136.638 	Reconstruction loss: 821.1722 	KL-z1: 2008.4099 	KL-z2: 432.47833 	log_q(y|z1): -125.422356
7 Train total loss: 2457.142 	Reconstruction loss: 632.9341 	KL-z1: 1516.9862 	KL-z2: 455.12613 	log_q(y|z1): -147.90442
8 Train total loss: 3271.323 	Rec

In [29]:
def restore_session(path):
    saver = tf.train.Saver()
    sess = tf.Session()
    saver.restore(sess, path)
    return sess

In [30]:
data_reader = dataReader(X_scaler[:upper_lim,:],y_bin[:upper_lim,np.newaxis], batch_size=150)
sess = restore_session("./my_model_all_layers.ckpt")

INFO:tensorflow:Restoring parameters from ./my_model_all_layers.ckpt


In [31]:
n_batches = data_reader.num_batches
encodings = None
for iteration in range(n_batches):
    X_batch, y_batch = data_reader.next_batch()
    codings_rnd = np.random.normal(scale=2,size=[n_digits, n_hidden2])
    s_rnd = X_batch[:, -n_s:]
    if encodings is None:
        encodings = sess.run(hidden2,feed_dict={X: X_batch[:,:-n_s],
                                            s: s_rnd,
                                            y: y_batch,
                                            })
    else:
        new_encodings = sess.run(hidden2,feed_dict={X: X_batch[:,:-n_s],
                                            s: s_rnd,
                                            y: y_batch,
                                            })
        encodings = np.concatenate((encodings, new_encodings))
sess.close()

In [32]:


encodings # train regression model on encodings now

array([[ 1.0124654 , -0.68345124, -2.3004704 , ...,  0.12640376,
        -0.21199988, -0.1584189 ],
       [ 1.0563194 , -0.9180748 ,  2.2737067 , ...,  2.141169  ,
         0.65167594, -1.0904669 ],
       [-0.13431466,  0.06996489,  3.9947982 , ..., -2.2501543 ,
         0.84560174,  2.5790262 ],
       ...,
       [-1.6773155 , -2.1915684 , -2.7678638 , ...,  1.6158307 ,
        -1.4345584 ,  0.9333562 ],
       [ 0.5645692 ,  1.8341246 , -1.6351135 , ...,  1.209703  ,
         0.704249  ,  0.02950668],
       [ 0.66798586,  1.4976501 ,  0.95820415, ..., -0.7433622 ,
         1.8012375 ,  0.83766854]], dtype=float32)

# Using Test data

In [33]:
scaler = StandardScaler()
X_scaler = scaler.fit_transform(X_test)

In [34]:
# instatianate
y_bin = pd.cut(y_test, bins=10, labels=False) 
data_reader = dataReader(X_scaler,y_bin[:,np.newaxis], batch_size=150)

In [35]:
init = tf.global_variables_initializer()

In [36]:
# Training  using Test Data
n_epochs = 50
batch_size = 100
n_digits = 60

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        # n_batches = mnist.train.num_examples // batch_size
        n_batches = data_reader.num_batches
        for iteration in range(n_batches):
            print("\r{}%".format(100 * iteration // n_batches), end="")
            X_batch, y_batch = data_reader.next_batch()
            sess.run(training_op, feed_dict={X: X_batch[:,:-n_s],
                                    s: X_batch[:,-n_s:],
                                    #you replace y_batch[:,np.newaxis] with just y_batch          
                                    #y: y_batch[:,np.newaxis],        
                                    y: y_batch,
                                   #is_unlabelled: False
                                            })
        kl_z2_val, kl_z1_val, log_q_y_z1_val, reconstruction_loss_val, loss_val = sess.run([
                kl_z2,
                kl_z1,
                log_q_y_z1,
                reconstruction_loss,
                cost],
                feed_dict={X: X_batch[:,:-n_s],
                        s: X_batch[:,-n_s:],
                        #you replace y_batch[:,np.newaxis] with just y_batch 
                        #y: y_batch[:,np.newaxis]
                        y: y_batch})
        saver.save(sess, "./mytest_model_all_layers.ckpt")
        print("\r{}".format(epoch), "Train total loss:", loss_val,
         "\tReconstruction loss:", reconstruction_loss_val,
          "\tKL-z1:", kl_z1_val,
          "\tKL-z2:", kl_z2_val,
          "\tlog_q(y|z1):", log_q_y_z1_val)

0 Train total loss: 4290.7095 	Reconstruction loss: 1317.669 	KL-z1: 2607.9128 	KL-z2: 506.19717 	log_q(y|z1): -141.06984
1 Train total loss: 4015.814 	Reconstruction loss: 1368.6458 	KL-z1: 2273.9487 	KL-z2: 524.29285 	log_q(y|z1): -151.07372
2 Train total loss: 4089.7366 	Reconstruction loss: 1316.7373 	KL-z1: 2441.7095 	KL-z2: 482.20908 	log_q(y|z1): -150.91922
3 Train total loss: 3745.1907 	Reconstruction loss: 1301.5696 	KL-z1: 2143.1902 	KL-z2: 439.28406 	log_q(y|z1): -138.85306
4 Train total loss: 4085.4014 	Reconstruction loss: 1243.2041 	KL-z1: 2452.7698 	KL-z2: 535.386 	log_q(y|z1): -145.95839
5 Train total loss: 3719.594 	Reconstruction loss: 1189.7146 	KL-z1: 2149.7273 	KL-z2: 532.7524 	log_q(y|z1): -152.60031
6 Train total loss: 3733.166 	Reconstruction loss: 1286.5837 	KL-z1: 2104.411 	KL-z2: 498.56818 	log_q(y|z1): -156.39668
7 Train total loss: 3729.8933 	Reconstruction loss: 1156.4026 	KL-z1: 2193.3066 	KL-z2: 535.4926 	log_q(y|z1): -155.30853
8 Train total loss: 3170.

In [38]:
data_reader = dataReader(X_scaler[:upper_lim,:],y_bin[:upper_lim,np.newaxis], batch_size=150)
#sess = restore_session("./my_model_all_layers.ckpt")
sess = restore_session("./mytest_model_all_layers.ckpt")

INFO:tensorflow:Restoring parameters from ./mytest_model_all_layers.ckpt


In [39]:
n_batches = data_reader.num_batches
encodings = None
for iteration in range(n_batches):
    X_batch, y_batch = data_reader.next_batch()
    codings_rnd = np.random.normal(scale=2,size=[n_digits, n_hidden2])
    s_rnd = X_batch[:, -n_s:]
    if encodings is None:
        encodings = sess.run(hidden2,feed_dict={X: X_batch[:,:-n_s],
                                            s: s_rnd,
                                            y: y_batch,
                                            })
    else:
        new_encodings = sess.run(hidden2,feed_dict={X: X_batch[:,:-n_s],
                                            s: s_rnd,
                                            y: y_batch,
                                            })
        encodings = np.concatenate((encodings, new_encodings))
sess.close()

In [40]:

encodings # train regression model on encodings now

array([[ 5.1388073e-01,  1.1245508e+00,  1.6173003e+00, ...,
         8.9260030e-01, -1.2240034e-01,  1.5899843e-01],
       [ 1.2821453e+00,  2.1456480e+00,  2.1543977e+00, ...,
         2.8752856e+00, -3.3987420e+00,  1.5216279e-01],
       [ 1.0283835e+00,  1.1371301e+00,  3.7215955e+00, ...,
         2.3477726e+00, -9.6792531e-01, -1.7790174e-01],
       ...,
       [ 2.8108841e-01,  8.6981964e-01,  1.7604111e-01, ...,
        -2.7629605e-01,  1.6088486e-03,  2.7655661e+00],
       [ 1.7274559e+00,  9.4036657e-01,  1.3196875e-01, ...,
        -2.6386590e+00,  4.7366416e-01, -4.4455914e+00],
       [ 9.1038942e-02,  8.9013159e-01, -3.6004763e+00, ...,
        -1.2782619e+00,  1.1992254e+01, -1.0526474e+00]], dtype=float32)