<h1>2b. Machine Learning using tf.estimator </h1>

In this notebook, we will create a machine learning model using tf.estimator and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [21]:
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil
from sklearn import preprocessing

Cleanup data

In [22]:

def data_cleanup():
    X = pd.read_csv('train.csv',encoding = 'ISO-8859-1',low_memory=False)
    addresses = pd.read_csv('addresses.csv',encoding = 'ISO-8859-1',low_memory=False)
    latlons = pd.read_csv('latlons.csv',encoding = 'ISO-8859-1',low_memory=False)
    X = pd.merge(X, pd.merge(addresses, latlons, on='address'), on='ticket_id')
    X = X[X.country.isin(['USA'])]
    train_columns_del =['agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date',
       'violation_description', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due',
       'payment_date', 'payment_status', 'collection_status',
       'grafitti_status', 'compliance_detail','address'] 
    #, 'address'
    X.drop(train_columns_del,axis=1,inplace=True)
    
    valid_val=[0,1]
    
    X_clean_data = X[X.compliance.isin(valid_val)]
    X_clean_data.dropna(inplace=True)
    
    labelEncoder = preprocessing.LabelEncoder()
    X_clean_data.loc[:,'disposition']= labelEncoder.fit_transform(X_clean_data.loc[:,'disposition'])
    
    X_clean_data.loc[:,'violation_code']= labelEncoder.fit_transform(X_clean_data.loc[:,'violation_code'])
    
    return X_clean_data
        

In [23]:
df_train=data_cleanup()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [24]:
CSV_COLUMNS=['lat','lon','disposition','violation_code','compliance']
LABEL='compliance'
FEATURES = ['lat','lon','disposition','violation_code']

df_train.shape

(159867, 6)

<h2> Train and eval input functions to read from Pandas Dataframe </h2>

In [25]:
#return input function that would feed DataFrame into the model
def make_train_input_fn(df, num_epochs): 
  return tf.estimator.inputs.pandas_input_fn( 
    x = df,
    y = df['compliance'],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 100
  )

In [6]:
def make_eval_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df['compliance'],
    batch_size = 128,
    shuffle = False,
    queue_capacity = 100
  )

Our input function for predictions is the same except we don't provide a label

In [7]:
def make_prediction_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = None,
    batch_size = 128,
    shuffle = True,
    queue_capacity = 100
  )

### Create feature columns for estimator

In [10]:
#tell model how to pack data in the model
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES] 
  return input_columns

<h3> Linear Regression with tf.Estimator framework </h3>

In [26]:
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'ml_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearClassifier(
      n_classes=2, feature_columns = make_feature_cols(), model_dir = OUTDIR)



I0730 12:12:16.061101 140258105837376 estimator.py:1790] Using default config.
I0730 12:12:16.066199 140258105837376 estimator.py:209] Using config: {'_model_dir': 'ml_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9008313048>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [27]:
model.train(input_fn = make_train_input_fn(df_train, num_epochs = 1))

I0730 12:12:28.087406 140258105837376 estimator.py:1145] Calling model_fn.
I0730 12:12:29.540736 140258105837376 estimator.py:1147] Done calling model_fn.
I0730 12:12:29.581441 140258105837376 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0730 12:12:30.134399 140258105837376 monitored_session.py:240] Graph was finalized.
I0730 12:12:30.324037 140258105837376 session_manager.py:500] Running local_init_op.
I0730 12:12:30.362643 140258105837376 session_manager.py:502] Done running local_init_op.
I0730 12:12:31.705451 140258105837376 basic_session_run_hooks.py:606] Saving checkpoints for 0 into ml_trained/model.ckpt.
I0730 12:12:32.429701 140258105837376 basic_session_run_hooks.py:262] loss = 88.722855, step = 1
I0730 12:12:33.513151 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 92.2634
I0730 12:12:33.518216 140258105837376 basic_session_run_hooks.py:260] loss = 15.751794, step = 101 (1.089 sec)
I0730 12:12:34.675566 140258105837376 basic_session_run_hooks

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x7f90083132e8>

In [None]:
predictions = model.predict(input_fn = make_prediction_input_fn(df_test))
for items in predictions:
  print(items)

<h3> Deep Neural Network classifier </h3>

In [32]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNClassifier(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_train_input_fn(df_train, num_epochs = 15));

I0730 12:24:54.130812 140258105837376 estimator.py:1790] Using default config.
I0730 12:24:54.134714 140258105837376 estimator.py:209] Using config: {'_model_dir': 'ml_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f900a26e940>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
I0730 12:24:54.223767

I0730 12:25:39.406068 140258105837376 basic_session_run_hooks.py:260] loss = 39.209507, step = 3201 (1.419 sec)
I0730 12:25:40.362552 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 104.346
I0730 12:25:40.364565 140258105837376 basic_session_run_hooks.py:260] loss = 23.313036, step = 3301 (0.958 sec)
I0730 12:25:41.481592 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 89.3566
I0730 12:25:41.486921 140258105837376 basic_session_run_hooks.py:260] loss = 21.480968, step = 3401 (1.122 sec)
I0730 12:25:42.669611 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 84.17
I0730 12:25:42.671102 140258105837376 basic_session_run_hooks.py:260] loss = 41.202595, step = 3501 (1.184 sec)
I0730 12:25:43.828720 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 86.2771
I0730 12:25:43.830579 140258105837376 basic_session_run_hooks.py:260] loss = 71.273964, step = 3601 (1.159 sec)
I0730 12:25:44.806781 140258105837376 basic_session_run_hooks.

I0730 12:26:27.746583 140258105837376 basic_session_run_hooks.py:260] loss = 27.726315, step = 7201 (1.298 sec)
I0730 12:26:28.761475 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 98.3234
I0730 12:26:28.765848 140258105837376 basic_session_run_hooks.py:260] loss = 49.920578, step = 7301 (1.019 sec)
I0730 12:26:29.840538 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 92.6744
I0730 12:26:29.842231 140258105837376 basic_session_run_hooks.py:260] loss = 32.365913, step = 7401 (1.076 sec)
I0730 12:26:31.156415 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 75.9934
I0730 12:26:31.158086 140258105837376 basic_session_run_hooks.py:260] loss = 43.50245, step = 7501 (1.316 sec)
I0730 12:26:32.516227 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 73.5392
I0730 12:26:32.517657 140258105837376 basic_session_run_hooks.py:260] loss = 26.936308, step = 7601 (1.360 sec)
I0730 12:26:33.537895 140258105837376 basic_session_run_hooks

I0730 12:27:16.301893 140258105837376 basic_session_run_hooks.py:260] loss = 19.999195, step = 11201 (1.364 sec)
I0730 12:27:17.931656 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 61.1258
I0730 12:27:17.934694 140258105837376 basic_session_run_hooks.py:260] loss = 26.949858, step = 11301 (1.633 sec)
I0730 12:27:18.979066 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 95.4718
I0730 12:27:18.984104 140258105837376 basic_session_run_hooks.py:260] loss = 18.599442, step = 11401 (1.049 sec)
I0730 12:27:20.026684 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 95.4586
I0730 12:27:20.028436 140258105837376 basic_session_run_hooks.py:260] loss = 58.581516, step = 11501 (1.044 sec)
I0730 12:27:20.944778 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 108.916
I0730 12:27:20.946895 140258105837376 basic_session_run_hooks.py:260] loss = 11.143333, step = 11601 (0.918 sec)
I0730 12:27:22.099390 140258105837376 basic_session_run

I0730 12:28:05.188620 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 88.6243
I0730 12:28:05.190448 140258105837376 basic_session_run_hooks.py:260] loss = 36.61597, step = 15201 (1.129 sec)
I0730 12:28:06.193393 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 99.5073
I0730 12:28:06.195145 140258105837376 basic_session_run_hooks.py:260] loss = 30.66317, step = 15301 (1.005 sec)
I0730 12:28:07.329099 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 88.0614
I0730 12:28:07.331417 140258105837376 basic_session_run_hooks.py:260] loss = 28.93161, step = 15401 (1.136 sec)
I0730 12:28:08.486600 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 86.3839
I0730 12:28:08.488480 140258105837376 basic_session_run_hooks.py:260] loss = 37.907127, step = 15501 (1.157 sec)
I0730 12:28:09.755958 140258105837376 basic_session_run_hooks.py:692] global_step/sec: 78.7898
I0730 12:28:09.758725 140258105837376 basic_session_run_hooks.py:260] loss =