# Introduction

H2O [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do supervised classification.

In [1]:
import numpy as np
import pandas as pd
import os
import h2o

ImportError: No module named pandas

In [3]:
h2o.init(max_mem_size = "2G")             #specify max number of bytes. uses all cores by default.\n
h2o.remove_all()                          #clean slate, in case cluster was already running

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_161"; Java(TM) SE Runtime Environment (build 1.8.0_161-b12); Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
  Starting server from /usr/local/lib/python3.5/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpcbwp5um6
  JVM stdout: /tmp/tmpcbwp5um6/h2o_yarenty_started_from_python.out
  JVM stderr: /tmp/tmpcbwp5um6/h2o_yarenty_started_from_python.err
  Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323... successful.


0,1
H2O cluster uptime:,09 secs
H2O cluster timezone:,Europe/Dublin
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.4
H2O cluster version age:,5 days
H2O cluster name:,H2O_from_python_yarenty_vlfa2f
H2O cluster total nodes:,1
H2O cluster free memory:,1.778 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [4]:
help(h2o)

Help on package h2o:

NAME
    h2o - :mod:`h2o` -- module for using H2O services.

DESCRIPTION
    (please add description).

PACKAGE CONTENTS
    assembly
    astfun
    automl (package)
    backend (package)
    cross_validation
    demos
    display
    estimators (package)
    exceptions
    expr
    expr_optimizer
    frame
    grid (package)
    group_by
    h2o
    job
    model (package)
    schemas (package)
    transforms (package)
    two_dim_table
    utils (package)

SUBMODULES
    __init__

FUNCTIONS
    api(endpoint, data=None, json=None, filename=None, save_to=None)
        Perform a REST API request to a previously connected server.
        
        This function is mostly for internal purposes, but may occasionally be useful for direct access to
        the backend H2O server. It has same parameters as :meth:`H2OConnection.request <h2o.backend.H2OConnection.request>`.
    
    as_list(data, use_pandas=True, header=True)
        Convert an H2O data object into a python




In [5]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
help(H2OGeneralizedLinearEstimator)
help(h2o.import_file)

Help on class H2OGeneralizedLinearEstimator in module h2o.estimators.glm:

class H2OGeneralizedLinearEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  Generalized Linear Modeling
 |  
 |  Fits a generalized linear model, specified by a response variable, a set of predictors, and a
 |  description of the error distribution.
 |  
 |  A subclass of :class:`ModelBase` is returned. The specific subclass depends on the machine learning task
 |  at hand (if it's binomial classification, then an H2OBinomialModel is returned, if it's regression then a
 |  H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specific
 |  information can be queried out of the object. Upon completion of the GLM, the resulting object has
 |  coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including
 |  MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.
 |  
 |  Method resolution order:
 |  


Help on function import_file in module h2o.h2o:

import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None)
    Import a dataset that is already on the cluster.
    
    The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster
    cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed
    multi-threaded pull of the data. The main difference between this method and :func:`upload_file` is that
    the latter works with local files, whereas this method imports remote files (i.e. files local to the server).
    If you running H2O server on your own maching, then both methods behave the same.
    
    :param path: path(s) specifying the location of the data to import or a path to a directory of files to import
    :param destination_frame: The unique hex key assigned to the imported file. If none is given, a key

In [14]:
covtype_df = h2o.import_file(os.path.realpath("../DATA/covtype.full.csv"))

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [57]:
covtype_df

Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
3066,124,5,0,0,1533,229,236,141,459,area_0,type_22,class_1
3136,32,20,450,-38,1290,211,193,111,1112,area_0,type_28,class_1
2655,28,14,42,8,1890,214,209,128,1001,area_2,type_9,class_2
3191,45,19,323,88,3932,221,195,100,2919,area_0,type_39,class_2
3217,80,13,30,1,3901,237,217,109,2859,area_0,type_22,class_7
3119,293,13,30,10,4810,182,237,194,1200,area_0,type_21,class_1
2679,48,7,150,24,1588,223,224,136,6265,area_0,type_11,class_2
3261,322,13,30,5,5701,186,226,180,769,area_0,type_21,class_1
2885,26,9,192,38,3271,216,220,140,2643,area_0,type_28,class_2
3167,271,29,242,37,4700,133,235,234,3260,area_0,type_28,class_1




In [76]:
 #split the data as described above
train, valid, test = covtype_df.split_frame([0.7, 0.15], seed=1234)

#Prepare predictors and response columns
covtype_X = covtype_df.col_names[:-1]     #last column is Cover_Type, our desired response variable \n",
covtype_y = covtype_df.col_names[-1]    

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
type,int,int,int,int,int,int,int,int,int,int,enum,enum,enum
mins,1859.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,,,
mean,2959.365300544568,155.65680743254856,14.103703537964787,269.4282166289166,46.41885537648099,2350.14661142971,212.14604861861739,223.31871630878535,142.5282627553303,1980.2912263430007,,,
maxs,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,,,
sigma,279.9847342506385,111.91372100329549,7.488241814480136,212.54935559508115,58.295231626887215,1559.2548698976093,26.769888805282132,19.76869715366642,38.274529231410625,1324.1952097801095,,,
zeros,0,4914,656,24603,38665,124,13,5,1338,51,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0
0,3066.0,124.0,5.0,0.0,0.0,1533.0,229.0,236.0,141.0,459.0,area_0,type_22,class_1
1,3136.0,32.0,20.0,450.0,-38.0,1290.0,211.0,193.0,111.0,1112.0,area_0,type_28,class_1
2,2655.0,28.0,14.0,42.0,8.0,1890.0,214.0,209.0,128.0,1001.0,area_2,type_9,class_2


In [59]:
glm_multi_v1 = H2OGeneralizedLinearEstimator(
    model_id='glm_v1',            #allows us to easily locate this model in Flow
    family='multinomial',
    solver='L_BFGS')

In [60]:
glm_multi_v1.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


We can view information about the model in [Flow](http://localhost:54321/) or within Python. 
To find more information in Flow, enter `getModel "rf_covType_v1"` into a cell and run in place pressing Ctrl-Enter. 



In Python, we can use call the model itself to get an overview of its stats,

In [61]:
glm_multi_v1


Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_v1


ModelMetricsMultinomialGLM: glm
** Reported on train data. **

MSE: 0.20948485767867384
RMSE: 0.457695158024065

ModelMetricsMultinomialGLM: glm
** Reported on validation data. **

MSE: 0.2101143165863004
RMSE: 0.45838228214700927
Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iterations,negative_log_likelihood,objective
,2018-03-14 16:48:51,0.000 sec,0,490244.3713397,1.2048958
,2018-03-14 16:48:51,0.254 sec,1,390188.1819345,0.9591928
,2018-03-14 16:48:51,0.374 sec,2,364701.2026063,0.8967724
,2018-03-14 16:48:51,0.627 sec,3,340249.4056551,0.8372369
,2018-03-14 16:48:52,0.880 sec,4,323802.2278831,0.7975038
---,---,---,---,---,---
,2018-03-14 16:49:01,10.699 sec,45,260029.3463549,0.6535750
,2018-03-14 16:49:02,10.938 sec,46,260032.0054209,0.6534826
,2018-03-14 16:49:02,11.177 sec,47,259914.4187691,0.6533524



See the whole table with table.as_data_frame()




To find out a little more about its performance, we can look at its hit ratio table

In [62]:
glm_multi_v1.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.7217729
2,0.9671201
3,0.9934171
4,0.9982538
5,0.9997013
6,0.9999886
7,1.0




In [63]:
glm_multi_v2 = H2OGeneralizedLinearEstimator(
    model_id='glm_v2',  
    family='multinomial',
    solver='L_BFGS',
    Lambda=0.0001                 #default value 0.001
    )
glm_multi_v2.train(covtype_X, covtype_y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [64]:
glm_multi_v2

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_v2


ModelMetricsMultinomialGLM: glm
** Reported on train data. **

MSE: 0.20779920667152163
RMSE: 0.45584998263850096

ModelMetricsMultinomialGLM: glm
** Reported on validation data. **

MSE: 0.20842800000336265
RMSE: 0.45653915495098846
Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iterations,negative_log_likelihood,objective
,2018-03-14 16:49:16,0.000 sec,0,490244.3713397,1.2048958
,2018-03-14 16:49:16,0.236 sec,1,390179.3904457,0.9590567
,2018-03-14 16:49:16,0.349 sec,2,364681.3958329,0.8964892
,2018-03-14 16:49:16,0.594 sec,3,340243.6026901,0.8366837
,2018-03-14 16:49:17,0.828 sec,4,323818.0586213,0.7966307
---,---,---,---,---,---
,2018-03-14 16:49:27,11.635 sec,48,258871.3363242,0.6445755
,2018-03-14 16:49:28,11.867 sec,49,258771.3770813,0.6444419
,2018-03-14 16:49:28,12.160 sec,50,258715.5687938,0.6443253



See the whole table with table.as_data_frame()




In [65]:
glm_multi_v2.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.7220027
2,0.9670627
3,0.9936354
4,0.9983112
5,0.9997588
6,0.9999655
7,1.0




There's a noticeable improvement in the MSE, and our hit ratio has improved from coin-flip to 72%. 

Let's look at the confusion matrix to see if we can gather any more insight on the errors in our multinomial classification.

In [66]:
glm_multi_v2.confusion_matrix(valid)

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8
class_1,class_2,class_3,class_4,class_5,class_6,class_7,Error,Rate
22107.0,8964.0,2.0,0.0,0.0,10.0,604.0,0.3023322,"9,580 / 31,687"
7691.0,33942.0,529.0,2.0,33.0,236.0,20.0,0.2004805,"8,511 / 42,453"
0.0,532.0,4326.0,63.0,4.0,388.0,0.0,0.1857708,"987 / 5,313"
0.0,1.0,240.0,125.0,0.0,66.0,0.0,0.7106481,307 / 432
4.0,1396.0,35.0,0.0,5.0,9.0,0.0,0.9965493,"1,444 / 1,449"
0.0,643.0,1296.0,3.0,5.0,651.0,0.0,0.7494226,"1,947 / 2,598"
1392.0,30.0,0.0,0.0,0.0,0.0,1690.0,0.4569409,"1,422 / 3,112"
31194.0,45508.0,6428.0,193.0,47.0,1360.0,2314.0,0.2779973,"24,198 / 87,044"




As we can see in the above confusion matrix, our model is struggling to correctly distinguish between covertype classes 1 and 2. To learn more about this, let's shrink the scope of our problem to a binomial classification.

In [67]:
c1 = covtype_df[covtype_df['Cover_Type'] == 'class_1']
c2 = covtype_df[covtype_df['Cover_Type'] == 'class_2']
df_b = c1.rbind(c2)

In [72]:
#split the data
train_b, valid_b, test_b = df_b.split_frame([0.7, 0.15], seed=1234)
#train_b.summary()
valid_b.summary()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
type,int,int,int,int,int,int,int,int,int,int,enum,enum,enum
mins,2147.0,0.0,0.0,0.0,-166.0,0.0,0.0,0.0,0.0,0.0,,,
mean,3010.200258659015,152.64896469035017,13.335219388648646,276.8974659499657,44.4413908310768,2511.6484123455793,213.18706974363116,224.5172237265759,143.23633620283184,2103.0536852173664,,,
maxs,3686.0,360.0,63.0,1368.0,595.0,7078.0,254.0,254.0,253.0,7150.0,,,
sigma,202.9089483173579,111.3236125255111,6.986221645549852,212.71416776089166,57.00085845225439,1570.7547099886986,24.727003380921136,18.405880450701385,35.962843924715436,1344.928654620616,,,
zeros,0,672,102,2938,4861,11,2,1,111,5,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0
0,3066.0,124.0,5.0,0.0,0.0,1533.0,229.0,236.0,141.0,459.0,area_0,type_22,class_1
1,2972.0,100.0,4.0,175.0,13.0,5031.0,227.0,234.0,142.0,6198.0,area_0,type_19,class_1
2,3301.0,205.0,4.0,190.0,-1.0,5750.0,217.0,243.0,162.0,573.0,area_0,type_37,class_1


In [77]:
glm_binom_v1 = H2OGeneralizedLinearEstimator(
    model_id='glm_v3',
    solver='L_BFGS',
    family='binomial')
glm_binom_v1.train(covtype_X, covtype_y, training_frame=train_b, validation_frame=valid_b)

glm Model Build progress: | (failed)


OSError: Job with key $03017f00000134d4ffffffff$_ba44fa5e3efce7edb5d7ba38ef664e0f failed with an exception: java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
stacktrace: 
java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
	at hex.glm.GLMModel$GLMWeightsFun.link(GLMModel.java:547)
	at hex.glm.GLM.getNullBeta(GLM.java:359)
	at hex.glm.GLM.init(GLM.java:502)
	at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1149)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.glm.GLM$GLMDriver.compute2(GLM.java:569)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


In [29]:
glm_binom_v1.accuracy()

AttributeError: type object 'H2OGeneralizedLinearEstimator' has no attribute 'accuracy'

In [42]:
def cut_column(train_df, train, valid, test, col):
    '''
    Convenience function to change a column from numerical to categorical
    We use train_df only for bucketing with histograms.
    Uses np.histogram to generate a histogram, with the buckets forming the categories of our new categorical.
    Picks buckets based on training data, then applies the same classification to the test and validation sets
    
    Assumes that train, valid, test will have the same histogram behavior.
    '''
    only_col= train_df[col]                            #Isolate the column in question from the training frame
    counts, breaks = np.histogram(only_col, bins=20)   #Generate counts and breaks for our histogram
    min_val = min(only_col)-1                          #Establish min and max values
    max_val = max(only_col)+1
    
    new_b = [min_val]                                  #Redefine breaks such that each bucket has enough support
    for i in range(19):
        if counts[i] > 1000 and counts[i+1] > 1000:
            new_b.append(breaks[i+1])
    new_b.append(max_val)
    names = [col + '_' + str(x) for x in range(len(new_b)-1)]  #Generate names for buckets, these will be categorical names
    
    train[col+"_cut"] = train[col].cut(breaks=new_b, labels=names)
    valid[col+"_cut"] = valid[col].cut(breaks=new_b, labels=names)
    test[col+"_cut"] = test[col].cut(breaks=new_b, labels=names)

In [43]:
def add_features(train, valid, test):
    '''
    Helper function to add a specific set of features to our covertype dataset
    '''
    #pull train dataset into Python
    train_df = train.as_data_frame(True)
    
    #Make categoricals for several columns
    cut_column(train_df, train, valid, test, "Elevation")
    cut_column(train_df, train, valid, test, "Hillshade_Noon")
    cut_column(train_df, train, valid, test, "Hillshade_9am")
    cut_column(train_df, train, valid, test, "Hillshade_3pm")
    cut_column(train_df, train, valid, test, "Horizontal_Distance_To_Hydrology")
    cut_column(train_df, train, valid, test, "Slope")
    cut_column(train_df, train, valid, test, "Horizontal_Distance_To_Roadways")
    cut_column(train_df, train, valid, test, "Aspect")


    #Add interaction columns for a subset of columns\n",
    interaction_cols1 = ["Elevation_cut",
                         "Wilderness_Area",
                         "Soil_Type",
                          "Hillshade_Noon_cut",
                           "Hillshade_9am_cut",
                           "Hillshade_3pm_cut",
                           "Horizontal_Distance_To_Hydrology_cut",
                           "Slope_cut",
                           "Horizontal_Distance_To_Roadways_cut",
                           "Aspect_cut"]

    train_cols = train.interaction(factors=interaction_cols1,    #Generate pairwise columns
        pairwise=True,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="itrain")
    
    valid_cols = valid.interaction(factors=interaction_cols1,
        pairwise=True,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="ivalid")
    
    test_cols = test.interaction(factors=interaction_cols1,
        pairwise=True,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="itest")

    train = train.cbind(train_cols)                              #Append pairwise columns to H2OFrames
    valid = valid.cbind(valid_cols)
    test = test.cbind(test_cols)
    
                                
    #Add a three-way interaction for Hillshade
    interaction_cols2 = ["Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut"]
    
                         
    train_cols = train.interaction(factors=interaction_cols2,    #Generate pairwise columns
        pairwise=False,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="itrain")
    
    valid_cols = valid.interaction(factors=interaction_cols2,
        pairwise=False,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="ivalid")
    
    test_cols = test.interaction(factors=interaction_cols2,
        pairwise=False,
        max_factors=1000,
        min_occurrence=100,
        destination_frame="itest")
    
    train = train.cbind(train_cols)                              #Append pairwise columns to H2OFrames
    valid = valid.cbind(valid_cols)
    test = test.cbind(test_cols)

                                 
    return train, valid, test

In [44]:
train_bf, valid_bf, test_bf = add_features(train_b, valid_b, test_b)

Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%


In [80]:
train_bf.summary()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type,Elevation_cut,Hillshade_Noon_cut,Hillshade_9am_cut,Hillshade_3pm_cut,Horizontal_Distance_To_Hydrology_cut,Slope_cut,Horizontal_Distance_To_Roadways_cut,Aspect_cut,Hillshade_Noon_cut_Hillshade_9am_cut_Hillshade_3pm_cut,Hillshade_Noon_cut_Hillshade_9am_cut_Hillshade_3pm_cut0
type,int,int,int,int,int,int,int,int,int,int,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum
mins,2142.0,0.0,0.0,0.0,-166.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,
mean,3009.8651868471597,154.02716623312205,13.373835629870173,275.57496380719465,44.27077697733842,2509.7929425471643,213.03225052919356,224.5165016120939,143.39138697751144,2100.479982927378,,,,,,,,,,,,,
maxs,3680.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,,,,,,,,,,,,,
sigma,202.65600833565,111.67382431290244,6.9866401755217105,213.15808640964318,57.24285432412293,1571.0628085443263,24.918830103394665,18.387246591531216,36.17654203721041,1349.7823750892069,,,,,,,,,,,,,
zeros,0,3073,419,13743,22581,32,6,2,523,29,,,,,,,,,,,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,3136.0,32.0,20.0,450.0,-38.0,1290.0,211.0,193.0,111.0,1112.0,area_0,type_28,class_1,Elevation_8,Hillshade_Noon_3,Hillshade_9am_7,Hillshade_3pm_5,Horizontal_Distance_To_Hydrology_6,Slope_6,Horizontal_Distance_To_Roadways_3,Aspect_1,Hillshade_Noon_3_Hillshade_9am_7_Hillshade_3pm_5,Hillshade_Noon_3_Hillshade_9am_7_Hillshade_3pm_5
1,3119.0,293.0,13.0,30.0,10.0,4810.0,182.0,237.0,194.0,1200.0,area_0,type_21,class_1,Elevation_8,Hillshade_Noon_6,Hillshade_9am_5,Hillshade_3pm_12,Horizontal_Distance_To_Hydrology_0,Slope_3,Horizontal_Distance_To_Roadways_13,Aspect_16,Hillshade_Noon_6_Hillshade_9am_5_Hillshade_3pm_12,Hillshade_Noon_6_Hillshade_9am_5_Hillshade_3pm_12
2,3167.0,271.0,29.0,242.0,37.0,4700.0,133.0,235.0,234.0,3260.0,area_0,type_28,class_1,Elevation_9,Hillshade_Noon_6,Hillshade_9am_1,Hillshade_3pm_15,Horizontal_Distance_To_Hydrology_3,Slope_8,Horizontal_Distance_To_Roadways_13,Aspect_15,Hillshade_Noon_6_Hillshade_9am_1_Hillshade_3pm_15,Hillshade_Noon_6_Hillshade_9am_1_Hillshade_3pm_15


In [81]:
glm_binom_feat_1 = H2OGeneralizedLinearEstimator(family='binomial', solver='L_BFGS', model_id='glm_v4')
glm_binom_feat_1.train(covtype_X, covtype_y, training_frame=train_bf, validation_frame=valid_bf)

glm Model Build progress: | (failed)


OSError: Job with key $03017f00000134d4ffffffff$_8b13abf778b5cfeea3a147fb69e9401a failed with an exception: java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
stacktrace: 
java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
	at hex.glm.GLMModel$GLMWeightsFun.link(GLMModel.java:547)
	at hex.glm.GLM.getNullBeta(GLM.java:359)
	at hex.glm.GLM.init(GLM.java:502)
	at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1149)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.glm.GLM$GLMDriver.compute2(GLM.java:569)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


In [46]:
glm_binom_feat_1.accuracy(valid=True)

AttributeError: type object 'H2OGeneralizedLinearEstimator' has no attribute 'accuracy'

In [47]:
glm_binom_feat_2 = H2OGeneralizedLinearEstimator(family='binomial', solver='L_BFGS', model_id='glm_v5', Lambda=0.001)
glm_binom_feat_2.train(covtype_X, covtype_y, training_frame=train_bf, validation_frame=valid_bf)

glm Model Build progress: | (failed)


OSError: Job with key $03017f00000134d4ffffffff$_955bbdee8f60d4e4c08c0ea92e68497b failed with an exception: java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
stacktrace: 
java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
	at hex.glm.GLMModel$GLMWeightsFun.link(GLMModel.java:547)
	at hex.glm.GLM.getNullBeta(GLM.java:359)
	at hex.glm.GLM.init(GLM.java:502)
	at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1149)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.glm.GLM$GLMDriver.compute2(GLM.java:569)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


In [48]:
glm_binom_feat_2.accuracy(valid=True)

AttributeError: type object 'H2OGeneralizedLinearEstimator' has no attribute 'accuracy'

In [49]:
glm_binom_feat_3 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_v6', lambda_search=True)
glm_binom_feat_3.train(covtype_X, covtype_y, training_frame=train_bf, validation_frame=valid_bf)

glm Model Build progress: | (failed)


OSError: Job with key $03017f00000134d4ffffffff$_a5d87b9a9506886dd041ae10527c40ef failed with an exception: java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
stacktrace: 
java.lang.AssertionError: x out of bounds, expected <0,1> range, got NaN
	at hex.glm.GLMModel$GLMWeightsFun.link(GLMModel.java:547)
	at hex.glm.GLM.getNullBeta(GLM.java:359)
	at hex.glm.GLM.init(GLM.java:502)
	at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:1149)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at hex.glm.GLM$GLMDriver.compute2(GLM.java:569)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


In [50]:
glm_binom_feat_3.accuracy(valid=True)

AttributeError: type object 'H2OGeneralizedLinearEstimator' has no attribute 'accuracy'

In [51]:
train_f, valid_f, test_f = add_features(train, valid, test)

Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%


In [52]:
glm_multi_v3 = H2OGeneralizedLinearEstimator(
                        model_id='glm_v7',           
                        family='multinomial',
                        solver='L_BFGS',
                        Lambda=0.0001)
glm_multi_v3.train(covtype_X, covtype_y, training_frame=train_f, validation_frame=valid_f)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [54]:
glm_multi_v3

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_v7


ModelMetricsMultinomialGLM: glm
** Reported on train data. **

MSE: 0.20567994255054234
RMSE: 0.4535195062514316

ModelMetricsMultinomialGLM: glm
** Reported on validation data. **

MSE: 0.20678926805660505
RMSE: 0.4547408801247201
Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iterations,negative_log_likelihood,objective
,2018-03-14 16:44:28,0.000 sec,0,490244.3713397,1.2048958
,2018-03-14 16:44:28,0.287 sec,1,390140.1633429,0.9589602
,2018-03-14 16:44:28,0.412 sec,2,364664.5843174,0.8964480
,2018-03-14 16:44:28,0.660 sec,3,340164.2226989,0.8364886
,2018-03-14 16:44:29,0.954 sec,4,323662.6690375,0.7962510
---,---,---,---,---,---
,2018-03-14 16:44:42,14.677 sec,55,256380.3518102,0.6394377
,2018-03-14 16:44:43,14.924 sec,56,256248.8448666,0.6393366
,2018-03-14 16:44:43,15.619 sec,57,256152.1928321,0.6392126



See the whole table with table.as_data_frame()




In [53]:
glm_multi_v3.hit_ratio_table(valid=True)

Top-7 Hit Ratios: 


0,1
k,hit_ratio
1,0.7235651
2,0.967752
3,0.9939686
4,0.9982768
5,0.9997358
6,0.9999885
7,1.0




In [56]:
glm_multi_v3.confusion_matrix(valid_f)

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8
class_1,class_2,class_3,class_4,class_5,class_6,class_7,Error,Rate
22089.0,8963.0,1.0,0.0,0.0,10.0,624.0,0.3029002,"9,598 / 31,687"
7673.0,33990.0,507.0,2.0,31.0,226.0,24.0,0.1993499,"8,463 / 42,453"
0.0,537.0,4334.0,80.0,2.0,360.0,0.0,0.1842650,"979 / 5,313"
0.0,0.0,220.0,146.0,0.0,66.0,0.0,0.6620370,286 / 432
4.0,1363.0,36.0,0.0,34.0,12.0,0.0,0.9765355,"1,415 / 1,449"
0.0,633.0,1308.0,5.0,7.0,645.0,0.0,0.7517321,"1,953 / 2,598"
1338.0,30.0,0.0,0.0,0.0,0.0,1744.0,0.4395887,"1,368 / 3,112"
31104.0,45516.0,6406.0,233.0,74.0,1319.0,2392.0,0.2764349,"24,062 / 87,044"




In [79]:
glm_multi_v3

Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_v7


ModelMetricsMultinomialGLM: glm
** Reported on train data. **

MSE: 0.20567994255054234
RMSE: 0.4535195062514316

ModelMetricsMultinomialGLM: glm
** Reported on validation data. **

MSE: 0.20678926805660505
RMSE: 0.4547408801247201
Scoring History: 


0,1,2,3,4,5
,timestamp,duration,iterations,negative_log_likelihood,objective
,2018-03-14 16:44:28,0.000 sec,0,490244.3713397,1.2048958
,2018-03-14 16:44:28,0.287 sec,1,390140.1633429,0.9589602
,2018-03-14 16:44:28,0.412 sec,2,364664.5843174,0.8964480
,2018-03-14 16:44:28,0.660 sec,3,340164.2226989,0.8364886
,2018-03-14 16:44:29,0.954 sec,4,323662.6690375,0.7962510
---,---,---,---,---,---
,2018-03-14 16:44:42,14.677 sec,55,256380.3518102,0.6394377
,2018-03-14 16:44:43,14.924 sec,56,256248.8448666,0.6393366
,2018-03-14 16:44:43,15.619 sec,57,256152.1928321,0.6392126



See the whole table with table.as_data_frame()




In [None]:
#h2o.shutdown(prompt=False)