New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation request/query: order of probability scores in Spark multinomial regression probability columns #907

Closed
MZLABS opened this Issue Aug 7, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@MZLABS

MZLABS commented Aug 7, 2017

Documentation request: What order are the predictions for a multinomial
logistic regression written. In the example before with three string
targets "g1", "g2", and "g3". Notice that "g1" is assigned position 3,
"g2" position 2, and "g1" position 3. Reverse of the order you might expect.
I wonder if this is a convention
we can document and use or is this undetermined?

suppressPackageStartupMessages(library("sparklyr"))
packageVersion("sparklyr")
#> [1] '0.6.1'

dL <- data.frame(x1 = c(1, 0, 0, 
                        0, 0, 0, 
                        1, 1, 1),
                 x2 = c(0, 1, 0, 
                        0, 0, 0, 
                        1, 1, 1),
                 x3 = c(0, 0, 1, 
                        0, 0, 0, 
                        1, 1, 1),
                 y = c('g1', 'g2', 'g3',
                       'g1', 'g2', 'g3',
                       'g1', 'g2', 'g3'),
                 stringsAsFactors = FALSE)
sc <- spark_connect(master = 'local', version = '2.1.0')

dS <- dplyr::copy_to(sc, dL, 'dS')

model <- ml_logistic_regression(dS,
                                response = 'y',
                                features = c('x1', 'x2', 'x3'))
#> * No rows dropped by 'na.omit' call
print(model)
#> Call: y ~ x1 + x2 + x3
#> 
#> Coefficients:
#>        (Intercept)        x1        x2        x3
#> [1,]  6.324847e-07 -7.352011 -7.351955 14.703965
#> [2,] -1.295642e-07 -7.351453 14.703408 -7.351955
#> [3,] -5.029206e-07 14.703464 -7.351453 -7.352010
res <- sdf_predict(model, dS)
cf <- sdf_separate_column(res, 'probability', 
                          list('p1'=1, 'p2'=2, 'p3'=3))
dplyr::select(cf, x1, x2, x3, y, p1, p2, p3)
#> # Source:   lazy query [?? x 7]
#> # Database: spark_connection
#>      x1    x2    x3     y           p1           p2           p3
#>   <dbl> <dbl> <dbl> <chr>        <dbl>        <dbl>        <dbl>
#> 1     1     0     0    g1 2.638941e-10 2.640412e-10 1.000000e+00
#> 2     0     1     0    g2 2.639234e-10 1.000000e+00 2.640558e-10
#> 3     0     0     1    g3 1.000000e+00 2.637759e-10 2.637612e-10
#> 4     0     0     0    g1 3.333335e-01 3.333333e-01 3.333332e-01
#> 5     0     0     0    g2 3.333335e-01 3.333333e-01 3.333332e-01
#> 6     0     0     0    g3 3.333335e-01 3.333333e-01 3.333332e-01
#> 7     1     1     1    g1 3.333333e-01 3.333333e-01 3.333333e-01
#> 8     1     1     1    g2 3.333333e-01 3.333333e-01 3.333333e-01
#> 9     1     1     1    g3 3.333333e-01 3.333333e-01 3.333333e-01

spark_disconnect(sc)

@javierluraschi javierluraschi added the ml label Aug 9, 2017

@kevinykuo

This comment has been minimized.

Show comment
Hide comment
@kevinykuo

kevinykuo Aug 14, 2017

Collaborator

Yeah this is pretty confusing. @MZLABS @JohnMount what have you seen in other libraries? Output the probabilities in appropriately named columns?

Collaborator

kevinykuo commented Aug 14, 2017

Yeah this is pretty confusing. @MZLABS @JohnMount what have you seen in other libraries? Output the probabilities in appropriately named columns?

@JohnMount

This comment has been minimized.

Show comment
Hide comment
@JohnMount

JohnMount Aug 14, 2017

In R packages I'd expect the predictions to be in an order that matches the lexical order of the result levels. A really cool fix would be to populate the ml_logistic_regression() R result so it has the named list needed to call sdf_separate_columns() as one of the values. User code could then use this value or look at it for guidance.

JohnMount commented Aug 14, 2017

In R packages I'd expect the predictions to be in an order that matches the lexical order of the result levels. A really cool fix would be to populate the ml_logistic_regression() R result so it has the named list needed to call sdf_separate_columns() as one of the values. User code could then use this value or look at it for guidance.

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 14, 2017

Contributor

I believe the underlying fix should occur around here:

df <- ft_string_indexer(df, response, envir$response, envir)

The overarching issue is, when we use the string indexer to transform a categorical variable into a numeric variable (as required by the Spark ML routines), it ends up generating these labels in an unexpected way:

Browse[3]> ft_string_indexer(df, response, envir$response, envir)
# Source:   table<sparklyr_tmp_3cff40cbcd98> [?? x 6]
# Database: spark_connection
     x1    x2    x3     y id3cff65de6ccc response3cff6db8f953
  <dbl> <dbl> <dbl> <chr>          <dbl>                <dbl>
1     1     0     0    g1              0                    2
2     0     1     0    g2              1                    1
3     0     0     1    g3              2                    0
4     0     0     0    g1              3                    2
5     0     0     0    g2              4                    1
6     0     0     0    g3              5                    0
7     1     1     1    g1              6                    2
8     1     1     1    g2              7                    1
9     1     1     1    g3              8                    0

Although note that we do record the labels generated by Spark; but we fail to use them in the R outputs:

Browse[3]> envir$labels
[1] "g3" "g2" "g1"

I think we should do two things here:

  1. Attach the labels to the generated coefficients matrix,
  2. See if we can convince ft_string_indexer() to use lexical ordering of the variables encountered here.
Contributor

kevinushey commented Aug 14, 2017

I believe the underlying fix should occur around here:

df <- ft_string_indexer(df, response, envir$response, envir)

The overarching issue is, when we use the string indexer to transform a categorical variable into a numeric variable (as required by the Spark ML routines), it ends up generating these labels in an unexpected way:

Browse[3]> ft_string_indexer(df, response, envir$response, envir)
# Source:   table<sparklyr_tmp_3cff40cbcd98> [?? x 6]
# Database: spark_connection
     x1    x2    x3     y id3cff65de6ccc response3cff6db8f953
  <dbl> <dbl> <dbl> <chr>          <dbl>                <dbl>
1     1     0     0    g1              0                    2
2     0     1     0    g2              1                    1
3     0     0     1    g3              2                    0
4     0     0     0    g1              3                    2
5     0     0     0    g2              4                    1
6     0     0     0    g3              5                    0
7     1     1     1    g1              6                    2
8     1     1     1    g2              7                    1
9     1     1     1    g3              8                    0

Although note that we do record the labels generated by Spark; but we fail to use them in the R outputs:

Browse[3]> envir$labels
[1] "g3" "g2" "g1"

I think we should do two things here:

  1. Attach the labels to the generated coefficients matrix,
  2. See if we can convince ft_string_indexer() to use lexical ordering of the variables encountered here.
@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 14, 2017

Contributor

Note that the StringIndexer automatically orders by label frequencies:

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.

And AFAICS this is not configurable in the API. So it's possible that we might want to consider a different solution here altogether (ie, perhaps implement our own 'string to index' transformation).

Contributor

kevinushey commented Aug 14, 2017

Note that the StringIndexer automatically orders by label frequencies:

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels), ordered by label frequencies. So the most frequent label gets index 0.

And AFAICS this is not configurable in the API. So it's possible that we might want to consider a different solution here altogether (ie, perhaps implement our own 'string to index' transformation).

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 14, 2017

Contributor

Spark 2.3.0 will make it possible to choose an ordering:

https://github.com/apache/spark/blob/fbc269252a1c99e04bd08906ad8404c031e9a097/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L59-L76

but we probably don't want to wait for that.

Contributor

kevinushey commented Aug 14, 2017

Spark 2.3.0 will make it possible to choose an ordering:

https://github.com/apache/spark/blob/fbc269252a1c99e04bd08906ad8404c031e9a097/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L59-L76

but we probably don't want to wait for that.

@kevinykuo

This comment has been minimized.

Show comment
Hide comment
@kevinykuo

kevinykuo Aug 15, 2017

Collaborator

Is there any reason why we'd want to keep the list column of probability vectors? What if we output something like the below with sdf_predict():

  predict         g1         g2         g3
1      g1 0.85319282 0.07322736 0.07357982
2      g2 0.07365074 0.85270285 0.07364641
3      g3 0.07328707 0.07319418 0.85351875
4      g3 0.33380310 0.33082052 0.33537638
5      g3 0.33380310 0.33082052 0.33537638
6      g3 0.33380310 0.33082052 0.33537638
Collaborator

kevinykuo commented Aug 15, 2017

Is there any reason why we'd want to keep the list column of probability vectors? What if we output something like the below with sdf_predict():

  predict         g1         g2         g3
1      g1 0.85319282 0.07322736 0.07357982
2      g2 0.07365074 0.85270285 0.07364641
3      g3 0.07328707 0.07319418 0.85351875
4      g3 0.33380310 0.33082052 0.33537638
5      g3 0.33380310 0.33082052 0.33537638
6      g3 0.33380310 0.33082052 0.33537638
@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 15, 2017

Contributor

That sounds reasonable to me.

Contributor

kevinushey commented Aug 15, 2017

That sounds reasonable to me.

@kevinykuo

This comment has been minimized.

Show comment
Hide comment
@kevinykuo

kevinykuo Aug 16, 2017

Collaborator

Closing, will track in #937

Collaborator

kevinykuo commented Aug 16, 2017

Closing, will track in #937

@JohnMount

This comment has been minimized.

Show comment
Hide comment
@JohnMount

JohnMount Sep 1, 2017

@kevinushey The saving of model.parameters$labels makes everything solvable, thank you very much. Roughly all one has to do is find a permutation that takes model.parameters$labels into the same order as the order the user supplies labels. This is possible by finding permutations that sort both vectors and inverting one of the permutations (some notes on the general concepts here, and a write up on the application to the problem here).

JohnMount commented Sep 1, 2017

@kevinushey The saving of model.parameters$labels makes everything solvable, thank you very much. Roughly all one has to do is find a permutation that takes model.parameters$labels into the same order as the order the user supplies labels. This is possible by finding permutations that sort both vectors and inverting one of the permutations (some notes on the general concepts here, and a write up on the application to the problem here).

@kevinykuo

This comment has been minimized.

Show comment
Hide comment
@kevinykuo

kevinykuo Oct 26, 2017

Collaborator

@JohnMount @MZLABS we provide more reasonable prediction outputs for classification models now (for ml_model), feel free to open another issue if you run into further issues.

iris_tbl <- sdf_copy_to(sc, iris)
lr <- ml_logistic_regression(iris_tbl, Species ~ Petal_Width + Petal_Length)
ml_predict(lr, iris_tbl) %>%
  glimpse()

Observations: 25
Variables: 12
$ Sepal_Length           <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4...
$ Sepal_Width            <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9...
$ Petal_Length           <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7...
$ Petal_Width            <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4...
$ Species                <chr> "setosa", "setosa", "setosa"...
$ prediction             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
$ probability            <list> [<3.431843e-16, 1.906355e-3...
$ rawPrediction          <list> [<-0.1605276, -35.2872088, ...
$ predicted_label        <chr> "setosa", "setosa", "setosa"...
$ probability_versicolor <dbl> 3.431843e-16, 3.431843e-16, ...
$ probability_virginica  <dbl> 1.906355e-31, 1.906355e-31, ...
$ probability_setosa     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
> 
Collaborator

kevinykuo commented Oct 26, 2017

@JohnMount @MZLABS we provide more reasonable prediction outputs for classification models now (for ml_model), feel free to open another issue if you run into further issues.

iris_tbl <- sdf_copy_to(sc, iris)
lr <- ml_logistic_regression(iris_tbl, Species ~ Petal_Width + Petal_Length)
ml_predict(lr, iris_tbl) %>%
  glimpse()

Observations: 25
Variables: 12
$ Sepal_Length           <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4...
$ Sepal_Width            <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9...
$ Petal_Length           <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7...
$ Petal_Width            <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4...
$ Species                <chr> "setosa", "setosa", "setosa"...
$ prediction             <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
$ probability            <list> [<3.431843e-16, 1.906355e-3...
$ rawPrediction          <list> [<-0.1605276, -35.2872088, ...
$ predicted_label        <chr> "setosa", "setosa", "setosa"...
$ probability_versicolor <dbl> 3.431843e-16, 3.431843e-16, ...
$ probability_virginica  <dbl> 1.906355e-31, 1.906355e-31, ...
$ probability_setosa     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
> 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment