Contributing Spark TensorFlow connector to ecosystem #32

skavulya · 2017-02-11T03:10:16Z

Our team has been working on a Spark TensorFlow connector that we would like to contribute back to the TensorFlow ecosystem. The connector uses the TensorFlow Hadoop input/output format, and simplifies import and export of data from TFRecords into Spark dataframes.

@jhseu, please let us know if this is something you are interested in. We would need some guidance on which directory to place the library in before creating the pull request. We were not sure if we should create a new spark directory at the root of the repo, or whether to create a new sub-directory under hadoop.

Here is a snippet of code that demonstrates the usage of the library:

import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._

val path = "test-output.tfr"
val testRows: Array[Row] = Array(
new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType), StructField("IntegerTypelabel", IntegerType), StructField("LongTypelabel", LongType), StructField("FloatTypelabel", FloatType), StructField("DoubleTypelabel", DoubleType), StructField("vectorlabel", ArrayType(DoubleType, true)), StructField("name", StringType)))
val rdd = spark.sparkContext.parallelize(testRows)

//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.write.format("tensorflow").save(path)

//Read TFRecords into DataFrame.
//The DataFrame schema is inferred from the TFRecords if no custom schema is provided.
val importedDf1: DataFrame = spark.read.format("tensorflow").load(path)
importedDf1.show()

//Read TFRecords into DataFrame using custom schema
val importedDf2: DataFrame = spark.read.format("tensorflow").schema(schema).load(path)
importedDf2.show()
+--------------+-----------+----------------+-------------+---------------+----+---+
|FloatTypelabel|vectorlabel|IntegerTypelabel|LongTypelabel|DoubleTypelabel|name| id|
+--------------+-----------+----------------+-------------+---------------+----+---+
|          10.0| [1.0, 2.0]|               1|           23|           14.0|  r1| 11|
|          12.0| [2.0, 2.0]|               2|           24|           15.0|  r2| 21|
+--------------+-----------+----------------+-------------+---------------+----+---+

The text was updated successfully, but these errors were encountered:

jhseu · 2017-02-14T20:52:11Z

Looks really useful! I'll take a look after the dev summit tomorrow. Don't have cycles at the moment :)

skavulya · 2017-02-15T18:04:33Z

Thanks @jhseu. Enjoy the dev summit :)

dsblr · 2017-02-24T05:30:09Z

Hi @skavulya Its a nice work by you and your team. Could you please help me with some use case where can I use this integration.
Can I process tensorflow's RNN on text document on spark for chatbot kind of application ?

thesuperzapper · 2017-02-27T00:35:45Z

Would it make more sense to integrate/implement this into Spark directly?

skavulya · 2017-02-28T04:17:17Z

Thanks, @dsblr. This library is a connector for importing and exporting data to and from TensorFlow and Spark. For example, if you did ETL in Spark and wanted to export that data into a format that can be processed by a TensorFlow program. If you would like to run TensorFlow programs using Spark, you can check out Yahoo's TensorFlow on Spark.

@thesuperzapper We thought the library might be more applicable to the TensorFlow ecosystem since it builds upon the TensorFlow Hadoop input/output format, but we are open to suggestions.

@jhseu Did you get a chance to look at our repo? Please let us know what you think.

jhseu · 2017-02-28T19:26:07Z

@skavulya Thanks for your patience, been busy with other things. Yeah, I took a look and it definitely makes sense to merge here. Please make a pull request and put it under a separate spark directory. We can do code review in the pull request.

karthikvadla · 2017-03-01T02:00:41Z

@jhseu @skavulya I have created PR #34 . Please review it and let me know for any changes.

Please feel free to add other reviewers too.

jhseu · 2017-03-07T20:59:23Z

Merged

jhseu closed this as completed Mar 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing Spark TensorFlow connector to ecosystem #32

Contributing Spark TensorFlow connector to ecosystem #32

skavulya commented Feb 11, 2017

jhseu commented Feb 14, 2017

skavulya commented Feb 15, 2017

dsblr commented Feb 24, 2017

thesuperzapper commented Feb 27, 2017 •

edited

skavulya commented Feb 28, 2017

jhseu commented Feb 28, 2017

karthikvadla commented Mar 1, 2017

jhseu commented Mar 7, 2017

Contributing Spark TensorFlow connector to ecosystem #32

Contributing Spark TensorFlow connector to ecosystem #32

Comments

skavulya commented Feb 11, 2017

jhseu commented Feb 14, 2017

skavulya commented Feb 15, 2017

dsblr commented Feb 24, 2017

thesuperzapper commented Feb 27, 2017 • edited

skavulya commented Feb 28, 2017

jhseu commented Feb 28, 2017

karthikvadla commented Mar 1, 2017

jhseu commented Mar 7, 2017

thesuperzapper commented Feb 27, 2017 •

edited