Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing Spark TensorFlow connector to ecosystem #32

Closed
skavulya opened this issue Feb 11, 2017 · 8 comments
Closed

Contributing Spark TensorFlow connector to ecosystem #32

skavulya opened this issue Feb 11, 2017 · 8 comments

Comments

@skavulya
Copy link
Contributor

Our team has been working on a Spark TensorFlow connector that we would like to contribute back to the TensorFlow ecosystem. The connector uses the TensorFlow Hadoop input/output format, and simplifies import and export of data from TFRecords into Spark dataframes.

@jhseu, please let us know if this is something you are interested in. We would need some guidance on which directory to place the library in before creating the pull request. We were not sure if we should create a new spark directory at the root of the repo, or whether to create a new sub-directory under hadoop.

Here is a snippet of code that demonstrates the usage of the library:

import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._

val path = "test-output.tfr"
val testRows: Array[Row] = Array(
new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType), StructField("IntegerTypelabel", IntegerType), StructField("LongTypelabel", LongType), StructField("FloatTypelabel", FloatType), StructField("DoubleTypelabel", DoubleType), StructField("vectorlabel", ArrayType(DoubleType, true)), StructField("name", StringType)))
val rdd = spark.sparkContext.parallelize(testRows)

//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.write.format("tensorflow").save(path)

//Read TFRecords into DataFrame.
//The DataFrame schema is inferred from the TFRecords if no custom schema is provided.
val importedDf1: DataFrame = spark.read.format("tensorflow").load(path)
importedDf1.show()

//Read TFRecords into DataFrame using custom schema
val importedDf2: DataFrame = spark.read.format("tensorflow").schema(schema).load(path)
importedDf2.show()
+--------------+-----------+----------------+-------------+---------------+----+---+
|FloatTypelabel|vectorlabel|IntegerTypelabel|LongTypelabel|DoubleTypelabel|name| id|
+--------------+-----------+----------------+-------------+---------------+----+---+
|          10.0| [1.0, 2.0]|               1|           23|           14.0|  r1| 11|
|          12.0| [2.0, 2.0]|               2|           24|           15.0|  r2| 21|
+--------------+-----------+----------------+-------------+---------------+----+---+
@jhseu
Copy link
Contributor

jhseu commented Feb 14, 2017

Looks really useful! I'll take a look after the dev summit tomorrow. Don't have cycles at the moment :)

@skavulya
Copy link
Contributor Author

Thanks @jhseu. Enjoy the dev summit :)

@dsblr
Copy link

dsblr commented Feb 24, 2017

Hi @skavulya Its a nice work by you and your team. Could you please help me with some use case where can I use this integration.
Can I process tensorflow's RNN on text document on spark for chatbot kind of application ?

@thesuperzapper
Copy link

thesuperzapper commented Feb 27, 2017

Would it make more sense to integrate/implement this into Spark directly?

@skavulya
Copy link
Contributor Author

Thanks, @dsblr. This library is a connector for importing and exporting data to and from TensorFlow and Spark. For example, if you did ETL in Spark and wanted to export that data into a format that can be processed by a TensorFlow program. If you would like to run TensorFlow programs using Spark, you can check out Yahoo's TensorFlow on Spark.

@thesuperzapper We thought the library might be more applicable to the TensorFlow ecosystem since it builds upon the TensorFlow Hadoop input/output format, but we are open to suggestions.

@jhseu Did you get a chance to look at our repo? Please let us know what you think.

@jhseu
Copy link
Contributor

jhseu commented Feb 28, 2017

@skavulya Thanks for your patience, been busy with other things. Yeah, I took a look and it definitely makes sense to merge here. Please make a pull request and put it under a separate spark directory. We can do code review in the pull request.

@karthikvadla
Copy link
Contributor

@jhseu @skavulya I have created PR #34 . Please review it and let me know for any changes.

Please feel free to add other reviewers too.

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

Merged

@jhseu jhseu closed this as completed Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants