Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@ #44

lastmansleeping · 2020-07-13T14:18:08Z

The PR includes changes required to read the following:

data (any directory -> copied over to local FS)
feature config + model config (any YAML/JSON file)
dataframe(eg: vocabulary files)

and writing the following back to HDFS:

models
logs, metrics, etc

lastmansleeping · 2020-07-13T14:41:24Z

python/ml4ir/base/io/spark_io.py

+    return SparkConfigHolder.HADOOP_CONFIG
+
+
+def get_hdfs():


@balikasg This is the main module to review. I had to jump some hoops to get access to the hadoop File system since it is only available in spark's java based APIs.

thanks for the pointer!
From what I understand it is not a Flowsnake v2 thing, we are expecting HDFS paths and we are handling things as this.

jakemannix · 2020-07-13T15:49:51Z

python/ml4ir/base/data/relevance_dataset.py

-        self.data_dir: str = data_dir
+        self.logger = logger
+
+        # If data directory is a HDFS path, first copy to local file system


I think maybe we can reorganize this a bit. This has references to sparkIO directly in this class, which we don't want. We should encapsulate the IO into a class (ahh! Jake is suggesting a class?!?) or a function which is passed in.

We can chat about that later on if it's not clear.

(Haha. I went overboard with NOT making spark_io a class this time.)
I'm not sure how we can eliminate a reference to spark_io here though. I can wrap it under file_io.copy_dir(src, dest) and then internally call spark_io like I do for the other I/O codepaths, if you like.

Does that work?

what I mean is you could pass into RelevanceDataset's constructor, an object or function which provides IO. If it's a object which you pass in, the (abstract) class of the object would be DataIO (or something like that), and some particular implementations would be LocalIO and SparkIO. Then RelevanceDataset has no references to spark, but whoever instantiates a RelevanceDataset may choose to create a spark-based DataIO and pass it into the constructor.

I understand what you are suggesting. But I don't get how you would do this specific task - "copying files from HDFS to Local filesystem", which is specific for the spark_io case.

Keep in mind that for tfrecord, we don't explicitly(using file_io or spark_io) do any read at all. It's handled by tensorflow.

Alternatively, I can do the copying in the main pipeline and force RelevanceDataset to work with only file_io(or local file I/O). This might be cleaner. My reasoning for adding the copying to RelevanceDataset was that if a user wanted to create a RelevanceDataset using data stored on HDFS, then they already can do it. They don't have to write their own version of spark_IO. May be passing in the file handler object into RelevanceDataset is a good middle ground. Not entirely convinced.

lastmansleeping · 2020-07-14T21:43:58Z

@balikasg Synced with @jakemannix offline, and he requested a few structural changes to the code. Will update you when done, but not changing any functionality. So you can still review it when you are free and try training models.

balikasg · 2020-07-15T07:46:25Z

python/ml4ir/base/config/keys.py

+    MODELS = "models"
+    LOGS = "logs"
+    DATA = "data"
+    TEMP_DATA = "data/temp"


since they are temp, should we move this to /tmp/data? do we have cleaning in place?

although I don't know if flowsnake has /tmp

I explicitly delete the temp directory once done.

balikasg · 2020-07-15T07:51:50Z

python/ml4ir/base/features/feature_config.py

@@ -465,13 +473,11 @@ def parse_config(
    tfrecord_type: str, feature_config, logger: Optional[Logger] = None
 ) -> FeatureConfig:
    if feature_config.endswith(".yaml"):
-        feature_config = file_io.read_yaml(feature_config)
-        if logger:
-            logger.info("Reading feature config from YAML file : {}".format(feature_config))


I am just wondering here. Not for this PR.
But if logger repeats a lot in ml4ir. Why not agree we always need a logger, and instantiate one form the beginning? Or decorate key functions/classes to avoid such checks.

Yeah. Let me file an issue for this. It looks to me like tech debt building.

I agree - we should just always have a Logger - if none is configured, use a dummy / noop logger or something.

balikasg · 2020-07-15T07:53:05Z

python/ml4ir/base/features/feature_config.py

-        if logger:
-            logger.info("Reading feature config from YAML string")
+    if logger:
+        logger.info("Feature Config \n{}".format(json.dumps(feature_config, indent=4)))


btw these are very nice, but not related to the functionality this PR adds!
thanks for the effort though! We could also consider this as logger debug, as it is often long

True. Can change.

balikasg · 2020-07-15T07:58:13Z

python/ml4ir/base/io/file_io.py

+
+    if infile.startswith(HDFS_PREFIX):
+        # NOTE: Move to fully spark dataframe based operations in the future
+        return spark_io.read_df(infile).toPandas()


is this lazy? Does it scale or OOM?

This is not lazy. The goal is to only use this for small files like vocabulary, etc which you really need in memory. For training data, we should use the lazy loading provided by TFRecordDataset.

balikasg · 2020-07-15T08:00:47Z

python/ml4ir/base/io/file_io.py

+
+    if infile.startswith(HDFS_PREFIX):
+        # NOTE: Move to fully spark dataframe based operations in the future
+        return spark_io.read_df(infile).toPandas()
    elif infile.endswith(".gz"):
        fp = gzip.open(os.path.expanduser(infile), "rb")


I have the impression that pandas handles gz files out of the box.

I used this method from our other projects. So my memory might not be completely correct - I don't think pandas can read gz files with the C engine. Can check though.

balikasg · 2020-07-15T08:03:05Z

python/ml4ir/base/io/file_io.py

-        fp.close()
-    return output
+        if outfile:
+            fp = open(outfile, "w")


Here using with open()... is a more pythonic way. Why not directly write with pd.to_csv? to replace?

After spending a lot of time with our ranking data, this combination of read_df and write_df were the only ones that were compatible with each other and the upstream spark jobs. If you notice, I need to replace \\ with \\\\ on the next line for compatibility with re-reading. I can add a FIXME here for the future though.

(Again, this method was written a year ago for some of our other projects. Might not be necessary to do all this hacky read write for ml4ir)

balikasg · 2020-07-15T09:15:03Z

python/ml4ir/base/pipeline.py

        if self.data_format == DataFormatKey.CSV:
            file_io.rm_dir(os.path.join(self.data_dir, "tfrecord"))
+        file_io.rm_dir(DefaultDirectoryKey.TEMP_DATA)


Just saw this, there is cleaning!

balikasg

This PR looks good, I like the new design!
I have left a few non-blocking comments, so I am approving

balikasg · 2020-07-16T12:16:57Z

python/ml4ir/base/io/file_io.py

-    """
-    if outfile and outfile.startswith(HDFS_PREFIX):
+
+class FileIO(object):


I like this! Defining FileIO and then having the LocalIO and SparkIO, is clean, slick etc. nice!

balikasg · 2020-07-16T12:20:14Z

python/ml4ir/base/io/local_io.py

+class LocalIO(FileIO):
+    """Class defining the file I/O handler methods for the local file system"""
+
+    def make_directory(self, dir_path: str, clear_dir: bool = False) -> str:


clear_dir is an overloaded name in this implementation. There are both functions and variables defined with the same name. Can we consider changing one of the two? For example, this bool arg can be renamed to remove_dir_content or something else equally descriptive. clear_dir_content would also work I guess.

balikasg · 2020-07-16T12:20:47Z

python/ml4ir/base/io/spark_io.py

+        )
+        self.local_fs = self.hdfs.getLocal(self.hadoop_config)
+
+    def make_directory(self, dir_path: str, clear_dir: bool = False) -> str:


same here with clear_dir..

balikasg · 2020-07-16T12:23:16Z

python/ml4ir/base/io/spark_io.py

+        Returns:
+            python dictionary
+        """
+        self.log("Reading JSON file : {}".format(infile))


nice usage of self.log here! +1

balikasg · 2020-07-16T12:25:29Z

python/ml4ir/base/io/spark_io.py

+        Returns:
+            pandas dataframe
+        """
+        raise NotImplementedError


SparkIO inherits from FileIO, isn't it redundant to overwrite notImplemented methods that were defined there? Any particular reason for it? it occurs a few times in SparkIO.

Another question (if you know): just passing the list of files, wouldn't do the trick? Like:

self.spark_session.read.csv .option("header", "true") .option("inferschema", "true") .option("mode", "DROPMALFORMED") .option("mergeSchema", "true") .load(infiles) .toPandas()

I am looking at this question.

lastmansleeping · 2020-07-16T12:32:39Z

This PR looks good, I like the new design!
I have left a few non-blocking comments, so I am approving

Thanks @balikasg . Will address the comments and fix the test failures and then merge the PR.

lastmansleeping · 2020-07-16T13:48:03Z

@jakemannix I have made the requested changes. Had to change a lot of files though.

jakemannix

Ok this about covers my concerns. We're still coupled, but the coupling is loose enough that it's mostly just package restructuring left.

jakemannix · 2020-07-16T14:48:54Z

python/ml4ir/base/pipeline.py

+        if self.args.file_handler == FileHandlerKey.LOCAL:
+            self.file_io = self.local_io
+        elif self.args.file_handler == FileHandlerKey.SPARK:
+            self.file_io = SparkIO(self.logger)


Ok so this is basically what I was looking for, yes. Technically, we need to make one last step of separation: pull spark_io.py out of ml4ir/base and move it into a new python package (ml4ir-spark), and perhaps put the pipeline in its own package as well (ml4ir-apps), and show how you could allow people to run pipelines without having to pip install pyspark at all, if not needed.

But folding that into the next ticket should be fine.

balikasg · 2020-07-17T08:18:47Z

python/ml4ir/applications/ranking/data/scripts/create_dataset.py

-    parser.add_argument('-max_num_records', default=MAX_NUM_RECORDS)
-    parser.add_argument('-num_samples', default=NUM_SAMPLES)
-    parser.add_argument('-random_state', default=RANDOM_STATE)
+    parser.add_argument("-data_dir", default=DATA_DIR)


help messages?

lastmansleeping added 17 commits July 13, 2020 02:20

Adding spark_io module

cdb8402

Fixing issues

47067d0

Updating logging

d98ab1b

Fixing issues

5d9dbce

Fixing issues

ae3efe0

Fixing issues

7953de8

Fixing HDFS get

2847fdd

Fixing path issues

d5c3edf

Fixing path issues

98b438d

Fixing YAML read

52af1a4

Adding logging

e6393de

Adding overwrite copy

f53cd96

Fixing issues

2e251ba

Clean up

0ce7932

Fixing issues

96efb83

Fixing overwrite

48235bb

Fixing read_df

95449a2

lastmansleeping self-assigned this Jul 13, 2020

salesforce-cla bot added the cla:signed label Jul 13, 2020

lastmansleeping changed the title ~~Enabling Spark based HDFS I/O for running ml4ir~~ Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@ Jul 13, 2020

lastmansleeping requested review from balikasg and jakemannix and removed request for balikasg July 13, 2020 14:25

Updating logging

948eac1

lastmansleeping commented Jul 13, 2020

View reviewed changes

jakemannix suggested changes Jul 13, 2020

View reviewed changes

lastmansleeping requested a review from darshshah July 13, 2020 16:51

balikasg reviewed Jul 15, 2020

View reviewed changes

lastmansleeping added 2 commits July 16, 2020 02:19

Adding file IO handlers

f8ce5e9

Configuring file IO objects

3b5e9a8

balikasg approved these changes Jul 16, 2020

View reviewed changes

lastmansleeping added 4 commits July 16, 2020 05:44

Fixing issues

cb8137f

Fixing output directory

6e0bac8

Fixing tests

5e7b0ce

Clean up

d90cf35

jakemannix approved these changes Jul 16, 2020

View reviewed changes

lastmansleeping merged commit d47950a into master Jul 17, 2020

lastmansleeping deleted the ashish/spark_io branch July 17, 2020 07:40

balikasg reviewed Jul 17, 2020

View reviewed changes

Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@ #44

Enabling Spark based HDFS I/O for running ml4ir @rev balikasg@ #44

Conversation

lastmansleeping commented Jul 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lastmansleeping commented Jul 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balikasg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lastmansleeping commented Jul 16, 2020

lastmansleeping commented Jul 16, 2020

jakemannix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lastmansleeping commented Jul 14, 2020 •

edited