Multiprocessing embedding #241

tweddielin · 2018-10-25T18:52:55Z

EmbeddingTrainer
- now is able to train multiple models with the same corpus in multiprocessing way.
- change the interface of the class. the corpus_generator should be provided in the train method arguments.
BatchGenerator provide the same functionality of batches_generator but can be serialized.

codecov-io · 2018-10-25T22:33:53Z

Codecov Report

Merging #241 into master will increase coverage by 0.07%.
The diff coverage is 98.63%.

@@            Coverage Diff             @@
##           master     #241      +/-   ##
==========================================
+ Coverage   86.11%   86.19%   +0.07%     
==========================================
  Files          75       75              
  Lines        4343     4367      +24     
==========================================
+ Hits         3740     3764      +24     
  Misses        603      603

Impacted Files	Coverage Δ
skills_ml/job_postings/common_schema.py	`96.25% <100%> (+0.47%)`	⬆️
skills_ml/algorithms/embedding/train.py	`95.65% <98.43%> (+0.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93b6b3c...4248356. Read the comment docs.

thcrock · 2018-10-30T16:39:49Z

skills_ml/algorithms/embedding/train.py

+                        partial_train = partial(self._train_one_batch, batch=batch, *args, **kwargs)
+                        self._models = pool.map(partial_train, self._models)
+
+        elif set(self.model_type) == set(['doc2vec']):


self.model_type should already be a set from the constructor right?

thcrock · 2018-10-30T16:42:03Z

skills_ml/utils/__init__.py

+            if isinstance(x, datetime.datetime) or isinstance(x, datetime.date):
+                return x.isoformat()
+            if isinstance(x, np.ndarray):
+                return len(x)


So any arrays of the same length will hash the same way?

thcrock · 2018-10-30T16:45:49Z

skills_ml/algorithms/embedding/train.py

-        self._model.model_name = self.model_name
+        logging.info(f"{', '.join([m.model_name for m in self._models])} are trained in {str(timedelta(seconds=time()-tic))}")
+
+    def _model_hash(self, model, training_time, corpus_metadata):


This could probably just be a standalone function because there are no references to self

Or make it so the caller doesn't have to pass in self.training_time or self.corpus_metadata

thcrock · 2018-10-30T16:47:33Z

skills_ml/algorithms/embedding/train.py

-            reiter_corpus_gen = Reiterable(corpus_gen)
-            self._model.build_vocab(reiter_corpus_gen)
-            self._model.train(reiter_corpus_gen, total_examples=self._model.corpus_count, epochs=self._model.iter, *args, **kwargs)
+        tic = time()


What does tic mean? I would just call this start_time or train_start_time

thcrock · 2018-10-30T17:56:50Z

skills_ml/algorithms/embedding/train.py


    def save_model(self, storage=None):
        if storage is None:
            if self.model_storage is None:
                raise AttributeError(f"'self.model_storage' should not be None if you want to save the model")
            ms = self.model_storage
-            ms.save_model(self._model, self.model_name)
+            for model in self._models:


Unless I'm missing something, we cold reduce the branching by dong something like:

if storage: ms = ModelStorage(storage) else: ms = self.model_storage for model in self._models: ms.save_model

thcrock · 2018-10-30T18:06:42Z

skills_ml/algorithms/embedding/train.py

+                self._model_hash(model, self.training_time, self.corpus_metadata)]
+            ) + '.model'
+
+        if self.model_type <= set(['word2vec', 'fasttext']):


This method is kind of long, maybe splitting this big middle section into _train_batches and _train_full_corpus would work

tweddielin added 3 commits October 25, 2018 00:49

multiprocess embedding trainer

1c3eb42

add multiprocess

06b0e61

remove comment-out

c38a77d

tweddielin assigned thcrock Oct 25, 2018

more tests

92977bf

thcrock reviewed Oct 30, 2018

View reviewed changes

multi-threading

ecf60eb

thcrock reviewed Oct 30, 2018

View reviewed changes

tweddielin added 2 commits November 5, 2018 17:50

make changes for the pr

0499def

test

4248356

thcrock merged commit db1c916 into master Nov 6, 2018

thcrock deleted the multiprocessing_embedding branch November 6, 2018 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing embedding #241

Multiprocessing embedding #241

tweddielin commented Oct 25, 2018

codecov-io commented Oct 25, 2018 •

edited

Loading

thcrock Oct 30, 2018

thcrock Oct 30, 2018

thcrock Oct 30, 2018

thcrock Oct 30, 2018

thcrock Oct 30, 2018

thcrock Oct 30, 2018

thcrock Oct 30, 2018

Multiprocessing embedding #241

Multiprocessing embedding #241

Conversation

tweddielin commented Oct 25, 2018

codecov-io commented Oct 25, 2018 • edited Loading

Codecov Report

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

thcrock Oct 30, 2018

Choose a reason for hiding this comment

codecov-io commented Oct 25, 2018 •

edited

Loading