-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing embedding #241
Conversation
Codecov Report
@@ Coverage Diff @@
## master #241 +/- ##
==========================================
+ Coverage 86.11% 86.19% +0.07%
==========================================
Files 75 75
Lines 4343 4367 +24
==========================================
+ Hits 3740 3764 +24
Misses 603 603
Continue to review full report at Codecov.
|
partial_train = partial(self._train_one_batch, batch=batch, *args, **kwargs) | ||
self._models = pool.map(partial_train, self._models) | ||
|
||
elif set(self.model_type) == set(['doc2vec']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model_type
should already be a set from the constructor right?
skills_ml/utils/__init__.py
Outdated
if isinstance(x, datetime.datetime) or isinstance(x, datetime.date): | ||
return x.isoformat() | ||
if isinstance(x, np.ndarray): | ||
return len(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So any arrays of the same length will hash the same way?
self._model.model_name = self.model_name | ||
logging.info(f"{', '.join([m.model_name for m in self._models])} are trained in {str(timedelta(seconds=time()-tic))}") | ||
|
||
def _model_hash(self, model, training_time, corpus_metadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably just be a standalone function because there are no references to self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make it so the caller doesn't have to pass in self.training_time or self.corpus_metadata
reiter_corpus_gen = Reiterable(corpus_gen) | ||
self._model.build_vocab(reiter_corpus_gen) | ||
self._model.train(reiter_corpus_gen, total_examples=self._model.corpus_count, epochs=self._model.iter, *args, **kwargs) | ||
tic = time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does tic mean? I would just call this start_time or train_start_time
|
||
def save_model(self, storage=None): | ||
if storage is None: | ||
if self.model_storage is None: | ||
raise AttributeError(f"'self.model_storage' should not be None if you want to save the model") | ||
ms = self.model_storage | ||
ms.save_model(self._model, self.model_name) | ||
for model in self._models: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, we cold reduce the branching by dong something like:
if storage:
ms = ModelStorage(storage)
else:
ms = self.model_storage
for model in self._models:
ms.save_model
self._model_hash(model, self.training_time, self.corpus_metadata)] | ||
) + '.model' | ||
|
||
if self.model_type <= set(['word2vec', 'fasttext']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is kind of long, maybe splitting this big middle section into _train_batches
and _train_full_corpus
would work
EmbeddingTrainer
corpus_generator
should be provided in thetrain
method arguments.BatchGenerator
provide the same functionality ofbatches_generator
but can be serialized.