Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing embedding #241

Merged
merged 7 commits into from
Nov 6, 2018
Merged

Multiprocessing embedding #241

merged 7 commits into from
Nov 6, 2018

Conversation

tweddielin
Copy link
Contributor

  • EmbeddingTrainer
    • now is able to train multiple models with the same corpus in multiprocessing way.
    • change the interface of the class. the corpus_generator should be provided in the train method arguments.
  • BatchGenerator provide the same functionality of batches_generator but can be serialized.

@codecov-io
Copy link

codecov-io commented Oct 25, 2018

Codecov Report

Merging #241 into master will increase coverage by 0.07%.
The diff coverage is 98.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #241      +/-   ##
==========================================
+ Coverage   86.11%   86.19%   +0.07%     
==========================================
  Files          75       75              
  Lines        4343     4367      +24     
==========================================
+ Hits         3740     3764      +24     
  Misses        603      603
Impacted Files Coverage Δ
skills_ml/job_postings/common_schema.py 96.25% <100%> (+0.47%) ⬆️
skills_ml/algorithms/embedding/train.py 95.65% <98.43%> (+0.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93b6b3c...4248356. Read the comment docs.

partial_train = partial(self._train_one_batch, batch=batch, *args, **kwargs)
self._models = pool.map(partial_train, self._models)

elif set(self.model_type) == set(['doc2vec']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.model_type should already be a set from the constructor right?

if isinstance(x, datetime.datetime) or isinstance(x, datetime.date):
return x.isoformat()
if isinstance(x, np.ndarray):
return len(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So any arrays of the same length will hash the same way?

self._model.model_name = self.model_name
logging.info(f"{', '.join([m.model_name for m in self._models])} are trained in {str(timedelta(seconds=time()-tic))}")

def _model_hash(self, model, training_time, corpus_metadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably just be a standalone function because there are no references to self

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make it so the caller doesn't have to pass in self.training_time or self.corpus_metadata

reiter_corpus_gen = Reiterable(corpus_gen)
self._model.build_vocab(reiter_corpus_gen)
self._model.train(reiter_corpus_gen, total_examples=self._model.corpus_count, epochs=self._model.iter, *args, **kwargs)
tic = time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does tic mean? I would just call this start_time or train_start_time


def save_model(self, storage=None):
if storage is None:
if self.model_storage is None:
raise AttributeError(f"'self.model_storage' should not be None if you want to save the model")
ms = self.model_storage
ms.save_model(self._model, self.model_name)
for model in self._models:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, we cold reduce the branching by dong something like:

if storage:
   ms = ModelStorage(storage)
else:
   ms = self.model_storage

for model in self._models:
   ms.save_model

self._model_hash(model, self.training_time, self.corpus_metadata)]
) + '.model'

if self.model_type <= set(['word2vec', 'fasttext']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is kind of long, maybe splitting this big middle section into _train_batches and _train_full_corpus would work

@thcrock thcrock merged commit db1c916 into master Nov 6, 2018
@thcrock thcrock deleted the multiprocessing_embedding branch November 6, 2018 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants