Move AccelerateMixin to hf.py, update docs

After introducing the skorch/hf.py module, it made sense to move the AccelerateMixin there from helper.py. Of course, this required some code adjustments, which have been made as well, but there are no functional changes. On top of that, I extended some parts of the documentation: - Created dedicated HF section in docs to put all the HF stuff - FAQ about gradient accumulation now also points to AccelerateMixin - Mention gradient accumulation in AccelerateMixin docstring
skorch-dev · Jul 26, 2022 · b884b11 · b884b11
1 parent 1dbf32b
commit b884b11
Show file tree

Hide file tree

Showing 8 changed files with 500 additions and 587 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -60,6 +60,7 @@ User's Guide
    user/parallelism
    user/customization
    user/performance
+   user/huggingface
    user/FAQ
 
 

diff --git a/docs/user/FAQ.rst b/docs/user/FAQ.rst
@@ -329,6 +329,11 @@ sure that there is an optimization step after the last batch of each
 epoch. However, this example can serve as a starting point to
 implement your own version gradient accumulation.
 
+Alternatively, make use of skorch's `accelerate
+<https://github.com/huggingface/accelerate>`_ integration provided by
+:class:`~skorch.hf.AccelerateMixin` and use the gradient accumulation feature
+from that library.
+
 How can I dynamically set the input size of the PyTorch module based on the data?
 ---------------------------------------------------------------------------------
 

diff --git a/docs/user/helper.rst b/docs/user/helper.rst
@@ -48,58 +48,6 @@ argument ``idx=0``, the default) and one for y (with argument
     gs.fit(X_sl, y_sl)
 
 
-AccelerateMixin
----------------
-
-This mixin class can be used to add support for huggingface accelerate_ to
-skorch. E.g., this allows you to use mixed precision training (AMP), multi-GPU
-training, or training with a TPU. For the time being, this feature should be
-considered experimental.
-
-To use this feature, create a new subclass of the neural net class you want to
-use and inherit from the mixin class. E.g., if you want to use a
-:class:`.NeuralNet`, it would look like this:
-
-.. code:: python
-
-    from skorch import NeuralNet
-    from skorch.helper import AccelerateMixin
-
-    class AcceleratedNet(AccelerateMixin, NeuralNet):
-        """NeuralNet with accelerate support"""
-
-The same would work for :class:`.NeuralNetClassifier`,
-:class:`.NeuralNetRegressor`, etc. Then pass an instance of Accelerator_ with
-the desired parameters and you're good to go:
-
-.. code:: python
-
-    from accelerate import Accelerator
-
-    accelerator = Accelerator(...)
-    net = AcceleratedNet(
-        MyModule,
-        accelerator=accelerator,
-    )
-    net.fit(X, y)
-
-accelerate_ recommends to leave the device handling to the Accelerator_, which
-is why ``device`` defautls to ``None`` (thus telling skorch not to change the
-device).
-
-To install accelerate_, run the following command inside your Python environment:
-
-.. code:: bash
-
-      python -m pip install accelerate
-
-.. note::
-
-    Under the hood, accelerate uses :class:`~torch.cuda.amp.GradScaler`,
-    which does not support passing the training step as a closure.
-    Therefore, if your optimizer requires that (e.g.
-    :class:`torch.optim.LBFGS`), you cannot use accelerate.
-
 Command line interface helpers
 ------------------------------
 
@@ -253,8 +201,6 @@ callbacks through the command line (but you can modify existing ones
 as usual).
 
 
-.. _accelerate: https://github.com/huggingface/accelerate
-.. _Accelerator: https://huggingface.co/docs/accelerate/accelerator.html
 .. _fire: https://github.com/google/python-fire
 .. _numpydoc: https://github.com/numpy/numpydoc
 .. _example: https://github.com/skorch-dev/skorch/tree/master/examples/cli
diff --git a/notebooks/Hugging_Face_Finetuning.ipynb b/notebooks/Hugging_Face_Finetuning.ipynb
@@ -445,55 +445,31 @@
    "metadata": {},
    "outputs": [
     {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "28462f59552f4cb19c25ddac48fa47ef",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "321b9786bc634e8ab6ba113bb7ec9a30",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "66a02c063d594d18b733a7774260c146",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight']\n",
+      "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']\n",
       "- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
       "- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
-      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']\n",
+      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']\n",
       "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
      ]
     },
@@ -517,7 +493,7 @@
      "text": [
       "  epoch    train_loss    valid_acc    valid_loss       dur\n",
       "-------  ------------  -----------  ------------  --------\n",
-      "      1        \u001b[36m1.0325\u001b[0m       \u001b[32m0.8444\u001b[0m        \u001b[35m0.5309\u001b[0m  130.8168\n"
+      "      1        \u001b[36m1.0325\u001b[0m       \u001b[32m0.8444\u001b[0m        \u001b[35m0.5309\u001b[0m  129.3804\n"
      ]
     },
     {
@@ -538,7 +514,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "      2        \u001b[36m0.3306\u001b[0m       \u001b[32m0.8780\u001b[0m        \u001b[35m0.4194\u001b[0m  129.1985\n"
+      "      2        \u001b[36m0.3306\u001b[0m       \u001b[32m0.8780\u001b[0m        \u001b[35m0.4194\u001b[0m  129.7384\n"
      ]
     },
     {
@@ -559,9 +535,9 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "      3        \u001b[36m0.1346\u001b[0m       \u001b[32m0.8798\u001b[0m        \u001b[35m0.4100\u001b[0m  129.4979\n",
-      "CPU times: user 6min 7s, sys: 43.4 s, total: 6min 50s\n",
-      "Wall time: 6min 44s\n"
+      "      3        \u001b[36m0.1346\u001b[0m       \u001b[32m0.8798\u001b[0m        \u001b[35m0.4100\u001b[0m  129.8741\n",
+      "CPU times: user 6min 7s, sys: 42.8 s, total: 6min 50s\n",
+      "Wall time: 6min 39s\n"
      ]
     },
     {
@@ -621,8 +597,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 20 s, sys: 35.2 ms, total: 20 s\n",
-      "Wall time: 15.6 s\n"
+      "CPU times: user 19.4 s, sys: 23.6 ms, total: 19.5 s\n",
+      "Wall time: 15 s\n"
      ]
     }
    ],
@@ -676,10 +652,10 @@
    "source": [
     "For this to work, you need:\n",
     "- A GPU that is capable of mixed precision training\n",
-    "- The [accelerate library](https://huggingface.co/docs/accelerate/index), which you can install as: `python -m pip install accelerate`.\n",
+    "- The [accelerate library](https://huggingface.co/docs/accelerate/index), which you can install as: `python -m pip install 'accelerate>=0.11'`.\n",
     "- skorch version 0.12 or installed from the current master branch (`python -m pip install git+https://github.com/skorch-dev/skorch.git`)\n",
     "\n",
-    "Again, we assume that you're familiar with the general concept of mixed precision training. For more information on how skorch integrates with accelerate, please consult the [skorch docs](https://skorch.readthedocs.io/en/latest/user/helper.html#acceleratemixin)."
+    "Again, we assume that you're familiar with the general concept of mixed precision training. For more information on how skorch integrates with accelerate, please consult the [skorch docs](https://skorch.readthedocs.io/en/latest/user/huggingface.html#accelerate)."
    ]
   },
   {
@@ -700,37 +676,18 @@
     }
    ],
    "source": [
-    "! [ ! -z \"$COLAB_GPU\" ] && pip install accelerate"
+    "! [ ! -z \"$COLAB_GPU\" ] && pip install 'accelerate>=0.11'"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 18,
    "id": "c47aa1a6-f466-4a2c-84ab-034e4d6bdbcd",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
-      "To disable this warning, you can either:\n",
-      "\t- Avoid using `tokenizers` before the fork if possible\n",
-      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
-      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
-      "To disable this warning, you can either:\n",
-      "\t- Avoid using `tokenizers` before the fork if possible\n",
-      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
-      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
-      "To disable this warning, you can either:\n",
-      "\t- Avoid using `tokenizers` before the fork if possible\n",
-      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "from accelerate import Accelerator\n",
-    "from skorch.helper import AccelerateMixin"
+    "from skorch.hf import AccelerateMixin"
    ]
   },
   {
@@ -751,7 +708,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "accelerator = Accelerator(fp16=True)"
+    "accelerator = Accelerator(mixed_precision='fp16')"
    ]
   },
   {
@@ -806,10 +763,10 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight']\n",
+      "Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']\n",
       "- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
       "- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
-      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']\n",
+      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']\n",
       "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
      ]
     },
@@ -833,7 +790,7 @@
      "text": [
       "  epoch    train_loss    valid_acc    valid_loss      dur\n",
       "-------  ------------  -----------  ------------  -------\n",
-      "      1        \u001b[36m1.0463\u001b[0m       \u001b[32m0.8374\u001b[0m        \u001b[35m0.5547\u001b[0m  71.6220\n"
+      "      1        \u001b[36m1.0463\u001b[0m       \u001b[32m0.8374\u001b[0m        \u001b[35m0.5547\u001b[0m  71.2980\n"
      ]
     },
     {
@@ -854,7 +811,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "      2        \u001b[36m0.3264\u001b[0m       \u001b[32m0.8786\u001b[0m        \u001b[35m0.4251\u001b[0m  71.5409\n"
+      "      2        \u001b[36m0.3264\u001b[0m       \u001b[32m0.8786\u001b[0m        \u001b[35m0.4251\u001b[0m  73.2230\n"
      ]
     },
     {
@@ -875,7 +832,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "      3        \u001b[36m0.1387\u001b[0m       \u001b[32m0.8845\u001b[0m        \u001b[35m0.4142\u001b[0m  71.6285\n"
+      "      3        \u001b[36m0.1387\u001b[0m       \u001b[32m0.8845\u001b[0m        \u001b[35m0.4142\u001b[0m  74.4516\n"
      ]
     },
     {
@@ -927,8 +884,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "CPU times: user 11.5 s, sys: 32.9 ms, total: 11.5 s\n",
-      "Wall time: 7.02 s\n"
+      "CPU times: user 11.7 s, sys: 4.97 ms, total: 11.7 s\n",
+      "Wall time: 7.27 s\n"
      ]
     }
    ],