From 77dd0468d8c517ceff2c08e78869ff7c7f26cc0d Mon Sep 17 00:00:00 2001 From: psicobloc Date: Tue, 18 Feb 2020 18:59:49 -0600 Subject: [PATCH 1/5] progress on spanish version cs230 DL tips & tricks --- es/cs-230-deep-learning-tips-and-tricks.md | 450 +++++++++++++++++++++ 1 file changed, 450 insertions(+) create mode 100644 es/cs-230-deep-learning-tips-and-tricks.md diff --git a/es/cs-230-deep-learning-tips-and-tricks.md b/es/cs-230-deep-learning-tips-and-tricks.md new file mode 100644 index 000000000..2cb1e6dc6 --- /dev/null +++ b/es/cs-230-deep-learning-tips-and-tricks.md @@ -0,0 +1,450 @@ +**Deep Learning Tips and Tricks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks) + +
+ +**1. Deep Learning Tips and Tricks cheatsheet** + +⟶ Hoja de referencia de consejos y trucos sobre Aprendizaje Profundo. + +
+ + +**2. CS 230 - Deep Learning** + +⟶ CS 230 - Aprendizaje Profundo. + +
+ + +**3. Tips and tricks** + +⟶ Consejos y trucos. + +
+ + +**4. [Data processing, Data augmentation, Batch normalization]** + +⟶ [Procesamiento de datos, Aumentación de datos, Normalización por lotes] + +
+ + +**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]** + +⟶ [Entrenando una red neuronal, Época, Mini-lote, Entropía cruzada, perdida, Retropropagación, Descenso por gradientes, Actualización de pesos, Comprobación de gradientes] + +
+ + +**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]** + +⟶[Ajuste de parámetros, Inicialización Xavier, Aprendizaje por transferencia, Tasa de aprendizaje, Tasas de aprendizaje adaptativas] + +
+ + +**7. [Regularization, Dropout, Weight regularization, Early stopping]** + +⟶ [Regularización, Descarte, Regularización de pesos, Parada temprana] + +
+ + +**8. [Good practices, Overfitting small batch, Gradient checking]** + +⟶ [Buenas prácticas, Sobreajuste en lotes pequeños, Comprobación de gradientes] + +
+ + +**9. View PDF version on GitHub** + +⟶ Ver la versión PDF en Github. + +
+ + +**10. Data processing** + +⟶ Procesamiento de datos. + +
+ + +**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:** + +⟶ Aumentación de datos ― Los modelos de aprendizaje profundo usualmente necesitan de gran cantidad de datos para poder ser entrenados propiamente. A menudo resulta útil obtener más datos a partir de los ya existentes, utilizando técnicas de aumentación de datos. Las principales técnicas se resumen en la siguiente tabla. Siendo más precisos, dada la siguiente imagen de entrada, éstas son las técnicas que podemos aplicar: + +
+ + +**12. [Original, Flip, Rotation, Random crop]** + +⟶ [Original, Volteo, Rotación, Recorte aleatorio] + +
+ + +**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]** + +⟶ + +
+ + +**14. [Color shift, Noise addition, Information loss, Contrast change]** + +⟶[Cambio de color, Adición de ruido, Pérdida de información, Cambio de contraste] + +
+ + +**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]** + +⟶ + +
+ + +**16. Remark: data is usually augmented on the fly during training.** + +⟶ Observación: los datos generalmente se aumentan sobre la marcha, durante el entrenamiento. + +
+ + +**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** + +⟶ Normalización por lotes - Es un paso del híperparámetro y,β que normaliza el lote {xi}. Denotando μB,σ2B la media y la varianza de lo que queremos corregir en el lote, se realiza de la siguiente manera: + +
+ + +**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** + +⟶ Se realiza usualmente después de una capa completamente conectada/convolucional y antes de una capa no-lineal y su objetivo es permitir tasas de aprendizaje más altas y reducir su fuerte dependencia sobre la inicialización. + +
+ + +**19. Training a neural network** + +⟶ Entrenando una red neuronal. + +
+ + +**20. Definitions** + +⟶ Definiciones. + +
+ + +**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.** + +⟶ + +
+ + +**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.** + +⟶ + +
+ + +**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.** + +⟶ + +
+ + +**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** + +⟶ + +
+ + +**25. Finding optimal weights** + +⟶ Encontrando los pesos óptimos. + +
+ + +**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.** + +⟶ + +
+ + +**27. Using this method, each weight is updated with the rule:** + +⟶ Utilizando éste método, cada peso es actualizado por la regla: + +
+ + +**28. Updating weights ― In a neural network, weights are updated as follows:** + +⟶ Actualizando pesos ― En una red neuronal, los pesos se actualizan de la siguiente manera: + +
+ + +**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]** + +⟶ + +
+ + +**30. [Forward propagation, Backpropagation, Weights update]** + +⟶ [Propagación hacia adelante, Retropropagación, Actualización de pesos] + +
+ + +**31. Parameter tuning** + +⟶ Ajuste de parámetros. + +
+ + +**32. Weights initialization** + +⟶ Inicialización de pesos. + +
+ + +**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.** + +⟶ + +
+ + +**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:** + +⟶ + +
+ + +**35. [Training size, Illustration, Explanation]** + +⟶ [Tamaño del entrenamiento, Ilustración, Explicación] + +
+ + +**36. [Small, Medium, Large]** + +⟶ [Pequeño, Mediano, Grande] + +
+ + +**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]** + +⟶ + +
+ + +**38. Optimizing convergence** + +⟶ Convergencia de optimización. + +
+ + +**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** + +⟶ Tasa de aprendizaje - La tasa de aprendizaje, denotada como α o algunas veces η, indica a que ritmo los pesos son actualizados. Este valor puede ser fijo o cambiar de forma adaptativa. El método más popular en este momento es llamado Adam, que es un método que adapta la tasa de aprendizaje. + +
+ + +**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:** + +⟶ + +
+ + +**41. [Method, Explanation, Update of w, Update of b]** + +⟶ + +
+ + +**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]** + +⟶ + +
+ + +**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]** + +⟶ + +
+ + +**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]** + +⟶ + +
+ + +**45. Remark: other methods include Adadelta, Adagrad and SGD.** + +⟶ + +
+ + +**46. Regularization** + +⟶ + +
+ + +**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.** + +⟶ + +
+ + +**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.** + +⟶ + +
+ + +**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:** + +⟶ + +
+ + +**50. [LASSO, Ridge, Elastic Net, Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]** + +⟶ + +
+ +**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.** + +⟶ + +
+ + +**52. [Error, Validation, Training, early stopping, Epochs]** + +⟶ + +
+ + +**53. Good practices** + +⟶ + +
+ + +**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.** + +⟶ + +
+ + +**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.** + +⟶ + +
+ + +**56. [Type, Numerical gradient, Analytical gradient]** + +⟶ + +
+ + +**57. [Formula, Comments]** + +⟶ + +
+ + +**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]** + +⟶ + +
+ + +**59. ['Exact' result, Direct computation, Used in the final implementation]** + +⟶ + +
+ + +**60. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + + +**61. Original authors** + +⟶ + +
+ +**62.Translated by X, Y and Z** + +⟶ + +
+ +**63.Reviewed by X, Y and Z** + +⟶ + +
+ +**64.View PDF version on GitHub** + +⟶ + +
+ +**65.By X and Y** + +⟶ + +
From dd0216fd43ff2246ba7e759689fa42a61a9b7e4d Mon Sep 17 00:00:00 2001 From: psicobloc Date: Tue, 18 Feb 2020 19:01:42 -0600 Subject: [PATCH 2/5] fixed tipo in cs-229-deep-learning.md --- es/cs-229-deep-learning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/es/cs-229-deep-learning.md b/es/cs-229-deep-learning.md index 85a2e2563..f2fbca500 100644 --- a/es/cs-229-deep-learning.md +++ b/es/cs-229-deep-learning.md @@ -132,7 +132,7 @@ **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** -⟶ Se realiza usualmente después de una capa completamente conectada/convolucional y antes de una capa no-lineal y su objetivo es permitir velocidades de aprendizaje más altas de aprendizaje y reducir su fuerte dependencia sobre la inicialización. +⟶ Se realiza usualmente después de una capa completamente conectada/convolucional y antes de una capa no-lineal y su objetivo es permitir velocidades de aprendizaje más altas y reducir su fuerte dependencia sobre la inicialización.
From 5cd48ab8cf3a4af57d2b903f33cca0a5360ac28b Mon Sep 17 00:00:00 2001 From: psicobloc Date: Tue, 18 Feb 2020 19:18:17 -0600 Subject: [PATCH 3/5] mostly finished --- es/cs-230-deep-learning-tips-and-tricks.md | 28 +++++++++++----------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/es/cs-230-deep-learning-tips-and-tricks.md b/es/cs-230-deep-learning-tips-and-tricks.md index 2cb1e6dc6..c96747b86 100644 --- a/es/cs-230-deep-learning-tips-and-tricks.md +++ b/es/cs-230-deep-learning-tips-and-tricks.md @@ -165,7 +165,7 @@ **24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** -⟶ +⟶Pérdida de entropía cruzada - En el contexto de clasificación binaria con redes neuronales, la pérdida de entropía cruzada L(z,y) es utilizada comúnmente y definida de la siguiente manera:
@@ -179,7 +179,7 @@ **26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.** -⟶ +⟶ Retropropagación - La retropropagación, o propagación inversa, es un método de actualización de los pesos en una red neuronal, teniendo en cuenta la salida actual y la salida esperada. La derivada respecto al peso w es calculada utilizando la regla de la cadena.
@@ -284,7 +284,7 @@ **41. [Method, Explanation, Update of w, Update of b]** -⟶ +⟶[Método, Explicación, Actualización de w, Actualización de b]
@@ -305,21 +305,21 @@ **44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]** -⟶ +⟶ [Adam, Estimación adaptativa de momento, Método más popular, 4 parámetros que ajustar]
**45. Remark: other methods include Adadelta, Adagrad and SGD.** -⟶ +⟶ Observación: otros métodos incluyen Adadelta, Adagrad y SGD.
**46. Regularization** -⟶ +⟶ Regularización.
@@ -333,7 +333,7 @@ **48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.** -⟶ +⟶ Observación: la mayoría de los entornos de trabajo(frameworks) de aprendizaje profundo parametrizan el abandono mediante el parámetro 'keep' 1−p.
@@ -360,14 +360,14 @@ **52. [Error, Validation, Training, early stopping, Epochs]** -⟶ +⟶[Error, Validación, Entrenamiento, Parada temprana, Épocas]
**53. Good practices** -⟶ +⟶ Buenas práctias.
@@ -388,14 +388,14 @@ **56. [Type, Numerical gradient, Analytical gradient]** -⟶ +⟶ [Tipo, Gradiente numérico, Gradiente Analítico]
**57. [Formula, Comments]** -⟶ +⟶ [Fórmula, Comentarios]
@@ -409,7 +409,7 @@ **59. ['Exact' result, Direct computation, Used in the final implementation]** -⟶ +⟶ [Resultado 'exacto', Computación directa, Usado en la implementación final]
@@ -427,7 +427,7 @@ **62.Translated by X, Y and Z** -⟶ +⟶ Traducido por Hugo Valencia Vargas.
@@ -439,7 +439,7 @@ **64.View PDF version on GitHub** -⟶ +⟶ Ver la version PDF en Github.
From 206c53b328c878a4dba8d35a209661c08e9dc264 Mon Sep 17 00:00:00 2001 From: psicobloc Date: Sat, 4 Jul 2020 15:27:21 -0500 Subject: [PATCH 4/5] progress --- es/cs-230-deep-learning-tips-and-tricks.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/es/cs-230-deep-learning-tips-and-tricks.md b/es/cs-230-deep-learning-tips-and-tricks.md index c96747b86..876722d2a 100644 --- a/es/cs-230-deep-learning-tips-and-tricks.md +++ b/es/cs-230-deep-learning-tips-and-tricks.md @@ -116,7 +116,7 @@ **17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** -⟶ Normalización por lotes - Es un paso del híperparámetro y,β que normaliza el lote {xi}. Denotando μB,σ2B la media y la varianza de lo que queremos corregir en el lote, se realiza de la siguiente manera: +⟶ Normalización por lotes ― Es un paso del híperparámetro y,β que normaliza el lote {xi}. Denotando μB,σ2B la media y la varianza de lo que queremos corregir en el lote, se realiza de la siguiente manera:
@@ -144,7 +144,7 @@ **21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.** -⟶ +⟶ Época ― En el contexto del entrenamiento de un modelo, Época es un término utilizado para referirse a una iteración donde el modelo es expuesto a todo el set de entrenamiento para actualizar sus pesos.
@@ -158,14 +158,14 @@ **23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.** -⟶ +⟶Función de pérdida ― Para cuantificar cómo se desempeña un modelo dado, es usualmente utilizada la función de pérdida L para evaluar en qué medida las salidas reales y son predichas correctamente por las salidas z del modelo.
**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** -⟶Pérdida de entropía cruzada - En el contexto de clasificación binaria con redes neuronales, la pérdida de entropía cruzada L(z,y) es utilizada comúnmente y definida de la siguiente manera: +⟶Pérdida de entropía cruzada ― En el contexto de clasificación binaria con redes neuronales, la pérdida de entropía cruzada L(z,y) es utilizada comúnmente y está definida de la siguiente manera:
@@ -179,7 +179,7 @@ **26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.** -⟶ Retropropagación - La retropropagación, o propagación inversa, es un método de actualización de los pesos en una red neuronal, teniendo en cuenta la salida actual y la salida esperada. La derivada respecto al peso w es calculada utilizando la regla de la cadena. +⟶ Retropropagación ― La retropropagación, o propagación inversa, es un método de actualización de los pesos en una red neuronal, teniendo en cuenta la salida actual y la salida esperada. La derivada respecto al peso w es calculada utilizando la regla de la cadena.
@@ -256,7 +256,7 @@ **37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]** -⟶ +⟶Congela todas las capas, entrena pesos en softmax, Congela la mayoría de las capas, entrena pesos en las últimas capas y softmax, Entrena los pesos en capas y softmax inicializando los pesos con otros pre-entrenados.
@@ -270,7 +270,7 @@ **39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** -⟶ Tasa de aprendizaje - La tasa de aprendizaje, denotada como α o algunas veces η, indica a que ritmo los pesos son actualizados. Este valor puede ser fijo o cambiar de forma adaptativa. El método más popular en este momento es llamado Adam, que es un método que adapta la tasa de aprendizaje. +⟶ Tasa de aprendizaje ― La tasa de aprendizaje, denotada como α o algunas veces η, indica a que ritmo los pesos son actualizados. Este valor puede ser fijo o cambiar de forma adaptativa. El método más popular en este momento es llamado Adam, que es un método que adapta la tasa de aprendizaje.
@@ -353,7 +353,7 @@ **51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.** -⟶ +⟶Terminación temprana ―
From ffa300f38e7e0dd21e34adc37d91705dde7d93bb Mon Sep 17 00:00:00 2001 From: psicobloc Date: Sat, 4 Jul 2020 16:05:44 -0500 Subject: [PATCH 5/5] progress --- es/cs-230-deep-learning-tips-and-tricks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/es/cs-230-deep-learning-tips-and-tricks.md b/es/cs-230-deep-learning-tips-and-tricks.md index 876722d2a..df8e75ff0 100644 --- a/es/cs-230-deep-learning-tips-and-tricks.md +++ b/es/cs-230-deep-learning-tips-and-tricks.md @@ -353,7 +353,7 @@ **51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.** -⟶Terminación temprana ― +⟶Terminación temprana ― Esta técnica de regularización detiene el proceso de entrenamiento en cuanto la perdida de validación se estanca en una meseta o comienza a incrementar.