diff --git a/README.md b/README.md
index 469ad179..e49e7446 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@
 
 PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike. The core principles behind the design of the library are:
 
-- Low Resistance Useability
+- Low Resistance Usability
 - Easy Customization
 - Scalable and Easier to Deploy
 
diff --git a/docs/models.md b/docs/models.md
index 48f65763..83c800d0 100644
--- a/docs/models.md
+++ b/docs/models.md
@@ -27,7 +27,7 @@ While there are separate config classes for each model, all of them share a few
 
 - `learning_rate`: float: The learning rate of the model. Defaults to 1e-3.
 
-- `loss`: Optional\[str\]: The loss function to be applied. By Default it is MSELoss for regression and CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss or L1Loss for regression and CrossEntropyLoss for classification
+- `loss`: Optional\[str\]: The loss function to be applied. By Default, it is MSELoss for regression and CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss or L1Loss for regression and CrossEntropyLoss for classification
 
 - `metrics`: Optional\[List\[str\]\]: The list of metrics you need to track during training. The metrics should be one of the functional metrics implemented in `torchmetrics`. By default, it is `accuracy` if classification and `mean_squared_error` for regression
 
@@ -55,13 +55,13 @@ That's it, Thats the most basic necessity. All the rest is intelligently inferre
 
 Adam Optimizer and the `learning_rate` of 1e-3 is a default that is set in PyTorch Tabular. It's a rule of thumb that works in most cases and a good starting point which has worked well empirically. If you want to change the learning rate(which is a pretty important hyperparameter), this is where you should. There is also an automatic way to derive a good learning rate which we will talk about in the TrainerConfig. In that case, Pytorch Tabular will ignore the learning rate set through this parameter
 
-Another key component of the model is the `loss`. Pytorch Tabular can use any loss function from standard PyTorch([`torch.nn`](https://pytorch.org/docs/stable/nn.html#loss-functions)) through this config. By default it is set to `MSELoss` for regression and `CrossEntropyLoss` for classification, which works well for those use cases and are the most popular loss functions used. If you want to use something else specficaly, like `L1Loss`, you just need to mention it in the `loss` parameter
+Another key component of the model is the `loss`. Pytorch Tabular can use any loss function from standard PyTorch([`torch.nn`](https://pytorch.org/docs/stable/nn.html#loss-functions)) through this config. By default, it is set to `MSELoss` for regression and `CrossEntropyLoss` for classification, which works well for those use cases and are the most popular loss functions used. If you want to use something else specficaly, like `L1Loss`, you just need to mention it in the `loss` parameter
 
 ```python
 loss = "L1Loss
 ```
 
-PyTorch Tabular also accepts custom loss functions(which are drop in replacements for the standard loss functions) through the `fit` method in the `TabularModel`.
+PyTorch Tabular also accepts custom loss functions (which are drop in replacements for the standard loss functions) through the `fit` method in the `TabularModel`.
 
 !!! warning
 
@@ -113,7 +113,7 @@ All the parameters have intelligent default values. Let's look at few of them:
 - `use_batch_norm`: bool: Flag to include a BatchNorm layer after each Linear Layer+DropOut. Defaults to `False`
 - `dropout`: float: The probability of the element to be zeroed. This applies to all the linear layers. Defaults to `0.0`
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.CategoryEmbeddingModelConfig][]
 
 ### Gated Adaptive Network for Deep Automated Learning of Features (GANDALF)
@@ -141,7 +141,7 @@ All the parameters have beet set to recommended values from the paper. Let's loo
     GANDALF can be considered as a more light and more performant Gated Additive Tree Ensemble (GATE). For most purposes, GANDALF is a better choice than GATE.
 
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.GANDALFConfig][]
 
 
@@ -165,14 +165,14 @@ All the parameters have beet set to recommended values from the paper. Let's loo
 
 - `share_head_weights`: bool: If True, we will share the weights between the heads. Defaults to True
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.GatedAdditiveTreeEnsembleConfig][]
 
 ### Neural Oblivious Decision Ensembles (NODE)
 
-[Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data](https://arxiv.org/abs/1909.06312) is a model presented in ICLR 2020 and according to the authors have beaten well-tuned Gradient Boosting models on many datasets. It uses a Neural equivalent of Oblivious Trees(the kind of trees Catboost uses) as the basic building blocks of the architecture. You can use it by choosing `NodeConfig`.
+[Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data](https://arxiv.org/abs/1909.06312) is a model presented in ICLR 2020 and according to the authors have beaten well-tuned Gradient Boosting models on many datasets. It uses a Neural equivalent of Oblivious Trees (the kind of trees Catboost uses) as the basic building blocks of the architecture. You can use it by choosing `NodeConfig`.
 
-The basic block, or a "layer" looks something like below(from the paper)
+The basic block, or a "layer" looks something like below (from the paper)
 
 ![NODE Architecture](imgs/node_arch.png)
 
@@ -185,24 +185,24 @@ All the parameters have beet set to recommended values from the paper. Let's loo
 - `num_layers`: int: Number of Oblivious Decision Tree Layers in the Dense Architecture. Defaults to `1`
 - `num_trees`: int: Number of Oblivious Decision Trees in each layer. Defaults to `2048`
 - `depth`: int: The depth of the individual Oblivious Decision Trees. Parameters increase exponentially with the increase in depth. Defaults to `6`
-- `choice_function`: str: Generates a sparse probability distribution to be used as feature weights(aka, soft feature selection). Choices are: `entmax15` `sparsemax`. Defaults to `entmax15`
-- `bin_function`: str: Generates a sparse probability distribution to be used as tree leaf weights. Choices are: `entmax15` `sparsemax`. Defaults to `entmax15`
+- `choice_function`: str: Generates a sparse probability distribution to be used as feature weights (aka, soft feature selection). Choices are: `entmax15` `sparsemax`. Defaults to `entmax15`
+- `bin_function`: str: Generates a sparse probability distribution to be used as tree leaf weights. Choices are: `entmoid15` `sparsemoid`. Defaults to `entmoid15`
 - `additional_tree_output_dim`: int: The additional output dimensions which is only used to pass through different layers of the architectures. Only the first output_dim outputs will be used for prediction. Defaults to `3`
 - `input_dropout`: float: Dropout which is applied to the input to the different layers in the Dense Architecture. The probability of the element to be zeroed. Defaults to `0.0`
 
 
-**For a complete list of parameters refer to the API Docs**     
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.NodeConfig][]
 
 !!! note
 
-    NODE model has a lot of parameters and therefore takes up a lot of memory. Smaller batchsizes(like 64 or 128) makes the model manageable in a smaller GPU(~4GB).
+    NODE model has a lot of parameters and therefore takes up a lot of memory. Smaller batchsizes (like 64 or 128) makes the model manageable in a smaller GPU(~4GB).
 
 ### TabNet
 
 - [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442) is another model coming out of Google Research which uses Sparse Attention in multiple steps of decision making to model the output. You can use it by choosing `TabNetModelConfig`.
 
-The architecture is as shown below(from the paper)
+The architecture is as shown below (from the paper)
 
 ![TabNet Architecture](imgs/tabnet_architecture.png)
 
@@ -210,12 +210,12 @@ All the parameters have beet set to recommended values from the paper. Let's loo
 
 - `n_d`: int: Dimension of the prediction layer (usually between 4 and 64). Defaults to `8`
 - `n_a`: int: Dimension of the attention layer (usually between 4 and 64). Defaults to `8`
-- `n_steps`: int: Number of sucessive steps in the newtork (usually betwenn 3 and 10). Defaults to `3`
+- `n_steps`: int: Number of successive steps in the network (usually between 3 and 10). Defaults to `3`
 - `n_independent`: int: Number of independent GLU layer in each GLU block. Defaults to `2`
 - `n_shared`: int: Number of independent GLU layer in each GLU block. Defaults to `2`
 - `virtual_batch_size`: int: Batch size for Ghost Batch Normalization. BatchNorm on large batches sometimes does not do very well and therefore Ghost Batch Normalization which does batch normalization in smaller virtual batches is implemented in TabNet. Defaults to `128`
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.TabNetModelConfig][]
 
 ### Automatic Feature Interaction Learning via Self-Attentive Neural Networks(AutoInt)
@@ -228,9 +228,9 @@ All the parameters have beet set to recommended values from the paper. Let's loo
 
 - `num_heads`: int: The number of heads in the Multi-Headed Attention layer. Defaults to 2
 
-- `num_attn_blocks`: int: The number of layers of stacked Multi-Headed Attention layers. Defaults to 2
+- `num_attn_blocks`: int: The number of layers of stacked Multi-Headed Attention layers. Defaults to 3
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.AutoIntConfig][]
 
 ### DANETs: Deep Abstract Networks for Tabular Data Classification and Regression
@@ -239,18 +239,18 @@ All the parameters have beet set to recommended values from the paper. Let's loo
 
 All the parameters have beet set to recommended values from the paper. Let's look at them:
 
-- `n_layers`: int: Number of Blocks in the DANet. Defaults to 16
+- `n_layers`: int: Number of Blocks in the DANet. Each block has 2 Abstlay Blocks each. Defaults to 8
 
 - `abstlay_dim_1`: int: The dimension for the intermediate output in the first ABSTLAY layer in a Block. Defaults to 32
 
-- `abstlay_dim_2`: int: The dimension for the intermediate output in the second ABSTLAY layer in a Block. Defaults to 64
+- `abstlay_dim_2`: int: The dimension for the intermediate output in the second ABSTLAY layer in a Block. If None, it will be twice abstlay_dim_1. Defaults to None
 
 - `k`: int: The number of feature groups in the ABSTLAY layer. Defaults to 5
 
 - `dropout_rate`: float: Dropout to be applied in the Block. Defaults to 0.1
 
 
-**For a complete list of parameters refer to the API Docs**    
+**For a complete list of parameters refer to the API Docs**
 [pytorch_tabular.models.DANetConfig][]
 
 ## Implementing New Architectures
@@ -308,7 +308,7 @@ In addition to the model, you will also need to define a config. Configs are pyt
 
 **Key things to note:**
 
-1. All the different parameters in the different configs(like TrainerConfig, OptimizerConfig, etc) are all available in `config` before calling `super()` and in `self.hparams` after.
+1. All the different parameters in the different configs (like TrainerConfig, OptimizerConfig, etc) are all available in `config` before calling `super()` and in `self.hparams` after.
 1. the input batch at the `forward` method is a dictionary with keys `continuous` and `categorical`
 1. In the `\_build_network` method, save every component that you want access in the `forward` to `self`
 
diff --git a/src/pytorch_tabular/config/config.py b/src/pytorch_tabular/config/config.py
index a5515309..f7160dc3 100644
--- a/src/pytorch_tabular/config/config.py
+++ b/src/pytorch_tabular/config/config.py
@@ -68,31 +68,31 @@ class DataConfig:
                 introduction_date and with a monthly frequency like "2023-12" should have
                 an entry ('intro_date','M','%Y-%m')
 
-        encode_date_columns (bool): Whether or not to encode the derived variables from date
+        encode_date_columns (bool): Whether to encode the derived variables from date
 
         validation_split (Optional[float]): Percentage of Training rows to keep aside as validation. Used
                 only if Validation Data is not given separately
 
-        continuous_feature_transform (Optional[str]): Whether or not to transform the features before
-                modelling. By default it is turned off.. Choices are: [`None`,`yeo-johnson`,`box-
-                cox`,`quantile_normal`,`quantile_uniform`].
+        continuous_feature_transform (Optional[str]): Whether to transform the features before
+                modelling. By default, it is turned off. Choices are: [`None`,`yeo-johnson`,`box-cox`,
+                `quantile_normal`,`quantile_uniform`].
 
         normalize_continuous_features (bool): Flag to normalize the input features(continuous)
 
         quantile_noise (int): NOT IMPLEMENTED. If specified fits QuantileTransformer on data with added
                 gaussian noise with std = :quantile_noise: * data.std ; this will cause discrete values to be more
-                separable. Please not that this transformation does NOT apply gaussian noise to the resulting
+                separable. Please note that this transformation does NOT apply gaussian noise to the resulting
                 data, the noise is only applied for QuantileTransformer
 
         num_workers (Optional[int]): The number of workers used for data loading. For windows always set to
                 0
 
-        pin_memory (bool): Whether or not to pin memory for data loading.
+        pin_memory (bool): Whether to pin memory for data loading.
 
-        handle_unknown_categories (bool): Whether or not to handle unknown or new values in categorical
+        handle_unknown_categories (bool): Whether to handle unknown or new values in categorical
                 columns as unknown
 
-        handle_missing_values (bool): Whether or not to handle missing values in categorical columns as
+        handle_missing_values (bool): Whether to handle missing values in categorical columns as
                 unknown
     """
 
@@ -146,7 +146,7 @@ class DataConfig:
     )
     normalize_continuous_features: bool = field(
         default=True,
-        metadata={"help": "Flag to normalize the input features(continuous)"},
+        metadata={"help": "Flag to normalize the input features (continuous)"},
     )
     quantile_noise: int = field(
         default=0,
@@ -264,7 +264,7 @@ class TrainerConfig:
                 Choices are: [`cpu`,`gpu`,`tpu`,`ipu`,'mps',`auto`].
 
         devices (Optional[int]): Number of devices to train on (int). -1 uses all available devices. By
-                default uses all available devices (-1)
+                default, uses all available devices (-1)
 
         devices_list (Optional[List[int]]): List of devices to train on (list). If specified, takes
                 precedence over `devices` argument. Defaults to None
@@ -563,7 +563,7 @@ class ExperimentConfig:
                 this defines the folder under which the logs will be saved and for W&B it defines the project name
 
         run_name (Optional[str]): The name of the run; a specific identifier to recognize the run. If left
-                blank, will be assigned a auto-generated name
+                blank, will be assigned an auto-generated name
 
         exp_watch (Optional[str]): The level of logging required.  Can be `gradients`, `parameters`, `all`
                 or `None`. Defaults to None. Choices are: [`gradients`,`parameters`,`all`,`None`].
@@ -695,7 +695,7 @@ def __init__(
         exp_version_manager: str = ".pt_tmp/exp_version_manager.yml",
     ) -> None:
         """The manages the versions of the experiments based on the name. It is a simple dictionary(yaml) based lookup.
-        Primary purpose is to avoid overwriting of saved models while runing the training without changing the
+        Primary purpose is to avoid overwriting of saved models while running the training without changing the
         experiment name.
 
         Args:
@@ -752,7 +752,7 @@ class ModelConfig:
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/autoint/config.py b/src/pytorch_tabular/models/autoint/config.py
index 93edde5b..2e8720f4 100644
--- a/src/pytorch_tabular/models/autoint/config.py
+++ b/src/pytorch_tabular/models/autoint/config.py
@@ -10,18 +10,19 @@
 
 @dataclass
 class AutoIntConfig(ModelConfig):
-    """AutomaticFeatureInteraction configuration
+    """AutomaticFeatureInteraction configuration.
+
     Args:
         attn_embed_dim (int): The number of hidden units in the Multi-Headed Attention layers. Defaults to
                 32
 
         num_heads (int): The number of heads in the Multi-Headed Attention layer. Defaults to 2
 
-        num_attn_blocks (int): The number of layers of stacked Multi-Headed Attention layers. Defaults to 2
+        num_attn_blocks (int): The number of layers of stacked Multi-Headed Attention layers. Defaults to 3
 
         attn_dropouts (float): Dropout between layers of Multi-Headed Attention Layers. Defaults to 0.0
 
-        has_residuals (bool): Flag to have a residual connect from enbedded output to attention layer
+        has_residuals (bool): Flag to have a residual connect from embedded output to attention layer
                 output. Defaults to True
 
         embedding_dim (int): The dimensions of the embedding for continuous and categorical columns.
@@ -40,7 +41,7 @@ class AutoIntConfig(ModelConfig):
         share_embedding_strategy (Optional[str]): There are two strategies in adding shared embeddings. 1.
                 `add` - A separate embedding for the feature is added to the embedding of the unique values of the
                 feature. 2. `fraction` - A fraction of the input embedding is reserved for the shared embedding of
-                the feature. Defaults to fraction.. Choices are: [`add`,`fraction`].
+                the feature. Defaults to fraction. Choices are: [`add`,`fraction`].
 
         shared_embedding_fraction (float): Fraction of the input_embed_dim to be reserved by the shared
                 embedding. Should be less than one. Defaults to 0.25
@@ -50,9 +51,10 @@ class AutoIntConfig(ModelConfig):
 
         layers (str): Hyphen-separated number of layers and units in the deep MLP. Defaults to 128-64-32
 
-        activation (str): The activation type in the deep MLP. The default activaion in PyTorch like ReLU,
-                TanH, LeakyReLU, etc. https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-
-                nonlinearity. Defaults to ReLU
+        activation (str): The activation type in the deep MLP. The default activation in PyTorch like ReLU,
+                TanH, LeakyReLU, etc.
+                https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity.
+                Defaults to ReLU
 
         use_batch_norm (bool): Flag to include a BatchNorm layer after each Linear Layer+DropOut in the
                 deep MLP. Defaults to False
@@ -60,15 +62,14 @@ class AutoIntConfig(ModelConfig):
         initialization (str): Initialization scheme for the linear layers in the deep MLP. Defaults to
                 `kaiming`. Choices are: [`kaiming`,`xavier`,`random`].
 
-        dropout (float): probability of an element to be zeroed in the deep MLP. Defaults to 0.0
+        dropout (float): Probability of an element to be zeroed in the deep MLP. Defaults to 0.0
 
         attention_pooling (bool): If True, will combine the attention outputs of each block for final
                 prediction. Defaults to False
 
-
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -81,14 +82,14 @@ class AutoIntConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
@@ -119,7 +120,7 @@ class AutoIntConfig(ModelConfig):
     )
     num_attn_blocks: int = field(
         default=3,
-        metadata={"help": "The number of layers of stacked Multi-Headed Attention layers. Defaults to 2"},
+        metadata={"help": "The number of layers of stacked Multi-Headed Attention layers. Defaults to 3"},
     )
     attn_dropouts: float = field(
         default=0.0,
@@ -184,7 +185,7 @@ class AutoIntConfig(ModelConfig):
     activation: str = field(
         default="ReLU",
         metadata={
-            "help": "The activation type in the deep MLP. The default activaion in PyTorch"
+            "help": "The activation type in the deep MLP. The default activation in PyTorch"
             " like ReLU, TanH, LeakyReLU, etc."
             " https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity."
             " Defaults to ReLU"
@@ -206,7 +207,7 @@ class AutoIntConfig(ModelConfig):
     )
     dropout: float = field(
         default=0.0,
-        metadata={"help": "probability of an element to be zeroed in the deep MLP. Defaults to 0.0"},
+        metadata={"help": "Probability of an element to be zeroed in the deep MLP. Defaults to 0.0"},
     )
     attention_pooling: bool = field(
         default=False,
diff --git a/src/pytorch_tabular/models/category_embedding/config.py b/src/pytorch_tabular/models/category_embedding/config.py
index 4f43406f..ee3b6984 100644
--- a/src/pytorch_tabular/models/category_embedding/config.py
+++ b/src/pytorch_tabular/models/category_embedding/config.py
@@ -9,12 +9,13 @@
 
 @dataclass
 class CategoryEmbeddingModelConfig(ModelConfig):
-    """CategoryEmbeddingModel configuration
+    """CategoryEmbeddingModel configuration.
+
     Args:
-        layers (str): DEPRECATED: Hyphen-separated number of layers and units in the classification head. eg. 32-64-32.
+        layers (str): DEPRECATED: Hyphen-separated number of layers and units in the classification head. E.g. 32-64-32.
                 Defaults to 128-64-32
 
-        activation (str): DEPRECATED: The activation type in the classification head. The default activaion in PyTorch
+        activation (str): DEPRECATED: The activation type in the classification head. The default activation in PyTorch
                 like ReLU, TanH, LeakyReLU, etc. https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity.
                 Defaults to ReLU
 
@@ -24,13 +25,13 @@ class CategoryEmbeddingModelConfig(ModelConfig):
         initialization (str): DEPRECATED: Initialization scheme for the linear layers. Defaults to `kaiming`. Choices
                 are: [`kaiming`,`xavier`,`random`].
 
-        dropout (float): DEPRECATED: probability of an classification element to be zeroed. This is added to each
+        dropout (float): DEPRECATED: probability of a classification element to be zeroed. This is added to each
                 linear layer. Defaults to 0.0
 
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -43,14 +44,14 @@ class CategoryEmbeddingModelConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
@@ -87,7 +88,7 @@ class CategoryEmbeddingModelConfig(ModelConfig):
         metadata={
             "help": (
                 "The activation type in the classification head. The default"
-                " activaion in PyTorch like ReLU, TanH, LeakyReLU, etc."
+                " activation in PyTorch like ReLU, TanH, LeakyReLU, etc."
                 " https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity."
                 " Defaults to ReLU"
             )
diff --git a/src/pytorch_tabular/models/common/heads/config.py b/src/pytorch_tabular/models/common/heads/config.py
index 50e75df5..b605af4d 100644
--- a/src/pytorch_tabular/models/common/heads/config.py
+++ b/src/pytorch_tabular/models/common/heads/config.py
@@ -11,13 +11,12 @@ class LinearHeadConfig:
 
     Args:
         layers (str): Hyphen-separated number of layers and units in the classification/regression head.
-                eg. 32-64-32. Default is just a mapping from intput dimension to output dimension
+                E.g. 32-64-32. Default is just a mapping from intput dimension to output dimension
 
-        activation (str): The activation type in the classification head. The default activaion in PyTorch
-                like ReLU, TanH, LeakyReLU, etc. https://pytorch.org/docs/stable/nn.html#non-linear-activations-
-                weighted-sum-nonlinearity
+        activation (str): The activation type in the classification head. The default activation in PyTorch
+                like ReLU, TanH, LeakyReLU, etc. https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity
 
-        dropout (float): probability of an classification element to be zeroed.
+        dropout (float): probability of a classification element to be zeroed.
 
         use_batch_norm (bool): Flag to include a BatchNorm layer after each Linear Layer+DropOut
 
@@ -35,7 +34,7 @@ class LinearHeadConfig:
     activation: str = field(
         default="ReLU",
         metadata={
-            "help": "The activation type in the classification head. The default activaion in PyTorch"
+            "help": "The activation type in the classification head. The default activation in PyTorch"
             " like ReLU, TanH, LeakyReLU, etc."
             " https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity"
         },
diff --git a/src/pytorch_tabular/models/danet/config.py b/src/pytorch_tabular/models/danet/config.py
index 91af665c..f2a71619 100644
--- a/src/pytorch_tabular/models/danet/config.py
+++ b/src/pytorch_tabular/models/danet/config.py
@@ -10,7 +10,8 @@
 
 @dataclass
 class DANetConfig(ModelConfig):
-    """DANet configuration
+    """DANet configuration.
+
     Args:
         n_layers (int): Number of Blocks in the DANet. 8, 20, 32 are configurations
             the paper evaluated. Defaults to 8
@@ -27,7 +28,7 @@ class DANetConfig(ModelConfig):
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -40,14 +41,14 @@ class DANetConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
@@ -85,7 +86,7 @@ class DANetConfig(ModelConfig):
     abstlay_dim_2: Optional[int] = field(
         default=None,
         metadata={
-            "help": "The dimension for the intermediate output in the second ABSTLAY layer in a Block. "
+            "help": "The dimension for the intermediate output in the second ABSTLAY layer in a Block."
             "If None, it will be twice abstlay_dim_1. Defaults to None"
         },
     )
diff --git a/src/pytorch_tabular/models/ft_transformer/config.py b/src/pytorch_tabular/models/ft_transformer/config.py
index 6bb25eb9..5afa1c21 100644
--- a/src/pytorch_tabular/models/ft_transformer/config.py
+++ b/src/pytorch_tabular/models/ft_transformer/config.py
@@ -10,7 +10,8 @@
 
 @dataclass
 class FTTransformerConfig(ModelConfig):
-    """Tab Transformer configuration
+    """Tab Transformer configuration.
+
     Args:
         input_embed_dim (int): The embedding dimension for the input categorical features. Defaults to 32
 
@@ -27,7 +28,7 @@ class FTTransformerConfig(ModelConfig):
         share_embedding_strategy (Optional[str]): There are two strategies in adding shared embeddings. 1.
                 `add` - A separate embedding for the feature is added to the embedding of the unique values of the
                 feature. 2. `fraction` - A fraction of the input embedding is reserved for the shared embedding of
-                the feature. Defaults to fraction.. Choices are: [`add`,`fraction`].
+                the feature. Defaults to fraction. Choices are: [`add`,`fraction`].
 
         shared_embedding_fraction (float): Fraction of the input_embed_dim to be reserved by the shared
                 embedding. Should be less than one. Defaults to 0.25
@@ -71,14 +72,14 @@ class FTTransformerConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/gandalf/config.py b/src/pytorch_tabular/models/gandalf/config.py
index 45c8921b..c6f596fc 100644
--- a/src/pytorch_tabular/models/gandalf/config.py
+++ b/src/pytorch_tabular/models/gandalf/config.py
@@ -16,7 +16,7 @@ class GANDALFConfig(ModelConfig):
 
         gflu_dropout (float): Dropout rate for the feature abstraction layer. Defaults to 0.0
 
-        gflu_feature_init_sparsity (float): Only valid for t-softmax. The perecentge of features
+        gflu_feature_init_sparsity (float): Only valid for t-softmax. The percentage of features
                 to be selected in each GFLU stage. This is just initialized and during learning
                 it may change. Defaults to 0.3
 
@@ -27,7 +27,7 @@ class GANDALFConfig(ModelConfig):
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -40,14 +40,14 @@ class GANDALFConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/gate/config.py b/src/pytorch_tabular/models/gate/config.py
index 9177adb1..99d582df 100644
--- a/src/pytorch_tabular/models/gate/config.py
+++ b/src/pytorch_tabular/models/gate/config.py
@@ -9,7 +9,7 @@
 
 @dataclass
 class GatedAdditiveTreeEnsembleConfig(ModelConfig):
-    """Gated Additive Tree Ensemble Config.
+    """Gated Additive Tree Ensemble configuration.
 
     Args:
         gflu_stages (int): Number of layers in the feature abstraction layer. Defaults to 6
@@ -42,7 +42,7 @@ class GatedAdditiveTreeEnsembleConfig(ModelConfig):
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -55,14 +55,14 @@ class GatedAdditiveTreeEnsembleConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/mixture_density/config.py b/src/pytorch_tabular/models/mixture_density/config.py
index c58b221a..50ca1958 100644
--- a/src/pytorch_tabular/models/mixture_density/config.py
+++ b/src/pytorch_tabular/models/mixture_density/config.py
@@ -12,7 +12,8 @@
 
 @dataclass
 class MDNConfig(ModelConfig):
-    """MDN configuration
+    """MDN configuration.
+
     Args:
         backbone_config_class (str): The config class for defining the Backbone. The config class should be
                 a valid module path from `models`. e.g. `FTTransformerConfig`
@@ -22,7 +23,7 @@ class MDNConfig(ModelConfig):
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (str):
 
@@ -32,14 +33,14 @@ class MDNConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/node/config.py b/src/pytorch_tabular/models/node/config.py
index 7d72d562..1a262090 100644
--- a/src/pytorch_tabular/models/node/config.py
+++ b/src/pytorch_tabular/models/node/config.py
@@ -7,7 +7,8 @@
 
 @dataclass
 class NodeConfig(ModelConfig):
-    """Model configuration
+    """Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data configuration.
+
     Args:
         num_layers (int): Number of Oblivious Decision Tree Layers in the Dense Architecture
 
@@ -33,7 +34,7 @@ class NodeConfig(ModelConfig):
         initialize_response (str): Initializing the response variable in the Oblivious Decision Trees. By
                 default, it is a standard normal distribution. Choices are: [`normal`,`uniform`].
 
-        initialize_selection_logits (str): Initializing the feature selector. By default is a uniform
+        initialize_selection_logits (str): Initializing the feature selector. By default, is a uniform
                 distribution across the features. Choices are: [`uniform`,`normal`].
 
         threshold_init_beta (float):                  Used in the Data-aware initialization of thresholds
@@ -60,7 +61,7 @@ class NodeConfig(ModelConfig):
 
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -73,14 +74,14 @@ class NodeConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/tab_transformer/config.py b/src/pytorch_tabular/models/tab_transformer/config.py
index 4a85f0da..8d09e31a 100644
--- a/src/pytorch_tabular/models/tab_transformer/config.py
+++ b/src/pytorch_tabular/models/tab_transformer/config.py
@@ -10,7 +10,8 @@
 
 @dataclass
 class TabTransformerConfig(ModelConfig):
-    """Tab Transformer configuration
+    """Tab Transformer configuration.
+
     Args:
         input_embed_dim (int): The embedding dimension for the input categorical features. Defaults to 32
 
@@ -27,7 +28,7 @@ class TabTransformerConfig(ModelConfig):
         share_embedding_strategy (Optional[str]): There are two strategies in adding shared embeddings. 1.
                 `add` - A separate embedding for the feature is added to the embedding of the unique values of the
                 feature. 2. `fraction` - A fraction of the input embedding is reserved for the shared embedding of
-                the feature. Defaults to fraction.. Choices are: [`add`,`fraction`].
+                the feature. Defaults to fraction. Choices are: [`add`,`fraction`].
 
         shared_embedding_fraction (float): Fraction of the input_embed_dim to be reserved by the shared
                 embedding. Should be less than one. Defaults to 0.25
@@ -68,14 +69,14 @@ class TabTransformerConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
diff --git a/src/pytorch_tabular/models/tabnet/config.py b/src/pytorch_tabular/models/tabnet/config.py
index d5a77e5d..83253d78 100644
--- a/src/pytorch_tabular/models/tabnet/config.py
+++ b/src/pytorch_tabular/models/tabnet/config.py
@@ -10,15 +10,16 @@
 
 @dataclass
 class TabNetModelConfig(ModelConfig):
-    """Model configuration
+    """TabNet: Attentive Interpretable Tabular Learning configuration
+
     Args:
         n_d (int): Dimension of the prediction  layer (usually between 4 and 64)
 
         n_a (int): Dimension of the attention  layer (usually between 4 and 64)
 
-        n_steps (int): Number of sucessive steps in the newtork (usually betwenn 3 and 10)
+        n_steps (int): Number of successive steps in the network (usually between 3 and 10)
 
-        gamma (float): Float above 1, scaling factor for attention updates (usually betwenn 1.0 to 2.0)
+        gamma (float): Float above 1, scaling factor for attention updates (usually between 1.0 to 2.0)
 
         n_independent (int): Number of independent GLU layer in each GLU block (default 2)
 
@@ -29,10 +30,9 @@ class TabNetModelConfig(ModelConfig):
         mask_type (str): Either 'sparsemax' or 'entmax' : this is the masking function to use. Choices are:
                 [`sparsemax`,`entmax`].
 
-
         task (str): Specify whether the problem is regression or classification. `backbone` is a task which
                 considers the model as a backbone to generate features. Mostly used internally for SSL and related
-                tasks.. Choices are: [`regression`,`classification`,`backbone`].
+                tasks. Choices are: [`regression`,`classification`,`backbone`].
 
         head (Optional[str]): The head to be used for the model. Should be one of the heads defined in
                 `pytorch_tabular.models.common.heads`. Defaults to  LinearHead. Choices are:
@@ -45,14 +45,14 @@ class TabNetModelConfig(ModelConfig):
                 list of tuples (cardinality, embedding_dim). If left empty, will infer using the cardinality of
                 the categorical column using the rule min(50, (x + 1) // 2)
 
-        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.1
+        embedding_dropout (float): Dropout to be applied to the Categorical Embedding. Defaults to 0.0
 
         batch_norm_continuous_input (bool): If True, we will normalize the continuous layer by passing it
                 through a BatchNorm layer.
 
         learning_rate (float): The learning rate of the model. Defaults to 1e-3.
 
-        loss (Optional[str]): The loss function to be applied. By Default it is MSELoss for regression and
+        loss (Optional[str]): The loss function to be applied. By Default, it is MSELoss for regression and
                 CrossEntropyLoss for classification. Unless you are sure what you are doing, leave it at MSELoss
                 or L1Loss for regression and CrossEntropyLoss for classification
 
@@ -85,11 +85,11 @@ class TabNetModelConfig(ModelConfig):
     )
     n_steps: int = field(
         default=3,
-        metadata={"help": ("Number of sucessive steps in the newtork (usually betwenn 3 and 10)")},
+        metadata={"help": ("Number of successive steps in the network (usually between 3 and 10)")},
     )
     gamma: float = field(
         default=1.3,
-        metadata={"help": ("Float above 1, scaling factor for attention updates (usually betwenn" " 1.0 to 2.0)")},
+        metadata={"help": ("Float above 1, scaling factor for attention updates (usually between" " 1.0 to 2.0)")},
     )
     n_independent: int = field(
         default=2,
diff --git a/src/pytorch_tabular/tabular_model_tuner.py b/src/pytorch_tabular/tabular_model_tuner.py
index 19640970..6dca2157 100644
--- a/src/pytorch_tabular/tabular_model_tuner.py
+++ b/src/pytorch_tabular/tabular_model_tuner.py
@@ -160,54 +160,55 @@ def tune(
         **kwargs,
     ):
         """Tune the hyperparameters of the TabularModel.
+
         Args:
-        train (DataFrame): Training data
+            train (DataFrame): Training data
 
-        validation (DataFrame, optional): Validation data. Defaults to None.
+            validation (DataFrame, optional): Validation data. Defaults to None.
 
-        search_space (Dict): A dictionary of the form {param_name: [values to try]}
-            for grid search or {param_name: distribution} for random search
+            search_space (Dict): A dictionary of the form {param_name: [values to try]}
+                for grid search or {param_name: distribution} for random search
 
-        metric (Union[str, Callable]): The metric to be used for evaluation.
-            If str is provided, will use that metric from the defined ones.
-            If callable is provided, will use that function as the metric.
-            We expect callable to be of the form `metric(y_true, y_pred)`. For classification
-            problems, The `y_pred` is a dataframe with the probabilities for each class
-            (<class>_probability) and a final prediction(prediction). And for Regression, it is a
-            dataframe with a final prediction (<target>_prediction).
-            Defaults to None.
+            metric (Union[str, Callable]): The metric to be used for evaluation.
+                If str is provided, will use that metric from the defined ones.
+                If callable is provided, will use that function as the metric.
+                We expect callable to be of the form `metric(y_true, y_pred)`. For classification
+                problems, The `y_pred` is a dataframe with the probabilities for each class
+                (<class>_probability) and a final prediction(prediction). And for Regression, it is a
+                dataframe with a final prediction (<target>_prediction).
+                Defaults to None.
 
-        mode (str): One of ['max', 'min']. Whether to maximize or minimize the metric.
+            mode (str): One of ['max', 'min']. Whether to maximize or minimize the metric.
 
-        strategy (str): One of ['grid_search', 'random_search']. The strategy to use for tuning.
+            strategy (str): One of ['grid_search', 'random_search']. The strategy to use for tuning.
 
-        n_trials (int, optional): Number of trials to run. Only used for random search.
-            Defaults to None.
+            n_trials (int, optional): Number of trials to run. Only used for random search.
+                Defaults to None.
 
-        cv (Optional[Union[int, Iterable, BaseCrossValidator]]): Determines the cross-validation splitting strategy.
-            Possible inputs for cv are:
+            cv (Optional[Union[int, Iterable, BaseCrossValidator]]): Determines the cross-validation splitting strategy.
+                Possible inputs for cv are:
 
-            - None, to not use any cross validation. We will just use the validation data
-            - integer, to specify the number of folds in a (Stratified)KFold,
-            - An iterable yielding (train, test) splits as arrays of indices.
-            - A scikit-learn CV splitter.
-            Defaults to None.
+                - None, to not use any cross validation. We will just use the validation data
+                - integer, to specify the number of folds in a (Stratified)KFold,
+                - An iterable yielding (train, test) splits as arrays of indices.
+                - A scikit-learn CV splitter.
+                Defaults to None.
 
-        cv_agg_func (Optional[Callable], optional): Function to aggregate the cross validation scores.
-            Defaults to np.mean.
+            cv_agg_func (Optional[Callable], optional): Function to aggregate the cross validation scores.
+                Defaults to np.mean.
 
-        cv_kwargs (Optional[Dict], optional): Additional keyword arguments to be passed to the cross validation
-            method. Defaults to {}.
+            cv_kwargs (Optional[Dict], optional): Additional keyword arguments to be passed to the cross validation
+                method. Defaults to {}.
 
-        verbose (bool, optional): Whether to print the results of each trial. Defaults to False.
+            verbose (bool, optional): Whether to print the results of each trial. Defaults to False.
 
-        progress_bar (bool, optional): Whether to show a progress bar. Defaults to True.
+            progress_bar (bool, optional): Whether to show a progress bar. Defaults to True.
 
-        random_state (Optional[int], optional): Random state to be used for random search. Defaults to 42.
+            random_state (Optional[int], optional): Random state to be used for random search. Defaults to 42.
 
-        ignore_oom (bool, optional): Whether to ignore out of memory errors. Defaults to True.
+            ignore_oom (bool, optional): Whether to ignore out of memory errors. Defaults to True.
 
-        **kwargs: Additional keyword arguments to be passed to the TabularModel fit.
+            **kwargs: Additional keyword arguments to be passed to the TabularModel fit.
 
         Returns:
             OUTPUT: A named tuple with the following attributes: