Skip to content

Commit 49b37a5

Browse files
authored
fixies in docs (#14357)
1 parent 5a01057 commit 49b37a5

File tree

56 files changed

+116
-62
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+116
-62
lines changed

docs/en/advanced_settings.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ sidebar:
1717

1818
You can change the following Spark NLP configurations via Spark Configuration:
1919

20+
{:.table-model-big}
2021
| Property Name | Default | Meaning |
2122
|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
2223
| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
@@ -32,6 +33,8 @@ You can change the following Spark NLP configurations via Spark Configuration:
3233
| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
3334
| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |
3435

36+
</div><div class="h3-box" markdown="1">
37+
3538
### How to set Spark NLP Configuration
3639

3740
**SparkSession:**
@@ -93,6 +96,7 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
9396

9497
NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
9598

99+
</div><div class="h3-box" markdown="1">
96100

97101
### S3 Integration
98102

docs/en/hardware_acceleration.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a
3434

3535
![Spark NLP CPU vs. GPU](/assets/images/Spark_NLP_CPU_vs._GPU_Transformers_(Word_Embeddings).png)
3636

37+
{:.table-model-big}
3738
| Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 |
3839
| ----------------- |:-------------------------:|
3940
| RoBERTa base | +560%(6.6x) |
@@ -72,6 +73,7 @@ Here we compare the last release of Spark NLP 3.4.3 on CPU (normal) with Spark N
7273

7374
![Spark NLP 3.4.4 CPU vs. Spark NLP 4.0 CPU with oneDNN](/assets/images/Spark_NLP_3.4_on_CPU_vs._Spark_NLP_4.0_on_CPU_with_oneDNN.png)
7475

76+
{:.table-model-big}
7577
| Model on CPU | 3.4.x vs. 4.0.0 with oneDNN |
7678
| ----------------- |:------------------------:|
7779
| BERT Base | +47% |

docs/en/install.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ spark = SparkSession.builder \
106106
If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course,
107107
you'll have to put the jars in a reachable location for all driver and executor nodes.
108108

109+
</div><div class="h3-box" markdown="1">
110+
109111
### Python without explicit Pyspark installation
110112

111113
### Pip/Conda
@@ -306,7 +308,6 @@ as expected.5.4.1
306308

307309
</div><div class="h3-box" markdown="1">
308310

309-
310311
## Command line
311312

312313
Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x
@@ -379,6 +380,8 @@ spark-shell \
379380
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
380381
```
381382

383+
</div><div class="h3-box" markdown="1">
384+
382385
## Installation for M1 & M2 Chips
383386

384387
### Scala and Java for M1
@@ -524,6 +527,8 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
524527
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
525528
available to driver path
526529

530+
</div><div class="h3-box" markdown="1">
531+
527532
## Python in Zeppelin
528533

529534
Apart from the previous step, install the python module through pip
@@ -546,6 +551,8 @@ install the pip library with (e.g. `python3`).
546551
An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as
547552
shown earlier since it includes both scala and python side installation.
548553

554+
</div><div class="h3-box" markdown="1">
555+
549556
## Jupyter Notebook
550557
5.4.1
551558
**Recommended:**
@@ -582,6 +589,8 @@ Alternatively, you can mix in using `--jars` option for pyspark + `pip install s
582589
If not using pyspark at all, you'll have to run the instructions
583590
pointed [here](#python-without-explicit-pyspark-installation)
584591
592+
</div><div class="h3-box" markdown="1">
593+
585594
## Databricks Cluster
586595
587596
1. Create a cluster if you don't have one already
@@ -605,6 +614,8 @@ NOTE: Databricks' runtimes support different Apache Spark major releases. Please
605614
NLP Maven package name (Maven Coordinate) for your runtime from
606615
our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
607616
617+
</div><div class="h3-box" markdown="1">
618+
608619
## EMR Cluster
609620
610621
To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software
@@ -670,6 +681,8 @@ aws emr create-cluster \
670681
--profile <aws_profile_credentials>
671682
```
672683
684+
</div><div class="h3-box" markdown="1">
685+
673686
## GCP Dataproc
674687
675688
1. Create a cluster if you don't have one already as follows.
@@ -733,6 +746,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
733746
734747
Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
735748
749+
{:.table-model-big}
736750
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
737751
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
738752
| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
@@ -750,6 +764,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
750764
751765
## Scala and Python Support
752766
767+
{:.table-model-big}
753768
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
754769
|-----------|------------|------------|------------|------------|------------|------------|------------|
755770
| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
@@ -1260,6 +1275,7 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
12601275
- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you
12611276
- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`)
12621277
1278+
</div><div class="h3-box" markdown="1">
12631279
12641280
## Compiled JARs
12651281
@@ -1285,6 +1301,8 @@ sbt -Dis_gpu=true assembly
12851301
sbt -Dis_silicon=true assembly
12861302
```
12871303
1304+
</div><div class="h3-box" markdown="1">
1305+
12881306
### Using the jar manually
12891307
12901308
If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it

docs/en/mlflow.md

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,8 @@ import pandas as pd
133133
import glob
134134
```
135135

136+
</div><div class="h3-box" markdown="1">
137+
136138
### Spark NLP imports
137139
```
138140
import sparknlp
@@ -172,13 +174,17 @@ We will be showcasing the serialization and experiment tracking of `NERDLApproac
172174

173175
There is one specific util that is able to parse the log of that approach in order to extract the metrics and charts. Let's get it.
174176

177+
</div><div class="h3-box" markdown="1">
178+
175179
### Ner Log Parser Util
176180
`!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/utils/ner_image_log_parser.py`
177181

178182
Now, let's import the library:
179183

180184
`import ner_image_log_parser`
181185

186+
</div><div class="h3-box" markdown="1">
187+
182188
### Starting a SparkNLP session
183189
It's important we create a Spark NLP Session using the Session Builder, since we need to specify the jars not only of Spark NLP, but also of MLFlow.
184190

@@ -198,6 +204,8 @@ def start():
198204
spark = start()
199205
```
200206

207+
</div><div class="h3-box" markdown="1">
208+
201209
### Training dataset preparation
202210
Let's download some training and test datasets:
203211
```
@@ -221,6 +229,8 @@ TRAINING_SIZE = training_data.count()
221229
TRAINING_SIZE
222230
```
223231

232+
</div><div class="h3-box" markdown="1">
233+
224234
### Hyperparameters configuration
225235
Let's configure our hyperparameter values.
226236
```
@@ -236,6 +246,8 @@ RANDOM_SEED = 0 # Adapt me to your experiment
236246
VALIDATION_SPLIT = 0.1 # Adapt me to your experiment
237247
```
238248

249+
</div><div class="h3-box" markdown="1">
250+
239251
### Creating the experiment
240252
Now, we are ready to instantiate an experiment in MLFlow
241253
```
@@ -244,6 +256,8 @@ EXPERIMENT_ID = mlflow.create_experiment(f"{MODEL_NAME}_{EXPERIMENT_NAME}")
244256

245257
Each time you want to test a different thing, change the EXPERIMENT_NAME and rerun the line above to create a new entry in the experiment. By changing the experiment name, a new experiment ID will be generated. Each experiment ID groups all runs in separates folder inside `./mlruns`.
246258

259+
</div><div class="h3-box" markdown="1">
260+
247261
### Pipeline creation
248262
```
249263
document = DocumentAssembler()\
@@ -300,11 +314,15 @@ ner_training_pipeline = Pipeline(stages = ner_preprocessing_pipeline.getStages()
300314
## Preparing inference objects
301315
Now, let's prepare the inference as well, since we will train and infer afterwards, and store all the results of training and inference as artifacts in our MLFlow object.
302316

317+
</div><div class="h3-box" markdown="1">
318+
303319
### Test dataset preparation
304320
```
305321
test_data = CoNLL().readDataset(spark, TEST_DATASET)
306322
```
307323

324+
</div><div class="h3-box" markdown="1">
325+
308326
### Setting the names of the inference objects
309327
```
310328
INFERENCE_NAME = "inference.parquet" # This is the name of the results inference on the test dataset, serialized in parquet,
@@ -520,11 +538,11 @@ Now, we just need to launch the MLFLow UI to see:
520538
</div><div class="h3-box" markdown="1">
521539

522540
## Some example screenshots
523-
![](/assets/images/mlflow/mlflow10.png)
524-
![](/assets/images/mlflow/mlflow11.png)
525-
![](/assets/images/mlflow/mlflow12.png)
526-
![](/assets/images/mlflow/mlflow13.png)
527-
![](/assets/images/mlflow/mlflow14.png)
528-
![](/assets/images/mlflow/mlflow15.png)
541+
![MLFLow](/assets/images/mlflow/mlflow10.png)
542+
![MLFLow](/assets/images/mlflow/mlflow11.png)
543+
![MLFLow](/assets/images/mlflow/mlflow12.png)
544+
![MLFLow](/assets/images/mlflow/mlflow13.png)
545+
![MLFLow](/assets/images/mlflow/mlflow14.png)
546+
![MLFLow](/assets/images/mlflow/mlflow15.png)
529547

530548
</div>

docs/en/pipelines.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ annotation.select("entities.result").show(false)
6262
*/
6363
```
6464

65+
</div><div class="h3-box" markdown="1">
66+
6567
#### Showing Available Pipelines
6668

6769
There are functions in Spark NLP that will list all the available Pipelines
@@ -105,6 +107,8 @@ ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
105107
*/
106108
```
107109

110+
</div><div class="h3-box" markdown="1">
111+
108112
#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
109113

110114
### Models
@@ -138,6 +142,8 @@ val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_155653145734
138142
.setOutputCol("pos")
139143
```
140144

145+
</div><div class="h3-box" markdown="1">
146+
141147
#### Showing Available Models
142148

143149
There are functions in Spark NLP that will list all the available Models

docs/en/training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ All of these graphs use an LSTM of size 128 and number of chars 100
138138

139139
In case, your train dataset has a different number of tags, embeddings dimension, number of chars and LSTM size combinations shown in the table above, `NerDLApproach` will raise an **IllegalArgumentException** exception during runtime with the message below:
140140

141-
*Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Check https://sparknlp.org/docs/en/graph for instructions to generate the required graph.*
141+
*Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Check [https://sparknlp.org/docs/en/graph](https://sparknlp.org/docs/en/graph) for instructions to generate the required graph.*
142142

143143
To overcome this exception message we have to follow these steps:
144144

docs/en/transformer_entries/AlbertEmbeddings.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ All official Albert releases by google in TF-HUB are supported with this Albert
1010

1111
**Ported TF-Hub Models:**
1212

13+
{:.table-model-big}
1314
| Spark NLP Model | TF-Hub Model | Model Properties |
1415
| -------------------------- | ----------------------------------------------------------- | ------------------------------------------------------ |
1516
| `"albert_base_uncased"` | [albert_base](https://tfhub.dev/google/albert_base/3) | 768-embed-dim, 12-layer, 12-heads, 12M parameters |
@@ -39,9 +40,9 @@ and the [AlbertEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blo
3940

4041
[ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS](https://arxiv.org/pdf/1909.11942.pdf)
4142

42-
https://github.com/google-research/ALBERT
43+
[https://github.com/google-research/ALBERT](https://github.com/google-research/ALBERT)
4344

44-
https://tfhub.dev/s?q=albert
45+
[https://tfhub.dev/s?q=albert](https://tfhub.dev/s?q=albert)
4546

4647
**Paper abstract:**
4748

docs/en/transformer_entries/AlbertForQuestionAnswering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ For available pretrained models please see the
1919
[Models Hub](https://sparknlp.org/models?task=Question+Answering).
2020

2121
To see which models are compatible and how to import them see
22-
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the
22+
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669). and the
2323
[AlbertForQuestionAnsweringTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/AlbertForQuestionAnsweringTestSpec.scala).
2424
{%- endcapture -%}
2525

docs/en/transformer_entries/AlbertForSequenceClassification.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ The default model is `"albert_base_sequence_classifier_imdb"`, if no name is pro
1919
For available pretrained models please see the [Models Hub](https://sparknlp.org/models?task=Text+Classification).
2020

2121
Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are
22-
compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
22+
compatible and how to import them see [https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).
2323
and the [AlbertForSequenceClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/AlbertForSequenceClassificationTestSpec.scala).
2424
{%- endcapture -%}
2525

docs/en/transformer_entries/BartTransformer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ For extended examples of usage, see
4343
**References:**
4444

4545
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://aclanthology.org/2020.acl-main.703.pdf)
46-
- https://github.com/pytorch/fairseq
46+
- [https://github.com/pytorch/fairseq](https://github.com/pytorch/fairseq)
4747

4848
**Paper Abstract:**
4949

0 commit comments

Comments
 (0)