

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb)




# **Pre-Process text:**
## **Convert text to tokens, remove punctuation, stop words, perform stemming and lemmatization using Spark NLP's annotators**

**Demo of the following annotators:**


* SentenceDetector
* Tokenizer
* Normalizer
* Stemmer
* Lemmatizer
* StopWordsCleaner

## 1. Colab Setup

# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

In [0]:
import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

## 2. Start Spark Session

In [0]:
spark = sparknlp.start()
print ("Spark NLP Version :", sparknlp.version())
spark

## 3. Setting sample text

In [0]:
## Generating Example Files ##

text_list = ["""The Geneva Motor Show, the first major car show of the year, opens tomorrow with U.S. Car makers hoping to make new inroads into European markets due to the cheap dollar, automobile executives said. Ford Motor Co and General Motors Corp sell cars in Europe, where about 10.5 mln new cars a year are bought. GM also makes a few thousand in North American plants for European export.""",]

## 4. Download lemma reference file. (you may also use a pre-trained lemmatization model)

#getting lemma files
!wget https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt

## 5. Define Spark NLP pipleline

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(True)\
    .setCleanupPatterns(["[^\w\d\s]"])

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("removed_stopwords")\
    .setCaseSensitive(False)\

nlpPipeline = Pipeline(stages=[documentAssembler,
                               sentenceDetector,
                               tokenizer,
                               normalizer,
                               stopwords_cleaner,

])


## 6. Run pipeline

In [0]:
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = nlpPipeline.fit(df).transform(df)

## 7. Visualize Results

In [0]:
# sentences in the text
result.select(F.explode(F.arrays_zip(result.sentences.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentences")).show(truncate=False)


In [0]:
# tokens in the text
result.select(F.explode(F.arrays_zip(result.token.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("tokens")).show(truncate=False)

In [0]:
# eliminated punctuation
result.select(F.explode(F.arrays_zip(result.normalized.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("normalized_tokens")).show(truncate=False)

# stemmed tokens
result.select(F.explode(F.arrays_zip(result.stem.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token_stems")).show(truncate=False)

In [0]:
# removed_stopwords
result.select(F.explode(F.arrays_zip(result.removed_stopwords.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("removed_stopwords")).show(truncate=False)

# lemmatization
result.select(F.explode(F.arrays_zip(result.lemma.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("lemma")).show(truncate=False)