In [1]:
import os
# Find the latest version of spark 3.0  from http://www-us.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.1'
spark_version = 'spark-3.0.1'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

0% [Working]            Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
0% [Waiting for headers] [Connecting to cloud.r-project.org] [Waiting for heade                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
                                                                               Hit:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to cloud.r-project.org] [Connecting to ppa0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Connecting to cloud.r-proj                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Connecting to cloud.r-proj                                                                               Hit:5 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic I

In [2]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Tokens").getOrCreate()

In [3]:
# Import the Tokenizer library
from pyspark.ml.feature import Tokenizer

In [4]:
# Create sample Dataframe
dataframe = spark.createDataFrame([
                                   (0, "My name is Amir ElTabakh"),
                                   (1, "I am a junior in college at this time"),
                                   (2, "I am interning at NASA")
], ["id", "sentence"])

dataframe.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|My name is Amir E...|
|  1|I am a junior in ...|
|  2|I am interning at...|
+---+--------------------+



The tokenizer function takes input and output parameters. The input passes the name of the column that we want to have tokenized, and the output takes the name that we want the column called. Type and run the following code:

In [5]:
# Tokenize sentences
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenizer

Tokenizer_3dd4a873d50f

The tokenizer that we created uses a transform method that takes a DataFrame as input. This is a transformation, so to reveal the results, we'll call show(truncate=False) as our action to display the results without shortening the output, as shown below:

In [6]:
# Transform and show DataFrame
tokenized_df = tokenizer.transform(dataframe)
tokenized_df.show(truncate=False)

+---+-------------------------------------+-----------------------------------------------+
|id |sentence                             |words                                          |
+---+-------------------------------------+-----------------------------------------------+
|0  |My name is Amir ElTabakh             |[my, name, is, amir, eltabakh]                 |
|1  |I am a junior in college at this time|[i, am, a, junior, in, college, at, this, time]|
|2  |I am interning at NASA               |[i, am, interning, at, nasa]                   |
+---+-------------------------------------+-----------------------------------------------+



The tokenizer looks similar to the split() method in Python. We can create a function that will enhance our tokenizer by returning a word count for each line. Start by creating a Python function that takes a list of words as its input, then returns the length of that list. We'll also import the `udf` function, the `col` function to select a column to be passed into a function, and the type IntegerType that will be used in our udf to define the data type of the output. We can then redo the tokenizer process. Only this time, after the DataFrame has outputted the tokenized values, we can use our own created function to return the number of tokens created. This will give us another data point to use in the future, if needed.

In [7]:
# Create a function to return the length of a list
def word_list_length(word_list):
    return len(word_list)

from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

# Create a user defined function
count_tokens = udf(word_list_length, IntegerType())

In [8]:
# Create our tokenizer
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

# Transform DataFrame
tokenized_df = tokenizer.transform(dataframe)

# Select the needed columns and don't truncate results
tokenized_df.withColumn("tokens", count_tokens(col("words"))).show(truncate=False)

+---+-------------------------------------+-----------------------------------------------+------+
|id |sentence                             |words                                          |tokens|
+---+-------------------------------------+-----------------------------------------------+------+
|0  |My name is Amir ElTabakh             |[my, name, is, amir, eltabakh]                 |5     |
|1  |I am a junior in college at this time|[i, am, a, junior, in, college, at, this, time]|9     |
|2  |I am interning at NASA               |[i, am, interning, at, nasa]                   |5     |
+---+-------------------------------------+-----------------------------------------------+------+

