## Creating Custom Transformers For Pyspark MlLib
Contributor: Tommy Lin (October 20th, 2024)

### Overview
Transformers are tools used to convert data into a desired format, usually for machine learning algorithms. Data from an input column is transformed and placed into an output column. Essentially, they leverage the Pyspark DataFrame's built-in function withColumn() to transform a column(s) in the dataset.

### Common Built-in MLLib Transformers
- Tokenizer
- StopWordRemover
- StandardScalar
- CountVectorizer

### Key Parts of a Transformer
All these transformers are extended from the PySpark MLLib Transformer class. They also all have a <strong>transform()</strong> function, hence the name Transformer. When creating our own Transformer, we need to create a <strong>transform()</strong> function. Some transformers also have a fit() function along with the transform() function, though it is not required.

In [1]:
! pip install pyspark



### Built-in MLLib Transformers vs Custom Version

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext("local")
spark = SparkSession.builder.getOrCreate()

#### Tokenizer
The built-in MLLib Tokenizer splits strings by " " and also converts the string to lovercase. For our custom Tokenizer, we will convert the strings in the input column to lowercase, then split the strings by " " and puts the results in the output column.

In [3]:
from pyspark.ml.feature import Tokenizer

# Create array of text
text = [('He went to the store.',), ('Dogs like to fight cats.',), ('When did you do that?',)]
df = spark.createDataFrame(text, ['text'])

# Tokenize the sentences into list of words using the built-in Tokenizer
# inputCol is the column we want to convert
# outputCol is where we will store the converted inputs
tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
tokenizer.transform(df).show(truncate=False)

+------------------------+------------------------------+
|text                    |tokenized_text                |
+------------------------+------------------------------+
|He went to the store.   |[he, went, to, the, store.]   |
|Dogs like to fight cats.|[dogs, like, to, fight, cats.]|
|When did you do that?   |[when, did, you, do, that?]   |
+------------------------+------------------------------+



In [4]:
from pyspark.ml import Transformer
import pyspark.sql.functions as F

class CustomTokenizer(Transformer):
    def __init__(self, inputCol, outputCol):
        super().__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    # Transform method that converts each sentence to tokens
    # 1. Takes the specified input column
    # 2. performs the operations on the column
    # 3. puts the new values into an output column
    def _transform(self, df):
        return df.withColumn(self.outputCol, F.split(F.lower(F.col(self.inputCol)), " "))


# Tokenizes data using custom Tokenizer
# Same parameters inputCol and outputCol, just like the built-in tokenizer
custom_tokenizer = CustomTokenizer(inputCol="text", outputCol="tokenized_text")
custom_tokenizer.transform(df).show(truncate=False)

+------------------------+------------------------------+
|text                    |tokenized_text                |
+------------------------+------------------------------+
|He went to the store.   |[he, went, to, the, store.]   |
|Dogs like to fight cats.|[dogs, like, to, fight, cats.]|
|When did you do that?   |[when, did, you, do, that?]   |
+------------------------+------------------------------+



#### VectorAssembler
The built-in VectorAssembler transformer groups several column values into one vector. For example, with columns "age", "height", and "weight", the VectorAssembler will group these features into an array ["age", "height", "weight"]. We can use the F.array function in our own transform() method to emulate this.

In [5]:
from pyspark.ml.feature import VectorAssembler

x = [(1.0, 2.4),
     (2.9, 5.0),
     (3.3, 1.0),
     (4.8, 2.0),
     (5.0, 4.6),
     (6.0, 1.2)]

df = spark.createDataFrame(x, ["x1", "x2"])

# Use VectorAssembler to combine columns into a feature array
assembler = VectorAssembler(inputCols=["x1", "x2"], outputCol="features")
assembler.transform(df).show(truncate=False)

+---+---+---------+
|x1 |x2 |features |
+---+---+---------+
|1.0|2.4|[1.0,2.4]|
|2.9|5.0|[2.9,5.0]|
|3.3|1.0|[3.3,1.0]|
|4.8|2.0|[4.8,2.0]|
|5.0|4.6|[5.0,4.6]|
|6.0|1.2|[6.0,1.2]|
+---+---+---------+



In [6]:
from pyspark.ml import Transformer
import pyspark.sql.functions as F
from pyspark.ml.linalg import VectorUDT

class CustomVectorAssembler(Transformer):
    def __init__(self, inputCols, outputCol):
        super().__init__()
        self.inputCols = inputCols
        self.outputCol = outputCol

    # Transform method that collects each column value into a feature array
    # Takes all the input columns we want to group together
    # puts the column values into arrays using F.array()
    # Stores the results in the output column
    def _transform(self, df):
        return df.withColumn(self.outputCol, F.array(*self.inputCols))


# Use custom Vector Assembler to assemble features into an array
assembler = CustomVectorAssembler(inputCols=["x1", "x2"], outputCol="features")
assembler.transform(df).show(truncate=False)

+---+---+----------+
|x1 |x2 |features  |
+---+---+----------+
|1.0|2.4|[1.0, 2.4]|
|2.9|5.0|[2.9, 5.0]|
|3.3|1.0|[3.3, 1.0]|
|4.8|2.0|[4.8, 2.0]|
|5.0|4.6|[5.0, 4.6]|
|6.0|1.2|[6.0, 1.2]|
+---+---+----------+



### Custom Transformers

#### StandardScalar + Addition
We can also add our own functionality to our Transformers. What if we wanted to add a number to our data after scaling it? To do this, we can create a fit() method to calculate the mean and standard deviation, then use transform() to scale the values. After, we can add our desired value and output the results to the output column.

In [7]:
from pyspark.ml import Transformer
from pyspark.sql import functions as F

# Custom Transformer that scales a column on numbers
class CustomStandardScaler(Transformer):

    # inputCol and outputCol are required parameters
    # addition is optional, default value is 0, which makes this a regular StandardScaler
    def __init__(self, inputCol, outputCol, addition=0):
        super().__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol
        self.addition = addition

    def fit(self, df):
        # Calculate mean and stddev for the input column
        stats = df.select(F.mean(F.col(self.inputCol)).alias("mean"),
                          F.stddev(F.col(self.inputCol)).alias("stddev"))\
                    .first()

        # Set mean and std of column to use later
        # we must run fit() before running transform()
        self.mean = stats["mean"]
        self.stddev = stats["stddev"]

    def _transform(self, df):
        # Scale the column using our calculated mean and sd and add the addition value
        # Use the values from the inputCol, mean, and std to scale
        # Add desired addition value
        # Store the results in the output column
        return df.withColumn(self.outputCol, (F.col(self.inputCol) - self.mean) / self.stddev + self.addition)


# Create dataframe of numbers
x = [(1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,), (8.0,), (9.0,)]
df = spark.createDataFrame(x, ["value"])

# Use custom Standard Scaler with no addition
scaler = CustomStandardScaler(inputCol="value", outputCol="scaled")
scaler.fit(df)
df = scaler.transform(df)

# Use custom Standard Scaler with addition
scaler2 = CustomStandardScaler(inputCol="value", outputCol="scaled+addition", addition=10)
scaler2.fit(df)
scaler2.transform(df).show(truncate=False)

+-----+-------------------+------------------+
|value|scaled             |scaled+addition   |
+-----+-------------------+------------------+
|1.0  |-1.4605934866804429|8.539406513319557 |
|2.0  |-1.0954451150103321|8.904554884989668 |
|3.0  |-0.7302967433402214|9.269703256659778 |
|4.0  |-0.3651483716701107|9.634851628329889 |
|5.0  |0.0                |10.0              |
|6.0  |0.3651483716701107 |10.365148371670111|
|7.0  |0.7302967433402214 |10.730296743340222|
|8.0  |1.0954451150103321 |11.095445115010332|
|9.0  |1.4605934866804429 |11.460593486680443|
+-----+-------------------+------------------+



#### Lemmatizer
What if we wanted to lemmatize our text?
MLLib doesn't have a built-in lemmatizer transformer we can use, but we can create our own.
By creating our own, we can choose when to lemmatize the text. We can lemmatize either the string of text, or an array of tokens.

In [8]:
import spacy
from pyspark.ml import Transformer, Pipeline
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType

# Transformer for Lemmatizing Strings
# Example: "I am running" => "I be run"
class CustomLemmatizerString(Transformer):
    def __init__(self, inputCol, outputCol):
        super().__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

        # Initializes language from spacy
        self.lang = spacy.load("en_core_web_sm")

    # Helper function that lemmatizes the words in a given text string
    def lemmatize(self, text):
        # Uses the spacy library to convert our string into documents
        document = self.lang(text)

        # Lemmatize each word in the document and returns a string representation of the lemmatized tokens
        return ' '.join(token.lemma_ for token in document)

    # Transform function, uses helper function
    def _transform(self, df):
        # Create a User-Defined function that lemmatizes a dataframe column
        # uses the lemmatize function we created above and specifies the result type
        lemmatize_udf = F.udf(self.lemmatize, StringType())

        # Use our UDF to Lemmatize the input column
        # Takes
        return df.withColumn(self.outputCol, lemmatize_udf(df[self.inputCol]))


# Create array of text
text = [('Bob is running',), ('Dreaming about eating',), ('He went to the store',)]
df = spark.createDataFrame(text, ['text'])

# Use custom lemmatizer transformer on string
lemmatizer = CustomLemmatizerString(inputCol="text", outputCol="lemmatized")
lemmatizer.transform(df).show(truncate=False)

+---------------------+------------------+
|text                 |lemmatized        |
+---------------------+------------------+
|Bob is running       |Bob be run        |
|Dreaming about eating|dream about eat   |
|He went to the store |he go to the store|
+---------------------+------------------+



In [9]:
# Transformer for Lemmatizing an array of tokens
# Example: ["I", "am", "running"] => ["I", "be", "run"]
class CustomLemmatizerTokens(Transformer):
    def __init__(self, inputCol, outputCol):
        super().__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

        # Initializes language from spacy
        self.lang = spacy.load("en_core_web_sm")

    # Helper function that lemmatizes the words in a given text string
    def lemmatize(self, tokens):
        # Converts tokens into a string and creates spacy doc objject
        text = self.lang(" ".join(tokens))

        # Lemmatizes words and returns as array
        return [word.lemma_ for word in text]

    # Transform function, uses helper function
    def _transform(self, df):
        # Create a User-Defined function that lemmatizes a dataframe column
        # Uses the lemmatize method we created above, and specifies array result type
        lemmatize_udf = F.udf(self.lemmatize, ArrayType(StringType()))

        # Use our UDF to Lemmatize the input column
        # Takes in a column containing arrays of words
        # Lemmatizes each word of arrays in the input column
        # Stores the arrays of lemmatized words in the output column
        return df.withColumn(self.outputCol, lemmatize_udf(df[self.inputCol]))


# Create array of text
text = [('Bob is running',), ('Dreaming about eating',), ('He went to the store',)]
df = spark.createDataFrame(text, ['text'])

# Create my custom transformers
tokenizer = CustomTokenizer(inputCol="text", outputCol="tokens")
lemmatizer = CustomLemmatizerTokens(inputCol="tokens", outputCol="lemmatized")

# Our custom Transformers can also be used in a Pipeline like built-in Transformers
# Our lemmatizer Transformer can be used in our machine learning preprocessing pipeline

# Create a pipeline and fit/transform data
pipeline = Pipeline(stages=[tokenizer, lemmatizer])
model = pipeline.fit(df)
model.transform(df).show(truncate=False)

+---------------------+--------------------------+------------------------+
|text                 |tokens                    |lemmatized              |
+---------------------+--------------------------+------------------------+
|Bob is running       |[bob, is, running]        |[bob, be, run]          |
|Dreaming about eating|[dreaming, about, eating] |[dream, about, eat]     |
|He went to the store |[he, went, to, the, store]|[he, go, to, the, store]|
+---------------------+--------------------------+------------------------+



## Notes for Custom Transformers
- Must implement a _transform function in your custom transformer class
- Must inherit from base MLLib Transformer class
- Leverage DataFrame operations
- Can be used in pipelines just as a built-in transformer
- You can be as creative as possible!