![Cloud-First](../image/CloudFirst.png)


# SIT742: Modern Data Science
**(Module: Big Data Manipulation)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)


Prepared by **SIT742 Teaching Team**

---


## Session 4I: SparkSQL and Data Understanding
---

### Table of Content

Part A: Spark Text Data Foundation

1. Word2Vec in Spark

2. Tokenization on the Given Document

3. Stop Words Removal
4. N Gram Tokenization
5. CountVectorizer Representation

Part B: Adcanced Representation

6. Binarizer on the Continuous Feature

7. PCA in Spark
8.Polynomial Representation For Continuous Data
9.Min Max Scaler in Spark
10.Discretization in Spark
11.Imputation in Spark


---

## Introduction ##

This notebook will introduce Spark capabilities to deal with the text data and also continuous data with ML.feature module.
The ML.feature module is a powerful module that provides a wide array of feature transformers and tools to prepare data for machine learning in a scalable and distributed manner. It is designed to integrate smoothly into ML pipelines using the pyspark.ml API


# Part A: Spark Text Data Foundation

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("Intro") \
    .getOrCreate()


Word2Vec in Spark

In [None]:
from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
    ("Hi this is from SIT742".split(" "), ),
    ("I THINK sit742 IS a good unit".split(" "), ),
    ("Spark is not hard to learn".split(" "), )
], ["text"])

# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)

result = model.transform(documentDF)
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))

Tokenization on the Given Document

In [None]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi|this|is|from|SIT742"),
    (1, "I     THINK sit742  IS a  good unit"),
    (2, "Spark,is,not,hard,to,learn")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

Stop Words Removal

In [None]:
from pyspark.ml.feature import StopWordsRemover

sentenceData = spark.createDataFrame([
    (0, ["I", "LIKE", "the", "UNIT", "?"]),
    (1, ["tHANKS", "AGAIN", "for", "your", "help"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)

N Gram Tokenization

In [None]:
from pyspark.ml.feature import NGram
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi this is from SIT742"),
    (1, "I THINK sit742 IS a good unit"),
    (2, "Spark is not hard to learn")
], ["id", "sentence"])


tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
countTokens = udf(lambda words: len(words), IntegerType())
wordDataFrame = tokenizer.transform(sentenceDataFrame)

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)

CountVectorizer Representation

In [None]:
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType

# Create initial DataFrame
df = spark.createDataFrame([
    (0, "Hi this is from SIT742".split(" ")),
    (1, "I THINK sit742 IS a good unit".split(" "))
], ["id", "words"])

# Define a UDF to lowercase all words
lowercase_udf = udf(lambda words: [w.lower() for w in words], ArrayType(StringType()))

# Apply the UDF
df_lower = df.withColumn("words_lower", lowercase_udf(col("words")))

# Fit a CountVectorizerModel from the lowercased corpus
cv = CountVectorizer(inputCol="words_lower", outputCol="features", minDF=1.0)
model = cv.fit(df_lower)
result = model.transform(df_lower)

# Show results
result.select("id", "words_lower", "features").show(truncate=False)

print("Vocabulary learned by CountVectorizer:")
print(model.vocabulary)

# Part B Adcanced Representation

Binarizer on the Continuous Feature

In [None]:
from pyspark.ml.feature import Binarizer

continuousDataFrame = spark.createDataFrame([
    (0, 3.1),
    (1, 9.3),
    (2, 27.9)
], ["id", "feature"])

binarizer = Binarizer(threshold=9.3, inputCol="feature", outputCol="binarized_feature")

binarizedDataFrame = binarizer.transform(continuousDataFrame)

print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show()

PCA in Spark

In [None]:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 3.5), (2, 9.5)]),),
        (Vectors.dense([1.0, 2.0, 4.0, 5.0, 6.0]),),
        (Vectors.dense([2.0, 3.0, 5.0, 7.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

Polynomial Representation For Continuous Data

In [None]:
from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (Vectors.dense([3.0, 3.0]),),
    (Vectors.dense([9.0, 3.0]),)
], ["features"])

polyExpansion = PolynomialExpansion(degree=5, inputCol="features", outputCol="polyFeatures")
polyDF = polyExpansion.transform(df)

polyDF.show(truncate=False)


Min Max Scaler in Spark

In [None]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 2.1, -18.0,100]),),
    (1, Vectors.dense([2.0, 2.0, -22.0,200]),),
    (2, Vectors.dense([4.0, 1000.0, 18.0,0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScaler
scalerModel = scaler.fit(dataFrame)

scaledData = scalerModel.transform(dataFrame)

scaledData.select("features", "scaledFeatures").show()

Discretization in Spark

In [None]:
from pyspark.ml.feature import Bucketizer

splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]

data = [(-221.9,), (-0.5,), (-0.1,), (0.1,), (3.2,), (1999.9,)]
dataFrame = spark.createDataFrame(data, ["features"])

bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures")

# Transform original data into its bucket index.
bucketedData = bucketizer.transform(dataFrame)

print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1))
bucketedData.show()

Imputation in Spark

In [None]:
from pyspark.ml.feature import Imputer

df = spark.createDataFrame([
    (2.0, float("nan")),
    (3.0, float("nan")),
    (float("nan"), 7.0),
    (4.0, 4.0),
    (5.0, 7.0)
], ["a", "b"])

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)

model.transform(df).show()

# Now your Turn, could you import a text data and process them with above learned steps?


1.   Do you need to consider the N gram tokenization? how it is different with normal Tokenization?
2.   How you will conduct the processing if text data is combined with continuous data?






## Example of using ML solution with Spark on text data to train and Evaluate the model

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create Spark session
spark = SparkSession.builder.appName("TextLogisticRegression").getOrCreate()

# Sample labeled text data
data = [
    (0, "I hate this unit, it's too hard", 0),
    (1, "This is an amazing class!", 1),
    (2, "I think the course is difficult", 0),
    (3, "What a great subject", 1),
    (4, "This unit is not good", 0),
    (5, "Excellent content and teaching", 1)
]
columns = ["id", "text", "label"]
df = spark.createDataFrame(data, columns)

# Text processing stages
tokenizer = Tokenizer(inputCol="text", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
vectorizer = CountVectorizer(inputCol="filtered_words", outputCol="raw_features")
idf = IDF(inputCol="raw_features", outputCol="features")

# Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Build pipeline
pipeline = Pipeline(stages=[tokenizer, remover, vectorizer, idf, lr])

# Train/test split
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

# Fit model
model = pipeline.fit(train_data)

# Evaluate model
predictions = model.transform(test_data)
predictions.select("text", "label", "prediction", "probability").show(truncate=False)

# Compute accuracy
evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print(f"Test AUC = {accuracy:.3f}")