# Multilayer Perceptron Classifier (MLPC) in PySpark

This notebook demonstrates how to use the **Multilayer Perceptron Classifier (MLPC)** in PySpark for classification tasks.
We will cover:
1. Data preparation.
2. Building an MLPC model.
3. Training and evaluating the model.

## Step 1: Set Up Spark Session
First, we create a Spark session to work with PySpark.

In [1]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("MLPC Example") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/13 20:24:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/01/13 20:25:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/01/13 20:25:00 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/01/13 20:25:00 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/01/13 20:25:00 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.


## Step 2: Load and Prepare Data
We will create a simple dataset for binary classification. Each row represents features and a label.

In [2]:
from pyspark.sql import Row
from pyspark.ml.feature import VectorAssembler

# Create sample dataset
data = [
    Row(features=[0.0, 0.0], label=0.0),
    Row(features=[0.0, 1.0], label=1.0),
    Row(features=[1.0, 0.0], label=1.0),
    Row(features=[1.0, 1.0], label=0.0)
]

# Convert to DataFrame
df = spark.createDataFrame(data)
df.show()

                                                                                

+----------+-----+
|  features|label|
+----------+-----+
|[0.0, 0.0]|  0.0|
|[0.0, 1.0]|  1.0|
|[1.0, 0.0]|  1.0|
|[1.0, 1.0]|  0.0|
+----------+-----+



## Step 3: Split Data into Training and Test Sets
We split the dataset into 80% training and 20% testing to evaluate the model.

In [3]:
# Split data into training and testing sets
train_data, test_data = df.randomSplit([0.8, 0.2], seed=1234)

## Step 4: Build the MLPC Model
We use PySpark's `MultilayerPerceptronClassifier` to create the model.

In [4]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

# Define the model layers (input layer, hidden layers, output layer)
# In this case: 2 input neurons, 2 hidden neurons, 2 output neurons (binary classification)
layers = [2, 2, 2]

# Initialize the MLPC model
mlpc = MultilayerPerceptronClassifier(featuresCol="features", labelCol="label", layers=layers, seed=1234)


## Step 5: Train the Model
Fit the MLPC model on the training data.

In [5]:
# Train the model
model = mlpc.fit(train_data)

IllegalArgumentException: requirement failed: Column features must be of type class org.apache.spark.ml.linalg.VectorUDT:struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually class org.apache.spark.sql.types.ArrayType:array<double>.

## Step 6: Make Predictions
Use the trained model to make predictions on the test data.

In [None]:
# Make predictions on the test data
predictions = model.transform(test_data)
predictions.show()

## Step 7: Evaluate the Model
Evaluate the model using metrics like accuracy.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Evaluate the model's accuracy
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")

## Step 8: Stop the Spark Session
Finally, we stop the Spark session to release resources.

In [None]:
# Stop the Spark session
spark.stop()