# KMEANS CLUSTERING

## Importing necessary libraries
- `from pyspark.sql import SparkSession`: Imports the `SparkSession` class from `pyspark.sql`. `SparkSession` is the entry point to programming Spark with the Dataset and DataFrame API.

- `spark = SparkSession.builder.getOrCreate()`: Creates a SparkSession `spark` if it doesn't already exist, or gets the existing one. The `builder` method is used to create a `SparkSession`.

In [None]:
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_5').getOrCreate()




## Reading a CSV File Without Header in PySpark

The code snippet reads a CSV file without a header into a PySpark DataFrame and prints a summary of the DataFrame:

- `data_without_header = spark.read.option("inferSchema", True).option("header", False).csv("kddcup.data_10_percent_corrected")
`: Reads a CSV file without a header into a PySpark DataFrame `data_without_header`. The options `inferSchema=True` and `header=False` are used to infer the schema from the data and indicate that the file does not have a header row.

- `print(data_without_header.summary)`: Prints a summary of the DataFrame `data_without_header`. The `summary` method provides summary statistics for each numerical column in the DataFrame.


In [None]:
data_without_header = spark.read.option("inferSchema", True).option("header", False).csv("kddcup.data_10_percent_corrected")


## Defining column names for the DataFrame

In [None]:
column_names = [
    "duration", "protocol_type", "service", "flag",
    "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent",
    "hot", "num_failed_logins", "logged_in", "num_compromised",
    "root_shell", "su_attempted", "num_root", "num_file_creations",
    "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count",
    "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate",
    "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
    "dst_host_count", "dst_host_srv_count",
    "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
    "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
    "dst_host_serror_rate", "dst_host_srv_serror_rate",
    "dst_host_rerror_rate", "dst_host_srv_rerror_rate",
    "label"
]

## Converting the DataFrame without header to a DataFrame with column names


In [None]:
data = data_without_header.toDF(*column_names)

## Import col

- `col` function from `pyspark.sql.functions`: Used to refer to a column in a DataFrame. It allows you to access and manipulate columns when working with PySpark DataFrames.


In [None]:
from pyspark.sql.functions import col

## Task Description: Selecting, Grouping, Counting, Ordering, and Showing Top 25 Results

1. **Selecting "label" Column**: Select the "label" column from the DataFrame.

2. **Grouping by "label" Column**: Group the DataFrame by the "label" column.

3. **Counting Occurrences**: Count the occurrences of each unique label.

4. **Ordering by Count**: Order the result by the count of occurrences in descending order.

5. **Showing Top 25 Results**: Display the top 25 results after ordering.


In [None]:
data.select("label").groupBy("label").count().orderBy(col("count").desc()).show(25)

+----------------+------+
|           label| count|
+----------------+------+
|          smurf.|280790|
|        neptune.|107201|
|         normal.| 97278|
|           back.|  2203|
|          satan.|  1589|
|        ipsweep.|  1247|
|      portsweep.|  1040|
|    warezclient.|  1020|
|       teardrop.|   979|
|            pod.|   264|
|           nmap.|   231|
|   guess_passwd.|    53|
|buffer_overflow.|    30|
|           land.|    21|
|    warezmaster.|    20|
|           imap.|    12|
|        rootkit.|    10|
|     loadmodule.|     9|
|      ftp_write.|     8|
|       multihop.|     7|
|            phf.|     4|
|           perl.|     3|
|            spy.|     2|
+----------------+------+



## Task Description: Creating a Pipeline for KMeans Clustering

1. **Importing Required Classes**: Import the `VectorAssembler` class from `pyspark.ml.feature`, the `KMeans` class from `pyspark.ml.clustering`, and the `Pipeline` class from `pyspark.ml`.

2. **Creating VectorAssembler**: Create a `VectorAssembler` instance to assemble feature columns into a single feature vector column.

3. **Creating KMeans Model**: Create a `KMeans` instance to define the KMeans clustering model, specifying parameters such as the number of clusters.

4. **Creating Pipeline**: Create a `Pipeline` instance and set its stages to include the `VectorAssembler` and `KMeans` stages.

5. **Overall Purpose**: The pipeline is used to transform the input DataFrame by assembling the features into a vector and then applying KMeans clustering to the feature vectors.


In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline

#### Dropping non-numeric columns and caching the DataFrame



In [None]:
numeric_only = data.drop("protocol_type", "service", "flag").cache()

#### Creating a VectorAssembler to assemble the feature columns into a single feature vector column



In [None]:
assembler = VectorAssembler().setInputCols(numeric_only.columns[:-1]).setOutputCol("featureVector")

#### Creating a KMeans model with the prediction column set to "cluster" and features column set to "featureVector"


In [None]:
kmeans = KMeans().setPredictionCol("cluster").setFeaturesCol("featureVector")

#### Creating a pipeline with the VectorAssembler and KMeans model


In [None]:
pipeline = Pipeline().setStages([assembler, kmeans])

#### Fitting the pipeline to the numeric_only DataFrame


In [None]:
pipeline_model = pipeline.fit(numeric_only)

#### Extracting the KMeans model from the pipeline model

In [None]:
kmeans_model = pipeline_model.stages[1]

#### Printing the cluster centers


In [None]:
from pprint import pprint
pprint(kmeans_model.clusterCenters())

[array([4.79793956e+01, 1.62207883e+03, 8.68534183e+02, 4.45326100e-05,
       6.43293794e-03, 1.41694668e-05, 3.45168212e-02, 1.51815716e-04,
       1.48247035e-01, 1.02121372e-02, 1.11331525e-04, 3.64357718e-05,
       1.13517671e-02, 1.08295211e-03, 1.09307315e-04, 1.00805635e-03,
       0.00000000e+00, 0.00000000e+00, 1.38658354e-03, 3.32286248e+02,
       2.92907143e+02, 1.76685418e-01, 1.76607809e-01, 5.74330999e-02,
       5.77183920e-02, 7.91548844e-01, 2.09816404e-02, 2.89968625e-02,
       2.32470732e+02, 1.88666046e+02, 7.53781203e-01, 3.09056111e-02,
       6.01935529e-01, 6.68351484e-03, 1.76753957e-01, 1.76441622e-01,
       5.81176268e-02, 5.74111170e-02]),
 array([2.0000000e+00, 6.9337564e+08, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00

#### Transforming the numeric_only DataFrame using the pipeline model to add a "cluster" column

In [None]:
with_cluster = pipeline_model.transform(numeric_only)

#### Selecting the "cluster" and "label" columns from the transformed DataFrame
grouping by "cluster" and "label", counting the occurrences of each combination,
and ordering the result by "cluster" and count in descending order. Also showing top 25 columns.


In [None]:
with_cluster.select("cluster", "label").groupBy("cluster", "label").count().orderBy(col("cluster"), col("count").desc()).show(25)  # Showing the top 25 results

+-------+----------------+------+
|cluster|           label| count|
+-------+----------------+------+
|      0|          smurf.|280790|
|      0|        neptune.|107201|
|      0|         normal.| 97278|
|      0|           back.|  2203|
|      0|          satan.|  1589|
|      0|        ipsweep.|  1247|
|      0|      portsweep.|  1039|
|      0|    warezclient.|  1020|
|      0|       teardrop.|   979|
|      0|            pod.|   264|
|      0|           nmap.|   231|
|      0|   guess_passwd.|    53|
|      0|buffer_overflow.|    30|
|      0|           land.|    21|
|      0|    warezmaster.|    20|
|      0|           imap.|    12|
|      0|        rootkit.|    10|
|      0|     loadmodule.|     9|
|      0|      ftp_write.|     8|
|      0|       multihop.|     7|
|      0|            phf.|     4|
|      0|           perl.|     3|
|      0|            spy.|     2|
|      1|      portsweep.|     1|
+-------+----------------+------+



### Choosing k

#### Import Summary

- `DataFrame` class from `pyspark.sql`: Used to represent a distributed collection of data organized into named columns.
- `randint` function from `random`: Used to generate a random integer between specified integers.


In [None]:
from pyspark.sql import DataFrame
from random import randint


## **Function Summary: `clustering_score`

1. **Input Data Preparation**: It takes a DataFrame `input_data` as input and drops non-numeric columns (`"protocol_type"`, `"service"`, `"flag"`) to create a new DataFrame `input_numeric_only`.

2. **Feature Vector Assembly**: It uses `VectorAssembler` to assemble the feature columns of `input_numeric_only` (excluding the last column, which is assumed to be the label column) into a single feature vector column called `"featureVector"`.

3. **KMeans Model Creation**: It creates a KMeans model with a randomly generated seed and a specified number of clusters `k`, using the feature vector column `"featureVector"` for clustering and setting the prediction column name to `"cluster"`.

4. **Pipeline Creation and Fitting**: It creates a pipeline with the `VectorAssembler` and KMeans model, and fits the pipeline to the `input_numeric_only` DataFrame to train the KMeans model.

5. **Training Cost Extraction**: It extracts the trained KMeans model from the pipeline model and retrieves the training cost (sum of squared distances of points to their nearest cluster center) from the model's summary.

6. **Return Value**: It returns the training cost of the KMeans model.

7. **Iterating Over K Values**: It iterates over a range of `k` values (20, 40, 60, 80) and prints the training cost for each `k`.

Overall, the function `clustering_score` is used to train KMeans clustering models with different numbers of clusters and evaluate their training costs, helping to determine the optimal number of clusters for the given dataset.


In [None]:
def clustering_score(input_data, k):
    input_numeric_only = input_data.drop("protocol_type", "service", "flag")
    assembler = VectorAssembler().setInputCols(input_numeric_only.columns[:-1]).setOutputCol("featureVector")
    kmeans = KMeans().setSeed(randint(100,100000)).setK(k).setPredictionCol("cluster").setFeaturesCol("featureVector")
    pipeline = Pipeline().setStages([assembler, kmeans])
    pipeline_model = pipeline.fit(input_numeric_only)
    kmeans_model = pipeline_model.stages[-1]
    training_cost = kmeans_model.summary.trainingCost
    return training_cost
for k in list(range(20,100, 20)):
    print(clustering_score(numeric_only, k))

## Feature normalization

#### `StandardScaler`

The `StandardScaler` class in PySpark's MLlib is used for standardizing features by removing the mean and scaling to unit variance. It is commonly used in machine learning pipelines to scale numerical features before model training.


In [None]:
from pyspark.ml.feature import StandardScaler


## Function Summary: `clustering_score_2`

1. **Input Data Preparation**: It takes a DataFrame `input_data` as input and drops non-numeric columns (`"protocol_type"`, `"service"`, `"flag"`) to create a new DataFrame `input_numeric_only`.

2. **Feature Vector Assembly**: It uses `VectorAssembler` to assemble the feature columns of `input_numeric_only` (excluding the last column, assumed to be the label column) into a single feature vector column called `"featureVector"`.

3. **Standard Scaling**: It standardizes the feature vector column using `StandardScaler`, creating a new column `"scaledFeatureVector"`.

4. **KMeans Model Creation**: It creates a KMeans model with a randomly generated seed, a specified number of clusters `k`, maximum iterations of 40, tolerance of 1.0e-5, using the scaled feature vector column `"scaledFeatureVector"` for clustering, and setting the prediction column name to `"cluster"`.

5. **Pipeline Creation and Fitting**: It creates a pipeline with the `VectorAssembler`, `StandardScaler`, and KMeans model, and fits the pipeline to the `input_numeric_only` DataFrame to train the KMeans model.

6. **Training Cost Extraction**: It extracts the trained KMeans model from the pipeline model and retrieves the training cost (sum of squared distances of points to their nearest cluster center) from the model's summary.

7. **Return Value**: It returns the training cost of the KMeans model.

Overall, the function `clustering_score_2` is an extension of `clustering_score`, adding standard scaling of features before KMeans clustering, potentially improving clustering performance.


In [None]:
def clustering_score_2(input_data, k):
    input_numeric_only = input_data.drop("protocol_type", "service", "flag")
    assembler = VectorAssembler().setInputCols(input_numeric_only.columns[:-1]).setOutputCol("featureVector")
    scaler = StandardScaler().setInputCol("featureVector").setOutputCol("scaledFeatureVector").setWithStd(True).setWithMean(False)
    kmeans = KMeans().setSeed(randint(100,100000)).setK(k).setMaxIter(40).setTol(1.0e-5).setPredictionCol("cluster").setFeaturesCol("scaledFeatureVector")
    pipeline = Pipeline().setStages([assembler, scaler, kmeans])
    pipeline_model = pipeline.fit(input_numeric_only)
    kmeans_model = pipeline_model.stages[-1]
    training_cost = kmeans_model.summary.trainingCost
    return training_cost


#### Printing the score for each k

In [None]:
for k in list(range(60, 271, 30)):
    print(k, clustering_score_2(numeric_only, k))

60 594473.2681887054
90 408499.30381192255
120 243875.62350682
150 185417.2358386817
180 151347.1862905078
210 126555.87022985023
240 111203.3118755722
270 99073.10534575064


### Categorical Variables

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

## Function Summary: `one_hot_pipeline`

1. **String Indexing**: It creates a `StringIndexer` to index the input column `input_col`, creating a new indexed column named `input_col + "_indexed"`.

2. **One-Hot Encoding**: It creates a `OneHotEncoder` to encode the indexed column `input_col + "_indexed"`, creating a new one-hot encoded column named `input_col + "_vec"`.

3. **Pipeline Creation**: It creates a pipeline with the `StringIndexer` and `OneHotEncoder` stages.

4. **Return Value**: It returns the pipeline and the name of the one-hot encoded column (`input_col + "_vec"`).

Overall, the `one_hot_pipeline` function encapsulates the process of indexing and one-hot encoding a single input column.


In [None]:
def one_hot_pipeline(input_col):
    indexer = StringIndexer().setInputCol(input_col).setOutputCol(input_col +"_indexed")
    encoder = OneHotEncoder().setInputCol(input_col + "_indexed").setOutputCol(input_col + "_vec")
    pipeline = Pipeline().setStages([indexer, encoder])
    return pipeline, input_col + "_vec"

## Function Summary: `clustering_score_3`

1. **One-Hot Encoding Pipelines**:
   - Utilizes the `one_hot_pipeline` function to create one-hot encoding pipelines for the columns `"protocol_type"`, `"service"`, and `"flag"`.
   - Extracts the resulting one-hot encoded columns for each feature.

2. **Feature Assembly**:
   - Determines the feature columns for clustering by excluding `"label"`, `"protocol_type"`, `"service"`, and `"flag"` columns.
   - Includes the one-hot encoded columns in the feature columns.

3. **Vector Assembler**:
   - Uses a `VectorAssembler` to assemble the selected feature columns into a single feature vector column named `"featureVector"`.

4. **Standard Scaling**:
   - Uses a `StandardScaler` to standardize the feature vector column, producing a new column named `"scaledFeatureVector"`.

5. **KMeans Model Creation**:
   - Creates a `KMeans` model with parameters such as the number of clusters `k`, maximum iterations, and tolerance.
   - Sets the seed for reproducibility.

6. **Creating Pipeline**:
   - Creates a `Pipeline` with stages including the one-hot encoding pipelines, `VectorAssembler`, `StandardScaler`, and `KMeans` stages.

7. **Fitting Pipeline**:
   - Fits the pipeline to the input data to create a `PipelineModel`.

8. **Extracting Training Cost**:
   - Extracts the trained KMeans model from the pipeline model and retrieves the training cost.

9. **Return Value**:
   - Returns the training cost of the KMeans model.

10. **Overall Purpose**:
    - The function `clustering_score_3` is used to train a KMeans clustering model with one-hot encoding for categorical features and evaluate its training cost.


In [None]:
def clustering_score_3(input_data, k):
    proto_type_pipeline, proto_type_vec_col = one_hot_pipeline("protocol_type")
    service_pipeline, service_vec_col = one_hot_pipeline("service")
    flag_pipeline, flag_vec_col = one_hot_pipeline("flag")
    assemble_cols = set(input_data.columns) - {"label", "protocol_type", "service", "flag"} | {proto_type_vec_col, service_vec_col, flag_vec_col}
    assembler = VectorAssembler().setInputCols(list(assemble_cols)).setOutputCol("featureVector")
    scaler = StandardScaler().setInputCol("featureVector").setOutputCol("scaledFeatureVector").setWithStd(True).setWithMean(False)
    kmeans = KMeans().setSeed(randint(100,100000)).setK(k).setMaxIter(40).setTol(1.0e-5).setPredictionCol("cluster").setFeaturesCol("scaledFeatureVector")
    pipeline = Pipeline().setStages([proto_type_pipeline, service_pipeline,flag_pipeline, assembler, scaler, kmeans])
    pipeline_model = pipeline.fit(input_data)
    kmeans_model = pipeline_model.stages[-1]
    training_cost = kmeans_model.summary.trainingCost
    return training_cost

In [None]:
for k in list(range(60, 271, 30)):
    print(k, clustering_score_3(data, k))

60 17388727.09984871
90 6394617.506449859
120 1519827.0375092942
150 999082.8065718799
180 796774.6873166809
210 582344.6975851612
240 475594.9717839229
270 385256.792086577


### Using Labels with Entropy

In [None]:
from math import log

## Function Summary: `entropy`

1. **Calculating Non-Zero Counts**: It filters out counts `c` from the input `counts` where `c` is greater than 0 and stores them in `values`.

2. **Total Count Calculation**: It calculates the total count `n` by summing up all values in `values`.

3. **Probability Calculation**: It calculates the probability `p` for each non-zero count `v` in `values` as `v/n`.

4. **Entropy Calculation**: It calculates the entropy as the sum of the negative of each probability times the logarithm of the probability, i.e., `sum([-1*(p_v) * log(p_v) for p_v in p])`.

5. **Return Value**: It returns the calculated entropy value.

Overall, the `entropy` function computes the entropy of a distribution represented by the input counts, which is a measure of uncertainty or disorder in the distribution.


In [None]:
def entropy(counts):
    values = [c for c in counts if (c > 0)]
    n = sum(values)
    p = [v/n for v in values]
    return sum([-1*(p_v) * log(p_v) for p_v in p])

## Import Summary

- `functions` module from `pyspark.sql` as `fun`: Used to access SQL functions in PySpark for DataFrame operations.
- `Window` class from `pyspark.sql`: Used to define window specifications for window functions in PySpark DataFrame operations.


In [None]:
from pyspark.sql import functions as fun
from pyspark.sql import Window


## 1. **Transforming Data**:
- Use the fitted pipeline model (`pipeline_model`) to transform the input data `data` and select the "cluster" and "label" columns, creating a new DataFrame `cluster_label`.

In [None]:
cluster_label = pipeline_model.transform(data).select("cluster", "label")

## 2. **Grouping and Counting**:
- Group the `cluster_label` DataFrame by "cluster" and "label", count the occurrences of each cluster-label combination, and order the result by "cluster", creating a new DataFrame `df`.


In [None]:
df = cluster_label.groupBy("cluster", "label").count().orderBy("cluster")

## 3. **Window Specification**:
- Define a window specification `w` partitioned by "cluster" for use in window functions.


In [None]:
w = Window.partitionBy("cluster")

## 4. **Calculating Probabilities**:
- Calculate the probability `p_col` for each cluster-label combination as the count of the combination divided by the sum of counts for the cluster, using the `over` window function to sum counts within each cluster.

In [None]:
p_col = df['count'] / fun.sum(df['count']).over(w)

## 5. **Adding Probability Column**:
- Add the calculated probabilities as a new column "p_col" to the DataFrame `df`, creating a new DataFrame `with_p_col`.

In [None]:
with_p_col = df.withColumn("p_col", p_col)

## 6. **Calculating Entropy**:
- Calculate the entropy for each cluster by summing `-p_col * log2(p_col)` for each cluster-label combination, creating a new DataFrame `result`.

In [None]:
result = with_p_col.groupBy("cluster").agg((-fun.sum(col("p_col") * fun.log2(col("p_col")))).alias("entropy"),
fun.sum(col("count")).alias("cluster_size"))


## 7. **Calculating Weighted Cluster Entropy**:
- Calculate the weighted cluster entropy by multiplying the entropy for each cluster by the cluster size (sum of counts) for that cluster, adding a new column "weightedClusterEntropy" to the `result` DataFrame.

## 8. **Calculating Average**:
- Calculate the average weighted cluster entropy by summing the weighted cluster entropies for all clusters and dividing by the total count of records in the input data `data`.


In [None]:
result = result.withColumn('weightedClusterEntropy',fun.col('entropy') * fun.col('cluster_size'))
weighted_cluster_entropy_avg = result.agg(fun.sum(col('weightedClusterEntropy'))).collect()
weighted_cluster_entropy_avg[0][0]/data.count()

1.557605039016584

## Function Summary: `fit_pipeline_4`

1. **One-Hot Encoding Pipelines**: It creates three one-hot encoding pipelines (`proto_type_pipeline`, `service_pipeline`, `flag_pipeline`) using the `one_hot_pipeline` function for the columns `"protocol_type"`, `"service"`, and `"flag"`, respectively. These pipelines produce one-hot encoded columns (`proto_type_vec_col`, `service_vec_col`, `flag_vec_col`).

2. **Feature Assembly**: It assembles the feature columns for clustering, including the one-hot encoded columns, by taking the set of all input columns except `"label"`, `"protocol_type"`, `"service"`, and `"flag"` and adding the one-hot encoded columns.

3. **Vector Assembler**: It creates a `VectorAssembler` to assemble the selected feature columns into a single feature vector column named `"featureVector"`.

4. **Standard Scaling**: It standardizes the feature vector column using `StandardScaler`, creating a new column named `"scaledFeatureVector"`.

5. **KMeans Model Creation**: It creates a KMeans model with a randomly generated seed, a specified number of clusters `k`, maximum iterations of 40, tolerance of 1.0e-5, using the scaled feature vector column `"scaledFeatureVector"` for clustering, and setting the prediction column name to `"cluster"`.

6. **Pipeline Creation and Fitting**: It creates a pipeline with the one-hot encoding pipelines, `VectorAssembler`, `StandardScaler`, and KMeans model stages, and fits the pipeline to the input data `data`.

7. **Return Value**: It returns the fitted pipeline, which can be used to transform new data.

Overall, the `fit_pipeline_4` function creates a pipeline for clustering with one-hot encoding, feature assembly, standard scaling, and KMeans clustering, and fits the pipeline to the input data.


In [None]:
def fit_pipeline_4(data, k):
    (proto_type_pipeline, proto_type_vec_col) = one_hot_pipeline("protocol_type")
    (service_pipeline, service_vec_col) = one_hot_pipeline("service")
    (flag_pipeline, flag_vec_col) = one_hot_pipeline("flag")
    assemble_cols = set(data.columns) - {"label", "protocol_type", "service","flag"} | {proto_type_vec_col, service_vec_col, flag_vec_col}
    assembler = VectorAssembler(inputCols=list(assemble_cols),outputCol="featureVector")
    scaler = StandardScaler(inputCol="featureVector",outputCol="scaledFeatureVector", withStd=True, withMean=False)
    kmeans = KMeans(seed=randint(100, 100000), k=k, predictionCol="cluster",
    featuresCol="scaledFeatureVector", maxIter=40, tol=1.0e-5)
    pipeline = Pipeline(stages=[proto_type_pipeline, service_pipeline,flag_pipeline, assembler, scaler, kmeans])
    return pipeline.fit(data)


## Function Summary: `clustering_score_4`

1. **Pipeline Fitting**: It fits a pipeline (`fit_pipeline_4`) to the input data `input_data` with a specified number of clusters `k`, producing a pipeline model.

2. **Cluster-Label DataFrame**: It transforms the input data using the fitted pipeline model to add a `"cluster"` column, then selects the `"cluster"` and `"label"` columns.

3. **Grouping and Aggregation**: It groups the cluster-label DataFrame by `"cluster"` and `"label"`, counts the occurrences of each combination, and orders the result by `"cluster"`.

4. **Window Function**: It defines a window function `w` partitioned by `"cluster"`.

5. **Calculating Probabilities**: It calculates the probability `p_col` for each cluster-label combination as the count of the combination divided by the sum of counts for the cluster.

6. **Adding Probability Column**: It adds the `"p_col"` column to the DataFrame with the probabilities.

7. **Entropy Calculation**: It calculates the entropy for each cluster by summing `-p_col * log2(p_col)` for each cluster-label combination.

8. **Weighted Cluster Entropy Calculation**: It calculates the weighted cluster entropy by multiplying the entropy for each cluster by the cluster size (sum of counts) for that cluster.

9. **Average Weighted Cluster Entropy**: It calculates the average weighted cluster entropy by summing the weighted cluster entropies for all clusters and dividing by the total count of records in the input data `input_data`.

10. **Return Value**: It returns the average weighted cluster entropy, which is a measure of the average uncertainty or disorder in the clustering results, weighted by the size of each cluster.


In [None]:
def clustering_score_4(input_data, k):
    pipeline_model = fit_pipeline_4(input_data, k)
    cluster_label = pipeline_model.transform(input_data).select("cluster","label")
    df = cluster_label.groupBy("cluster", "label").count().orderBy("cluster")
    w = Window.partitionBy("cluster")
    p_col = df['count'] / fun.sum(df['count']).over(w)
    with_p_col = df.withColumn("p_col", p_col)
    result = with_p_col.groupBy("cluster").agg(-fun.sum(col("p_col") * fun.log2(col("p_col"))).alias("entropy"), fun.sum(col("count")).alias("cluster_size"))
    result = result.withColumn('weightedClusterEntropy', col('entropy') * col('cluster_size'))
    weighted_cluster_entropy_avg = result.agg(fun.sum(col('weightedClusterEntropy'))).collect()
    return weighted_cluster_entropy_avg[0][0] / input_data.count()

## Clustering in Action

## 1. **Fitting Pipeline**:
- Fit a pipeline model (`pipeline_model`) to the input data `data` with 180 clusters using the `fit_pipeline_4` function.


In [None]:
pipeline_model = fit_pipeline_4(data, 180)

## 2. **Transforming Data and Counting**:
- Transform the input data using the fitted pipeline model to add a "cluster" column.
- Select the "cluster" and "label" columns, then group by "cluster" and "label".
- Count the occurrences of each cluster-label combination and order the result by "cluster" and "label".
- Display the result using the `show()` method.


In [None]:

count_by_cluster_label = pipeline_model.transform(data).select("cluster", "label").groupBy("cluster", "label").count().orderBy("cluster", "label")
count_by_cluster_label.show()

+-------+-----------+-----+
|cluster|      label|count|
+-------+-----------+-----+
|      0|   neptune.|36459|
|      0| portsweep.|    9|
|      1|      back.|    1|
|      1|    normal.| 6029|
|      2|   neptune.|   94|
|      2|     satan.|    1|
|      3|   neptune.|   80|
|      4|   neptune.|  107|
|      4| portsweep.|    1|
|      5|loadmodule.|    2|
|      5|  multihop.|    1|
|      6|   neptune.|  177|
|      6|    normal.|    1|
|      6| portsweep.|    1|
|      7|   neptune.|   24|
|      7| portsweep.|    4|
|      7|     satan.|    1|
|      8|   neptune.|   77|
|      9|   neptune.|   90|
|      9|    normal.|    1|
+-------+-----------+-----+
only showing top 20 rows

