# Tarea Spark-MLlib 

### *Guillermo Climent, Rubén Giménez, Mayra Russo*

### Enunciado

La tarea propuesta consiste en cargar, procesar y evaluar distintos algoritmos de clustering y clasificación de los que provee la librería de MLlib de Spark sobre el clásico conjunto de datos MNIST.

Disponéis de información detallada del dataset, así como resultados de algoritmos empleados para su resolución (en términos de test error rate) en http://yann.lecun.com/exdb/mnist/.

En la carpeta de ‘Material’ del ‘Tema 6 – Spark MLlib’ tenéis una sub-carpeta llamada ‘mnist’ donde están los ficheros de entrenamiento y test en formato .csv para su fácil lectura mediante Spark.

Se deben probar y comparar varios algoritmos de clasificación aplicando y sin aplicar previamente algoritmos de selección de características (por ejemplo, PCA).

Como algoritmos de clasificación se deben compara al menos LogisticRegression y MultiLayerPerceptronClassifier. Para cada método se deberán buscar y optimizar sus parámetros para obtener la mejor clasificación posible.

Respecto al clustering, se realizará una prueba con al menos uno de los métodos de Spark para evaluar si uno de estos algoritmos no supervisados es capaz de obtener resultados comparables a los algoritmos supervisados.

Se valorará:

El uso de Pipelines en el proceso de optimización, entrenamiento y test.

Comparar con otros métodos de clasificación además de los dos requeridos.

La tarea puede hacerse en parejas o individualmente. Se podrán realizar los programas en R (sparkR) o Python (py-spark). Se recomienda utilizar notebooks para la realización y presentación del trabajo.

In [1]:
sc

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190506091707-0000
KERNEL_ID = a82e690b-3659-4c1a-ba08-42fc42e86c1e


### 1. Preprocessing

In [1]:
SEED = 42

from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession

# Getting the SparkContext
sc = SparkContext()

# Initializing the SQLContext
sqlContext = SQLContext(sc)

# Initializing Spark Session
spark = SparkSession.builder.appName('recommender-system').getOrCreate()

#### 1.1. Load train data

In [4]:

import ibmos2spark
# @hidden_cell
credentials = {
    'endpoint': 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-a4631986-b796-40bd-b93a-06065e91801b',
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token',
    'api_key': 'gl2jpib2U2AtMKgyuZX4nk2YLlPFz3DxfM4REYvALTQl'
}

configuration_name = 'os_068e48156a3944328f96d36308ac1cdf_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initialized for you.
# The following variable contains the path to your file on your IBM Cloud Object Storage.
train_file = cos.url('mnist_train.csv.bz2', 'filtradocollab-donotdelete-pr-dchkxqx3kk4b5a')


In [5]:
#train_file = 'mnist/mnist_train.csv.bz2'
train_df_raw = spark.read.csv(train_file, header=True, inferSchema=False)
train_df_raw.select(train_df_raw.columns[:8]).show(4)

+-----+---+---+---+---+---+---+---+
|label|1x1|1x2|1x3|1x4|1x5|1x6|1x7|
+-----+---+---+---+---+---+---+---+
|    5|  0|  0|  0|  0|  0|  0|  0|
|    0|  0|  0|  0|  0|  0|  0|  0|
|    4|  0|  0|  0|  0|  0|  0|  0|
|    1|  0|  0|  0|  0|  0|  0|  0|
+-----+---+---+---+---+---+---+---+
only showing top 4 rows



#### 1.2. Update data schema

In [6]:
from pyspark.sql.functions import col

mapping = {column: 'integer' for column in train_df_raw.columns}

mapping_dict = dict(mapping)

exprs = [col(c).cast(mapping[c]) if c in mapping_dict else c 
         for c in train_df_raw.columns]

train_df_raw = train_df_raw.select(*exprs)

#### 1.3. Add 'features' column as a Vector

In [8]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=train_df_raw.columns[1:], outputCol='features')
train_df = assembler.transform(train_df_raw)

train_df.select(['label', 'features']).show(4)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    5|(784,[152,153,154...|
|    0|(784,[127,128,129...|
|    4|(784,[160,161,162...|
|    1|(784,[158,159,160...|
+-----+--------------------+
only showing top 4 rows



#### 1.4. Scaling: create the scaler model

In [9]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(
    inputCol='features',
    outputCol='scaledFeatures',
    withStd=True,
    withMean=False
)

### 2. Training

#### 2.1. Define estimators and their hyperparameters
1) Logistic Regression <br>
2) Random Forest <br>
3) Multilayer Perceptron <br>
4) K-Means<br>

In [11]:
import re

from pyspark.ml.classification import LogisticRegression, MultilayerPerceptronClassifier, RandomForestClassifier
from pyspark.ml.clustering import KMeans, GaussianMixture

from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import MulticlassClassificationEvaluator, ClusteringEvaluator


# Extract number of classes
classes = train_df.select('label').distinct()
n_classes = train_df.select('label').distinct().count()


# Extract number of features
regex = re.compile(r'[0-9]+x[0-9]+')
feature_cols = list(filter(regex.search, train_df.columns))
n_features = len(feature_cols)


# Define estimators
log_reg = LogisticRegression(standardization=False)

rf = RandomForestClassifier(labelCol="label", seed = SEED)

mlp_layers = [n_features, int(n_features / 2), int(n_features / 4), n_classes]
mlp = MultilayerPerceptronClassifier(layers=mlp_layers, blockSize=128, seed=SEED)

kmeans = KMeans(k=n_classes, seed=SEED)

# Create a list with each defined estimator and its grid param
estimator_list = [
    (
        log_reg, 
        ParamGridBuilder() \
            .addGrid(log_reg.maxIter, [50, 100]) \
            .addGrid(log_reg.tol, [1E-6]) \
            .addGrid(log_reg.fitIntercept, [False]) \
            .addGrid(scaler.withMean, [False]) \
            .build()
    ),
    (
        rf,
        ParamGridBuilder() \
            .addGrid(rf.maxDepth, [10, 15]) \
            .addGrid(rf.maxBins, [5, 10, 20]) \
            .addGrid(rf.numTrees, [10, 20]) 
            .addGrid(scaler.withMean, [False, True]) \
            .build()
    ),
    (
        mlp, 
        ParamGridBuilder() \
            .addGrid(mlp.maxIter, [25, 50]) \
            .addGrid(mlp.tol, [1E-6]) \
            .addGrid(scaler.withMean, [False]) \
            .build()
    ),
]

clustering_list = [
    (
        kmeans, 
        ParamGridBuilder() \
            .addGrid(kmeans.maxIter, [20, 50])\
            .addGrid(scaler.withMean, [False, True]) \
            .build()
    ),

]

In [16]:
# Extract number of features and define models for PCA
#we will use the same classifiers, however need to define the models again because of some features adapted to the PCA
n_features_pca = 25

# Define estimators
log_reg = LogisticRegression(standardization=False)

rf = RandomForestClassifier(labelCol="label", seed = SEED)

mlp_layers_pca = [n_features_pca, int(n_features_pca / 2), int(n_features_pca / 4), n_classes]
mlp_pca = MultilayerPerceptronClassifier(layers=mlp_layers_pca, blockSize=128, seed=SEED)

kmeans = KMeans(k=n_classes, seed=SEED)

# Create a list with each defined estimator and its grid param
estimators_list_pca = [
    (
        log_reg, 
        ParamGridBuilder() \
            .addGrid(log_reg.maxIter, [50, 100]) \
            .addGrid(log_reg.tol, [1E-6]) \
            .addGrid(log_reg.fitIntercept, [False]) \
            .addGrid(scaler.withMean, [False]) \
            .build()
    ),
    (
        rf,
        ParamGridBuilder() \
            .addGrid(rf.maxDepth, [10, 15]) \
            .addGrid(rf.maxBins, [5, 10, 20]) \
            .addGrid(rf.numTrees, [10, 20]) 
            .addGrid(scaler.withMean, [False, True]) \
            .build()
    ),
    (
        mlp_pca, 
        ParamGridBuilder() \
            .addGrid(mlp.maxIter, [25, 50]) \
            .addGrid(mlp.tol, [1E-6]) \
            .addGrid(scaler.withMean, [False]) \
            .build()
    ),
]

clustering_list = [
    (
        kmeans, 
        ParamGridBuilder() \
            .addGrid(kmeans.maxIter, [20, 50])\
            .addGrid(scaler.withMean, [False, True]) \
            .build()
    ),
]

#### 2.2. PCA: create the PCA model

In [33]:
from pyspark.ml.feature import PCA

pca = PCA(
    k=n_features_pca,
    inputCol='features',
    outputCol='pcaFeatures'
)

#### 2.3. Perform CV for models (without PCA step)

In [None]:
# Save the results of the cross-validation and hyperparameter tuning in a list of models
model_list = []
for estimator, paramGrid in estimator_list:
    # Set features column
    estimator.setFeaturesCol('scaledFeatures')
    
    pipeline = Pipeline(stages=[scaler, estimator])
    cvModel = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=MulticlassClassificationEvaluator(metricName='accuracy'),
        numFolds=5
    )
    print(estimator) 
    %time cvModel = cvModel.fit(train_df)
    
    model_list.append(cvModel)

LogisticRegression_4698bb555695f6efcda9
CPU times: user 2.59 s, sys: 872 ms, total: 3.46 s
Wall time: 11min 42s
RandomForestClassifier_474a90300003cd62675a
CPU times: user 18.3 s, sys: 6.84 s, total: 25.2 s
Wall time: 1h 15min 7s
MultilayerPerceptronClassifier_458a8220b24b620b0ee1
CPU times: user 10.9 s, sys: 4.02 s, total: 14.9 s
Wall time: 1h 58min 41s


#### 2.4. Perform CV for models (with PCA step)

In [19]:
# Save the results of the cross-validation and hyperparameter tuning in a list of models
model_list_pca = []
for estimator, paramGrid in estimators_list_pca:
    # Set features column
    estimator.setFeaturesCol('pcaFeatures')
    
    pipeline = Pipeline(stages=[scaler, pca, estimator])
    cvModel = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=MulticlassClassificationEvaluator(metricName='accuracy'),
        numFolds=5,
        parallelism=2
    )
    print(estimator)
    %time cvModel = cvModel.fit(train_df)
    
    model_list_pca.append(cvModel)

LogisticRegression_484eaf71803d0b80b0d6
CPU times: user 3.06 s, sys: 1.1 s, total: 4.16 s
Wall time: 7min 23s
RandomForestClassifier_48a090634c9a1cc6150a
CPU times: user 18.7 s, sys: 5.96 s, total: 24.6 s
Wall time: 27min 45s
MultilayerPerceptronClassifier_41a1a7dfc612324a502f
CPU times: user 4.56 s, sys: 1.52 s, total: 6.08 s
Wall time: 9min 1s


#### 2.5. Perform CV for clustering models (without PCA step)

In [9]:
# Save the results of the cross-validation and hyperparameter tuning in a list of models
cluster_list = []
for estimator, paramGrid in clustering_list:
    # Set features column
    estimator.setFeaturesCol('scaledFeatures')
    
    pipeline = Pipeline(stages=[scaler, estimator])
    cvModel = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=ClusteringEvaluator(),
        numFolds=5,
        parallelism=2
    )
    print(estimator)
    %time cvModel = cvModel.fit(train_df)
    
    cluster_list.append(cvModel)

KMeans_b5816906ce6d
Wall time: 8min 17s


#### 2.6  Perform CV for clustering models (with PCA step)

In [10]:
# Save the results of the cross-validation and hyperparameter tuning in a list of models
cluster_list_pca = []
for estimator, paramGrid in clustering_list:
    # Set features column
    estimator.setFeaturesCol('pcaFeatures')
    
    pipeline = Pipeline(stages=[scaler, pca, estimator])
    cvModel = CrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=ClusteringEvaluator(),
        numFolds=5,
        parallelism=2
    )
    print(estimator)
    %time cvModel = cvModel.fit(train_df)
    
    cluster_list_pca.append(cvModel)

KMeans_b5816906ce6d
Wall time: 6min 40s


### 3. Evaluation

In [20]:
import numpy as np

from sklearn import metrics

#### 3.1. Load test data

In [21]:
# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initialized for you.
# The following variable contains the path to your file on your IBM Cloud Object Storage.
test_file = cos.url('mnist_test.csv.bz2', 'filtradocollab-donotdelete-pr-dchkxqx3kk4b5a')

In [22]:
#test_file = 'mnist/mnist_test.csv.bz2'
test_df_raw = spark.read.csv(test_file, header=True, inferSchema=True)
test_df_raw.select(test_df_raw.columns[:8]).show(4)

+-----+---+---+---+---+---+---+---+
|label|1x1|1x2|1x3|1x4|1x5|1x6|1x7|
+-----+---+---+---+---+---+---+---+
|    7|  0|  0|  0|  0|  0|  0|  0|
|    2|  0|  0|  0|  0|  0|  0|  0|
|    1|  0|  0|  0|  0|  0|  0|  0|
|    0|  0|  0|  0|  0|  0|  0|  0|
+-----+---+---+---+---+---+---+---+
only showing top 4 rows



#### 3.2. Update data schema

In [32]:
from pyspark.sql.functions import col

mapping = {column: 'integer' for column in test_df_raw.columns}

mapping_dict = dict(mapping)

exprs = [col(c).cast(mapping[c]) if c in mapping_dict else c 
         for c in test_df_raw.columns]

test_df_raw = test_df_raw.select(*exprs)

#### 3.3. Add 'features' column as a Vector

In [24]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=test_df_raw.columns[1:], outputCol='features')
test_df = assembler.transform(test_df_raw)

test_df.select(['label', 'features']).show(4)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    7|(784,[202,203,204...|
|    2|(784,[94,95,96,97...|
|    1|(784,[128,129,130...|
|    0|(784,[124,125,126...|
+-----+--------------------+
only showing top 4 rows



#### 3.4. Predict using the best models

In [25]:
predictions = [model.transform(test_df) for model in model_list]
predictions_pca = [model.transform(test_df) for model in model_list_pca]
predictions_clust = [model.transform(test_df) for model in cluster_list]
predictions_clust_pca = [model.transform(test_df) for model in cluster_list_pca]

#### 3.5. Confusion matrix

In [16]:
def print_crosstab(model_list, predictions):
    assert len(model_list) == len(predictions)
    
    for i in range(len(model_list)):
        print('Pipeline:', model_list[i].bestModel.stages[:-1])
        print('Model:', model_list[i].bestModel.stages[-1].__str__())
        predictions[i].crosstab('label', 'prediction').sort('label_prediction').show()

**Models without PCA**

In [21]:
print_crosstab(model_list, predictions)
print_crosstab(model_list_pca, predictions_pca)
print_crosstab(cluster_list, predictions_clust)

Pipeline: [StandardScaler_456fb8b83d10d193a4dc]
Model: LogisticRegression_4698bb555695f6efcda9
+----------------+---+----+---+---+---+---+---+---+---+---+
|label_prediction|0.0| 1.0|2.0|3.0|4.0|5.0|6.0|7.0|8.0|9.0|
+----------------+---+----+---+---+---+---+---+---+---+---+
|               0|960|   0|  1|  2|  2|  4|  7|  1|  2|  1|
|               1|  0|1113|  5|  2|  0|  1|  3|  2|  9|  0|
|               2|  5|  10|927| 16|  7|  3| 12| 11| 39|  2|
|               3|  2|   0| 17|926|  1| 22|  3|  9| 24|  6|
|               4|  1|   1|  3|  2|915|  0| 12|  5| 10| 33|
|               5|  8|   2|  2| 34|  8|769| 13| 11| 38|  7|
|               6| 10|   3|  8|  1|  7| 15|910|  2|  2|  0|
|               7|  1|   5| 24|  8|  4|  1|  0|949|  5| 31|
|               8|  6|  11|  6| 22|  9| 25|  9| 13|865|  8|
|               9|  8|   8|  2|  8| 25|  5|  0| 19| 11|923|
+----------------+---+----+---+---+---+---+---+---+---+---+

Pipeline: [StandardScaler_456fb8b83d10d193a4dc]
Model: RandomFor

In [17]:
print_crosstab(cluster_list, predictions_clust)

Pipeline: [StandardScaler_9b98e64ce93c]
Model: KMeans_b5816906ce6d
+----------------+----+---+---+---+---+---+---+---+---+
|label_prediction|   0|  1|  2|  3|  4|  5|  6|  7|  9|
+----------------+----+---+---+---+---+---+---+---+---+
|               0|   1| 15|  1|480|174|  2|255| 15| 37|
|               1|1114|  2|  0|  0|  2|  0|  3| 10|  4|
|               2| 104|310| 10|  6|226| 14|119|  8|235|
|               3|  59|190| 19|  1|129| 18|559| 18| 17|
|               4|  49|  8|204| 10|  3|514|  1|165| 28|
|               5|  54| 47| 18|  7| 46| 38|372|298| 12|
|               6|  55| 26|  4| 65|  7|  4| 26|  5|766|
|               7|  74|  4|134|  0|  2|617|  1|194|  2|
|               8|  96| 44| 22|  4| 19| 58|521|200| 10|
|               9|  31|  2|210|  3|  9|654| 19| 80|  1|
+----------------+----+---+---+---+---+---+---+---+---+



**Models with PCA**

In [27]:
print_crosstab(model_list_pca, predictions_pca)

Pipeline: [StandardScaler_4200ac3d67976e896566, PCA_40c996a054db5d8e508d]
Model: LogisticRegression_484eaf71803d0b80b0d6
+----------------+---+----+---+---+---+---+---+---+---+---+
|label_prediction|0.0| 1.0|2.0|3.0|4.0|5.0|6.0|7.0|8.0|9.0|
+----------------+---+----+---+---+---+---+---+---+---+---+
|               0|946|   0|  7|  5|  0|  8|  8|  2|  4|  0|
|               1|  0|1108|  3|  2|  0|  1|  3|  1| 17|  0|
|               2|  8|  11|876| 22| 19|  5| 19| 16| 42| 14|
|               3|  4|   0| 23|894|  2| 41|  2| 15| 21|  8|
|               4|  3|   2|  3|  0|889|  2| 15|  3| 14| 51|
|               5| 13|   3| 12| 60| 21|707| 18|  8| 42|  8|
|               6| 20|   3|  8|  1| 20| 12|888|  1|  5|  0|
|               7|  5|   6| 39|  3| 13|  0|  0|918|  7| 37|
|               8| 11|   9| 14| 37|  9| 44| 12|  7|813| 18|
|               9|  9|  12|  9|  8| 53| 19|  1| 35|  8|855|
+----------------+---+----+---+---+---+---+---+---+---+---+

Pipeline: [StandardScaler_4200ac3d6797

#### 3.6. Metrics

In [29]:
import pandas as pd

def print_metrics(cvModel):
    model = cvModel.bestModel.stages[-1]
    summary = model.summary
    
    print('Pipeline:', cvModel.bestModel.stages[:-1])
    print('Model:', model.__str__())
    print('Accuracy:', summary.accuracy)
    
    f_measure = summary.fMeasureByLabel()
    
    metrics_df = pd.DataFrame(
        {
            'Label': range(len(f_measure)),
            'F-measure': f_measure,
            'TPR (Recall)': summary.truePositiveRateByLabel,
            'FPR (1 - Specificity)': summary.falsePositiveRateByLabel,
            'Precision': summary.precisionByLabel
        }
    )
    
    print(metrics_df.round(2).to_string(index=False))
    
    print()

def print_custom_metrics(cvModel, predictions):
    model = cvModel.bestModel.stages[-1]
    
    print('Pipeline:', cvModel.bestModel.stages[:-1])
    print('Model:', model.__str__())
    
    y_true = predictions.select('label').toPandas().values.flatten()
    y_pred = predictions.select('prediction').toPandas().values.flatten()
    
    print(metrics.classification_report(y_true, y_pred))
    
    print()

**Without PCA**

In [44]:
# Logistic Regression metrics
log_results = []
log_results.append(model_list[0])
tmp = [print_metrics(model) for model in log_results]

Pipeline: [StandardScaler_456fb8b83d10d193a4dc]
Model: LogisticRegression_4698bb555695f6efcda9
Accuracy: 0.9352333333333334
F-measure  FPR (1 - Specificity)  Label  Precision  TPR (Recall)
     0.97                   0.00      0       0.97          0.98
     0.97                   0.00      1       0.96          0.98
     0.93                   0.01      2       0.94          0.92
     0.92                   0.01      3       0.92          0.91
     0.94                   0.01      4       0.94          0.94
     0.90                   0.01      5       0.92          0.89
     0.96                   0.01      6       0.95          0.97
     0.95                   0.01      7       0.95          0.94
     0.90                   0.01      8       0.89          0.90
     0.92                   0.01      9       0.91          0.92



In [79]:
print_custom_metrics(model_list[1], predictions[1])
print_custom_metrics(model_list[2], predictions[2])

Pipeline: [StandardScaler_456fb8b83d10d193a4dc]
Model: RandomForestClassificationModel (uid=RandomForestClassifier_474a90300003cd62675a) with 20 trees
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       980
          1       0.99      0.99      0.99      1135
          2       0.96      0.96      0.96      1032
          3       0.93      0.95      0.94      1010
          4       0.96      0.95      0.96       982
          5       0.97      0.93      0.95       892
          6       0.96      0.98      0.97       958
          7       0.97      0.95      0.96      1028
          8       0.94      0.94      0.94       974
          9       0.94      0.95      0.94      1009

avg / total       0.96      0.96      0.96     10000


Pipeline: [StandardScaler_456fb8b83d10d193a4dc]
Model: MultilayerPerceptronClassifier_458a8220b24b620b0ee1
             precision    recall  f1-score   support

          0       0.97      0.99      0.98      

**With PCA**

In [30]:
# Logistic Regression metrics
log_pca_results = []
log_pca_results.append(model_list_pca[0])
tmp = [print_metrics(model) for model in log_pca_results]

Pipeline: [StandardScaler_4200ac3d67976e896566, PCA_40c996a054db5d8e508d]
Model: LogisticRegression_484eaf71803d0b80b0d6
Accuracy: 0.8833833333333333
F-measure  FPR (1 - Specificity)  Label  Precision  TPR (Recall)
     0.94                   0.01      0       0.93          0.95
     0.95                   0.01      1       0.94          0.97
     0.86                   0.01      2       0.88          0.85
     0.86                   0.02      3       0.86          0.85
     0.89                   0.01      4       0.88          0.91
     0.80                   0.02      5       0.82          0.78
     0.92                   0.01      6       0.91          0.93
     0.91                   0.01      7       0.91          0.90
     0.83                   0.02      8       0.83          0.83
     0.85                   0.02      9       0.85          0.85



In [31]:
print_custom_metrics(model_list_pca[1], predictions_pca[1])
print_custom_metrics(model_list_pca[2], predictions_pca[2])

Pipeline: [StandardScaler_4200ac3d67976e896566, PCA_40c996a054db5d8e508d]
Model: RandomForestClassificationModel (uid=RandomForestClassifier_48a090634c9a1cc6150a) with 20 trees
             precision    recall  f1-score   support

          0       0.97      0.97      0.97       980
          1       0.98      0.98      0.98      1135
          2       0.93      0.93      0.93      1032
          3       0.92      0.93      0.92      1010
          4       0.93      0.93      0.93       982
          5       0.91      0.93      0.92       892
          6       0.95      0.97      0.96       958
          7       0.96      0.92      0.94      1028
          8       0.90      0.90      0.90       974
          9       0.91      0.91      0.91      1009

avg / total       0.94      0.94      0.94     10000


Pipeline: [StandardScaler_4200ac3d67976e896566, PCA_40c996a054db5d8e508d]
Model: MultilayerPerceptronClassifier_41a1a7dfc612324a502f
             precision    recall  f1-score   suppo

Tras la implementación de los diferentes algoritmos de clasificación, podemos observar como estos tienen un mejor rendimiento cuando no se emplea el uso de la PCA. 

Entre todos los métodos de clasificación implementados, Multilayer Perceptron, Random Forest y Logistic Regression, se obtuvo el mejor rendimiento con el Random Forest y el MLP sin PCA.

In [18]:
# Evaluation of clustering methods
import pandas as pd

def print_cluster_metrics(cvModel):
    model = cvModel.bestModel.stages[-1]
    summary = model.summary
    
    print('Pipeline:', cvModel.bestModel.stages[:-1])
    print('Model:', model.__str__())
    
    metrics_df = pd.DataFrame(
        {
            'Label': range(len(summary.clusterSizes)),
            'clusterSizes': summary.clusterSizes,
        }
    )
    
    print(metrics_df.round(2).to_string(index=False))
    
    print('trainingCost:', round(summary.trainingCost, 2))
    
    print()

In [22]:
tmp = [print_cluster_metrics(model) for model in cluster_list + cluster_list_pca]

Pipeline: [StandardScaler_9b98e64ce93c]
Model: KMeans_b5816906ce6d
 Label  clusterSizes
     0         10279
     1          4140
     2          3916
     3          3361
     4          4233
     5         10889
     6         10033
     7          6457
     8             4
     9          6688
trainingCost: 36684486.11

Pipeline: [StandardScaler_9b98e64ce93c, PCA_f1334576f97f]
Model: KMeans_b5816906ce6d
 Label  clusterSizes
     0          5442
     1          4999
     2          6861
     3          5404
     4          4587
     5          5226
     6          7263
     7          7633
     8          5689
     9          6896
trainingCost: 90086714778.98



In [24]:
from sklearn import metrics

def print_cluster_metrics(cvModel, predictions):
    model = cvModel.bestModel.stages[-1]
    summary = model.summary
    
    print('Pipeline:', cvModel.bestModel.stages[:-1])
    print('Model:', model.__str__())
    
    y_true = predictions.select('label').toPandas().values.flatten()
    y_pred = predictions.select('prediction').toPandas().values.flatten()
    
    metrics_df = pd.DataFrame(
        {
            'Label': range(len(summary.clusterSizes)),
            'clusterSizes': summary.clusterSizes,
            'F-measure': metrics.f1_score(y_true, y_pred, average='micro'),
            'TPR (Recall)': metrics.recall_score(y_true, y_pred, average='micro'),
            'Precision': metrics.precision_score(y_true, y_pred, average='micro')
        }
    )
    
    print(metrics_df.round(2).to_string(index=False))
    
    print('\ntrainingCost:', round(summary.trainingCost, 2))
    
    print()

**Clustering without PCA**

In [25]:
assert len(cluster_list) == len(predictions_clust)

for i in range(len(cluster_list)):
    print_cluster_metrics(cluster_list[i], predictions_clust[i])

Pipeline: [StandardScaler_9b98e64ce93c]
Model: KMeans_b5816906ce6d
 Label  clusterSizes  F-measure  TPR (Recall)  Precision
     0         10279       0.03          0.03       0.03
     1          4140       0.03          0.03       0.03
     2          3916       0.03          0.03       0.03
     3          3361       0.03          0.03       0.03
     4          4233       0.03          0.03       0.03
     5         10889       0.03          0.03       0.03
     6         10033       0.03          0.03       0.03
     7          6457       0.03          0.03       0.03
     8             4       0.03          0.03       0.03
     9          6688       0.03          0.03       0.03

trainingCost: 36684486.11



**Clustering with PCA**

In [26]:
# predicting pca
assert len(cluster_list_pca) == len(predictions_clust_pca)

for i in range(len(cluster_list_pca)):
    print_cluster_metrics(cluster_list_pca[i], predictions_clust_pca[i])

Pipeline: [StandardScaler_9b98e64ce93c, PCA_f1334576f97f]
Model: KMeans_b5816906ce6d
 Label  clusterSizes  F-measure  TPR (Recall)  Precision
     0          5442       0.04          0.04       0.04
     1          4999       0.04          0.04       0.04
     2          6861       0.04          0.04       0.04
     3          5404       0.04          0.04       0.04
     4          4587       0.04          0.04       0.04
     5          5226       0.04          0.04       0.04
     6          7263       0.04          0.04       0.04
     7          7633       0.04          0.04       0.04
     8          5689       0.04          0.04       0.04
     9          6896       0.04          0.04       0.04

trainingCost: 90086714778.98



En el caso del algoritmo de clustering, k-means, los resultados de rendimiento en comparación con los resultados obtenidos con los métodos de clasificación de aprendizaje supervisado fueron significativamente peores. 

Con respeto al uso del PCA, se obtuvieron resultados de rendimiento ligeramente más favorables con el uso del PCA. 