## Classification of the Palmer penguins data

This document is part of the showcase, where I replicate the same brief and simple analyses with different tools.

This particular file focuses on simple classification of the Palmer penguins data from the tidytuesday project.

The data can be found in <https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-07-28>. They consist of one documents: *penguins.csv* contains information and measurements about some penguins.

For the specific analysis I will use **Python** and **pyspark** (plus **Jupyter notebook**).

We start by loading the pyspark package and initiating a spark session:

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

We load the dataset:

In [2]:
from pyspark import SparkFiles

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"
spark.sparkContext.addFile(url)
penguins = spark.read.csv(SparkFiles.get("penguins.csv"), header=True)

and check the schema:

In [3]:
penguins.printSchema()

root
 |-- species: string (nullable = true)
 |-- island: string (nullable = true)
 |-- bill_length_mm: string (nullable = true)
 |-- bill_depth_mm: string (nullable = true)
 |-- flipper_length_mm: string (nullable = true)
 |-- body_mass_g: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- year: string (nullable = true)



Everything was read as string, so we have to fix the schema:

In [4]:
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType,FloatType,StringType,LongType

penguins = penguins.withColumn("bill_length_mm",col("bill_length_mm").cast(FloatType())) \
    .withColumn("bill_depth_mm",col("bill_depth_mm").cast(FloatType())) \
    .withColumn("flipper_length_mm",col("flipper_length_mm").cast(FloatType())) \
    .withColumn("body_mass_g",col("body_mass_g").cast(FloatType())) \
    .withColumn("year",col("year").cast(IntegerType()))

penguins.printSchema()

root
 |-- species: string (nullable = true)
 |-- island: string (nullable = true)
 |-- bill_length_mm: float (nullable = true)
 |-- bill_depth_mm: float (nullable = true)
 |-- flipper_length_mm: float (nullable = true)
 |-- body_mass_g: float (nullable = true)
 |-- sex: string (nullable = true)
 |-- year: integer (nullable = true)



We can see now one record indicatively:

In [5]:
penguins.show(1, truncate=False, vertical=True)

-RECORD 0----------------------
 species           | Adelie    
 island            | Torgersen 
 bill_length_mm    | 39.1      
 bill_depth_mm     | 18.7      
 flipper_length_mm | 181.0     
 body_mass_g       | 3750.0    
 sex               | male      
 year              | 2007      
only showing top 1 row



And the summary statistics:

In [6]:
penguins.describe().show(truncate=False, vertical=True)

-RECORD 0-------------------------------
 summary           | count              
 species           | 344                
 island            | 344                
 bill_length_mm    | 342                
 bill_depth_mm     | 342                
 flipper_length_mm | 342                
 body_mass_g       | 342                
 sex               | 344                
 year              | 344                
-RECORD 1-------------------------------
 summary           | mean               
 species           | null               
 island            | null               
 bill_length_mm    | 43.921929733097905 
 bill_depth_mm     | 17.151169584508526 
 flipper_length_mm | 200.91520467836258 
 body_mass_g       | 4201.754385964912  
 sex               | null               
 year              | 2008.0290697674418 
-RECORD 2-------------------------------
 summary           | stddev             
 species           | null               
 island            | null               
 bill_length_mm 

There is some confusion with missing values, so we are going to replace the 'NA's with "None"s, which will turn them to "null"s:

In [7]:
penguins = penguins.replace('NA', None)
penguins.show()

+-------+---------+--------------+-------------+-----------------+-----------+------+----+
|species|   island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|   sex|year|
+-------+---------+--------------+-------------+-----------------+-----------+------+----+
| Adelie|Torgersen|          39.1|         18.7|            181.0|     3750.0|  male|2007|
| Adelie|Torgersen|          39.5|         17.4|            186.0|     3800.0|female|2007|
| Adelie|Torgersen|          40.3|         18.0|            195.0|     3250.0|female|2007|
| Adelie|Torgersen|          null|         null|             null|       null|  null|2007|
| Adelie|Torgersen|          36.7|         19.3|            193.0|     3450.0|female|2007|
| Adelie|Torgersen|          39.3|         20.6|            190.0|     3650.0|  male|2007|
| Adelie|Torgersen|          38.9|         17.8|            181.0|     3625.0|female|2007|
| Adelie|Torgersen|          39.2|         19.6|            195.0|     4675.0|  male|2007|

Now we can count the missing values:

In [8]:
from pyspark.sql.functions import isnan, when, count, col

penguins.select([count(when(col(c).isNull(), c)).alias(c) for c in penguins.columns]).show()

+-------+------+--------------+-------------+-----------------+-----------+---+----+
|species|island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|sex|year|
+-------+------+--------------+-------------+-----------------+-----------+---+----+
|      0|     0|             2|            2|                2|          2| 11|   0|
+-------+------+--------------+-------------+-----------------+-----------+---+----+



There is a few of them, so we are just going to drop them:

In [9]:
penguins = penguins.na.drop()
penguins.show()

+-------+---------+--------------+-------------+-----------------+-----------+------+----+
|species|   island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|   sex|year|
+-------+---------+--------------+-------------+-----------------+-----------+------+----+
| Adelie|Torgersen|          39.1|         18.7|            181.0|     3750.0|  male|2007|
| Adelie|Torgersen|          39.5|         17.4|            186.0|     3800.0|female|2007|
| Adelie|Torgersen|          40.3|         18.0|            195.0|     3250.0|female|2007|
| Adelie|Torgersen|          36.7|         19.3|            193.0|     3450.0|female|2007|
| Adelie|Torgersen|          39.3|         20.6|            190.0|     3650.0|  male|2007|
| Adelie|Torgersen|          38.9|         17.8|            181.0|     3625.0|female|2007|
| Adelie|Torgersen|          39.2|         19.6|            195.0|     4675.0|  male|2007|
| Adelie|Torgersen|          41.1|         17.6|            182.0|     3200.0|female|2007|

Now we check again:

In [10]:
penguins.select([count(when(col(c).isNull(), c)).alias(c) for c in penguins.columns]).show()

+-------+------+--------------+-------------+-----------------+-----------+---+----+
|species|island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|sex|year|
+-------+------+--------------+-------------+-----------------+-----------+---+----+
|      0|     0|             0|            0|                0|          0|  0|   0|
+-------+------+--------------+-------------+-----------------+-----------+---+----+



We also drop the "year" column, because it is not going to be useful for our model:

In [11]:
penguins = penguins.drop('year')

Now we can split the dataset to training and testing:

In [12]:
train_penguins, test_penguins = penguins.randomSplit([0.8, 0.2])

We add some pre-processing steps, specifically:

* StringIndexer will turn the strings into numerical factors
* OneHotEncoder will perform one-hot-encoding of the categorical features
* StandardScaler will normalize all numeric predictors
* VectorAssembler will assemble the features into a feature vector

In [13]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler

We can call the dataset to check again on the attributes:

In [14]:
penguins

DataFrame[species: string, island: string, bill_length_mm: float, bill_depth_mm: float, flipper_length_mm: float, body_mass_g: float, sex: string]

We start turning all string attributes to factors:

In [15]:
str_Ind = StringIndexer(inputCol="sex", outputCol="island") , StringIndexer(inputCol="sex_ind", outputCol="island_ind")

In [16]:
sex_indexer = StringIndexer(inputCol="sex", outputCol="sex_ind")

In [17]:
island_indexer = StringIndexer(inputCol="island", outputCol="island_ind")

In [18]:
species_indexer = StringIndexer(inputCol="species", outputCol="label")

One-hot-encoding:

In [19]:
ohe = OneHotEncoder(inputCols = ['sex_ind', 'island_ind'], outputCols=['sex_ohe', 'island_ohe'])

Assembling all numerical features and scaling them:

In [20]:
assembler1 = VectorAssembler(inputCols=['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g'], outputCol="features_scaled1")
scaler = StandardScaler(inputCol="features_scaled1", outputCol="features_scaled")

Assembling all features:

In [21]:
assembler2 = VectorAssembler(inputCols=['sex_ohe', 'island_ohe','features_scaled'], outputCol="features")

The first model we are going to use is a multiclass logistic regression model (Note that spark's version of support vector machines doesnot support more than two classes)

In [22]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

We construct a pipeline with the pre-processing steps and the model:

In [23]:
pipeline = Pipeline(stages = [sex_indexer, island_indexer, species_indexer, ohe, assembler1, scaler, assembler2, lr])

We fit the pipeline into the training data:

In [24]:
pModel = pipeline.fit(train_penguins)

And we use it on the testing data:

In [25]:
testingPred = pModel.transform(test_penguins)

We can check the predictions and their probabilities (remember that we have a multiclass logistic regression model):

In [26]:
testingPred.select('label','probability','prediction').show()

+-----+--------------------+----------+
|label|         probability|prediction|
+-----+--------------------+----------+
|  0.0|[0.43380747392115...|       0.0|
|  0.0|[0.43140546838099...|       0.0|
|  0.0|[0.38064482728401...|       1.0|
|  0.0|[0.45155213695603...|       0.0|
|  0.0|[0.40074319738246...|       0.0|
|  0.0|[0.41802236439608...|       0.0|
|  0.0|[0.42153372637713...|       0.0|
|  0.0|[0.41116176426650...|       0.0|
|  0.0|[0.46839620040492...|       0.0|
|  0.0|[0.47184560104111...|       0.0|
|  0.0|[0.49111323991538...|       0.0|
|  0.0|[0.45547413833924...|       0.0|
|  0.0|[0.46198566862360...|       0.0|
|  0.0|[0.45079245281516...|       0.0|
|  0.0|[0.46027711782604...|       0.0|
|  0.0|[0.47554251764059...|       0.0|
|  0.0|[0.48720154024803...|       0.0|
|  0.0|[0.44610865215814...|       0.0|
|  0.0|[0.44215089511852...|       0.0|
|  0.0|[0.46331029306059...|       0.0|
+-----+--------------------+----------+
only showing top 20 rows



Finally, we can calculate the accuracy:

In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol = "label", predictionCol = "prediction", metricName = "accuracy")
acc = evaluator.evaluate(testingPred)
acc

0.8428571428571429

About 84%.

For the second model, we use a decision tree:

In [28]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

We construct the new pipeline by combining the old pre-processing steps with the new model:

In [29]:
pipeline_dt = Pipeline(stages = [sex_indexer, island_indexer, species_indexer, ohe, assembler1, scaler, assembler2, dt])

We fit the training data:

In [30]:
pModel_dt = pipeline_dt.fit(train_penguins)

And test on the testing data:

In [31]:
testingPred_dt = pModel_dt.transform(test_penguins)

We can check the predictions:

In [32]:
testingPred_dt.select('label','prediction').show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       2.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       2.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



And calculate the accuracy:

In [33]:
acc_dt = evaluator.evaluate(testingPred_dt)
acc_dt

0.9714285714285714

The decision tree was correct in 97% of its predictions