# Data Science with PySpark Demo

In this notebook we will use classification model to classify flower types by their characteristics using the famous [iris dataset]().

This small dataset from 1936 is often used for testing out machine learning algorithms and visualizations (for example, Scatter Plot). Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal and petal, in centimeters.

We have already explored the dataset using [this notebook](https://www.kaggle.com/aceccon/1-iris-dataset-data-exploration/notebook).

**Our goal is to get >90% of our predictions right (>90% accuracy)**

# Load Data

In [67]:
from pyspark.sql import SQLContext

sqlcontext = SQLContext(sc)
raw_df = sqlcontext.read.csv('iris.csv', header=True, inferSchema=True)
raw_df.show(10)
raw_df.printSchema()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 10 rows

root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true

# Cleaning 

we have already explored the dataset using [this notebook](https://www.kaggle.com/aceccon/1-iris-dataset-data-exploration/notebook) and we saw **no missing values**, no suspicious duplicates and in general this dataset is well known for being clean and ML ready.

# Preprocessing

#### Encode Labels

In [68]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="species", outputCol="label")
df = indexer.fit(raw_df).transform(raw_df)
df.show(5)

+------------+-----------+------------+-----------+-------+-----+
|sepal_length|sepal_width|petal_length|petal_width|species|label|
+------------+-----------+------------+-----------+-------+-----+
|         5.1|        3.5|         1.4|        0.2| setosa|  2.0|
|         4.9|        3.0|         1.4|        0.2| setosa|  2.0|
|         4.7|        3.2|         1.3|        0.2| setosa|  2.0|
|         4.6|        3.1|         1.5|        0.2| setosa|  2.0|
|         5.0|        3.6|         1.4|        0.2| setosa|  2.0|
+------------+-----------+------------+-----------+-------+-----+
only showing top 5 rows



#### Convert to Spark ML Schema

In [69]:
from pyspark.ml.feature import VectorAssembler

vectors2features = VectorAssembler(inputCols=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], outputCol='features')
df = vectors2features.transform(df)
df = df.select(df['label'], df['features'])
df.show(10)

+-----+-----------------+
|label|         features|
+-----+-----------------+
|  2.0|[5.1,3.5,1.4,0.2]|
|  2.0|[4.9,3.0,1.4,0.2]|
|  2.0|[4.7,3.2,1.3,0.2]|
|  2.0|[4.6,3.1,1.5,0.2]|
|  2.0|[5.0,3.6,1.4,0.2]|
|  2.0|[5.4,3.9,1.7,0.4]|
|  2.0|[4.6,3.4,1.4,0.3]|
|  2.0|[5.0,3.4,1.5,0.2]|
|  2.0|[4.4,2.9,1.4,0.2]|
|  2.0|[4.9,3.1,1.5,0.1]|
+-----+-----------------+
only showing top 10 rows



#### Train Test Split

In [70]:
train, test = df.randomSplit([0.8, 0.2], seed=2405)
print "Train #rows:", train.count()
print "Test #rows:", test.count()

Train #rows: 123
Test #rows: 27


# Modeling + Evaluation

In [73]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

rf_model = RandomForestClassifier(numTrees=10).fit(train) # we can also use cross validation
predictions_df = rf_model.transform(test)
predictions_df.select('prediction', 'label', 'features').show(10)
accuracy = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(predictions_df)
print('Accuracy on test data is {}'.format(accuracy))

+----------+-----+-----------------+
|prediction|label|         features|
+----------+-----+-----------------+
|       0.0|  0.0|[5.0,2.3,3.3,1.0]|
|       0.0|  0.0|[5.5,2.3,4.0,1.3]|
|       0.0|  0.0|[5.5,2.4,3.7,1.0]|
|       0.0|  0.0|[5.5,2.4,3.8,1.1]|
|       0.0|  0.0|[5.5,2.5,4.0,1.3]|
|       0.0|  0.0|[5.7,2.6,3.5,1.0]|
|       0.0|  0.0|[5.7,2.8,4.1,1.3]|
|       0.0|  0.0|[5.9,3.0,4.2,1.5]|
|       0.0|  0.0|[6.3,2.3,4.4,1.3]|
|       0.0|  0.0|[6.4,2.9,4.3,1.3]|
+----------+-----+-----------------+
only showing top 10 rows

Accuracy on test data is 0.925925925926


# Notes

* The same principals can be used for [regression models](https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#regression).

* This dataset is fairly clean and ready for modeling usually we will need to use [transformations and feature engineering](https://spark.apache.org/docs/2.1.0/ml-features.html). 

* make sure you don't train and test yourself on the same data, generally getting suprisingly good metrics should concern you.