## [*Занятие 4*](https://hackmd.io/@J_qqq0PjTGK1be0341GpYA/BJEYLlK-X#/ "Spark Streaming - HackMD")

https://hackmd.io/@J_qqq0PjTGK1be0341GpYA/BJEYLlK-X#/

### Spark Streaming: ML with Streaming

---

Продолжение ноутбука с [*Занятия 3*](https://github.com/rklepov/hse-cs-ml-2018-2019/blob/master/08-spark/03-ml/%D0%97%D0%B0%D0%BD%D1%8F%D1%82%D0%B8%D0%B5%20MLLib%2022.06.ipynb "Занятие MLLib 22.06.ipynb")

In [1]:
from zipfile import ZipFile
from io import BytesIO
import urllib.request

import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE


def download(url):
    ZipFile.extractall(
        ZipFile(
            BytesIO(
                urllib
                .request
                .urlopen(url,context=ctx)
                .read()
            )
        ),
    )


In [2]:
download('https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip')

In [3]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           1.9G        257M        864M        1.2M        823M        1.5G
Swap:            0B          0B          0B


In [4]:
import findspark

findspark.init()
import pyspark

In [5]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [6]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           1.9G        385M        736M        1.2M        824M        1.4G
Swap:            0B          0B          0B


In [7]:
sms = spark.read.option("sep", "\t").csv("SMSSpamCollection")

In [8]:
src = sms.withColumnRenamed("_c0", "label").withColumnRenamed("_c1", "text")
src.show()

+-----+--------------------+
|label|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
|  ham|I'm gonna be home...|
| spam|SIX chances to wi...|
| spam|URGENT! You have ...|
|  ham|I've been searchi...|
|  ham|I HAVE A DATE ON ...|
| spam|XXXMobileMovieClu...|
|  ham|Oh k...i'm watchi...|
|  ham|Eh u remember how...|
|  ham|Fine if thats th...|
| spam|England v Macedon...|
+-----+--------------------+
only showing top 20 rows



In [9]:
src.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  ham| 4827|
| spam|  747|
+-----+-----+



In [12]:
from pyspark.ml import feature

In [13]:
feature.Tokenizer(inputCol="text", outputCol="tokens").transform(src).show()

+-----+--------------------+--------------------+
|label|                text|              tokens|
+-----+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|
|  ham|Ok lar... Joking ...|[ok, lar..., joki...|
| spam|Free entry in 2 a...|[free, entry, in,...|
|  ham|U dun say so earl...|[u, dun, say, so,...|
|  ham|Nah I don't think...|[nah, i, don't, t...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|
|  ham|Even my brother i...|[even, my, brothe...|
|  ham|As per your reque...|[as, per, your, r...|
| spam|WINNER!! As a val...|[winner!!, as, a,...|
| spam|Had your mobile 1...|[had, your, mobil...|
|  ham|I'm gonna be home...|[i'm, gonna, be, ...|
| spam|SIX chances to wi...|[six, chances, to...|
| spam|URGENT! You have ...|[urgent!, you, ha...|
|  ham|I've been searchi...|[i've, been, sear...|
|  ham|I HAVE A DATE ON ...|[i, have, a, date...|
| spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|  ham|Oh k...i'm watchi...|[oh, k...i'm, wat...|


In [14]:
from pyspark.ml import classification

In [15]:
from pyspark.ml import pipeline

main = pipeline.Pipeline(
    stages=(
        feature.RegexTokenizer(
            minTokenLength=3,
            inputCol="text", 
            pattern="\s+", 
            outputCol="tokens",
        ),
        feature.CountVectorizer(
            inputCol="tokens", 
            outputCol="v",
            minDF=5,
            maxDF=900
        ),
        feature.StringIndexer(inputCol="label", outputCol="y"),
        classification.RandomForestClassifier(
            seed=123,
            labelCol="y",
            featuresCol="v",
        )
    )
)




In [16]:
train, test = src.randomSplit(weights=(70., 30.), seed=123)
main_model = main.fit(train)

results = (
    main_model
    .transform(test)
    .select("y", "rawPrediction", "probability", "prediction")
    .cache()
)

results.show()

+---+--------------------+--------------------+----------+
|  y|       rawPrediction|         probability|prediction|
+---+--------------------+--------------------+----------+
|0.0|[17.4577708451969...|[0.87288854225984...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.6961900509745...|[0.88480950254872...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.

In [17]:
stream_in = "/streaming/mlexample"
!mkdir -p $stream_in

In [19]:
input_stream = spark.readStream.schema(train.schema).option("sep", "\t").csv(stream_in)

In [20]:
prediction_stream = main_model.transform(input_stream)

In [21]:
(
    prediction_stream
    .writeStream
    .trigger(processingTime="10 seconds")
    .format("memory")
    .queryName("preds")
    .start()
)

<pyspark.sql.streaming.StreamingQuery at 0x7f5720e73048>

In [23]:
spark.sql("select * from preds").show()

+-----+----+------+---+---+-------------+-----------+----------+
|label|text|tokens|  v|  y|rawPrediction|probability|prediction|
+-----+----+------+---+---+-------------+-----------+----------+
+-----+----+------+---+---+-------------+-----------+----------+



In [25]:
!shuf SMSSpamCollection | head -n1k > `tempfile -d $stream_in`

shuf: write error: Broken pipe
shuf: write error


In [26]:
spark.sql("select * from preds").show()

+-----+--------------------+--------------------+--------------------+---+--------------------+--------------------+----------+
|label|                text|              tokens|                   v|  y|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+---+--------------------+--------------------+----------+
| spam|A £400 XMAS REWAR...|[£400, xmas, rewa...|(1358,[2,5,14,19,...|1.0|[14.7228608827178...|[0.73614304413589...|       0.0|
|  ham|Lol! U drunkard! ...|[lol!, drunkard!,...|(1358,[0,14,44,12...|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|  ham|When are you goin...|[when, are, you, ...|(1358,[3,6,18,42]...|0.0|[17.6961900509745...|[0.88480950254872...|       0.0|
|  ham|Ah poop. Looks li...|[poop., looks, li...|(1358,[4,12,22,36...|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|  ham|Ü comin to fetch ...|[comin, fetch, or...| (1358,[1202],[1.0])|0.0|[17.9915243908026...|[0.899576