## [*Занятие 4*](https://hackmd.io/@J_qqq0PjTGK1be0341GpYA/BJEYLlK-X#/ "Spark Streaming - HackMD")

https://hackmd.io/@J_qqq0PjTGK1be0341GpYA/BJEYLlK-X#/

### Spark Streaming: ML with Streaming

---

Продолжение ноутбука с [*Занятия 3*](https://github.com/rklepov/hse-cs-ml-2018-2019/blob/master/08-spark/03-ml/%D0%97%D0%B0%D0%BD%D1%8F%D1%82%D0%B8%D0%B5%20MLLib%2022.06.ipynb "Занятие MLLib 22.06.ipynb")

In [1]:
from zipfile import ZipFile
from io import BytesIO
import urllib.request

import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE


def download(url):
    ZipFile.extractall(
        ZipFile(
            BytesIO(
                urllib
                .request
                .urlopen(url,context=ctx)
                .read()
            )
        ),
    )


In [2]:
download('https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip')

In [3]:
!cat readme

SMS Spam Collection v.1
-------------------------

1. DESCRIPTION
--------------

The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. 

1.1. Compilation
----------------

This corpus has been collected from free or free for research sources at the Web:

- A collection of between 425 SMS spam messages extracted manually from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/
- A list of 450 SMS ham messages collected from Caroline

In [4]:
import findspark

findspark.init()
import pyspark

In [5]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [6]:
sms = spark.read.option("sep", "\t").csv("SMSSpamCollection")

In [7]:
src = sms.withColumnRenamed("_c0", "label").withColumnRenamed("_c1", "text")
src.show()

+-----+--------------------+
|label|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
|  ham|Even my brother i...|
|  ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
|  ham|I'm gonna be home...|
| spam|SIX chances to wi...|
| spam|URGENT! You have ...|
|  ham|I've been searchi...|
|  ham|I HAVE A DATE ON ...|
| spam|XXXMobileMovieClu...|
|  ham|Oh k...i'm watchi...|
|  ham|Eh u remember how...|
|  ham|Fine if thats th...|
| spam|England v Macedon...|
+-----+--------------------+
only showing top 20 rows



In [8]:
src.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  ham| 4827|
| spam|  747|
+-----+-----+



In [9]:
from pyspark.ml import feature

In [10]:
feature.Tokenizer(inputCol="text", outputCol="tokens").transform(src).show()

+-----+--------------------+--------------------+
|label|                text|              tokens|
+-----+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|
|  ham|Ok lar... Joking ...|[ok, lar..., joki...|
| spam|Free entry in 2 a...|[free, entry, in,...|
|  ham|U dun say so earl...|[u, dun, say, so,...|
|  ham|Nah I don't think...|[nah, i, don't, t...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|
|  ham|Even my brother i...|[even, my, brothe...|
|  ham|As per your reque...|[as, per, your, r...|
| spam|WINNER!! As a val...|[winner!!, as, a,...|
| spam|Had your mobile 1...|[had, your, mobil...|
|  ham|I'm gonna be home...|[i'm, gonna, be, ...|
| spam|SIX chances to wi...|[six, chances, to...|
| spam|URGENT! You have ...|[urgent!, you, ha...|
|  ham|I've been searchi...|[i've, been, sear...|
|  ham|I HAVE A DATE ON ...|[i, have, a, date...|
| spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|  ham|Oh k...i'm watchi...|[oh, k...i'm, wat...|


In [11]:
from pyspark.ml import classification

In [12]:
from pyspark.ml import pipeline

main = pipeline.Pipeline(
    stages=(
        feature.RegexTokenizer(
            minTokenLength=3,
            inputCol="text", 
            pattern="\s+", 
            outputCol="tokens",
        ),
        feature.CountVectorizer(
            inputCol="tokens", 
            outputCol="v",
            minDF=5,
            maxDF=900
        ),
        feature.StringIndexer(inputCol="label", outputCol="y"),
        classification.RandomForestClassifier(
            seed=123,
            labelCol="y",
            featuresCol="v",
        )
    )
)




In [13]:
train, test = src.randomSplit(weights=(70., 30.), seed=123)
main_model = main.fit(train)

results = (
    main_model
    .transform(test)
    .select("y", "rawPrediction", "probability", "prediction")
    .cache()
)

results.show()

+---+--------------------+--------------------+----------+
|  y|       rawPrediction|         probability|prediction|
+---+--------------------+--------------------+----------+
|0.0|[17.4577708451969...|[0.87288854225984...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.6961900509745...|[0.88480950254872...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.

In [14]:
from pyspark.sql import functions

results.orderBy("probability").show()

+---+--------------------+--------------------+----------+
|  y|       rawPrediction|         probability|prediction|
+---+--------------------+--------------------+----------+
|1.0|[6.02774777924674...|[0.30138738896233...|       1.0|
|1.0|[7.05938187799983...|[0.35296909389999...|       1.0|
|1.0|[7.06069250555023...|[0.35303462527751...|       1.0|
|1.0|[7.28622884676247...|[0.36431144233812...|       1.0|
|1.0|[7.54685095248554...|[0.37734254762427...|       1.0|
|1.0|[8.41448111855782...|[0.42072405592789...|       1.0|
|1.0|[8.46720221661500...|[0.42336011083075...|       1.0|
|1.0|[8.63567962692463...|[0.43178398134623...|       1.0|
|1.0|[8.67324089890456...|[0.43366204494522...|       1.0|
|1.0|[9.30519987922656...|[0.46525999396132...|       1.0|
|1.0|[9.36938491669584...|[0.46846924583479...|       1.0|
|1.0|[9.36938491669584...|[0.46846924583479...|       1.0|
|1.0|[9.36938491669584...|[0.46846924583479...|       1.0|
|1.0|[9.54773915954144...|[0.47738695797707...|       1.

In [15]:
results.orderBy(functions.desc("probability")).show()

+---+--------------------+--------------------+----------+
|  y|       rawPrediction|         probability|prediction|
+---+--------------------+--------------------+----------+
|1.0|[18.0893416907218...|[0.90446708453609...|       0.0|
|0.0|[18.0893416907218...|[0.90446708453609...|       0.0|
|1.0|[18.0893416907218...|[0.90446708453609...|       0.0|
|0.0|[18.0893416907218...|[0.90446708453609...|       0.0|
|1.0|[18.0893416907218...|[0.90446708453609...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|0.0|[17.9915243908026...|[0.89957621954013...|       0.

In [16]:
from pyspark.ml import evaluation

evaluation.BinaryClassificationEvaluator(labelCol="y").evaluate(results)

0.9314047828132821

In [17]:
from pyspark.ml import pipeline

lsh_pipe = pipeline.Pipeline(
    stages=(
        feature.Tokenizer(
            inputCol="text",  
            outputCol="tokens",
        ),
        feature.CountVectorizer(
            binary=True,
            inputCol="tokens", 
            outputCol="v",
        ),
    )
)


In [18]:
lsh_prep_model = lsh_pipe.fit(src)

lsh_src = lsh_prep_model.transform(src)
lsh_src.show()

+-----+--------------------+--------------------+--------------------+
|label|                text|              tokens|                   v|
+-----+--------------------+--------------------+--------------------+
|  ham|Go until jurong p...|[go, until, juron...|(13587,[8,42,52,6...|
|  ham|Ok lar... Joking ...|[ok, lar..., joki...|(13587,[5,75,411,...|
| spam|Free entry in 2 a...|[free, entry, in,...|(13587,[0,3,8,20,...|
|  ham|U dun say so earl...|[u, dun, say, so,...|(13587,[5,22,60,1...|
|  ham|Nah I don't think...|[nah, i, don't, t...|(13587,[0,1,66,87...|
| spam|FreeMsg Hey there...|[freemsg, hey, th...|(13587,[0,2,6,10,...|
|  ham|Even my brother i...|[even, my, brothe...|(13587,[0,7,9,13,...|
|  ham|As per your reque...|[as, per, your, r...|(13587,[0,10,11,4...|
| spam|WINNER!! As a val...|[winner!!, as, a,...|(13587,[0,2,3,14,...|
| spam|Had your mobile 1...|[had, your, mobil...|(13587,[0,4,5,10,...|
|  ham|I'm gonna be home...|[i'm, gonna, be, ...|(13587,[0,1,6,32,...|
| spam

In [19]:
mh = feature.MinHashLSH(inputCol="v", outputCol="hash")
mh_model = mh.fit(lsh_src)

In [20]:
similar = mh_model.approxSimilarityJoin(lsh_src, lsh_src, 0.7)

In [21]:
similar.show()

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|           distCol|
+--------------------+--------------------+------------------+
|[ham, Ok lar i do...|[ham, Ok lar i do...|               0.0|
|[ham, Hello my bo...|[ham, Hello my bo...|               0.0|
|[ham, Huh so late...|[ham, Huh so late...|               0.0|
|[ham, Was actuall...|[ham, Was actuall...|               0.0|
|[ham, Ill call u ...|[ham, Ill call u ...|               0.0|
|[ham, Save yourse...|[ham, Save yourse...|               0.0|
|[ham, Ok i msg u ...|[ham, Ok i msg u ...|               0.0|
|[spam, Do you wan...|[spam, Do you wan...|0.6578947368421053|
|[ham, Sir, I have...|[ham, Sir, I have...|               0.0|
|[ham, Hope you ar...|[ham, Hope you ar...|               0.0|
|[ham, I'm in offi...|[ham, K.i will se...|0.6428571428571428|
|[ham, Dear how yo...|[ham, how tall ar...|             0.625|
|[ham, Can i get y...|[ham, Can i get y...|            

принтанем найденные похожие (но неодинаковые) тексты

In [22]:
print(
    "\n===========\n".join(
        " <= похож на => ".join(x) 
        for x in
        similar
        .where("datasetA.text != datasetB.text")
        .rdd
        .map(lambda x: (x["datasetA"]["text"], x["datasetB"]["text"]))
        .take(100)
    )
)

Do you want 750 anytime any network mins 150 text and a NEW VIDEO phone for only five pounds per week call 08002888812 or reply for delivery tomorrow <= похож на => Do you want a new Video handset? 750 any time any network mins? UNLIMITED TEXT? Camcorder? Reply or Call now 08000930705 for del Sat AM
I'm in office now . I will call you  &lt;#&gt;  min:) <= похож на => K.i will send in  &lt;#&gt;  min:)
Dear how you. Are you ok? <= похож на => how tall are you princess?
You want to go?  <= похож на => Hey do you want anything to buy:)
Mm i am on the way to railway <= похож на => I am on the way to ur home
You are being contacted by our dating service by someone you know! To find out who it is, call from a land line 09050000928. PoBox45W2TG150P <= похож на => You are being contacted by our dating service by someone you know! To find out who it is, call from a land line 09050000878. PoBox45W2TG150P
Call him and say you not coming today ok and tell them not to fool me like this ok <= похож 

In [32]:
stream_in = '/streaming/mlexample'

!mkdir -p $stream_in

In [33]:
input_stream = spark.readStream.schema(train.schema).option('sep', '\t').csv(stream_in)

In [27]:
prediction_stream = main_model.transform(input_stream)

In [28]:
(
    prediction_stream
    .writeStream
    .trigger(processingTime='10 seconds')
    .format('memory')
    .queryName('preds')
    .start()
)

<pyspark.sql.streaming.StreamingQuery at 0x7f542075f710>

In [48]:
spark.sql('select * from preds').show()

+-----+--------------------+--------------------+--------------------+---+--------------------+--------------------+----------+
|label|                text|              tokens|                   v|  y|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+---+--------------------+--------------------+----------+
|  ham|I'm putting it on...|[i'm, putting, no...|(1358,[2,13,81,98...|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|  ham|U WILL SWITCH YOU...|[will, switch, yo...|(1358,[3,15,760],...|0.0|[17.6961900509745...|[0.88480950254872...|       0.0|
|  ham|The  &lt;#&gt; g ...|[the, &lt;#&gt;, ...|(1358,[0,1,2,7,10...|0.0|[17.9915243908026...|[0.89957621954013...|       0.0|
|  ham|Thanks a lot for ...|[thanks, lot, for...|(1358,[2,3,131,32...|0.0|[17.6961900509745...|[0.88480950254872...|       0.0|
|  ham|Doing nothing, th...|[doing, nothing,,...|(1358,[8,32,120,2...|0.0|[17.9915243908026...|[0.899576

In [40]:
!shuf SMSSpamCollection | head -n1k | grep spam > `tempfile -d $stream_in`

shuf: write error: Broken pipe
shuf: write error
