# Spark

* Last updated 20161125 20170221

## 목적

* Spark를 사용하여 빅데이터를 ETL 할 수 있다.
* Spark를 사용하여 분석을 할 수 있다.

## 문제

* 문제 S-1: Spark를 standalone cluster로 구성하기
* 문제 S-2: Hello Spark - 환경설정을 읽어 클라이언트 sc를 생성하기.
* 문제 S-3: RDD를 사용하여 MLlib의 입력 데이터 word vector 생성하기.
* 문제 S-4: RDD를 사용하여 MLLib의 입력 데이터 feature vector 생성하기.
* 문제 S-5: 파일에서 Spark SQL로 데이터 읽기
* 문제 S-6: spark sql uber csv
* 문제 S-7: Kolmogorov-Smirnov 검증
* 문제 S-8: 무작위 데이터 생성
* 문제 S-9: 정량데이터 분석
* 문제 S-10: 텍스트 분석
* 문제 S-11: twitter 데이터 분석
* 문제 S-12: 그래프 분석
* 문제 S-13: spark-submit
* 문제 S-14: 시각화 Bokeh
http://www.blog.pythonlibrary.org/2016/07/27/python-visualization-with-bokeh/

* ref
    * introduction to big data with apache spark (Berkley Anthony Joseph)
    * [spark-sklearn](https://github.com/databricks/spark-sklearn)
        ```
        pip install spark-sklearn
        ```
        
        * spark-shell
        ```
        $SPARK_HOME/bin/spark-shell --packages databricks:spark-sklearn:0.2.0
        ```
        
    * 듀크대학 STA663 Statistical Computing and Computation, Spring 2016
        * https://github.com/cliburn/sta-663-2016

    * kaggle
        * https://www.kaggle.com/kaggle/us-baby-names
        * https://github.com/pcsanwald/kaggle-titanic

* TODO
    * spark를 사용해서 이미지 처리 http://docs.thunder-project.org/spark
    * spakr mysql
    ```
    spark.driver.extraLibraryPath
    os.environ['SPARK_CLASSPATH'] =r"/home/jsl/Code/git/bb/sd/lib/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar"
    ```


## S.1 빅데이터와 Spark

* 빅데이터는 여러 출처에서 발생하며, 그 형식이 비구조적이고 다양하다.
* 데이터 웨어하우스, 데이터 마이닝과 관련이 있다.
* ETL (Extract, Transfrom and Load)
    * RapidMiner, SAS 등

단계 | 설명
-----|-----
Extract | 다양한 소스에서 데이터 추출 (Hadoop files, files, json, DB...)
Transform | 데이터 변환.
Load | 데이터베이스에 저장하여 사용

## S.2 Spark 소개

### 버전
* 2009년 UC Berkeley, Matei Zaharia 박사과정 하면서 개뱔
* 2010년 BSD 라이센스 오픈소스로 전환.
* 2013년 Apache 2.0 license로 전환
* 현재 개발자가 Databricks를 설립해서 관리

### 왜 Spark를 배워야 하는가?
* REPL (Read Eval Print Looop)이 가능해서 배우기 쉽다. 쉘 환경이 있어 편리하다. Standalone으로 시작할 수 있다.
* 빅데이터를 빠르게 map reduce 할 수 있다.
* Machine learning 라이브러리를 가지고 있다.

### 구조

* 분산 클러스터 컴퓨팅 프레임워크로서, API를 사용해서 데이터추출, 변환, 기계학습, 그래프분석을 할 수 있다.
* Spark Core가 분산작업에 필요한 바탕이 되고, 그 위 sql, streaming, mllib, graphx를 제공한다.

구분 | 구성 | 설명
-------|-------|-------
Spark engine | Spark Core | 작업배분, 입출력 등 분산작업에 필요한 기능
Spark Applicaiton Frameworks | Spark SQL | DataFrames
| Spark Streaming | 실시간 처리
| MLlib | 머신러닝 (참조 scikit-learn)
|GraphX | 그래프 분석


* 빅데이터를 처리하기 위해 만들어져 있고, Hadoop과 달리 메모리에서 처리하기 때문에 빠름 (pipeline). Spark는 RDD를 통해 Hadoop을 사용할 수 있다.
* Scala로 개발되어 jvm에서 실행. 그러나 Scala, Java, Python, R 여러 언어를 섞어서 할 수 있는 환경을 제공 (polyglot)

구분 | Spark | Hadoop
-------|-------|-------
사용 목적 | 데이터 분석 | 데이터 분산 처리
파일 시스템 | 자체 파일 시스템이 없슴. hdfs, db, csv등을 사용 | hdfs
속도 | 파이프라인을 사용하므로 빠름 | 보다 느림

### Spark의 3가지 데이터 - rdd, dataset, dataframe

데이터구조 | 도입된 spark version | 설명
---------|---------|---------
rdd | Spark 1.0 | 
dataframe | 1.3 | 
dataset | 1.6 | Scala and Java에서 사용할 수 있다.

* Pyspark는 데이터타잎이 loose하므로 RDD, DataFrame만 사용할 수 있다.

### 설치

* 설치하려는 하둡의 버전을 선택하여, prebuilt distribution을 설치한다.
* [Spark 다운로드](https://spark.apache.org/downloads.html)
    * spark 1.6 hadoop2.6 
        ```
        tar xvfz spark-1.x.x-bin-hadoop2.x.tgz
        ```

    * 또는 최근 버전은 tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz

* before Spark 2.0
    * SparkContext
    * streaming StreamingContext
    * SQL sqlContext
    * hive HiveContext

* Spark 2.0
    * SparkSession는 context를 통합해서 제공한다. SQLContext, HiveContext and future StreamingContext.
    * No need to create SparkConf, SparkContext or SQLContex
    * DataFrame still exists but is just a synonym for a Dataset.
    sparkSession.read.text와 RDD가 비슷
    _sc = spark.sparkContext라고 하면 sc를 만들 수 있다.
    * backward compatibility - SQLContext를 그대로 사용할 수 있다.
    * 아래는 sqlContex를 포함

* diff 1.6 vs 2.0 -- sparksession
    * Spark 1.6
    sparkConf = SparkConf()
    sc = SparkContext(conf=sparkConf)
    sqlContext = SQLContext(sc)
    df = sc.read.json("data.json")
    tables = sc.tables()

    * Spark 2.x
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.json("data.json")
    tables = spark.catalog.listTables()


```
spark = pyspark.sql.SparkSession.builder\
    .master("local")\
    .appName("spark session example")\
    .getOrCreate()
1. 그냥 파일
textFile=spark.read.text("ds_spark_wiki.txt")

2. dataframe연습
fields = [
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
]
schema = StructType(fields)

test = spark.createDataFrame([
    Row(id=1, name=u"a", age=34), 
    Row(id=2, name=u"b", age=25)
], schema)

test.show()

쓰기 연습 postgresql로 (안해보았슴)
test.write.jdbc(
    url="jdbc:postgresql://localhost:5432/db", 
    table="test", 
    mode="overwrite", 
    properties={
        "user":"root", 
        "password":"12345", 
        "driver":"org.postgresql.Driver", 
        "client_encoding":"utf8"
   }
)


```

## 문제 S-1: Spark를 standalone cluster로 구성하기

* 클러스터를 구성하지 않으면 클러스터 없이 운영 - curl로 7077, 8080확인해도 없음 -> NO cluster!

* 클러스터의 종류:
    * Spark-Standalone – Spark workers are registered with spark master
    * Yarn – Spark workers are registered with YARN Cluster manager.
    * Mesos – Spark workers are registered with Mesos.

* 클러스터 환경 구성
    * Client: spark shell, pyspark shell
    * Cluster
        * Cluster 1:
            * Spark master / Spark worker
            * hdfs namenode / datanode
        * Cluster n:
            * Spark worker
            * hdfs datanode

*  Spark-Standalone
    * SPARK_HOME은 /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/
    * 단계1 JAVA_HOME을 설정
        * JAVA_HOME을 설정한다.
            * automatically set symlink to java binary /usr/bin/java
            * JAVA_HOME을 설정하려면 /etc/environment에 하는 것이 좋다.
            ```
            $ echo $JAVA_HOME
            $ update-alternatives --config java
            ```

    * 단계2: master 실행
        * $SPARK_HOME/conf/spark-env.sh에 master ip설정
            ```
            SPARK_MASTER_IP=
            ```

        * 실행
            ```
            $ sh $SPARK_HOME/sbin/start-master.sh
            ```
            * spark://IPADRESS_OF_YOUR_MASTER_SYSTEM:7077
            * 기본 port는 7077 (web UI는 localhost:8080)

    * 단계3: slave 실행 (worker라고 함)
        ```
        $ sh $SPARK_HOME/sbin//start-slave.sh spark://IPADRESS_OF_YOUR_MASTER_SYSTEM:7077
        ```

* 쉘 명령어

sbin디렉토리의 shell | 설명
----------|----------
start-master.sh, stop-master.sh | 마스터를 시작 (종료)
start-slaves.sh, stop-slaves.sh | 각 노드의 슬레이브를 시작 (종료)
start-all.sh, stop-all.sh | 마스터, 슬레이브를 모두 시작 (종료)

## 문제 S-2: Hello Spark: 환경설정을 읽어 클라이언트 sc를 생성하기.

* Spark는 batch, streaming, iterative, interactive 4가지 방식으로 실행할 수 있다.
* 환경설정을 읽어 sc를 생성하고, 이를 클라이언트와 같이 Spark를 사용한다.

### spark-submit
* spark 프로그램을 일괄 실행
* Python프로그램 (.py)의 일괄 실행

### insteractive shell

* Scala, Python에서 지원되는 REPL (the Read-Eval-Print-Loop)
* scala
    ```
    ./bin/spark-shell
    scala>
    ```

* python
    * pyspark를 실행하면, sc, sqlContext는 제공된다.
    ```
    spark-1.6.0-bin-hadoop2.6/bin$ pyspark
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
          /_/

    Using Python version 2.7.12 (default, Jul  1 2016 15:12:24)
    SparkContext available as sc, HiveContext available as sqlContext.
    >>> sc.version
    u'1.6.0'
    >>> text=sc.textFile("derby.log");
    >>>
    ```


* SparkContext
    * Spark서버(클러스터)에 대한 클라이언트와 같은 역할로, 반드시 있어야 한다.
    * 클러스터를 어떻게 사용할 것인지 정하는 것 -> cluster manager에서 system resource를 할당받음 (cpu, memory, machine)
    * Python의 SparkContext는 jar를 분산환경에서 사용하게 되므로 주의 (Scala, Java와 다름)
        * pyFiles에 사용할 (의존적인) 라이브러리를 넣는다 (또는 사용할 라이브러리가 없으면 빈 파일로 둔다).

    * Cannot run multiple SparkContexts at once;
        * sc가 이미 있는 경우
        ```
        from pyspark import SparkContext
        sc = SparkContext("master","my python app", sparkHome="sparkhome",pyFiles="placeholderdeps.zip")
        ```

* SparkConf
    * spark-defaults.conf와 같은 파일의 값을 읽어서 설정
    ```
    scala> sc.getConf.getOption("spark.local.dir")
    res0: Option[String] = None

    scala> sc.getConf.getOption("spark.app.name")
    res1: Option[String] = Some(Spark shell)

    scala> sc.getConf.get("spark.master")
    res2: String = local[*]
    ```

### PySpark on IPython Notebook

* ipython notebook에서 pyspark를 사용
* kernel을 만들지 않고, findspark를 사용한다.
* SPARK_HOME을 설정해서 사용한다
    * 현재 ~/Downloads/spark-1.6.0-bin-hadoop2.6
* kernel을 사용하면 아래와 같이 한다.
    ```
    from pyspark import SparkContext, SparkConf
    conf = SparkConf().setAppName("jsl").setMaster("local")
    sc = SparkContext(conf=conf)
    ```

In [1]:
import os
import findspark

home=os.getenv("HOME")
spark_home=os.path.join(home,"Downloads/spark-1.6.0-bin-hadoop2.6")
findspark.init(spark_home)

* SparkConf()에서 설정한 값을 읽어 SparkContext sc를 생성한다.
    * spark.master와 spark.app.name 필수적으로 설정해야 한다.
* standalone 실행에서의 spark master
    * local로 설정되고, 기본은 필요한만큼 쓰레드를 생성한다 (local[*])

In [2]:
import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
sc = pyspark.SparkContext(conf=conf)

In [3]:
print sc

<pyspark.context.SparkContext object at 0x10f05add0>


In [4]:
sc.version

u'1.6.0'

* 설정을 읽어 온다 (conf디렉토리 아래)  

In [3]:
sc.master

u'local[*]'

In [7]:
sc._conf.get("spark.jars.packages")

u'graphframes:graphframes:0.1.0-spark1.6,org.mongodb.spark:mongo-spark-connector_2.10:1.1.0'

In [4]:
sc._conf.getAll()

[(u'spark.app.name', u'myAppName'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.jars',
  u'file:/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,file:/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10-1.1.0.jar,file:/home/jsl/.ivy2/jars/org.mongodb_mongo-java-driver-3.2.2.jar'),
 (u'spark.jars.packages',
  u'graphframes:graphframes:0.1.0-spark1.6,org.mongodb.spark:mongo-spark-connector_2.10:1.1.0'),
 (u'spark.files',
  u'file:/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,file:/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10-1.1.0.jar,file:/home/jsl/.ivy2/jars/org.mongodb_mongo-java-driver-3.2.2.jar'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.master', u'local[*]'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.submit.pyFiles',
  u'/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10-1.1.0.jar,/home/jsl/.ivy2/jars/org.mong

## S.3 Hello RDD

* RDD (Resilient Distributed Dataset)는 분산 레코드인 데이터 형식이다.
    * Resilient - fault tolerent (어느 한 노드에서 작업이 실패하면 다른 노드에서 실행된다.)
    * Distributed - multiple nodes in a clusters
    * Dataset - 데이터타잎으로 구성된다.
* RDD는 내, 외부 자료에서 생성하며, 생성된 자료는 read-only이다.
    * HDFS 파일을 처리할 수 있다.

* 3단계 처리
    * 1단계: 읽기 - 2가지 방식: 
        * 외부에서 읽기 
        ```
        sc.textFile()
        ```
        
        * 내부에서 읽기 parallelizing a collection
        ```
        sc.parallelize()
        ```
        
    * 2단계: 변환 transformations - lazy도 가능: RDD => RDD or seq(RDD)

함수 | 설명 | 예제
-------|-------|-------
map(fn) | 요소별로 fn을 적용해서 결과 rdd 돌려줌 | 
filter(fn) | 요소별로 선별하여 fn을 적용해서 결과 rdd 돌려줌 | filter(lambda line: "Spark" in line)
flatMap(fn) | 요소별로 fn을 적용하고, flat해서 결과 rdd 돌려줌 | .flatMap(lambda x: x.split(' '))
groupByKey() | key를 그룹해서 iterator를 돌려줌. |


    * 3단계: actions: RDD => a value (e.g., python list)

함수 | 설명 | 예제
-------|-------|-------
reduce(fn) | 요소별로 fn을 사용해서 줄여서 결과 list를 돌려줌 |
collect() | 모든 요소를 결과 list로 돌려줌 |
count() | 요소의 갯수를 결과 list로 돌려줌 |
countByKey() | |
foreach(fn) | |



### Spark를 사용하기 전, python 함수 사용하기

* map, reduce, filter
    * 함수의 인자는 2개가 필요하다 (함수, 데이터).

함수 | 설명 | 예
-------|-------|-------
map() | 각 데이터 요소에 함수를 적용해서 list를 반환 | map(fn,data)
filter() | 각 데이터 요소에 함수의 결과 True를 선택해서 반환 | filter(fn, data)
reduce() | 각 데이터 요소에 함수를 적용해서 list를 반환 | reduce(fn, data)

* Python 함수로 처리한다.
    * 입출력은 데이터 하나씩이 아니라, list로 한다.

In [28]:
celsius = [39.2, 36.5, 37.3, 37.8]
def c2f(c):
    f=list()
    for i in c:
        _f=(float(9)/5)*i + 32
        f.append(_f)
    return f

print c2f(celsius)

[102.56, 97.7, 99.14, 100.03999999999999]


* Python에서 제공하는 map() 함수를 사용한다. map() 함수의 인자:
    * (1) 함수명 (함수의 return은 반드시 있어야 한다.)
    * (2) 입력인자

In [8]:
celsius = [39.2, 36.5, 37.3, 37.8]

def c2f(c):
    return (float(9)/5)*c + 32

f=map(c2f, celsius)
print f

[102.56, 97.7, 99.14, 100.03999999999999]


* lambda함수를 사용한다.
    * lambda는 무명 함수이다. 처리 결과가 반환된다.

In [9]:
map(lambda c:(float(9)/5)*c + 32, celsius)

[102.56, 97.7, 99.14, 100.03999999999999]

* 문자열에 map()을 사용한다.

In [14]:
sentence = 'Hello World'
words = sentence.split()
print words

['Hello', 'World']


* 문자열을 사용하면, 각 단어를 split()한다.
* list를 사용하면, 각 요소를 split()한다.

In [17]:
sentence = "Hello World"
map(lambda x:x.split(),sentence)

[['H'], ['e'], ['l'], ['l'], ['o'], [], ['W'], ['o'], ['r'], ['l'], ['d']]

In [18]:
sentence = ["Hello World"]
map(lambda x:x.split(),sentence)

[['Hello', 'World']]

* filter()는 데이터를 선별한다.

In [37]:
fib = [0,1,1,2,3,5,8,13,21,34,55]
result = filter(lambda x: x % 2, fib)
print result

[1, 1, 3, 5, 13, 21, 55]


* reduce()는 2개의 인자를 받는다.
* [ func(func(s1, s2),s3), ... , sn ]와 같이 수행한다.

In [36]:
reduce(lambda x, y: x+y, range(1,101))

5050

### Spark RDD 사용하기

* Apache spark wiki에서 첫 문단을 복사해 왔다.
* 3째줄은 한글, 4째 줄은 같은 단어를 반복해 추가했다.


In [10]:
%%writefile data/ds_spark_wiki.txt
Wikipedia
Apache Spark is an open source cluster computing framework.
아파치 스파크는 오픈 소스 클러스터 컴퓨팅 프레임워크이다.
Apache Spark Apache Spark Apache Spark Apache Spark
Originally developed at the University of California, Berkeley's AMPLab,
the Spark codebase was later donated to the Apache Software Foundation,
which has maintained it since.
Spark provides an interface for programming entire clusters with
implicit data parallelism and fault-tolerance.

Overwriting data/ds_spark_wiki.txt


* 파일에서 읽기
    ```
    textFile()
    ```

In [11]:
textFile = sc.textFile("data/ds_spark_wiki.txt")

In [16]:
textFile.first()

u'Wikipedia'

* map()함수로 단어 분리하기

In [14]:
words.count()

9

In [15]:
words=textFile.map(lambda x:x.split(' '))

* lambda아닌 함수로 map()

In [12]:
def mySplit(x):
    return x.split(" ")

words=textFile.map(mySplit)

In [14]:
words.count()

9

In [16]:
words.collect()

[[u'Wikipedia'],
 [u'Apache',
  u'Spark',
  u'is',
  u'an',
  u'open',
  u'source',
  u'cluster',
  u'computing',
  u'framework.'],
 [u'\uc544\ud30c\uce58',
  u'\uc2a4\ud30c\ud06c\ub294',
  u'\uc624\ud508',
  u'\uc18c\uc2a4',
  u'\ud074\ub7ec\uc2a4\ud130',
  u'\ucef4\ud4e8\ud305',
  u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.'],
 [u'Originally',
  u'developed',
  u'at',
  u'the',
  u'University',
  u'of',
  u'California,',
  u"Berkeley's",
  u'AMPLab,'],
 [u'the',
  u'Spark',
  u'codebase',
  u'was',
  u'later',
  u'donated',
  u'to',
  u'the',
  u'Apache',
  u'Software',
  u'Foundation,'],
 [u'which', u'has', u'maintained', u'it', u'since.'],
 [u'Spark',
  u'provides',
  u'an',
  u'interface',
  u'for',
  u'programming',
  u'entire',
  u'clusters',
  u'with'],
 [u'implicit', u'data', u'parallelism', u'and', u'fault-tolerance.']]

In [15]:
for i in words.collect():
    print i

[u'Wikipedia']
[u'Apache', u'Spark', u'is', u'an', u'open', u'source', u'cluster', u'computing', u'framework.']
[u'\uc544\ud30c\uce58', u'\uc2a4\ud30c\ud06c\ub294', u'\uc624\ud508', u'\uc18c\uc2a4', u'\ud074\ub7ec\uc2a4\ud130', u'\ucef4\ud4e8\ud305', u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.']
[u'Apache', u'Spark', u'Apache', u'Spark', u'Apache', u'Spark', u'Apache', u'Spark']
[u'Originally', u'developed', u'at', u'the', u'University', u'of', u'California,', u"Berkeley's", u'AMPLab,']
[u'the', u'Spark', u'codebase', u'was', u'later', u'donated', u'to', u'the', u'Apache', u'Software', u'Foundation,']
[u'which', u'has', u'maintained', u'it', u'since.']
[u'Spark', u'provides', u'an', u'interface', u'for', u'programming', u'entire', u'clusters', u'with']
[u'implicit', u'data', u'parallelism', u'and', u'fault-tolerance.']


* 각 문장의 철자 갯수를 센다.
    * 첫 문장 'Wiskipedia'는 9

In [24]:
textFile.map(lambda s:len(s)).collect()

[9, 59, 32, 72, 71, 30, 64, 46]

* filter()

In [6]:
_sparkLine=textFile.filter(lambda line: "Spark" in line)

In [25]:
print _sparkLine.count()

3


* 한글은 앞에 u를 붙여준다.

In [26]:
_line = textFile.filter(lambda line: u"스파크" in line)

In [27]:
print _line.first()

아파치 스파크는 오픈 소스 클러스터 컴퓨팅 프레임워크이다.


* groupByKey()
    * groupByKey()는 key를 묶어준다. 따라서 iterator를 반환한다. mapValues(sum)을 하면 key별 합계를 구할 수 있다.

In [24]:
textFile\
    .flatMap(lambda x:x.split())\
    .map(lambda x:(x,1))\
    .groupByKey()\
    .mapValues(sum)\
    .collect()

[(u'and', 1),
 (u'\uc18c\uc2a4', 1),
 (u'is', 1),
 (u'Wikipedia', 1),
 (u'AMPLab,', 1),
 (u'maintained', 1),
 (u'donated', 1),
 (u'\ucef4\ud4e8\ud305', 1),
 (u'open', 1),
 (u'since.', 1),
 (u'for', 1),
 (u'\ud074\ub7ec\uc2a4\ud130', 1),
 (u'with', 1),
 (u'framework.', 1),
 (u'provides', 1),
 (u'Apache', 6),
 (u'Spark', 7),
 (u'was', 1),
 (u'Originally', 1),
 (u'which', 1),
 (u'fault-tolerance.', 1),
 (u'University', 1),
 (u'codebase', 1),
 (u'interface', 1),
 (u'data', 1),
 (u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.', 1),
 (u'Foundation,', 1),
 (u'\uc624\ud508', 1),
 (u'programming', 1),
 (u'\uc2a4\ud30c\ud06c\ub294', 1),
 (u'the', 3),
 (u'entire', 1),
 (u'has', 1),
 (u'to', 1),
 (u'later', 1),
 (u'computing', 1),
 (u'Software', 1),
 (u'developed', 1),
 (u"Berkeley's", 1),
 (u'it', 1),
 (u'an', 2),
 (u'cluster', 1),
 (u'implicit', 1),
 (u'at', 1),
 (u'of', 1),
 (u'clusters', 1),
 (u'parallelism', 1),
 (u'\uc544\ud30c\uce58', 1),
 (u'California,', 1),
 (u'source', 1)]

* parallelize() 사용하기
    * list에서 읽어, rdd로 변환하기

In [28]:
_aList=[1,2,3]
rdd = sc.parallelize(_aList)

In [44]:
rdd.take(3)

[1, 2, 3]

* map(), collect() 사용해서 square

In [29]:
nums = sc.parallelize([1, 2, 3, 4])
squared = nums.map(lambda x: x * x).collect()
print squared

[1, 4, 9, 16]


* 문장 처리하기
* 단어를 교체하기

In [23]:
a=["this is","a line"]
_rdd=sc.parallelize(a)

words=_rdd.map(lambda x:x.split())
print words.collect()

[['this', 'is'], ['a', 'line']]


In [18]:
_upper=_rdd.map(lambda x:x.replace("a","AA"))
_upper.take(10)

['this is', 'AA line']

* 첫 글자를 대문자로 만들어서 출력해 보기

In [29]:
's'.upper()

'S'

In [30]:
pluralRDD =words.map(lambda x: x[0].upper())
print pluralRDD.collect()

['THIS', 'A']


In [4]:
pluralRDD =words.map(lambda x: [i.upper() for i in x])
print pluralRDD.collect()

[[u'WIKIPEDIA'], [u'APACHE', u'SPARK', u'IS', u'AN', u'OPEN', u'SOURCE', u'CLUSTER', u'COMPUTING', u'FRAMEWORK.'], [u'\uc544\ud30c\uce58', u'\uc2a4\ud30c\ud06c\ub294', u'\uc624\ud508', u'\uc18c\uc2a4', u'\ud074\ub7ec\uc2a4\ud130', u'\ucef4\ud4e8\ud305', u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.'], [u'APACHE', u'SPARK', u'APACHE', u'SPARK', u'APACHE', u'SPARK', u'APACHE', u'SPARK'], [u'ORIGINALLY', u'DEVELOPED', u'AT', u'THE', u'UNIVERSITY', u'OF', u'CALIFORNIA,', u"BERKELEY'S", u'AMPLAB,'], [u'THE', u'SPARK', u'CODEBASE', u'WAS', u'LATER', u'DONATED', u'TO', u'THE', u'APACHE', u'SOFTWARE', u'FOUNDATION,'], [u'WHICH', u'HAS', u'MAINTAINED', u'IT', u'SINCE.'], [u'SPARK', u'PROVIDES', u'AN', u'INTERFACE', u'FOR', u'PROGRAMMING', u'ENTIRE', u'CLUSTERS', u'WITH'], [u'IMPLICIT', u'DATA', u'PARALLELISM', u'AND', u'FAULT-TOLERANCE.']]


* transformation(map()), action(collect()) 함수를 한꺼번에

In [33]:
pluralRDD =words.map(lambda x: [i.upper() for i in x]).collect()
print pluralRDD

[['THIS', 'IS'], ['A', 'LINE']]


In [34]:
wordsLength = words\
    .map(len)\
    .collect()
print wordsLength

[2, 2]


* 파일에 쓰기

In [6]:
pluralRDD.saveAsTextFile("data/ds_spark_wiki1.txt")

* create RDD from CSV


In [6]:
%%writefile ./data/ds_spark_2cols.csv
35, 2
40, 27
12, 38
15, 31
21, 1
14, 19
46, 1
10, 34
28, 3
48, 1
16, 2
30, 3
32, 2
48, 1
31, 2
22, 1
12, 3
39, 29
19, 37
25, 2


Writing ./data/ds_spark_2cols.csv


In [6]:
inp_file = sc.textFile("./data/ds_spark_2cols.csv")
numbers_rdd = inp_file.map(lambda line: line.split(','))

In [7]:
numbers_rdd.take(10)

[[u'35', u' 2'],
 [u'40', u' 27'],
 [u'12', u' 38'],
 [u'15', u' 31'],
 [u'21', u' 1'],
 [u'14', u' 19'],
 [u'46', u' 1'],
 [u'10', u' 34'],
 [u'28', u' 3'],
 [u'48', u' 1']]

## S.4 데이터 타잎

구분 | 설명
----------|----------
Vector | dense와 sparse가 있다.
Labled Point | 클래스값과 결과를 묶음. supervised learning에 사용.
Matrix | 행열로 구성

* vectors
    * local vector - single machine에 있다.
    * pyspark.mllib.linalg.Vectors

dense vector | sparse vector
----------|----------
빈 값이 별로 없는 경우. an array of its values | 빈 값이 많은 경우 사용. 인덱스, 값 배열 별도
(160,69,24) | (3,[0,1,2],[160.0,69.0,24.0])
입력 NumPy’s array, Python list | MLlib’s SparseVector, SciPy’s csc_matrix with a single column

### dense vector

In [41]:
from pyspark.mllib.linalg import Vectors
dv1=Vectors.dense([0.0, 1.1, 0.1])

In [42]:
print dv1

[0.0,1.1,0.1]


* numpy를 사용해서 dense vector를 생성할 수 있다.

In [45]:
import numpy as np

dv2 = np.array([1.0, 2.1, 3.2])

In [43]:
print dv1

[0.0,1.1,0.1]


* Python list를 사용하여 dense vector를 생성할 수 있다.

In [46]:
dv2 = [1.0, 2.1, 3.2]

### sparse vector.

* SparseVector vs Vectors.sparse 차이?

* toArray()는 1줄씩


In [48]:
sv1 = Vectors.sparse(3, [1, 2], [1.0, 3.0])
print sv1.toArray()

[ 0.  1.  3.]


* scipy.sparse

In [50]:
sv2 = sps.csc_matrix((np.array([1.0,3.0]), np.array([0,2]), np.array([0,2])), shape = (3,1))
sv2.todense()
print sv2

  (0, 0)	1.0
  (2, 0)	3.0


### labeled point

* local vector, either dense or sparse
* label과 response로 구성된다.


In [52]:
from pyspark.mllib.regression import LabeledPoint
LabeledPoint(1, [1.0, 2.0, 3.0])

LabeledPoint(1.0, [1.0,2.0,3.0])

In [8]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
LabeledPoint(1.0, Vectors.dense([1.0, 2.0, 3.0]))

LabeledPoint(1.0, [1.0,2.0,3.0])

* Python list에서 dataframe 생성

In [12]:
p = [[1,[1.0,2.0,3.0]],[1,[1.1,2.1,3.1]],[0,[1.2,2.2,3.3]]]
trainDf=sqlCtx.createDataFrame(p)
trainDf.collect()

[Row(_1=1, _2=[1.0, 2.0, 3.0]),
 Row(_1=1, _2=[1.1, 2.1, 3.1]),
 Row(_1=0, _2=[1.2, 2.2, 3.3])]

* Python list를 LabeledPoint로 생성하면, label과 features로 생성한다.

In [9]:
from pyspark.mllib.regression import LabeledPoint
p = [LabeledPoint(1,[1.0,2.0,3.0]),
     LabeledPoint(1,[1.1,2.1,3.1]),
     LabeledPoint(0,[1.2,2.2,3.3])]
trainDf=sqlCtx.createDataFrame(p)
trainDf.collect()

[Row(features=DenseVector([1.0, 2.0, 3.0]), label=1.0),
 Row(features=DenseVector([1.1, 2.1, 3.1]), label=1.0),
 Row(features=DenseVector([1.2, 2.2, 3.3]), label=0.0)]

In [11]:
from pyspark.mllib.linalg import Vectors

trainDf = sqlCtx.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, 1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, 0.5]))], ["label", "features"])
trainDf.collect()

[Row(label=1.0, features=DenseVector([0.0, 1.1, 0.1])),
 Row(label=0.0, features=DenseVector([2.0, 1.0, 1.0])),
 Row(label=0.0, features=DenseVector([2.0, 1.3, 1.0])),
 Row(label=1.0, features=DenseVector([0.0, 1.2, 0.5]))]

* schema를 사용해서 dataframe 생성하기

In [6]:
from pyspark.mllib.linalg import SparseVector, VectorUDT
from pyspark.sql.types import StructType, StructField, DoubleType
_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

In [7]:
schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

In [8]:
trainDf=_rdd.toDF(schema)
trainDf.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



### maxtrix

* local matrix - pyspark.mllib.linalg.Matrix, Matrices
* distributed matrix
    * pyspark.mllib.linalg.distributed.RowMatrix
    * pyspark.mllib.linalg.distributed.IndexedRow, IndexedRowMatrix
    * pyspark.mllib.linalg.distributed.BlockMatrix

In [53]:
from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

### libsvm format

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
* format
    ```
    [label] [index1]:[value1] [index2]:[value2] ...
    [label] [index1]:[value1] [index2]:[value2] ...
    ```
    * label - class
    * index - integers
    * value - real numbers

* see - /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt
```
0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37
```

In [10]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

In [11]:
svmfn="/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt"
svmDf = sqlCtx.read.format("libsvm").load(svmfn)

In [12]:
type(svmDf)

pyspark.sql.dataframe.DataFrame

In [13]:
svmDf.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = false)



* csr format (https://www.ncsu.edu/hpc/Courses/6sparse.html)
    ```
    0 0 0 0
    5 8 0 0
    0 0 3 0
    0 6 0 0
    ```
    * non-zero 5 8 3 6
    * column-index 0 1 2 1 (5(1,0) 8(1,1) 3(2,2) 6(3,1)에서 행값만 추출)



## 문제 S-3: RDD를 사용하여 MLlib의 입력 데이터 word vector생성하기.

* RDD API를 사용해서 단어를 셀 수 있다 (map, reduce 등).
* mllib 패키지를 사용하여 데이터를 변환할 수 있다.
    * TF-IDF, Word2Vec 등을 사용할 수 있다.
    * mllib에 없는 변환기능은 ml을 사용한다 (ml은 dataframe을 변환하는 패키지.)
        * Tokenizer, StopWordsRemove, n-gram등


In [439]:
!ls data/ds_spark_wiki.txt

data/ds_spark_wiki.txt


### 파일 전체 word count

In [19]:
lines = sc.textFile("data/ds_spark_wiki.txt")
wc = lines\
    .flatMap(lambda x: x.split(' '))

In [20]:
wc.collect()

[u'Wikipedia',
 u'Apache',
 u'Spark',
 u'is',
 u'an',
 u'open',
 u'source',
 u'cluster',
 u'computing',
 u'framework.',
 u'\uc544\ud30c\uce58',
 u'\uc2a4\ud30c\ud06c\ub294',
 u'\uc624\ud508',
 u'\uc18c\uc2a4',
 u'\ud074\ub7ec\uc2a4\ud130',
 u'\ucef4\ud4e8\ud305',
 u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Originally',
 u'developed',
 u'at',
 u'the',
 u'University',
 u'of',
 u'California,',
 u"Berkeley's",
 u'AMPLab,',
 u'the',
 u'Spark',
 u'codebase',
 u'was',
 u'later',
 u'donated',
 u'to',
 u'the',
 u'Apache',
 u'Software',
 u'Foundation,',
 u'which',
 u'has',
 u'maintained',
 u'it',
 u'since.',
 u'Spark',
 u'provides',
 u'an',
 u'interface',
 u'for',
 u'programming',
 u'entire',
 u'clusters',
 u'with',
 u'implicit',
 u'data',
 u'parallelism',
 u'and',
 u'fault-tolerance.']

* 단어를 세어서 tuple로 만듦

In [453]:
from operator import add
wc = sc.textFile("data/ds_spark_wiki.txt")\
    .flatMap(lambda x: x.split(' '))\
    .map(lambda x: (x.lower().rstrip().lstrip().rstrip(',').rstrip('.'), 1))\
    .reduceByKey(add)

In [454]:
wc.count()

50

In [455]:
wc.first()

(u'and', 1)

### 라인 별 word count

* dataframe으로 처리

In [3]:
from operator import add
wc = sc.textFile("data/ds_spark_wiki.txt")\
    .map(lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
    .map(lambda x:x.split())\
    .map(lambda x:[(i,1) for i in x])

In [4]:
for e in wc.collect():
    print e

[(u'wikipedia', 1)]
[(u'apache', 1), (u'spark', 1), (u'is', 1), (u'an', 1), (u'open', 1), (u'source', 1), (u'cluster', 1), (u'computing', 1), (u'framework', 1)]
[(u'\uc544\ud30c\uce58', 1), (u'\uc2a4\ud30c\ud06c\ub294', 1), (u'\uc624\ud508', 1), (u'\uc18c\uc2a4', 1), (u'\ud074\ub7ec\uc2a4\ud130', 1), (u'\ucef4\ud4e8\ud305', 1), (u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4', 1)]
[(u'originally', 1), (u'developed', 1), (u'at', 1), (u'the', 1), (u'university', 1), (u'of', 1), (u'california', 1), (u"berkeley's", 1), (u'amplab', 1)]
[(u'the', 1), (u'spark', 1), (u'codebase', 1), (u'was', 1), (u'later', 1), (u'donated', 1), (u'to', 1), (u'the', 1), (u'apache', 1), (u'software', 1), (u'foundation', 1)]
[(u'which', 1), (u'has', 1), (u'maintained', 1), (u'it', 1), (u'since', 1)]
[(u'spark', 1), (u'provides', 1), (u'an', 1), (u'interface', 1), (u'for', 1), (u'programming', 1), (u'entire', 1), (u'clusters', 1), (u'with', 1)]
[(u'implicit', 1), (u'data', 1), (u'parallelism', 1), (u'and', 1), (u'f

* TF (Term Frequency)
    * HashingTF

In [21]:
documents = sc.textFile("data/ds_spark_wiki.txt").map(lambda line: line.split(" "))

In [26]:
from pyspark.mllib.feature import HashingTF

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

In [28]:
tf.collect()

[SparseVector(1048576, {253068: 1.0}),
 SparseVector(1048576, {36751: 1.0, 50570: 1.0, 68380: 1.0, 415281: 1.0, 511377: 1.0, 728364: 1.0, 862087: 1.0, 938426: 1.0, 999480: 1.0}),
 SparseVector(1048576, {63234: 1.0, 340190: 1.0, 357478: 1.0, 375592: 1.0, 458138: 1.0, 486171: 1.0, 598772: 1.0}),
 SparseVector(1048576, {938426: 4.0, 999480: 4.0}),
 SparseVector(1048576, {36757: 1.0, 225801: 1.0, 323305: 1.0, 453405: 1.0, 498679: 1.0, 518030: 1.0, 688842: 1.0, 762570: 1.0, 959994: 1.0}),
 SparseVector(1048576, {420843: 1.0, 550676: 1.0, 725041: 1.0, 782544: 1.0, 938426: 1.0, 959994: 2.0, 991590: 1.0, 993084: 1.0, 996703: 1.0, 999480: 1.0}),
 SparseVector(1048576, {50573: 1.0, 263739: 1.0, 892834: 1.0, 1014710: 1.0, 1035538: 1.0}),
 SparseVector(1048576, {3932: 1.0, 36751: 1.0, 192182: 1.0, 358969: 1.0, 363244: 1.0, 496856: 1.0, 546913: 1.0, 938426: 1.0, 951974: 1.0}),
 SparseVector(1048576, {69621: 1.0, 157580: 1.0, 219357: 1.0, 297436: 1.0, 715648: 1.0})]

In [None]:
def countPartitions(id,iterator): 
         c = 0 
         for _ in iterator: 
              c += 1 
         yield (id,c) 
_wc=wc.mapPartitions(countPartitions)

```
from pyspark.mllib.regression import LabeledPoint
trainRdd = trainDf.map(lambda row: LabeledPoint(row.label,row.features))
```

## S.5 Spark SQL

* Spark SQL은 
    * 데이터를 구조화해서 sql을 사용할 수 있다. RDD는 비구조적인 경우에 사용한다.

구분 | Spark SQL | RDD
-----|-----|-----
데이터 | 구조적 | 비구조적
* Spark SQL 구성

구분 | 설명
-----|-----
Language API | Python, Java, Scala, Hive QL API를 제공
Schema RDD | RDD에 Schema를 적용해 임시 테이블로 변환한다. dataframe.
Data Sources | 다양한 형식 지원 - HDFS, Cassandra, HBase, and relational databases


* dataframe 생성
    * 이름, 키 정보 파일을 읽어서 키가 170이상 

In [None]:
p = [{'name': 'kim', 'height': 170}]
sqlCtx.createDataFrame(p).collect()

type(p)

from pyspark.sql import *
pRow=list(Row(name="kim", height=1961))

df=sqlCtx.createDataFrame([pRow])
df.show()

## 문제 S-4: 파일을 읽어서 feature vector 생성하기.

* rdd에서 dataframe을 생성하고, sql 사용한다.
* 네트워크 침입
    * https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
* attack 종류 구분 (41번째 열)

침입구분 | 건수
-------|-------
normal | 97278
attack | 396743
전체 | 494021

In [3]:
import os
import urllib
_url = 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz'
_fname = os.path.join(os.getcwd(),'data','kddcup.data_10_percent.gz')
if(not os.path.exists(_fname)):
    print "%s data does not exist! retrieving.." % _fname
    _f=urllib.urlretrieve(_url,_fname)


In [7]:
_rdd = sc.textFile(_fname)

In [8]:
_rdd.count()

494021

In [9]:
_rdd.take(3)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.']

* 데이터에 'normal.'이 포함된 건수

In [10]:
_normal = _rdd.filter(lambda x: 'normal.' in x)
print _normal.count()

97278


In [11]:
_csvRdd=_rdd.map(lambda x: x.split(','))

In [12]:
print _csvRdd.take(1)

[[u'0', u'tcp', u'http', u'SF', u'181', u'5450', u'0', u'0', u'0', u'0', u'0', u'1', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'8', u'8', u'0.00', u'0.00', u'0.00', u'0.00', u'1.00', u'0.00', u'0.00', u'9', u'9', u'1.00', u'0.00', u'0.11', u'0.00', u'0.00', u'0.00', u'0.00', u'0.00', u'normal.']]


* 데이터 분류
    * reduceByKey()를 사용해 각 경우의 건 수를 센다.

In [22]:
_kv = _csvRdd.map(lambda x: (x[41], 1))
_attack = _kv.reduceByKey(lambda x,y: x+y)

In [23]:
_attack.collect()

[(u'guess_passwd.', 53),
 (u'nmap.', 231),
 (u'warezmaster.', 20),
 (u'rootkit.', 10),
 (u'warezclient.', 1020),
 (u'smurf.', 280790),
 (u'pod.', 264),
 (u'neptune.', 107201),
 (u'normal.', 97278),
 (u'spy.', 2),
 (u'ftp_write.', 8),
 (u'phf.', 4),
 (u'portsweep.', 1040),
 (u'teardrop.', 979),
 (u'buffer_overflow.', 30),
 (u'land.', 21),
 (u'imap.', 12),
 (u'loadmodule.', 9),
 (u'perl.', 3),
 (u'multihop.', 7),
 (u'back.', 2203),
 (u'ipsweep.', 1247),
 (u'satan.', 1589)]

In [14]:
_normalRdd=_csvRdd.filter(lambda x: x[41]=="normal.")
_attackRdd=_csvRdd.filter(lambda x: x[41]!="normal.")

In [16]:
print _normalRdd.count()
print _attackRdd.count()

97278
396743


* combineByKey(x, y, z)
    * Combiner function: x
        * key-value에서 value로 combine하려면 (value,1)
    * Merge value function: y
    * Merge combiners function: z


In [25]:
data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )
sumCount = data.combineByKey(lambda value: (value, 1),
                             lambda x, value: (x[0] + value, x[1] + 1),
                             lambda x, y: (x[0] + y[0], x[1] + y[1]))

averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))

print averageByKey.collectAsMap()

{0: 3.0, 1: 10.0}


In [27]:
from operator import add
aggregated_counts = (data
    .map(lambda kv: (kv, 1))
    .reduceByKey(add)
    .map(lambda kv: (kv[0][0], (kv[0][1], kv[1])))
    .groupByKey()
    .mapValues(lambda xs: (list(xs), sum(x[1] for x in xs))))

aggregated_counts.collect()

[(0, ([(2.0, 1), (4.0, 1)], 2)), (1, ([(10.0, 1), (0.0, 1), (20.0, 1)], 3))]

In [24]:
sum_counts = _kv.combineByKey(
    (lambda x: (x, 1)), # the initial value, with value x and count 1
    (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count
    (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators
)

sum_counts.collectAsMap()

{u'back.': (2203, 2203),
 u'buffer_overflow.': (30, 30),
 u'ftp_write.': (8, 8),
 u'guess_passwd.': (53, 53),
 u'imap.': (12, 12),
 u'ipsweep.': (1247, 1247),
 u'land.': (21, 21),
 u'loadmodule.': (9, 9),
 u'multihop.': (7, 7),
 u'neptune.': (107201, 107201),
 u'nmap.': (231, 231),
 u'normal.': (97278, 97278),
 u'perl.': (3, 3),
 u'phf.': (4, 4),
 u'pod.': (264, 264),
 u'portsweep.': (1040, 1040),
 u'rootkit.': (10, 10),
 u'satan.': (1589, 1589),
 u'smurf.': (280790, 280790),
 u'spy.': (2, 2),
 u'teardrop.': (979, 979),
 u'warezclient.': (1020, 1020),
 u'warezmaster.': (20, 20)}

* rdd to sql, dataframe

In [65]:
from pyspark.sql import Row

_csv = _data.map(lambda l: l.split(","))
_rdd = _csv.map(lambda p: 
    Row(
        duration=int(p[0]), 
        protocol=p[1],
        service=p[2],
        flag=p[3],
        src_bytes=int(p[4]),
        dst_bytes=int(p[5])
    )
)

In [66]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

_df=sqlCtx.createDataFrame(_rdd)
_df.registerTempTable("_tab")

In [68]:
_df.select("protocol", "duration", "dst_bytes").groupBy("protocol").count().show()

+--------+------+
|protocol| count|
+--------+------+
|     udp| 20354|
|     tcp|190065|
|    icmp|283602|
+--------+------+



In [70]:
_df.select("protocol", "duration", "dst_bytes")\
    .filter(_df.duration>1000)\
    .filter(_df.dst_bytes==0)\
    .groupBy("protocol")\
    .count()\
    .show()

+--------+-----+
|protocol|count|
+--------+-----+
|     tcp|  139|
+--------+-----+



In [78]:
tcp_interactions = sqlCtx.sql(
"""
    SELECT duration, dst_bytes FROM _tab
    WHERE protocol = 'tcp' AND duration > 1000 AND dst_bytes = 0
""")

In [79]:
tcp_interactions.show()

+--------+---------+
|duration|dst_bytes|
+--------+---------+
|    5057|        0|
|    5059|        0|
|    5051|        0|
|    5056|        0|
|    5051|        0|
|    5039|        0|
|    5062|        0|
|    5041|        0|
|    5056|        0|
|    5064|        0|
|    5043|        0|
|    5061|        0|
|    5049|        0|
|    5061|        0|
|    5048|        0|
|    5047|        0|
|    5044|        0|
|    5063|        0|
|    5068|        0|
|    5062|        0|
+--------+---------+
only showing top 20 rows



In [74]:
tcp_interactions_out = tcp_interactions\
    .map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))

In [76]:
for i,ti_out in enumerate(tcp_interactions_out.collect()):
    if(i%10==0):
        print ti_out

Duration: 5057, Dest. bytes: 0
Duration: 5043, Dest. bytes: 0
Duration: 5046, Dest. bytes: 0
Duration: 5051, Dest. bytes: 0
Duration: 5057, Dest. bytes: 0
Duration: 5063, Dest. bytes: 0
Duration: 42448, Dest. bytes: 0
Duration: 40121, Dest. bytes: 0
Duration: 31709, Dest. bytes: 0
Duration: 30619, Dest. bytes: 0
Duration: 22616, Dest. bytes: 0
Duration: 21455, Dest. bytes: 0
Duration: 13998, Dest. bytes: 0
Duration: 12933, Dest. bytes: 0


## 문제 S-5: 파일에서 Spark SQL로 데이터 읽기

* 1. json파일에서 읽기
    * 주의: format이 건별로 저장되어 있슴???
* 2. twitter json
* 3. url에서 json 읽어오기
* 4. csv파일에서 일기
* 5. com.databricks.spark.csv
    * vim conf/spark-defaults.conf
        ```
        spark.jars.packages=com.databricks:spark-csv_2.10:1.3.0
        ```

* sqlContext.jsonRDD()
* sqlContext.jsonFile()

### json 파일 읽기

* json파일을 읽어서, sql을 사용한다.

In [None]:
# %load /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}


In [35]:
pDF= sqlCtx.read.json("/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json")

In [36]:
type(pDF)

pyspark.sql.dataframe.DataFrame

In [37]:
pDF.filter(pDF['age'] > 21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [38]:
pDF.registerTempTable("people")
sqlCtx.sql("select name from people").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



### twitter json을 읽을 경우

구분 | 예
-------|-------
unicode를 사용하면 backslash | "{\"created_at\":\"Sun Nov 13 00:05:19 +0000 2016\"
보통 | {"created_at":"Sun Nov 13 00:05:19 +0000 2016"


    * allowBackslashEscapingAnyCharacter

In [33]:
twitterDF= sqlCtx.read.json("src/ds_twitter_1_noquote.json")

In [30]:
twitterDF.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- symbols: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- urls: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- user_mentions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- favorite_count: long (nullable = true)
 |-- favorited: boolean (nullable = true)
 |-- geo: string (nullable = true)
 |-- id: long (nullable = true)
 |-- id_str: string (nullable = true)
 |-- in_reply_to_screen_name: string (nullable = true)
 |-- in_reply_to_status_id: string (nullable = true)
 |-- in_reply_to_status_id_str: string (nullable = true)
 |-- in_reply_to_user_id: string (nullable = true)
 |-- in_reply_to_user_id_str: s

In [31]:
twitterDF.select('text').show()

+---------------+
|           text|
+---------------+
|Hello 21 160924|
+---------------+



In [10]:
twitterDF.registerTempTable("twitter")
sqlCtx.sql("select text from twitter").show()

+---------------+
|           text|
+---------------+
|Hello 21 160924|
+---------------+



### json frm url

* url에서 데이터 읽으면 string (예: r.iter_lines()하면 문자 1개씩 가져옴)
* response를 json으로 읽으면 ok

In [15]:
import requests
r=requests.get("https://raw.githubusercontent.com/jokecamp/FootballData/master/World%20Cups/all-world-cup-players.json")

In [18]:
wc=r.json()

In [19]:
type(wc)

list

* Row로 만들어주어야?
    ```
    df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])
    ```

In [39]:
wcDF=sqlCtx.createDataFrame(wc)

In [40]:
wcDF.printSchema()

root
 |-- Club: string (nullable = true)
 |-- ClubCountry: string (nullable = true)
 |-- Competition: string (nullable = true)
 |-- DateOfBirth: string (nullable = true)
 |-- FullName: string (nullable = true)
 |-- IsCaptain: boolean (nullable = true)
 |-- Number: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Team: string (nullable = true)
 |-- Year: long (nullable = true)



In [44]:
wcDF.registerTempTable("wc")
sqlCtx.sql("select Club,Team,Year from wc").show(1)

+--------------------+---------+----+
|                Club|     Team|Year|
+--------------------+---------+----+
|Club AtlÃ©tico Ta...|Argentina|1930|
+--------------------+---------+----+
only showing top 1 row



* baby names

In [82]:
import json
import requests
_url="https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?accessType=DOWNLOAD"
_json=requests.get(_url).json()

* json데이터는 meta, data로 구분해서 만들어져 있슴
* data는 52252건

In [52]:
_json.keys()

[u'meta', u'data']

In [53]:
_jsonList=_json['data']
print len(_jsonList)

52252


In [54]:
_json['data'][0]

[1,
 u'5DC7F285-052B-4739-8DC3-62827014A4CD',
 1,
 1425450997,
 u'714909',
 1425450997,
 u'714909',
 u'{\n}',
 u'2013',
 u'GAVIN',
 u'ST LAWRENCE',
 u'M',
 u'9']

* list to spark dataFrame
    * schema를 정하지 않으면 없이 생성함

In [55]:
_df=sqlCtx.createDataFrame(_json['data'])
_df.count()

52252

* schema를 정하지 않았으므로 임의로 생성된 속성을 사용하고 있다.

In [57]:
_df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: long (nullable = true)
 |-- _5: string (nullable = true)
 |-- _6: long (nullable = true)
 |-- _7: string (nullable = true)
 |-- _8: string (nullable = true)
 |-- _9: string (nullable = true)
 |-- _10: string (nullable = true)
 |-- _11: string (nullable = true)
 |-- _12: string (nullable = true)
 |-- _13: string (nullable = true)



In [59]:
_df.filter(_df['_10'] == u'GAVIN').show(2)

+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
| _1|                  _2| _3|        _4|    _5|        _6|    _7| _8|  _9|  _10|        _11|_12|_13|
+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
|  1|5DC7F285-052B-473...|  1|1425450997|714909|1425450997|714909|{
}|2013|GAVIN|ST LAWRENCE|  M|  9|
| 82|43E9414D-9BE0-456...| 82|1425450997|714909|1425450997|714909|{
}|2013|GAVIN|    SUFFOLK|  M| 54|
+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
only showing top 2 rows



* select를 사용해보자?
    * pivotTable??

In [61]:
_df.registerTempTable("babyNames")
sqlCtx.sql("select distinct(_10) from babyNames").show(5)

+-------+
|    _10|
+-------+
|  HENRY|
| LESLIE|
|  ALICE|
|MIRANDA|
|    EVA|
+-------+
only showing top 5 rows



### read from text

In [None]:
# %load /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19


In [63]:
from pyspark.sql import Row
lines = sc.textFile("/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))

schemaPeople = sqlCtx.createDataFrame(people)
schemaPeople.registerTempTable("people")

In [64]:
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

In [65]:
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
  print(teenName)

Name: Justin


In [None]:
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

sqlCtx.createDataFrame(people,schema)

### csv

In [None]:
%%writefile data/ds_spark.csv
1,2,3,4
11,22,33,44
111,222,333,444

In [None]:
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv')\
    .options(header='true', inferschema='true').load('data/ds_spark.csv')

df.show()

df.withColumnRenamed('1','label')

## 문제 S-6: spark sql uber csv

https://github.com/tmcgrath/spark-with-python-course/blob/master/Spark-SQL-CSV-with-Python.ipynb



* fivethirtyeight
    * git clone https://github.com/fivethirtyeight/uber-tlc-foil-response.git
        daily Uber trip statistics in January and February 2015
        ```
        dispatching_base_number	date	active_vehicles	trips
        B02512	1/1/2015	190	1132
        B02765	1/1/2015	225	1765
        ```

In [14]:
data_home=os.path.join(home,"Code/git/else/uber-tlc-foil-response")
filePath=os.path.join(data_home,"Uber-Jan-Feb-FOIL.csv")

_fub = sc.textFile(filePath)

In [15]:
type(_fub)
_fub.count()
_fub.first()

pyspark.rdd.RDD

* csv는 comma seperated 형식이므로, ','로 분리
* 첫번째 열에서 key값을 추출한다 (header값 포함)

In [None]:
_dub = _fub.map(lambda line: line.split(","))

type(_dub)

_row0keys=_dub.map(lambda row: row[0]).distinct().collect()

print _row0keys

_dub.filter(lambda row: "B02512" in row).count()

* B02512인 경우, trips가 2000보다 큰 레코드 수집

In [None]:
_dub.filter(lambda row: "B02512" in row).filter(lambda row: int(row[3])>2000).collect()

* header는 속성 명을 가지고 있다. 이를 제외하면 전체 갯수에서 1개를 뺀 숫자

In [None]:
_noheader = _fub.filter(lambda line: "base" not in line).map(lambda line:line.split(","))
_noheader.count()

* reduceByKey - key별로 value를 합쳐서 결과 -> 아래는 a,3 b,2
```
("a", 1)
("b", 1)
("a", 1)
("a", 1)
("b", 1)
```

In [None]:
_noheader.map(lambda x: (x[0], int(x[3]))).reduceByKey(lambda k,v: k + v).collect()

* saving
    ```
    rddOfStrings.saveAsTextFile("out.txt")
    ```

## S.6: DataFrame

http://www.cs.sfu.ca/CourseCentral/732/ggbaker/spark-sql.html


* Data Frame은 DB 테이블
    * MLib의 입력 데이터로 사용할 수 있다.
        * 입력 데이터는 1) Spark RDDs 또는 2) DataFrame을 사용할 수 있다.
        * 기본은 Data Frames (Pandas dataframe) (Spark 3.0 이후 DataFrame API)

Pipeline | 설명 | 예
----------|----------|----------
DataFrame | text, feature vectors, true labels, and predictions.
Transformer | DataFrame into another DataFrame | Transformer.transform()
Estimator | fit on a DataFrame to produce a TransformerPipeline | Estimator.fit()
Pipeline | chains multiple Transformers and Estimators together
Parameter | a common API for specifying parameters. | ParamMap

* 기능

기능 | 예제
-------|-------
json 읽기 | sqlContext.read.json("employee.json")
data 보기 | dfs.show()
schema | dfs.printSchema()
select | dfs.select("name").show()
filter | dfs.filter(dfs("age") > 23).show()
groupBy | dfs.groupBy("age").count().show()
select | 

* Spar DataFrame vs Pandas 의 비교

DataFrame | Spark | Pandas
-------|-------|-------
csv file | map split(',') | read_csv()
| show() | head(), tail()
data types | 맞게 추정 | 모두 strings

In [3]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

In [48]:
from pyspark.sql import Row
Person = Row('name', 'height')
rows = [Person('kim', 170), Person('lee', 175), Person('lim', 180),]
#rowsRdd = sc.parallelize(rows)
#rowsDf = sqlCtx.createDataFrame(rowsRdd)

rowsDF=sqlCtx.createDataFrame(rows)

In [47]:
type(rows)

list

In [36]:
type(rowsRdd)

pyspark.rdd.RDD

In [49]:
type(rowsDF)

pyspark.sql.dataframe.DataFrame

In [50]:
rowsDF.printSchema()

root
 |-- name: string (nullable = true)
 |-- height: long (nullable = true)



In [52]:
rowsDF.where(rowsDF.height < 175)\
    .select([rowsDF.name, rowsDF.height]).show()

+----+------+
|name|height|
+----+------+
| kim|   170|
+----+------+



In [53]:
rowsDF.groupby(rowsDF.height).max().show()

+------+-----------+
|height|max(height)|
+------+-----------+
|   170|        170|
|   175|        175|
|   180|        180|
+------+-----------+



## S.7 Hello Statistics

* mllib.stat.Statistics
* 기본 통계
* 가설 검증
* 상관관계

## 문제 S-7:  Kolmogorov-Smirnov 검증

In [54]:
from pyspark.mllib.stat import Statistics

parallelData = sc.parallelize([1.0, 2.0, 5.0, 4.0, 3.0, 3.3, 5.5])

# run a KS test for the sample versus a standard normal distribution
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
print(testResult)

Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.841344746068543 
pValue = 5.06089025353873E-6 
Very strong presumption against null hypothesis: Sample follows theoretical distribution.


## 문제 S-8:  무작위 데이터 생성

In [55]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

In [56]:
from pyspark.sql.functions import rand, randn
 # Create a DataFrame with one int column and 10 rows.
df = sqlCtx.range(0, 10)
df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



In [57]:
df.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal")).show()
df.describe().show()

+---+-------------------+--------------------+
| id|            uniform|              normal|
+---+-------------------+--------------------+
|  0|0.41371264720975787|  0.5888539012978773|
|  1| 0.1982919638208397| 0.06157382353970104|
|  2|0.12030715258495939|  1.0854146699817222|
|  3|0.44292918521277047| -0.4798519469521663|
|  4| 0.8898784253886249| -0.8820294772950535|
|  5| 0.2731073068483362|-0.15116027592854422|
|  6|   0.87079354700073|-0.27674189870783683|
|  7|0.27149331793166864|-0.18575112254167045|
|  8| 0.6037143578435027|   0.734722467897308|
|  9| 0.1435668838975337|-0.30123700668427145|
+---+-------------------+--------------------+

+-------+------------------+
|summary|                id|
+-------+------------------+
|  count|                10|
|   mean|               4.5|
| stddev|3.0276503540974917|
|    min|                 0|
|    max|                 9|
+-------+------------------+



In [58]:
df = sqlCtx.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('rand2', rand(seed=27))
print df.stat.corr('rand1', 'rand2')
print df.stat.corr('id', 'id')

-0.15104231217
1.0


In [59]:
names = ["Alice", "Bob", "Mike"]
items = ["milk", "bread", "butter", "apples", "oranges"]
df = sqlCtx.createDataFrame([(names[i % 3], items[i % 5]) for i in range(100)], ["name", "item"])
df.show(10)

+-----+-------+
| name|   item|
+-----+-------+
|Alice|   milk|
|  Bob|  bread|
| Mike| butter|
|Alice| apples|
|  Bob|oranges|
| Mike|   milk|
|Alice|  bread|
|  Bob| butter|
| Mike| apples|
|Alice|oranges|
+-----+-------+
only showing top 10 rows



In [60]:
df = sqlCtx.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(100)], ["a", "b", "c"])
print df.show(10)
freq = df.stat.freqItems(["a", "b", "c"], 0.4)

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  1|  2|  1|
|  1|  2|  3|
|  3|  6|  3|
|  1|  2|  3|
|  5| 10|  1|
|  1|  2|  3|
|  7| 14|  3|
|  1|  2|  3|
|  9| 18|  1|
+---+---+---+
only showing top 10 rows

None


## 문제 S-9: 정량데이터 분석

* 정량데이터 처리
* 머신러닝 입력데이터 형식

api | dataframe api | rdd api
----------|----------|----------
데이터 타잎 | label, feature vectors | Labeled Points

### train 데이터 만들기

* category 데이터로 구성

In [26]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

In [5]:
df = sqlCtx.createDataFrame(
    [
        ['No','young', 'false', 'false', 'fair'],
        ['No','young', 'false', 'false', 'good'],
        ['Yes','young', 'true', 'false', 'good'],
        ['Yes','young', 'true', 'true', 'fair'],
        ['No','young', 'false', 'false', 'fair'],
        ['No','middle', 'false', 'false', 'fair'],
        ['No','middle', 'false', 'false', 'good'],
        ['Yes','middle', 'true', 'true', 'good'],
        ['Yes','middle', 'false', 'true', 'excellent'],
        ['Yes','middle', 'false', 'true', 'excellent'],
        ['Yes','old', 'false', 'true', 'excellent'],
        ['Yes','old', 'false', 'true', 'good'],
        ['Yes','old', 'true', 'false', 'good'],
        ['Yes','old', 'true', 'false', 'excellent'],
        ['No','old', 'false', 'false', 'fair'],
    ],
    ['cls','age','f1','f2','f3']
)

In [6]:
df.printSchema()

root
 |-- cls: string (nullable = true)
 |-- age: string (nullable = true)
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)



* 컬럼명 변경을 위해 일부러 labels로 한다. 뒤에 label로 변경한다.

In [7]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol="cls", outputCol="labels")
model=labelIndexer.fit(df)

In [8]:
df1=model.transform(df)

In [9]:
df1.printSchema()

root
 |-- cls: string (nullable = true)
 |-- age: string (nullable = true)
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)
 |-- labels: double (nullable = true)



In [10]:
df1.show()

+---+------+-----+-----+---------+------+
|cls|   age|   f1|   f2|       f3|labels|
+---+------+-----+-----+---------+------+
| No| young|false|false|     fair|   1.0|
| No| young|false|false|     good|   1.0|
|Yes| young| true|false|     good|   0.0|
|Yes| young| true| true|     fair|   0.0|
| No| young|false|false|     fair|   1.0|
| No|middle|false|false|     fair|   1.0|
| No|middle|false|false|     good|   1.0|
|Yes|middle| true| true|     good|   0.0|
|Yes|middle|false| true|excellent|   0.0|
|Yes|middle|false| true|excellent|   0.0|
|Yes|   old|false| true|excellent|   0.0|
|Yes|   old|false| true|     good|   0.0|
|Yes|   old| true|false|     good|   0.0|
|Yes|   old| true|false|excellent|   0.0|
| No|   old|false|false|     fair|   1.0|
+---+------+-----+-----+---------+------+



In [11]:
labelIndexer = StringIndexer(inputCol="age", outputCol="att1")
model=labelIndexer.fit(df1)
df2=model.transform(df1)

In [12]:
labelIndexer = StringIndexer(inputCol="f1", outputCol="att2")
model=labelIndexer.fit(df2)
df3=model.transform(df2)

In [13]:
labelIndexer = StringIndexer(inputCol="f2", outputCol="att3")
model=labelIndexer.fit(df3)
df4=model.transform(df3)

In [14]:
labelIndexer = StringIndexer(inputCol="f3", outputCol="att4")
model=labelIndexer.fit(df4)
df5=model.transform(df4)

In [15]:
df5.printSchema()

root
 |-- cls: string (nullable = true)
 |-- age: string (nullable = true)
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)
 |-- labels: double (nullable = true)
 |-- att1: double (nullable = true)
 |-- att2: double (nullable = true)
 |-- att3: double (nullable = true)
 |-- att4: double (nullable = true)



In [16]:
df5.show()

+---+------+-----+-----+---------+------+----+----+----+----+
|cls|   age|   f1|   f2|       f3|labels|att1|att2|att3|att4|
+---+------+-----+-----+---------+------+----+----+----+----+
| No| young|false|false|     fair|   1.0| 0.0| 0.0| 0.0| 1.0|
| No| young|false|false|     good|   1.0| 0.0| 0.0| 0.0| 0.0|
|Yes| young| true|false|     good|   0.0| 0.0| 1.0| 0.0| 0.0|
|Yes| young| true| true|     fair|   0.0| 0.0| 1.0| 1.0| 1.0|
| No| young|false|false|     fair|   1.0| 0.0| 0.0| 0.0| 1.0|
| No|middle|false|false|     fair|   1.0| 1.0| 0.0| 0.0| 1.0|
| No|middle|false|false|     good|   1.0| 1.0| 0.0| 0.0| 0.0|
|Yes|middle| true| true|     good|   0.0| 1.0| 1.0| 1.0| 0.0|
|Yes|middle|false| true|excellent|   0.0| 1.0| 0.0| 1.0| 2.0|
|Yes|middle|false| true|excellent|   0.0| 1.0| 0.0| 1.0| 2.0|
|Yes|   old|false| true|excellent|   0.0| 2.0| 0.0| 1.0| 2.0|
|Yes|   old|false| true|     good|   0.0| 2.0| 0.0| 1.0| 0.0|
|Yes|   old| true|false|     good|   0.0| 2.0| 1.0| 0.0| 0.0|
|Yes|   

* vector assembler

In [17]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

va = VectorAssembler(inputCols=["att1","att2","att3","att4"],outputCol="features")
df6 = va.transform(df5)

In [18]:
df7=df6.withColumnRenamed('labels','label')

In [19]:
df7.printSchema()

root
 |-- cls: string (nullable = true)
 |-- age: string (nullable = true)
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)
 |-- label: double (nullable = true)
 |-- att1: double (nullable = true)
 |-- att2: double (nullable = true)
 |-- att3: double (nullable = true)
 |-- att4: double (nullable = true)
 |-- features: vector (nullable = true)



In [20]:
trainDf=df7.select('label','features')

In [21]:
trainDf.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [222]:
trainDf.show()

+-----+-----------------+
|label|         features|
+-----+-----------------+
|  1.0|    (4,[3],[1.0])|
|  1.0|        (4,[],[])|
|  0.0|    (4,[1],[1.0])|
|  0.0|[0.0,1.0,1.0,1.0]|
|  1.0|    (4,[3],[1.0])|
|  1.0|[1.0,0.0,0.0,1.0]|
|  1.0|    (4,[0],[1.0])|
|  0.0|[1.0,1.0,1.0,0.0]|
|  0.0|[1.0,0.0,1.0,2.0]|
|  0.0|[1.0,0.0,1.0,2.0]|
|  0.0|[2.0,0.0,1.0,2.0]|
|  0.0|[2.0,0.0,1.0,0.0]|
|  0.0|[2.0,1.0,0.0,0.0]|
|  0.0|[2.0,1.0,0.0,2.0]|
|  1.0|[2.0,0.0,0.0,1.0]|
+-----+-----------------+



* RDD LabelPoint는 df에서 컬럼을 선택해서 만든다.

In [22]:
from pyspark.mllib.regression import LabeledPoint
trainRdd = trainDf.map(lambda row: LabeledPoint(row.label,row.features))

In [23]:
trainRdd.first()

LabeledPoint(1.0, (4,[3],[1.0]))

* udf
    * 함수, 반환타잎 (없으면 StringType을 기본으로)
* ?결과 show()

In [35]:
from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors,VectorUDT
myudf=udf(lambda x: Vectors.dense(x), VectorUDT())
_trainDf2=trainDf.withColumn('dvf',myudf(trainDf.features))

In [36]:
_trainDf2.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- dvf: vector (nullable = true)



### machine learning test

In [172]:
trainDf.show()

+-----+-----------------+
|label|         features|
+-----+-----------------+
|  1.0|    (4,[3],[1.0])|
|  1.0|        (4,[],[])|
|  0.0|    (4,[1],[1.0])|
|  0.0|[0.0,1.0,1.0,1.0]|
|  1.0|    (4,[3],[1.0])|
|  1.0|[1.0,0.0,0.0,1.0]|
|  1.0|    (4,[0],[1.0])|
|  0.0|[1.0,1.0,1.0,0.0]|
|  0.0|[1.0,0.0,1.0,2.0]|
|  0.0|[1.0,0.0,1.0,2.0]|
|  0.0|[2.0,0.0,1.0,2.0]|
|  0.0|[2.0,0.0,1.0,0.0]|
|  0.0|[2.0,1.0,0.0,0.0]|
|  0.0|[2.0,1.0,0.0,2.0]|
|  1.0|[2.0,0.0,0.0,1.0]|
+-----+-----------------+



* pyspark.ml.classification.LogisticRegression
    * 2진 분류만 가능

In [24]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.01)
model1 = lr.fit(trainDf)
print model1.coefficients
print model1.intercept

[-0.50705810019,-5.31916107407,-5.04694958332,-0.351455356638]
3.2822908185


In [26]:
from pyspark.sql import Row
test0 = sc.parallelize([Row(features=Vectors.dense(2,0,0,1))]).toDF()
result = model1.transform(test0).head()

In [27]:
result.prediction

1.0

* mllib - lr test

In [48]:
from pyspark.mllib.classification import LogisticRegressionWithSGD
lrm = LogisticRegressionWithSGD.train(trainRdd, iterations=10)

In [49]:
lrm.predict([1.0,0.0,1.1,1.2])

0

* mllib - svm (ml은 없다!?)

In [177]:
from pyspark.mllib.classification import SVMWithSGD
svm = SVMWithSGD.train(trainRdd, iterations=10)

In [178]:
svm.predict([1.0,0.0,1.1,1.2])

0

* ml - Data frame pipeline으로 decision tree

In [180]:
from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import DecisionTreeClassifier

li1 = StringIndexer(inputCol="cls", outputCol="_label")
li2 = StringIndexer(inputCol="age", outputCol="att1")
li3 = StringIndexer(inputCol="f1", outputCol="att2")
li4 = StringIndexer(inputCol="f2", outputCol="att3")
li5 = StringIndexer(inputCol="f3", outputCol="att4")
va = VectorAssembler(inputCols=["att1","att2","att3","att4"],outputCol="_features")

dt = DecisionTreeClassifier(labelCol="_label", featuresCol="_features")
pipeline = Pipeline(stages=[li1,li2,li3,li4,li5,va,dt])
model = pipeline.fit(df)
predictions = model.transform(df)

# Select example rows to display.
predictions.select("prediction", "_label", "_features").show()

+----------+------+-----------------+
|prediction|_label|        _features|
+----------+------+-----------------+
|       1.0|   1.0|    (4,[3],[1.0])|
|       1.0|   1.0|        (4,[],[])|
|       0.0|   0.0|    (4,[1],[1.0])|
|       0.0|   0.0|[0.0,1.0,1.0,1.0]|
|       1.0|   1.0|    (4,[3],[1.0])|
|       1.0|   1.0|[1.0,0.0,0.0,1.0]|
|       1.0|   1.0|    (4,[0],[1.0])|
|       0.0|   0.0|[1.0,1.0,1.0,0.0]|
|       0.0|   0.0|[1.0,0.0,1.0,2.0]|
|       0.0|   0.0|[1.0,0.0,1.0,2.0]|
|       0.0|   0.0|[2.0,0.0,1.0,2.0]|
|       0.0|   0.0|[2.0,0.0,1.0,0.0]|
|       0.0|   0.0|[2.0,1.0,0.0,0.0]|
|       0.0|   0.0|[2.0,1.0,0.0,2.0]|
|       1.0|   1.0|[2.0,0.0,0.0,1.0]|
+----------+------+-----------------+



* ml - bayesian
    * labeled point

In [51]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(trainDf)

In [52]:
print model.pi
print model.theta

[-0.530628251062,-0.887303195001]
DenseMatrix([[-1.07044141, -1.76358859, -1.60943791, -1.25276297],
             [-0.87546874, -2.48490665, -2.48490665, -0.87546874]])


In [55]:
r=model.transform(test0)

In [53]:
from pyspark.sql import Row
test0 = sc.parallelize([Row(features=Vectors.dense([1.0,0.0,1.1,1.2]))]).toDF()
result = model.transform(test0).head()
result.prediction

0.0

* mllib - bayesian (spark 제공 자료)

In [6]:
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

def parseLine(line):
    parts = line.split(',')
    label = float(parts[0])
    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
    print features
    return LabeledPoint(label, features)

data = sc.textFile('/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_naive_bayes_data.txt').map(parseLine)

# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=0)

In [8]:
model = NaiveBayes.train(training, 1.0)

In [None]:
* 위에서 내가 가공한 데이터

* trainRDD에 마이너스 값이 있으면 nok?!

## 문제 S-10: 텍스트 분석

* bag of words 표현
    * 문서는 단어로 구성된다.
    * 단어의 순서는 의미를 가지지 않는다.
    
구분 | 설명 | 예
----------|----------|----------|
corpus | 문서 집합 | "why she had to go", "where she have to go"
document | 레코드 | "why she had to go"
vocabularay | 중복없는 단어 집합 | "why","she","had","to","go","where","have"
word vector | 이진수로 단어 유무를 나타낸다 | [1,1,1,1,1,0,0],[0,1,0,1,1,1,1]

* word vector의 정보화
    * Tokenizer
    * HashingTF
    ```
    hashingTF=HashingTF()
    tf=hashingTF.transform(_rdd)
    ```

    * TF-IDF Term Frequency-Inverse Document Frequency


### train 데이터 만들기

* 입력은 string으로 한다.
    * Array 입력은 맞지 않는다.
        ```
        [0,['my','dog','has','flea','problems','help','please']]
        또는
        [0,['my dog has flea problems. help please.']]
        ```
    * 한글은 unicode로 설정한다.

In [66]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, RegexTokenizer
from pyspark.sql import SQLContext

sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(
    [
        [0,'my dog has flea problems. help please.'],
        [1,'maybe not take him to dog park stupid'],
        [0,'my dalmation is so cute. I love him'],
        [1,'stop posting stupid worthless garbage'],
        [0,'mr licks ate my steak how to stop him'],
        [1,'quit buying worthless dog food stupid'],
        [0,u'우리 강아지 벌레 있어요 도와주세요'],
        [0,u'우리 강아지 귀여워 너 사랑해'],
        [1,u'강아지 공원 가지마 바보같이'],
        [1,u'강아지 음식 구매 마세요 바보같이']
    ],
    ['cls','sent']
)

In [54]:
df.printSchema()

root
 |-- lab: long (nullable = true)
 |-- sent: string (nullable = true)



### tokenizer

* transform으로 새로운 컬럼을 추가해서 dataframe을 추가한다.
    * label, feature + words

In [60]:
tokenizer = Tokenizer(inputCol="sent", outputCol="words")

In [67]:
tokDf = tokenizer.transform(df)

* words는 array로 만들어졌다. 요소가 string이다.

In [68]:
tokDf.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [69]:
for r in tokDf.select("cls", "sent").take(3):
    print(r)

Row(cls=0, sent=u'my dog has flea problems. help please.')
Row(cls=1, sent=u'maybe not take him to dog park stupid')
Row(cls=0, sent=u'my dalmation is so cute. I love him')


In [70]:
tokDf.show()

+---+--------------------+--------------------+
|cls|                sent|               words|
+---+--------------------+--------------------+
|  0|my dog has flea p...|[my, dog, has, fl...|
|  1|maybe not take hi...|[maybe, not, take...|
|  0|my dalmation is s...|[my, dalmation, i...|
|  1|stop posting stup...|[stop, posting, s...|
|  0|mr licks ate my s...|[mr, licks, ate, ...|
|  1|quit buying worth...|[quit, buying, wo...|
|  0| 우리 강아지 벌레 있어요 도와주세요|[우리, 강아지, 벌레, 있어요...|
|  0|    우리 강아지 귀여워 너 사랑해|[우리, 강아지, 귀여워, 너,...|
|  1|     강아지 공원 가지마 바보같이|[강아지, 공원, 가지마, 바보같이]|
|  1|  강아지 음식 구매 마세요 바보같이|[강아지, 음식, 구매, 마세요...|
+---+--------------------+--------------------+



### RegexTokenizer

* 한글에는 \w 패턴이 적용되지 않는다.
* 공백 \s 패턴을 적용한다.

In [72]:
re = RegexTokenizer(inputCol="sent", outputCol="wordsReg", pattern="\\s+")
regDf=re.transform(df)
regDf.show()

+---+--------------------+--------------------+
|cls|                sent|            wordsReg|
+---+--------------------+--------------------+
|  0|my dog has flea p...|[my, dog, has, fl...|
|  1|maybe not take hi...|[maybe, not, take...|
|  0|my dalmation is s...|[my, dalmation, i...|
|  1|stop posting stup...|[stop, posting, s...|
|  0|mr licks ate my s...|[mr, licks, ate, ...|
|  1|quit buying worth...|[quit, buying, wo...|
|  0| 우리 강아지 벌레 있어요 도와주세요|[우리, 강아지, 벌레, 있어요...|
|  0|    우리 강아지 귀여워 너 사랑해|[우리, 강아지, 귀여워, 너,...|
|  1|     강아지 공원 가지마 바보같이|[강아지, 공원, 가지마, 바보같이]|
|  1|  강아지 음식 구매 마세요 바보같이|[강아지, 음식, 구매, 마세요...|
+---+--------------------+--------------------+



### Stop words


In [82]:
from pyspark.ml.feature import StopWordsRemover
stop = StopWordsRemover(inputCol="words", outputCol="nostops")

* stop words

In [83]:
stopwords=list()

_stopwords=stop.getStopWords()
for e in _stopwords:
    stopwords.append(e)

In [84]:
_mystopwords=[u"나",u"너", u"우리"]
for e in _mystopwords:
    stopwords.append(e)

In [85]:
stop.setStopWords(stopwords)

StopWordsRemover_4a9da39097e84c36b177

In [86]:
for e in stop.getStopWords():
    print e,

a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither

* 한글의 stop words '너','우리'가 제거되었다.

In [87]:
stopDf=stop.transform(tokDf)
stopDf.show()

+---+--------------------+--------------------+--------------------+
|cls|                sent|               words|             nostops|
+---+--------------------+--------------------+--------------------+
|  0|my dog has flea p...|[my, dog, has, fl...|[dog, flea, probl...|
|  1|maybe not take hi...|[maybe, not, take...|[maybe, dog, park...|
|  0|my dalmation is s...|[my, dalmation, i...|[dalmation, cute....|
|  1|stop posting stup...|[stop, posting, s...|[stop, posting, s...|
|  0|mr licks ate my s...|[mr, licks, ate, ...|[mr, licks, ate, ...|
|  1|quit buying worth...|[quit, buying, wo...|[quit, buying, wo...|
|  0| 우리 강아지 벌레 있어요 도와주세요|[우리, 강아지, 벌레, 있어요...|[강아지, 벌레, 있어요, 도와...|
|  0|    우리 강아지 귀여워 너 사랑해|[우리, 강아지, 귀여워, 너,...|     [강아지, 귀여워, 사랑해]|
|  1|     강아지 공원 가지마 바보같이|[강아지, 공원, 가지마, 바보같이]|[강아지, 공원, 가지마, 바보같이]|
|  1|  강아지 음식 구매 마세요 바보같이|[강아지, 음식, 구매, 마세요...|[강아지, 음식, 구매, 마세요...|
+---+--------------------+--------------------+--------------------+



### TF-IDF

* Term frequency-inverse document frequency (TF-IDF)
* tokenizer하고 나서 사용해야 함.
* tfidf를 계산하는 방법은 1) hashingTF, 2) IDF

* sklearn

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary = "a list of words I want to look for in the documents".split()
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
                 stop_words='english')
vect.fit(vocabulary)
tfidf=vect.transform(vocabulary)

* spark

In [89]:
hashTF = HashingTF(inputCol="words", outputCol="hash", numFeatures=50)
hashDf = hashTF.transform(tokDf)

In [90]:
hashDf.show()

+---+--------------------+--------------------+--------------------+
|cls|                sent|               words|                hash|
+---+--------------------+--------------------+--------------------+
|  0|my dog has flea p...|[my, dog, has, fl...|(50,[0,16,34,35,4...|
|  1|maybe not take hi...|[maybe, not, take...|(50,[0,7,8,17,36,...|
|  0|my dalmation is s...|[my, dalmation, i...|(50,[0,5,8,20,26,...|
|  1|stop posting stup...|[stop, posting, s...|(50,[1,17,18,39,4...|
|  0|mr licks ate my s...|[mr, licks, ate, ...|(50,[0,4,7,8,14,4...|
|  1|quit buying worth...|[quit, buying, wo...|(50,[17,38,39,41,...|
|  0| 우리 강아지 벌레 있어요 도와주세요|[우리, 강아지, 벌레, 있어요...|(50,[16,20,31,34,...|
|  0|    우리 강아지 귀여워 너 사랑해|[우리, 강아지, 귀여워, 너,...|(50,[16,20,30,31,...|
|  1|     강아지 공원 가지마 바보같이|[강아지, 공원, 가지마, 바보같이]|(50,[0,31,33,37],...|
|  1|  강아지 음식 구매 마세요 바보같이|[강아지, 음식, 구매, 마세요...|(50,[5,14,30,31,3...|
+---+--------------------+--------------------+--------------------+



In [91]:
idf = IDF(inputCol="hash", outputCol="idf")
idfModel = idf.fit(hashDf)
idfDf = idfModel.transform(hashDf)

In [92]:
idfDf.select('hash','idf').show()

+--------------------+--------------------+
|                hash|                 idf|
+--------------------+--------------------+
|(50,[0,16,34,35,4...|(50,[0,16,34,35,4...|
|(50,[0,7,8,17,36,...|(50,[0,7,8,17,36,...|
|(50,[0,5,8,20,26,...|(50,[0,5,8,20,26,...|
|(50,[1,17,18,39,4...|(50,[1,17,18,39,4...|
|(50,[0,4,7,8,14,4...|(50,[0,4,7,8,14,4...|
|(50,[17,38,39,41,...|(50,[17,38,39,41,...|
|(50,[16,20,31,34,...|(50,[16,20,31,34,...|
|(50,[16,20,30,31,...|(50,[16,20,30,31,...|
|(50,[0,31,33,37],...|(50,[0,31,33,37],...|
|(50,[5,14,30,31,3...|(50,[5,14,30,31,3...|
+--------------------+--------------------+



In [93]:
idfDf.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hash: vector (nullable = true)
 |-- idf: vector (nullable = true)



In [94]:
for e in idfDf.select("cls","hash").take(12):
    print(e)

Row(cls=0, hash=SparseVector(50, {0: 1.0, 16: 2.0, 34: 1.0, 35: 1.0, 44: 1.0, 48: 1.0}))
Row(cls=1, hash=SparseVector(50, {0: 1.0, 7: 1.0, 8: 1.0, 17: 1.0, 36: 1.0, 39: 1.0, 41: 1.0, 44: 1.0}))
Row(cls=0, hash=SparseVector(50, {0: 1.0, 5: 1.0, 8: 2.0, 20: 1.0, 26: 1.0, 29: 1.0, 31: 1.0}))
Row(cls=1, hash=SparseVector(50, {1: 1.0, 17: 1.0, 18: 1.0, 39: 1.0, 44: 1.0}))
Row(cls=0, hash=SparseVector(50, {0: 1.0, 4: 1.0, 7: 1.0, 8: 1.0, 14: 1.0, 43: 1.0, 44: 2.0, 46: 1.0}))
Row(cls=1, hash=SparseVector(50, {17: 1.0, 38: 1.0, 39: 1.0, 41: 1.0, 44: 2.0}))
Row(cls=0, hash=SparseVector(50, {16: 1.0, 20: 1.0, 31: 1.0, 34: 1.0, 38: 1.0}))
Row(cls=0, hash=SparseVector(50, {16: 1.0, 20: 1.0, 30: 1.0, 31: 1.0, 41: 1.0}))
Row(cls=1, hash=SparseVector(50, {0: 1.0, 31: 1.0, 33: 1.0, 37: 1.0}))
Row(cls=1, hash=SparseVector(50, {5: 1.0, 14: 1.0, 30: 1.0, 31: 1.0, 37: 1.0}))


### Word2Vec


* Word2Vec
    * see wikipedia https://en.wikipedia.org/wiki/Word2vec

In [95]:
from pyspark.ml.feature import Word2Vec
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="words", outputCol="w2v")
model = word2Vec.fit(tokDf)
w2vDf = model.transform(tokDf)
for e in w2vDf.select("w2v").take(3):
    print(e)

Row(w2v=DenseVector([-0.0514, 0.0162, 0.0288]))
Row(w2v=DenseVector([0.0043, -0.0537, 0.0231]))
Row(w2v=DenseVector([-0.0225, 0.0084, -0.0055]))


### CountVectorizer

* word vectors를 생성한다.

* tokenize하고 나서 사용
* 결과는 sparse vector
* minDF
    * 소수점은 비율, 사용된 문서 수/전체 문서 수)
    * 정수는 사용된 문서 수, 단어가 몇 개의 문서에 사용되어야 하는지

* sklearn
    * fit_transform -> bag of words
    vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

In [71]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game'
]
vectorizer = CountVectorizer()
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{u'duke': 1, u'basketball': 0, u'lost': 4, u'played': 5, u'game': 2, u'unc': 7, u'in': 3, u'the': 6}


In [97]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="nostops", outputCol="cv", vocabSize=30,minDF=1.0)
cvModel = cv.fit(stopDf)
cvDf = cvModel.transform(stopDf)

cvDf.collect()
cvDf.select('words','nostops','cv').show()

+--------------------+--------------------+--------------------+
|               words|             nostops|                  cv|
+--------------------+--------------------+--------------------+
|[my, dog, has, fl...|[dog, flea, probl...|(30,[1,11,16,17,2...|
|[maybe, not, take...|[maybe, dog, park...|(30,[1,2,8,13],[1...|
|[my, dalmation, i...|[dalmation, cute....|(30,[14,19],[1.0,...|
|[stop, posting, s...|[stop, posting, s...|(30,[2,3,5,18,26]...|
|[mr, licks, ate, ...|[mr, licks, ate, ...|(30,[3,15,27,28],...|
|[quit, buying, wo...|[quit, buying, wo...|(30,[1,2,5,22,29]...|
|[우리, 강아지, 벌레, 있어요...|[강아지, 벌레, 있어요, 도와...|(30,[0,10,24,25],...|
|[우리, 강아지, 귀여워, 너,...|     [강아지, 귀여워, 사랑해]|(30,[0,7],[1.0,1.0])|
|[강아지, 공원, 가지마, 바보같이]|[강아지, 공원, 가지마, 바보같이]|(30,[0,4,9,20],[1...|
|[강아지, 음식, 구매, 마세요...|[강아지, 음식, 구매, 마세요...|(30,[0,4,6,12,21]...|
+--------------------+--------------------+--------------------+



In [98]:
cvDf.select('cv').take(13)

[Row(cv=SparseVector(30, {1: 1.0, 11: 1.0, 16: 1.0, 17: 1.0, 23: 1.0})),
 Row(cv=SparseVector(30, {1: 1.0, 2: 1.0, 8: 1.0, 13: 1.0})),
 Row(cv=SparseVector(30, {14: 1.0, 19: 1.0})),
 Row(cv=SparseVector(30, {2: 1.0, 3: 1.0, 5: 1.0, 18: 1.0, 26: 1.0})),
 Row(cv=SparseVector(30, {3: 1.0, 15: 1.0, 27: 1.0, 28: 1.0})),
 Row(cv=SparseVector(30, {1: 1.0, 2: 1.0, 5: 1.0, 22: 1.0, 29: 1.0})),
 Row(cv=SparseVector(30, {0: 1.0, 10: 1.0, 24: 1.0, 25: 1.0})),
 Row(cv=SparseVector(30, {0: 1.0, 7: 1.0})),
 Row(cv=SparseVector(30, {0: 1.0, 4: 1.0, 9: 1.0, 20: 1.0})),
 Row(cv=SparseVector(30, {0: 1.0, 4: 1.0, 6: 1.0, 12: 1.0, 21: 1.0}))]

In [100]:
for v in cvModel.vocabulary:
    print v,

강아지 dog stupid stop 바보같이 worthless 구매 귀여워 maybe 가지마 있어요 help 음식 park love ate problems. flea posting cute. 공원 마세요 buying please. 도와주세요 벌레 garbage mr steak food


### 컬럼을 선택해서 처리

* rdd로 변환
    * label과 테스트용 1,1을 넣음

In [245]:
_rdd=df.map(lambda x:x.sent)
_rdd.first()

u'It is calm and sunny.'

In [13]:
from pyspark.mllib.regression import LabeledPoint
_rdd=df.map(lambda x:LabeledPoint(x.class,[1,1]))
_rdd.first()

LabeledPoint(0.0, [1.0,1.0])

### VectorAssembler

* string할 수 없슴.

In [397]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["label2", "cv"],
    outputCol="va")

output = assembler.transform(cvDf)
print(output.select("va").first())

Row(va=SparseVector(8, {0: 1.0, 3: 1.0, 7: 1.0}))


### n-gram

* input array

In [279]:
from pyspark.ml.feature import NGram
ngram = NGram(inputCol="words", outputCol="ngrams")
ngramDf = ngram.transform(tokDf)
ngramDf.show()
for e in ngramDf.select("ngrams", "label2").take(3):
    print e

+------+--------------------+--------------------+--------------------+
|label2|                sent|               words|              ngrams|
+------+--------------------+--------------------+--------------------+
|     1|It is calm and su...|[it, is, calm, an...|[it is, is calm, ...|
|     0|         feel gloomy|      [feel, gloomy]|       [feel gloomy]|
|     1|           I am calm|       [i, am, calm]|     [i am, am calm]|
|     0|      gloomy weather|   [gloomy, weather]|    [gloomy weather]|
|     1|               나 행복해|            [나, 행복해]|             [나 행복해]|
|     0|               너 불행해|            [너, 불행해]|             [너 불행해]|
|     1|              우리 행복해|           [우리, 행복해]|            [우리 행복해]|
+------+--------------------+--------------------+--------------------+

Row(ngrams=[u'it is', u'is calm', u'calm and', u'and sunny.'], label2=1)
Row(ngrams=[u'feel gloomy'], label2=0)
Row(ngrams=[u'i am', u'am calm'], label2=1)


### pipeline

In [24]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.classification import LogisticRegression

trainDf = sqlCtx.createDataFrame([
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)], ["id", "text", "label"])

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(trainDf)

### udf

* df의 행을 처리하는 함수
* 함수명과 반환 값을 미리 정의함
* withColumn과 같이 사용

In [20]:
cvDf.printSchema()

root
 |-- label: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cv: vector (nullable = true)



In [25]:
cvDf.take(1)

[Row(label=0, sent=u'When I find myself in times of trouble', words=[u'when', u'i', u'find', u'myself', u'in', u'times', u'of', u'trouble'], cv=SparseVector(30, {3: 1.0, 4: 1.0, 17: 1.0, 18: 1.0, 20: 1.0, 23: 1.0, 24: 1.0, 27: 1.0}))]

* 문장을 대문자로

In [280]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
 
def uppercase(s):
    return s.upper()

upperUdf = udf(uppercase, StringType())
newDF = cvDf.withColumn("upperSent", upperUdf(cvDf.sent))

In [281]:
_a=newDF.select('upperSent')
__a=_a.rdd.map(lambda x:x.split(' '))
_a.take(1)

[Row(upperSent=u'IT IS CALM AND SUNNY.')]

* 벡터변환

In [6]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

maturity_udf = udf(lambda age: "adult" if age >=18 else "child", StringType())

# use Row
df = sqlCtx.createDataFrame([{'name': 'Alice', 'age': 1}])
df.withColumn("maturity", maturity_udf(df.age))


DataFrame[age: bigint, name: string, maturity: string]

### ml test

```
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
```

```
(trainingData, testData) = data.randomSplit([0.7, 0.3])
```

* ml.decisiontree
    * cvDf의 label은 IllegalArgumentException, StringIndexer로 변환해서 사용

### StringIndexer

* double로 변환

In [101]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol="cls", outputCol="labels")
model=labelIndexer.fit(cvDf)
trainDf2=model.transform(cvDf)

In [102]:
trainDf2.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- nostops: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cv: vector (nullable = true)
 |-- labels: double (nullable = true)



* [nok] udf를 사용할 수 있다. double로 안만들어짐

In [104]:
from pyspark.sql.functions import udf

toDoublefunc = udf(lambda x: x.DoubleType())
trainDf3 = trainDf2.withColumn("_label",toDoublefunc(trainDf2.cls))

In [105]:
trainDf3.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- nostops: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cv: vector (nullable = true)
 |-- labels: double (nullable = true)
 |-- _label: string (nullable = true)



* train df는 사전에 정해 놓은 컬럼명, 데이터타잎으로


열 | 데이터타잎 | 예
-----|-----|-----
label | string | 
features | sparse or dense vectors | (7,[2,6],[1.0,1.0])

* 일부를 선택할 수 있다.

In [106]:
_trainDf=trainDf2.select('labels','cv')

In [107]:
trainDf=trainDf2.withColumnRenamed('labels','label')

* df 이름 변경하지 않고 할 수 있다.

In [108]:
trainDf=trainDf.withColumnRenamed('cv','features')

In [110]:
trainDf.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- nostops: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



In [112]:
trainDf.select('cls','label','features').show()

+---+-----+--------------------+
|cls|label|            features|
+---+-----+--------------------+
|  0|  0.0|(30,[1,11,16,17,2...|
|  1|  1.0|(30,[1,2,8,13],[1...|
|  0|  0.0|(30,[14,19],[1.0,...|
|  1|  1.0|(30,[2,3,5,18,26]...|
|  0|  0.0|(30,[3,15,27,28],...|
|  1|  1.0|(30,[1,2,5,22,29]...|
|  0|  0.0|(30,[0,10,24,25],...|
|  0|  0.0|(30,[0,7],[1.0,1.0])|
|  1|  1.0|(30,[0,4,9,20],[1...|
|  1|  1.0|(30,[0,4,6,12,21]...|
+---+-----+--------------------+



### machine learning test

* lr

In [113]:
from pyspark.ml.classification import *

lr = LogisticRegression(maxIter=10, regParam=0.01)

In [114]:
model = lr.fit(trainDf)

* lr pipeline
    * 처음 df를 가지고 pipeline 실행

In [115]:
df.printSchema()

root
 |-- cls: long (nullable = true)
 |-- sent: string (nullable = true)



In [116]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import *

labelIndexer = StringIndexer(inputCol="cls", outputCol="label")
tokenizer = Tokenizer(inputCol="sent", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[labelIndexer,tokenizer, hashingTF, lr])

In [117]:
model=pipeline.fit(df)

* dt

In [118]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
model=dt.fit(trainDf)

In [395]:
model.numNodes

5

* nb

In [119]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(trainDf)

* mllib
    * 위에 있는 trainDf를 사용해서

In [120]:
from pyspark.mllib.regression import LabeledPoint
trainRdd = trainDf.map(lambda row: LabeledPoint(row.label,row.features))

In [121]:
from pyspark.mllib.classification import NaiveBayes

NaiveBayes.train(trainRdd)

<pyspark.mllib.classification.NaiveBayesModel at 0x10b4fed90>

In [122]:
from pyspark.mllib.regression import LinearRegressionWithSGD
LinearRegressionWithSGD.train(trainRdd)

(weights=[0.176731539981,0.215696205714,0.404460423896,0.102108218995,0.36888244587,0.219717639591,0.151150601987,-0.131736079465,0.184742784305,0.217731843883,-0.0604148264235,-0.0534905795486,0.151150601987,0.184742784305,0.0,-0.0331654196387,-0.0534905795486,-0.0534905795486,0.135273638634,0.0,0.217731843883,0.151150601987,0.0844440009573,-0.0534905795486,-0.0604148264235,-0.0604148264235,0.135273638634,-0.0331654196387,-0.0331654196387,0.0844440009573], intercept=0.0)

In [123]:
from pyspark.mllib.classification import LogisticRegressionWithSGD
lrm = LogisticRegressionWithSGD.train(trainRdd, iterations=10)

In [124]:
from pyspark.mllib.classification import SVMWithSGD
svm = SVMWithSGD.train(trainRdd, iterations=10)

## 문제 S-11: twitter 데이터 분석

* 1-1 sklearn
* 1-2 spark

    * tf-idf
    * KMeans
    * MDS Multi-Dimensional Scaling (MDS)
    * visualize

* ibm직원이 tweet을 변환해서 mlib한 거 https://github.com/castanan/w2v

* [nok] 현재 디렉토리 _tweet.json
    * src/ds_twitter_3.py로 변경 (ds_twitter_3.json으로 저장)

### sklearn

In [592]:
import pandas as pd

#_jfname='_tweet.json'
_jfname='src/ds_twitter_seoul_3.json'
# read the entire file into a python array
with open(_jfname, 'rb') as f:
    data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)

# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas
data_df = pd.read_json(data_json_str)

In [593]:
data_df.count()

contributors                    0
coordinates                    14
created_at                   2000
entities                     2000
extended_entities             343
favorite_count               2000
favorited                    2000
geo                            14
id                           2000
id_str                       2000
in_reply_to_screen_name        66
in_reply_to_status_id          63
in_reply_to_status_id_str      63
in_reply_to_user_id            66
in_reply_to_user_id_str        66
is_quote_status              2000
lang                         2000
metadata                     2000
place                          21
possibly_sensitive            846
quoted_status                   6
quoted_status_id               14
quoted_status_id_str           14
retweet_count                2000
retweeted                    2000
retweeted_status             1586
source                       2000
text                         2000
truncated                    2000
user          

In [594]:
data_df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,metadata,place,possibly_sensitive,Unnamed: 21
0,,,2016-11-25 01:09:18,"{u'symbols': [], u'user_mentions': [{u'indices...",{u'media': [{u'source_user_id': 74629927357097...,0,False,,801955891956236288,801955891956236288,,,,,,False,en,"{u'iso_language_code': u'en', u'result_type': ...",,0.0,...
1,,,2016-11-25 01:09:14,"{u'symbols': [], u'user_mentions': [{u'indices...","{u'media': [{u'source_user_id': 3131144039, u'...",0,False,,801955872410697728,801955872410697728,,,,,,False,en,"{u'iso_language_code': u'en', u'result_type': ...",,0.0,...
2,,,2016-11-25 01:09:09,"{u'symbols': [], u'user_mentions': [], u'hasht...",,0,False,,801955852798267393,801955852798267392,,,,,,False,en,"{u'iso_language_code': u'en', u'result_type': ...",,0.0,...
3,,,2016-11-25 01:09:06,"{u'symbols': [], u'user_mentions': [], u'hasht...",{u'media': [{u'expanded_url': u'https://twitte...,0,False,,801955840051781633,801955840051781632,,,,,,False,en,"{u'iso_language_code': u'en', u'result_type': ...",,0.0,...
4,,,2016-11-25 01:09:04,"{u'symbols': [], u'user_mentions': [{u'indices...",,0,False,,801955833424642048,801955833424642048,,,,,,False,th,"{u'iso_language_code': u'th', u'result_type': ...",,,...


In [595]:
data_df['id']

0     801955891956236288
1     801955872410697728
2     801955852798267393
3     801955840051781633
4     801955833424642048
5     801955813715804160
6     801955812302127104
7     801955794631598080
8     801955792853168128
9     801955787476070404
10    801955750666899456
11    801955740197986304
12    801955739140964352
13    801955736448339968
14    801955733025816576
...
1985    801931097713975296
1986    801931077199818752
1987    801931065329876996
1988    801931056123260928
1989    801931044148506624
1990    801931008454967296
1991    801931005762289664
1992    801931001018486784
1993    801930990893617152
1994    801930973013311488
1995    801930944290504704
1996    801930935772086274
1997    801930895523512320
1998    801930887239634944
1999    801930876875481088
Name: id, Length: 2000, dtype: int64

* tf-idf
    * 단계1: tf-idf model
    * 단계2: fit

* 아래는 예제

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary = "a list of words I want to look for in the documents".split()

vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
           stop_words='english', vocabulary=vocabulary)

doc = "some string I want to get tf-idf vector for"
doc_tfidf = vect.fit_transform([doc])
print doc_tfidf

  (0, 5)	1.0


* 10개만 연습으로

In [6]:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                                    min_df=2, stop_words='english',
                                    use_idf=True)
_text=data_df['text'][20:30]
X = vectorizer.fit_transform(_text)

In [16]:
print X

  (1, 0)	0.353553390593
  (1, 3)	0.353553390593
  (1, 9)	0.353553390593
  (1, 13)	0.353553390593
  (1, 11)	0.353553390593
  (1, 10)	0.353553390593
  (1, 12)	0.353553390593
  (1, 8)	0.353553390593
  (2, 15)	0.449404150908
  (2, 7)	0.449404150908
  (2, 14)	0.349561218439
  (2, 6)	0.349561218439
  (2, 4)	0.313925733382
  (2, 5)	0.313925733382
  (2, 2)	0.393175527292
  (3, 0)	0.301511344578
  (3, 3)	0.301511344578
  (3, 9)	0.301511344578
  (3, 13)	0.301511344578
  (3, 11)	0.301511344578
  (3, 10)	0.301511344578
  (3, 12)	0.301511344578
  (3, 8)	0.603022689156
  (5, 16)	0.439122351338
  (5, 17)	0.439122351338
  (5, 1)	0.439122351338
  (5, 14)	0.341563699106
  (5, 6)	0.341563699106
  (5, 4)	0.306743508955
  (5, 5)	0.306743508955
  (6, 14)	0.507796399128
  (6, 4)	0.456029870009
  (6, 5)	0.456029870009
  (6, 2)	0.571153510321
  (8, 1)	0.439122351338
  (8, 15)	0.439122351338
  (8, 7)	0.439122351338
  (8, 14)	0.341563699106
  (8, 6)	0.341563699106
  (8, 4)	0.306743508955
  (8, 5)	0.306743508955


In [20]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7, init='k-means++', max_iter=100, n_init=1,verbose=1)

#import sklearn.cluster.KMeans
#km = sklearn.cluster.KMeans(n_clusters=7, init='k-means++', max_iter=100, n_init=1,verbose=1)
km.fit(X)

Initialization complete
Iteration  0, inertia 0.081
Iteration  1, inertia 0.041
Converged at iteration 1


KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=7, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1)

### spark

In [10]:
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)

* 위에서 읽은 데이터
    * _jfname -> \n제거 -> data
    * data는 string이라서 parallelize해서 dataframe으로 만듦

In [596]:
_df=sqlCtx.jsonRDD(sc.parallelize(data))

In [597]:
type(_df)

pyspark.sql.dataframe.DataFrame

In [33]:
_df.count()

50

In [57]:
_my=sc.parallelize("I am not")

In [59]:
type(_my)

pyspark.rdd.RDD

In [58]:
_my.map(lambda x: x.replace("not","NN"))

PythonRDD[91] at RDD at PythonRDD.scala:43

## 문제 S-12: 그래프 분석

* todo
    * theme
        * link prediction - similarities
        * communities - clustering
        * opinion leader - PageRank
        * graph partitioning
    * google search
    "social graph analysis spark"
    to read - Social big data: Recent achievements and new challenges와 이 논문 cite한 논문들
    
* [ok] spark-defaults.conf
    ```
    spark.jars.packages=graphframes:graphframes:0.1.0-spark1.6
    ```
    
* [ok in python; nok in ipython notebook] to use a Spark package 'GraphFrame'
    * download graphframes-0.1.0-spark1.6.jar
    * copy to lib/ and symlink
    ```
    cd ~/Downloads/spark-1.6.0-bin-hadoop2.6/lib
    ln -s graphframes-0.1.0-spark1.6.jar graphframes.jar
    ```

    * run
    ```
    cd ~/Downloads/spark-1.6.0-bin-hadoop2.6/bin
    ./pyspark --py-files ../lib/graphframes.jar --jars ../lib/graphframes.jar
    ```
    
    * tutorial
    ```
    http://graphframes.github.io/quick-start.html
    ```


In [1]:
import os
import findspark

home=os.getenv("HOME")
spark_home=os.path.join(home,"Downloads/spark-1.6.0-bin-hadoop2.6")
findspark.init(spark_home)

import pyspark
#conf=pyspark.SparkConf()
#conf = pyspark.SparkConf().setAppName("myAppName")
#conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway?readPreference=primaryPreferred")
#conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway")
#sc = pyspark.SparkContext(conf=conf)
sc = pyspark.SparkContext()
print sc._conf.get("spark.jars.packages")
sc.setLogLevel("ERROR")
sqlContext = pyspark.sql.SQLContext(sc)

v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

graphframes:graphframes:0.1.0-spark1.6,org.mongodb.spark:mongo-spark-connector_2.10:1.1.0
+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       1|
+---+--------+

+---+-------------------+
| id|           pagerank|
+---+-------------------+
|  a|               0.01|
|  b| 0.2808611427228327|
|  c|0.27995525261339177|
+---+-------------------+



## 문제 S-13: spark-submit

* spark-defaults.conf
    * packages 여러개를 넣을 경우 컴마로 분리

* spark-submit (self-contained app in quick-start 참조)

### hello

In [1]:
pwd

u'/home/jsl/Code/git/bb/jsl/pyds'

In [18]:
%%writefile src/ds_spark_hello.py
print "---------BEGIN-----------"
import pyspark
conf = pyspark.SparkConf().setAppName("myAppName1")
sc   = pyspark.SparkContext(conf=conf)
sc.setLogLevel("ERROR")
print "---------RESULT-----------"
print sc
rdd = sc.parallelize(range(1000), 10)
print "mean=",rdd.mean()
nums = sc.parallelize([1, 2, 3, 4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
    print "%i " % (num)


Overwriting src/ds_spark_hello.py


In [19]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_hello.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.11;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 128ms :: artifacts dl 3ms
	:: modules in use:
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.11;1.1.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	------------

### sql, file

In [20]:
%%writefile src/ds_spark_sql.py
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
sc.setLogLevel("ERROR")
print "---------RESULT-----------"
from operator import add
lines = sc.textFile("README.md")
word_count_bo = lines\
    .flatMap(lambda x: x.split(' '))\
    .map(lambda x: (x.lower().rstrip().lstrip().rstrip(',').rstrip('.'), 1))\
    .reduceByKey(add)
print word_count_bo.count()

from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
d = [{'name': 'Alice', 'age': 1}]
print sqlCtx.createDataFrame(d).collect()


Writing src/ds_spark_sql.py


In [21]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_sql.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.11;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 129ms :: artifacts dl 3ms
	:: modules in use:
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.11;1.1.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	------------

### graphframes

* spark-defaults.conf
    ```
    spark.jars.packages=graphframes:graphframes:0.1.0-spark1.6
    ```

In [22]:
%%writefile src/ds_spark_graphframe.py
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
sc.setLogLevel("ERROR")
print "=====ds_spark_dataframe====="
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

Writing src/ds_spark_graphframe.py


In [8]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_graphframe.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found graphframes#graphframes;0.1.0-spark1.6 in spark-packages
	found org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 143ms :: artifacts dl 3ms
	:: modules in use:
	graphframes#graphframes;0.1.0-spark1.6 from spark-packages in [default]
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 from central in [default]
	------------------------------------------

## mongodb spark connector

### 설치

* 참조 https://docs.mongodb.com/spark-connector/
* 사전 설치
    * Running MongoDB instance (version 2.6 or later).
    * Spark 1.6.x.
    * Scala 2.10.x if using the mongo-spark-connector_2.10 package

* [ok] submit-spark
    ```
    $ ./bin/spark-submit /home/jsl/Code/git/bb/jsl/pyds/src/ds_spark_mongo.py
    ```
    
    * submit-spark실행 전, 설정에 packages를 넣음 (scala -version이 2.11이라 mongo-spark-connector_2.11을 넣어야 하지만, 2.10으로 ok
    ```
    $vim conf/spark-defaults.conf 
    spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.10:1.1.0
    ```
    
    * scala version 확인
    ```
    jsl@jsl-smu:~$ scala -version
    Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
    ```
    
* [ok] pyspark

```
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
              --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \
              --packages org.mongodb.spark:mongo-spark-connector_2.10:1.1.0
```

* ERROR DefaultMongoPartitioner: MongoDB version < 3.2 detected. 설정에 추가
    ```
    spark.mongodb.input.partitioner=MongoPaginateBySizePartitioner
    ```

### MongoDB Python API Basics

* MongoDB에 쓰기
    * DataFrame을 생성
        ```
        SQLContext.createDataFrame()
        ```
    * DataFrame을 MongoDB로 저장
        ```
        write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()
        ```

* MongoDB을 읽기 (collection을 DataFrame으로)
    * database, collection은 spark.mongodb.input.uri로 설정해 놓음
    * format은 "com.mongodb.spark.sql.DefaultSource"로 정해놓음.
        ```
        sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
        ```


In [6]:
%%writefile src/ds_spark_mongo.py
import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/test.my2?readPreference=primaryPreferred")
conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/test.my2")
sc = pyspark.SparkContext(conf=conf)
#sc = pyspark.SparkContext()
sc.setLogLevel("ERROR")
print sc._conf.getAll()
sqlContext = pyspark.sql.SQLContext(sc)
print "---------write-----------"
myRdd = sc.parallelize([
        ("js", 150),
        ("Gandalf", 1000),
        ("Thorin", 195),
        ("Balin", 178),
        ("Kili", 77),
        ("Dwalin", 169),
        ("Oin", 167),
        ("Gloin", 158),
        ("Fili", 82),
        ("Bombur", None)
    ])
myDf = sqlContext.createDataFrame(myRdd, ["name", "age"])
print myDf
myDf.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()
print "---------read-----------"
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
print df.printSchema()
df.registerTempTable("myTable")
myTab = sqlContext.sql("SELECT name, age FROM myTable WHERE age >= 100")
myTab.show()

Overwriting src/ds_spark_mongo.py


In [4]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_mongo.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found graphframes#graphframes;0.1.0-spark1.6 in spark-packages
	found org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 146ms :: artifacts dl 4ms
	:: modules in use:
	graphframes#graphframes;0.1.0-spark1.6 from spark-packages in [default]
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 from central in [default]
	------------------------------------------

### 내 사례 mongodb twitter

json을 읽을 경우, 
CardSubwayStatisticsService.row.RIDE_PASGR_NUM
```
$ mongo
> use ds_rest_subwayPassengers_mongo_db
switched to db ds_rest_subwayPassengers_mongo_db
> show tables
db_rest_subway
system.indexes
> db.db_rest_subway.find().limit(1)
{ "_id" : ObjectId("57fa386ff5e6e94359c033e9"), "CardSubwayStatisticsService" : { "row" : [ { "COMMT" : "", "RIDE_PASGR_NUM" : 111275, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "용문", "ALIGHT_PASGR_NUM" : 108878, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 11495, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "원덕", "ALIGHT_PASGR_NUM" : 10964, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 118103, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "양평", "ALIGHT_PASGR_NUM" : 116604, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 10590, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "오빈", "ALIGHT_PASGR_NUM" : 10020, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 26304, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "아신", "ALIGHT_PASGR_NUM" : 26358, "USE_MON" : "201306" } ], "RESULT" : { "MESSAGE" : "정상 처리되었습니다", "CODE" : "INFO-000" }, "list_total_count" : 530 } }
```

In [15]:
%%writefile src/ds_spark_twitter.py
import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway?readPreference=primaryPreferred")
conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway")
sc = pyspark.SparkContext(conf=conf)
#sc = pyspark.SparkContext()
sc.setLogLevel("ERROR")
print sc._conf.getAll()
sqlContext = pyspark.sql.SQLContext(sc)
print "---------read-----------"
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
print df.printSchema()
df.registerTempTable("myTwitter")
myTab = sqlContext.sql("SELECT CardSubwayStatisticsService.row.RIDE_PASGR_NUM FROM myTwitter")
print type(myTab)
myTab.show()

Overwriting src/ds_spark_twitter.py


In [16]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_twitter.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found graphframes#graphframes;0.1.0-spark1.6 in spark-packages
	found org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 144ms :: artifacts dl 4ms
	:: modules in use:
	graphframes#graphframes;0.1.0-spark1.6 from spark-packages in [default]
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 from central in [default]
	------------------------------------------

* all-in-one 위를 spark-submit아닌 것으로 풀기
* sc가 또 생성되지 않도록 주의한다.

In [3]:
import os
import findspark

home=os.getenv("HOME")
spark_home=os.path.join(home,"Downloads/spark-1.6.0-bin-hadoop2.6")
findspark.init(spark_home)

import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable?readPreference=primaryPreferred")
conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable")
sc = pyspark.SparkContext(conf=conf)
#sc = pyspark.SparkContext()
print sc._conf.getAll()
print sc._conf.get("spark.jars.packages")
sc.setLogLevel("ERROR")
sqlContext = pyspark.sql.SQLContext(sc)
print "---------read-----------"
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
print df.printSchema()
df.registerTempTable("myTwitter")
myTab = sqlContext.sql("SELECT CardSubwayStatisticsService.row.RIDE_PASGR_NUM FROM myTwitter")
print type(myTab)
myTab.show()

[(u'spark.app.name', u'myAppName'), (u'spark.mongodb.input.uri', u'mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable?readPreference=primaryPreferred'), (u'spark.submit.pyFiles', u'/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10-1.1.0.jar,/home/jsl/.ivy2/jars/com.databricks_spark-csv_2.10-1.3.0.jar,/home/jsl/.ivy2/jars/org.mongodb_mongo-java-driver-3.2.2.jar,/home/jsl/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/jsl/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'), (u'spark.rdd.compress', u'True'), (u'spark.serializer.objectStreamReset', u'100'), (u'spark.master', u'local[*]'), (u'spark.mongodb.output.uri', u'mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable'), (u'spark.submit.deployMode', u'client'), (u'spark.jars', u'file:/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,file:/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10

In [23]:
print myTab.first()
print myTab.head()

Row(RIDE_PASGR_NUM=[111275.0, 11495.0, 118103.0, 10590.0, 26304.0])
Row(RIDE_PASGR_NUM=[111275.0, 11495.0, 118103.0, 10590.0, 26304.0])
