# Spark DataFrame

* Last updated 20161125 20170221
* TODO: Spark MySql, postgresql

## S.1 학습내용

### S.1.1 목표

* Spark 머신러닝에 필요한 데이터타잎을 정의하고, 데이터프레임을 사용할 수 있다.
* Spark SQL을 사용하여 데이터를 추출할 수 있다.
* Spark 데이터를 MongoDB에 쓰고, 읽을 수 있다.

### S.1.2 목차

* S.2 IPython Notebook에서 SparkSession 생성하기  
* S.3 데이터 타잎
* S.3.1 dense vector
* S.3.2 sparse vector.
* S.3.3 labeled point
* S.3.4 maxtrix
* S.3.5 libsvm format
* S.4 Spark SQL
* S.5 Dataframe
* S.6 MonogoDB
* S.7 spark-submit

### S.1.3 문제

* 문제 S-1: RDD를 사용하여 MLlib의 입력 데이터 word vector생성하기. 
* 문제 S-2: 파일을 읽어서 feature vector 생성하기. 
* 문제 S-3: 파일에서 Spark SQL로 데이터 읽기
* 문제 S-4: spark sql uber csv


## S.2 IPython Notebook에서 SparkSession 생성하기


In [3]:
import os
import sys 
os.environ["SPARK_HOME"]=os.path.join(os.environ['HOME'],'Downloads','spark-2.0.0-bin-hadoop2.7')
os.environ["PYLIB"]=os.path.join(os.environ["SPARK_HOME"],'python','lib')
sys.path.insert(0,os.path.join(os.environ["PYLIB"],'py4j-0.10.1-src.zip'))
sys.path.insert(0,os.path.join(os.environ["PYLIB"],'pyspark.zip'))

In [4]:
import pyspark
spark = pyspark.sql.SparkSession.builder\
    .master("local")\
    .appName("mySpark")\
    .getOrCreate()

In [5]:
print spark.version

2.0.0


## S.3 데이터 타잎

* 데이터 타잎은 다음과 같다.

구분 | 설명
----------|----------
Vector | dense와 sparse가 있다.
Labled Point | 클래스값과 결과를 묶음. supervised learning에 사용.
Matrix | 행열로 구성

* 이러한 데이터 타잎이 Spark의 ml, mllib 패키지에 따라 차이가 있다는 것을 주의한다.

구분 | 설명
-------|-------
mllib | RDD API
ml | DataFrame API, Pipelines.


* vectors
    * local vector - single machine에 있다.
    * pyspark.mllib.linalg.Vectors

dense vector | sparse vector
----------|----------
빈 값이 별로 없는 경우. an array of its values | 빈 값이 많은 경우 사용. 인덱스, 값 배열 별도
(160,69,24) | (3,[0,1,2],[160.0,69.0,24.0])
입력 NumPy’s array, Python list | MLlib’s SparseVector, SciPy’s csc_matrix with a single column

### S.3.1 dense vector

* 주의: ml, mllib에서 제공하는지 식별하면서 사용한다.

In [11]:
from pyspark.ml.linalg import Vectors
dv1ml=Vectors.dense([0.0, 1.1, 0.1])

In [12]:
from pyspark.mllib.linalg import Vectors
dv1mllib=Vectors.dense([0.0, 1.1, 0.1])

In [14]:
print dv1ml, dv1mllib

[0.0,1.1,0.1] [0.0,1.1,0.1]


* numpy를 사용해서 dense vector를 생성할 수 있다.

In [5]:
import numpy as np

dv2 = np.array([1.0, 2.1, 3.2])

In [6]:
print dv1

[0.0,1.1,0.1]


* Python list를 사용하여 dense vector를 생성할 수 있다.

In [46]:
dv2 = [1.0, 2.1, 3.2]

### S.3.2 sparse vector

* SparseVector vs Vectors.sparse 차이?

* toArray()는 1줄씩


In [48]:
sv1 = Vectors.sparse(3, [1, 2], [1.0, 3.0])
print sv1.toArray()

[ 0.  1.  3.]


* scipy.sparse

In [50]:
sv2 = sps.csc_matrix((np.array([1.0,3.0]), np.array([0,2]), np.array([0,2])), shape = (3,1))
sv2.todense()
print sv2

  (0, 0)	1.0
  (2, 0)	3.0


### S.3.3 labeled point

* local vector, either dense or sparse
* label과 response로 구성된다.


In [15]:
from pyspark.mllib.regression import LabeledPoint
LabeledPoint(1, [1.0, 2.0, 3.0])

LabeledPoint(1.0, [1.0,2.0,3.0])

In [16]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
LabeledPoint(1.0, Vectors.dense([1.0, 2.0, 3.0]))

LabeledPoint(1.0, [1.0,2.0,3.0])

* mllib의 LabeledPoint를 사용하는 경우, ml의 Vectors를 혼용하면 오류

```
Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector
```
* 이러한 오류는 Vectors.fromML() 함수를 사용해서 혼용하지 않게 하면 된다.

In [17]:
LabeledPoint(1.0, dv1ml)

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

In [18]:
LabeledPoint(1.0, dv1mllib)

LabeledPoint(1.0, [0.0,1.1,0.1])

In [19]:
LabeledPoint(1.0, Vectors.fromML(dv1ml))

LabeledPoint(1.0, [0.0,1.1,0.1])

* Python list에서 dataframe 생성

In [9]:
p = [[1,[1.0,2.0,3.0]],[1,[1.1,2.1,3.1]],[0,[1.2,2.2,3.3]]]
#trainDf=sqlCtx.createDataFrame(p)
trainDf=spark.createDataFrame(p)
trainDf.collect()

[Row(_1=1, _2=[1.0, 2.0, 3.0]),
 Row(_1=1, _2=[1.1, 2.1, 3.1]),
 Row(_1=0, _2=[1.2, 2.2, 3.3])]

* Python list를 LabeledPoint로 생성하면, label과 features로 생성한다.

In [10]:
from pyspark.mllib.regression import LabeledPoint
p = [LabeledPoint(1,[1.0,2.0,3.0]),
     LabeledPoint(1,[1.1,2.1,3.1]),
     LabeledPoint(0,[1.2,2.2,3.3])]
trainDf=spark.createDataFrame(p)
trainDf.collect()

[Row(features=DenseVector([1.0, 2.0, 3.0]), label=1.0),
 Row(features=DenseVector([1.1, 2.1, 3.1]), label=1.0),
 Row(features=DenseVector([1.2, 2.2, 3.3]), label=0.0)]

In [11]:
from pyspark.mllib.linalg import Vectors

trainDf = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, 1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, 0.5]))], ["label", "features"])
trainDf.collect()

[Row(label=1.0, features=DenseVector([0.0, 1.1, 0.1])),
 Row(label=0.0, features=DenseVector([2.0, 1.0, 1.0])),
 Row(label=0.0, features=DenseVector([2.0, 1.3, 1.0])),
 Row(label=1.0, features=DenseVector([0.0, 1.2, 0.5]))]

* schema를 사용해서 dataframe 생성하기

In [12]:
from pyspark.mllib.linalg import SparseVector, VectorUDT
from pyspark.sql.types import StructType, StructField, DoubleType
_rdd = spark.sparkContext.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

In [13]:
schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

In [14]:
trainDf=_rdd.toDF(schema)
trainDf.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



### S.3.4 maxtrix

* local matrix - pyspark.mllib.linalg.Matrix, Matrices
* distributed matrix
    * pyspark.mllib.linalg.distributed.RowMatrix
    * pyspark.mllib.linalg.distributed.IndexedRow, IndexedRowMatrix
    * pyspark.mllib.linalg.distributed.BlockMatrix

In [15]:
from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

### S.3.5 libsvm format

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
* format
    ```
    [label] [index1]:[value1] [index2]:[value2] ...
    [label] [index1]:[value1] [index2]:[value2] ...
    ```
    * label - class
    * index - integers
    * value - real numbers

* see - /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt
```
0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37
```

```OLD
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
svmfn="/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt"
svmDf = sqlCtx.read.format("libsvm").load(svmfn)
```

In [17]:
svmfn="/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/data/mllib/sample_libsvm_data.txt"
svmDf = spark.read.format("libsvm").load(svmfn)

In [18]:
type(svmDf)

pyspark.sql.dataframe.DataFrame

In [19]:
svmDf.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



* csr format (https://www.ncsu.edu/hpc/Courses/6sparse.html)
    ```
    0 0 0 0
    5 8 0 0
    0 0 3 0
    0 6 0 0
    ```
    * non-zero 5 8 3 6
    * column-index 0 1 2 1 (5(1,0) 8(1,1) 3(2,2) 6(3,1)에서 행값만 추출)



## 문제 S-1: RDD를 사용하여 MLlib의 입력 데이터 word vector생성하기.

* RDD API를 사용해서 단어를 셀 수 있다 (map, reduce 등).
* mllib 패키지를 사용하여 데이터를 변환할 수 있다.
    * TF-IDF, Word2Vec 등을 사용할 수 있다.
    * mllib에 없는 변환기능은 ml을 사용한다 (ml은 dataframe을 변환하는 패키지.)
        * Tokenizer, StopWordsRemove, n-gram등


In [439]:
!ls data/ds_spark_wiki.txt

data/ds_spark_wiki.txt


### 파일 전체 word count

In [20]:
lines = spark.sparkContext.textFile("data/ds_spark_wiki.txt")
wc=lines\
    .flatMap(lambda x: x.split(' '))

In [21]:
type(wc)

pyspark.rdd.PipelinedRDD

In [22]:
wc.collect()

[u'Wikipedia',
 u'Apache',
 u'Spark',
 u'is',
 u'an',
 u'open',
 u'source',
 u'cluster',
 u'computing',
 u'framework.',
 u'\uc544\ud30c\uce58',
 u'\uc2a4\ud30c\ud06c\ub294',
 u'\uc624\ud508',
 u'\uc18c\uc2a4',
 u'\ud074\ub7ec\uc2a4\ud130',
 u'\ucef4\ud4e8\ud305',
 u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4.',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Apache',
 u'Spark',
 u'Originally',
 u'developed',
 u'at',
 u'the',
 u'University',
 u'of',
 u'California,',
 u"Berkeley's",
 u'AMPLab,',
 u'the',
 u'Spark',
 u'codebase',
 u'was',
 u'later',
 u'donated',
 u'to',
 u'the',
 u'Apache',
 u'Software',
 u'Foundation,',
 u'which',
 u'has',
 u'maintained',
 u'it',
 u'since.',
 u'Spark',
 u'provides',
 u'an',
 u'interface',
 u'for',
 u'programming',
 u'entire',
 u'clusters',
 u'with',
 u'implicit',
 u'data',
 u'parallelism',
 u'and',
 u'fault-tolerance.']

* 단어를 세어서 tuple로 만듦

In [23]:
from operator import add
wc = spark.sparkContext.textFile("data/ds_spark_wiki.txt")\
    .flatMap(lambda x: x.split(' '))\
    .map(lambda x: (x.lower().rstrip().lstrip().rstrip(',').rstrip('.'), 1))\
    .reduceByKey(add)

In [24]:
wc.count()

50

In [25]:
wc.first()

(u'and', 1)

### 라인 별 word count

* dataframe으로 처리

In [26]:
from operator import add
wc = spark.sparkContext.textFile("data/ds_spark_wiki.txt")\
    .map(lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').lower())\
    .map(lambda x:x.split())\
    .map(lambda x:[(i,1) for i in x])

In [27]:
for e in wc.collect():
    print e

[(u'wikipedia', 1)]
[(u'apache', 1), (u'spark', 1), (u'is', 1), (u'an', 1), (u'open', 1), (u'source', 1), (u'cluster', 1), (u'computing', 1), (u'framework', 1)]
[(u'\uc544\ud30c\uce58', 1), (u'\uc2a4\ud30c\ud06c\ub294', 1), (u'\uc624\ud508', 1), (u'\uc18c\uc2a4', 1), (u'\ud074\ub7ec\uc2a4\ud130', 1), (u'\ucef4\ud4e8\ud305', 1), (u'\ud504\ub808\uc784\uc6cc\ud06c\uc774\ub2e4', 1)]
[(u'apache', 1), (u'spark', 1), (u'apache', 1), (u'spark', 1), (u'apache', 1), (u'spark', 1), (u'apache', 1), (u'spark', 1)]
[(u'originally', 1), (u'developed', 1), (u'at', 1), (u'the', 1), (u'university', 1), (u'of', 1), (u'california', 1), (u"berkeley's", 1), (u'amplab', 1)]
[(u'the', 1), (u'spark', 1), (u'codebase', 1), (u'was', 1), (u'later', 1), (u'donated', 1), (u'to', 1), (u'the', 1), (u'apache', 1), (u'software', 1), (u'foundation', 1)]
[(u'which', 1), (u'has', 1), (u'maintained', 1), (u'it', 1), (u'since', 1)]
[(u'spark', 1), (u'provides', 1), (u'an', 1), (u'interface', 1), (u'for', 1), (u'programming'

* TF (Term Frequency)
    * HashingTF

In [6]:
documents = spark.sparkContext.textFile("data/ds_spark_wiki.txt").map(lambda line: line.split(" "))

In [7]:
from pyspark.mllib.feature import HashingTF

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

In [8]:
tf.collect()

[SparseVector(1048576, {253068: 1.0}),
 SparseVector(1048576, {36751: 1.0, 50570: 1.0, 68380: 1.0, 415281: 1.0, 511377: 1.0, 728364: 1.0, 862087: 1.0, 938426: 1.0, 999480: 1.0}),
 SparseVector(1048576, {63234: 1.0, 340190: 1.0, 357478: 1.0, 375592: 1.0, 458138: 1.0, 486171: 1.0, 598772: 1.0}),
 SparseVector(1048576, {938426: 4.0, 999480: 4.0}),
 SparseVector(1048576, {36757: 1.0, 225801: 1.0, 323305: 1.0, 453405: 1.0, 498679: 1.0, 518030: 1.0, 688842: 1.0, 762570: 1.0, 959994: 1.0}),
 SparseVector(1048576, {420843: 1.0, 550676: 1.0, 725041: 1.0, 782544: 1.0, 938426: 1.0, 959994: 2.0, 991590: 1.0, 993084: 1.0, 996703: 1.0, 999480: 1.0}),
 SparseVector(1048576, {50573: 1.0, 263739: 1.0, 892834: 1.0, 1014710: 1.0, 1035538: 1.0}),
 SparseVector(1048576, {3932: 1.0, 36751: 1.0, 192182: 1.0, 358969: 1.0, 363244: 1.0, 496856: 1.0, 546913: 1.0, 938426: 1.0, 951974: 1.0}),
 SparseVector(1048576, {69621: 1.0, 157580: 1.0, 219357: 1.0, 297436: 1.0, 715648: 1.0})]

In [None]:
def countPartitions(id,iterator): 
         c = 0 
         for _ in iterator: 
              c += 1 
         yield (id,c) 
_wc=wc.mapPartitions(countPartitions)

```
from pyspark.mllib.regression import LabeledPoint
trainRdd = trainDf.map(lambda row: LabeledPoint(row.label,row.features))
```

## S.4 Spark SQL

* Spark SQL은 
    * 데이터를 구조화해서 sql을 사용할 수 있다. RDD는 비구조적인 경우에 사용한다.

구분 | Spark SQL | RDD
-----|-----|-----
데이터 | 구조적 | 비구조적
* Spark SQL 구성

구분 | 설명
-----|-----
Language API | Python, Java, Scala, Hive QL API를 제공
Schema RDD | RDD에 Schema를 적용해 임시 테이블로 변환한다. dataframe.
Data Sources | 다양한 형식 지원 - HDFS, Cassandra, HBase, and relational databases


* dataframe 생성
    * 이름, 키 정보 파일을 읽어서 키가 170이상 

In [29]:
p = [{'name': 'kim', 'height': 170}]
spark.createDataFrame(p).collect()

[Row(height=170, name=u'kim')]

In [30]:
print type(p)

<type 'list'>


In [31]:
from pyspark.sql import *
pRow=list(Row(name="kim", height=1961))

df=spark.createDataFrame([pRow])
df.show()

+----+---+
|  _1| _2|
+----+---+
|1961|kim|
+----+---+



## 문제 S-2: 파일을 읽어서 feature vector 생성하기.

* rdd에서 dataframe을 생성하고, sql 사용한다.
* 네트워크 침입
    * https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
* attack 종류 구분 (41번째 열)

침입구분 | 건수
-------|-------
normal | 97278
attack | 396743
전체 | 494021

In [14]:
import os
import urllib
_url = 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz'
_fname = os.path.join(os.getcwd(),'data','kddcup.data_10_percent.gz')
if(not os.path.exists(_fname)):
    print "%s data does not exist! retrieving.." % _fname
    _f=urllib.urlretrieve(_url,_fname)


In [15]:
_rdd = spark.sparkContext.textFile(_fname)

In [16]:
_rdd.count()

494021

In [17]:
_rdd.take(3)

[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.']

* 데이터에 'normal.'이 포함된 건수

In [18]:
_normal = _rdd.filter(lambda x: 'normal.' in x)
print _normal.count()

97278


In [19]:
_csvRdd=_rdd.map(lambda x: x.split(','))

In [20]:
print _csvRdd.take(1)

[[u'0', u'tcp', u'http', u'SF', u'181', u'5450', u'0', u'0', u'0', u'0', u'0', u'1', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'0', u'8', u'8', u'0.00', u'0.00', u'0.00', u'0.00', u'1.00', u'0.00', u'0.00', u'9', u'9', u'1.00', u'0.00', u'0.11', u'0.00', u'0.00', u'0.00', u'0.00', u'0.00', u'normal.']]


* 데이터 분류
    * reduceByKey()를 사용해 각 경우의 건 수를 센다.

In [21]:
_kv = _csvRdd.map(lambda x: (x[41], 1))
_attack = _kv.reduceByKey(lambda x,y: x+y)

In [22]:
_attack.collect()

[(u'guess_passwd.', 53),
 (u'nmap.', 231),
 (u'warezmaster.', 20),
 (u'rootkit.', 10),
 (u'warezclient.', 1020),
 (u'smurf.', 280790),
 (u'pod.', 264),
 (u'neptune.', 107201),
 (u'normal.', 97278),
 (u'spy.', 2),
 (u'ftp_write.', 8),
 (u'phf.', 4),
 (u'portsweep.', 1040),
 (u'teardrop.', 979),
 (u'buffer_overflow.', 30),
 (u'land.', 21),
 (u'imap.', 12),
 (u'loadmodule.', 9),
 (u'perl.', 3),
 (u'multihop.', 7),
 (u'back.', 2203),
 (u'ipsweep.', 1247),
 (u'satan.', 1589)]

In [23]:
_normalRdd=_csvRdd.filter(lambda x: x[41]=="normal.")
_attackRdd=_csvRdd.filter(lambda x: x[41]!="normal.")

In [24]:
print _normalRdd.count()
print _attackRdd.count()

97278
396743


* combineByKey(x, y, z)
    * Combiner function: x
        * key-value에서 value로 combine하려면 (value,1)
    * Merge value function: y
    * Merge combiners function: z


In [33]:
data = spark.sparkContext.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )
sumCount = data.combineByKey(lambda value: (value, 1),
                             lambda x, value: (x[0] + value, x[1] + 1),
                             lambda x, y: (x[0] + y[0], x[1] + y[1]))

averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))

print averageByKey.collectAsMap()

{0: 3.0, 1: 10.0}


In [34]:
from operator import add
aggregated_counts = (data
    .map(lambda kv: (kv, 1))
    .reduceByKey(add)
    .map(lambda kv: (kv[0][0], (kv[0][1], kv[1])))
    .groupByKey()
    .mapValues(lambda xs: (list(xs), sum(x[1] for x in xs))))

aggregated_counts.collect()

[(0, ([(2.0, 1), (4.0, 1)], 2)), (1, ([(10.0, 1), (0.0, 1), (20.0, 1)], 3))]

In [25]:
sum_counts = _kv.combineByKey(
    (lambda x: (x, 1)), # the initial value, with value x and count 1
    (lambda acc, value: (acc[0]+value, acc[1]+1)), # how to combine a pair value with the accumulator: sum value, and increment count
    (lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1])) # combine accumulators
)

sum_counts.collectAsMap()

{u'back.': (2203, 2203),
 u'buffer_overflow.': (30, 30),
 u'ftp_write.': (8, 8),
 u'guess_passwd.': (53, 53),
 u'imap.': (12, 12),
 u'ipsweep.': (1247, 1247),
 u'land.': (21, 21),
 u'loadmodule.': (9, 9),
 u'multihop.': (7, 7),
 u'neptune.': (107201, 107201),
 u'nmap.': (231, 231),
 u'normal.': (97278, 97278),
 u'perl.': (3, 3),
 u'phf.': (4, 4),
 u'pod.': (264, 264),
 u'portsweep.': (1040, 1040),
 u'rootkit.': (10, 10),
 u'satan.': (1589, 1589),
 u'smurf.': (280790, 280790),
 u'spy.': (2, 2),
 u'teardrop.': (979, 979),
 u'warezclient.': (1020, 1020),
 u'warezmaster.': (20, 20)}

* rdd to sql, dataframe

In [26]:
from pyspark.sql import Row

_csv = _rdd.map(lambda l: l.split(","))
_csvRdd = _csv.map(lambda p: 
    Row(
        duration=int(p[0]), 
        protocol=p[1],
        service=p[2],
        flag=p[3],
        src_bytes=int(p[4]),
        dst_bytes=int(p[5])
    )
)

In [27]:
type(_csvRdd)

pyspark.rdd.PipelinedRDD

In [30]:
#from pyspark.sql import SQLContext
#sqlCtx = SQLContext(sc)

#_df=sqlCtx.createDataFrame(_rdd)
_df=spark.createDataFrame(_csvRdd)

_df.registerTempTable("_tab")

In [31]:
_df.select("protocol", "duration", "dst_bytes").groupBy("protocol").count().show()

+--------+------+
|protocol| count|
+--------+------+
|     tcp|190065|
|     udp| 20354|
|    icmp|283602|
+--------+------+



In [32]:
_df.select("protocol", "duration", "dst_bytes")\
    .filter(_df.duration>1000)\
    .filter(_df.dst_bytes==0)\
    .groupBy("protocol")\
    .count()\
    .show()

+--------+-----+
|protocol|count|
+--------+-----+
|     tcp|  139|
+--------+-----+



In [34]:
tcp_interactions = spark.sql(
"""
    SELECT duration, dst_bytes FROM _tab
    WHERE protocol = 'tcp' AND duration > 1000 AND dst_bytes = 0
""")

In [79]:
tcp_interactions.show()

+--------+---------+
|duration|dst_bytes|
+--------+---------+
|    5057|        0|
|    5059|        0|
|    5051|        0|
|    5056|        0|
|    5051|        0|
|    5039|        0|
|    5062|        0|
|    5041|        0|
|    5056|        0|
|    5064|        0|
|    5043|        0|
|    5061|        0|
|    5049|        0|
|    5061|        0|
|    5048|        0|
|    5047|        0|
|    5044|        0|
|    5063|        0|
|    5068|        0|
|    5062|        0|
+--------+---------+
only showing top 20 rows



In [37]:
tcp_interactions_out = tcp_interactions.rdd\
    .map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))

In [38]:
for i,ti_out in enumerate(tcp_interactions_out.collect()):
    if(i%10==0):
        print ti_out

Duration: 5057, Dest. bytes: 0
Duration: 5043, Dest. bytes: 0
Duration: 5046, Dest. bytes: 0
Duration: 5051, Dest. bytes: 0
Duration: 5057, Dest. bytes: 0
Duration: 5063, Dest. bytes: 0
Duration: 42448, Dest. bytes: 0
Duration: 40121, Dest. bytes: 0
Duration: 31709, Dest. bytes: 0
Duration: 30619, Dest. bytes: 0
Duration: 22616, Dest. bytes: 0
Duration: 21455, Dest. bytes: 0
Duration: 13998, Dest. bytes: 0
Duration: 12933, Dest. bytes: 0


## 문제 S-3: 파일에서 Spark SQL로 데이터 읽기

* 1. json파일에서 읽기
    * 주의: format이 건별로 저장되어 있슴???
* 2. twitter json
* 3. url에서 json 읽어오기
* 4. csv파일에서 일기
* 5. com.databricks.spark.csv
    * vim conf/spark-defaults.conf
        ```
        spark.jars.packages=com.databricks:spark-csv_2.10:1.3.0
        ```

* sqlContext.jsonRDD()
* sqlContext.jsonFile()

### JSON 파일 읽기

* json파일을 읽어서, sql을 사용한다.

In [None]:
# %load /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}


In [6]:
pDF= spark.read.json("/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json")

In [7]:
type(pDF)

pyspark.sql.dataframe.DataFrame

In [8]:
pDF.filter(pDF['age'] > 21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [9]:
pDF.registerTempTable("people")
spark.sql("select name from people").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



* spark.catalog를 사용하여 사용하는 'Table'을 확인할 수 있다.

In [11]:
spark.catalog.listTables()

[Table(name=u'people', database=None, description=None, tableType=u'TEMPORARY', isTemporary=True)]

### Twitter JSON을 읽을 경우

구분 | 예
-------|-------
unicode를 사용하면 backslash | "{\"created_at\":\"Sun Nov 13 00:05:19 +0000 2016\"
보통 | {"created_at":"Sun Nov 13 00:05:19 +0000 2016"


    * allowBackslashEscapingAnyCharacter

In [34]:
twitterDF= spark.read.json(os.path.join("src","ds_twitter_1_noquote.json"))

In [35]:
twitterDF.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- symbols: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- urls: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- user_mentions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- favorite_count: long (nullable = true)
 |-- favorited: boolean (nullable = true)
 |-- geo: string (nullable = true)
 |-- id: long (nullable = true)
 |-- id_str: string (nullable = true)
 |-- in_reply_to_screen_name: string (nullable = true)
 |-- in_reply_to_status_id: string (nullable = true)
 |-- in_reply_to_status_id_str: string (nullable = true)
 |-- in_reply_to_user_id: string (nullable = true)
 |-- in_reply_to_user_id_str: s

In [36]:
twitterDF.select('text').show()

+---------------+
|           text|
+---------------+
|Hello 21 160924|
+---------------+



In [39]:
twitterDF.registerTempTable("twitter")
spark.sql("select text from twitter").show()

+---------------+
|           text|
+---------------+
|Hello 21 160924|
+---------------+



### JSON frm URL

* url에서 데이터 읽으면 string (예: r.iter_lines()하면 문자 1개씩 가져옴)
* response를 json으로 읽으면 ok

In [40]:
import requests
r=requests.get("https://raw.githubusercontent.com/jokecamp/FootballData/master/World%20Cups/all-world-cup-players.json")

In [42]:
wc=r.json()

In [43]:
type(wc)

list

* Row로 만들어주어야?
    ```
    df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])
    ```

In [44]:
wcDF=spark.createDataFrame(wc)



In [45]:
wcDF.printSchema()

root
 |-- Club: string (nullable = true)
 |-- ClubCountry: string (nullable = true)
 |-- Competition: string (nullable = true)
 |-- DateOfBirth: string (nullable = true)
 |-- FullName: string (nullable = true)
 |-- IsCaptain: boolean (nullable = true)
 |-- Number: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Team: string (nullable = true)
 |-- Year: long (nullable = true)



In [46]:
wcDF.registerTempTable("wc")
spark.sql("select Club,Team,Year from wc").show(1)

+--------------------+---------+----+
|                Club|     Team|Year|
+--------------------+---------+----+
|Club AtlÃ©tico Ta...|Argentina|1930|
+--------------------+---------+----+
only showing top 1 row



* baby names

In [47]:
import json
import requests
_url="https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?accessType=DOWNLOAD"
_json=requests.get(_url).json()

* json데이터는 meta, data로 구분해서 만들어져 있슴
* data는 52252건

In [48]:
_json.keys()

[u'meta', u'data']

In [49]:
_jsonList=_json['data']
print len(_jsonList)

145570


In [50]:
_json['data'][0]

[1,
 u'5DC7F285-052B-4739-8DC3-62827014A4CD',
 1,
 1425450997,
 u'714909',
 1425450997,
 u'714909',
 u'{\n}',
 u'2013',
 u'GAVIN',
 u'ST LAWRENCE',
 u'M',
 u'9']

* list to spark dataFrame
    * schema를 정하지 않으면 없이 생성함

In [51]:
_df=spark.createDataFrame(_json['data'])
_df.count()

145570

* schema를 정하지 않았으므로 임의로 생성된 속성을 사용하고 있다.

In [52]:
_df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: long (nullable = true)
 |-- _4: long (nullable = true)
 |-- _5: string (nullable = true)
 |-- _6: long (nullable = true)
 |-- _7: string (nullable = true)
 |-- _8: string (nullable = true)
 |-- _9: string (nullable = true)
 |-- _10: string (nullable = true)
 |-- _11: string (nullable = true)
 |-- _12: string (nullable = true)
 |-- _13: string (nullable = true)



In [53]:
_df.filter(_df['_10'] == u'GAVIN').show(2)

+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
| _1|                  _2| _3|        _4|    _5|        _6|    _7| _8|  _9|  _10|        _11|_12|_13|
+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
|  1|5DC7F285-052B-473...|  1|1425450997|714909|1425450997|714909|{
}|2013|GAVIN|ST LAWRENCE|  M|  9|
| 82|43E9414D-9BE0-456...| 82|1425450997|714909|1425450997|714909|{
}|2013|GAVIN|    SUFFOLK|  M| 54|
+---+--------------------+---+----------+------+----------+------+---+----+-----+-----------+---+---+
only showing top 2 rows



* select를 사용해보자?
    * pivotTable??

In [54]:
_df.registerTempTable("babyNames")
spark.sql("select distinct(_10) from babyNames").show(5)

+------+
|   _10|
+------+
|MILANA|
|  JADE|
|  ANNA|
|HUNTER|
|ANJALI|
+------+
only showing top 5 rows



### read from text

In [None]:
# %load /home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19


In [39]:
from pyspark.sql import Row
lines = spark.sparkContext.textFile("/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))

schemaPeople = spark.createDataFrame(people)
schemaPeople.registerTempTable("people")

In [40]:
# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

In [41]:
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
  print(teenName)

Name: Justin


In [45]:
from pyspark.sql.types import StructType, StructField, StringType
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

spark.createDataFrame(people,schema)

DataFrame[name: string, age: string]

### csv

In [None]:
%%writefile data/ds_spark.csv
1,2,3,4
11,22,33,44
111,222,333,444

In [46]:
#from pyspark.sql import SQLContext

#sqlContext = SQLContext(sc)
#df = sqlContext.read.format('com.databricks.spark.csv')\
#    .options(header='true', inferschema='true').load('data/ds_spark.csv')
df = spark.read.format('com.databricks.spark.csv')\
    .options(header='true', inferschema='true').load('data/ds_spark.csv')
df.show()

df.withColumnRenamed('1','label')

+---+---+---+---+
|  1|  2|  3|  4|
+---+---+---+---+
| 11| 22| 33| 44|
|111|222|333|444|
+---+---+---+---+



DataFrame[label: int, 2: int, 3: int, 4: int]

## 문제 S-4: Spark SQL Uber csv

https://github.com/tmcgrath/spark-with-python-course/blob/master/Spark-SQL-CSV-with-Python.ipynb



* fivethirtyeight
    * git clone https://github.com/fivethirtyeight/uber-tlc-foil-response.git
        daily Uber trip statistics in January and February 2015
        ```
        dispatching_base_number	date	active_vehicles	trips
        B02512	1/1/2015	190	1132
        B02765	1/1/2015	225	1765
        ```

In [49]:
data_home=os.path.join(os.environ['HOME'],"Code/git/else/uber-tlc-foil-response")
filePath=os.path.join(data_home,"Uber-Jan-Feb-FOIL.csv")

_fub = spark.sparkContext.textFile(filePath)

In [50]:
type(_fub)
_fub.count()
_fub.first()

u'dispatching_base_number,date,active_vehicles,trips'

* csv는 comma seperated 형식이므로, ','로 분리
* 첫번째 열에서 key값을 추출한다 (header값 포함)

In [51]:
_dub = _fub.map(lambda line: line.split(","))

type(_dub)

_row0keys=_dub.map(lambda row: row[0]).distinct().collect()

print _row0keys

_dub.filter(lambda row: "B02512" in row).count()

[u'B02682', u'B02512', u'dispatching_base_number', u'B02617', u'B02765', u'B02764', u'B02598']


59

* B02512인 경우, trips가 2000보다 큰 레코드 수집

In [52]:
_dub.filter(lambda row: "B02512" in row).filter(lambda row: int(row[3])>2000).collect()

[[u'B02512', u'1/30/2015', u'256', u'2016'],
 [u'B02512', u'2/5/2015', u'264', u'2022'],
 [u'B02512', u'2/12/2015', u'269', u'2092'],
 [u'B02512', u'2/13/2015', u'281', u'2408'],
 [u'B02512', u'2/14/2015', u'236', u'2055'],
 [u'B02512', u'2/19/2015', u'250', u'2120'],
 [u'B02512', u'2/20/2015', u'272', u'2380'],
 [u'B02512', u'2/21/2015', u'238', u'2149'],
 [u'B02512', u'2/27/2015', u'272', u'2056']]

* header는 속성 명을 가지고 있다. 이를 제외하면 전체 갯수에서 1개를 뺀 숫자

In [53]:
_noheader = _fub.filter(lambda line: "base" not in line).map(lambda line:line.split(","))
_noheader.count()

354

* reduceByKey - key별로 value를 합쳐서 결과 -> 아래는 a,3 b,2
```
("a", 1)
("b", 1)
("a", 1)
("a", 1)
("b", 1)
```

In [54]:
_noheader.map(lambda x: (x[0], int(x[3]))).reduceByKey(lambda k,v: k + v).collect()

[(u'B02682', 662509),
 (u'B02512', 93786),
 (u'B02617', 725025),
 (u'B02765', 193670),
 (u'B02764', 1914449),
 (u'B02598', 540791)]

* saving
    ```
    rddOfStrings.saveAsTextFile("out.txt")
    ```

## S.5 Dataframe

http://www.cs.sfu.ca/CourseCentral/732/ggbaker/spark-sql.html


* Data Frame은 DB 테이블
    * MLib의 입력 데이터로 사용할 수 있다.
        * 입력 데이터는 1) Spark RDDs 또는 2) DataFrame을 사용할 수 있다.
        * 기본은 Data Frames (Pandas dataframe) (Spark 3.0 이후 DataFrame API)

Pipeline | 설명 | 예
----------|----------|----------
DataFrame | text, feature vectors, true labels, and predictions.
Transformer | DataFrame into another DataFrame | Transformer.transform()
Estimator | fit on a DataFrame to produce a TransformerPipeline | Estimator.fit()
Pipeline | chains multiple Transformers and Estimators together
Parameter | a common API for specifying parameters. | ParamMap

* 기능

기능 | 예제
-------|-------
json 읽기 | sqlContext.read.json("employee.json")
data 보기 | dfs.show()
schema | dfs.printSchema()
select | dfs.select("name").show()
filter | dfs.filter(dfs("age") > 23).show()
groupBy | dfs.groupBy("age").count().show()
select | 

* Spar DataFrame vs Pandas 의 비교

DataFrame | Spark | Pandas
-------|-------|-------
csv file | map split(',') | read_csv()
| show() | head(), tail()
data types | 맞게 추정 | 모두 strings

In [55]:
from pyspark.sql import Row
Person = Row('name', 'height')
rows = [Person('kim', 170), Person('lee', 175), Person('lim', 180),]
#rowsRdd = sc.parallelize(rows)
#rowsDf = sqlCtx.createDataFrame(rowsRdd)

rowsDF=spark.createDataFrame(rows)

In [56]:
type(rows)

list

In [59]:
type(rowsDF.rdd)

pyspark.rdd.RDD

In [60]:
type(rowsDF)

pyspark.sql.dataframe.DataFrame

In [61]:
rowsDF.printSchema()

root
 |-- name: string (nullable = true)
 |-- height: long (nullable = true)



In [62]:
rowsDF.where(rowsDF.height < 175)\
    .select([rowsDF.name, rowsDF.height]).show()

+----+------+
|name|height|
+----+------+
| kim|   170|
+----+------+



In [63]:
rowsDF.groupby(rowsDF.height).max().show()

+------+-----------+
|height|max(height)|
+------+-----------+
|   170|        170|
|   175|        175|
|   180|        180|
+------+-----------+



## S.6 MongoDB

## S.7 spark-submit

## 문제 S-13: spark-submit

* spark-defaults.conf
    * packages 여러개를 넣을 경우 컴마로 분리

* spark-submit (self-contained app in quick-start 참조)

In [1]:
pwd

u'/home/jsl/Code/git/bb/jsl/pyds'

### sql, file


In [64]:
%%writefile src/ds_spark_sql.py
import pyspark
#conf = pyspark.SparkConf().setAppName("myAppName1")
#sc   = pyspark.SparkContext(conf=conf)
#sc.setLogLevel("ERROR")
def doIt():
    from operator import add
    lines = spark.sparkContext.textFile("README.md")
    word_count_bo = lines\
        .flatMap(lambda x: x.split(' '))\
        .map(lambda x: (x.lower().rstrip().lstrip().rstrip(',').rstrip('.'), 1))\
        .reduceByKey(add)
    print word_count_bo.count()

    #from pyspark.sql import SQLContext
    #sqlCtx = SQLContext(sc)
    d = [{'name': 'Alice', 'age': 1}]
    print spark.createDataFrame(d).collect()

if __name__ == "__main__":
    myConf=pyspark.SparkConf()
    spark = pyspark.sql.SparkSession.builder\
        .master("local")\
        .appName("myApp")\
        .config(conf=myConf)\
        .getOrCreate()
    doIt()
    spark.stop()
#spark.stop()

Overwriting src/ds_spark_sql.py


* spark-submit을 실행하기 전, 'conf/log4j.properties'를 수정 log level을 ERROR로 설정하였다.
```
log4j.rootCategory=ERROR, console
```

In [65]:
!/home/jsl/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-submit src/ds_spark_sql.py

268
[Row(age=1, name=u'Alice')]


## MongoDB Spark connector

* 참고 MongoDB의 Spark 연결 설명 https://docs.mongodb.com/spark-connector/

* spark-defaults.conf 수정 (MongoDB<3.2인 경우 spark.mongodb.input.partitioner가 필요하다)
```
$vim conf/spark-defaults.conf 
spark.jars.packages=org.mongodb.spark:mongo-spark-connector_2.10:1.1.0
spark.mongodb.input.partitioner=MongoPaginateBySizePartitioner
```

### MongoDB Python API Basics

* MongoDB에 쓰기
    * DataFrame을 생성
```
spark.createDataFrame()
```
    * DataFrame을 MongoDB로 저장
```
write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()
```

* MongoDB을 읽기 (collection을 DataFrame으로)
    * database, collection은 spark.mongodb.input.uri로 설정해 놓음
    * format은 "com.mongodb.spark.sql.DefaultSource"로 정해놓음.
```
spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
```


In [71]:
%%writefile src/ds_spark_mongo.py
import pyspark
#conf = pyspark.SparkConf().setAppName("myAppName1")
#sc   = pyspark.SparkContext(conf=conf)
#sc.setLogLevel("ERROR")
def doIt():
    print "---------RESULT-----------"
    print "------mongodb write-------"
    myRdd = spark.sparkContext.parallelize([
        ("js", 150),
        ("Gandalf", 1000),
        ("Thorin", 195),
        ("Balin", 178),
        ("Kili", 77),
        ("Dwalin", 169),
        ("Oin", 167),
        ("Gloin", 158),
        ("Fili", 82),
        ("Bombur", None)
    ])
    myDf = spark.createDataFrame(myRdd, ["name", "age"])
    print myDf
    myDf.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()
    print "---------read-----------"
    df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
    print df.printSchema()
    df.registerTempTable("myTable")
    myTab = spark.sql("SELECT name, age FROM myTable WHERE age >= 100")
    myTab.show()

if __name__ == "__main__":
    myConf=pyspark.SparkConf()
    spark = pyspark.sql.SparkSession.builder\
        .master("local")\
        .appName("myApp")\
        .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/myDB.ds_spark_df_mongo") \
        .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/myDB.ds_spark_df_mongo") \
        .getOrCreate()
    doIt()
    spark.stop()
#spark.stop()

Overwriting src/ds_spark_mongo.py


In [73]:
!/home/jsl/Downloads/spark-2.0.0-bin-hadoop2.7/bin/spark-submit src/ds_spark_mongo.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found graphframes#graphframes;0.1.0-spark1.6 in spark-packages
	found org.mongodb.spark#mongo-spark-connector_2.10;2.0.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
	found com.databricks#spark-csv_2.10;1.3.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
downloading https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.10/2.0.0/mongo-spark-connector_2.10-2.0.0.jar ...
	[SUCCE

### 내 사례 mongodb twitter

json을 읽을 경우, 
CardSubwayStatisticsService.row.RIDE_PASGR_NUM
```
$ mongo
> use ds_rest_subwayPassengers_mongo_db
switched to db ds_rest_subwayPassengers_mongo_db
> show tables
db_rest_subway
system.indexes
> db.db_rest_subway.find().limit(1)
{ "_id" : ObjectId("57fa386ff5e6e94359c033e9"), "CardSubwayStatisticsService" : { "row" : [ { "COMMT" : "", "RIDE_PASGR_NUM" : 111275, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "용문", "ALIGHT_PASGR_NUM" : 108878, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 11495, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "원덕", "ALIGHT_PASGR_NUM" : 10964, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 118103, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "양평", "ALIGHT_PASGR_NUM" : 116604, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 10590, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "오빈", "ALIGHT_PASGR_NUM" : 10020, "USE_MON" : "201306" }, { "COMMT" : "", "RIDE_PASGR_NUM" : 26304, "WORK_DT" : "20130723", "LINE_NUM" : "중앙선", "SUB_STA_NM" : "아신", "ALIGHT_PASGR_NUM" : 26358, "USE_MON" : "201306" } ], "RESULT" : { "MESSAGE" : "정상 처리되었습니다", "CODE" : "INFO-000" }, "list_total_count" : 530 } }
```

In [15]:
%%writefile src/ds_spark_twitter.py
import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway?readPreference=primaryPreferred")
conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/ds_rest_subwayPassengers_mongo_db.db_rest_subway")
sc = pyspark.SparkContext(conf=conf)
#sc = pyspark.SparkContext()
sc.setLogLevel("ERROR")
print sc._conf.getAll()
sqlContext = pyspark.sql.SQLContext(sc)
print "---------read-----------"
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
print df.printSchema()
df.registerTempTable("myTwitter")
myTab = sqlContext.sql("SELECT CardSubwayStatisticsService.row.RIDE_PASGR_NUM FROM myTwitter")
print type(myTab)
myTab.show()

Overwriting src/ds_spark_twitter.py


In [16]:
!/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/bin/spark-submit src/ds_spark_twitter.py

Ivy Default Cache set to: /home/jsl/.ivy2/cache
The jars for the packages stored in: /home/jsl/.ivy2/jars
:: loading settings :: url = jar:file:/home/jsl/Downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
org.mongodb.spark#mongo-spark-connector_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found graphframes#graphframes;0.1.0-spark1.6 in spark-packages
	found org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 in central
	found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 144ms :: artifacts dl 4ms
	:: modules in use:
	graphframes#graphframes;0.1.0-spark1.6 from spark-packages in [default]
	org.mongodb#mongo-java-driver;3.2.2 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.10;1.1.0 from central in [default]
	------------------------------------------

* all-in-one 위를 spark-submit아닌 것으로 풀기
* sc가 또 생성되지 않도록 주의한다.

In [3]:
import os
import findspark

home=os.getenv("HOME")
spark_home=os.path.join(home,"Downloads/spark-1.6.0-bin-hadoop2.6")
findspark.init(spark_home)

import pyspark
conf=pyspark.SparkConf()
conf = pyspark.SparkConf().setAppName("myAppName")
conf.set("spark.mongodb.input.uri","mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable?readPreference=primaryPreferred")
conf.set("spark.mongodb.output.uri","mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable")
sc = pyspark.SparkContext(conf=conf)
#sc = pyspark.SparkContext()
print sc._conf.getAll()
print sc._conf.get("spark.jars.packages")
sc.setLogLevel("ERROR")
sqlContext = pyspark.sql.SQLContext(sc)
print "---------read-----------"
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
print df.printSchema()
df.registerTempTable("myTwitter")
myTab = sqlContext.sql("SELECT CardSubwayStatisticsService.row.RIDE_PASGR_NUM FROM myTwitter")
print type(myTab)
myTab.show()

[(u'spark.app.name', u'myAppName'), (u'spark.mongodb.input.uri', u'mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable?readPreference=primaryPreferred'), (u'spark.submit.pyFiles', u'/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10-1.1.0.jar,/home/jsl/.ivy2/jars/com.databricks_spark-csv_2.10-1.3.0.jar,/home/jsl/.ivy2/jars/org.mongodb_mongo-java-driver-3.2.2.jar,/home/jsl/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/jsl/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'), (u'spark.rdd.compress', u'True'), (u'spark.serializer.objectStreamReset', u'100'), (u'spark.master', u'local[*]'), (u'spark.mongodb.output.uri', u'mongodb://127.0.0.1/ds_rest_subwayPassengersDb.db_rest_subwayTable'), (u'spark.submit.deployMode', u'client'), (u'spark.jars', u'file:/home/jsl/.ivy2/jars/graphframes_graphframes-0.1.0-spark1.6.jar,file:/home/jsl/.ivy2/jars/org.mongodb.spark_mongo-spark-connector_2.10

In [23]:
print myTab.first()
print myTab.head()

Row(RIDE_PASGR_NUM=[111275.0, 11495.0, 118103.0, 10590.0, 26304.0])
Row(RIDE_PASGR_NUM=[111275.0, 11495.0, 118103.0, 10590.0, 26304.0])


## 더 해보기

* Spark MySql
```
spark.driver.extraLibraryPath
os.environ['SPARK_CLASSPATH'] =r"/home/jsl/sd/lib/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar"
```

* postgresql
```[not yet] postgresql
test.write.jdbc(
    url="jdbc:postgresql://localhost:5432/db", 
    table="test", 
    mode="overwrite", 
    properties={
        "user":"root", 
        "password":"12345", 
        "driver":"org.postgresql.Driver", 
        "client_encoding":"utf8"
   }
)
```