### Dataframes 
Dataframes are a special type of RDDs. They are similar to, but not the same as, pandas dataframes. They are used to store two dimensional data, similar to the type of data stored in a spreadsheet. Each column in a dataframe can have a different type and each row contains a `record`.

Spark DataFrames are similar to `pandas` DataFrames. With the important difference that spark DataFrames are **distributed** data structures, based on RDDs.

In [67]:
import os
import sys

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType, BinaryType

In [2]:
# Just like using Spark requires having a SparkContext, using SQL requires an SQLContext
sqlContext = SQLContext(sc)
sqlContext

<pyspark.sql.context.SQLContext at 0x1087174d0>

In [3]:
# One way to create a DataFrame is to first define an RDD from a list of rows
some_rdd = sc.parallelize([Row(name=u"John", age=19),
                           Row(name=u"Smith", age=23),
                           Row(name=u"Sarah", age=18)])
some_rdd.collect()

[Row(age=19, name=u'John'),
 Row(age=23, name=u'Smith'),
 Row(age=18, name=u'Sarah')]

In [4]:
# The DataFrame is created from the RDD or Rows
# Infer schema from the first row, create a DataFrame and print the schema
some_df = sqlContext.createDataFrame(some_rdd)
some_df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [5]:
# A dataframe is an RDD of rows plus information on the schema.
# In our case, the content of the RDD is the same as the content of the dataframe. 
print type(some_rdd),type(some_df)
print 'some_df =',some_df.collect()
print 'some_rdd=',some_rdd.collect()

<class 'pyspark.rdd.RDD'> <class 'pyspark.sql.dataframe.DataFrame'>
some_df = [Row(age=19, name=u'John'), Row(age=23, name=u'Smith'), Row(age=18, name=u'Sarah')]
some_rdd= [Row(age=19, name=u'John'), Row(age=23, name=u'Smith'), Row(age=18, name=u'Sarah')]


### Example of using an RDD

In [6]:
# In this case we create the dataframe from an RDD of tuples (rather than Rows) and provide the schema explicitly
another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])
# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", IntegerType(), False)])

# Create a DataFrame by applying the schema to the RDD and print the schema
another_df = sqlContext.createDataFrame(another_rdd, schema)
another_df.printSchema()
# root
#  |-- age: binteger (nullable = true)
#  |-- name: string (nullable = true)

root
 |-- person_name: string (nullable = false)
 |-- person_age: integer (nullable = false)



### What formats does spark-sql support?

According to [this post](https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html), spark supports many formats. However, I could not find definite documentation on which 
formats are supported.

In terms of syntax, you can use the format 
```python
sqlContext.read.format('json').load('python/test_support/sql/people.json')
```
Where instead of `json` you can use `parquet`,`text` and suppposedly other formats, but I could not find an authoritative list of formats. It seems that `csv` is not supported at this time.

In [21]:
RDD=sc.textFile('../../Data/example.csv')
RDD.collect()

[u'Col1, Col2, Col3', u'11,12,13', u'21,22,23', u'31,32,33']

### Parquet files
[Parquet](http://parquet.apache.org/) is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. 

In [9]:
dir='../../Data'
parquet_file=dir+"/users.parquet"
!ls -ld $dir/*.parquet
!rm -rf ../../Data/namesAndFavColors.parquet

drwxr-xr-x  10 yoavfreund  staff  340 Apr 24 11:05 [34m../../Data/namesAndFavColors.parquet[m[m
-rw-r--r--   1 yoavfreund  staff  615 Apr 24 11:05 ../../Data/users.parquet


In [10]:
#load a Parquet file
df = sqlContext.read.load(parquet_file)
df.show()

+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



In [13]:
df.printSchema()

root
 |-- name: string (nullable = false)
 |-- favorite_color: string (nullable = true)
 |-- favorite_numbers: array (nullable = false)
 |    |-- element: integer (containsNull = false)



In [11]:
df2=df.select("name", "favorite_color")
df2.show()

+------+--------------+
|  name|favorite_color|
+------+--------------+
|Alyssa|          null|
|   Ben|           red|
+------+--------------+



In [12]:
df2.write.save(dir+"/namesAndFavColors.parquet")
!ls -ld $dir/*.parquet

drwxr-xr-x  10 yoavfreund  staff  340 Aug  8 14:15 [34m../../Data/namesAndFavColors.parquet[m[m
-rw-r--r--   1 yoavfreund  staff  615 Apr 24 11:05 ../../Data/users.parquet


### Storing numpy arrays as strings inside a dataframe

In [218]:
import numpy as np
"""Code for packing and unpacking a numpy array into a byte array.
   the array is flattened if it is not 1D.
   This is intended to be used as the interface for storing 
   
   This code is intended to be used to store numpy array as fields in a dataframe and then store the 
   dataframes in a parquet file.
"""

def packArray(a):
    if type(a)!=np.ndarray:
        raise Exception("input to packArray should be numpy.ndarray. It is instead "+str(type(a)))
    return bytearray(a.tobytes())
def unpackArray(x,data_type=np.int16):
    return np.frombuffer(x,dtype=data_type)

In [219]:
packArray(np.array([1,2,3]))

bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00')

In [220]:
import numpy as np
n=100; m=100
L=[]
data_type=np.float128
for i in range(m):
    A=np.round(np.random.random([n,n])*1000)
    A=np.array(A,dtype=data_type)
    B=packArray(A)
    L.append(Row(image=B,j=i))

    if i==0:
        print A
        print 'length of buffer=',len(B)
        print 'type of buffer=',type(B)

        print 'mapping buffer back:',unpackArray(B,data_type=data_type)

schema = StructType([StructField("image", BinaryType(), True),
                     StructField("j", IntegerType(), False)])

RDD=sc.parallelize(L)
print 'number of elements in RDD=',RDD.count()
df=sqlContext.createDataFrame(RDD, schema)
print 'number of elements in dataframe=',df.count()

[[ 697.0  495.0  211.0 ...,  705.0  932.0  481.0]
 [ 979.0  250.0  797.0 ...,  305.0  617.0  457.0]
 [ 482.0  224.0  911.0 ...,  727.0  295.0  70.0]
 ..., 
 [ 648.0  120.0  596.0 ...,  231.0  780.0  284.0]
 [ 426.0  486.0  325.0 ...,  526.0  377.0  395.0]
 [ 353.0  714.0  806.0 ...,  255.0  81.0  374.0]]
length of buffer= 160000
type of buffer= <type 'bytearray'>
mapping buffer back: [ 697.0  495.0  211.0 ...,  255.0  81.0  374.0]
number of elements in RDD= 100
number of elements in dataframe= 100


In [224]:
parquet_file=dir+"dataFrameWithNumpy.parquet"
!rm -rf $parquet_file
df.write.save(parquet_file)

!ls -l $parquet_file

total 16128
-rw-r--r--  1 yoavfreund  staff        0 Aug  8 19:00 _SUCCESS
-rw-r--r--  1 yoavfreund  staff      284 Aug  8 19:00 _common_metadata
-rw-r--r--  1 yoavfreund  staff   640749 Aug  8 19:00 _metadata
-rw-r--r--  1 yoavfreund  staff  3801957 Aug  8 19:00 part-r-00000-3fe7db4e-7767-4175-a51c-6b3309096125.gz.parquet
-rw-r--r--  1 yoavfreund  staff  3801927 Aug  8 19:00 part-r-00001-3fe7db4e-7767-4175-a51c-6b3309096125.gz.parquet


In [226]:
df2 = sqlContext.read.load(parquet_file)

LX=df2.take(3)

for X in LX:
    C=X.image
    print X.j
    print len(C)
    print type(C)
    print type(B)
    print unpackArray(C,data_type=data_type)

0
160000
<type 'bytearray'>
<type 'bytearray'>
[ 697.0  495.0  211.0 ...,  255.0  81.0  374.0]
1
160000
<type 'bytearray'>
<type 'bytearray'>
[ 411.0  155.0  330.0 ...,  158.0  878.0  281.0]
2
160000
<type 'bytearray'>
<type 'bytearray'>
[ 880.0  280.0  404.0 ...,  821.0  880.0  375.0]
