## Intro to Spark


### what is Spark?

http://spark.apache.org/

*Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.*

This nicest thing about Spark, is the interface to Python using PySpark

There is some language Scala, that Spark is built on.  Luckily you don't have to learn Scala.



### how does it work?

it uses a data structure called an RDD (Resilient Distributed Dataset) which is basically some fast, clever, magical way of saving data to disk instead of holding it all in memory




### using the Spark SQL shell instead of Hive

1. some pre-reqs
* start hive
* start the hive metastore

1. just use this instead and do the same stuff
```
/data/spark15/bin/spark-sql
```

1. run some commands but faster
```
show tables;
select avg(age) from foo;
```


### use the Python interface

1. launch it 
```
/data/spark15/bin/pyspark
```

1. or launch ipython (much better)
```
IPYTHON=1 /data/spark15/bin/pyspark
```

1. examine the "Spark context variable" and the "Hive context variable"
```
sc
sqlContext
```

1. get Hive context and SQL context manually 
```
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
```

1. see what tables we have 
```
sqlContext.sql("show tables").show()
```

1. run a simple sql command
```
rdd = sqlContext.sql("select * from user_info")
rdd.show()
```



### access HDFS directly bypassing Hive

1. read a text file directly from HDFS
```
f = sc.textFile("hdfs://localhost/user/w205/lab_3/user_data/userdata_lab.csv")
f.count()
f.take(5)
```

1. read the same file from HDFS without giving full path
```
f = sc.textFile('lab_3/user_data/userdata_lab.csv')
f.count()
```

### access a local file direct bypassing Hive and HDFS

1. notice that you can give a full path to a text file
```            
f = sc.textFile('file:///home/w205/data/stadiums.csv')
f.count()
f.take(5)
```

2. **there is no way to read a local file without full path**

*Question: when and why would you want to read local vs HDFS files?*

### lets write a little code

1. read in a long list of words
```
words = sc.textFile("file:///usr/share/dict/words")
words.count()
```

1. use the **filter** function
```
some_words = words.filter(lambda w: w.startswith("spark"))
some_words.count()
some_words.take(7)
```

### lets try a join in spark

1. read in some files
```
rdd1 = sc.textFile('lab_3/user_data/userdata_lab.csv').map(lambda x: x.split("\t"))
rdd2 = sc.textFile('lab_3/weblog_data/weblog_lab.csv').map(lambda x: x.split("\t"))
```

1. convert to data frames

1. try a join ...