# Pyspark Q&A 
inspired by justmarkham/pandas-videos on github

## Table of Contents
+ [Installing and running a simple Spark environment]
+ + [Developing on Apache Zeppelin]
+ + [Developing on Jupyter]
+ [What is pyspark?]



### Datasets

Filename | Description | Raw File | Original Source | Other
--- | --- | --- | --- | ---
[chipotle.tsv](data/chipotle.tsv) | Online orders from the Chipotle restaurant chain | [bit.ly/chiporders](http://bit.ly/chiporders) | [The Upshot](https://github.com/TheUpshot/chipotle) | [Upshot article](http://www.nytimes.com/interactive/2015/02/17/upshot/what-do-people-actually-order-at-chipotle.html)
[drinks.csv](data/drinks.csv) | Alcohol consumption by country | [bit.ly/drinksbycountry](http://bit.ly/drinksbycountry) | [FiveThirtyEight](https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption) | [FiveThirtyEight article](http://fivethirtyeight.com/datalab/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/)
[imdb_1000.csv](data/imdb_1000.csv) | Top rated movies from IMDb | [bit.ly/imdbratings](http://bit.ly/imdbratings) | [IMDb](http://www.imdb.com/search/title?groups=top_1000&sort=user_rating&view=simple) | [Web scraping script](https://github.com/justmarkham/DAT5/blob/master/code/08_web_scraping.py)
[titanic_test.csv](data/titanic_test.csv) | Testing set from Kaggle's Titanic competition | [bit.ly/kaggletest](http://bit.ly/kaggletest) | [Kaggle](https://www.kaggle.com/c/titanic) | [Data dictionary](https://www.kaggle.com/c/titanic/data)
[titanic_train.csv](data/titanic_train.csv) | Training set from Kaggle's Titanic competition | [bit.ly/kaggletrain](http://bit.ly/kaggletrain) | [Kaggle](https://www.kaggle.com/c/titanic) | [Data dictionary](https://www.kaggle.com/c/titanic/data)
[u.user](data/u.user) | Demographic information about MovieLens users | [bit.ly/movieusers](http://bit.ly/movieusers) | [GroupLens](http://grouplens.org/datasets/movielens/100k/) | [Data dictionary](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt)
[ufo.csv](data/ufo.csv) | Reports of UFO sightings from 1930-2000 | [bit.ly/uforeports](http://bit.ly/uforeports) | [National UFO Reporting Center](http://www.nuforc.org/webreports.html) | [Web scraping script](https://github.com/josiahdavis/josiahdavis.github.io/blob/master/supporting%20material/get_ufo_data.py)


### 0a. Install and running with Apache Zeppelin

If you have already a Spark setup that you are using , we suggest to download and run zeppelin as described in 
https://zeppelin.apache.org/docs/0.7.3/install/install.html

If you want to develop spark projects with relatively small data ( for limited resources) , we suggest you to use the Zeppelin docker images 

To persist logs and notebook directories, use the volume option for docker container.
```bash
docker run -p 8080:8080 --rm -v $PWD/logs:/logs -v $PWD/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name zeppelin apache/zeppelin:0.7.3
```



### 0a. Install and running with Jupyter and Apache Toree

do the following steps

1. Download and Install JDK

```bash
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u144-b01/090f390dda5b47b9b721c7dfaa008135/jdk-8u144-linux-x64.rpm
yum localinstall jdk-8u121-linux-x64.rpm
```
2. Download and Install Spark
```bash
wget -c https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
mkdir ~/spark
tar -zxvf spark-2.2.0-bin-hadoop2.6.tgz -C ~/spark/
```
3. Set Environment variables

Add following lines in your .bash_profile or .bashrc
```bash
export SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
```
4. Test Spark Installation

You can start Spark shell and Python shell by typing below commands
```bash
#To start spark shell
spark-shell
#To start python shell
pyspark
```

5. Download and Install Anaconda
```bash
wget -c https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh
bash Anaconda3-5.0.0.1-Linux-x86_64.sh
```
6. Download and Install Apache Toree
```bash
wget -c https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
pip install toree-0.2.0.dev1.tar.gz
jupyter toree install --spark_home=$SPARK_HOME --interpreters=Scala,PySpark,SQL --user
```
7. Start Jupyter Server ( for local machine)
```bash
jupyter notebook --no-browser
```

### 1. What is pyspark ?

- [what is spark?](https://databricks.com/spark/about)
- [spark main page](https://spark.apache.org/) 
- [pyspark documentation](https://spark.apache.org/docs/latest/api/python/index.html)

While using IDE the pyspark libraries should be included like
```python
from pyspark import SparkContext
```
While using jupyter notebook with the following command, sparkContext is already avaliable to use

```bash
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="8G" PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/spark/bin/pyspark --master local[2]
```


To submit spark jobs to a cluster you must use spark-submit which is in the spark installation directory

Simple submission
```bash
~/spark/bin/spark-submit examples/src/main/python/wordcount.py /etc/profile
```

Submission with parameters
```bash
export SPARK_MAJOR_VERSION=2
PYSPARK_PYTHON=/usr/local/bin/python2.7 \
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 64 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 4 \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=16 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=8 \
--conf spark.speculation=true \
--conf spark.yarn.executor.memoryOverhead=6g \
--conf spark.driver.maxResultSize=6g \
--conf spark.executor.heartbeatInterval=20s \
--queue default \
examplecode.py
```

### 2. How do I read a tabular data file into spark?

In [1]:
# file should be on a file sytem (local or hdfs)
filename='./data/chipotle.tsv'
df = spark.read.csv(filename,header=True, inferSchema=True,sep="\t")

In [3]:
df.cov?

In [4]:
df._sc?

In [2]:
# get the first data
df.head()

Row(order_id=1, quantity=1, item_name=u'Chips and Fresh Tomato Salsa', choice_description=u'NULL', item_price=u'$2.39 ')

In [4]:
# display in a nice format
df.show(4,truncate=True)

+--------+--------+--------------------+------------------+----------+
|order_id|quantity|           item_name|choice_description|item_price|
+--------+--------+--------------------+------------------+----------+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 |
|       1|       1|                Izze|      [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|           [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|              NULL|    $2.39 |
+--------+--------+--------------------+------------------+----------+
only showing top 4 rows



In [5]:
# files without header
userfilename = './data/u.user'
userdf = spark.read.csv(userfilename,header=False, inferSchema=True,sep="|")

In [6]:
userdf.show()

+---+---+---+-------------+-----+
|_c0|_c1|_c2|          _c3|  _c4|
+---+---+---+-------------+-----+
|  1| 24|  M|   technician|85711|
|  2| 53|  F|        other|94043|
|  3| 23|  M|       writer|32067|
|  4| 24|  M|   technician|43537|
|  5| 33|  F|        other|15213|
|  6| 42|  M|    executive|98101|
|  7| 57|  M|administrator|91344|
|  8| 36|  M|administrator|05201|
|  9| 29|  M|      student|01002|
| 10| 53|  M|       lawyer|90703|
| 11| 39|  F|        other|30329|
| 12| 28|  F|        other|06405|
| 13| 47|  M|     educator|29206|
| 14| 45|  M|    scientist|55106|
| 15| 49|  F|     educator|97301|
| 16| 21|  M|entertainment|10309|
| 17| 30|  M|   programmer|06355|
| 18| 35|  F|        other|37212|
| 19| 40|  M|    librarian|02138|
| 20| 42|  F|    homemaker|95660|
+---+---+---+-------------+-----+
only showing top 20 rows



In [None]:
# to chage the headers
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

In [30]:
userdf.show(5)

+---+---+---+----------+-----+
|_c0|_c1|_c2|       _c3|  _c4|
+---+---+---+----------+-----+
|  1| 24|  M|technician|85711|
|  2| 53|  F|     other|94043|
|  3| 23|  M|    writer|32067|
|  4| 24|  M|technician|43537|
|  5| 33|  F|     other|15213|
+---+---+---+----------+-----+
only showing top 5 rows

