https://medium.com/swlh/building-a-big-data-pipeline-with-airflow-spark-and-zeppelin-843f31ef220c

## Airflow


### [QuickStart](https://airflow.apache.org/docs/stable/start.html)

```
$ sudo apt-get update
$ sudo apt-get install build-essential
```



```
# create a virtualenv
$ cd ~/projects/bigdata
$ python -m venv bigdata
$ source bigdata/bin/activate

mkdir ~/airflow
mkdir ~/airflow/dags

export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

airflow list_dags
airflow list_tasks tutorial
airflow list_tasks tutorial --tree

# validate
python ~/airflow/dags/tutorial.py

# test
airflow test tutorial print_date 2015-06-01
airflow test tutorial sleep 2015-06-01
airflow test tutorial templated 2015-06-01

# bash_operator
python ~/airflow/dags/example_bash_operator.py list_tasks

$ airflow list_tasks example_python_operator
$ airflow test example_python_operator print_the_context 2020-01-01

$ pip uninstall apache-airflow
 
# postgresql
$ sudo apt update
$ sudo apt install postgresql postgresql-contrib
$ service postgresql status

$ sudo passwd postgres
Enter new UNIX password: postgres
Retype new UNIX password: postgres
 
```
 
### [Tutorial](https://airflow.apache.org/docs/stable/tutorial.html)


## Spark

```
pip install pyspark
```
- [Learning Spark v2](https://github.com/databricks/LearningSparkV2)
    - [Book](https://drive.google.com/file/d/1HhZrxAjGQ-D9cYEfrExBepKamKLtwUuV/view?usp=sharing)
    - [Source Code](https://github.com/databricks/LearningSparkV2)

- [Mastering Big Data with PySpark](https://github.com/PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark)

- [Hands On Big Data Analytics with PySpark](https://github.com/PacktPublishing/Hands-On-Big-Data-Analytics-with-PySpark)
    - [Preview/sample pages at Google Books](https://books.google.com/books/about/Hands_On_Big_Data_Analytics_with_PySpark.html?id=jc-PDwAAQBAJ)
    - dataset [scikit-learn kddcup99 dataset](https://figshare.com/articles/kddcup_data_gz/3830001)

- [apache-spark-for-big-data-analytics](https://medium.com/@christophberns/apache-spark-for-big-data-analytics-53b99185bf51)

- [Install Spark on Windows](https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c)

```
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin/winutils.exe?raw=true


setx SPARK_HOME C:\opt\spark\spark-3.0.1-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-3.0.1-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON ipython
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
setx JAVA_HOME C:\Java
```

## Zeppelin

### Install
https://zeppelin.apache.org/docs/0.7.3/install/install.html

```
$ tar -xvzf ~/Downloads/zeppelin-0.9.0-preview2-bin-all.tgz



```

### Run
bin/zeppelin-daemon.sh start | stop | status

goto Zeppelin Home Page at http://localhost:8080
