## Why Apache Spark for Big Data?

1. Easy to use. Provides high-level API that focuses on the content of the computation.
2. Fast, enabling interactive use and complex algorithms.
3. General engine. Combines multiple types of computations (SQL queries, text processing, and ML)

## Chapter 1: Introduction to Data Analysis with Spark

### What is Apache Spark?

1. Apache Spark is a cluster computing platform designed to be fast and general-purpose.
2. Ability to run computation in memory.
3. More efficient than MapReduce for complex applications.
4. Integrate closely with other Big Data tools.

### A Unified Stack

1. Spark Core - Task scheduling, memory management, RDD API
2. Spark SQL - Structured data
3. Spark streaming - Live stream of data in real time
4. MLlib machine learning
5. GraphX graph processing
6. Cluster Managers - Standalone, YARN, Mesos

### Users of Spark

1. Data Scientist
2. Engineer

## Chapter 2: Downloading Spark and Getting Started

Spark shell allow us to interact with data that is distributed on disk or in memory across many machines.
Provides Scala and Python shells.

1. Scala shell: bin/spark-shell
2. Python shell (PySpark): bin/pyspark

### Changing verbosity of logging in spark shell

Make a copy of conf/log4j.properties.template called conf/log4j.properties and find the following line:  
log4j.rootCategory=INFO, console  
And change it to  
log4j.rootCategory=WARN, console

### Working with RDD

In [27]:
lines = sc.textFile('file:///usr/local/spark/README.md')

In [28]:
lines

file:///usr/local/spark/README.md MapPartitionsRDD[17] at textFile at NativeMethodAccessorImpl.java:0

In [29]:
lines.count()

104

In [30]:
lines.first()

u'# Apache Spark'

### Introduction to Core Spark Concepts

Every Spark application consists of a driver program that launches various parallel operations on a cluster.  
Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.  
Driver programs manages a number of nodes called executors.

### Standalone Applications

In standalone applications, such as scripts, we have to initialize our own SparkContext.  
In Java and Scala, one has to give the application a Maven dependency on the spark-core artifact.  
In Python, application must be run using bin/spark-submit script.

### Initializing a SparkContext

In [26]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('My App')
#sc = SparkContext(conf=conf) # Spark context already running inside Ipython notebook