# PySpark Tutorial 

## Basic idea

from [tutorial point](https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm)

- **SparkContext**: is the entry point to any spark functionality.
> SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’, so creating a new SparkContext won't work.

![data flow](img/sparkcontext.jpg)

* [embedding pic in jupyter nb](https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o)


## SparkContext class

The following code block has the details of a PySpark class and the parameters, which a SparkContext can take.

```python

class pyspark.SparkContext (
   master = None,
   appName = None, 
   sparkHome = None, 
   pyFiles = None, 
   environment = None, 
   batchSize = 0, 
   serializer = PickleSerializer(), 
   conf = None, 
   gateway = None, 
   jsc = None, 
   profiler_cls = <class 'pyspark.profiler.BasicProfiler'>
)
```

### Parameters

- **Master** − It is the URL of the cluster it connects to.
- **appName** − Name of your job
- **sparkHome** - Spark installation directory.
- **pyFiles** − The .zip or .py files to send to the cluster and add to the PYTHONPATH.
- **Environment** − Worker nodes environment variables.
- **batchSize** − The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size.
- **Serializer** − RDD serializer.
- **Conf** − An object of L{SparkConf} to set all the Spark properties.
- **Gateway** − Use an existing gateway and JVM, otherwise initializing a new JVM.
- **JSC** − The JavaSparkContext instance.
- **profiler_cls** − A class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler).

Now let's try running on python

In [1]:
from pyspark import SparkContext
import os
cdw = os.getcwd() # current working dir
print(cdw)
logFile = "spark_setup.md"
PATH = os.path.join(cdw, logFile)

sc = SparkContext("local", "first app") # Master URL and appName
data = sc.textFile(PATH).cache()


/home/wataru/spark


In [2]:
numLines = data.count() # of lines
# now filter content
headers = data.filter(lambda s: "#" in s).count() 
print("total %i lines, with %i headers" % (numLines, headers))

total 6 lines, with 3 headers
