Historically, Apache Spark has had two core contexts that are available to the user. The sparkContext made available as sc and the SQLContext made available as sqlContext, these contexts make a variety of functions and information available to the user. The sqlContext makes a lot of DataFrame functionality available while the sparkContext focuses more on the Apache Spark engine itself.
However in Apache Spark 2.X, there is just one context - the SparkSession.

SparkSession is generally created using the builder pattern, along with getOrCreate() which will return an existing session if one is already running. The builder can take string based configuration keys config(key, value), and shortcuts exist for a number of common params. One of the more important shortcuts is enableHiveSupport() which will give you access to Hive UDFs and does not require a Hive installation - but does require certain extra JARs (discussed in “Spark SQL Dependencies”). The enableHiveSupport() shortcut not only configures Spark SQL to use these Hive jars, it also eagerly checks that they can be loaded - leading to a clearer error message than setting configuration values by hand. In general using shortcuts, listed in the API docs, is advised when they are present - since no checking is done in the generic config interface.

In [None]:
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .enableHiveSupport() \
        .appName("PythonWordCount")\
        .getOrCreate()


### Before 2.0 SparkContext was used for Spark Core functionality

In [None]:
import pyspark
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("myapp").setMaster("local[2]")
sc = SparkContext(conf=conf)



#### ### Before 2.0, SQLContext was used for SparkSQL functionality - You could use Dataframes in SQLContext

In [5]:
%%bash 
ls

Couldn't find program: 'bash'


In [7]:
%lsmagic


Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%rub

In [2]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 277C-C030

 Directory of D:\RomiCode\DataScience\Spark

05/24/17  07:09 AM    <DIR>          .
05/24/17  07:09 AM    <DIR>          ..
05/24/17  07:09 AM    <DIR>          .ipynb_checkpoints
05/24/17  07:09 AM             9,095 FindSpark, SparkSession and SparkContext.ipynb
05/24/17  06:39 AM            29,596 Introduction to Apache Spark on Databricks.ipynb
05/23/17  09:15 PM            80,076 PCA- PySpark.ipynb
01/14/17  09:46 PM            27,134 Practical ML.ipynb
01/17/17  04:22 PM            15,568 Spark ML - Test1.ipynb
05/22/17  09:36 PM               733 derby.log
05/22/17  09:36 PM    <DIR>          metastore_db
05/22/17  08:56 PM    <DIR>          spark-warehouse
               6 File(s)        162,202 bytes
               5 Dir(s)  32,024,481,792 bytes free


In [None]:
!pip install numpy
!pip list | grep pandas



In [None]:
$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$