# Connecting to Spark
Connecting to Spark
 - Connection to spark is established by driver.
	- Java, Scala, Python, or R 
	- Python or R is high-level language
    - Python is good or R is good to write the code.


In [None]:
import pyspark

Pyspark version - 2.4.1 is good

#### Sub-modules
- Structured-Data - pyspark.sql
- streaming data- pyspark.streaming
- Machine learning - pyspark.mllib or pyspark.ml


#### Spark URL
- Remote cluster using Spark URL - spark://<IP address | DNS name >:<port>
- example : spark://13.59.151.161.7077
- 7077 is the default-port 
- How Spark works?
        ##### Local cluster: specify number of cores to be active or utlize.
        - local - only 1 core;
        - local[4] - 4 cores, or
        - local[*] - all available cores.


In [None]:
# connect to spark
from pyspark.sql import SparkSession

# create a local-cluster using a SparkSession builder
spark  = SparkSession.builder \
            .master('local[*]') \
            .appName('first_spark_application') # name the application
            .getOrCreate() # to return the existing object
# Close connection to Spark
spark.stop()

##### Exercise Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

specify the location of the master node;
name the application (optional); and
retrieve an existing SparkSession or, if there is none, create a new one.
The SparkSession class has a version attribute which gives the version of Spark.

Find out more about SparkSession here.

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

Note:: You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.

In [None]:
# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
print(spark.version)

# Terminate the cluster
spark.stop()

#### Read data-into spark

 - DataFrame for tabular data.
 - select methods :
     - count() - no/. of rows
     - show() - subset of rows
     - printschema() - 
 - Selected attributes.
     - dtypes
     
 - CSV formats mostly used.
 - separated by comma

###### Read method in spark spark.read.csv()


In [None]:
# Read the csv of sample data (cars)
cars = spark.read.csv('cars.csv', header=True)

# optional arguments mostly used
# - header - is first row a header ? (default: False)
# - sep - field separator (default : a comma ',')
# - schema - explicit column data types.
# - inferSchema - deduce column data types from data?
# - nullvalue - missing data.

cars.show() # the method can shows the data of the dataframe.


#### Cleansing the data using spark

##### Check column type 
- cars.printSchema()
- gives the schema of columns used in data-frame 
- This also shows the datatypes 




In [None]:
# Inferring the column types from data
cars = spark.read.csv('cars.csv', header = True, inferSchema = True)
# By specifying True of inferschema, the datatypes of loaded columns will be exactly as the same of pre-defined dataframe or csv

# figure out the dtypes.
cars.dtypes

# Handling the missing data. by specifying nullvalue = 'NA' 
# Note nullvalue is case sensitive.
cars = spark.read.csv('cars.csv', header=True, inferschema=True, nullvalue='NA')

# if the data-types cannot be inferrred. then specify them manually
# create a data-structure 
# then remove the infer schema and replace schema with schema defined below.
schema = StructType([
    StructField("maker", StringType())
    StructField("model", StringType())
    StructField("origin", StringType())
    StructField("type", StringType())
    StructField("cyl", StringType())
    StructField("size", StringType())
    StructField("weight", StringType())
    StructField("length", StringType())
    StructField("rpm", StringType())
    StructField("consumption", StringType())
])

cars = spark.read.csv('cars.csv', header=True, schema=schema, nullvalue='NA')


##### Exercise 
Loading flights data
In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.

Notes on CSV format:

fields are separated by a comma (this is the default separator) and
missing data are denoted by the string 'NA'.
Data dictionary:

- mon — month (integer between 1 and 12)
- dom — day of month (integer between 1 and 31)
- dow — day of week (integer; 1 = Monday and 7 = Sunday)
- org — origin airport (IATA code)
- mile — distance (miles)
- carrier — carrier (IATA code)
- depart — departure time (decimal hour)
- duration — expected duration (minutes)
- delay — delay (minutes)

* pyspark has been imported for you and the session has been initialized.

Note: The data have been aggressively down-sampled.

In [None]:
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
flights.dtypes

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv('sms.csv', sep=';', header=True, schema=schema)

# Print schema of DataFrame
sms.printSchema()