# Introduction to Python and PySpark

This is an additional lecture designed for ECE795 Advanced Big Data Analytics course. In this lecture, how to code with Python and PySpark will be discussed.
TA: Rui Ma (rxm1351@miami.edu), Anchen Sun (a.sun158@umiami.edu)

## Python Introduction 

- Python is an open-source general-purpose programming language with powerful library and package support. 

- Python supports object-oriented, procedural and functional programming. 

- Python has great interactive environment and interfaces with Spark.

## Runing Python: The Python Interpreter

### Print the version of Python

In [0]:
!python --version

Python 3.6.9


### Use Python as calculator

In [0]:
3 * (7 + 2)

27

### A simple example

1. Assignment operator: `=`
 
2. Value comparison: `==`

3. Basic arithmetic operators: add(`+`), minus(`-`), multiply(`*`), divide(`/`), and modulo(`%`).

    - `+` is also used for string concatenation when the operands are of `str` type.

    - `%` is also used for string formatting (as with printf in C)

4. Logical operators are in words (`and`, `or`, `not`) but not symbols.

5. The basic printing command is `print`.

In [0]:
x = 100 - 50                            # A comment
y = 'University'                        # Another one
z = 23.33
if z == 23.33 or y == 'University':
  x = x + 1
  y = y + ' of Miami'                   # String concat
print(x)
print(y)

51
University of Miami


### Basic Data Types

Number

In [0]:
# Integers (default for numbers)
z = int(7 / 3)
print(z)

# Floating number which is the same as `double` in C
z = float(7 / 3)
print(z)

2
2.3333333333333335


Strings

In [0]:
# Can use " " or ' ' to specify.
x = "abc"
y = 'abc'
print(x == y)

# Unmatched quotes can occur within the string.
z = "matt's"
print(z)

# Use triple double-quotes for multi-line strings or strings than contain both 'and " inside of them:
q = """a'b"c"""
print(q)

True
matt's
a'b"c


### Whitespace

Whitespace is meaningful in Python!!!! Especially the indentation and placement of newlines

Use a newline to end a line of code: Use \ when must go to next line prematurely.

In [0]:
x = 1
y = 2
z = x + \
    y
print(z)

3


### Comments

Start comments with `#`: the rest of lines are ignored.

Can include a """documentation string""" as the first line of any new function or class that you define.

The development environment, debugger, and other tools like help() use it: it’s good style to include one.

In [0]:
def my_function(x, y):
  """
  This is the docstring. This function does ...

  Attributes
    x (int): number of ...
    y (string): ...
  """
  # The code would go here...

### Multiple Simultaneous Assignments

In [0]:
x, y = 1, 100
print(x + y)

101


### Reserved words

and, assert, break, class, continue, def, del, elif, else, except, exec, finally,
for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return,
try, while

### Basic Operators

Binary operators on numbers

In [0]:
x = 95
y = 7
print(x + y, x - y, x / y, x * y, 2 ** y, x % 7, x // 7)

102 88 13.571428571428571 665 128 4 13


In [0]:
# Some overloaded to work on strings
print('University' + ' of ' + 'Miami')   # concatenation
print('I' + 3 * 'E')                     # replication
print(eval('1 + 4 / 3'))

University of Miami
IEEE
2.333333333333333


### Advanced Data Types

1. Tuple
    
    - A simple immutable ordered sequence of items
    
    - Items can be of mixed types, including collections

2. String

    - Immutable

    - Conceptually very much like a tuple

3. List

    - Mutable ordered sequence of items of mixed types

All three sequence types (tuples, strings, and lists) share
much of the same syntax and functionality.

Key difference:

1. Tuples and strings are immutable

2. Lists are mutable


Lists are defined using square brackets (and commas).

In [2]:
L = ['ECE', 795, 'Advanced Big Data Analytics']
print(L[0])

ECE


In [3]:
print(L[3])

IndexError: ignored

In [4]:
print(L[-1])

Advanced Big Data


Mutability: Lists are Mutable and Operations Only on Lists

In [5]:
L.append('College of Engineering')
print(L)

['ECE', 795, 'Advanced Big Data', 'College of Engineering']


In [6]:
L.insert(3, 'Dr. Shyu')
print(L)

['ECE', 795, 'Advanced Big Data', 'Dr. Shyu', 'College of Engineering']


In [7]:
L.pop()
print(L)

['ECE', 795, 'Advanced Big Data', 'Dr. Shyu']


In [8]:
L.pop(2)
print(L)

['ECE', 795, 'Dr. Shyu']


In [9]:
L[2] = 'College of Engineering'
print(L)

['ECE', 795, 'College of Engineering']


Mutability: Tuples are Immutable

In [10]:
# Tuples are defined using parentheses (and commas).
T = ('ECE', 795, 'Advanced Big Data Analytics')
print(T)

('ECE', 795, 'Advanced Big Data')


In [11]:
T(1) = 800

SyntaxError: ignored

In [16]:
T = ('ECE', 795, 'Advanced Big Data Analytics', ['a', 'b', 'c'])
print(T)
L = T[3]
print(L)
L[0] = 1
L[1] = 2
print(T)

('ECE', 795, 'Advanced Big Data', ['a', 'b', 'c'])
['a', 'b', 'c']
('ECE', 795, 'Advanced Big Data', [1, 2, 'c'])


Slicing: Return a Copy of a Subset

In [17]:
print(T[1:3])
print(T[:3])
print(T[2:])

(795, 'Advanced Big Data')
('ECE', 795, 'Advanced Big Data')
('Advanced Big Data', [1, 2, 'c'])


Copy the Entire Sequence

In [20]:
A = [1, 2, 3, 4]
B = A
B[1] = 100
print(A, B)

A = [1, 2, 3, 4]
B = A[:]
B[1] = 100
print(A, B)

[1, 100, 3, 4] [1, 100, 3, 4]
[1, 2, 3, 4] [1, 100, 3, 4]


In [15]:
St = 'ECE 795 Advanced Big Data Analytics'
print(St[2])

E


### Membership test

Boolean test whether a value is inside a container:

In [22]:
T = [1, 2, 3]
print(3 in T)
print(4 in T)
print(4 not in T)

True
False
True


For strings, also tests for substrings

In [23]:
st = 'University of Miami'
print('Miami' in st)
print('M' in st)

True
True


### Replication

The * operator produces a new tuple, list, or string that repeats the original content

In [25]:
print((1, 2) * 3)
print([1, 2] * 3)
print('Miami' * 3)

(1, 2, 1, 2, 1, 2)
[1, 2, 1, 2, 1, 2]
MiamiMiamiMiami


Lists made out of other lists: List Comprehension

In [30]:
li = range(5)
print(li)

range(0, 5)


In [31]:
squares = [x ** 2 for x in li]
print(squares)

[0, 1, 4, 9, 16]


In [33]:
even_squares = [x for x in squares if x%2==0]
print(even_squares)

[0, 4, 16]


In [35]:
newlist = [(x,1) for x in even_squares]
print(newlist)

[(0, 1), (4, 1), (16, 1)]


In [36]:
newlist = [range(x) for x in range(5)]
print(newlist)

[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]


In [44]:
newlist = [y for x in range(5) for y in range(x)] #"nested" for loop
print(newlist)

[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]


### Control Flow
if statement

In [45]:
x = 2
if x == 1:
  print('x = 1')
elif x == 2:
  print('x = 2')
else:
  print('x equals something else')

x = 2


In [46]:
x = 'abc'
if 'a' in x:
  print('a is in string x')
else:
  print('a is not in string x')

a is in string x


while statement: Fibonacci Series:

In [47]:
a, b = 0, 1
while a < 100:
  print(a)
  a, b = b, a + b

0
1
1
2
3
5
8
13
21
34
55
89


for loop: Primality Test

In [61]:
for n in range(2, 10):
  for x in range(2, n):
    if n % x == 0:
      print(n, 'equals', x, '*', n//x)
      break
  else:    #executed only if loop does not break early
    print(n, 'is a prime number')

2 is a prime number
3 is a prime number
4 equals 2 * 2
5 is a prime number
6 equals 2 * 3
7 is a prime number
8 equals 2 * 4
9 equals 3 * 3


### Function

Function Definitions

1. def creates a function and assigns it a name

2. return sends a result back to the caller

3. Arguments are passed by assignment

4. Arguments and return types are not declared:

In [63]:
def mean(Num):
  sum = 0
  for each in Num:
    sum += each
  return sum / len(Num)

print(mean([1, 2, 3, 9, 10])) 

5.0


### Python Modules

Modules comprise functions and variables defined in separate files

Functions or variables from a module are imported using from or import

In [64]:
import numpy
print(numpy.sqrt(144))

12.0


In [65]:
from numpy import sqrt
print(sqrt(144))

12.0


In [66]:
import numpy as np
print(np.sqrt(144))

12.0


In [69]:
from numpy import *
print(sqrt(144))
print(exp(2))

12.0
7.38905609893065


In [72]:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([10, 20, 30])
print(a + b)
print(np.dot(a, b))
print(np.outer(a, b))

[11 22 33]
140
[[10 20 30]
 [20 40 60]
 [30 60 90]]


## PySpark Introdction

What is Spark?

A general engine for large-scale data processing

Fast, expressive cluster computing system compatible
with Apache Hadoop

Improves efficiency through:

1. In-memory computing primitives

2. General computation graphs

Improves usability through:

1. Rich APIs in Java, Scala, Python

2. Interactive shell

Programmer’s point of view:

“Normal” python program, with “special” parallel data
structures

Easy to write short, clear map/reduce operations

Full expressibility of java/scala/python “included”

### Qucik Tour of Operations
Starting point for spark functionalities

In pyspark interpreter, automatically loaded
variable:

`sc`

In standalone programs, you create your own

Creating an RDD

In [0]:
#Setup PySpark

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate()

In [78]:
# Turn a local collection into an RDD
rdd = sc.parallelize([1, 2, 3])
print(rdd.take(2))

[1, 2]


In [85]:
rdd = sc.textFile('/content/sample_data/california_housing_test.csv')
print(rdd.take(2))

['"longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income","median_house_value"', '-122.050000,37.370000,27.000000,3885.000000,661.000000,1537.000000,606.000000,6.608500,344700.000000']


Elements in the list are lines of the file

The list is distributed over multiple machines

The list is immutable

1. You cannot change it,

2. Cannot see individual elements (e.g., 5th element)

3. Can only interact with it through specific ops

In [87]:
#Basic Transformations
nums = sc.parallelize([1, 2, 3])

# Pass each element through a function
squares = nums.map(lambda x: x * x)
print(squares.take(10))

# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0)
print(even.take(10))

[1, 4, 9]
[4]


In [91]:
#Basic Actions
nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection
print(nums.collect())

# Count number of elements
print(nums.count())

# Write elements to a text file
#nums.saveAsTextFile('file.txt')

[1, 2, 3]
3
