# pySPARK - BASIC KNOWLEDGE

## This book contains data of notebook pySPARK in SparkPractices

#### This book follows the progress the author has made on the topic nad will contain different versions of the notebook.

## For Local Machine to find Spark Home

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\pyspark'

## Using OLD style SparkContext creation:

In [None]:
#Using SparkContext, SparkConf
from pyspark import SparkConf,SparkContext
conf=SparkConf().setMaster('local').setAppName('Testing')
sc=SparkContext(conf=conf)

## Using NEW style SparkSession creation:

In [2]:
#Using SparkSession
from pyspark.sql import SparkSession

spark=SparkSession. \
builder. \
appName('PySpark'). \
master('local'). \
getOrCreate()

In [3]:
sc=spark.sparkContext
spark

## READING DATA FROM A FILE

#### On VAGRANT VM
##### "--master yarn" can only read from HDFS

In [None]:
orderItems=sc.textFile('/public/retail_db/order_items')

#To read from local, we need "--master local[*]"
blog=sc.textFile('file:///home/vagrant/SparkPractices/Share/blogtexts')

#### FOR LOCAL
##### Can read from HDFS and local

In [4]:
orderItems=sc.textFile('D://Bigdata Tutorials//data//retail_db//order_items//part-00000')

blog=sc.textFile('Share//blogtexts')

## Let's work with BlogTexts example of wordcount

## Step 1  - Lets find some rows of data

In [5]:
blog.take(5)

[u'Think of it for a moment \u2013 1 Qunitillion = 1 Million Billion! Can you imagine how many drives / CDs / Blue-ray DVDs would be required to store them? It is difficult to imagine this scale of data generation even as a data science professional. While this pace of data generation is very exciting,  it has created entirely new set of challenges and has forced us to find new ways to handle Big Huge data effectively.',
 u'',
 u'Big Data is not a new phenomena. It has been around for a while now. However, it has become really important with this pace of data generation. In past, several systems were developed for processing big data. Most of them were based on MapReduce framework. These frameworks typically rely on use of hard disk for saving and retrieving the results. However, this turns out to be very costly in terms of time and speed.',
 u'',
 u'On the other hand, Organizations have never been more hungrier to add a competitive differentiation through understanding this data and o

## Let's tackle transformations first

## map and flatMap

### Use Case

#### Suppose we want to convert all letters in RDD to lowercase and split the lines using " "(space)
  
#### map(): To lower the case of each word of a document & split the line, we can use the map transformation.
  * Do note that the output is not FLAT, i.e. it has nested list.
  
#### flatMap: Same as map, but as an extra step, it flattens the result

In [6]:
blogMap=blog.map(lambda x: x.lower().split(" "))

In [7]:
# This gives O/P as list of lists which is too long to show using first() or take()
print(blogMap.take(1))

[[u'think', u'of', u'it', u'for', u'a', u'moment', u'\u2013', u'1', u'qunitillion', u'=', u'1', u'million', u'billion!', u'can', u'you', u'imagine', u'how', u'many', u'drives', u'/', u'cds', u'/', u'blue-ray', u'dvds', u'would', u'be', u'required', u'to', u'store', u'them?', u'it', u'is', u'difficult', u'to', u'imagine', u'this', u'scale', u'of', u'data', u'generation', u'even', u'as', u'a', u'data', u'science', u'professional.', u'while', u'this', u'pace', u'of', u'data', u'generation', u'is', u'very', u'exciting,', u'', u'it', u'has', u'created', u'entirely', u'new', u'set', u'of', u'challenges', u'and', u'has', u'forced', u'us', u'to', u'find', u'new', u'ways', u'to', u'handle', u'big', u'huge', u'data', u'effectively.']]


In [8]:
blogFlatMap=blog.flatMap(lambda x: x.lower().split(" "),True)

In [9]:
print(blogFlatMap.take(20))

[u'think', u'of', u'it', u'for', u'a', u'moment', u'\u2013', u'1', u'qunitillion', u'=', u'1', u'million', u'billion!', u'can', u'you', u'imagine', u'how', u'many', u'drives', u'/']


#### NOTE - To make map same as flatMap, we can use collect().
#### Thus using collect to convert it from RDD to a list and then using indexes to show only 5 elements

In [10]:
print(blogMap.collect()[0][:20])

[u'think', u'of', u'it', u'for', u'a', u'moment', u'\u2013', u'1', u'qunitillion', u'=', u'1', u'million', u'billion!', u'can', u'you', u'imagine', u'how', u'many', u'drives', u'/']


#### Now we have a collection of words and symbols on different indexes of a list.

## FILTER out words that are not needed

#### We call these words as “stop words”; Stop words do not add much value in a text. For example, “is”, “am”, “are” and “the” are few examples of stop words.

* Let's try to filter these words out:  

In [11]:
blogFiltered=blogFlatMap.filter(lambda x: x not in ['is','am','are','the','a','for','/','1','='])

In [12]:
print(blogFiltered.take(20))

[u'think', u'of', u'it', u'moment', u'\u2013', u'qunitillion', u'million', u'billion!', u'can', u'you', u'imagine', u'how', u'many', u'drives', u'cds', u'blue-ray', u'dvds', u'would', u'be', u'required']


### Note, if this was a wordcount example, we would just need to do the following:
* Use map to create (key,value) pairs where starting value will be 1.
* We will use reduceByKey() to get the final result.

In [13]:
from operator import add

blogTuples=blogFiltered.map(lambda x: (x,1))
print(blogTuples.take(20))


[(u'think', 1), (u'of', 1), (u'it', 1), (u'moment', 1), (u'\u2013', 1), (u'qunitillion', 1), (u'million', 1), (u'billion!', 1), (u'can', 1), (u'you', 1), (u'imagine', 1), (u'how', 1), (u'many', 1), (u'drives', 1), (u'cds', 1), (u'blue-ray', 1), (u'dvds', 1), (u'would', 1), (u'be', 1), (u'required', 1)]


#### Now to find and print the WordCount:

In [14]:
blogResult=blogTuples.reduceByKey(add).map(lambda x: (x[1],x[0])).sortByKey(False)
print(blogResult.take(20))

[(274, u''), (164, u'to'), (143, u'in'), (122, u'of'), (106, u'and'), (103, u'we'), (69, u'spark'), (64, u'this'), (63, u'data'), (55, u'can'), (52, u'apache'), (40, u'it'), (40, u'on'), (39, u'which'), (32, u'with'), (32, u'will'), (31, u'you'), (31, u'by'), (30, u'rdd'), (28, u'as')]


In [15]:
#Another way to find the word count:

blogCount=blogTuples.countByKey()

# This returns a dictionary, which we can edit in a such a way to show only some data
##This is done by explicitly converting it into a list, splicing the list and then reverting back to a dict.

print(dict(list(blogCount.items())[:20]))

{u'': 274, u'saves': 2, u'elements,': 2, u'step1:': 1, u'cluster).': 1, u'pyspark_driver_python=ipython': 1, u'skills': 1, u'better.': 1, u'connects': 1, u"('am',": 1, u'solution': 3, u'sc.parallelize(data)': 2, u'elegant': 1, u'diff_cat_in_train_test.distinct().count()#': 1, u'sc.parallelize(data,': 1, u'otherwise,': 1, u'saved': 1, u'(which': 1, u'created.': 1, u'second': 1}


In [16]:
# Instead of these, we can also try groupByKey

In [17]:
blogGroup=blogTuples.groupByKey()
print(list([j[0],list(j[1])] for j in blogGroup.take(5)))

[[u'', [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], [u'diff_cat_in_train_test.distinct().count()#', [1]], [u'better.', [1]], [u'elements,', [1, 1]], [u'saved', [1]]]


In [18]:
#sortByKey() -- True - ASC and FALSE - DESC
blogCount2=blogGroup.mapValues(sum).map(lambda x: (x[1],x[0])).sortByKey(False)
print(blogCount2.take(20))

[(274, u''), (164, u'to'), (143, u'in'), (122, u'of'), (106, u'and'), (103, u'we'), (69, u'spark'), (64, u'this'), (63, u'data'), (55, u'can'), (52, u'apache'), (40, u'it'), (40, u'on'), (39, u'which'), (32, u'with'), (32, u'will'), (31, u'you'), (31, u'by'), (30, u'rdd'), (28, u'as')]


## Difference B/W reduceByKey and groupByKey

* The “reduceByKey” transformations first combined the values for each key in all partition, so each partition will have only one value for a key then after shuffling, in reduce phase executors will apply operation for example, in my case sum(lambda x: x+y).

<div><img align=left src="Share/files/images/reduceByKey-3.png" width=500 height=500 /><img align=left src="Share/files/images/groupbykey.png" width=500 height=500 /></div>

* But in case of “groupByKey” transformation, it will not combine the values in each key in all partition it directly shuffle the data then merge the values for each key. Here in “groupByKey” transformation lot of shuffling in the data is required to get the answer, so it is better to use “reduceByKey” in case of large shuffling of data.

### This completes the WORDCOUNT example

## But, we can do much more...

## groupBy

### Step 3: Now, we want to group the words in blogFiltered based on which letters they start with.

* This means that, Key will be a substring and Value will be all values that start with those values

In [19]:
def f(x):
    if x.startswith('all'):
        return x
blogGroupIt=blogFiltered.groupBy(f)

In [20]:
[(k, list(v)) for (k, v) in blogGroupIt.take(2)]

[(u'all',
  [u'all',
   u'all',
   u'all',
   u'all',
   u'all',
   u'all',
   u'all',
   u'all',
   u'all',
   u'all']),
 (u'allows', [u'allows'])]

## mapPartition

### STEP 4
* We can use it to count the words ‘spark’ and ‘apache’ in blogFiltered, separatly, on each partition and get the output of the task performed in these partition

* We can do this by applying “mapPartitions” transformation. The “mapPartitions” is like a map transformation but runs separately on different partitions of a RDD.


In [21]:
def func(iterator):
  count_spark = 0
  count_apache = 0
  for i in iterator:
     if i =='spark':
        count_spark = count_spark + 1
     if i == 'apache':
        count_apache = count_apache + 1
  return (count_spark,count_apache)

blogFiltered.repartition(2).mapPartitions(func,True).glom().collect()

[[30, 23], [39, 29]]

## Transformation: sample
### STEP 5 

* What if I want to work with samples instead of full data ?

* “sample” transformation helps us in taking samples instead of working on full data. 

* The sample method will return a new RDD, containing a statistical sample of the original RDD.


In [22]:
blog.sample?

[1;31mSignature:[0m [0mblog[0m[1;33m.[0m[0msample[0m[1;33m([0m[0mwithReplacement[0m[1;33m,[0m [0mfraction[0m[1;33m,[0m [0mseed[0m[1;33m=[0m[0mNone[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a sampled subset of this RDD.

:param withReplacement: can elements be sampled multiple times (replaced when sampled out)
:param fraction: expected size of the sample as a fraction of this RDD's size
    without replacement: probability that each element is chosen; fraction must be [0, 1]
    with replacement: expected number of times each element is chosen; fraction must be >= 0
:param seed: seed for the random number generator

.. note:: This is not guaranteed to provide exactly the fraction specified of the total
    count of the given :class:`DataFrame`.

>>> rdd = sc.parallelize(range(100), 4)
>>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14
True
[1;31mFile:[0m      c:\pyspark\python\pyspark\rdd.py
[1;31mType:[0m      instancemethod


In [23]:
# fraction = 0.3 means 30% of total dataset
#seed = for a random number generation to be always ben same
blogFilteredSample=blogFiltered.sample(False,0.3,30)

In [24]:
print(len(blogFiltered.collect()),len(blogFilteredSample.collect()))

(4989, 1509)


## Transformation: Union
### STEP 6

* What if I want to work with multiple samples?


* “union” transformation will return a new RDD by taking the union of two RDDs. 

* Note that duplicate items will not be removed in the new RDD.

* To show the same:


In [25]:
# Same sample copy
blogFilteredSample2=blogFilteredSample

#OR 
#blogFilteredSample2=blogFiltered.sample(False,0.3,30)

#Union of the two samples:
blogFilteredSampleUnion=blogFilteredSample.union(blogFilteredSample2).collect()

In [26]:
print(len(blogFilteredSample.collect()),len(blogFilteredSample2.collect()),len(blogFilteredSampleUnion))

(1509, 1509, 3018)


### Thus, we can see that DUPLICATES were not removed!