<a href="https://colab.research.google.com/github/roitraining/SparkProgram/blob/master/Day1/IntroToSpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create the Spark context to start a session and connect to the cluster.

In [1]:
import sys
sys.path.append('/home/student/ROI/SparkProgram')
from initspark import *
sc, spark, conf = initspark()


Read a text file from the local file system.

In [2]:
shake = sc.textFile('/home/student/ROI/SparkProgram/datasets/text/shakespeare.txt')
print(shake.count())
print(shake.take(10))

124796
['The Project Gutenberg EBook of The Complete Works of William Shakespeare, by ', 'William Shakespeare', '', 'This eBook is for the use of anyone anywhere at no cost and with', 'almost no restrictions whatsoever.  You may copy it, give it away or', 're-use it under the terms of the Project Gutenberg License included', 'with this eBook or online at www.gutenberg.org', '', '** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **', '**     Please follow the copyright guidelines in this file.     **']


Use the map method to apply a function call on each element.

In [3]:
shake2 = shake.map(str.upper)
shake2.take(10)

['THE PROJECT GUTENBERG EBOOK OF THE COMPLETE WORKS OF WILLIAM SHAKESPEARE, BY ',
 'WILLIAM SHAKESPEARE',
 '',
 'THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE AT NO COST AND WITH',
 'ALMOST NO RESTRICTIONS WHATSOEVER.  YOU MAY COPY IT, GIVE IT AWAY OR',
 'RE-USE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED',
 'WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG',
 '',
 '** THIS IS A COPYRIGHTED PROJECT GUTENBERG EBOOK, DETAILS BELOW **',
 '**     PLEASE FOLLOW THE COPYRIGHT GUIDELINES IN THIS FILE.     **']

Using the split method you get a list of lists.

In [4]:
shake3 = shake.map(lambda x : x.split(' '))
shake3.take(10)

[['The',
  'Project',
  'Gutenberg',
  'EBook',
  'of',
  'The',
  'Complete',
  'Works',
  'of',
  'William',
  'Shakespeare,',
  'by',
  ''],
 ['William', 'Shakespeare'],
 [''],
 ['This',
  'eBook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'at',
  'no',
  'cost',
  'and',
  'with'],
 ['almost',
  'no',
  'restrictions',
  'whatsoever.',
  '',
  'You',
  'may',
  'copy',
  'it,',
  'give',
  'it',
  'away',
  'or'],
 ['re-use',
  'it',
  'under',
  'the',
  'terms',
  'of',
  'the',
  'Project',
  'Gutenberg',
  'License',
  'included'],
 ['with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],
 [''],
 ['**',
  'This',
  'is',
  'a',
  'COPYRIGHTED',
  'Project',
  'Gutenberg',
  'eBook,',
  'Details',
  'Below',
  '**'],
 ['**',
  '',
  '',
  '',
  '',
  'Please',
  'follow',
  'the',
  'copyright',
  'guidelines',
  'in',
  'this',
  'file.',
  '',
  '',
  '',
  '',
  '**']]

The flatMap method flattens the inner list to return one big list of strings instead

In [5]:
shake4 = shake.flatMap(lambda x : x.split(' '))
shake4.take(20)

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare,',
 'by',
 '',
 'William',
 'Shakespeare',
 '',
 'This',
 'eBook',
 'is',
 'for']

Parallelize will load manually created data into the spark cluster into an RDD.

In [6]:
r = sc.parallelize(range(1,11))
print(r.collect())
print(r.take(5))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5]


Load a folder stored on HDFS.

In [7]:
cat = sc.textFile('hdfs://localhost:9000/categories')
cat.collect()

['1,Beverages,Soft drinks coffees teas beers and ales',
 '2,Condiments,Sweet and savory sauces relishes spreads and seasonings',
 '3,Confections,Desserts candies and sweet breads',
 '4,Dairy Products,Cheeses',
 '5,Grains/Cereals,Breads crackers pasta and cereal',
 '6,Meat/Poultry,Prepared meats',
 '7,Produce,Dried fruit and bean curd',
 '8,Seafood,Seaweed and fish']

Other useful actions

In [8]:
print(cat.takeOrdered(5))
print(cat.top(5))
print(cat.takeSample(False,5))


['1,Beverages,Soft drinks coffees teas beers and ales', '2,Condiments,Sweet and savory sauces relishes spreads and seasonings', '3,Confections,Desserts candies and sweet breads', '4,Dairy Products,Cheeses', '5,Grains/Cereals,Breads crackers pasta and cereal']
['8,Seafood,Seaweed and fish', '7,Produce,Dried fruit and bean curd', '6,Meat/Poultry,Prepared meats', '5,Grains/Cereals,Breads crackers pasta and cereal', '4,Dairy Products,Cheeses']
['6,Meat/Poultry,Prepared meats', '7,Produce,Dried fruit and bean curd', '5,Grains/Cereals,Breads crackers pasta and cereal', '1,Beverages,Soft drinks coffees teas beers and ales', '4,Dairy Products,Cheeses']


Save the results in an RDD to disk. Note how it makes a folder and fills it with as many files as there are nodes solving the problem. Also, you must make sure that the folder does not exist or it throws an exception.

In [9]:
! rm -r /home/student/file1.txt
cat.saveAsTextFile('/home/student/file1.txt')

In [10]:
print(cat.map(str.upper).collect())

['1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES', '2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS', '3,CONFECTIONS,DESSERTS CANDIES AND SWEET BREADS', '4,DAIRY PRODUCTS,CHEESES', '5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL', '6,MEAT/POULTRY,PREPARED MEATS', '7,PRODUCE,DRIED FRUIT AND BEAN CURD', '8,SEAFOOD,SEAWEED AND FISH']


Parse the string into a tuple to resemble a record structure.

In [11]:
cat1 = cat.map(lambda x : tuple(x.split(',')))
cat1 = cat1.map(lambda x : (int(x[0]), x[1], x[2]))
cat1.take(10)

[(1, 'Beverages', 'Soft drinks coffees teas beers and ales'),
 (2, 'Condiments', 'Sweet and savory sauces relishes spreads and seasonings'),
 (3, 'Confections', 'Desserts candies and sweet breads'),
 (4, 'Dairy Products', 'Cheeses'),
 (5, 'Grains/Cereals', 'Breads crackers pasta and cereal'),
 (6, 'Meat/Poultry', 'Prepared meats'),
 (7, 'Produce', 'Dried fruit and bean curd'),
 (8, 'Seafood', 'Seaweed and fish')]

**LAB:** Put the regions folder found in /home/student/ROI/Spark/datasets/northwind/CSV/regions into HDFS. Read it into an RDD and convert it into a tuple shape.

The filter method takes a lambda that returns a True or False.

In [12]:
cat1.filter(lambda x : x[0] <= 5).collect()


[(1, 'Beverages', 'Soft drinks coffees teas beers and ales'),
 (2, 'Condiments', 'Sweet and savory sauces relishes spreads and seasonings'),
 (3, 'Confections', 'Desserts candies and sweet breads'),
 (4, 'Dairy Products', 'Cheeses'),
 (5, 'Grains/Cereals', 'Breads crackers pasta and cereal')]

The filter expressions can be more complicated.

In [13]:
cat1.filter(lambda x : x[0] % 2 == 0 and 'e' in x[1]).collect()

[(2, 'Condiments', 'Sweet and savory sauces relishes spreads and seasonings'),
 (6, 'Meat/Poultry', 'Prepared meats'),
 (8, 'Seafood', 'Seaweed and fish')]

The sortBy method returns an expression that is used to sort the data.

In [14]:
cat1.sortBy(lambda x : x[2]).collect()

[(5, 'Grains/Cereals', 'Breads crackers pasta and cereal'),
 (4, 'Dairy Products', 'Cheeses'),
 (3, 'Confections', 'Desserts candies and sweet breads'),
 (7, 'Produce', 'Dried fruit and bean curd'),
 (6, 'Meat/Poultry', 'Prepared meats'),
 (8, 'Seafood', 'Seaweed and fish'),
 (1, 'Beverages', 'Soft drinks coffees teas beers and ales'),
 (2, 'Condiments', 'Sweet and savory sauces relishes spreads and seasonings')]

sortBy has an option ascending parameter to sort in reverse order.

In [15]:
cat1.sortBy(lambda x : x[0], ascending = False).collect()

[(8, 'Seafood', 'Seaweed and fish'),
 (7, 'Produce', 'Dried fruit and bean curd'),
 (6, 'Meat/Poultry', 'Prepared meats'),
 (5, 'Grains/Cereals', 'Breads crackers pasta and cereal'),
 (4, 'Dairy Products', 'Cheeses'),
 (3, 'Confections', 'Desserts candies and sweet breads'),
 (2, 'Condiments', 'Sweet and savory sauces relishes spreads and seasonings'),
 (1, 'Beverages', 'Soft drinks coffees teas beers and ales')]