### The Hadoop File System (HDFS) is a distributed file system that spans across multiple nodes and saves files in a cluster. It slices large files into blocks and redundantly saves multiple copies across several nodes in the cluster according to the replication factor chosen for the cluster. 
To examine the contents of the HDFS cluster, you either need to install the Hadoop tools on a local machine or ssh into a remote machine that has them installed.
Try the following commands to see what is currently on the cluster and add new files to it.

In [None]:
! hadoop fs -ls /


In [2]:
! hadoop fs -put /class/datasets/northwind/CSV/categories /


In [3]:
! hadoop fs -ls /
! hadoop fs -ls /categories


Found 14 items
drwxr-xr-x   - root supergroup          0 2021-11-09 20:24 /categories
drwxr-xr-x   - root supergroup          0 2021-11-09 17:01 /orders_part
drwxr-xr-x   - root supergroup          0 2021-11-09 16:55 /orders_table
drwxr-xr-x   - root supergroup          0 2021-11-09 17:34 /person
drwxr-xr-x   - root supergroup          0 2021-11-09 17:39 /person2
drwxr-xr-x   - root supergroup          0 2021-11-09 19:30 /region_avro
drwxr-xr-x   - root supergroup          0 2021-11-09 14:21 /regions
drwxr-xr-x   - root supergroup          0 2021-11-09 19:29 /regions_nested
drwxr-xr-x   - root supergroup          0 2021-11-09 16:08 /shippers4
drwx-wx-wx   - root supergroup          0 2021-11-09 15:32 /tmp
drwxr-xr-x   - root supergroup          0 2021-11-09 16:50 /transactions
drwxr-xr-x   - root supergroup          0 2021-11-09 14:19 /user
drwxr-xr-x   - root supergroup          0 2021-11-09 15:48 /usstates11
drwxr-xr-x   - root supergroup          0 2021-11-09 15:51 /usstates12
Found

### Create the Spark context to start a session and connect to the cluster.

In [1]:
import sys
sys.path.append('/class')
from initspark import *
sc, spark, conf = initspark()


initializing pyspark
pyspark initialized


### Read a text file from the local file system.

In [4]:
shake = sc.textFile('file:///class/datasets/text/shakespeare.txt')
print(shake.count())
print(shake.take(10))


124796
['The Project Gutenberg EBook of The Complete Works of William Shakespeare, by ', 'William Shakespeare', '', 'This eBook is for the use of anyone anywhere at no cost and with', 'almost no restrictions whatsoever.  You may copy it, give it away or', 're-use it under the terms of the Project Gutenberg License included', 'with this eBook or online at www.gutenberg.org', '', '** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **', '**     Please follow the copyright guidelines in this file.     **']


### Read from HDFS.

In [5]:
cat = sc.textFile('hdfs://localhost:9000/categories')
#shake = sc.textFile('/categories')
print(cat.count())
print(cat.take(10))


8
['1,Beverages,Soft drinks coffees teas beers and ales', '2,Condiments,Sweet and savory sauces relishes spreads and seasonings', '3,Confections,Desserts candies and sweet breads', '4,Dairy Products,Cheeses', '5,Grains/Cereals,Breads crackers pasta and cereal', '6,Meat/Poultry,Prepared meats', '7,Produce,Dried fruit and bean curd', '8,Seafood,Seaweed and fish']


### The `map` method can apply a function or lambda to each element of the collection, but distribute it to multiple workers to be done in parallel.

In [7]:
shake2 = shake.map(lambda x : x.split(' '))
shake2.take(10)

[['The',
  'Project',
  'Gutenberg',
  'EBook',
  'of',
  'The',
  'Complete',
  'Works',
  'of',
  'William',
  'Shakespeare,',
  'by',
  ''],
 ['William', 'Shakespeare'],
 [''],
 ['This',
  'eBook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'at',
  'no',
  'cost',
  'and',
  'with'],
 ['almost',
  'no',
  'restrictions',
  'whatsoever.',
  '',
  'You',
  'may',
  'copy',
  'it,',
  'give',
  'it',
  'away',
  'or'],
 ['re-use',
  'it',
  'under',
  'the',
  'terms',
  'of',
  'the',
  'Project',
  'Gutenberg',
  'License',
  'included'],
 ['with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],
 [''],
 ['**',
  'This',
  'is',
  'a',
  'COPYRIGHTED',
  'Project',
  'Gutenberg',
  'eBook,',
  'Details',
  'Below',
  '**'],
 ['**',
  '',
  '',
  '',
  '',
  'Please',
  'follow',
  'the',
  'copyright',
  'guidelines',
  'in',
  'this',
  'file.',
  '',
  '',
  '',
  '',
  '**']]

### Parallelize will load manually created data into the spark cluster into an RDD.

In [8]:
r = sc.parallelize(range(1,11))
print(r.collect())
print(r.take(5))


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5]


### Load a folder stored on HDFS or local and apply an function to each element.

In [10]:
cat = sc.textFile('hdfs://localhost:9000/categories')
cat = sc.textFile('file:///class/datasets/northwind/CSV/categories')
cat = cat.map(lambda x : x.upper())
print(cat.collect())
print(cat)

['1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES', '2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS', '3,CONFECTIONS,DESSERTS CANDIES AND SWEET BREADS', '4,DAIRY PRODUCTS,CHEESES', '5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL', '6,MEAT/POULTRY,PREPARED MEATS', '7,PRODUCE,DRIED FRUIT AND BEAN CURD', '8,SEAFOOD,SEAWEED AND FISH']
PythonRDD[28] at collect at <ipython-input-10-15712c7b4bb1>:4


### Try some different actions to fetch data.

In [11]:
print(cat.takeOrdered(5))
print(cat.top(5))
print(cat.takeSample(False,5))

cat.foreach(lambda x : print(x.upper)) # does not display properly in notebook


['1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES', '2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS', '3,CONFECTIONS,DESSERTS CANDIES AND SWEET BREADS', '4,DAIRY PRODUCTS,CHEESES', '5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL']
['8,SEAFOOD,SEAWEED AND FISH', '7,PRODUCE,DRIED FRUIT AND BEAN CURD', '6,MEAT/POULTRY,PREPARED MEATS', '5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL', '4,DAIRY PRODUCTS,CHEESES']
['6,MEAT/POULTRY,PREPARED MEATS', '4,DAIRY PRODUCTS,CHEESES', '1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES', '2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS', '8,SEAFOOD,SEAWEED AND FISH']


### Save the results in an RDD to disk. Note how it makes a folder and fills it with as many files as there are nodes solving the problem. Also, you must make sure that the folder does not exist or it throws an exception.

In [12]:
! rm -r /tmp/file1.txt
cat.saveAsTextFile('file:///tmp/file1.txt')
! cat /tmp/file1.txt/*

rm: cannot remove '/tmp/file1.txt': No such file or directory
1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES
2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS
3,CONFECTIONS,DESSERTS CANDIES AND SWEET BREADS
4,DAIRY PRODUCTS,CHEESES
5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL
6,MEAT/POULTRY,PREPARED MEATS
7,PRODUCE,DRIED FRUIT AND BEAN CURD
8,SEAFOOD,SEAWEED AND FISH


### Use the map method to apply a function call on each element.

In [13]:
shake2 = shake.map(str.upper)
shake2.take(10)


['THE PROJECT GUTENBERG EBOOK OF THE COMPLETE WORKS OF WILLIAM SHAKESPEARE, BY ',
 'WILLIAM SHAKESPEARE',
 '',
 'THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE AT NO COST AND WITH',
 'ALMOST NO RESTRICTIONS WHATSOEVER.  YOU MAY COPY IT, GIVE IT AWAY OR',
 'RE-USE IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED',
 'WITH THIS EBOOK OR ONLINE AT WWW.GUTENBERG.ORG',
 '',
 '** THIS IS A COPYRIGHTED PROJECT GUTENBERG EBOOK, DETAILS BELOW **',
 '**     PLEASE FOLLOW THE COPYRIGHT GUIDELINES IN THIS FILE.     **']

### Using the split method you get a list of lists. Apply a second map to change the lists to tuples and just keep the first two elements.

In [16]:
cat3 = cat.map(lambda x : x.split(','))
print(cat3.take(5))

cat3 = cat3.map(lambda x : (int(x[0]), x[1]))
print(cat3.take(5))


[['1', 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'], ['2', 'CONDIMENTS', 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'], ['3', 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'], ['4', 'DAIRY PRODUCTS', 'CHEESES'], ['5', 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL']]
[(1, 'BEVERAGES'), (2, 'CONDIMENTS'), (3, 'CONFECTIONS'), (4, 'DAIRY PRODUCTS'), (5, 'GRAINS/CEREALS')]


### The `map` method returns a list of lists.

In [17]:
shake3 = shake.map(lambda x : x.split(' '))
print(shake3.count(), shake3.take(20))

124796 [['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare,', 'by', ''], ['William', 'Shakespeare'], [''], ['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with'], ['almost', 'no', 'restrictions', 'whatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or'], ['re-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included'], ['with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'], [''], ['**', 'This', 'is', 'a', 'COPYRIGHTED', 'Project', 'Gutenberg', 'eBook,', 'Details', 'Below', '**'], ['**', '', '', '', '', 'Please', 'follow', 'the', 'copyright', 'guidelines', 'in', 'this', 'file.', '', '', '', '', '**'], [''], ['Title:', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare'], [''], ['Author:', 'William', 'Shakespeare'], [''], ['Posting', 'Date:', 'September', '1,', '2011', '[EBook', '#100]'], ['Release', 'Date:'

### The `flatMap` method flattens the inner list to return one big list of strings instead

In [18]:
shake4 = shake.flatMap(lambda x : x.split(' '))
print(shake4.count(), shake4.take(20))


1410759 ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare,', 'by', '', 'William', 'Shakespeare', '', 'This', 'eBook', 'is', 'for']


### Could map and existing function not just lambdas.

In [19]:
print(cat.map(str.upper).collect())


['1,BEVERAGES,SOFT DRINKS COFFEES TEAS BEERS AND ALES', '2,CONDIMENTS,SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS', '3,CONFECTIONS,DESSERTS CANDIES AND SWEET BREADS', '4,DAIRY PRODUCTS,CHEESES', '5,GRAINS/CEREALS,BREADS CRACKERS PASTA AND CEREAL', '6,MEAT/POULTRY,PREPARED MEATS', '7,PRODUCE,DRIED FRUIT AND BEAN CURD', '8,SEAFOOD,SEAWEED AND FISH']


### Parse the string into a tuple to resemble a record structure.

In [20]:
cat1 = cat.map(lambda x : tuple(x.split(',')))
cat1 = cat1.map(lambda x : (int(x[0]), x[1], x[2]))
cat1.take(10)


[(1, 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'),
 (2, 'CONDIMENTS', 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'),
 (3, 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'),
 (4, 'DAIRY PRODUCTS', 'CHEESES'),
 (5, 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL'),
 (6, 'MEAT/POULTRY', 'PREPARED MEATS'),
 (7, 'PRODUCE', 'DRIED FRUIT AND BEAN CURD'),
 (8, 'SEAFOOD', 'SEAWEED AND FISH')]

## LAB: ## 
### Put the regions folder found in /class/datasets/northwind/CSV/regions into HDFS. Read it into an RDD and convert it into a tuple shape.
<br>
<details><summary>Click for <b>hint</b></summary>
<p>
Use hadoop fs -put or hdfs dfs -put
<br>
Read the file using sc.textFile
<br>
Do a map to split and another to convert the datatypes
<br>
<br>
</p>
</details>

<details><summary>Click for <b>code</b></summary>
<p>

```python
! hadoop fs -put /class/datasets/northwind/CSV/regions /regions
regions = sc.textFile('hdfs://localhost:9000/regions')
regions = regions.map(lambda x : x.split(',')).map(lambda x : (int(x[0]), x[1]))
print(regions.collect())
```
</p>
</details>

[{'id': 1, 'name': 'Eastern'}, {'id': 2, 'name': 'Western'}, {'id': 3, 'name': 'Northern'}, {'id': 4, 'name': 'Southern'}]


### Can create complex python functions to map onto elements instead of using simple lambdas.

In [21]:
def multicomma(x):
    if len(x) == 2:
        return (int(x[0]), x[1], 'Unknown')
    elif len(x) == 3:
        return (int(x[0]), x[1], x[2])
    else:
        return (None, None, None)
        
cat3 = cat.map(lambda x : x.split(',')).map(multicomma)
print(cat3.collect())

[(1, 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'), (2, 'CONDIMENTS', 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'), (3, 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'), (4, 'DAIRY PRODUCTS', 'CHEESES'), (5, 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL'), (6, 'MEAT/POULTRY', 'PREPARED MEATS'), (7, 'PRODUCE', 'DRIED FRUIT AND BEAN CURD'), (8, 'SEAFOOD', 'SEAWEED AND FISH')]


### You can chain multiple transformations together to do it all in one step.
#### Here we converted the datatypes to int, then turned the tuple into a dictionary.

In [22]:
cat2 = cat.map(lambda x : tuple(x.split(','))) \
      .map(lambda x : (int(x[0]), x[1], x[2])) \
      .map(lambda x : dict(zip(['CategoryID', 'Name', 'Description'], x)))
cat2.take(10)


[{'CategoryID': 1,
  'Name': 'BEVERAGES',
  'Description': 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'},
 {'CategoryID': 2,
  'Name': 'CONDIMENTS',
  'Description': 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'},
 {'CategoryID': 3,
  'Name': 'CONFECTIONS',
  'Description': 'DESSERTS CANDIES AND SWEET BREADS'},
 {'CategoryID': 4, 'Name': 'DAIRY PRODUCTS', 'Description': 'CHEESES'},
 {'CategoryID': 5,
  'Name': 'GRAINS/CEREALS',
  'Description': 'BREADS CRACKERS PASTA AND CEREAL'},
 {'CategoryID': 6, 'Name': 'MEAT/POULTRY', 'Description': 'PREPARED MEATS'},
 {'CategoryID': 7,
  'Name': 'PRODUCE',
  'Description': 'DRIED FRUIT AND BEAN CURD'},
 {'CategoryID': 8, 'Name': 'SEAFOOD', 'Description': 'SEAWEED AND FISH'}]

### Could make a single function to do all the steps on each element and call a single `map` instead.

In [34]:
def parse_cat(x):
    step1 = x.split(',')
    step2 = (int(step1[0]), step1[1], step1[2])
    step3 = dict(zip(['categoryid', 'name', 'description'], step2))
    return step3
#cat.take(2)    
cat.map(parse_cat).take(10)

[{'categoryid': 1,
  'name': 'BEVERAGES',
  'description': 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'},
 {'categoryid': 2,
  'name': 'CONDIMENTS',
  'description': 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'},
 {'categoryid': 3,
  'name': 'CONFECTIONS',
  'description': 'DESSERTS CANDIES AND SWEET BREADS'},
 {'categoryid': 4, 'name': 'DAIRY PRODUCTS', 'description': 'CHEESES'},
 {'categoryid': 5,
  'name': 'GRAINS/CEREALS',
  'description': 'BREADS CRACKERS PASTA AND CEREAL'},
 {'categoryid': 6, 'name': 'MEAT/POULTRY', 'description': 'PREPARED MEATS'},
 {'categoryid': 7,
  'name': 'PRODUCE',
  'description': 'DRIED FRUIT AND BEAN CURD'},
 {'categoryid': 8, 'name': 'SEAFOOD', 'description': 'SEAWEED AND FISH'}]

### The `filter` method takes a lambda that returns a True or False.

In [35]:
#cat2.filter(lambda x : x['CategoryID'] <= 5).collect()
#cat2.filter(lambda x : x['CategoryID'] % 2 == 0).collect()
cat2.filter(lambda x : x['Name'].startswith('S')).collect()


[{'CategoryID': 8, 'Name': 'SEAFOOD', 'Description': 'SEAWEED AND FISH'}]

### The `filter` expressions can be more complicated.

In [36]:
cat2.filter(lambda x : x['CategoryID'] % 2 == 0 and 'e' in x['Name'].lower()).collect()


[{'CategoryID': 2,
  'Name': 'CONDIMENTS',
  'Description': 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'},
 {'CategoryID': 6, 'Name': 'MEAT/POULTRY', 'Description': 'PREPARED MEATS'},
 {'CategoryID': 8, 'Name': 'SEAFOOD', 'Description': 'SEAWEED AND FISH'}]

### The `sortBy` method returns an expression that is used to sort the data.

In [38]:
cat2.sortBy(lambda x : x['Name']).collect()


[{'CategoryID': 1,
  'Name': 'BEVERAGES',
  'Description': 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'},
 {'CategoryID': 2,
  'Name': 'CONDIMENTS',
  'Description': 'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'},
 {'CategoryID': 3,
  'Name': 'CONFECTIONS',
  'Description': 'DESSERTS CANDIES AND SWEET BREADS'},
 {'CategoryID': 4, 'Name': 'DAIRY PRODUCTS', 'Description': 'CHEESES'},
 {'CategoryID': 5,
  'Name': 'GRAINS/CEREALS',
  'Description': 'BREADS CRACKERS PASTA AND CEREAL'},
 {'CategoryID': 6, 'Name': 'MEAT/POULTRY', 'Description': 'PREPARED MEATS'},
 {'CategoryID': 7,
  'Name': 'PRODUCE',
  'Description': 'DRIED FRUIT AND BEAN CURD'},
 {'CategoryID': 8, 'Name': 'SEAFOOD', 'Description': 'SEAWEED AND FISH'}]

### `sortBy` has an option ascending parameter to sort in reverse order.

In [39]:
cat1 = cat.map(lambda x : x.split(','))
cat1.sortBy(lambda x : x[0], ascending = False).collect()


[['8', 'SEAFOOD', 'SEAWEED AND FISH'],
 ['7', 'PRODUCE', 'DRIED FRUIT AND BEAN CURD'],
 ['6', 'MEAT/POULTRY', 'PREPARED MEATS'],
 ['5', 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL'],
 ['4', 'DAIRY PRODUCTS', 'CHEESES'],
 ['3', 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'],
 ['2',
  'CONDIMENTS',
  'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'],
 ['1', 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES']]

## LAB:##
### Try to sort region in descending order by ID and then by name in ascending order. ###

<br>
<details><summary>Click for <b>hint</b></summary>
<p>
Use sortByKey and sortBy respectively
<br>
sortBy needs a lambda
<br><br>
</p>
</details>

<details><summary>Click for <b>code</b></summary>
<p>

```python
print(regions.sortByKey(ascending = False).collect())
print(regions.sortBy(lambda x : x[1]).collect())
```
</p>
</details>

[['1', 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'],
 ['2',
  'CONDIMENTS',
  'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'],
 ['3', 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'],
 ['4', 'DAIRY PRODUCTS', 'CHEESES'],
 ['5', 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL'],
 ['6', 'MEAT/POULTRY', 'PREPARED MEATS'],
 ['7', 'PRODUCE', 'DRIED FRUIT AND BEAN CURD'],
 ['8', 'SEAFOOD', 'SEAWEED AND FISH']]

### We have to be aware of the shape of the data. Some function require data to be shaped into just two elements, a key and a value. So the value portion may be more complex than just one simple value. We often need to reshape complex data into a simple two element tuple as a preparatory step for another function.

```
# (1, 'A', 'B')
# (2, 'C', 'D')  3 elements per row
# (1, 'E', 'F')

# reshape to just 2 elements (Key, everything else)
# (1, ('A','B'))
# (2, ('C','D')) 2 elements per row 1 Key, 2 everything else wrapped up in a tuple or dict
# (1, ('E','F'))
```

### Reshape categories from a tuple of three elements like (1, 'Beverages', 'Soft drinks') to a tuple with two elements (key, value) like (1, ('Beverages', 'Soft drinks')).

In [41]:
cat3 = cat1.map(lambda x : (x[0], (x[1], x[2]))) # eliminate the key from the values collection
cat3 = cat1.map(lambda x : (x[1], x)) # keep the key in the values collection
cat3.collect()


[('BEVERAGES', ['1', 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES']),
 ('CONDIMENTS',
  ['2',
   'CONDIMENTS',
   'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS']),
 ('CONFECTIONS', ['3', 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS']),
 ('DAIRY PRODUCTS', ['4', 'DAIRY PRODUCTS', 'CHEESES']),
 ('GRAINS/CEREALS',
  ['5', 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL']),
 ('MEAT/POULTRY', ['6', 'MEAT/POULTRY', 'PREPARED MEATS']),
 ('PRODUCE', ['7', 'PRODUCE', 'DRIED FRUIT AND BEAN CURD']),
 ('SEAFOOD', ['8', 'SEAFOOD', 'SEAWEED AND FISH'])]

### The sortByKey method does not require a function as a parameter if the data is structured into a tuple of the shape (key, value).

In [69]:
#cat3.sortByKey(ascending=False).collect()
#cat3.sortByKey(ascending=False).map(lambda x : x[1] ).collect()
cat1.map(lambda x: (x[1], x)).sortByKey().map(lambda x : x[1]).collect()



# select col1, count(col2) 
#from table
#where col3 =3 
#group by col1
with step1 as
(select col1, col2 from table where col3 = 3)
select col1, count(col2) from step1


[['1', 'BEVERAGES', 'SOFT DRINKS COFFEES TEAS BEERS AND ALES'],
 ['2',
  'CONDIMENTS',
  'SWEET AND SAVORY SAUCES RELISHES SPREADS AND SEASONINGS'],
 ['3', 'CONFECTIONS', 'DESSERTS CANDIES AND SWEET BREADS'],
 ['4', 'DAIRY PRODUCTS', 'CHEESES'],
 ['5', 'GRAINS/CEREALS', 'BREADS CRACKERS PASTA AND CEREAL'],
 ['6', 'MEAT/POULTRY', 'PREPARED MEATS'],
 ['7', 'PRODUCE', 'DRIED FRUIT AND BEAN CURD'],
 ['8', 'SEAFOOD', 'SEAWEED AND FISH']]

### Read in another CSV file.

In [70]:
prod = sc.textFile('file:///class/datasets/northwind/CSV/products')
print(prod.count())
prod.take(4)


77


['1,Chai,8,1,10 boxes x 30 bags,18.0,39,0,10,1',
 '2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,1',
 '3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0',
 "4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,22.0,53,0,0,0"]

### Split it up and just keep the ProductID, ProductName, CategoryID, Price, Quantity values.

In [73]:
#prod1 = prod.map(lambda x : x.split(',')).map(lambda x : (int(x[0]), x[1], int(x[3]), float(x[5]), int(x[6])))
prod1 = (sc.textFile('file:///class/datasets/northwind/CSV/products')
          .map(lambda x : x.split(','))
          .map(lambda x : (int(x[0]), x[1], int(x[3]), float(x[5]), int(x[6])))
        )
prod1.take(5)

# FROM filename
# SELECT ....
# SELECT

[(1, 'Chai', 1, 18.0, 39),
 (2, 'Chang', 1, 19.0, 17),
 (3, 'Aniseed Syrup', 2, 10.0, 13),
 (4, "Chef Anton's Cajun Seasoning", 2, 22.0, 53),
 (5, "Chef Anton's Gumbo Mix", 2, 21.35, 0)]

### Reshape it to a key value tuple where category is the key and the other fields are the values.

In [74]:
prod2 = prod1.map(lambda x : (x[2], (x[0], x[1], x[3], x[4])))
prod2.take(5)


[(1, (1, 'Chai', 18.0, 39)),
 (1, (2, 'Chang', 19.0, 17)),
 (2, (3, 'Aniseed Syrup', 10.0, 13)),
 (2, (4, "Chef Anton's Cajun Seasoning", 22.0, 53)),
 (2, (5, "Chef Anton's Gumbo Mix", 21.35, 0))]

In [81]:

cat1.collect()
cat2 = cat1.map(lambda x : (int(x[0]), x[1].title()))
cat2.collect()

[(1, 'Beverages'),
 (2, 'Condiments'),
 (3, 'Confections'),
 (4, 'Dairy Products'),
 (5, 'Grains/Cereals'),
 (6, 'Meat/Poultry'),
 (7, 'Produce'),
 (8, 'Seafood')]

### Both c3 and prod2 are in key value tuple format so they can be joined to produce a new tuple of (key, (cat, prod)).

In [82]:
joined = cat2.join(prod2)
joined.sortByKey().take(15)


[(1, ('Beverages', (1, 'Chai', 18.0, 39))),
 (1, ('Beverages', (2, 'Chang', 19.0, 17))),
 (1, ('Beverages', (24, 'Guarana Fantastica', 4.5, 20))),
 (1, ('Beverages', (34, 'Sasquatch Ale', 14.0, 111))),
 (1, ('Beverages', (35, 'Steeleye Stout', 18.0, 20))),
 (1, ('Beverages', (38, 'Cote de Blaye', 263.5, 17))),
 (1, ('Beverages', (39, 'Chartreuse verte', 18.0, 69))),
 (1, ('Beverages', (43, 'Ipoh Coffee', 46.0, 17))),
 (1, ('Beverages', (67, 'Laughing Lumberjack Lager', 14.0, 52))),
 (1, ('Beverages', (70, 'Outback Lager', 15.0, 15))),
 (1, ('Beverages', (75, 'Rhonbrau Klosterbier', 7.75, 125))),
 (1, ('Beverages', (76, 'Lakkalikoori', 18.0, 57))),
 (2, ('Condiments', (3, 'Aniseed Syrup', 10.0, 13))),
 (2, ('Condiments', (4, "Chef Anton's Cajun Seasoning", 22.0, 53))),
 (2, ('Condiments', (5, "Chef Anton's Gumbo Mix", 21.35, 0)))]

## LAB: ##
### Load territories into HDFS and join it to regions. ###


<br>
<details><summary>Click for <b>hint</b></summary>
<p>
Put /class/datasets/northwind/CSV/territories into HDFS
<br>
Use sc.textFile to read it into an RDD
<br>
Use map to split and convert it to the proper datatypes
<br>
Use the join method
<br><br>
</p>
</details>

<details><summary>Click for <b>code</b></summary>
<p>

```python
! hadoop fs -put /class/datasets/northwind/CSV/territories /

territories = sc.textFile('hdfs://localhost:9000/territories')
territories = territories.map(lambda x : x.split(',')).map(lambda x : (int(x[0]), x[1], int(x[2])))
print(territories.collect())

region_territories = regions.join(territories.map(lambda x : (x[2], (x[0],x[1]))))
print(region_territories.collect())
# Reshape it to make it look more normal. The * in front of the x is a python unpacking trick
region_territories = region_territories.map(lambda x : (x[0], (x[1][0], *x[1][1])))
print(region_territories.collect())
```
</p>
</details>

### The groupBy methods are seldom used but they can produce hierarchies where children records are embedded inside a parent.

In [None]:
group1 = prod2.groupByKey()
group1.take(3)


In [None]:
list(group1.take(1)[0][1])


In [None]:
group2 = [(key, list(it)) for key, it in group1.collect()]
for k,v in group2:
    print ('Key:', k)
    for x in v:
        print(x)
#print (group2)


### The reduce methods take a function as a parameter that tells Spark how to accumulate the values for each group. The function takes two parameters; the first is the accumulated value and the second is the next value in the list. 

In [None]:
shake4.map(lambda x : (x, 1)).reduceByKey(lambda x, y : x + y).sortBy(lambda x : x[1], ascending = False).take(10)


## LAB: ## 
### Use the territories RDD to count how many territories are in each region. 
### Display the results in regionID order and then descending order based on the counts.
<br>
<details><summary>Click for <b>hint</b></summary>
<p>
Use map to put the key first then reduceByKey to accumulate the values
<br>
Use sortByKey to sort by regionID and sortBy with a lambda to sort by counts
<br><br>
</p>
</details>

<details><summary>Click for <b>code</b></summary>
<p>

```python
region_count = territories.map(lambda x : (x[2], 1)).reduceByKey(lambda x, y: x + y)
print(region_count.sortByKey().collect())
print(region_count.sortBy(lambda x : x[1], ascending = False).collect())
```
</p>
</details>

### In this example, we are adding up all the prices for each categoryID.

In [None]:
red1 = prod2.map(lambda x : (x[0], x[1][2])).reduceByKey(lambda x, y: x + y)
red1.collect()


### To accumulate more than one value, use a tuple to hold as many values as you want to aggregate.

In [None]:
red1 = prod2.map(lambda x : (x[0], (x[1][2], x[1][3], 1))).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
red1.collect()


### Some Python magic can make things easier in the long run.
Named tuples make accessing the elements of the row easier.
Unpacking using the * is a neat Python trick that is widely used.
 
datetime has function to convert a string into a date.

In [None]:
mort = sc.textFile('file:///class/datasets/finance/30YearMortgage.csv')
head = mort.first()
mort = mort.filter(lambda x : x != head)


In [None]:
from datetime import date, datetime
from collections import namedtuple
Rate = namedtuple('Rate','date fed_fund_rate avg_rate_30year')
mort1 = mort.map(lambda x : Rate(*(x.split(','))))
mort2 = mort1.map(lambda x : Rate(datetime.strptime(x.date, '%Y-%m').date(), float(x.fed_fund_rate), float(x.avg_rate_30year)))
mort2.take(5)


In [None]:
mort2.filter(lambda x : x.fed_fund_rate > .1 ).collect()


### HOMEWORK:
1. The creditcard.csv dataset provides sample data on credit card transactions
2. Load the file into HDFS
3. Load the file into an RDD
4. Parse the file into a tuple or namedtuple or dictionary
5. Make sure to convert columns to the right data types
6. You can ignore any columns you don’t need for the solution
7. Filter the data to show only transactions made by women
8. Calculate the amount spent in each city
