# Spark Introduction

The goal of this assignment is to develop some expertise and familiarity with Spark, using RDDs and pySpark.

There are 2 different datasets that you will use:
* Rx dataset: Medication prescriptions in the United Kingdom from  July 2016 to September 2017
* Bioinformatics dataset: Tardigrade and bacteria genome sequences

There are 3 tasks:
1. Rx dataset

    1.1  Compute the total **net ingredient cost** of prescription items dispensed for each PERIOD 
    
    1.2 Compute the 5 practices that issued the prescriptions with the highest total net ingredient cost
    
2. Bioinformatics dataset

      2.1 Compute and label each sequence from a provided sample as most likely being Tardigrade or bacterial using Edit Distance.


## Datasets
### Rx Dataset

We will be using practice prescribing data from the UK National Health Service.
The data set itself is a set of simple text files.  Each prescription/prescribing practice is a different line in a file. 
The attributes present on each line of the files are, in order:

| Field    | Description                             |
|----------|-----------------------------------------|
| SHA      | Area team identifier                    |
| PCT      | Clinical commissioning group identifier |
| PRACTICE | Practice identifier                     |
| BNF_CODE | British National Formulary (BNF) code   |
| BNF_NAME | BNF name                                |
| ITEMS    | Number of prescription items dispensed  |
| NIC      | Net ingredient cost (pounds and pence)  |
| ACT_COST | Actual cost (pounds and pence)          |
| QUANTITY | Quantity - whole numbers                |
| PERIOD   | YYYYMM                                  |

Some additional information on the data can be found here:

https://digital.nhs.uk/data-and-information/areas-of-interest/prescribing/practice-level-prescribing-in-england-a-summary/practice-level-prescribing-glossary-of-terms

The data files are in comma separated values (CSV) format.


A super-small subset of the first file (only about 1000 lines) is available for download (see Canvas).  This file may be used on your computer using Docker and the Spark container. If you want, you can also use this file for testing and debugging by loading it into HDFS (just like you did in lab) and then running your Spark program over it. 

### Tardigrades

What is a tardigrade and why are we looking at this problem?

Tardigrades, also known as **water bears** are micro-animals that live in the water. They are caterpillar-like, with 4 pairs of legs and segmented bodies. They are ubiquitous and resilient. They have found just about everywhere in the world.  (https://en.wikipedia.org/wiki/Tardigrade)

In 2015, Boothby et al. published a paper claiming that the tardigrade's ability to survive extreme conditions is due to horizontal gene transfer (HGT)(transfer of genetic material between species) from many different species, including bacteria, fungi, and plants. 

Koutsovoulos et al. investigated Boothby's claim and rebutted it. Basically claiming that the evidence seen was DNA sample contamination, not actual HGT.

* The contaminated tardigrade assemblies are in \\
``LMYF01.1.oneline.fa``
You will be comparing these contigs with contigs in the following other files:

* The **clean tardigrade reference assemblies are in the file \\
``nHd.2.3.abv500.oneline.fa`` 

* Bacterial contigs are in the file 
``exp1.oneline.fa``



Each file contains a set of lines, one line per contig. Valid lines start with the ``>`` symbol, followed by the organism name. Next is a vertical bar (``|``) followed by a unique identifier for the contig within the organism. There may then be additional text describing the contig. Finally, there will be a ``<`` symbol. After this symbol, the remaining text on the line contains the DNA code. As you may know, this text consists of the characters A, C, T, and G.

Valid contig lines start with a ``>`` and contain only the specified letters in the DNA code.
You should only include valid lines in your analysis.

1. Boothby TC, Tenlen JR, Smith FW, Wang JR, Patanella KA, Nishimura EO, et al. Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences. 2015;112(52):15976-81.

1.	Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences. 2016:201600338.

## Start Spark Context

Make sure to execute first and execute only once per session

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[4]")

## Read in the (small) file

In [14]:
raw = sc.textFile('../data/rxSmallSubset.csv')

## Task 1

Write a program that computes the total "net ingredient cost" of prescription items dispensed for each PERIOD in the data set (total pounds and pence from the NIC field).

As you do this, be aware that this data (like all real data) can be quite noisy and dirty. The first line in the file might describe the schema, and so it doesn’t have any valid data, just a bunch of text. You may find lines that do not have enough entries on them, or where an entry is of the wrong type (for example, the NIC or ACT COST cannot be converted into a decimal number). Basically, you need to write robust code. If you find any error on a line, simply discard the line. Your code should still output the correct result.


For your results, print out each period, in sorted order, followed by the total net ingredient cost for that period.

The following steps are just a guide. Feel free to do it your own way.

#### Define a function that checks if a string is a valid number

#### Split each line into fields

#### Filter out invalid line(s), probably using the function defined above

#### Pick fields of interest, as the key and value

#### Sum by PERIOD

#### Print the result in order

## Task 2

Find the 5 practices that issued the prescriptions with the highest total net ingredient cost in the data set.

How many sequences in the contaminated file are believed to be bacterial sequences?

## Task 3

Your task is to classify each sequence in the contaminated tardigrade file as being most likely bacteria or tardigrade.

There are many ways to approach this job. Here are some steps at a high level:

a) A function that calculates Edit Distance between two sequences

b) Calculate Edit Distance for each sample against every clean and bacterial contig

c) Find the shortest distance for each sample

d) Classify samples

You are likely to use much more RDD operations than previous tasks. Check documents for some handy functions.

#### Load data files

In [3]:
bacterialRaw = sc.textFile('../data/exp1.oneline.fa.small')
cleanRaw = sc.textFile('../data/nHd.2.3.abv500.oneline.fa.small')
contaminatedRaw = sc.textFile('../data/LMYF01.1.oneline.fa.small')

Copyright ©  2019 Rice University, Christopher M Jermaine (cmj4@rice.edu), and Risa B Myers  (rbm2@rice.edu)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.