# Spark Introduction

The goal of this assignment is to develop some expertise and familiarity with Spark, using RDDs and pySpark.

## Start Spark Context

Make sure to execute first and execute only once per session

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[4]")

## Read in the (small) file

In [14]:
raw = sc.textFile('../data/rxSmallSubset.csv')

## Task 1

Write a program that computes the total "net ingredient cost" of prescription items dispensed for each PERIOD in the data set (total pounds and pence from the NIC field).

As you do this, be aware that this data (like all real data) can be quite noisy and dirty. The first line in the file might describe the schema, and so it doesn’t have any valid data, just a bunch of text. You may find lines that do not have enough entries on them, or where an entry is of the wrong type (for example, the NIC or ACT COST cannot be converted into a decimal number). Basically, you need to write robust code. If you find any error on a line, simply discard the line. Your code should still output the correct result.


For your results, print out each period, in sorted order, followed by the total net ingredient cost for that period.

The following steps are just a guide. Feel free to do it your own way.

#### Define a function that checks if a string is a valid number

#### Split each line into fields

#### Filter out invalid line(s), probably using the function defined above

#### Pick fields of interest, as the key and value

#### Sum by PERIOD

#### Print the result in order

## Task 2

Find the 5 practices that issued the prescriptions with the highest total net ingredient cost in the data set.

How many sequences in the contaminated file are believed to be bacterial sequences?

## Task 3

Your task is to classify each sequence in the contaminated tardigrade file as being most likely bacteria or tardigrade.

There are many ways to approach this job. Here are some steps at a high level:

a) A function that calculates Edit Distance between two sequences

b) Calculate Edit Distance for each sample against every clean and bacterial contig

c) Find the shortest distance for each sample

d) Classify samples

You are likely to use much more RDD operations than previous tasks. Check documents for some handy functions.

#### Load data files

In [3]:
bacterialRaw = sc.textFile('../data/exp1.oneline.fa.small')
cleanRaw = sc.textFile('../data/nHd.2.3.abv500.oneline.fa.small')
contaminatedRaw = sc.textFile('../data/LMYF01.1.oneline.fa.small')

Copyright ©  2019 Rice University, Christopher M Jermaine (cmj4@rice.edu), and Risa B Myers  (rbm2@rice.edu)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.