# Loading and Saving your data

So far our examples have loaded and saved all of their data from a native collection and regular files, but odds are that your data doesn't fit on a single machine, so it's time to explore our options for loading and saving.

Spark supports a wide range of input and output sources, partly because it builds on the ecosystem avaialble for Hadoop.

## Text Files

When we load a single text file as an RDD, each input line becomes an element in the RDD. We can also load multiple whole text files at the same time into a pair RDD, with the key being the name and the value being the contents of each file.

In [1]:
# Loading a text file
input_file = sc.textFile("file:/Users/sergulaydore/Downloads/nyc16_bigdata1-master-3/week2b_thursday/shakespeare.txt")
# we can control the number of partitions by specifying minPartitions

If our files are small enough, then we can use the SparkContext.wholeTextFiles() method and get back a pair RDD where the key is the name of the input file.

In [None]:
# Saving as a text file
result.saveAsTextFile(outfile)

## JSON

In [3]:
# Loading unstructured JSON in Python
import json
data = input.map(lambda x: json.loads(x))

In [None]:
# Saving JSON - say we are running a promotion for people who love pandas
(data.filter(lambda x: x["lovesPandas"]).map(lambda x: json.dumps(x))).saveAsTextFile(outputFile)

## Comma-Separated Values and Tab-Separated Values

Note that, in addition to writing CSV loading code by hand, there is a package on http:/www.spark-packages.org called spark-csv to load csv data as a Spark SQL data source.

In [None]:
# Loading CSV with textFile()
import csv
import StringIO
def loadRecord(line):
    """Parse a CSV line"""
    my_input = StringIO.StringIO(line)
    reader = csv.DictReader(my_input, fieldnames = ["name", "favoriteAnimal"])
    return reader.next()
my_input = sc.TextFile(inputFile).map(loadRecord)

In [None]:
# Writing CSV
def writeRecords(records):
    """Write out CSV lines"""
    output = StringIO.StringIO()
    writer = csv.DictWriter(output, fieldnames = ["name", "favoriteAnimal"])
    for record in records:
        writer.writerow(record)
    return [output.getvalue()]

## Sequence Files

SequenceFiles are a popular Hadoop format composed of flat files with key/value pairs. SequenceFiles consist of elements that implement Hadoop's Writable interface, as Hadoop uses a custom serialization framework.

Spark has a specialized API for reading in SequenceFiles. On the SparkContext we can call sequenceFile(path, keyClass, valueClass, minPartitions). Let's consider loading people and the number of pandas they have seen from a SequenceFile. In this case our keyClass would be text and our valueClass would be IntWritable.

In [None]:
# Loading a SequenceFile
data = sc.sequenceFile(inFile,
                      "org.apache.hadoop.io.Text", "org.apache.hadoop.io.IntWritable")

## Object Files

Object files are deceptively simple wrapper around SequenceFiles that allows us to save our RDDs containing just values.

## Hadoop Input and Output Formats

# get back to this chapter as you need different data types and file systems