# Chapter 16: Working with CSV files and JSON data
In Chapter 15, you learned how to extract text from PDF and Word documents. These files were in a binary format, which required special Python modules to access their data. CSV and JSON files, on the other hand, are just plaintext files. You can view them in a text editor, such as Mu. But Python also comes with the special csv and json modules, each providing functions to help you work with these file formats.

CSV stands for “comma-separated values,” and CSV files are simplified spreadsheets stored as plaintext files. Python’s csv module makes it easy to parse CSV files.

JSON (pronounced “JAY-sawn” or “Jason”—it doesn’t matter how because either way people will say you’re pronouncing it wrong) is a format that stores information as JavaScript source code in plaintext files. (JSON is short for JavaScript Object Notation.) You don’t need to know the JavaScript programming language to use JSON files, but the JSON format is useful to know because it’s used in many web applications.

## The csv Module

Each line in a CSV file represents a row in the spreadsheet, and commas separate the cells in the row. For example, the spreadsheet example.xlsx from https://nostarch.com/automatestuff2/ would look like this in a CSV file:

I will use this file for this chapter’s interactive shell examples. You can download example.csv from https://nostarch.com/automatestuff2/ or enter the text into a text editor and save it as example.csv.

CSV files are simple, lacking many of the features of an Excel spreadsheet. For example, CSV files:

    Don’t have types for their values—everything is a string
    Don’t have settings for font size or color
    Don’t have multiple worksheets
    Can’t specify cell widths and heights
    Can’t have merged cells
    Can’t have images or charts embedded in them

The advantage of CSV files is simplicity. CSV files are widely supported by many types of programs, can be viewed in text editors (including Mu), and are a straightforward way to represent spreadsheet data. The CSV format is exactly as advertised: it’s just a text file of comma-separated values.

Since CSV files are just text files, you might be tempted to read them in as a string and then process that string using the techniques you learned in Chapter 9. For example, since each cell in a CSV file is separated by a comma, maybe you could just call split(',') on each line of text to get the comma-separated values as a list of strings. But not every comma in a CSV file represents the boundary between two cells. CSV files also have their own set of escape characters to allow commas and other characters to be included as part of the values. The split() method doesn’t handle these escape characters. Because of these potential pitfalls, you should always use the csv module for reading and writing CSV files.

### reader Objects

To read data from a CSV file with the csv module, you need to create a reader object. A reader object lets you iterate over lines in the CSV file. Enter the following into the interactive shell, with example.csv in the current working directory:

In [None]:
import os, csv

loadPath = os.path.join('automate_online-materials', 'example.csv')

exampleFile = open(loadPath) # open the csv file
exampleReader = csv.reader(exampleFile) # pass the file to csv.reader which will return a reader object
exampleData = list(exampleReader) # convert the values to plain text using the list function
print(exampleData) # returns a nested list with each row as inner lists 

[['4/5/2014 13:34', 'Apples', '73'], ['4/5/2014 3:41', 'Cherries', '85'], ['4/6/2014 12:46', 'Pears', '14'], ['4/8/2014 8:59', 'Oranges', '52'], ['4/10/2014 2:07', 'Apples', '152'], ['4/10/2014 18:10', 'Bananas', '23'], ['4/10/2014 2:40', 'Strawberries', '98']]


The csv module comes with Python, so we can import it ➊ without having to install it first.

To read a CSV file with the csv module, first open it using the open() function ➋, just as you would any other text file. But instead of calling the read() or readlines() method on the File object that open() returns, pass it to the csv.reader() function ➌. This will return a reader object for you to use. Note that you don’t pass a filename string directly to the csv.reader() function.

The most direct way to access the values in the reader object is to convert it to a plain Python list by passing it to list() ➍. Using list() on this reader object returns a list of lists, which you can store in a variable like exampleData. Entering exampleData in the shell displays the list of lists ➎.

Now that you have the CSV file as a list of lists, you can access the value at a particular row and column with the expression exampleData[row][col], where row is the index of one of the lists in exampleData, and col is the index of the item you want from that list. Enter the following into the interactive shell:

In [4]:
# access the data of a particular row and column by calling data[row][col]
print(exampleData[0][0])
print(exampleData[0][1]) # first row, second column
print(exampleData[0][2]) # first row third column
print(exampleData[1][1])
print(exampleData[6][1])

4/5/2014 13:34
Apples
73
Cherries
Strawberries


As you can see from the output, exampleData[0][0] goes into the first list and gives us the first string, exampleData[0][2] goes into the first list and gives us the third string, and so on.

### Reading Data from reader Objects in a for Loop

For large CSV files, you’ll want to use the reader object in a for loop. This avoids loading the entire file into memory at once. For example, enter the following into the interactive shell:

In [6]:
import os, csv

loadPath = os.path.join('automate_online-materials', 'example.csv')

exampleFile = open(loadPath)
exampleReader = csv.reader(exampleFile)


# The reader object's line_num variable contains the number of the current line
for row in exampleReader:
    print('Row # ' + str(exampleReader.line_num) + ' ' + str(row))

Row # 1 ['4/5/2014 13:34', 'Apples', '73']
Row # 2 ['4/5/2014 3:41', 'Cherries', '85']
Row # 3 ['4/6/2014 12:46', 'Pears', '14']
Row # 4 ['4/8/2014 8:59', 'Oranges', '52']
Row # 5 ['4/10/2014 2:07', 'Apples', '152']
Row # 6 ['4/10/2014 18:10', 'Bananas', '23']
Row # 7 ['4/10/2014 2:40', 'Strawberries', '98']


After you import the csv module and make a reader object from the CSV file, you can loop through the rows in the reader object. Each row is a list of values, with each value representing a cell.

The print() function call prints the number of the current row and the contents of the row. To get the row number, use the reader object’s line_num variable, which contains the number of the current line.

The reader object can be looped over only once. To reread the CSV file, you must call csv.reader to create a reader object.

writer Objects

A writer object lets you write data to a CSV file. To create a writer object, you use the csv.writer() function. Enter the following into the interactive shell:

In [7]:
import os, csv

savePath = os.path.join('Files', 'output.csv')

outputFile = open(savePath, 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham'])

21

In [8]:
# each write row adds a new row and returns the number of characters writter
outputWriter.writerow(['Hello, world!', 'eggs', 'bacon', 'ham']) 

32

In [9]:
outputWriter.writerow([1, 2, 3.141592, 4])

16

In [10]:
outputFile.close()

First, call open() and pass it 'w' to open a file in write mode ➊. This will create the object you can then pass to csv.writer() ➋ to create a writer object.

On Windows, you’ll also need to pass a blank string for the open() function’s newline keyword argument. For technical reasons beyond the scope of this book, if you forget to set the newline argument, the rows in output.csv will be double-spaced, as shown in Figure 16-1.

The writerow() method for writer objects takes a list argument. Each value in the list is placed in its own cell in the output CSV file. The return value of writerow() is the number of characters written to the file for that row (including newline characters).

Notice how the writer object automatically escapes the comma in the value 'Hello, world!' with double quotes in the CSV file. The csv module saves you from having to handle these special cases yourself.

### The delimiter and lineterminator Keyword Arguments
Say you want to separate cells with a tab character instead of a comma and you want the rows to be double-spaced. You could enter something like the following into the interactive shell:

In [16]:
import os, csv

savePath = os.path.join('Files', 'example.tsv')

#open a file in write mode. on windows you need to set newline to an empty string
csvFile = open(savePath, 'w', newline='')

# change the delimiter between cells to a tab, and the lineterminator to two new lines
csvWriter = csv.writer(csvFile, delimiter='\t', lineterminator='\n\n')
csvWriter.writerow(['apples', 'oranges', 'grapes'])
csvWriter.writerow(['eggs', 'bacon', 'ham'])
csvWriter.writerow(['spam', 'spam', 'spam', 'spam', 'spam', 'spam',])
csvFile.close()

This changes the delimiter and line terminator characters in your file. The delimiter is the character that appears between cells on a row. By default, the delimiter for a CSV file is a comma. The line terminator is the character that comes at the end of a row. By default, the line terminator is a newline. You can change characters to different values by using the delimiter and lineterminator keyword arguments with csv.writer().

Passing delimiter='\t' and lineterminator='\n\n' ➊ changes the character between cells to a tab and the character between rows to two newlines. We then call writerow() three times to give us three rows.

### DictReader and DictWriter CSV Objects

For CSV files that contain header rows, it’s often more convenient to work with the DictReader and DictWriter objects, rather than the reader and writer objects.

The reader and writer objects read and write to CSV file rows by using lists. The DictReader and DictWriter CSV objects perform the same functions but use dictionaries instead, and they use the first row of the CSV file as the keys of these dictionaries.

Go to https://nostarch.com/automatestuff2/ and download the exampleWithHeader.csv file. This file is the same as example.csv except it has Timestamp, Fruit, and Quantity as the column headers in the first row.

In [17]:
import os, csv

loadPath = os.path.join('automate_online-materials', 'exampleWithHeader.csv')

exampleFile = open(loadPath)
exampleDictReader = csv.DictReader(exampleFile)

for row in exampleDictReader: # iterate through the row numbers and call for the value in each key (header)
    print(row['Timestamp'], row['Fruit'], row['Quantity'])

4/5/2014 13:34 Apples 73
4/5/2014 3:41 Cherries 85
4/6/2014 12:46 Pears 14
4/8/2014 8:59 Oranges 52
4/10/2014 2:07 Apples 152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40 Strawberries 98


Inside the loop, DictReader object sets row to a dictionary object with keys derived from the headers in the first row. (Well, technically, it sets row to an OrderedDict object, which you can use in the same way as a dictionary; the difference between them is beyond the scope of this book.) Using a DictReader object means you don’t need additional code to skip the first row’s header information, since the DictReader object does this for you.

If you tried to use DictReader objects with example.csv, which doesn’t have column headers in the first row, the DictReader object would use '4/5/2015 13:34', 'Apples', and '73' as the dictionary keys. To avoid this, you can supply the DictReader() function with a second argument containing made-up header names:

In [19]:
import os, csv

loadPath = os.path.join('automate_online-materials', 'example.csv')

exampleFile = open(loadPath)

# the csv does not contain headers, so we can supply the dunction with an argument containing made up headers
exampleDictReader = csv.DictReader(exampleFile, ['time', 'name', 'amount'])

for row in exampleDictReader:
    print(row['time'], row['name'], row['amount'])

4/5/2014 13:34 Apples 73
4/5/2014 3:41 Cherries 85
4/6/2014 12:46 Pears 14
4/8/2014 8:59 Oranges 52
4/10/2014 2:07 Apples 152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40 Strawberries 98


Because example.csv’s first row doesn’t have any text for the heading of each column, we created our own: 'time', 'name', and 'amount'.

DictWriter objects use dictionaries to create CSV files.

In [None]:
import csv

savePath = os.path.join('Files', 'output.csv')

outputFile = open(savePath, 'w', newline='') # open a new file
outputDictWriter = csv.DictWriter(outputFile, ['Name', 'Pet', 'Phone']) # write the header row
outputDictWriter.writeheader()

# add a new row by passing a dictionary that uses the headers as keys
outputDictWriter.writerow({'Name': 'Alice', 'Pet': 'cat', 'Phone': '555-1234'})
outputDictWriter.writerow({'Name': 'Bob', 'Phone': '555-9999'}) # blank cell for pet column
outputDictWriter.writerow({'Phone': '555-5555', 'Name': 'Carol', 'Pet':'dog'})
outputFile.close()

If you want your file to contain a header row, write that row by calling writeheader(). Otherwise, skip calling writeheader() to omit a header row from the file. You then write each row of the CSV file with a writerow() method call, passing a dictionary that uses the headers as keys and contains the data to write to the file.

Notice that the order of the key-value pairs in the dictionaries you passed to writerow() doesn’t matter: they’re written in the order of the keys given to DictWriter(). For example, even though you passed the Phone key and value before the Name and Pet keys and values in the fourth row, the phone number still appeared last in the output.

Notice also that any missing keys, such as 'Pet' in {'Name': 'Bob', 'Phone': '555-9999'}, will simply be empty in the CSV file.
