# Data Has To Go Somewhere

Most of the files you will manipulate as a data scientist will be text files, and we've already seen how to access them using open(), read(), readlines() and so forth.  These commands work for any text file, but many text files include additional structure that we can leverage.  The text isn't just anything - it's organized according to certain standardized rules.  These are what we call structured text files, and they allow us to share and consume information from different applications easily. Let's look into some common structured text formats and how we can access them in Python.

## CSV: Comma Separated Value

CSV files are files in which items are divided across different lines and different columns, like a table.  On each line, items are separated by a comma. To get a good visual, think of a spreadsheet application: A typical spreadsheet program can open CSV files and display them to you as a table. 

The "C" in CSV stands for comma, whis is called a *delimiter*, but other variations are possible.  Other common delimiters include ' ', '|', '\t', and so on. Furthermore, the first row of a csv file is often a special header row and does not denote data. 

If you wanted to, you could read a csv file using read() or readlines(), but it would take a little effort to deal with the commas and split each line correctly.  Instead, Python has a package that makes reading and writing CSV files easy: 

In [1]:
import csv

grades = [
    ['John', 88],
    ['Kate', 93],
    ['Harry', 93],
    ['Linda', 87],
    ['Harriet', 91]
]

grades_csv_write = open('grades.csv', 'wt')
csvout = csv.writer(grades_csv_write)
csvout.writerows(grades)

grades_csv_write.close()

Now let's take a look at the file that was just created:

In [2]:
!cat grades.csv

John,88
Kate,93
Harry,93
Linda,87
Harriet,91


We have just created a comma separated file of a 5 by 2 table of names and grades. You could open this file in a spreadsheet program and see the data in tabular format, but the data itself is just a plain text file.

Notice that we constructed a csv.writer object, passing our file into the constructor, and used that to write our data to the file.  This is how Python makes it very simple to write (and read) csv files.

We can read our CSV file in a similar manner:

In [3]:
import csv

grades_csv_read = open('grades.csv', 'rt')
csvin = csv.reader(grades_csv_read)

for row in csvin:
    print(row)

['John', '88']
['Kate', '93']
['Harry', '93']
['Linda', '87']
['Harriet', '91']


Once you create a csv.reader object, you can interate row by row.  Each row in the csv file is conveniently provided as a list. There is no need to write a special parser to split and manipulate the original text file - this is all done for you by the csv package.

## XML: eXtensible Markup Language

XML is another structured text format that is used to represent relationships between data.  XML has many applications, and features prominently in modern web services. Python gives us convenient tools for processing XML.

Below, we'll create an example of an XML file.  If you haven't worked with XML before, just know that it is a language with a syntax similar to HTML.  Notice the use markup tags, which are placed between the '<' and '>' symbols.  The elements of an XML file are arranged in a hierarchy, with one root element, 'students', which contains several child elements.

In [4]:
xml_data = '''<?xml version="1.0"?>
<students>
	<student name="John">
		<grade value="88" />
	</student>
	<student name="Kate">
		<grade value="93" />
	</student>
	<student name="Harry">
		<grade value="93" />
	</student>
	<student name="Linda">
		<grade value="87" />
	</student>
	<student name="Harriet">
		<grade value="91" />
	</student>
</students>'''

xml_data_file = open('grades.xml', 'wt')
xml_data_file.write(xml_data)
xml_data_file.close()

In [5]:
!cat grades.xml

<?xml version="1.0"?>
<students>
	<student name="John">
		<grade value="88" />
	</student>
	<student name="Kate">
		<grade value="93" />
	</student>
	<student name="Harry">
		<grade value="93" />
	</student>
	<student name="Linda">
		<grade value="87" />
	</student>
	<student name="Harriet">
		<grade value="91" />
	</student>
</students>

This is just a regular text file that happens to be formatted as XML. Let's see how to read in the text file as an XML file.

In [8]:
from xml.etree import ElementTree  
tree = ElementTree.ElementTree(file='grades.xml')
root = tree.getroot()
print(root.tag)

students


We import the module ElementTree from the xml.etree package.  The ElementTree object gives us the ability to traverse an xml tree programmatically. After we create the tree, we retrieve the root. We can get the name of the root tag by accessing the tag property.

Once we have the root, we can proceed to traverse the XML tree.  We do this by moving from an element to its child elements.

In [7]:
for child in root:
    print(' tag:', child.tag, 'attributes:', child.attrib)

    for grandchild in child:
        print('\ttag:', grandchild.tag, 'attributes:', grandchild.attrib)

 tag: student attributes: {'name': 'John'}
	tag: grade attributes: {'value': '88'}
 tag: student attributes: {'name': 'Kate'}
	tag: grade attributes: {'value': '93'}
 tag: student attributes: {'name': 'Harry'}
	tag: grade attributes: {'value': '93'}
 tag: student attributes: {'name': 'Linda'}
	tag: grade attributes: {'value': '87'}
 tag: student attributes: {'name': 'Harriet'}
	tag: grade attributes: {'value': '91'}


Notice that each Element object has a tag property and an attrib property.  These are taken directly from the tag in the XML file. The Elements are also iterables.  Placing them in a for loop allows you to access each of an Element's child Elements. 

There are a number of other operations you can use for XML files: check out the [documentation](https://docs.python.org/3.3/library/xml.etree.elementtree.html) for details.

You can also try [xml.dom](https://docs.python.org/3/library/xml.dom.html) and [xml.sax](https://docs.python.org/3/library/xml.sax.html) for alternative xml processing libraries. 

## JSON: Javascript Object Notation

JSON is a structured text format that is commonly used for web development, but is increasingly popular for other applications as well. One good thing about JSON format is that the syntax is pretty similar to Python so it will be very easy to understand. 

Python's support of JSON is straightforward: there is one library that handles json and it's conveniently called "json".

Let's begin with an example of a json file:

In [8]:
json_data = '''{
	"students": {
		
		"John": {
			"grades": [88]
		},

		"Kate": {
			"grades": [93]
		},

		"Harry": {
			"grades": [93]
		},

		"Linda": {
			"grades": [87]
		},

		"Harriet": {
			"grades": [91]
		}
	}
}'''

json_data_file = open('grades.json', 'wt')
json_data_file.write(json_data)
json_data_file.close()

In [9]:
!cat grades.json

{
	"students": {
		
		"John": {
			"grades": [88]
		},

		"Kate": {
			"grades": [93]
		},

		"Harry": {
			"grades": [93]
		},

		"Linda": {
			"grades": [87]
		},

		"Harriet": {
			"grades": [91]
		}
	}
}

Please note: json looks very similar to dictionary literals in Python. In fact, the json package converts JSON files into nested dictionaries.  Any text in quotes is converted to strings, numbers are converted to ints or floats, and brackets are interpreted as lists.

Let's see this in action:

In [2]:
import json
json_data_file = open("grades.json", "rt")
json_data = json.loads(json_data_file.read())
json_data_file.close()

print("root:", json_data)
print()
print("students:", json_data["students"])

root: {'students': {'Harry': {'grades': [93]}, 'Harriet': {'grades': [91]}, 'Linda': {'grades': [87]}, 'John': {'grades': [88]}, 'Kate': {'grades': [93]}}}

students: {'Harry': {'grades': [93]}, 'Harriet': {'grades': [91]}, 'Linda': {'grades': [87]}, 'John': {'grades': [88]}, 'Kate': {'grades': [93]}}


We have now loaded our json file from the file system, and used the json.loads() function to convert that text file into a Python object. As we print both the root json_data and the dictionary indexed by "students", we can see how Python has coverted that json file into a Python dictionary.

We can also convert take Python dictionaries and write them into json files.

In [11]:
python_dict = {'students': 
                   {'Harriet': {'grades': [91]}, 
                    'John': {'grades': [88]}, 
                    'Kate': {'grades': [93]}, 
                    'Linda': {'grades': [87]}, 
                    'Harry': {'grades': [93]}
                   }
              }
python_dict_json = json.dumps(python_dict)

python_dict_json_file = open("grades_python.json", "wt")
python_dict_json_file.write(python_dict_json)
python_dict_json_file.close()

In [12]:
!cat grades_python.json

{"students": {"Kate": {"grades": [93]}, "Harry": {"grades": [93]}, "Linda": {"grades": [87]}, "Harriet": {"grades": [91]}, "John": {"grades": [88]}}}

We use the `dumps()` function to convert a python dictionary into a json formatted string. We then write the results to grades_python.json

There are a number of other structured file formats, including HTML, YAML, and INI.  Python has specialized modules that handle the reading and writing to each of these. When you encouter a new file format, make sure you search for a previously written module before you try to write one yourself!