# CSV 

Comma-separated values is a common data storage format. Yet, despite it's name it is actually an amalgam of a variety of possible formats, some of which do not even use commas as separators! Rather, when I say CSV here I'm referring to data that is stored in a rectangular format, with rows, columns, and often headers. Such data is extremely common on the web when looking for data sets. However, lots of tables vary in their specification. Here are some important features to consider:

- **Headers**: Does the file have headers for the columns? In the pandas ```pd.read_csv()``` method (or Python's built-in ```csvreader```) you can indicate that the file has a header with the ```headers=True``` argument. In neither case, however, does it deal nicely with two line headers. IF you have a two line header, you might be better off reading the file in and then writing the file back line by line but excluding the first line. Then using this new file, you would import it as usual.
- **Quote character**: The quote character is used to indicate that all of the text inside quotes refers to one string. This is important in the cases where you are parsing a comma-separated file, but the file has commas in it. For example, imagine a file of addresses with headers: Name, Location, User. Now imagine one line of data is "Drake, Toronto, Canada, False". It might parse that as ```Name: Drake```, ```Location: Toronto```, ```User: Canada```, followed by a parse error. If the data, on the other hand was: 

~~~
Drake, "Toronto, Canada", False 
~~~

Then it is clear that the comma in between Toronto and Canada belongs in the string. Now you might be wondering what if you want to use the quote character in the string? This is where we would use 'escape codes'. In this case using \" to include " inside a string. See this entry for an example:

~~~ 
Name, Location, User
"Sean \"P. Diddy\" Combs", "New York City, NY", False
~~~

- **Delimiter**: Even though it is a comma separated file, sometimes people will use a different way of making the separation. A common option is the tab character (and sometimes it is even referred to as a TSV or Tab-separated values".
- **New Lines**: There are two issues with new lines that sometimes trip people up. The first is that, particularly on some old files, the end of a file has a ```\r\n``` rather than just an ```\n```. This is because the ```\r``` represents a 'return carriage', that is , the cursor should go down one line and return to the left of the screen. This is very similar to a typewriter. Thankfully it is now really rare and almost all CSV files use ```\n```. The second issue is how many ```\n``` characters are at the bottom. Sometimes if there is more than one the computer gets confused because in between them it would expect a full row.  

It is not difficult to build your own CSV parser. In fact, that is one of the exercises in this section. However, it is clear that there are enough little details to attend to that it makes sense to use the build in packages where possible. Python offers two main ways to parse CSV files. First is the ```csv``` library. This is a standalone library that can be imported. It has many options for separators and whether there's a header. It also has some nice ways to index the data. For example, if you want to store your data as a dictionary with the header as the key and the column as the values, you can use ```csvreader``` to do that. 

## Using the build-in CSVReader 
The basic usage, however, is to iterate through a file line by line. Instead of iterating through with 'readline' and splitting the text that comes back, you create a "csv reader", and this iterates line by line returning not a string of text, but a list split at every comma (or user-defined separator). 

In [1]:
import csv,os

with open('..{0}Data{0}MuppetsTable.csv'.format(os.sep), newline='') as file_to_read:
    filereader = csv.reader(file_to_read, delimiter=',', quotechar='|')
    for row in filereader:
        row = ["{:<20}".format(x) for x in row]
        print("".join(row))

﻿Name               Gender              Species             First Appearance    
Fozzie              Male                Bear                1976                
Kermit              Male                Frog                1955                
Piggy               Female              Pig                 1974                
Gonzo               Male                Unknown             1970                
Rowlf               Male                Dog                 1962                
Beaker              Male                Muppet              1977                
Janice              Female              Muppet              1975                
Hilda               Female              Muppet              1976                


The nice thing about ```csv```, particularly when not using pandas, is the use of the DictReader. This returns a dictionary with the header as the key and the values in that row as the value. If there's no header line, you can specify a list to be the keys using the ```fieldnames``` argument, such as ```fieldnames = ["Name","Location","User"].``` 

## Using the Pandas CSV reader 
To import into a DataFrame directly using pandas, observe: 

In [2]:
import pandas as pd

df = pd.read_csv('..{0}Data{0}MuppetsTable.csv'.format(os.sep)) 

Just like the CSVReader, the pandas ```pd.read_csv``` method has many arguments for things like headers and delimiters. 

> TIP: **Using help()** 
>
> You can use help in jupyter or in a python console by encasing any method or function in ```help()```. So to learn about all the arguments for read_csv, you would run help(pd.read_csv). A word of caution: Note that this is without the ```()``` after ```read_csv```. If you put those parentheses inside the help method, then it will first _evaluate_ ```read_csv()``` which means you will be asking for help on whatever ```read_csv()``` returns, not on ```read_csv``` the method itself. 

Since we will almost always be moving data to a DataFrame this is often a very handy thing to get working. However, as data gets larger, reading straight into a DataFrame gets increasingly slow. For very massive files you might want to read them piecemeal. How large are we talking? On modern computers we might be talking CSVs in the hundreds of megabytes or more. Below that, Python should be very speedy importing and parsing CSVs. Above that and you will want to consider whether to just read in parts of the file at a time or another strategy, typically to divide and conqueror the data. For really big data (on the order of gigabytes or more, where you will have more data than RAM) you will want to turn to server-based solutions outside of scope in this book, such as Google's BigQuery.