# 6.1 Reading and Writing Data in Text Format 

Parsing functions in pandas:
- `read_csv`
- `read_fwf`
- `read_clipboard`
- `read_excel`
- `read_hdf`
- `read_html`
- `read_json`
- `read_msgpack`
- `read_pickle`
- `read_sas`
- `read_sql`
- `read_stata`
- `read_feather`

With optinal arguments: 
- Indexing: Can treat one or more columns as the returned DF, and whether to get column names from the file, the user, or not at all
- Type Inference and Data Conversion: Includes the **user-defined value conversions** and custom list of missing value markers.
- Datetime Parsing: Includes **combining capability**, including combining date and time info spread over multiple columns into a single column in the result.
- Iterating: Iterating over chunks of very large files.
- Unclean Data Issues: Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Some of these functions perform type inference because the column data types are not part of the data format. 

In [3]:
# read examples/ex1.csv
import pandas as pd 
df = pd.read_csv('examples/ex1.csv')
df

Unnamed: 0,a,b,c,2,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Notice the file has a header row. If the file you are working with does not have one, pass `header=None` **or** assign the names to the columns yourself by passing the `names` argument. 
- `df = pd.read_csv('examples/ex1.csv, header=None)`
- `df = pd.read_csv('examples/ex1.csv, names=['col1','col2', ...])`

You can indicate what column you would like to be the index column:
- `index_col='col6`

Furthermore you can create a hierarchical index (multiple index values) by passing a list of columns to the `index_col` argument. 

In cases where the data does not have a fixed delimiter, you can pass `sep` argument, and use a **regular expression** to choose the delimeter. `read_csv` can infer which column to be the DF's index. I*t does this by noticing that there is one fewer column name in the data u are passing. 

Pass a list of indeces to the `skiprows` argument to skip those rows when loading in the data.  

 Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some **sentinel** value. By default, pandas uses a set of commonly occurring sentinels such as `NA` and `NULL`. 
 - Use `pd.isnull(data_frame)` to return a boolean DF indicating missing values. Furthermore you can use this as a mask :)
- Pass a list, dict, or a set of strings to the `na_values` argument to assign missing values.
- You can assign different `NA` sentinels to each column, just pass a dict to the `na_values` argument with the column name as the key and the sentinel as the value.


Common `read_csv` function arguments:
- `path`
- `sep` or `delimeter`
- `header` a row number to use as column names
- `index_col`
- `names`
- `skiprows`
- `na_values`
- `comment`
- `parse_dates`
- `keep_date_col`
- `converters`
- `dayfirst`
- `date_parser` a function to use to parse data
- `nrows`
- `iterator`
- `chunksize`
- `skip_footer`
- `verbose`
- `encoding`
- `squeeze` Returns a series if the parsed data only contains one column.
- `thousands`

## Reading Text Files in Pieces

Use the `nrows` argument with `pd.read_csv()` to limit the number of rows to load in. Alternatively, iterate over the file according to chunk size using `chunksize` argument 
- Create a `TextFileReader` chunk object: `chunker = pd.read_csv('path', cunksize=1000)`
- Now use this object to iterate over the file.

In [None]:
# aggregate the counts in the 'key' column of our data. 

# total = pd.Series([])
# for piece in chunker: 
#     total.add(piece['key'].value_counts(), fill_value = 0)
# total = sort_values(ascending=False)

This code returns a series `total` with the index as the column specified `key` and the values are the value counts. 

## Writing Data to Text Format

Use the `to_csv` method. 
- `sep` can be used here for delimiter.
- By default, missing values appear as empty strings in the output. Pass `na_rep` to change this.
- Both the rown and column labels are written by default. This can be disabled with `index=False`, `header=False`.
- Write only a subset of the columns by passing a list of column names to `columns` argument
- Series also have a `to_csv` method.  

## Working with Delimited Formats 

For any file with a single-character delimiter, you can use Pythons built-in csv module. 

In [5]:
import csv 
f = open('examples/ex7.csv')
reader = csv.reader(f)

In [6]:
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


Now lets put the data in the form that we need.

In [10]:
with open('examples/ex7.csv') as f:
    lines=list(csv.reader(f))

# split the lines into the header line and the data lines:
header, values = lines[0], lines[1:]

# create a dict of data columns using dict comprehension and the expression zip(*values), which transpose rows to columns. 
data_dict = {h:v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

Since CSV files come in many different flavors, we can define a new format with a different delimiter, string quoting convention, or line terminator. Define a simple subclass of `csv.Dialect`

In [None]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

reader = csv.reader(f, dialect = my_dialect)

If you dont need to go this far with it, you can simply pass one of these as an argument to `csv.reader`.

CSV Dialect Options:
- `delimiter`
- `lineterminator`
- `quotechar`
- `quoting`
- `skipinitialspace`
- `doublequote`
- `escapechar`

Note: For files with more complicated or fixed multicharacter delimiters, you will not be able to use the `csv` module. In those cases, you will have to do the line splitting and other cleanup using string's `split` method or the regular expression method `re.split`.