# 6.1 Reading and Writing Data in Text Format 

Parsing functions in pandas:
- `read_csv`
- `read_fwf`
- `read_clipboard`
- `read_excel`
- `read_hdf`
- `read_html`
- `read_json`
- `read_msgpack`
- `read_pickle`
- `read_sas`
- `read_sql`
- `read_stata`
- `read_feather`

With optinal arguments: 
- Indexing: Can treat one or more columns as the returned DF, and whether to get column names from the file, the user, or not at all
- Type Inference and Data Conversion: Includes the **user-defined value conversions** and custom list of missing value markers.
- Datetime Parsing: Includes **combining capability**, including combining date and time info spread over multiple columns into a single column in the result.
- Iterating: Iterating over chunks of very large files.
- Unclean Data Issues: Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Some of these functions perform type inference because the column data types are not part of the data format. 

In [3]:
# read examples/ex1.csv
import pandas as pd 
df = pd.read_csv('examples/ex1.csv')
df

Unnamed: 0,a,b,c,2,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Notice the file has a header row. If the file you are working with does not have one, pass `header=None` **or** assign the names to the columns yourself by passing the `names` argument. 
- `df = pd.read_csv('examples/ex1.csv, header=None)`
- `df = pd.read_csv('examples/ex1.csv, names=['col1','col2', ...])`

You can indicate what column you would like to be the index column:
- `index_col='col6`

Furthermore you can create a hierarchical index (multiple index values) by passing a list of columns to the `index_col` argument. 

In cases where the data does not have a fixed delimiter, you can pass `sep` argument, and use a **regular expression** to choose the delimeter. `read_csv` can infer which column to be the DF's index. I*t does this by noticing that there is one fewer column name in the data u are passing. 

Pass a list of indeces to the `skiprows` argument to skip those rows when loading in the data.  

 Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some **sentinel** value. By default, pandas uses a set of commonly occurring sentinels such as `NA` and `NULL`. 
 - Use `pd.isnull(data_frame)` to return a boolean DF indicating missing values. Furthermore you can use this as a mask :)
- Pass a list, dict, or a set of strings to the `na_values` argument to assign missing values.
- You can assign different `NA` sentinels to each column, just pass a dict to the `na_values` argument with the column name as the key and the sentinel as the value.


Common `read_csv` function arguments:
- `path`
- `sep` or `delimeter`
- `header` a row number to use as column names
- `index_col`
- `names`
- `skiprows`
- `na_values`
- `comment`
- `parse_dates`
- `keep_date_col`
- `converters`
- `dayfirst`
- `date_parser` a function to use to parse data
- `nrows`
- `iterator`
- `chunksize`
- `skip_footer`
- `verbose`
- `encoding`
- `squeeze` Re