# Introduction to Panda DataFrame

In [None]:
import numpy as np
import pandas as pd

### Execute the frame below if there is no output

In [None]:
df= pd.read_csv('./iris-data-index-column.csv', index_col=0, header=0)
print(df)

* You can think of a DataFrame as a spreadsheet 
* Each column is stored as a separate panda series
* There are many ways to set up the index (d1, a1,...) and the column name (Sepal-Length, Sepal-Width...)
* Floats are stored as float64 and integers are stored as int64

### Execute the frame below to see the oficial documentation on a DataFrame.

In [None]:
help(pd.DataFrame)

# Getting the Data

## Checking for data
Execute the next cell to the iris data set is available. You should see:

    1. iris-data.csv,
    2. iris-data-column.csv
    3. iris-data-index-column.csv
    4. iris-data.txt
    
The data set has been corrected. The index column is bogus and was added just for this notebook.



In [None]:
!ls -l iris*

## The first step is just taking a quick look at the data
In the three cells below, use less to look at each of the csv files.
Remember to execute a shell command inside of a notebook—the format is `!<command>`



In [None]:
!cat iris-data-column.csv

* The first line gives the column names.

In [None]:
!cat iris-data-index-column.csv

* The first column was added to show how to control the index
* Notice the column label row starts with a comma (,) to show the first column is the index column
* The index column (d1,a1,...) does not have a name in the column row (called a header)

In [None]:
!cat iris-data.csv 

This is an extract from the original raw data set—small so that the output can easily be seen.

The command to read data from a csv fle is `pd.read_csv`. 

Execute the next cell.

Look at all of the parameters for a minute. Then continue on in this notebook.

In [None]:
help(pd.read_csv)

Execute the next cell.

In [None]:
raw_df = pd.read_csv('./iris-data.csv', header=None)
print(raw_df)

If you do not include the keyword option `header=<row index>` then the first line is taken as the header line and made the column headings.

    * header=None
        - Columns are numbered starting a 0
        - Use the keyword attribute `names=<list of column name>`
    * header=0
        - First row (counting starts at 0) is the headers) 
    
Execute the cell below.

See how you can change the column names.
    
The column names are stored as a Series in the DataFrame.

In [None]:
raw_df.columns=['Sepal-Length', 'Sepal-Width','Petal-Length','Petal-Width', 'Class']
print(raw_df)

If there is an index column you must specify it.  This is done with the keyword `index=<column number`. This is not done very often. 

The column is stored as a Series. In the DataFrame below I am using the Index as a way to say who the collector of the data way.

Execute the cell below.


In [None]:
df=pd.read_csv('./iris-data-index-column.csv', index_col=0, header=0)
print(df)

You can change the index labels.

Execute the cell below.

In [None]:
df.index=[1,2,3,4,5,6]
print(df)

## Saving a DataFrame as a CSV file

To save a DataFrame to a CSV file, use `.to_csv`

In [None]:
help(df.to_csv)

In [None]:
df.to_csv("./saved_iris.csv")

In [None]:
!cat saved_iris.csv

* Index
    - To write out the index, add the keyword `index=True`
    - To not write out the index, leave out keyword
* Header
    - Use `header=False` if no column labels are to be saved 
    - Use `header=True` if column labels are to be saved
    - Use `header=<list of column names>` to write out different set of column names
* Column
    - Use `column=<list of column names>` to write out only the selected columns

There are multiple other ways to create a DataFrame. 

See [Dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)