In [1]:
import numpy as np
import pandas as pd

# Pandas DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

<img src="img/df1.jpg">

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Here's an example where we have set the Dates column to be the index and label for the rows. 

<img src="img/df2.jpg">

## Creation of DataFrames

### Using Dictionaries

In [2]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

d

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [3]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


### From a Dictionary of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be ``range(n)``, where n is the array length.

In [4]:
d = {'first' : [1., 2., 3., 4.], 'second' : [4., 3., 2., 1.], 'third':np.random.randint(10,20,4)}

In [5]:
df = pd.DataFrame(d)
df

Unnamed: 0,first,second,third
0,1.0,4.0,11
1,2.0,3.0,15
2,3.0,2.0,13
3,4.0,1.0,10


### From a Numpy Array

In [6]:
data = np.random.randint(0,2,size=(5,3))
data

array([[1, 1, 1],
       [0, 0, 1],
       [1, 1, 1],
       [0, 0, 1],
       [1, 1, 1]])

In [7]:
col_names = ['simulation1','simulation2','simulation3']
index_list = list('abcde')

In [8]:
df = pd.DataFrame(data, index=index_list, columns=col_names)
df

Unnamed: 0,simulation1,simulation2,simulation3
a,1,1,1
b,0,0,1
c,1,1,1
d,0,0,1
e,1,1,1


### From a file

In [9]:
titanic = pd.read_csv('../datasets/titanic/titanic.csv')

In [10]:
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0000,0.0,0.0,19952,26.5500,E12,S,3,,"New York, NY"
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0000,0.0,0.0,112050,0.0000,A36,S,,,"Belfast, NI"
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


### Anatomy of a DataFrame

A DataFrame consists on three parts:
* Row Index
* Columns Names (Column Index)
* Data

The row and column labels can be accessed respectively by accessing the ``index`` and ``columns`` attributes:

In [11]:
df.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [12]:
df.columns

Index([u'simulation1', u'simulation2', u'simulation3'], dtype='object')

In [13]:
df.values

array([[1, 1, 1],
       [0, 0, 1],
       [1, 1, 1],
       [0, 0, 1],
       [1, 1, 1]])

## Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [14]:
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'ticket', u'fare', u'cabin', u'embarked', u'boat', u'body',
       u'home.dest'],
      dtype='object')

In [17]:
#titanic['name']

In [18]:
del titanic['ticket']
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'fare', u'cabin', u'embarked', u'boat', u'body', u'home.dest'],
      dtype='object')

In [19]:
titanic['age_in_months'] = 12*titanic['age']

In [20]:
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'fare', u'cabin', u'embarked', u'boat', u'body', u'home.dest',
       u'age_in_months'],
      dtype='object')

When inserting a scalar value, it will naturally be propagated to fill the column:

In [22]:
titanic['year'] = 1909
#titanic['year']

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

In [23]:
len(titanic)

1310

In [25]:
titanic['rand_integer'] = np.random.randint(0,10,size=len(titanic))
#titanic['rand_integer'] = np.random.randint(0,10,size=10)

You can select many columns by passing a list of column names:

In [33]:
titanic.head(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,fare,embarked,boat,body,home.dest,age_in_months,year,rand_integer
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,S,11.0,,"Montreal, PQ / Chesterville, ON",11.0004,1909,2
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,S,,,"Montreal, PQ / Chesterville, ON",24.0,1909,2
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,S,,135.0,"Montreal, PQ / Chesterville, ON",360.0,1909,8


In [32]:
#titanic[['name','survived','sex','age']]
titanic.drop('cabin', axis=1, inplace=True)

### Go to Excercises

## The basics of indexing / row selection

<table border="1" class="docutils">
<colgroup>
<col width="50%">
<col width="33%">
<col width="17%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Operation</th>
<th class="head">Syntax</th>
<th class="head">Result</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>Select column</td>
<td><tt class="docutils literal"><span class="pre">df[col]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-odd"><td>Select row by label</td>
<td><tt class="docutils literal"><span class="pre">df.loc[label]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-even"><td>Select row by integer location</td>
<td><tt class="docutils literal"><span class="pre">df.iloc[loc]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-odd"><td>Slice rows</td>
<td><tt class="docutils literal"><span class="pre">df[5:10]</span></tt></td>
<td>DataFrame</td>
</tr>
<tr class="row-even"><td>Select rows by boolean vector</td>
<td><tt class="docutils literal"><span class="pre">df[bool_vec]</span></tt></td>
<td>DataFrame</td>
</tr>
</tbody>
</table>

In [None]:
df

In [None]:
# Select index b


In [None]:
# select the second row


In [None]:
# select rows 2 and 3


In [None]:
# Selecting a single value


In [None]:
# Selecting two columns


In [None]:
# Selecting with a boolean series


In [None]:
# Cell by position 
