# pandas DataFrame

The DataFrame sits at the heart of pythonic data science. This is the pandas representation for a table of data.  A strong understanding of the DataFrame will greatly assist you as you learn more data analysis using Python.

In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt

3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)]
1.9.2
0.16.2


In [12]:
import string
upcase = list(string.ascii_uppercase)
lcase = list(string.ascii_lowercase)

In [13]:
print(upcase[:5], lcase[:5])

['A', 'B', 'C', 'D', 'E'] ['a', 'b', 'c', 'd', 'e']


You can create DataFrames by passing in np arrays, lists of Series, or dictionaries. Here, notice that lists are interpreted to be rows.

In [14]:
pd.DataFrame([upcase, lcase])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
1,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z


Although we have a lot of intricacies to cover, you can start thinking of a DataFrame as just a spreadsheet with columns and rows.

In more specific pandas language, a DataFrame is a powerful list of Series.

Each column is a pandas Series of data.  We can operate on all these Series as a group when they are in a DataFrame.

We saw that if we construct a DataFrame from a list of lists, each list becomes a row. More commonly, we might want each list to be a column.  If this is the case, one solution is to transpose our DataFrame.

In [15]:
pd.DataFrame([upcase, lcase]).T

Unnamed: 0,0,1
0,A,a
1,B,b
2,C,c
3,D,d
4,E,e
5,F,f
6,G,g
7,H,h
8,I,i
9,J,j


This should be familiar because it is the same way that we transpose ndarrays in NumPy.

A cleaner solution when we want our lists to be columns is to pass a dictionary into our DataFrame constructor.  When we do this, the keys become column names and the values becomes the data in each column.


In [16]:
letters = pd.DataFrame({'lowercase':lcase, 'uppercase':upcase})
letters.head()

Unnamed: 0,lowercase,uppercase
0,a,A
1,b,B
2,c,C
3,d,D
4,e,E


You will find that if the lengths of our lists are not the same, we will get a `ValueError`. It is worth exploring your data to make sure that it is clean before using it to create a DataFrame.

In [17]:
pd.DataFrame({'lowercase':lcase + [0], 'uppercase':upcase})

ValueError: arrays must all be same length

In [8]:
letters.head()

Unnamed: 0,lowercase,uppercase
0,a,A
1,b,B
2,c,C
3,d,D
4,e,E


We can rename the columns easily and even add a new one through a relatively simple dictionary-like assignment.

In [9]:
letters.columns = ['LowerCase','UpperCase']

In [10]:
np.random.seed(25)
letters['Number'] = np.random.random_integers(1,50,26)

In [11]:
letters

Unnamed: 0,LowerCase,UpperCase,Number
0,a,A,5
1,b,B,27
2,c,C,16
3,d,D,24
4,e,E,45
5,f,F,9
6,g,G,29
7,h,H,5
8,i,I,26
9,j,J,32


Just like Series, DataFrames have data types.  We can inspect those by accessing the `dtype` attributes of the DataFrame.

In [12]:
letters.dtypes

LowerCase    object
UpperCase    object
Number        int64
dtype: object

In [13]:
letters.index = lcase
letters

Unnamed: 0,LowerCase,UpperCase,Number
a,a,A,5
b,b,B,27
c,c,C,16
d,d,D,24
e,e,E,45
f,f,F,9
g,g,G,29
h,h,H,5
i,i,I,26
j,j,J,32


We can sort a DataFrame by a specific column or by the index (the default).

In [14]:
letters.sort('Number')

Unnamed: 0,LowerCase,UpperCase,Number
t,t,T,2
l,l,L,2
s,s,S,4
p,p,P,4
n,n,N,4
a,a,A,5
h,h,H,5
k,k,K,6
f,f,F,9
y,y,Y,10


In [15]:
letters.sort()

Unnamed: 0,LowerCase,UpperCase,Number
a,a,A,5
b,b,B,27
c,c,C,16
d,d,D,24
e,e,E,45
f,f,F,9
g,g,G,29
h,h,H,5
i,i,I,26
j,j,J,32


We have seen how to query a DataFrame for a single column.  Retrieving multiple columns is not too much more difficult.

We can get upperand lowercase columns.

In [16]:
letters[['LowerCase','UpperCase']].head()

Unnamed: 0,LowerCase,UpperCase
a,a,A
b,b,B
c,c,C
d,d,D
e,e,E


We can also just query for specific rows using the index. A lot of what we learned for Series translates directly to DataFrames.

We can query by index location or by letters.

In [17]:
letters.iloc[5:10]

Unnamed: 0,LowerCase,UpperCase,Number
f,f,F,9
g,g,G,29
h,h,H,5
i,i,I,26
j,j,J,32


In [18]:
letters["f":"k"]

Unnamed: 0,LowerCase,UpperCase,Number
f,f,F,9
g,g,G,29
h,h,H,5
i,i,I,26
j,j,J,32
k,k,K,6


As you can see, these operations are pretty similar to the one we used for Series.

You should be starting to become familiar with the way that DataFrames work.  DataFrames are really a foundation for doing data analysis in Python.  Although these lessons have used fabricated data, we have covered a lot of the methods you will need for analyzing real data.

Next, we will dive into our first data set.