In [1]:
import numpy as np
import pandas as pd

## Object Creation

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [4]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a `DataFrame` by passing a `Numpy array`, with datetime index and labeled columns:

In [6]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates,
                  columns=['A','B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306


Creating a `DataFame` by passing a **dict of objects** that can be converted to series-like.

In [17]:
df2 = pd.DataFrame({'A':1.,
                   'B': pd.Timestamp('20130102'),
                   'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                   'D': np.array([3] * 4, dtype='int32'),
                   'E': pd.Categorical(['test','train','test','train']),
                   'F': 'foo'})
df2

# The keys will make the names of the columns
# pd.Series REQUIRES both an input into the params data & index

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting `DataFrame` have different `dtypes`.

In [19]:
# df.dtypes calls on the datatypes of each of the columns
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you're using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here's a subset of the attributes that will be completed:

In [23]:
df2.<TAB>

df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.clip_lower
df2.align              df2.clip_upper
df2.all                df2.columns
df2.any                df2.combine
df2.append             df2.combine_first
df2.apply              df2.compound
df2.applymap           df2.consolidate
df2.D

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

As you can see, the columns `A, B, C` and `D` are automatically tab-completed. `E` is there as well; the rest of the attributes have be truncated for brevity.

# Viewing Data

Here's how to view the top `df.head()` and bottom `df.tail()` rows of the frame:

In [24]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221


In [27]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306


Display the index, columns:

In [28]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [29]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

`DataFrame.to_numpy()` gives a `Numpy` representation of the underlying data. Note that this can be an expensive operation when your `DataFrame` has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: **Numpy arrays can have on dtype for the entire array, while pandas DataFrame have one dtype per column** (i.e. a DataFrame is made up of columns of Numpy arrays).   
  
When you call `DataFrame.to_numpy()`, pandas will find the NumPy dtype that can hold _all_ of the dtypes in the DataFrame. This may end up being `object`, which requires casting every value to a Python object.  
  
For `df`, our `DataFrame` of all floating-point values, `DataFrame.to_numpy()` is fast and doesn't require copying data.

In [30]:
df.to_numpy()

array([[ 0.71645067,  0.67315652,  1.18683594,  0.49507864],
       [-0.21397582, -1.05549155, -0.87948834,  0.74545296],
       [ 0.51223488, -1.80012294, -0.37977059,  1.14456771],
       [ 0.79606861,  0.61760478, -1.29870332,  0.32543979],
       [-2.12413968, -0.21787856, -1.8681629 ,  1.00922066],
       [ 0.91954347,  0.25251532, -1.91694139,  0.19305995]])

In [32]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

For `df2`, the `DataFrame` with multiple dtypes, `DataFrame.to_numpy()` is relatively expensive.

In [33]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

> **Note**: `DataFrame.to_numpy()` does _not_ include the index or column labels in the output.

  
`describe()` shows a quick statistic summary of your data:

In [34]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.10103,-0.255036,-0.859372,0.652137
std,1.162023,0.990965,1.161595,0.379698
min,-2.12414,-1.800123,-1.916941,0.19306
25%,-0.032423,-0.846088,-1.725798,0.36785
50%,0.614343,0.017318,-1.089096,0.620266
75%,0.776164,0.526332,-0.5047,0.943279
max,0.919543,0.673157,1.186836,1.144568


  
Transposing your data:

In [35]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,0.716451,-0.213976,0.512235,0.796069,-2.12414,0.919543
B,0.673157,-1.055492,-1.800123,0.617605,-0.217879,0.252515
C,1.186836,-0.879488,-0.379771,-1.298703,-1.868163,-1.916941
D,0.495079,0.745453,1.144568,0.32544,1.009221,0.19306


Sorting by an axis:

In [38]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306


In [36]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.495079,1.186836,0.673157,0.716451
2013-01-02,0.745453,-0.879488,-1.055492,-0.213976
2013-01-03,1.144568,-0.379771,-1.800123,0.512235
2013-01-04,0.32544,-1.298703,0.617605,0.796069
2013-01-05,1.009221,-1.868163,-0.217879,-2.12414
2013-01-06,0.19306,-1.916941,0.252515,0.919543


Sorting by values:

In [39]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-01,0.716451,0.673157,1.186836,0.495079


# Selection

## Getting:

Selecting a _single_ column, which yields a `Series`, equivalent to `df.A`:

In [40]:
df.A

2013-01-01    0.716451
2013-01-02   -0.213976
2013-01-03    0.512235
2013-01-04    0.796069
2013-01-05   -2.124140
2013-01-06    0.919543
Freq: D, Name: A, dtype: float64

In [41]:
df['A']

2013-01-01    0.716451
2013-01-02   -0.213976
2013-01-03    0.512235
2013-01-04    0.796069
2013-01-05   -2.124140
2013-01-06    0.919543
Freq: D, Name: A, dtype: float64

Selecting via [  ], which slices the rows

In [49]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568


Selecting rows AND a specific column

In [50]:
df[0:3]['A']

2013-01-01    0.716451
2013-01-02   -0.213976
2013-01-03    0.512235
Freq: D, Name: A, dtype: float64

# Selection by label

For getting a cross section using a label:

In [52]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306


In [51]:
df.loc[[dates[0]]]

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079


Selecting on a multi-axis by label:

In [53]:
df.loc[:, ['A','B']]

Unnamed: 0,A,B
2013-01-01,0.716451,0.673157
2013-01-02,-0.213976,-1.055492
2013-01-03,0.512235,-1.800123
2013-01-04,0.796069,0.617605
2013-01-05,-2.12414,-0.217879
2013-01-06,0.919543,0.252515


Showing label slicing, both endpoints are _included_:

In [59]:
df.loc['20130102':'20130104', ['A','B']]

Unnamed: 0,A,B
2013-01-02,-0.213976,-1.055492
2013-01-03,0.512235,-1.800123
2013-01-04,0.796069,0.617605


Reduction in the dimensions of the returned object:

In [60]:
df.loc['20130102',['A','B']]

A   -0.213976
B   -1.055492
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [63]:
df.loc[dates[0],['A']]

A    0.716451
Name: 2013-01-01 00:00:00, dtype: float64

In [64]:
df.loc['20130101',['A']]

A    0.716451
Name: 2013-01-01 00:00:00, dtype: float64

# Selection by position
* df.iloc  
  
Uses index POSITIONS to get values from a dataframe.  
  
`df.loc` uses the names of the rows/columns to get values from the df

In [66]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.716451,0.673157,1.186836,0.495079
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568
2013-01-04,0.796069,0.617605,-1.298703,0.32544
2013-01-05,-2.12414,-0.217879,-1.868163,1.009221
2013-01-06,0.919543,0.252515,-1.916941,0.19306


In [67]:
df.iloc[3] # Selects values in the row of index 3

A    0.796069
B    0.617605
C   -1.298703
D    0.325440
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python:

In [68]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.796069,0.617605
2013-01-05,-2.12414,-0.217879


By lists of integer position locations, similar to the numpy/python style:

In [69]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.213976,-0.879488
2013-01-03,0.512235,-0.379771
2013-01-05,-2.12414,-1.868163


For slicing rows explicitly:

In [70]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-0.213976,-1.055492,-0.879488,0.745453
2013-01-03,0.512235,-1.800123,-0.379771,1.144568


For slicing columns explicitly:

In [71]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,0.673157,1.186836
2013-01-02,-1.055492,-0.879488
2013-01-03,-1.800123,-0.379771
2013-01-04,0.617605,-1.298703
2013-01-05,-0.217879,-1.868163
2013-01-06,0.252515,-1.916941


For getting a value explicitly

In [72]:
df.iloc[1,1]

-1.0554915462344563

For getting fast access to a scalar (equivalent to the prior method):

In [73]:
df.iat[1,1]

-1.0554915462344563