## Setup the environment

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Object Creation
Creating series which are one dimensional ndarray with axis labels

In [2]:
s=pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a Data Frame by passing a numpy array, with a dattime index and labeled columns:

In [3]:
dates=pd.date_range('20130101',periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Use the generated dates column to set up the rownames for the next data frame

In [5]:
df=pd.DataFrame(np.random.randn(6,4),index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


Create a dataframe by passing a dictionary of objects that can be converted to series like

In [8]:
df2=pd.DataFrame({'A':1.,
                  'B':pd.Timestamp('20130102'),
                  'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                  'D':np.array([3]*4,dtype='int32'),
                  'E':pd.Categorical(['test','train','test','train']),
                 'F':'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [9]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## Viewing the data 

In [11]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685


In [12]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [13]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


Display the index, columns, and the underlying numpy data

In [14]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [16]:
df.values

array([[ 0.06708505,  0.30765085,  0.07625925, -0.39849668],
       [ 1.38100429, -0.69546943, -0.93410228,  0.30579479],
       [ 0.20789324, -0.53918022,  1.39244574,  0.27167902],
       [ 1.18111466, -1.30796626, -0.35814268,  0.96629382],
       [-0.14709176, -0.83541153, -0.71814737,  1.52068452],
       [-0.64088026,  1.91656105,  1.52246708, -0.30637964]])

The following shows that by a simple call to the describe function on a given dataframe, a variety of data summary is created

In [18]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.341521,-0.192303,0.163463,0.393263
std,0.785171,1.160091,1.060081,0.740035
min,-0.64088,-1.307966,-0.934102,-0.398497
25%,-0.093548,-0.800426,-0.628146,-0.161865
50%,0.137489,-0.617325,-0.140942,0.288737
75%,0.937809,0.095943,1.063399,0.801169
max,1.381004,1.916561,1.522467,1.520685


Transposing the data 

In [19]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,0.067085,1.381004,0.207893,1.181115,-0.147092,-0.64088
B,0.307651,-0.695469,-0.53918,-1.307966,-0.835412,1.916561
C,0.076259,-0.934102,1.392446,-0.358143,-0.718147,1.522467
D,-0.398497,0.305795,0.271679,0.966294,1.520685,-0.30638


In [20]:
df2.T

Unnamed: 0,0,1,2,3
A,1,1,1,1
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1,1,1,1
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


Sorting Operations

Sorting a dataframe based on the indexed called?

In [25]:
df.sort_index(axis=0,ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-0.64088,1.916561,1.522467,-0.30638
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-01,0.067085,0.307651,0.076259,-0.398497


In [26]:
df2.sort_index(axis=0)

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Sorting by values

In [28]:
df.sort_values(by='B',ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-0.64088,1.916561,1.522467,-0.30638
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-04,1.181115,-1.307966,-0.358143,0.966294


## Selection 

### Getting 

Selecting a single column, which yields a series, equivalent to df.A

In [29]:
df['A']

2013-01-01    0.067085
2013-01-02    1.381004
2013-01-03    0.207893
2013-01-04    1.181115
2013-01-05   -0.147092
2013-01-06   -0.640880
Freq: D, Name: A, dtype: float64

In [34]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [35]:
df2['B']

0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]

Selecting multiple different columns

In [36]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679


In [37]:
df[0:]

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


In [39]:
df2[0:]

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


### Selection by label

This is for getting a cross section using a label/ Or rather using this as a way to obtain the 

In [45]:
df.loc[dates[5]]

A   -0.640880
B    1.916561
C    1.522467
D   -0.306380
Name: 2013-01-06 00:00:00, dtype: float64

In [42]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


Selecting multiple different rows by label

In [46]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,0.067085,0.307651
2013-01-02,1.381004,-0.695469
2013-01-03,0.207893,-0.53918
2013-01-04,1.181115,-1.307966
2013-01-05,-0.147092,-0.835412
2013-01-06,-0.64088,1.916561


In [48]:
df.loc['20130102':'20130106',["A","B"]]

Unnamed: 0,A,B
2013-01-02,1.381004,-0.695469
2013-01-03,0.207893,-0.53918
2013-01-04,1.181115,-1.307966
2013-01-05,-0.147092,-0.835412
2013-01-06,-0.64088,1.916561


Reduction in the dimensions of the returned object

In [49]:
df.loc['20130102',['A','B']]

A    1.381004
B   -0.695469
Name: 2013-01-02 00:00:00, dtype: float64

In [50]:
df.loc[dates[0],'A'
      ]

0.067085045086817036

### Selection by position. 
Notice that when selecting values by position instead of the columns or labels. The difference in the key word API invoked. One is using loc vs iloc

In [51]:
df.iloc[3]

A    1.181115
B   -1.307966
C   -0.358143
D    0.966294
Name: 2013-01-04 00:00:00, dtype: float64

In [52]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


using integers to directly slice the dataframe

In [53]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,1.181115,-1.307966
2013-01-05,-0.147092,-0.835412


In [54]:
df2.iloc[1:,2:]

Unnamed: 0,C,D,E,F
1,1.0,3,train,foo
2,1.0,3,test,foo
3,1.0,3,train,foo


Slicing rows explicitly

In [55]:
df.iloc[1,:]

A    1.381004
B   -0.695469
C   -0.934102
D    0.305795
Name: 2013-01-02 00:00:00, dtype: float64

In [56]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


For selecting columns explicitly

In [57]:
df.iloc[:,2]

2013-01-01    0.076259
2013-01-02   -0.934102
2013-01-03    1.392446
2013-01-04   -0.358143
2013-01-05   -0.718147
2013-01-06    1.522467
Freq: D, Name: C, dtype: float64

### Boolean Indexing
Using a single column's values to select data

In [58]:
df[df.A>0]

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294


In [59]:
df[df.B>1]

Unnamed: 0,A,B,C,D
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


Selecting values from a data frame where a boolean conditions is met

In [60]:
df[df>0]

Unnamed: 0,A,B,C,D
2013-01-01,0.067085,0.307651,0.076259,
2013-01-02,1.381004,,,0.305795
2013-01-03,0.207893,,1.392446,0.271679
2013-01-04,1.181115,,,0.966294
2013-01-05,,,,1.520685
2013-01-06,,1.916561,1.522467,


Using the isin() method for filtering:

In [63]:
df2=df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,0.067085,0.307651,0.076259,-0.398497,one
2013-01-02,1.381004,-0.695469,-0.934102,0.305795,one
2013-01-03,0.207893,-0.53918,1.392446,0.271679,two
2013-01-04,1.181115,-1.307966,-0.358143,0.966294,three
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685,four
2013-01-06,-0.64088,1.916561,1.522467,-0.30638,three


In [64]:
df2[df2['E'].isin(["two","four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.207893,-0.53918,1.392446,0.271679,two
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685,four


### Setting

Setting a new column automatically aligns the data by the indexes

In [68]:
s1=pd.Series([1,2,3,4,5,6],index=pd.date_range("20130102",periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Setting values by label

In [69]:
df.at[dates[0],'A']=0

In [70]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.307651,0.076259,-0.398497
2013-01-02,1.381004,-0.695469,-0.934102,0.305795
2013-01-03,0.207893,-0.53918,1.392446,0.271679
2013-01-04,1.181115,-1.307966,-0.358143,0.966294
2013-01-05,-0.147092,-0.835412,-0.718147,1.520685
2013-01-06,-0.64088,1.916561,1.522467,-0.30638


Setting values by position within the data frame
Note that similar to the slicing and dicing the data frame by labels, the keyword is loc and when slicing and dicing the dataframe by values of the dataframe, iloc is used. In setting the values of a specific cell within a a dataframe using label at is used and conversely, when refering the the specific position within the cell, the iat keywork/API is used

In [71]:
df.iat[0,1]

0.30765085002310805

In [72]:
df.iat[0,1]=1

In [73]:
df.loc[:,'D']=np.array([5]*len(df))

In [74]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.0,1.0,0.076259,5
2013-01-02,1.381004,-0.695469,-0.934102,5
2013-01-03,0.207893,-0.53918,1.392446,5
2013-01-04,1.181115,-1.307966,-0.358143,5
2013-01-05,-0.147092,-0.835412,-0.718147,5
2013-01-06,-0.64088,1.916561,1.522467,5


In [75]:
df2=df.copy()

In [76]:
df2[df2>0]=-df2
df2

Unnamed: 0,A,B,C,D
2013-01-01,0.0,-1.0,-0.076259,-5
2013-01-02,-1.381004,-0.695469,-0.934102,-5
2013-01-03,-0.207893,-0.53918,-1.392446,-5
2013-01-04,-1.181115,-1.307966,-0.358143,-5
2013-01-05,-0.147092,-0.835412,-0.718147,-5
2013-01-06,-0.64088,-1.916561,-1.522467,-5
