Importing pandas

In [28]:
import numpy as np
import pandas as pd

Create a Series by passing a list of values, letting pandas create a default integer index

In [29]:
s = pd.Series([1,3,5,np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns

In [30]:
dates = pd.date_range("20130101", periods = 6)
dates
df = pd.DataFrame(np.random.randn(6,4), index = dates, columns = list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.738126,-0.907267,0.54608,-1.841121
2013-01-02,0.118038,-0.037458,0.117986,0.173869
2013-01-03,1.826167,-0.209789,-0.890959,1.54978
2013-01-04,1.369127,0.902474,0.987785,0.402184
2013-01-05,1.014155,-1.21348,0.345806,0.083404
2013-01-06,-1.673284,2.301981,-1.350806,0.109042


Creating a DataFrame by passing a dict of objects that can be converted to series-like

In [31]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1,index=list(range(4)), dtype = "float32"),
        "D": np.array([3]*4, dtype = "int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo"
    }
)
df2


Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes (datatypes)

In [32]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Access columns

In [33]:
df2.A
df2.B
df2.C
df2.D
df2.E
df2.F

0    foo
1    foo
2    foo
3    foo
Name: F, dtype: object

Viewing Data

Here is how to view the top and bottom rows of the frame

In [36]:
df.head()

Unnamed: 0,A,B,C,D


In [None]:
df.tail(3)

Display the index, columns

In [39]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [40]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesnâ€™t require copying data.

DataFrame.to_numpy() does not include the index or column labels in the output.

In [45]:
df.to_numpy()

array([[-0.73812566, -0.9072671 ,  0.5460799 , -1.8411208 ],
       [ 0.11803808, -0.03745798,  0.11798629,  0.17386897],
       [ 1.82616728, -0.20978883, -0.89095871,  1.54977957],
       [ 1.36912731,  0.90247367,  0.98778516,  0.40218436],
       [ 1.01415511, -1.21348035,  0.34580553,  0.08340405],
       [-1.67328399,  2.30198058, -1.3508061 ,  0.10904226]])

In [46]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() shows a quick statistic summary of your data:

In [49]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.319346,0.13941,-0.040685,0.079526
std,1.340497,1.291359,0.896212,1.092193
min,-1.673284,-1.21348,-1.350806,-1.841121
25%,-0.524085,-0.732898,-0.638722,0.089814
50%,0.566097,-0.123623,0.231896,0.141456
75%,1.280384,0.667491,0.496011,0.345106
max,1.826167,2.301981,0.987785,1.54978


It seems like describe() will only give statistic summaries of columns that have numeric data types

In [50]:
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


Transposing data

In [51]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.738126,0.118038,1.826167,1.369127,1.014155,-1.673284
B,-0.907267,-0.037458,-0.209789,0.902474,-1.21348,2.301981
C,0.54608,0.117986,-0.890959,0.987785,0.345806,-1.350806
D,-1.841121,0.173869,1.54978,0.402184,0.083404,0.109042


Sorting by an axis

In [63]:
df.sort_index(axis = 1, ascending=True)

Unnamed: 0,A,B,C,D
2013-01-01,-0.738126,-0.907267,0.54608,-1.841121
2013-01-02,0.118038,-0.037458,0.117986,0.173869
2013-01-03,1.826167,-0.209789,-0.890959,1.54978
2013-01-04,1.369127,0.902474,0.987785,0.402184
2013-01-05,1.014155,-1.21348,0.345806,0.083404
2013-01-06,-1.673284,2.301981,-1.350806,0.109042


Sorting by values

In [57]:
df.sort_values(by = "B")

Unnamed: 0,A,B,C,D
2013-01-05,1.014155,-1.21348,0.345806,0.083404
2013-01-01,-0.738126,-0.907267,0.54608,-1.841121
2013-01-03,1.826167,-0.209789,-0.890959,1.54978
2013-01-02,0.118038,-0.037458,0.117986,0.173869
2013-01-04,1.369127,0.902474,0.987785,0.402184
2013-01-06,-1.673284,2.301981,-1.350806,0.109042


Selection

While standard Python/NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods: .at, iat, .loc, and .iloc.

Getting

Selecting a single column, which yields a Series, equivalent to df.A

In [66]:
df["A"]

2013-01-01   -0.738126
2013-01-02    0.118038
2013-01-03    1.826167
2013-01-04    1.369127
2013-01-05    1.014155
2013-01-06   -1.673284
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows

In [68]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.738126,-0.907267,0.54608,-1.841121
2013-01-02,0.118038,-0.037458,0.117986,0.173869
2013-01-03,1.826167,-0.209789,-0.890959,1.54978


In [86]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,0.118038,-0.037458,0.117986,0.173869
2013-01-03,1.826167,-0.209789,-0.890959,1.54978
2013-01-04,1.369127,0.902474,0.987785,0.402184


Selection by Label

For getting a cross section using a label:

In [71]:
df.loc[dates[0]]

A   -0.738126
B   -0.907267
C    0.546080
D   -1.841121
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label (not exactly sure what the single colon is doing here)

In [81]:
df.loc[:,["A", "B"]]

Unnamed: 0,A,B
2013-01-01,-0.738126,-0.907267
2013-01-02,0.118038,-0.037458
2013-01-03,1.826167,-0.209789
2013-01-04,1.369127,0.902474
2013-01-05,1.014155,-1.21348
2013-01-06,-1.673284,2.301981


Showing label slicing, both endpoints are included

In [87]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,0.118038,-0.037458
2013-01-03,1.826167,-0.209789
2013-01-04,1.369127,0.902474


Reduction in the dimensions of the returned object

In [97]:
df.loc["20130102", ["A", "B"]]

A    0.118038
B   -0.037458
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value, reduce twice

In [98]:
df.loc[dates[0], "A"]

-0.7381256634658458

For getting fast access to a scalar (equivalent result as the prior method)
This is probably only usable for scalar outputs, whereas loc can be used for arrays and scalars.

In [99]:
df.at[dates[0], "A"]

-0.7381256634658458

Selecting by position

Select via the position of the passed integers (makes things so much easier...)

In [100]:
df.iloc[3]

A    1.369127
B    0.902474
C    0.987785
D    0.402184
Name: 2013-01-04 00:00:00, dtype: float64

By lists of integer positions, similar to NumPy/Python style

In [101]:
df.iloc[[1,2,4], [0,2]]

Unnamed: 0,A,C
2013-01-02,0.118038,0.117986
2013-01-03,1.826167,-0.890959
2013-01-05,1.014155,0.345806


By integer slices, acting similar to NumPy/Python

In [102]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,1.369127,0.902474
2013-01-05,1.014155,-1.21348


For slicing rows explicitly

In [108]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C
2013-01-02,0.118038,-0.037458,0.117986
2013-01-03,1.826167,-0.209789,-0.890959


For slicing columns explicitly

In [109]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,-0.907267,0.54608
2013-01-02,-0.037458,0.117986
2013-01-03,-0.209789,-0.890959
2013-01-04,0.902474,0.987785
2013-01-05,-1.21348,0.345806
2013-01-06,2.301981,-1.350806


For getting a value explicitly (scalar)

In [110]:
df.iloc[1, 1]

-0.03745797983251126

For getting fast access to a scalar (equivalent result as the prior method)

In [None]:
df.iat[1,1]