### DataFrame Methods

Like the Series object, there are rich methods that are built-into the DataFrame object in pandas.

In [1]:
import pandas as pd
from io import StringIO

data = StringIO(
    '''UPC,Units,Sales,Date
    1234,5,20.2,1-1-2014
    1234,2,8.,1-2-2014
    1234,3,13.,1-3-2014
    789,1,2.,1-1-2014
    789,2,3.8,1-2-2014
    789,,,1-3-2014
    789,1,1.8,1-5-2014''')
sales = pd.read_csv(data)

sales

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014


### DataFrame Attributes

As we know, we can access the axes of a DataFrame via the `.axes` attribute:

In [2]:
sales.axes

[RangeIndex(start=0, stop=7, step=1),
 Index(['UPC', 'Units', 'Sales', 'Date'], dtype='object')]

The two axes, the index and the columns, can be access by their respective attributes:

In [3]:
sales.index

RangeIndex(start=0, stop=7, step=1)

In [4]:
sales.columns

Index(['UPC', 'Units', 'Sales', 'Date'], dtype='object')

The dimension of the DataFrame (rows and columns) can be accessed by the `.shape` attribute:

In [5]:
sales.shape

(7, 4)

The `.info()` method gives basic information of the DataFrame object:

In [6]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   UPC     7 non-null      int64  
 1   Units   6 non-null      float64
 2   Sales   6 non-null      float64
 3   Date    7 non-null      object 
dtypes: float64(2), int64(1), object(1)
memory usage: 352.0+ bytes


The `dtype` of each column is important here, and we can convert those using methods like `.astype()` or functions like `to_datetime()`.

### Iteration

Although we should always rely on vector operations when dealing with pandas objects, iteration is possible:

In [8]:
for column in sales:
    print(column)

UPC
Units
Sales
Date


By default, Python will iterate over the columns of the DataFrame. Note that this is different from Series. We can explicitly specify this via the `.keys()` method:

In [9]:
for column in sales.keys():
    print(column)

UPC
Units
Sales
Date


To iterate over pairs of column names and the individual column as a Series, we again use the `.iteritems()` method:

In [10]:
for col, ser in sales.iteritems():
    print(col, ser)

UPC 0    1234
1    1234
2    1234
3     789
4     789
5     789
6     789
Name: UPC, dtype: int64
Units 0    5.0
1    2.0
2    3.0
3    1.0
4    2.0
5    NaN
6    1.0
Name: Units, dtype: float64
Sales 0    20.2
1     8.0
2    13.0
3     2.0
4     3.8
5     NaN
6     1.8
Name: Sales, dtype: float64
Date 0    1-1-2014
1    1-2-2014
2    1-3-2014
3    1-1-2014
4    1-2-2014
5    1-3-2014
6    1-5-2014
Name: Date, dtype: object


We can use the `.iterrows()` method to return a tuple for every row in the formate of `(index value, row as Series)`:

In [11]:
for row in sales.iterrows():
    print(row)

(0, UPC          1234
Units           5
Sales        20.2
Date     1-1-2014
Name: 0, dtype: object)
(1, UPC          1234
Units           2
Sales           8
Date     1-2-2014
Name: 1, dtype: object)
(2, UPC          1234
Units           3
Sales          13
Date     1-3-2014
Name: 2, dtype: object)
(3, UPC           789
Units           1
Sales           2
Date     1-1-2014
Name: 3, dtype: object)
(4, UPC           789
Units           2
Sales         3.8
Date     1-2-2014
Name: 4, dtype: object)
(5, UPC           789
Units         NaN
Sales         NaN
Date     1-3-2014
Name: 5, dtype: object)
(6, UPC           789
Units           1
Sales         1.8
Date     1-5-2014
Name: 6, dtype: object)


Or, we can use `itertuples()` to return a namedtuple containing the index and row values:

In [12]:
for row in sales.itertuples():
    print(row)

Pandas(Index=0, UPC=1234, Units=5.0, Sales=20.2, Date='1-1-2014')
Pandas(Index=1, UPC=1234, Units=2.0, Sales=8.0, Date='1-2-2014')
Pandas(Index=2, UPC=1234, Units=3.0, Sales=13.0, Date='1-3-2014')
Pandas(Index=3, UPC=789, Units=1.0, Sales=2.0, Date='1-1-2014')
Pandas(Index=4, UPC=789, Units=2.0, Sales=3.8, Date='1-2-2014')
Pandas(Index=5, UPC=789, Units=nan, Sales=nan, Date='1-3-2014')
Pandas(Index=6, UPC=789, Units=1.0, Sales=1.8, Date='1-5-2014')


Sidenote, namedtuple is not in base Python, but in the collections module. They are just like a tuple, but each value is "named" and can be accessed by attribute:

In [14]:
from collections import namedtuple

Sales = namedtuple('Sales', 'upc,units,sales')
s = Sales(1234, 5., 20.2)

s

Sales(upc=1234, units=5.0, sales=20.2)

In [16]:
s[0]

1234

In [17]:
s.upc

1234

Another side note, if we use `len()` to inquire the length of a DataFrame, Python will tell us the number of rows, not columns:

In [18]:
len(sales)

7

### Arithmetic and Matrix Operations

Arithmetic operations are broadcast to every cell of a DataFrame. However, you have to make sure that all of the cells contain `int` or `float` cells, otherwise the operation will fail. The way to do this is to specify columns via `.loc[]` or `.iloc[]`:

In [19]:
sales.loc[:, ['Sales', 'Units']] + 10

Unnamed: 0,Sales,Units
0,30.2,15.0
1,18.0,12.0
2,23.0,13.0
3,12.0,11.0
4,13.8,12.0
5,,
6,11.8,11.0


Because pandas is based on NumPy, numeric DataFrames are essentially matrices, and most related methods work as expected. Such as transposition:

In [20]:
sales.transpose()

Unnamed: 0,0,1,2,3,4,5,6
UPC,1234,1234,1234,789,789,789,789
Units,5,2,3,1,2,,1
Sales,20.2,8,13,2,3.8,,1.8
Date,1-1-2014,1-2-2014,1-3-2014,1-1-2014,1-2-2014,1-3-2014,1-5-2014


In [21]:
sales.T # the .T attribute is the same as .transpose()

Unnamed: 0,0,1,2,3,4,5,6
UPC,1234,1234,1234,789,789,789,789
Units,5,2,3,1,2,,1
Sales,20.2,8,13,2,3.8,,1.8
Date,1-1-2014,1-2-2014,1-3-2014,1-1-2014,1-2-2014,1-3-2014,1-5-2014


Dot product works, but not on cells that are not numeric:

In [27]:
sales.iloc[:, :3].dot(sales.iloc[:, :3].T)

Unnamed: 0,0,1,2,3,4,5,6
0,1523189.04,1522927.6,1523033.6,973671.4,973712.76,,973667.36
1,1522927.6,1522824.0,1522866.0,973644.0,973660.4,,973642.4
2,1523033.6,1522866.0,1522934.0,973655.0,973681.4,,973652.4
3,973671.4,973644.0,973655.0,622526.0,622530.6,,622525.6
4,973712.76,973660.4,973681.4,622530.6,622539.44,,622529.84
5,,,,,,,
6,973667.36,973642.4,973652.4,622525.6,622529.84,,622525.24


### Serialization of DataFrames

The most common way to serialize a DataFrame is to write it into a .csv file. Pandas facillitate this easily with the `to_csv()` function:

In [29]:
fout = StringIO()
sales.to_csv(fout, index_label='index')

print(fout.getvalue())

index,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014



Alternatively, the `.to_dict()` method gives a mapping of column name to a mapping of index to value. This is useful for conversion to JSON later:

In [30]:
sales.to_dict()

{'UPC': {0: 1234, 1: 1234, 2: 1234, 3: 789, 4: 789, 5: 789, 6: 789},
 'Units': {0: 5.0, 1: 2.0, 2: 3.0, 3: 1.0, 4: 2.0, 5: nan, 6: 1.0},
 'Sales': {0: 20.2, 1: 8.0, 2: 13.0, 3: 2.0, 4: 3.8, 5: nan, 6: 1.8},
 'Date': {0: '1-1-2014',
  1: '1-2-2014',
  2: '1-3-2014',
  3: '1-1-2014',
  4: '1-2-2014',
  5: '1-3-2014',
  6: '1-5-2014'}}

We can force the ampping of column name to a list instead by using the `orient=` parameter:

In [31]:
sales.to_dict(orient='list')

{'UPC': [1234, 1234, 1234, 789, 789, 789, 789],
 'Units': [5.0, 2.0, 3.0, 1.0, 2.0, nan, 1.0],
 'Sales': [20.2, 8.0, 13.0, 2.0, 3.8, nan, 1.8],
 'Date': ['1-1-2014',
  '1-2-2014',
  '1-3-2014',
  '1-1-2014',
  '1-2-2014',
  '1-3-2014',
  '1-5-2014']}

Note that we can retrieve a DataFrame from such a dictionary:

In [32]:
pd.DataFrame.from_dict(sales.to_dict())

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,1.8,1-5-2014


Pandas support reading and writing MS Excel files via the function `read_excel()` and the method `.to_excel()`:

In [33]:
writer = pd.ExcelWriter('output.xlsx')
sales.to_excel(writer, 'sheet1')
writer.save()

pd.read_excel('output.xlsx')

Unnamed: 0.1,Unnamed: 0,UPC,Units,Sales,Date
0,0,1234,5.0,20.2,1-1-2014
1,1,1234,2.0,8.0,1-2-2014
2,2,1234,3.0,13.0,1-3-2014
3,3,789,1.0,2.0,1-1-2014
4,4,789,2.0,3.8,1-2-2014
5,5,789,,,1-3-2014
6,6,789,1.0,1.8,1-5-2014


Finally, DataFrames can be serialized to a NumPy array by calling the `.values` attribute:

In [36]:
sales.values

array([[1234, 5.0, 20.2, '1-1-2014'],
       [1234, 2.0, 8.0, '1-2-2014'],
       [1234, 3.0, 13.0, '1-3-2014'],
       [789, 1.0, 2.0, '1-1-2014'],
       [789, 2.0, 3.8, '1-2-2014'],
       [789, nan, nan, '1-3-2014'],
       [789, 1.0, 1.8, '1-5-2014']], dtype=object)

### DataFrame Index Operations

The index of a DataFrame supports many operations. The `.reindex()` method confroms the data to a new index (and/or columns):

In [37]:
sales.reindex([0, 4])

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
4,789,2.0,3.8,1-2-2014


In [39]:
sales.reindex(columns=['Date', 'Sales'])

Unnamed: 0,Date,Sales
0,1-1-2014,20.2
1,1-2-2014,8.0
2,1-3-2014,13.0
3,1-1-2014,2.0
4,1-2-2014,3.8
5,1-3-2014,
6,1-5-2014,1.8


In [41]:
sales.reindex(index=[2, 6, 8], columns=['Sales', 'UPC', 'missing']) # Non existing column call will be filled with value that can be specify via fill_value= (default is NaN)

Unnamed: 0,Sales,UPC,missing
2,13.0,1234.0,
6,1.8,789.0,
8,,,


To set a column as the index, use `.set_index()`:

In [42]:
by_date = sales.set_index('Date')

by_date

Unnamed: 0_level_0,UPC,Units,Sales
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1-1-2014,1234,5.0,20.2
1-2-2014,1234,2.0,8.0
1-3-2014,1234,3.0,13.0
1-1-2014,789,1.0,2.0
1-2-2014,789,2.0,3.8
1-3-2014,789,,
1-5-2014,789,1.0,1.8


As in Series, the index of a DataFrame can contain duplicated values.

To reset the index, we can use `.reset_index()`:

In [44]:
by_date.reset_index() # This returns a new DataFrame, unless you pass inplace=True

Unnamed: 0,Date,UPC,Units,Sales
0,1-1-2014,1234,5.0,20.2
1,1-2-2014,1234,2.0,8.0
2,1-3-2014,1234,3.0,13.0
3,1-1-2014,789,1.0,2.0
4,1-2-2014,789,2.0,3.8
5,1-3-2014,789,,
6,1-5-2014,789,1.0,1.8


### Acessing and Modifying Values

Unlike Series, the methods `.iat()` and `.at()` are not the same as `iloc()` and `.loc()`. The latter are generally used to return full rows/columns, while the former return a single value:

In [48]:
by_date.iat[4, 2]

3.8

In [49]:
by_date.at['1-5-2014', 'UPC']

789

Note the behavior when there are duplicated index values:

In [51]:
by_date.at['1-2-2014', 'UPC'] # This will return an array because of duplicated index values

array([1234,  789], dtype=int64)

Using assignment operator, we can modify the values with `.iat()` or `.at()`:

In [55]:
sales.at[6, 'Sales'] = 789

sales

Unnamed: 0,UPC,Units,Sales,Date
0,1234,5.0,20.2,1-1-2014
1,1234,2.0,8.0,1-2-2014
2,1234,3.0,13.0,1-3-2014
3,789,1.0,2.0,1-1-2014
4,789,2.0,3.8,1-2-2014
5,789,,,1-3-2014
6,789,1.0,789.0,1-5-2014


Inserting a column to a DataFrame, the method `.insert()` is available, but it is not the best way to do so. This method insert **in place**, therefore use cautiously.

In [56]:
sales.insert(1, 'Category', 'Food') # First parameter is position of column; second parameter is the name of column, third parameter is the new value.

sales

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,789.0,1-5-2014


The values of the inserted column do not have to be a scalar, it can also be a sequence (like a list) or a Series. But once again, using `.insert()` is not recommended, it is better to use index assignment (e.g. `sales.loc[:, 'Category'] = 'Food'`). Columns inserted this way is always added at the right-most position, but the order of the column can always be rearranged via `.reindex()`.

To update many values of a DataFrame across columns, use `.replace()`:

In [57]:
sales.replace(789, 790) # Replace all 789 entries with 790. This is not done in place.

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,790,Food,1.0,2.0,1-1-2014
4,790,Food,2.0,3.8,1-2-2014
5,790,Food,,,1-3-2014
6,790,Food,1.0,790.0,1-5-2014


Note that this will non-specifically replace the value in *all cells*. Therefore, we can be more specific by passing a dictionary mapping instead:

In [59]:
sales.replace({'UPC': {789: 790},    # At column 'UPC', replace 789 with 790
              'Sales': {789: 1.4}})  # At column 'Sales', replace 789 with 1.4

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,790,Food,1.0,2.0,1-1-2014
4,790,Food,2.0,3.8,1-2-2014
5,790,Food,,,1-3-2014
6,790,Food,1.0,1.4,1-5-2014


Powerfully, `.replace()` also accept regular expression. This can be switched on by passing `regex=True`:

In [61]:
sales.replace('(F.*d)', r'\1_stuff', regex=True) # Append all cells that starts with 'F' with '_stuff'

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food_stuff,5.0,20.2,1-1-2014
1,1234,Food_stuff,2.0,8.0,1-2-2014
2,1234,Food_stuff,3.0,13.0,1-3-2014
3,789,Food_stuff,1.0,2.0,1-1-2014
4,789,Food_stuff,2.0,3.8,1-2-2014
5,789,Food_stuff,,,1-3-2014
6,789,Food_stuff,1.0,789.0,1-5-2014


### Deleting Columns

Similar to deleting rows, to delete columns there are multiple ways:

* The `.pop()` method
* The `.drop(axis=1)` method
* The `.reindex()` method
* Indexing with a list of new columns

Note that `.pop()` works in place, so avoid using this method or use cautiously:

In [62]:
sales.loc[:, 'subcat'] = 'Dairy'

sales

Unnamed: 0,UPC,Category,Units,Sales,Date,subcat
0,1234,Food,5.0,20.2,1-1-2014,Dairy
1,1234,Food,2.0,8.0,1-2-2014,Dairy
2,1234,Food,3.0,13.0,1-3-2014,Dairy
3,789,Food,1.0,2.0,1-1-2014,Dairy
4,789,Food,2.0,3.8,1-2-2014,Dairy
5,789,Food,,,1-3-2014,Dairy
6,789,Food,1.0,789.0,1-5-2014,Dairy


In [63]:
sales.pop('subcat')

0    Dairy
1    Dairy
2    Dairy
3    Dairy
4    Dairy
5    Dairy
6    Dairy
Name: subcat, dtype: object

In [64]:
sales

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,789.0,1-5-2014


The `.drop(axis=1)` method is the safest way, like most methods in pandas this returns a new DataFrame:

In [65]:
sales.drop(['Category', 'Units'], axis=1)

Unnamed: 0,UPC,Sales,Date
0,1234,20.2,1-1-2014
1,1234,8.0,1-2-2014
2,1234,13.0,1-3-2014
3,789,2.0,1-1-2014
4,789,3.8,1-2-2014
5,789,,1-3-2014
6,789,789.0,1-5-2014


The easiest way to use `.reindex()` or indexing to remove column is to create a list of the columns you want to keep first:

In [67]:
cols = ['Sales', 'Date']
sales.reindex(columns=cols) # This is a safe method; returns a new DataFrame

Unnamed: 0,Sales,Date
0,20.2,1-1-2014
1,8.0,1-2-2014
2,13.0,1-3-2014
3,2.0,1-1-2014
4,3.8,1-2-2014
5,,1-3-2014
6,789.0,1-5-2014


In [69]:
sales.loc[:, cols] # Same here

Unnamed: 0,Sales,Date
0,20.2,1-1-2014
1,8.0,1-2-2014
2,13.0,1-3-2014
3,2.0,1-1-2014
4,3.8,1-2-2014
5,,1-3-2014
6,789.0,1-5-2014


### Slicing a DataFrame

The `.head()` and `.tail()` methods are great for slicing a DataFrame either from the beginning or the end row-wise. By default the slice is 5 rows:

In [70]:
sales.head()

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014


In [71]:
sales.tail()

Unnamed: 0,UPC,Category,Units,Sales,Date
2,1234,Food,3.0,13.0,1-3-2014
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,789.0,1-5-2014


In [72]:
sales.head(2)

Unnamed: 0,UPC,Category,Units,Sales,Date
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014


By using `.iloc[]` and `.loc[]`, we can slice the DataFrame is very specific ways:

In [74]:
sales.loc[:, 'new_index'] = list('abcdefg')
df = sales.set_index('new_index')
sales = sales.drop('new_index', axis=1)

df

Unnamed: 0_level_0,UPC,Category,Units,Sales,Date
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,1234,Food,5.0,20.2,1-1-2014
b,1234,Food,2.0,8.0,1-2-2014
c,1234,Food,3.0,13.0,1-3-2014
d,789,Food,1.0,2.0,1-1-2014
e,789,Food,2.0,3.8,1-2-2014
f,789,Food,,,1-3-2014
g,789,Food,1.0,789.0,1-5-2014


In [75]:
df.iloc[2:4] # Slice DataFrame row-wise from position 2 to 3

Unnamed: 0_level_0,UPC,Category,Units,Sales,Date
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
c,1234,Food,3.0,13.0,1-3-2014
d,789,Food,1.0,2.0,1-1-2014


The `.iloc[]` and `.loc[]` accept positional parameters; the first parameter is always the row, and the second is the columns. To specify just columns, you must past `:` in the first parameter position. For example:

In [76]:
df.iloc[2:4, 0:1] # Slice row-wise position 2-3, column-wise position 0

Unnamed: 0_level_0,UPC
new_index,Unnamed: 1_level_1
c,1234
d,789


In [77]:
df.loc['d':, 'Units'] # Slice row-wise from index value 'd' to the end; column-wise just the 'Units' column

new_index
d    1.0
e    2.0
f    NaN
g    1.0
Name: Units, dtype: float64

Note that `.loc[]` behave a little differently with normal Python slicing. It **will** include the final index of the specify interval:

In [78]:
df.loc['a': 'd']

Unnamed: 0_level_0,UPC,Category,Units,Sales,Date
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,1234,Food,5.0,20.2,1-1-2014
b,1234,Food,2.0,8.0,1-2-2014
c,1234,Food,3.0,13.0,1-3-2014
d,789,Food,1.0,2.0,1-1-2014


In summary:

* `.iloc[i:j]` - Rows position i up to but not including j
* `.iloc[:, i:j]` - Columns position i up to but not including j
* `.iloc[[i, k, m]]` - Rows at i, k, and m (not an interval)
* `.loc[a:b]` - Rows from index label a through (and including) b
* `.loc[:, c:d]` - Columns from column label c through (and including) d
* `.loc[:, [b, d, f]]` - Columns at labels b, d, and f (not an interval)

To slice out columns by value, but rows by position, we can chain calls to `.iloc[]` and `.loc[]` together:

In [81]:
(df.loc[:, ['UPC', 'Sales']] # Slice columns by label; 'UPC' and 'Sales'
 .iloc[-4:])                 # Slice rows by position, the last 4 rows

Unnamed: 0_level_0,UPC,Sales
new_index,Unnamed: 1_level_1,Unnamed: 2_level_1
d,789,2.0
e,789,3.8
f,789,
g,789,789.0


### Sorting a DataFrame

We can sort a DataFrame by index (with `.sort_index()`), or by any of the columns (with `.sort_values()`). In both case this returns a new DataFrame:

In [88]:
sales.sort_index(ascending=False)

Unnamed: 0,UPC,Category,Units,Sales,Date
6,789,Food,1.0,789.0,1-5-2014
5,789,Food,,,1-3-2014
4,789,Food,2.0,3.8,1-2-2014
3,789,Food,1.0,2.0,1-1-2014
2,1234,Food,3.0,13.0,1-3-2014
1,1234,Food,2.0,8.0,1-2-2014
0,1234,Food,5.0,20.2,1-1-2014


In [85]:
sales.sort_values('UPC') # Have to specify a column label here

Unnamed: 0,UPC,Category,Units,Sales,Date
3,789,Food,1.0,2.0,1-1-2014
4,789,Food,2.0,3.8,1-2-2014
5,789,Food,,,1-3-2014
6,789,Food,1.0,789.0,1-5-2014
0,1234,Food,5.0,20.2,1-1-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014


Passing a list of column labels will enable pandas to first sort by the left-most column, then proceed right:

In [87]:
sales.sort_values(['Units', 'UPC']) # Sort by UPC first, then sort by Units, because UPC is left of Units

Unnamed: 0,UPC,Category,Units,Sales,Date
3,789,Food,1.0,2.0,1-1-2014
6,789,Food,1.0,789.0,1-5-2014
4,789,Food,2.0,3.8,1-2-2014
1,1234,Food,2.0,8.0,1-2-2014
2,1234,Food,3.0,13.0,1-3-2014
0,1234,Food,5.0,20.2,1-1-2014
5,789,Food,,,1-3-2014
