# Pandas


Pandas is a library built using NumPy specifically for data analysis. You'll be using Pandas heavily for data manipulation, visualisation, building machine learning models, etc. 

There are two main data structures in Pandas - Series and Dataframes. The default way to store data is dataframes, and thus manipulating dataframes quickly is probably the most important skill set for data analysis. 

*Source: https://pandas.pydata.org/pandas-docs/stable/overview.html*

In this section, you will study:
1. The pandas Series (similar to a numpy array)
    * Creating a pandas series
    * Indexing series
2. Dataframes 
    * Creating dataframes from dictionaries
    * Importing CSV data files as pandas dataframes
    * Reading and summarising dataframes
    * Sorting dataframes 

### 1. The Pandas Series 

A series is similar to a 1-D numpy array, and contains scalar values of the same type (numeric, character, datetime etc.). 
A dataframe is simply a table where each column is a pandas series.


#### Creating Pandas Series

Series are one-dimensional array-like structures, though unlike numpy arrays, they often contain non-numeric data (characters, dates, time, booleans etc.)

You can create pandas series from array-like objects using ```pd.Series()```.

In [78]:
# import pandas, pd is an alias
import pandas as pd
import numpy as np

# Creating a numeric pandas series
s = pd.Series([2, 4, 5, 6, 9])
print(s)
print(type(s))

0    2
1    4
2    5
3    6
4    9
dtype: int64
<class 'pandas.core.series.Series'>


In [79]:
series_2 = pd.Series({'name': 'shravan', 'age': 28, 'email': 'skjh54554@gmail.com'})
print(series_2)

name                 shravan
age                       28
email    skjh54554@gmail.com
dtype: object


In [80]:
series_3 = pd.Series(('shravan', 'Kumar', 'Jha', 'Jha' 'test', 'edfdasf', 'dfsadf', 3, 4,6, 6.7, 7.4))
print(series_3)

0     shravan
1       Kumar
2         Jha
3     Jhatest
4     edfdasf
5      dfsadf
6           3
7           4
8           6
9         6.7
10        7.4
dtype: object


In [81]:
s_2 = pd.Series((5,6,9,10,34))
print(s_2)
print(type(s_2))

0     5
1     6
2     9
3    10
4    34
dtype: int64
<class 'pandas.core.series.Series'>


Note that each element in the Series has an index, and the index starts at 0 as usual.

In [82]:
# creating a series of characters
# notice that the 'dtype' here is 'object'
char_series = pd.Series(['a', 'b', 'af'])
char_series

0     a
1     b
2    af
dtype: object

In [83]:
import numpy as np
sdfsa = np.array([['a', 'bdfds', 'dfsdfs','kk'], [2, 4, 6,6]])
print(sdfsa)
print()
print(type(sdfsa))
print()
print(sdfsa.shape)

[['a' 'bdfds' 'dfsdfs' 'kk']
 ['2' '4' '6' '6']]

<class 'numpy.ndarray'>

(2, 4)


In [84]:
# creating a series of type datetime
date_series = pd.date_range(start = '11-09-2017', end = '12-12-2017')
print(date_series)
print(type(date_series))


DatetimeIndex(['2017-11-09', '2017-11-10', '2017-11-11', '2017-11-12',
               '2017-11-13', '2017-11-14', '2017-11-15', '2017-11-16',
               '2017-11-17', '2017-11-18', '2017-11-19', '2017-11-20',
               '2017-11-21', '2017-11-22', '2017-11-23', '2017-11-24',
               '2017-11-25', '2017-11-26', '2017-11-27', '2017-11-28',
               '2017-11-29', '2017-11-30', '2017-12-01', '2017-12-02',
               '2017-12-03', '2017-12-04', '2017-12-05', '2017-12-06',
               '2017-12-07', '2017-12-08', '2017-12-09', '2017-12-10',
               '2017-12-11', '2017-12-12'],
              dtype='datetime64[ns]', freq='D')
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


In [85]:
print(pd.date_range(start='jan-19-2023', end='feb-13-2023'))

DatetimeIndex(['2023-01-19', '2023-01-20', '2023-01-21', '2023-01-22',
               '2023-01-23', '2023-01-24', '2023-01-25', '2023-01-26',
               '2023-01-27', '2023-01-28', '2023-01-29', '2023-01-30',
               '2023-01-31', '2023-02-01', '2023-02-02', '2023-02-03',
               '2023-02-04', '2023-02-05', '2023-02-06', '2023-02-07',
               '2023-02-08', '2023-02-09', '2023-02-10', '2023-02-11',
               '2023-02-12', '2023-02-13'],
              dtype='datetime64[ns]', freq='D')


In [86]:
date_series2  = pd.date_range(start='Jan-29-2023', end='Feb-14-2023')
print(date_series2)

DatetimeIndex(['2023-01-29', '2023-01-30', '2023-01-31', '2023-02-01',
               '2023-02-02', '2023-02-03', '2023-02-04', '2023-02-05',
               '2023-02-06', '2023-02-07', '2023-02-08', '2023-02-09',
               '2023-02-10', '2023-02-11', '2023-02-12', '2023-02-13',
               '2023-02-14'],
              dtype='datetime64[ns]', freq='D')


In [1]:
import numpy as np

In [2]:

print(pd.Series(np.random.randint(5)))

NameError: name 'pd' is not defined

#### Indexing Series

Indexing series is exactly same as 1-D numpy arrays - index starts at 0.

In [89]:
# Indexing pandas series: Same as indexing 1-d numpy arrays or lists
# accessing the fourth element
print(s[3])
print()

# accessing elements starting index = 2 till the end
print(s[2:])

6

2    5
3    6
4    9
dtype: int64


In [90]:
# accessing the second and the fourth elements
# note that s[1, 3] will not work, you need to pass the indices [1, 3] as a list inside the original []
print(s[[1, 3]])

1    4
3    6
dtype: int64


#### Explicitly specifying indices

You might have noticed that while creating a series, Pandas automatically indexes it from 0 to (n-1), n being the number of rows. But if we want, we can also explicitly set the index ourselves, using the ‘index’ argument while creating the series using `pd.Series()`

In [91]:
# Indexing explicitly
pd.Series([0, 1, 2], index = ['a', 'b', 'c'])

a    0
b    1
c    2
dtype: int64

In [92]:
print(np.random.randint(8, size=(4,2)))

[[3 0]
 [0 1]
 [6 7]
 [6 3]]


In [93]:
hetroginious_series = pd.Series(data = [5, 6, '4', 'ff'], index= np.random.random(4) )
print(hetroginious_series)
print(type(hetroginious_series))

0.979362     5
0.242922     6
0.791901     4
0.592913    ff
dtype: object
<class 'pandas.core.series.Series'>


In [94]:
# You can also give the index as a sequence or use functions to specify the index
# But always make sure that the number of elements in the index list is equal to the number of elements specified in the serie
from time import time
t0 = time()
print(pd.Series(np.array(range(0,10))**2, index = range(0,10)))
t1= time()
print((t1-t0)/1000)

0     0
1     1
2     4
3     9
4    16
5    25
6    36
7    49
8    64
9    81
dtype: int32
0.0


In [95]:
pd.Series(data=np.array(range(1,21))**3,  index=range(1,21))

1        1
2        8
3       27
4       64
5      125
6      216
7      343
8      512
9      729
10    1000
11    1331
12    1728
13    2197
14    2744
15    3375
16    4096
17    4913
18    5832
19    6859
20    8000
dtype: int32

Usually, you will work with Series only as a part of dataframes. Let's study the basics of dataframes.

### The Pandas Dataframe

Dataframe is the most widely used data-structure in data analysis. It is a table with rows and columns, with rows having an index and columns having meaningful names.

#### Creating dataframes from dictionaries

There are various ways of creating dataframes, such as creating them from dictionaries, JSON objects, reading from txt, CSV files, etc. 

In [96]:
# keys become column names
df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28], 
                    'occupation': ['engineer', 'doctor', 'data analyst', 'teacher']})
df

Unnamed: 0,name,age,occupation
0,Vinay,22,engineer
1,Kushal,25,doctor
2,Aman,24,data analyst
3,Saif,28,teacher


In [97]:
df2 = pd.DataFrame({'Item': ['Sugar', 'Oil', 'Biscuit', 'Lays'], 'Price': ['Rs. 41/-', 'Rs. 190/-', 'Rs. 135/-', 'Rs. 150/-']})
df2

Unnamed: 0,Item,Price
0,Sugar,Rs. 41/-
1,Oil,Rs. 190/-
2,Biscuit,Rs. 135/-
3,Lays,Rs. 150/-


In [98]:
length = 10
df3 = pd.DataFrame({'RandInt': np.random.randint(length),
                    'Random': np.random.random(length),
                     'aRange': np.arange(1, length+1),
                     'np.ones': np.ones(length),
                     'np.zeros': np.zeros(length)
                     }, index = np.arange(1, length+1, 1)
                  )
df3

Unnamed: 0,RandInt,Random,aRange,np.ones,np.zeros
1,8,0.223507,1,1.0,0.0
2,8,0.573471,2,1.0,0.0
3,8,0.272857,3,1.0,0.0
4,8,0.736973,4,1.0,0.0
5,8,0.730535,5,1.0,0.0
6,8,0.232236,6,1.0,0.0
7,8,0.377073,7,1.0,0.0
8,8,0.388506,8,1.0,0.0
9,8,0.383493,9,1.0,0.0
10,8,0.546686,10,1.0,0.0


#### Importing CSV data files as pandas dataframes 

For the upcoming exercises, we will use a dataset of a retail store having details about the orders placed, customers, product details, sales, profits etc. 



In [99]:
# reading a CSV file as a dataframe
market_df = pd.read_csv("global_sales_data/market_fact.csv")

Usually, dataframes are imported as CSV files, but sometimes it is more convenient to convert dictionaries 
into dataframes. For e.g. when the raw data is in a JSON format (which is not uncommon), you can easily convert it into a dictionary, and then into a dataframe. 

You will learn how to convert JSON objects to dataframes later.

#### Reading and Summarising Dataframes

After you import a dataframe, you'd want to quickly understand its structure, shape, meanings of rows and columns etc. Further, you may want to look at summary statistics - such as mean, percentiles etc.

In [101]:
# Looking at the top and bottom entries of dataframes
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


In [102]:
market_df.tail()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
8394,Ord_5353,Prod_4,SHP_7479,Cust_1798,2841.4395,0.08,28,374.63,7.69,0.59
8395,Ord_5411,Prod_6,SHP_7555,Cust_1798,127.16,0.1,20,-74.03,6.92,0.37
8396,Ord_5388,Prod_6,SHP_7524,Cust_1798,243.05,0.02,39,-70.85,5.35,0.4
8397,Ord_5348,Prod_15,SHP_7469,Cust_1798,3872.87,0.03,23,565.34,30.0,0.62
8398,Ord_5459,Prod_6,SHP_7628,Cust_1798,603.69,0.0,47,131.39,4.86,0.38


Here, each row represents an order placed at a retail store. Notice the index associated with each row - starts at 0 and ends at 8398, implying that there were 8399 orders placed.

In [104]:
# Looking at the datatypes of each column
market_df.info()

# Note that each column is basically a pandas Series of length 8399
# The ID columns are 'objects', i.e. they are being read as characters
# The rest are numeric (floats or int)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8399 entries, 0 to 8398
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Ord_id               8399 non-null   object 
 1   Prod_id              8399 non-null   object 
 2   Ship_id              8399 non-null   object 
 3   Cust_id              8399 non-null   object 
 4   Sales                8399 non-null   float64
 5   Discount             8399 non-null   float64
 6   Order_Quantity       8399 non-null   int64  
 7   Profit               8399 non-null   float64
 8   Shipping_Cost        8399 non-null   float64
 9   Product_Base_Margin  8336 non-null   float64
dtypes: float64(5), int64(1), object(4)
memory usage: 656.3+ KB


In [105]:
# Describe gives you a summary of all the numeric columns in the dataset
market_df.describe()

Unnamed: 0,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
count,8399.0,8399.0,8399.0,8399.0,8399.0,8336.0
mean,1775.878179,0.049671,25.571735,181.184424,12.838557,0.512513
std,3585.050525,0.031823,14.481071,1196.653371,17.264052,0.135589
min,2.24,0.0,1.0,-14140.7,0.49,0.35
25%,143.195,0.02,13.0,-83.315,3.3,0.38
50%,449.42,0.05,26.0,-1.5,6.07,0.52
75%,1709.32,0.08,38.0,162.75,13.99,0.59
max,89061.05,0.25,50.0,27220.69,164.73,0.85


In [106]:
# Column names
market_df.columns

Index(['Ord_id', 'Prod_id', 'Ship_id', 'Cust_id', 'Sales', 'Discount',
       'Order_Quantity', 'Profit', 'Shipping_Cost', 'Product_Base_Margin'],
      dtype='object')

In [107]:
# The number of rows and columns
market_df.shape

(8399, 10)

In [111]:
# You can extract the values of a dataframe as a numpy array using df.values 
market_df.values

array([['Ord_5446', 'Prod_16', 'SHP_7609', ..., -30.51, 3.6, 0.56],
       ['Ord_5406', 'Prod_13', 'SHP_7549', ..., 4.56, 0.93, 0.54],
       ['Ord_5446', 'Prod_4', 'SHP_7610', ..., 1148.9, 2.5, 0.59],
       ...,
       ['Ord_5388', 'Prod_6', 'SHP_7524', ..., -70.85, 5.35, 0.4],
       ['Ord_5348', 'Prod_15', 'SHP_7469', ..., 565.34, 30.0, 0.62],
       ['Ord_5459', 'Prod_6', 'SHP_7628', ..., 131.39, 4.86, 0.38]],
      dtype=object)

In [108]:
print(market_df.values)

[['Ord_5446' 'Prod_16' 'SHP_7609' ... -30.51 3.6 0.56]
 ['Ord_5406' 'Prod_13' 'SHP_7549' ... 4.56 0.93 0.54]
 ['Ord_5446' 'Prod_4' 'SHP_7610' ... 1148.9 2.5 0.59]
 ...
 ['Ord_5388' 'Prod_6' 'SHP_7524' ... -70.85 5.35 0.4]
 ['Ord_5348' 'Prod_15' 'SHP_7469' ... 565.34 30.0 0.62]
 ['Ord_5459' 'Prod_6' 'SHP_7628' ... 131.39 4.86 0.38]]


In [110]:
print(market_df.values.shape)

(8399, 10)


In [120]:
print(print(market_df.values[:3, :4]))

[['Ord_5446' 'Prod_16' 'SHP_7609' 'Cust_1818']
 ['Ord_5406' 'Prod_13' 'SHP_7549' 'Cust_1818']
 ['Ord_5446' 'Prod_4' 'SHP_7610' 'Cust_1818']]
None


In [121]:
df3.head()

Unnamed: 0,RandInt,Random,aRange,np.ones,np.zeros
1,8,0.223507,1,1.0,0.0
2,8,0.573471,2,1.0,0.0
3,8,0.272857,3,1.0,0.0
4,8,0.736973,4,1.0,0.0
5,8,0.730535,5,1.0,0.0


In [122]:
df3.tail()

Unnamed: 0,RandInt,Random,aRange,np.ones,np.zeros
6,8,0.232236,6,1.0,0.0
7,8,0.377073,7,1.0,0.0
8,8,0.388506,8,1.0,0.0
9,8,0.383493,9,1.0,0.0
10,8,0.546686,10,1.0,0.0


In [123]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 1 to 10
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   RandInt   10 non-null     int64  
 1   Random    10 non-null     float64
 2   aRange    10 non-null     int32  
 3   np.ones   10 non-null     float64
 4   np.zeros  10 non-null     float64
dtypes: float64(3), int32(1), int64(1)
memory usage: 440.0 bytes


In [124]:
df3.describe()

Unnamed: 0,RandInt,Random,aRange,np.ones,np.zeros
count,10.0,10.0,10.0,10.0,10.0
mean,8.0,0.446534,5.5,1.0,0.0
std,0.0,0.191238,3.02765,0.0,0.0
min,8.0,0.223507,1.0,1.0,0.0
25%,8.0,0.298911,3.25,1.0,0.0
50%,8.0,0.386,5.5,1.0,0.0
75%,8.0,0.566775,7.75,1.0,0.0
max,8.0,0.736973,10.0,1.0,0.0


In [125]:
products_df = pd.read_csv('global_sales_data/prod_dimen.csv')
products_df

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5
5,OFFICE SUPPLIES,PAPER,Prod_6
6,OFFICE SUPPLIES,RUBBER BANDS,Prod_7
7,TECHNOLOGY,COMPUTER PERIPHERALS,Prod_8
8,OFFICE SUPPLIES,ENVELOPES,Prod_9
9,FURNITURE,BOOKCASES,Prod_10


In [126]:
products_df.head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [117]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Product_Category      17 non-null     object
 1   Product_Sub_Category  17 non-null     object
 2   Prod_id               17 non-null     object
dtypes: object(3)
memory usage: 536.0+ bytes


In [127]:
products_df.tail()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
12,OFFICE SUPPLIES,PENS & ART SUPPLIES,Prod_13
13,TECHNOLOGY,COPIERS AND FAX,Prod_14
14,FURNITURE,CHAIRS & CHAIRMATS,Prod_15
15,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",Prod_16
16,TECHNOLOGY,OFFICE MACHINES,Prod_17


In [128]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Product_Category      17 non-null     object
 1   Product_Sub_Category  17 non-null     object
 2   Prod_id               17 non-null     object
dtypes: object(3)
memory usage: 536.0+ bytes


In [130]:
products_df.describe()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
count,17,17,17
unique,3,17,17
top,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
freq,9,1,1


In [132]:
products_df.values

array([['OFFICE SUPPLIES', 'STORAGE & ORGANIZATION', 'Prod_1'],
       ['OFFICE SUPPLIES', 'APPLIANCES', 'Prod_2'],
       ['OFFICE SUPPLIES', 'BINDERS AND BINDER ACCESSORIES', 'Prod_3'],
       ['TECHNOLOGY', 'TELEPHONES AND COMMUNICATION', 'Prod_4'],
       ['FURNITURE', 'OFFICE FURNISHINGS', 'Prod_5'],
       ['OFFICE SUPPLIES', 'PAPER', 'Prod_6'],
       ['OFFICE SUPPLIES', 'RUBBER BANDS', 'Prod_7'],
       ['TECHNOLOGY', 'COMPUTER PERIPHERALS', 'Prod_8'],
       ['OFFICE SUPPLIES', 'ENVELOPES', 'Prod_9'],
       ['FURNITURE', 'BOOKCASES', 'Prod_10'],
       ['FURNITURE', 'TABLES', 'Prod_11'],
       ['OFFICE SUPPLIES', 'LABELS', 'Prod_12'],
       ['OFFICE SUPPLIES', 'PENS & ART SUPPLIES', 'Prod_13'],
       ['TECHNOLOGY', 'COPIERS AND FAX', 'Prod_14'],
       ['FURNITURE', 'CHAIRS & CHAIRMATS', 'Prod_15'],
       ['OFFICE SUPPLIES', 'SCISSORS, RULERS AND TRIMMERS', 'Prod_16'],
       ['TECHNOLOGY', 'OFFICE MACHINES', 'Prod_17']], dtype=object)

In [136]:
products_df.values.shape

(17, 3)

In [139]:
print(products_df.columns)
print(type(products_df.columns))

Index(['Product_Category', 'Product_Sub_Category', 'Prod_id'], dtype='object')
<class 'pandas.core.indexes.base.Index'>


#### Indices 

An important concept in pandas dataframes is that of *row indices*. By default, each row is assigned indices starting from 0, and are represented at the left side of the dataframe. 

In [131]:
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


Now, arbitrary numeric indices are difficult to read and work with. Thus, you may want to change the indices of the df to something more meanigful.

Let's change the index to Ord_id (unique id of each order), so that you can select rows using the order ids directly.

In [120]:
# Setting index to Ord_id
market_df.set_index('Ord_id', inplace = True)
market_df.head()

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


In [146]:
products_df.set_index(keys='Prod_id', inplace=False, append=True).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Product_Category,Product_Sub_Category
Unnamed: 0_level_1,Prod_id,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Prod_1,OFFICE SUPPLIES,STORAGE & ORGANIZATION
1,Prod_2,OFFICE SUPPLIES,APPLIANCES
2,Prod_3,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES
3,Prod_4,TECHNOLOGY,TELEPHONES AND COMMUNICATION
4,Prod_5,FURNITURE,OFFICE FURNISHINGS


In [162]:
shape= products_df.shape
index_count = shape[0]
print(index_count)
products_df.set_index(keys=pd.Series([ 'Order {}'.format(indx+1) for indx in range(index_count)]), inplace=False).head()

17


Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
Order 1,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
Order 2,OFFICE SUPPLIES,APPLIANCES,Prod_2
Order 3,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
Order 4,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
Order 5,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [161]:
products_df.tail()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
12,OFFICE SUPPLIES,PENS & ART SUPPLIES,Prod_13
13,TECHNOLOGY,COPIERS AND FAX,Prod_14
14,FURNITURE,CHAIRS & CHAIRMATS,Prod_15
15,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",Prod_16
16,TECHNOLOGY,OFFICE MACHINES,Prod_17


Having meaningful row labels as indices helps you to select (subset) dataframes easily. You will study selecting dataframes in the next section. 

In [121]:
products_df.head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [163]:
products_df.set_index(keys='Prod_id', inplace=False)

Unnamed: 0_level_0,Product_Category,Product_Sub_Category
Prod_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Prod_1,OFFICE SUPPLIES,STORAGE & ORGANIZATION
Prod_2,OFFICE SUPPLIES,APPLIANCES
Prod_3,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES
Prod_4,TECHNOLOGY,TELEPHONES AND COMMUNICATION
Prod_5,FURNITURE,OFFICE FURNISHINGS
Prod_6,OFFICE SUPPLIES,PAPER
Prod_7,OFFICE SUPPLIES,RUBBER BANDS
Prod_8,TECHNOLOGY,COMPUTER PERIPHERALS
Prod_9,OFFICE SUPPLIES,ENVELOPES
Prod_10,FURNITURE,BOOKCASES


In [164]:
products_df.head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [166]:
no_of_rows = products_df.shape[0]
print(no_of_rows)

17


In [167]:
products_df.set_index(keys=[np.arange(1, no_of_rows + 1), 'Prod_id'], inplace=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Product_Category,Product_Sub_Category
Unnamed: 0_level_1,Prod_id,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Prod_1,OFFICE SUPPLIES,STORAGE & ORGANIZATION
2,Prod_2,OFFICE SUPPLIES,APPLIANCES
3,Prod_3,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES
4,Prod_4,TECHNOLOGY,TELEPHONES AND COMMUNICATION
5,Prod_5,FURNITURE,OFFICE FURNISHINGS
6,Prod_6,OFFICE SUPPLIES,PAPER
7,Prod_7,OFFICE SUPPLIES,RUBBER BANDS
8,Prod_8,TECHNOLOGY,COMPUTER PERIPHERALS
9,Prod_9,OFFICE SUPPLIES,ENVELOPES
10,Prod_10,FURNITURE,BOOKCASES


In [168]:
products_df

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5
5,OFFICE SUPPLIES,PAPER,Prod_6
6,OFFICE SUPPLIES,RUBBER BANDS,Prod_7
7,TECHNOLOGY,COMPUTER PERIPHERALS,Prod_8
8,OFFICE SUPPLIES,ENVELOPES,Prod_9
9,FURNITURE,BOOKCASES,Prod_10


#### Sorting dataframes

You can sort dataframes in two ways - 1) by the indices and 2) by the values.  


In [169]:
# Sorting by index
# axis = 0 indicates that you want to sort rows (use axis=1 for columns)
market_df.sort_index(axis = 0, ascending = False)

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
8398,Ord_5459,Prod_6,SHP_7628,Cust_1798,603.6900,0.00,47,131.39,4.86,0.38
8397,Ord_5348,Prod_15,SHP_7469,Cust_1798,3872.8700,0.03,23,565.34,30.00,0.62
8396,Ord_5388,Prod_6,SHP_7524,Cust_1798,243.0500,0.02,39,-70.85,5.35,0.40
8395,Ord_5411,Prod_6,SHP_7555,Cust_1798,127.1600,0.10,20,-74.03,6.92,0.37
8394,Ord_5353,Prod_4,SHP_7479,Cust_1798,2841.4395,0.08,28,374.63,7.69,0.59
...,...,...,...,...,...,...,...,...,...,...
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.1500,0.08,35,1219.87,26.30,0.38
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.8900,0.09,43,729.34,14.30,0.37
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.6900,0.00,26,1148.90,2.50,0.59
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.2700,0.01,13,4.56,0.93,0.54


In [172]:
products_df.sort_index(axis=0, ascending=False)

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
16,TECHNOLOGY,OFFICE MACHINES,Prod_17
15,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",Prod_16
14,FURNITURE,CHAIRS & CHAIRMATS,Prod_15
13,TECHNOLOGY,COPIERS AND FAX,Prod_14
12,OFFICE SUPPLIES,PENS & ART SUPPLIES,Prod_13
11,OFFICE SUPPLIES,LABELS,Prod_12
10,FURNITURE,TABLES,Prod_11
9,FURNITURE,BOOKCASES,Prod_10
8,OFFICE SUPPLIES,ENVELOPES,Prod_9
7,TECHNOLOGY,COMPUTER PERIPHERALS,Prod_8


In [141]:
# Sorting by values

# Sorting in increasing order of Sales
market_df.sort_values(by='Sales').head()

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_704,Prod_7,SHP_964,Cust_242,2.24,0.01,1,-1.97,0.7,0.37
Ord_149,Prod_3,SHP_7028,Cust_1712,3.2,0.09,1,-3.16,1.49,0.37
Ord_4270,Prod_7,SHP_5959,Cust_1450,3.23,0.06,2,-2.73,0.7,0.81
Ord_4755,Prod_13,SHP_6628,Cust_1579,3.41,0.06,1,-1.78,0.7,0.56
Ord_2252,Prod_3,SHP_3064,Cust_881,3.42,0.05,1,-2.91,1.49,0.37


In [174]:
products_df.sort_values(by='Prod_id').head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
9,FURNITURE,BOOKCASES,Prod_10
10,FURNITURE,TABLES,Prod_11
11,OFFICE SUPPLIES,LABELS,Prod_12
12,OFFICE SUPPLIES,PENS & ART SUPPLIES,Prod_13


In [175]:
# Sorting in decreasing order of Shipping_Cost
market_df.sort_values(by='Shipping_Cost', ascending = False).head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
4509,Ord_1751,Prod_15,SHP_2426,Cust_597,14740.51,0.0,46,3407.73,164.73,0.56
5327,Ord_839,Prod_11,SHP_1361,Cust_364,12689.87,0.04,44,-169.23,154.12,0.76
8283,Ord_1741,Prod_11,SHP_2411,Cust_595,15168.82,0.02,26,-1096.78,147.12,0.8
2800,Ord_417,Prod_11,SHP_561,Cust_156,20333.816,0.02,45,-1430.45,147.12,0.8
5511,Ord_1581,Prod_15,SHP_2184,Cust_519,2573.92,0.07,17,117.23,143.71,0.55


In [23]:
# Sorting by more than two columns

# Sorting in ascending order of Sales for each Product
market_df.sort_values(by=['Prod_id', 'Sales'], ascending = False)

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_2197,Prod_9,SHP_2994,Cust_827,7522.80,0.04,48,3187.37,19.99,0.39
Ord_4356,Prod_9,SHP_6074,Cust_1481,6831.72,0.01,41,3081.02,19.99,0.39
Ord_262,Prod_9,SHP_358,Cust_66,6553.45,0.03,39,2969.81,19.99,0.39
Ord_4059,Prod_9,SHP_5660,Cust_1378,5587.20,0.05,36,2254.16,19.99,0.39
Ord_2973,Prod_9,SHP_6073,Cust_1480,5410.95,0.09,36,2077.91,19.99,0.39
Ord_950,Prod_9,SHP_1315,Cust_334,4906.85,0.09,32,1907.94,19.99,0.39
Ord_5112,Prod_9,SHP_7141,Cust_1729,4273.95,0.05,49,1340.07,19.99,0.40
Ord_612,Prod_9,SHP_836,Cust_687,3872.38,0.10,50,1110.35,19.99,0.38
Ord_3443,Prod_9,SHP_4773,Cust_1246,3849.17,0.06,46,1982.78,5.01,0.38
Ord_3650,Prod_9,SHP_6795,Cust_1683,3353.54,0.07,22,1189.96,19.99,0.39


In [176]:
products_df.sort_index(axis = 0, ascending = True)

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5
5,OFFICE SUPPLIES,PAPER,Prod_6
6,OFFICE SUPPLIES,RUBBER BANDS,Prod_7
7,TECHNOLOGY,COMPUTER PERIPHERALS,Prod_8
8,OFFICE SUPPLIES,ENVELOPES,Prod_9
9,FURNITURE,BOOKCASES,Prod_10


In [177]:
products_df.sort_values(by='Product_Category')

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
14,FURNITURE,CHAIRS & CHAIRMATS,Prod_15
4,FURNITURE,OFFICE FURNISHINGS,Prod_5
10,FURNITURE,TABLES,Prod_11
9,FURNITURE,BOOKCASES,Prod_10
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
12,OFFICE SUPPLIES,PENS & ART SUPPLIES,Prod_13
11,OFFICE SUPPLIES,LABELS,Prod_12
15,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",Prod_16
8,OFFICE SUPPLIES,ENVELOPES,Prod_9
6,OFFICE SUPPLIES,RUBBER BANDS,Prod_7
