# Data Manipulation with Pandas

Pandas is the most widely used library of python for data science. It is incredibly helpful in manipulating the data so that you can derive better insights and build great machine learning models.

In this notebook, we will have a look at some of the intermediate concepts of working with pandas.


## Table of Contents

 - Apply function

In [1]:
import pandas as pd
import numpy as np

# read the dataset
data_BM = pd.read_csv('bigmart_data.csv')
# drop the null values
data_BM = data_BM.dropna(how="any")
# reset index after dropping
data_BM = data_BM.reset_index(drop=True)
# view the top results
data_BM.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
4,FDP36,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088


### Apply function

- Apply function can be used to perform pre-processing/data-manipulation on your data both row wise and column wise.
- It is a faster method than simply using a **for** loop over your dataframe.
- Almost every time I need to iterate over a dataframe or it's rows/columns, I will think of using the `apply`.
- Hence, it is widely used in feature engineering code.

In [2]:
# accessing row wise
data_BM.apply(lambda x: x)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
4,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
5,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
6,FDY07,11.800,Low Fat,0.000000,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
7,FDA03,18.500,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.1530
8,FDX32,15.100,Regular,0.100014,Fruits and Vegetables,145.4786,OUT049,1999,Medium,Tier 1,Supermarket Type1,1589.2646
9,FDS46,17.600,Regular,0.047257,Snack Foods,119.6782,OUT046,1997,Small,Tier 1,Supermarket Type1,2145.2076


In [4]:
# access first row
data_BM.apply(lambda x: x[0])

Item_Identifier                          FDA15
Item_Weight                                9.3
Item_Fat_Content                       Low Fat
Item_Visibility                      0.0160473
Item_Type                                Dairy
Item_MRP                               249.809
Outlet_Identifier                       OUT049
Outlet_Establishment_Year                 1999
Outlet_Size                             Medium
Outlet_Location_Type                    Tier 1
Outlet_Type                  Supermarket Type1
Item_Outlet_Sales                      3735.14
dtype: object

In [5]:
# accessing column wise
data_BM.apply(lambda x: x, axis=1)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
4,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
5,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
6,FDY07,11.800,Low Fat,0.000000,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
7,FDA03,18.500,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.1530
8,FDX32,15.100,Regular,0.100014,Fruits and Vegetables,145.4786,OUT049,1999,Medium,Tier 1,Supermarket Type1,1589.2646
9,FDS46,17.600,Regular,0.047257,Snack Foods,119.6782,OUT046,1997,Small,Tier 1,Supermarket Type1,2145.2076


In [6]:
# access first column by index
data_BM.apply(lambda x: x[0], axis=1)

0       FDA15
1       DRC01
2       FDN15
3       NCD19
4       FDP36
5       FDO10
6       FDY07
7       FDA03
8       FDX32
9       FDS46
10      FDF32
11      FDP49
12      NCB42
13      FDP49
14      FDU02
15      FDN22
16      NCB30
17      FDR28
18      FDV10
19      DRJ59
20      NCS17
21      FDP33
22      DRH01
23      NCX29
24      DRZ11
25      FDU02
26      FDK43
27      FDA46
28      FDC02
29      FDL50
        ...  
4620    FDJ32
4621    FDV31
4622    FDW27
4623    NCS17
4624    FDX34
4625    FDL10
4626    FDZ28
4627    DRJ49
4628    FDV13
4629    FDO03
4630    FDT34
4631    FDE22
4632    FDT08
4633    NCP54
4634    NCK53
4635    FDQ44
4636    FDB46
4637    DRF37
4638    FDN28
4639    FDN58
4640    FDF05
4641    FDR26
4642    FDH31
4643    FDH24
4644    NCJ19
4645    FDF53
4646    FDF22
4647    NCJ29
4648    FDN46
4649    DRG01
Length: 4650, dtype: object

In [7]:
# access by column name
data_BM.apply(lambda x: x["Item_Fat_Content"], axis=1)

0       Low Fat
1       Regular
2       Low Fat
3       Low Fat
4       Regular
5       Regular
6       Low Fat
7       Regular
8       Regular
9       Regular
10      Low Fat
11      Regular
12      Low Fat
13      Regular
14      Low Fat
15      Regular
16      Low Fat
17      Regular
18      Regular
19      low fat
20      Low Fat
21      Low Fat
22      Low Fat
23      Low Fat
24      Regular
25      Low Fat
26      Low Fat
27      Low Fat
28      Low Fat
29      Regular
         ...   
4620    Low Fat
4621         LF
4622    Regular
4623    Low Fat
4624    Low Fat
4625    Low Fat
4626    Regular
4627    Low Fat
4628    Regular
4629    Regular
4630    Low Fat
4631    Low Fat
4632    Low Fat
4633    Low Fat
4634    Low Fat
4635    Low Fat
4636    Regular
4637    Low Fat
4638    Regular
4639    Regular
4640    Low Fat
4641    Low Fat
4642    Regular
4643    Low Fat
4644    Low Fat
4645        reg
4646    Low Fat
4647    Low Fat
4648    Regular
4649    Low Fat
Length: 4650, dtype: obj

- You can also use `apply` to implement a **condition** individually on every row/column of your dataframe.
- Suppose you want to clip Item_MRP to 200 and not consider any value greater than that.
```python
def clip_price(price):
    if price > 200:
        price = 200
    return price
```

In [7]:
# before clipping
data_BM["Item_MRP"][:5]

0    249.8092
1     48.2692
2    141.6180
3     53.8614
4     51.4008
Name: Item_MRP, dtype: float64

In [11]:
# clip price if it is greater than 200
def clip_price(price):
    if price > 200:
        price = 200*19+45/56
    elif price < 200:
        price = 120*5/345
    return price

# after clipping
data_BM["Item_MRP"].apply(lambda x: clip_price(x))[:5]

0    3800.803571
1       1.739130
2       1.739130
3       1.739130
4       1.739130
Name: Item_MRP, dtype: float64

- Suppose you want to label encode Outlet_Location_Type as 0, 1 and 2 for Tier 1, Tier 2 and Tier 3 city, your logic would be:

```python
def label_encode(city):
    if city == 'Tier 1':
        label = 0
    elif city == 'Tier 2':
        label = 1
    else:
        label = 2
    return label
```
- You can use the `apply` to operate `label_encode` logic on every row of the Outlet_Location_Type column.

In [12]:
# before label encoding
data_BM["Outlet_Location_Type"][:5]

0    Tier 1
1    Tier 3
2    Tier 1
3    Tier 3
4    Tier 3
Name: Outlet_Location_Type, dtype: object

In [13]:
# label encode city type
def label_encode(city):
    if city == 'Tier 1':
        label = 0
    elif city == 'Tier 2':
        label = 1
    else:
        label = 2
    return label

# operate label_encode on every row of Outlet_Location_Type
data_BM["Outlet_Location_Type"] = data_BM["Outlet_Location_Type"].apply(label_encode)

In [14]:
# after label encoding
data_BM["Outlet_Location_Type"][:5]

0    0
1    2
2    0
3    2
4    2
Name: Outlet_Location_Type, dtype: int64