<a href="https://colab.research.google.com/github/sahil301290/Python-for-Data-Science/blob/main/06_2_Pandas_Conditional_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Introduction to Conditional Filtering

In data analysis, our datasets are large enough that we don't filter out data based on locations, instead, we do it based on a condition.

Conditional Filtering allows us to select rows based on a condition on column.

Hence, we should focus on organizing our data.

In [1]:
import numpy as np
import pandas as pd

Columns are features

Rows are instances of data



In [2]:
myindex = ['India', 'USA', 'Canada']
mydata = [[1947,1390, 10], [1776, 33, 20], [1867, 20, 12]]
mycolumns = ['Independence', 'Population', 'GDP']
df = pd.DataFrame(data = mydata, index = myindex, columns = mycolumns)
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390,10
USA,1776,33,20
Canada,1867,20,12


Which countries have Population more than X?

Which countries have GDP less than Y?

Which country got the independence first?

In [3]:
#Which countries have population greater than 30?
df['Population']>30
#df[df['Population']>30]

India      True
USA        True
Canada    False
Name: Population, dtype: bool

Key steps:
- Grab the column for comparison
- Perform a comparison to get series of boolean values
- Pass that series of boolean values into DataFrame

Conditional Filtering
- Filter by Single Condition
- Filter by Multiple Condition
- Check against multiple possible values

Reading dataset: https://www.kaggle.com/ranjeetjain3/seaborn-tips-dataset

In [4]:
df = pd.read_csv('tips.csv')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Adding random numbers with Heading of "Aadhar Number" of every person in the dataframe

In [5]:
Aadhar_Number = np.random.randint(100000000000, 999999999999, len(df))
Aadhar_Number

array([988570798439, 704978009810, 247681557945, 264536142568,
       435259271933, 678415459164, 586644504057, 917916745188,
       958768253912, 161285106957, 218829432664, 358597310957,
       204808159105, 314226442295, 893484418709, 819941857519,
       586327886141, 674608102707, 282459128391, 256764614701,
       925863765203, 628325541029, 914403001198, 443764599341,
       190834725676, 733275823349, 936160653101, 348925747043,
       773644180733, 380268243319, 794592891280, 288644875746,
       493331464415, 705015534662, 721910271503, 823529037321,
       306394405467, 485190521009, 314185743261, 116493301889,
       854339633133, 297200506614, 401124136611, 376604046367,
       739884220874, 713790967830, 848128078053, 496963648546,
       702972767586, 465422618257, 124739016861, 921574862383,
       540756694077, 667144014171, 624166231775, 709372079906,
       961357682127, 953873768086, 339752121740, 433035530724,
       174475791273, 843049139398, 538656682006, 860570

In [6]:
df['Aadhar_Number'] = pd.DataFrame(Aadhar_Number)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933


###Filter by Single Condition

In [7]:
df['total_bill']

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

In [8]:
#Give all rows where total bill is greater than 45
bool_series = df['total_bill'] > 45

In [9]:
df[bool_series]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
59,48.27,6.73,Male,No,Sat,Dinner,4,433035530724
156,48.17,5.0,Male,No,Sun,Dinner,6,469174814331
170,50.81,10.0,Male,Yes,Sat,Dinner,3,262086152866
182,45.35,3.5,Male,Yes,Sun,Dinner,3,219931343583
212,48.33,9.0,Male,No,Sat,Dinner,4,181663571451


In [10]:
df[df['total_bill'] > 45]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
59,48.27,6.73,Male,No,Sat,Dinner,4,433035530724
156,48.17,5.0,Male,No,Sun,Dinner,6,469174814331
170,50.81,10.0,Male,Yes,Sat,Dinner,3,262086152866
182,45.35,3.5,Male,Yes,Sun,Dinner,3,219931343583
212,48.33,9.0,Male,No,Sat,Dinner,4,181663571451


In [11]:
#Give all rows where sex is Male
df[df['sex'] == 'Male']

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810
2,21.01,3.50,Male,No,Sun,Dinner,3,247681557945
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568
5,25.29,4.71,Male,No,Sun,Dinner,4,678415459164
6,8.77,2.00,Male,No,Sun,Dinner,2,586644504057
...,...,...,...,...,...,...,...,...
236,12.60,1.00,Male,Yes,Sat,Dinner,2,477690927699
237,32.83,1.17,Male,Yes,Sat,Dinner,2,196661124391
239,29.03,5.92,Male,No,Sat,Dinner,3,375401151059
241,22.67,2.00,Male,Yes,Sat,Dinner,2,309704363004


In [12]:
#Give all rows where size of party is >4
df[df['size'] > 4]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
125,29.8,4.2,Female,No,Thur,Lunch,6,501602536036
141,34.3,6.7,Male,No,Thur,Lunch,6,390347789189
142,41.19,5.0,Male,No,Thur,Lunch,5,860638290289
143,27.05,5.0,Female,No,Thur,Lunch,6,980156477617
155,29.85,5.14,Female,No,Sun,Dinner,5,821857251373
156,48.17,5.0,Male,No,Sun,Dinner,6,469174814331
185,20.69,5.0,Male,No,Sun,Dinner,5,454451866736
187,30.46,2.0,Male,Yes,Sun,Dinner,5,546769519333
216,28.15,3.0,Male,Yes,Sat,Dinner,5,166962488549


###Filter by Multiple Condition

AND (&) operator when BOTH conditions need to be true

OR (|) operator when EITHER condition is true

In [13]:
#1 == 1 and 5 == 5 

In [14]:
#df[(df['total_bill'] > 40) AND (df['sex'] == 'Male')]
df[(df['total_bill'] > 40) & (df['sex'] == 'Male')]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
59,48.27,6.73,Male,No,Sat,Dinner,4,433035530724
95,40.17,4.73,Male,Yes,Fri,Dinner,4,978955749864
142,41.19,5.0,Male,No,Thur,Lunch,5,860638290289
156,48.17,5.0,Male,No,Sun,Dinner,6,469174814331
170,50.81,10.0,Male,Yes,Sat,Dinner,3,262086152866
182,45.35,3.5,Male,Yes,Sun,Dinner,3,219931343583
184,40.55,3.0,Male,Yes,Sun,Dinner,2,900295919016
212,48.33,9.0,Male,No,Sat,Dinner,4,181663571451


###Check against multiple possible values

In [15]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933


In [16]:
#Return rows where day is Sat or Sun or Fri
df[(df['day'] == 'Sat') | (df['day'] == 'Sun') | (df['day'] == 'Fri')]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810
2,21.01,3.50,Male,No,Sun,Dinner,3,247681557945
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933
...,...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3,942880169005
239,29.03,5.92,Male,No,Sat,Dinner,3,375401151059
240,27.18,2.00,Female,Yes,Sat,Dinner,2,125866245699
241,22.67,2.00,Male,Yes,Sat,Dinner,2,309704363004


In [17]:
#When column is same, isin technique can be useful
options = ['Fri', 'Sat', 'Sun']
df['day'].isin(options)
#df[df['day'].isin(options)]

0       True
1       True
2       True
3       True
4       True
       ...  
239     True
240     True
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool

##Method Calls in Pandas

Documentation Link: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

###Apply Function .apply()
Apply any custom python function of our own to every row in a Series

####Apply on single column

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   total_bill     244 non-null    float64
 1   tip            244 non-null    float64
 2   sex            244 non-null    object 
 3   smoker         244 non-null    object 
 4   day            244 non-null    object 
 5   time           244 non-null    object 
 6   size           244 non-null    int64  
 7   Aadhar_Number  244 non-null    int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 15.4+ KB


In [19]:
#Aadhar_Number[0][0]
str(Aadhar_Number[0])[0]

'9'

In [20]:
str(Aadhar_Number[0])[:4]

'9885'

In [21]:
def Aadhar_series(num):
  return str(num)[0]

In [22]:
Aadhar_series(213421352643)

'2'

In [23]:
df['Aadhar_Number'].apply(Aadhar_series)

0      9
1      7
2      2
3      2
4      4
      ..
239    3
240    1
241    3
242    9
243    9
Name: Aadhar_Number, Length: 244, dtype: object

In [24]:
df['Aadhar_Series'] = df['Aadhar_Number'].apply(Aadhar_series)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number,Aadhar_Series
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439,9
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810,7
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945,2
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568,2
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933,4


In [25]:
#df['total_bill']
df['total_bill'].mean()

19.785942622950824

Defining function to know how costly the restaurant is.

In [26]:
def Pricy(price):
  if price < 10:
    return '$'
  elif price >= 10 and price < 30:
    return '$$'
  else:
    return '$$$'

In [27]:
df['Costly'] = df['total_bill'].apply(Pricy)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number,Aadhar_Series,Costly
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439,9,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810,7,$$
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945,2,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568,2,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933,4,$$


####Apply on multiple columns

- using Lambda Function
- using NumPy Vectorize

Reviewing Lambda Function

In [28]:
def simple(num):
  return num*2

In [29]:
simple(3)

6

Lambda is an anonymous function. Hence no name.

In [30]:
lambda num: num * 2

<function __main__.<lambda>>

In [31]:
df['total_bill'].apply(lambda bill: bill * 2)

0      33.98
1      20.68
2      42.02
3      47.36
4      49.18
       ...  
239    58.06
240    54.36
241    45.34
242    35.64
243    37.56
Name: total_bill, Length: 244, dtype: float64

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column

In [32]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number,Aadhar_Series,Costly
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439,9,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810,7,$$
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945,2,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568,2,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933,4,$$


Is tip generous or not?

In [33]:
def quality(total_bill, tip):
  if tip/total_bill > 0.2:
    return 'Generous'
  else:
    return 'Other'

In [34]:
quality(20,6)

'Generous'

Apply function on df using Lambda Function

In [35]:
df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis = 1)

0         Other
1         Other
2         Other
3         Other
4         Other
         ...   
239    Generous
240       Other
241       Other
242       Other
243       Other
Length: 244, dtype: object

In [36]:
df['Quality_of_tip'] = df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis = 1)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number,Aadhar_Series,Costly,Quality_of_tip
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439,9,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810,7,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945,2,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568,2,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933,4,$$,Other


Apply function on df using NumPy Vectorize

In [37]:
df['Quality_of_tip'] = np.vectorize(quality)(df['total_bill'], df['tip'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Aadhar_Number,Aadhar_Series,Costly,Quality_of_tip
0,16.99,1.01,Female,No,Sun,Dinner,2,988570798439,9,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,704978009810,7,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,247681557945,2,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,264536142568,2,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,435259271933,4,$$,Other


What is NumPy Vectorize and how fast is it?

https://stackoverflow.com/questions/3379301/using-numpy-vectorize-on-functions-that-return-vectors

np.vectorize is just a convenience function. It doesn't actually make code run any faster. If it isn't convenient to use np.vectorize, simply write your own function that works as you wish.

The purpose of np.vectorize is to transform functions which are not numpy-aware (e.g. take floats as input and return floats as output) into functions that can operate on (and return) numpy arrays.

Checking time difference between lambda and vectorize method

In [38]:
import time

In [39]:
tic = time.time()
df['Quality_of_tip'] = df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis = 1)
toc = time.time()
df.head()
print("Lambda Version:"+str(1000*(toc-tic))+"ms")
tic = time.time()
df['Quality_of_tip'] = np.vectorize(quality)(df['total_bill'], df['tip'])
toc = time.time()
df.head()
print("Vectorize Version:"+str(1000*(toc-tic))+"ms")

Lambda Version:9.551763534545898ms
Vectorize Version:1.0654926300048828ms
