# Keywords
pandas, filtering, query, groupby, grop by grouped comprihensions

# References
[filtering examples](https://towardsdatascience.com/20-examples-to-master-filtering-pandas-dataframes-df6fabfe126f)
[groupby](https://towardsdatascience.com/11-examples-to-master-pandas-groupby-function-86e0de574f38)
[group by grouped data](https://towardsdatascience.com/2-useful-code-snippets-for-aggregated-data-880d5d263a3b)
[comprehensions](https://python.plainenglish.io/comprehensions-in-python-a244e55aa2e5)
[*args, **kwargs](https://towardsdatascience.com/10-examples-to-master-args-and-kwargs-in-python-6f1e8cc30749)

# Python, pandas filtering 

In [13]:
import os
from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
# settings
import warnings

pd.set_option("precision", 2)
warnings.filterwarnings("ignore")

In [3]:
# create data
df = pd.DataFrame({    "name": ["John","Jane","Emily","Lisa","Matt"],
    "note": [92,94,87,82,90],
    "profession":["Electrical engineer","Mechanical engineer",
                  "Data scientist","Accountant","Athlete"],
    "date_of_birth":["1998-11-01","2002-08-14","1996-01-12",
                     "2002-10-24","2004-04-05"],
    "group":["A","B","B","A","C"]})

In [4]:
# quick check
df

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


In [6]:
# Example 1: Selecting a subset of columns
df[["name","note"]]

Unnamed: 0,name,note
0,John,92
1,Jane,94
2,Emily,87
3,Lisa,82
4,Matt,90


In [7]:
# Example 2: Selecting a subset of rows and columns with loc
df.loc[:3, ["name","note"]]

Unnamed: 0,name,note
0,John,92
1,Jane,94
2,Emily,87
3,Lisa,82


In [None]:
# Example 3: Selecting a subset of rows and columns with iloc
df.iloc[:3, 2]

In [8]:
# Example 4: Using a comparison operator on column values
df[df.note > 90]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B


In [9]:
# Example 5: Using a comparison operator with strings
df[df.name=="John"]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A


In [None]:
# Example 6: String condition with str accessor
df[df.profession.str.contains("engineer")]

In [None]:
#Example 7: Another string condition with str accessor
df[df.name.str.startswith("L")]

In [None]:
# Example 8: Multipe str methods
df[df.name.str.lower().str.startswith("l")]

In [10]:
# Example 9: Tilde (~) operator
df[~df.profession.str.contains("engineer")]

Unnamed: 0,name,note,profession,date_of_birth,group
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


In [11]:
# Example 10: The dt accessor
df.date_of_birth = df.date_of_birth.astype("datetime64[ns]")

print(df[df.date_of_birth.dt.month==11])
print(df[df.date_of_birth.dt.year > 2000])

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A


In [12]:
# Example 12: Multiple conditions (and)
df[(df.date_of_birth.dt.year > 2000) &  
   (df.profession.str.contains("engineer"))]

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B


In [13]:
# Example 13: Multiple conditions (or)
df[(df.note > 90) | (df.profession=="Data scientist")]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B


In [14]:
# Example 14: The isin method
df[df.group.isin(["A","C"])]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


In [16]:
# Example 15: The query function

print(df.query("note > 90"))
print(df.query("group=='A' and note > 89"))

   name  note           profession date_of_birth group
0  John    92  Electrical engineer    1998-11-01     A
1  Jane    94  Mechanical engineer    2002-08-14     B
   name  note           profession date_of_birth group
0  John    92  Electrical engineer    1998-11-01     A


In [17]:
# Example 17: The nsmallest function
df.nsmallest(2, "note")

Unnamed: 0,name,note,profession,date_of_birth,group
3,Lisa,82,Accountant,2002-10-24,A
2,Emily,87,Data scientist,1996-01-12,B


In [18]:
# Example 18: The nlargest function
df.nlargest(2, "note")

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B
0,John,92,Electrical engineer,1998-11-01,A


In [19]:
# Example 19: The isna function
df.loc[0, "profession"] = np.nan

df[df.profession.isna()]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,,1998-11-01,A


In [20]:
# Example 20: The notna function

df[df.profession.notna()]

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


# Pandas groupby examples

In [26]:
# load data

# path = 'e:\PycharmProjects\TimeSeries'
path = os.path.abspath(os.getcwd())

# load apple, google data 
churn_modelling_file = 'Churn_Modelling.csv'
path_to_churn_modelling_file = os.path.join(path, 'data', churn_modelling_file)
cm_df = pd.read_csv(path_to_churn_modelling_file).dropna()
df = cm_df

In [23]:
cm_df.sample(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
2020,2021,15565779,Kent,627,Germany,Female,30,6,57809.32,1,1,0,188258.49,0
7959,7960,15686999,Nicholas,556,France,Female,40,8,0.0,2,1,0,62112.7,0
8014,8015,15644295,Hargreaves,731,Spain,Female,39,2,126816.18,1,1,1,74850.93,0
4608,4609,15614103,Colombo,850,Germany,Male,42,8,119839.69,1,0,1,51016.02,1
3083,3084,15814816,Kambinachi,466,France,Male,40,4,91592.06,1,1,0,141210.18,1


In [29]:
#example 1
df[['Gender','Exited']].groupby('Gender').count()

Unnamed: 0_level_0,Exited
Gender,Unnamed: 1_level_1
Female,4543
Male,5457


In [28]:
#example 2
df[['Gender','Exited']].groupby('Gender').agg(['mean','count'])

Unnamed: 0_level_0,Exited,Exited
Unnamed: 0_level_1,mean,count
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2
Female,0.25,4543
Male,0.16,5457


In [30]:
#example 3
df[['Gender','Geography','Exited']].groupby(['Gender','Geography']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Exited
Gender,Geography,Unnamed: 2_level_1
Female,France,0.2
Female,Germany,0.38
Female,Spain,0.21
Male,France,0.13
Male,Germany,0.28
Male,Spain,0.13


In [31]:
#example 4
df[['Gender','Geography','Exited']].groupby(['Gender','Geography']).mean().sort_values(by='Exited')

Unnamed: 0_level_0,Unnamed: 1_level_0,Exited
Gender,Geography,Unnamed: 2_level_1
Male,France,0.13
Male,Spain,0.13
Female,France,0.2
Female,Spain,0.21
Male,Germany,0.28
Female,Germany,0.38


In [32]:
#example 5
df[['Gender','Geography','Exited']].groupby(['Gender','Geography']).mean().sort_values(by='Exited', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Exited
Gender,Geography,Unnamed: 2_level_1
Female,Germany,0.38
Male,Germany,0.28
Female,Spain,0.21
Female,France,0.2
Male,Spain,0.13
Male,France,0.13


In [33]:
#example 6
df[['Geography','Age','Tenure']].groupby(['Geography']).agg(['mean','max'])

Unnamed: 0_level_0,Age,Age,Tenure,Tenure
Unnamed: 0_level_1,mean,max,mean,max
Geography,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
France,38.51,92,5.0,10
Germany,39.77,84,5.01,10
Spain,38.89,88,5.03,10


In [34]:
#example 7
df[['Exited','Geography','Age','Tenure']].groupby(['Exited','Geography']).agg(['mean','count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Tenure,Tenure
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean,count
Exited,Geography,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,France,37.24,4204,5.01,4204
0,Germany,37.31,1695,5.01,1695
0,Spain,37.84,2064,5.11,2064
1,France,45.13,810,5.0,810
1,Germany,44.89,814,5.01,814
1,Spain,44.15,413,4.66,413


In [35]:
#example 8
df[['Exited','Geography','Age','Tenure']].groupby(['Exited','Geography']).agg(['mean','count']).sort_values(by=[('Age','mean')])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Tenure,Tenure
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean,count
Exited,Geography,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,France,37.24,4204,5.01,4204
0,Germany,37.31,1695,5.01,1695
0,Spain,37.84,2064,5.11,2064
1,Spain,44.15,413,4.66,413
1,Germany,44.89,814,5.01,814
1,France,45.13,810,5.0,810


In [36]:
#example 9
df[['Exited','IsActiveMember','NumOfProducts','Balance']].groupby(['Exited','IsActiveMember'], as_index=False).mean()

Unnamed: 0,Exited,IsActiveMember,NumOfProducts,Balance
0,0,0,1.55,72048.82
1,0,1,1.54,73304.72
2,1,0,1.44,90988.81
3,1,1,1.53,91320.64


In [38]:
#example 10
df['Geography'][30:50] = np.nan
df[['Geography','Exited']].groupby('Geography').mean()

Unnamed: 0_level_0,Exited
Geography,Unnamed: 1_level_1
France,0.16
Germany,0.32
Spain,0.17


In [40]:
#example 10
df['Geography'][30:50] = np.nan
df[['Geography','Exited']].groupby('Geography').mean()

Unnamed: 0_level_0,Exited
Geography,Unnamed: 1_level_1
France,0.16
Germany,0.32
Spain,0.17


In [41]:
#example 11
df[['Geography','Exited']].groupby('Geography', dropna=False).agg(['mean','count'])

Unnamed: 0_level_0,Exited,Exited
Unnamed: 0_level_1,mean,count
Geography,Unnamed: 1_level_2,Unnamed: 2_level_2
France,0.16,5008
Germany,0.32,2502
Spain,0.17,2470
,0.3,20


In [14]:
# example 12

df = sns.load_dataset('taxis')
df.head(2)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan


In [15]:
# example 12 :: Group, Count and Rename
df.groupby(['payment']).pickup_borough.value_counts()

payment      pickup_borough
cash         Manhattan         1397
             Queens             266
             Brooklyn           119
             Bronx               25
credit card  Manhattan         3839
             Queens             383
             Brooklyn           261
             Bronx               74
Name: pickup_borough, dtype: int64

In [19]:
# example 12 :: group, percentage, rename
df.groupby(['payment']).pickup_borough.value_counts(normalize=True)\
.to_frame()\
.rename(columns={'pickup_borough': 'freq_count'})\
.reset_index()


Unnamed: 0,payment,pickup_borough,freq_count
0,cash,Manhattan,0.77
1,cash,Queens,0.15
2,cash,Brooklyn,0.07
3,cash,Bronx,0.01
4,credit card,Manhattan,0.84
5,credit card,Queens,0.08
6,credit card,Brooklyn,0.06
7,credit card,Bronx,0.02


In [21]:
# Example 13 :: Group the grouped data

# Create a dataset
df2 = pd.DataFrame({'id': [1,1,2,2,3,3,3,3,3,3,4,4,4,6,7,7,8,8,5,5],
              'register': [2,2,2,2,4,4,4,4,4,4,1,1,1,1,1,1,2,2,2,2],
             'amount': np.random.randint(2,20,20)})# Sort
df2= df2.sort_values(by='id').reset_index(drop=True)

# Step 1: Group by register and ID and sum the amount.
# Step 2: group the result by register and calculate the mean.

df2.groupby(['register','id']).amount.sum()\
.groupby('register').mean()

register
1    18.67
2    15.75
4    52.00
Name: amount, dtype: float64

# Comprihensions

## list comprihensions

In [42]:
# simple
num = [i*2 for i in range(100)]
print(num)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198]


In [None]:
# if condition
even = [i for i in range(10) if i%2 ==0]
print(even)

In [5]:
# nested for loops
mat = [[j for j in range(5)] for i in range(3)]
print(mat)

[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]


## Dict comprehensions

In [6]:
#With a single for loop
square= {i:i**2 for i in range(10)}
print(square)

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}


In [7]:
# With an If condition
even_square= {i:i**2 for i in range(10) if i%2 ==0}
print(even_square)

{0: 0, 2: 4, 4: 16, 6: 36, 8: 64}


In [8]:
#With if-else condition
num = [0,1,2,3,4,5,6,7,8,9]
dic = {i : 'even' if i%2==0 else 'odd' for i in num}
print(dic)

{0: 'even', 1: 'odd', 2: 'even', 3: 'odd', 4: 'even', 5: 'odd', 6: 'even', 7: 'odd', 8: 'even', 9: 'odd'}


## Set Comprehensions

In [11]:
# simple for loop
square = {i**2 for i in range(10)}
print(square)

{0, 1, 64, 4, 36, 9, 16, 49, 81, 25}


In [10]:
# With an If condition
even= {i for i in range(10) if i%2 ==0}
print(even)

{0, 2, 4, 6, 8}


# *args, **kwargs

In [None]:

To summarize:

    There are two types of arguments in a function which are positional arguments (declared by a name only) and keyword arguments (declared by a name and a default value).
    When a function is called, values for positional arguments must be given. 
    Keywords arguments are optional (they take the default value if not specified).
    *args collects the positional arguments that are not explicitly defined and store them in a tuple
    **kwargs does the same as *args but for keyword arguments. They are stored in a dictionary because keyword arguments are stored as name-value pairs.
    Python does not allow positional arguments to follow keyword arguments. 
    Thus, we first declare positional arguments and then keyword arguments.

