# Conditional selection

df[df['height'] < 1.8], but we don't need to do it here, as the column name doesn't contain any whitespaces.

If we need to combine several conditions, we use the following Boolean operators:

- `&` for "and"
- `|` (vertical line) for "or"
- `~` for "not"
- `>, <, >=, <=, ==, !=` for statement comparing.

Please, don't forget about the parentheses:

df[(df.first_name == 'Michael') & (df.birthday == '17.02.1963')]

In [None]:
df[((df.first_name == 'Michael') | (df.first_name == 'John'))
   & (df.height >= 1.8)
   & (df.last_name != 'Jordan')]

If we want to make our filtering exclusive, in other words, to select everything except the indicated parameters, we can add a tilde character ~ and extra parenthesis:

In [None]:
df[~(((df.first_name == 'Michael') | (df.first_name == 'John'))
   & (df.height >= 1.8)
   & (df.last_name != 'Jordan'))]

In [8]:
import pandas as pd

chart_dict = {"artist": {0: "The Weeknd", 1: "Dua Lipa", 2: "Roddy Ricch", 3: "Post Malone", 4: "Harry Styles", 5: "Tones And I", 6: "Future & Drake", 7: "Lewis Capaldi"}, "song": {0: "Blinding Lights", 1: "Don't Start Now", 2: "The Box", 3: "Circles", 4: "Adore You", 5: "Dance Monkey", 6: "Life Is Good", 7: "Before You Go"}, "peak_us": {0: 1, 1: 2, 2: 1, 3: 1, 4: 6, 5: 4, 6: 2, 7: 9}, "peak_uk": {0: "1", 1: "3", 2: "2", 3: "19", 4: "7", 5: "-", 6: "3", 7: "1"}, "peak_de": {0: "1", 1: "10", 2: "12", 3: "37", 4: "83", 5: "1", 6: "8", 7: "42"}, "peak_fr": {0: "1", 1: "12", 2: "7", 3: "77", 4: "126", 5: "3", 6: "33", 7: "49"}, "peak_ca": {0: 1, 1: 3, 2: 1, 3: 3, 4: 10, 5: 1, 6: 3, 7: 16}, "result_place": {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8}}

chart2020 = pd.DataFrame(chart_dict)

# your code here
df = chart2020[["artist","song"]]
df.tail(3)

Unnamed: 0,artist,song
5,Tones And I,Dance Monkey
6,Future & Drake,Life Is Good
7,Lewis Capaldi,Before You Go


# Combining Data in Pandas

-  pandas suggests several ways to do it. In this topic, you will learn how to join DataFrame and Series objects using the concat() and merge() functions.

 ## Concatenating objects

- The concat() function is used to concatenate or glue multiple objects together along a horizontal or vertical axis. To do so, we need to pass multiple Series or DataFrame objects as arguments to the function. But first, we will create two DataFrame instances — the tables that store students' names and their results after running distances of 100 meters and 2 kilometers respectively:

In [4]:
import pandas as pd 
junior_class = pd.DataFrame({'Name': ['Ann', 'Kate', 'George', 'Eric'],
                             '100m (sec.)': ['16.3', '17.1', '14.8', '14.3'],
                             '2km (min., sec.)': ['9,24', '9,45', '9,17', '8,14']},
                            index=[1, 2, 3, 4])
senior_class = pd.DataFrame({'Name': ['Jack', 'Alicia', 'Ella', 'James'],
                             '100m (sec.)': ['15.9', '17.8', '17.0', '15.0'],
                             '2km (min., sec.)': ['8,18', '9,02', '8,58', '7,58']})

pd.concat([junior_class, senior_class])

Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Name.1,100m (sec.).1,"2km (min., sec.).1"
1,Ann,16.3,924.0,Alicia,17.8,902.0
2,Kate,17.1,945.0,Ella,17.0,858.0
3,George,14.8,917.0,James,15.0,758.0
4,Eric,14.3,814.0,,,
0,,,,Jack,15.9,818.0


In [5]:
pd.concat([junior_class, senior_class], axis=1)


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Name.1,100m (sec.).1,"2km (min., sec.).1"
1,Ann,16.3,924.0,Alicia,17.8,902.0
2,Kate,17.1,945.0,Ella,17.0,858.0
3,George,14.8,917.0,James,15.0,758.0
4,Eric,14.3,814.0,,,
0,,,,Jack,15.9,818.0


In [6]:
pd.concat([junior_class, senior_class], ignore_index=True)


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)"
0,Ann,16.3,924
1,Kate,17.1,945
2,George,14.8,917
3,Eric,14.3,814
4,Jack,15.9,818
5,Alicia,17.8,902
6,Ella,17.0,858
7,James,15.0,758


In [7]:
pd.concat([junior_class, senior_class], axis=1, join='inner')


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Name.1,100m (sec.).1,"2km (min., sec.).1"
1,Ann,16.3,924,Alicia,17.8,902
2,Kate,17.1,945,Ella,17.0,858
3,George,14.8,917,James,15.0,758


In [8]:
pd.concat([junior_class, senior_class], axis=1, join='outer')


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Name.1,100m (sec.).1,"2km (min., sec.).1"
1,Ann,16.3,924.0,Alicia,17.8,902.0
2,Kate,17.1,945.0,Ella,17.0,858.0
3,George,14.8,917.0,James,15.0,758.0
4,Eric,14.3,814.0,,,
0,,,,Jack,15.9,818.0


In [9]:
pd.concat([junior_class, senior_class], keys=['Jun. class', 'Sen. class'])


Unnamed: 0,Unnamed: 1,Name,100m (sec.),"2km (min., sec.)"
Jun. class,1,Ann,16.3,924
Jun. class,2,Kate,17.1,945
Jun. class,3,George,14.8,917
Jun. class,4,Eric,14.3,814
Sen. class,0,Jack,15.9,818
Sen. class,1,Alicia,17.8,902
Sen. class,2,Ella,17.0,858
Sen. class,3,James,15.0,758


# Merging objects


In [10]:
age_of_participants = pd.DataFrame({'Name': ['Ann', 'Eric', 'Ella'],
                                    'Age': ['16', '16', '18']})

In [11]:
junior_class.merge(age_of_participants)


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Age
0,Ann,16.3,924,16
1,Eric,14.3,814,16


In [12]:
junior_class.merge(age_of_participants, how='left')


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Age
0,Ann,16.3,924,16.0
1,Kate,17.1,945,
2,George,14.8,917,
3,Eric,14.3,814,16.0


In [13]:
junior_class.merge(age_of_participants, how='right')


Unnamed: 0,Name,100m (sec.),"2km (min., sec.)",Age
0,Ann,16.3,924.0,16
1,Eric,14.3,814.0,16
2,Ella,,,18


## Searching within a pandas DataFrame 


- Set occurrence search


In [None]:
df[(df.location == 'Perth') | (df.location == 'Sydney') | (df.location == 'Canberra')]

That notation secures the bag, but what if we had about 100 cities to filter? If we have so many possible cell values, we have to store them in a collection, a Python list, for example. This fact leads to the following question: "is a certain value in the list or not?" This is where the .isin() method comes into play:

In [None]:
desired_cities = ['Perth', 'Sydney', 'Canberra']
df.location.isin(desired_cities)

We get the boolean DataFrame as a result:

`0     True
1     True
2    False
3     True
4     True
5     True
6    False
7     True
Name: location, dtype: bool`

Now, let's use it as a condition to select the required rows:



In [None]:
desired_cities = ['Perth', 'Sydney', 'Canberra']
df[df.location.isin(desired_cities)]

 In case you have to exclude the results that appear in a list, just place the tilde sign ~ before the expression (like with other reversals):

In [None]:
df[~df.location.isin(desired_cities)]


## How to replace a value


In [None]:
df[['min_temp', 'max_temp']].where(df.min_temp > 10)


In [None]:
df[['min_temp', 'max_temp']].where(df.min_temp > 10, 'min temp too low')


## Boolean expression query


In [None]:
df.query("location == 'Perth' & max_temp > 22")


In [2]:
import pandas as pd
import numpy as np

values = np.random.randint(0, 100, size=(10, 2))
df = pd.DataFrame(values, columns=['measure', 'error'])
df.describe()

Unnamed: 0,measure,error
count,10.0,10.0
mean,55.6,47.9
std,30.79033,33.057021
min,13.0,7.0
25%,35.5,15.5
50%,51.5,55.0
75%,84.5,63.75
max,96.0,96.0


## Grouping and aggregating data in pandas

- DataFrame.aggregate


In [None]:
df.body_mass_g.agg('median')


- Tip: You can pass a function, the function name (as a string), a list of functions (or their names) or a dict to .agg(), for example, 'unique'in df.body_mass_g.agg('unique') will get the unique elements from body_mass_g column, 'nunique' will count the unique elements in the specified column, and 'sum' in df.body_mass_g.agg('sum')will produce the sum of body_mass_g values.

Also, you can pass a function like so (here we pass a Python built-in sum(), but any function that handles a DataFrame can be passed):

In [None]:
df.body_mass_g.agg(sum)


Another way to do it is by calling the .median() method from the Series:



In [None]:
df.body_mass_g.median()


The difference is that by using the .agg() method, we are also able to apply several aggregating functions to different columns.

In [None]:
df.agg({'bill_length_mm': ['min', 'mean'],
         'flipper_length_mm': ['max', 'mean']
        })

It's also possible to aggregate the data with your own functions. The example of the function below outputs the number of missing values. If the set contains no such values, it puts 0 (by default):

In [None]:
def count_nulls(series, ok_message=0):
    if not series.isna().sum():
        return ok_message
    return len(series) - series.count()

In [None]:
df.agg(count_nulls)


In [None]:
df.agg(count_nulls, ok_message='Hurray!')


In [None]:
df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']].agg(max, 
axis='columns')

## DataFrame.groupby


In [None]:
df.groupby(['sex']).agg({'bill_length_mm':'median'})


In [None]:
df.groupby('sex', dropna=False).agg({'bill_length_mm':'median'})


The df.groupby(['sex']) alone outputs something like "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002A1E7BE8358>". That's because it's just an object in memory with split values for indicated groups. The interpreter doesn't know what to output unless you directly specify it with .agg().

As you've noticed, the "sex" column contains 11 missing values (some penguins preferred not to share their gender). To include them in our grouping, set the groupby dropna argument to False (pandas 1.1 or higher required):

In [None]:
df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'})


Tip: As you've seen thus far, groupbywill make the group labels you pass into it into index columns — the as_index parameter is responsible for that, and is set to True by default. For our last example, if we call .index.names, we can see that we have 2 indexes — 'island' and 'sex':

In [None]:
df.groupby(['island', 'sex']).agg({'bill_length_mm':'median'}).index.names
