<a href="https://colab.research.google.com/github/thousandoaks/Python4DS101/blob/master/labs/Filtering_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filtering Basics



## What I hope you'll get out of this lab
* The ability to use basic filtering operations to process DataFrames
* The ability to develop advanced filters based on the composition of basic ones
* Resources to look further

##  Comparison operators
### Comparison operators allow us to ask if something is true or false.



In [3]:
FirstVariable=50

In [4]:
SecondVariable=150

In [5]:
ThirdVariable=50

#### Question 1. is FirstVariable equal to SecondVariable ?

In [6]:
FirstVariable==SecondVariable

False

#### Question 2. is FirstVariable equal to ThirdVariable ?

In [7]:
FirstVariable==ThirdVariable

True

#### Question 3. is FirstVariable larger than SecondVariable ?

In [8]:
FirstVariable>SecondVariable

False

#### Question 4. is SecondVariable equal or larger than ThirdVariable ?

In [9]:
SecondVariable>=ThirdVariable

True

### The following image lists the comparison operators supported by Python:

<img src="https://github.com/thousandoaks/Python4DS101/blob/master/images/comparisonoperators.png?raw=1" width="50%"/>

## Filter-based selection in Pandas
### Filters rely on comparison operators to determine if a condition is true or false.


### Let's load our gapminder DataSet to start with

In [10]:
# we import the library pandas and give it the "pd" kickname
import pandas as pd

In [11]:
# we use pandas.read_csv() function to access the file "gapminder.tsv" stored in a remote location 

# the remote location is: https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/

# with the argument sep='\t' we indicate that the columns are separated by tabs rather than commas.

gapminderDataFrame = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/gapminder.tsv', sep='\t')


In [12]:
gapminderDataFrame.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Example 1. Given our Gapminder DataSet let's select countries in Asia

### Step A. We define the filtering condition
#### in this case, we want to select countries in the asian continent

In [13]:
gapminderDataFrame['continent']=='Asia'

0        True
1        True
2        True
3        True
4        True
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: continent, Length: 1704, dtype: bool

### Step B. We save the filtering condition


In [14]:
asiaFilter=gapminderDataFrame['continent']=='Asia'

### Step C. We apply the filtering condition
#### if the filter works as intended you will get only observations fullfilling the condition continent=='Asia'

In [15]:
gapminderDataFrame[asiaFilter]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1675,"Yemen, Rep.",Asia,1987,52.922,11219340,1971.741538
1676,"Yemen, Rep.",Asia,1992,55.599,13367997,1879.496673
1677,"Yemen, Rep.",Asia,1997,58.020,15826497,2117.484526
1678,"Yemen, Rep.",Asia,2002,60.308,18701257,2234.820827


### Example 2. Given our Gapminder DataSet let's select countries with a population larger than 100.000.000 people

### Step A. We define the filtering condition
#### in this case, we want to select countries whose population is larger than 100.000.000 people

In [16]:
gapminderDataFrame['pop']>=100000000

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: pop, Length: 1704, dtype: bool

### Step B. We save the filtering condition

In [17]:
largePopulationFilter=gapminderDataFrame['pop']>=1000000000

#### Step C. We apply the filtering condition
#### if the filter works as intended you will get only observations fullfilling the condition population larger than 100 million

In [18]:
gapminderDataFrame[largePopulationFilter]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
294,China,Asia,1982,65.525,1000281000,962.421381
295,China,Asia,1987,67.274,1084035000,1378.904018
296,China,Asia,1992,68.69,1164970000,1655.784158
297,China,Asia,1997,70.426,1230075000,2289.234136
298,China,Asia,2002,72.028,1280400000,3119.280896
299,China,Asia,2007,72.961,1318683096,4959.114854
706,India,Asia,2002,62.879,1034172547,1746.769454
707,India,Asia,2007,64.698,1110396331,2452.210407


### Example 3. Given our Gapminder DataSet let's select observations from Germany

#### Step A. We define the filtering condition
#### in this case, we want to select observations corresponding to Germany
#### But.. how do we look for Germany ??

In [19]:
pd.set_option('display.max_rows', 200)
gapminderDataFrame.groupby('country').count()

Unnamed: 0_level_0,continent,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,12,12,12,12,12
Albania,12,12,12,12,12
Algeria,12,12,12,12,12
Angola,12,12,12,12,12
Argentina,12,12,12,12,12
Australia,12,12,12,12,12
Austria,12,12,12,12,12
Bahrain,12,12,12,12,12
Bangladesh,12,12,12,12,12
Belgium,12,12,12,12,12


#### Now that we know that it is called "Germany" we define the filtering condition accordingly

In [20]:
gapminderDataFrame['country']=='Germany'

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: country, Length: 1704, dtype: bool

#### Step B. We save the filtering condition

In [21]:
germanyFilter=gapminderDataFrame['country']=='Germany'

#### Step C. We apply the filtering condition
#### if the filter works as intended you will get only observations from Germany

In [22]:
gapminderDataFrame[germanyFilter]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
564,Germany,Europe,1952,67.5,69145952,7144.114393
565,Germany,Europe,1957,69.1,71019069,10187.82665
566,Germany,Europe,1962,70.3,73739117,12902.46291
567,Germany,Europe,1967,70.8,76368453,14745.62561
568,Germany,Europe,1972,71.0,78717088,18016.18027
569,Germany,Europe,1977,72.5,78160773,20512.92123
570,Germany,Europe,1982,73.8,78335266,22031.53274
571,Germany,Europe,1987,74.847,77718298,24639.18566
572,Germany,Europe,1992,76.07,80597764,26505.30317
573,Germany,Europe,1997,77.34,82011073,27788.88416


## Composite Filtering

### Example 3. Given our Gapminder DataSet let's select observations from Germany after 1982

#### In this example we need a composite filter

#### Step A. We define the filtering conditions
#### in this case, we want to select observations corresponding to Germany which also took place from 1982 onwards

In [23]:
gapminderDataFrame['country']=='Germany'

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: country, Length: 1704, dtype: bool

In [24]:
gapminderDataFrame['year']>=1982

0       False
1       False
2       False
3       False
4       False
        ...  
1699     True
1700     True
1701     True
1702     True
1703     True
Name: year, Length: 1704, dtype: bool

### Step B. We save the filtering condition

In [25]:
germanyFilter=gapminderDataFrame['country']=='Germany'

In [26]:
Filter1982Onwards=gapminderDataFrame['year']>=1982

#### Step C. We apply the filtering condition
#### if the filter works as intended you will get only observations from Germany from 1982 onwards

In [27]:
gapminderDataFrame[germanyFilter & Filter1982Onwards]


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
570,Germany,Europe,1982,73.8,78335266,22031.53274
571,Germany,Europe,1987,74.847,77718298,24639.18566
572,Germany,Europe,1992,76.07,80597764,26505.30317
573,Germany,Europe,1997,77.34,82011073,27788.88416
574,Germany,Europe,2002,78.67,82350671,30035.80198
575,Germany,Europe,2007,79.406,82400996,32170.37442


## Challenge yourself !

#### Question: How has life expectancy evolved in China ?

### Question: Plot the life expectancy evolution of China

### Question: Which countries have gdpPerCap larger than 30000 dollars ?

### Question: Which countries have gdpPerCap between 500 and 1000 dollars ?


### Question: Which countries have gdpPerCap larger than 5000 dollars and are located in Asia ?