# Lab-Data-Manipulation - Filtering Data


In [39]:
import pandas as pd

# Context 

For this lab you'll use a dataset for UFO observations. The objective is to exercise the manipulation of a dataframe, so we'll use the tools we've learned about `reading`, `renaming`, `selecting specific columns`, `filtering based on conditions` and `merging` dataframes to try to better understand our dataset and store an enriched version of our dataset at the end.

variable	|class|	description
------------|-----|-------------
date_time	|datetime (mdy h:m)	| Date time sighting occurred
city_area	|character	        | City or area of sighting
state	    |character          |	state/region of sighting
country	    |character          |	Country of sighting
ufo_shape	|character          |	UFO Shape
encounter_length	|double     |	Encounter length in seconds
described_encounter_length	|character |	Encounter length as described (eg 1 hour, etc)
description	|character          |	Description of encounter
date_documented	|character      |	Date documented
latitude	|double             |	Latitude
longitude	|double             |	Longitude

## Read the dataset and store it in a dataframe called `ufo`

Pay attention at the file separator.

In [40]:
ufo = pd.read_csv('data/ufo.csv',sep=';')

## Check the first 6 columns of the dataframe

In [41]:
ufo.columns[0:]

Index(['Unnamed: 0', 'date', 'year', 'month', 'day', 'date_time', 'city_area',
       'state', 'country', 'ufo_shape', 'encounter_length',
       'described_encounter_length', 'description', 'date_documented',
       'latitude', 'longitude'],
      dtype='object')

In [42]:
ufo.iloc[:, :6]

Unnamed: 0.1,Unnamed: 0,date,year,month,day,date_time
0,0,1949-10-10,1949,10,10,10/10/1949 20:30
1,1,1949-10-10,1949,10,10,10/10/1949 21:00
2,2,1955-10-10,1955,10,10,10/10/1955 17:00
3,3,1956-10-10,1956,10,10,10/10/1956 21:00
4,4,1960-10-10,1960,10,10,10/10/1960 20:00
...,...,...,...,...,...,...
80327,80327,2013-09-09,2013,9,9,9/9/2013 21:15
80328,80328,2013-09-09,2013,9,9,9/9/2013 22:00
80329,80329,2013-09-09,2013,9,9,9/9/2013 22:00
80330,80330,2013-09-09,2013,9,9,9/9/2013 22:20


## Check the shape of your dataframe to see how many rows and columns it has

In [43]:
ufo.shape

(80332, 16)

## Bring the date information to the beginning of the dataframe

If you check the dataframe columns, there are some information of the date at the end of the dataframe. For this task, you should reorder the columns in a way that the first few columns all show the date information. 

*Hint: Use the ufo.columns to see all the column names you have.*

In [44]:
ufo.columns.tolist()

['Unnamed: 0',
 'date',
 'year',
 'month',
 'day',
 'date_time',
 'city_area',
 'state',
 'country',
 'ufo_shape',
 'encounter_length',
 'described_encounter_length',
 'description',
 'date_documented',
 'latitude',
 'longitude']

In [45]:
ufo = ufo[['date', 'year','month', 'day', 'date_time', 'city_area',
            'state', 'country', 'ufo_shape', 'encounter_length',
            'described_encounter_length', 'description','date_documented',
            'latitude', 'longitude']]

## Just check if you did it the right way. Take a look at the head of the dataframe again and see if the `ufo` dataframe now is reordered.

In [46]:
ufo.head()

Unnamed: 0,date,year,month,day,date_time,city_area,state,country,ufo_shape,encounter_length,described_encounter_length,description,date_documented,latitude,longitude
0,1949-10-10,1949,10,10,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,1949-10-10,1949,10,10,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,1955-10-10,1955,10,10,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,1956-10-10,1956,10,10,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,1960-10-10,1960,10,10,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


## Select a piece of your dataframe. We won't work with the whole dataframe for now, just a few columns. Create a new dataframe called `ufo_vars` and select only the following columns of the `ufo` dataframe. 

`year`, `month`, `state`, `country`, `ufo_shape`, `encounter_length`

In [47]:
ufo_vars = ufo[['year', 'month', 'state', 'country', 'ufo_shape', 'encounter_length']]

Perform a *.head()* on your result to check if you did it right.

Expected output:


|    |   year |   month | state   | country   | ufo_shape   |   encounter_length |
|---:|-------:|--------:|:--------|:----------|:------------|-------------------:|
|  0 |   1949 |      10 | tx      | us        | cylinder    |               2700 |
|  1 |   1949 |      10 | tx      | nan       | light       |               7200 |
|  2 |   1955 |      10 | nan     | gb        | circle      |                 20 |
|  3 |   1956 |      10 | tx      | us        | circle      |                 20 |
|  4 |   1960 |      10 | hi      | us        | light       |                900 |

In [48]:
ufo_vars.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_length
0,1949,10,tx,us,cylinder,2700.0
1,1949,10,tx,,light,7200.0
2,1955,10,,gb,circle,20.0
3,1956,10,tx,us,circle,20.0
4,1960,10,hi,us,light,900.0


## Rename the variable `encounter_length` to `encounter_seconds`. Keep using the `ufo_vars` dataset for the following tasks, unless specifically specified.

Again, check your results to check if you did it right.

Expected output:


|    |   year |   month | state   | country   | ufo_shape   |   encounter_seconds |
|---:|-------:|--------:|:--------|:----------|:------------|--------------------:|
|  0 |   1949 |      10 | tx      | us        | cylinder    |                2700 |
|  1 |   1949 |      10 | tx      | nan       | light       |                7200 |
|  2 |   1955 |      10 | nan     | gb        | circle      |                  20 |
|  3 |   1956 |      10 | tx      | us        | circle      |                  20 |
|  4 |   1960 |      10 | hi      | us        | light       |                 900 |

In [49]:
ufo_vars = ufo_vars.rename(columns = {'encounter_length' : 'encounter_seconds'})
ufo_vars.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_seconds
0,1949,10,tx,us,cylinder,2700.0
1,1949,10,tx,,light,7200.0
2,1955,10,,gb,circle,20.0
3,1956,10,tx,us,circle,20.0
4,1960,10,hi,us,light,900.0


## Let's start filtering some records. Create a new dataframe called `ufo_us` and filter the `ufo_vars` dataframe bringing only the results in which the `country` is `"us"`



Expected output:


|    |   year |   month | state   | country   | ufo_shape   |   encounter_seconds |
|---:|-------:|--------:|:--------|:----------|:------------|--------------------:|
|  0 |   1949 |      10 | tx      | us        | cylinder    |                2700 |
|  3 |   1956 |      10 | tx      | us        | circle      |                  20 |
|  4 |   1960 |      10 | hi      | us        | light       |                 900 |
|  5 |   1961 |      10 | tn      | us        | sphere      |                 300 |
|  7 |   1965 |      10 | ct      | us        | disk        |                1200 |

In [50]:
condition = (ufo_vars['country'] == 'us')
ufo_us = ufo_vars[condition]
ufo_us.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_seconds
0,1949,10,tx,us,cylinder,2700.0
3,1956,10,tx,us,circle,20.0
4,1960,10,hi,us,light,900.0
5,1961,10,tn,us,sphere,300.0
7,1965,10,ct,us,disk,1200.0


### Use the `.query()` method to perform the same task as above



In [51]:
ufo_us = ufo_vars.query('country == "us"')
ufo_us.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_seconds
0,1949,10,tx,us,cylinder,2700.0
3,1956,10,tx,us,circle,20.0
4,1960,10,hi,us,light,900.0
5,1961,10,tn,us,sphere,300.0
7,1965,10,ct,us,disk,1200.0


See which one do you prefer the most and keep using it for the exercises that follow

## For the `ufo_us` dataframe, select only the cases in which the year is in the first decade (2001-2010). Put that in a variable called `ufo_us_2000`.

Check your results.

In [52]:
condition = (ufo_us['year'] > 2000) & (ufo_us['year'] <= 2010)
ufo_us_2000 = ufo_us[condition]
ufo_us_2000.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_seconds
102,2001,10,ia,us,triangle,240.0
105,2001,10,ca,us,circle,120.0
106,2001,10,ia,us,rectangle,300.0
107,2001,10,ca,us,changing,900.0
108,2001,10,az,us,triangle,60.0


In [53]:
ufo_us_2000 = ufo_us.query('year > 2000 and year <= 2010')
ufo_us_2000.head()

Unnamed: 0,year,month,state,country,ufo_shape,encounter_seconds
102,2001,10,ia,us,triangle,240.0
105,2001,10,ca,us,circle,120.0
106,2001,10,ia,us,rectangle,300.0
107,2001,10,ca,us,changing,900.0
108,2001,10,az,us,triangle,60.0


## Try to do the same without the intermediate step of creating the `ufo_us` dataframe. That is, try to filter the dataset for the cases in which the country is "us" and the year is (2001-2010) from the original dataframe.



*Hint:* You have to make sure all of these conditions are applied simultaneously - using the `and` (or `&`) operator. Try to understand when to use the `and` and the `&` operator.

In [54]:
conditions = (ufo['country'] == 'us') & (ufo['year'] > 2000) & (ufo['year'] <= 2010)
ufo[conditions]

Unnamed: 0,date,year,month,day,date_time,city_area,state,country,ufo_shape,encounter_length,described_encounter_length,description,date_documented,latitude,longitude
102,2001-10-10,2001,10,10,10/10/2001 03:00,rockwell city,ia,us,triangle,240.0,4 min.s,Large&#44silent&#44slow&#44low to the ground d...,7/1/2002,42.395278,-94.633611
105,2001-10-10,2001,10,10,10/10/2001 20:35,hayward,ca,us,circle,120.0,2/min.,FALLING STAR STOPS &#39SHOTS OUT DOZENS OF ...,11/20/2001,37.668889,-122.079722
106,2001-10-10,2001,10,10,10/10/2001 21:15,ottumwa,ia,us,rectangle,300.0,3-5 minutes,We saw a square object at night&#44 which had...,11/20/2001,41.004167,-92.373611
107,2001-10-10,2001,10,10,10/10/2001 21:30,fresno,ca,us,changing,900.0,15 min. apprx,Objects were sighted driving north on Highway ...,11/20/2001,36.747778,-119.771389
108,2001-10-10,2001,10,10,10/10/2001 22:00,phoenix,az,us,triangle,60.0,less then a minute,Triangle shaped craft spotted flying west to e...,11/20/2001,33.448333,-112.073333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80269,2010-09-09,2010,9,9,9/9/2010 21:30,brighton,mi,us,changing,300.0,3-5 minutes,Glittery random lights near each other&#44 rea...,11/21/2010,42.529444,-83.780278
80270,2010-09-09,2010,9,9,9/9/2010 22:00,gainesville,fl,us,oval,2700.0,45 min,2nd night watching the objects&#44 approximate...,11/21/2010,29.651389,-82.325000
80271,2010-09-09,2010,9,9,9/9/2010 22:00,gainesville,fl,us,oval,7200.0,2+ hrs,Multiple sightings by multiple whitnesses&#44 ...,11/21/2010,29.651389,-82.325000
80272,2010-09-09,2010,9,9,9/9/2010 22:00,lemitar,nm,us,fireball,300.0,3 to 5 mins,orange lights in the southeastern sky in New M...,11/21/2010,34.159722,-106.909722


In [55]:
ufo.query('country == "us" and year > 2000 and year <= 2010')

Unnamed: 0,date,year,month,day,date_time,city_area,state,country,ufo_shape,encounter_length,described_encounter_length,description,date_documented,latitude,longitude
102,2001-10-10,2001,10,10,10/10/2001 03:00,rockwell city,ia,us,triangle,240.0,4 min.s,Large&#44silent&#44slow&#44low to the ground d...,7/1/2002,42.395278,-94.633611
105,2001-10-10,2001,10,10,10/10/2001 20:35,hayward,ca,us,circle,120.0,2/min.,FALLING STAR STOPS &#39SHOTS OUT DOZENS OF ...,11/20/2001,37.668889,-122.079722
106,2001-10-10,2001,10,10,10/10/2001 21:15,ottumwa,ia,us,rectangle,300.0,3-5 minutes,We saw a square object at night&#44 which had...,11/20/2001,41.004167,-92.373611
107,2001-10-10,2001,10,10,10/10/2001 21:30,fresno,ca,us,changing,900.0,15 min. apprx,Objects were sighted driving north on Highway ...,11/20/2001,36.747778,-119.771389
108,2001-10-10,2001,10,10,10/10/2001 22:00,phoenix,az,us,triangle,60.0,less then a minute,Triangle shaped craft spotted flying west to e...,11/20/2001,33.448333,-112.073333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80269,2010-09-09,2010,9,9,9/9/2010 21:30,brighton,mi,us,changing,300.0,3-5 minutes,Glittery random lights near each other&#44 rea...,11/21/2010,42.529444,-83.780278
80270,2010-09-09,2010,9,9,9/9/2010 22:00,gainesville,fl,us,oval,2700.0,45 min,2nd night watching the objects&#44 approximate...,11/21/2010,29.651389,-82.325000
80271,2010-09-09,2010,9,9,9/9/2010 22:00,gainesville,fl,us,oval,7200.0,2+ hrs,Multiple sightings by multiple whitnesses&#44 ...,11/21/2010,29.651389,-82.325000
80272,2010-09-09,2010,9,9,9/9/2010 22:00,lemitar,nm,us,fireball,300.0,3 to 5 mins,orange lights in the southeastern sky in New M...,11/21/2010,34.159722,-106.909722


## BONUS 1:  Take a look at the column named `ufo_shape`. Compare the number of triangular UFO occurrances in the US from the year 2001 up to 2010 as compared to the years of 1991 up to 2000.

*Hint: you should expect roughly ~3 times more cases for 2001-2010 than 1991-2000.*

In [56]:
# calculate the dataframe from 2001-2010 here
conditions = (ufo['country'] == 'us') & (ufo['year'] > 2000) & (ufo['year'] <= 2010) & (ufo['ufo_shape'] == 'triangle')
ufo[conditions]

# or

ufo.query('country == "us" and year > 2000 and year <= 2010 and ufo_shape == "triangle"')


Unnamed: 0,date,year,month,day,date_time,city_area,state,country,ufo_shape,encounter_length,described_encounter_length,description,date_documented,latitude,longitude
102,2001-10-10,2001,10,10,10/10/2001 03:00,rockwell city,ia,us,triangle,240.0,4 min.s,Large&#44silent&#44slow&#44low to the ground d...,7/1/2002,42.395278,-94.633611
108,2001-10-10,2001,10,10,10/10/2001 22:00,phoenix,az,us,triangle,60.0,less then a minute,Triangle shaped craft spotted flying west to e...,11/20/2001,33.448333,-112.073333
109,2001-10-10,2001,10,10,10/10/2001 23:00,virginia beach,va,us,triangle,30.0,30 seconds,shaped like a stealth bomber (boomerang like&#...,7/26/2002,36.852778,-75.978333
127,2004-10-10,2004,10,10,10/10/2004 02:50,mahwah,nj,us,triangle,180.0,2 - 3 minutes,Triangle shaped flying&#44 hovering U.F.O. wit...,12/12/2011,41.088611,-74.144167
147,2005-10-10,2005,10,10,10/10/2005 20:00,loretto,pa,us,triangle,300.0,5 minutes,Dull red flash&#44 Triangular ship&#44 Vocal N...,10/30/2006,40.503056,-78.630556
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80195,2004-09-09,2004,9,9,9/9/2004 22:00,corunna,mi,us,triangle,240.0,4 mins,Flying Triangle seen over a field in Michigan...,9/29/2004,42.981944,-84.117778
80231,2008-09-09,2008,9,9,9/9/2008 22:00,shelbyville,il,us,triangle,7.0,7 seconds,triangle &#443 huge bright lights&#44 fly over...,10/31/2008,39.406389,-88.790000
80235,2009-09-09,2009,9,9,9/9/2009 03:38,denver,co,us,triangle,900.0,15 minutes,Bright triangle lights in sky,12/12/2009,39.739167,-104.984167
80248,2009-09-09,2009,9,9,9/9/2009 20:45,elkton,or,us,triangle,300.0,5 minutes,Two triangluar-shaped craft filmed near Elkton...,12/12/2009,43.637778,-123.566944


In [57]:
# calculate the dataframe from 1991-2000 here
conditions = (ufo['country'] == 'us') & (ufo['year'] > 1990) & (ufo['year'] <= 2000) & (ufo['ufo_shape'] == 'triangle')
ufo[conditions]

# or

ufo.query('country == "us" and year > 1990 and year <= 2000 and ufo_shape == "triangle"')

Unnamed: 0,date,year,month,day,date_time,city_area,state,country,ufo_shape,encounter_length,described_encounter_length,description,date_documented,latitude,longitude
50,1991-10-10,1991,10,10,10/10/1991 22:00,harrisburg,pa,us,triangle,600.0,10 minutes,We observed 3 triangular shaped high speed obj...,5/9/2003,40.273611,-76.884722
64,1996-10-10,1996,10,10,10/10/1996 03:20,higginsville,mo,us,triangle,3.0,3sec,illuminated triangular craft&#44 flying at hig...,2/16/2000,39.072500,-93.716944
70,1997-10-10,1997,10,10,10/10/1997 20:00,bonaire,ga,us,triangle,300.0,<5 minutes,Triangular Object Sighted at Very Close Range,2/1/2007,32.543611,-83.596111
78,1998-10-10,1998,10,10,10/10/1998 20:30,spokane (about 30 miles sw of&#44i-90&#44 mayb...,wa,us,triangle,600.0,10 minutes,Dark boomerange object seen for ten minutes ho...,8/5/2001,47.658889,-117.425000
291,1994-10-11,1994,10,11,10/11/1994 02:00,jackson,nj,us,triangle,300.0,5mins,triangle UFO in jackson NJ Countyline 526 o...,1/29/2002,39.776389,-74.862778
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79711,2000-09-07,2000,9,7,9/7/2000 20:00,glendale,az,us,triangle,120.0,2 minutes,At an intersecton I saw a silent and motionles...,8/5/2001,33.538611,-112.185278
79712,2000-09-07,2000,9,7,9/7/2000 20:30,syracuse,ny,us,triangle,300.0,5 mins,I was on I-90 30&#44 miles west of Syracuse&#4...,9/17/2000,43.048056,-76.147778
79897,1999-09-08,1999,9,8,9/8/1999 21:58,bellingham,wa,us,triangle,10.0,10 seconds,At 9:58pm on 9/8/99 I was on the back deck smo...,9/12/1999,48.759722,-122.486944
79898,1999-09-08,1999,9,8,9/8/1999 22:00,albany,or,us,triangle,15.0,appox 10-15 sec,The craft was flying at approx. 200 feet with ...,9/12/1999,44.636667,-123.104722


In [58]:
# calculate the dataframe from 1991-2000 here

ufo.query('country == "us" and year > 1990 and year <= 2000 and ufo_shape == "triangle"').shape

(1116, 15)

In [59]:
# calculate the dataframe from 2001-2010 here

ufo.query('country == "us" and year > 2000 and year <= 2010 and ufo_shape == "triangle"').shape

(3353, 15)

## BONUS 1.1: Count how many values does each category of `ufo_shape` has. 

Expected output:

````
        light        16565
        triangle      7865
        circle        7608
        fireball      6208
        other         5649
        unknown       5584
        sphere        5387
        disk          5213
        oval          3733
        formation     2457
        cigar         2057
        changing      1962
        flash         1328
        rectangle     1297
        cylinder      1283
        diamond       1178
        chevron        952
        egg            759
        teardrop       750
        cone           316
        cross          233
        delta            7
        round            2
        crescent         2
        dome             1
        pyramid          1
        changed          1
        hexagon          1
        flare            1
        Name: ufo_shape, dtype: int64

````



In [60]:
ufo['ufo_shape'].value_counts()

light        16565
triangle      7865
circle        7608
fireball      6208
other         5649
unknown       5584
sphere        5387
disk          5213
oval          3733
formation     2457
cigar         2057
changing      1962
flash         1328
rectangle     1297
cylinder      1283
diamond       1178
chevron        952
egg            759
teardrop       750
cone           316
cross          233
delta            7
crescent         2
round            2
pyramid          1
hexagon          1
changed          1
dome             1
flare            1
Name: ufo_shape, dtype: int64

### Mask hints 

A `mask` is nothing more than a condition. This condition is applied to your whole dataframe (or pandas Series).
So for example, if you had a pandas Series with a variable called `Age`, you could create a mask for all people whose `Age` is less than 18 years old using the syntax:

`df['Age'] <= 18`

This would return a pandas series containing `True` and `False` values. For each index, you'd get a value of `True` or `False`.

You could save this mask in a variable, for example:

`condition = (df['Age'] <= 18)`

And then you could use that variable `condition` to select only the cases of the dataframe in which the index returned `True` using:
`df.loc[condition, :]`.


### `.query()` hints

Remember that the .query() method expects a string. That string should contain the variable of your dataframe without quotation marks and the comparison. For example, if you had a variable called `name`, you'd use a syntax like:
 `df.query('name == "Jack"')`
 
to bring all observations whose column `name` is exactly equal to `"Jack"` (note that Jack should be within quotation marks because a name is a string in this example).