## Ejemplo 5: Filtros

### 1. Objetivos:
    - Aprender cómo funcionan los filtros
    - Aplicar varios filtros para verlos en acción
 
### 2. Desarrollo:

In [2]:
import pandas as pd

Creamos nuestro DataFrame a partir del archivo de datos CSV:

In [3]:
df = pd.read_csv('../../Datasets/new_york_times_bestsellers-dirty.csv', index_col=0)

df.head(3)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Descr: Aliens have taken control of the minds ...,"Little, Brown",THE HOST,5b4aa4ead3089013507db18c,2008-05-24 00:00:00,1212883200000,2,1,3,25.99
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,Descr: A woman's happy marriage is shaken when...,St. Martin's,LOVE THE ONE YOU'RE WITH,5b4aa4ead3089013507db18d,2008-05-24 00:00:00,1212883200000,3,2,2,24.95
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,Descr: A Massachusetts state investigator and ...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,2008-05-24 00:00:00,1212883200000,4,0,1,22.95


Ahora, queremos todos los registros donde el nombre del autor empiece con 'R', entonces usamos la función `str.startswith(-patrón-)` a la columna `author` de la forma:

`df[-columna-].str.startswith(-patrón-)`

In [4]:
df["author"].str.startswith("R")

0       False
1       False
2       False
3       False
5       False
        ...  
3027    False
3028    False
3029    False
3030    False
3031    False
Name: author, Length: 2266, dtype: bool

Lo que obtenemos de regreso es una `Serie` con la misma longitud que la columna original y la función `startswith()` se aplicó comparando cada elemento regresando un valor booleano.

Después, al pasar este filtro al `operador de indexación` del `DataFrame`, todas las filas a las que les corresponda un `True` se mantienen, mientras que las filas a las que les corresponde un `False` se dejan fuera del subconjunto resultante:

`dataframe[ -serie índice para filtrado- ]`

In [6]:
df[ df["author"].str.startswith("R") ].head(5)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
79,http://www.amazon.com/Chasing-Darkness-Elvis-N...,Robert Crais,Descr: he Los Angeles private eye Elvis Cole r...,Simon & Schuster,CHASING DARKNESS,5b4aa4ead3089013507db209,2008-07-05 00:00:00,1216512000000,7,0,1,25.95
94,http://www.amazon.com/Chasing-Darkness-Elvis-N...,Robert Crais,Descr: Is the Los Angeles private eye Elvis Co...,Simon & Schuster,CHASING DARKNESS,5b4aa4ead3089013507db221,2008-07-12 00:00:00,1217116800000,11,7,2,25.95
110,http://www.amazon.com/Killer-View-Fleming-Ridl...,Ridley Pearson,"Descr: A sheriff in Sun Valley, Idaho, investi...",Putnam,KILLER VIEW,5b4aa4ead3089013507db239,2008-07-19 00:00:00,1217721600000,15,0,1,24.95
143,http://www.amazon.com/Foreign-Body-Robin-Cook/...,Robin Cook,Descr: A medical student investigates a rising...,Putnam,FOREIGN BODY,5b4aa4ead3089013507db26f,2008-08-09 00:00:00,1219536000000,9,0,1,25.95
158,http://www.amazon.com/Foreign-Body-Robin-Cook/...,Robin Cook,Descr: A medical student investigates a rising...,Putnam,FOREIGN BODY,5b4aa4ead3089013507db287,2008-08-16 00:00:00,1220140800000,No Rank,9,2,25.95


In [7]:
df[ df["author"].str.startswith("R") ].shape

(57, 12)

Podemos también guardar nuestras condiciones en variables y después utilizarlos, por ejemplo, encuentra todas las producciones cuyo precio es mayor a 20:

In [8]:
mayor_a_20 = df["price.numberDouble"] > 20
mayor_a_20.head()

0    True
1    True
2    True
3    True
5    True
Name: price.numberDouble, dtype: bool

In [9]:
df[mayor_a_20].head()

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Descr: Aliens have taken control of the minds ...,"Little, Brown",THE HOST,5b4aa4ead3089013507db18c,2008-05-24 00:00:00,1212883200000,2,1,3,25.99
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,Descr: A woman's happy marriage is shaken when...,St. Martin's,LOVE THE ONE YOU'RE WITH,5b4aa4ead3089013507db18d,2008-05-24 00:00:00,1212883200000,3,2,2,24.95
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,Descr: A Massachusetts state investigator and ...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,2008-05-24 00:00:00,1212883200000,4,0,1,22.95
3,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,Descr: An aging porn queens aims to cap her ca...,Doubleday,SNUFF,5b4aa4ead3089013507db18f,2008-05-24 00:00:00,1212883200000,5,0,1,24.95
5,http://www.amazon.com/Phantom-Prey-John-Sandfo...,John Sandford,Descr: The Minneapolis detective Lucas Davenpo...,Putnam,PHANTOM PREY,5b4aa4ead3089013507db191,2008-05-24 00:00:00,1212883200000,7,4,3,26.95


Podemos incluso aplicar dos o más filtros utilizando `operadores lógicos`. En este caso, nuestro operador `and` se representa con un `&` y el operador `or` se representa con `|`.

Así que podemos obtener todas las producciones con rango 1 y que además cuyo precio sea mayor a 20:

In [12]:
rank_numero_uno = df["rank.numberInt"] == "1"
rank_numero_uno.unique()

array([False,  True])

In [14]:
df[mayor_a_20 & rank_numero_uno].head()  # df[0]: s[0]:True & s[0]:False -> False
                                        # df[1]: s[1]:True & s[1]:True -> True -> df[1] ok

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
51,http://www.amazon.com/Fearless-Fourteen-Janet-...,Janet Evanovich,Descr: Stephanie Plum and her boyfriend Joe Mo...,St. Martin’s,FEARLESS FOURTEEN,5b4aa4ead3089013507db1db,2008-06-21 00:00:00,1215302400000,1,0,1,27.95
63,http://www.amazon.com/Fearless-Fourteen-Janet-...,Janet Evanovich,Descr: Stephanie Plum and her boyfriend Joe Mo...,St. Martin’s,FEARLESS FOURTEEN,5b4aa4ead3089013507db1ef,2008-06-28 00:00:00,1215907200000,1,1,2,27.95
85,http://www.amazon.com/Tribute-Nora-Roberts/dp/...,Nora Roberts,Descr: A former child star returns to Virginia...,Putnam,TRIBUTE,5b4aa4ead3089013507db217,2008-07-12 00:00:00,1217116800000,1,0,1,26.95
98,http://www.amazon.com/Tribute-Nora-Roberts/dp/...,Nora Roberts,Descr: A former child star returns to Virginia...,Putnam,TRIBUTE,5b4aa4ead3089013507db22b,2008-07-19 00:00:00,1217721600000,1,1,2,26.95
111,http://www.amazon.com/Secret-Servant-Gabriel-A...,Daniel Silva,"Descr: Gabriel Allon, an art restorer and an o...",Putnam,THE SECRET SERVANT,5b4aa4ead3089013507db23f,2008-07-26 00:00:00,1218326400000,1,0,1,26.95


---
---

## Reto 5: Filtros

### 1. Objetivos:
    - Practicar el uso de filtros para la obtención de subconjuntos de datos
    
### 2. Desarrollo:

#### a) Filtrando por fechas, booleanos y valores numéricos

Vamos a trabajar con el mismo dataset que guardaste del Reto anterior. Este Reto consiste en los siguiente:

Usando filtros, crea 3 subconjuntos de datos:

1. Un subconjunto llamado `df_hazardous` que contenga sólo los records que correspondan a los objetos donde `is_potentially_hazardous_asteroid` sea `True` (o `1`).
2. Un subconjunto llamado `df_greater_than_1000` que contenga sólo los records donde el `estimated_diameter.meters.estimated_diameter_max` sea mayor a 1000 metros.
3. Un subconjunto llamado `df_february` que contenga sólo los records que pertenezcan exactamente al mes de Febrero de 1995. Recuerda que los datos en la columna `epoch_date_close_approach` están en milisegundos.


In [None]:
df_reto_5 = pd.read_csv("../Ejemplo-04/objetos_cercanos_4.csv", index_col=0)
df_reto_5.head(3)

In [None]:
df_hazardous = ...

In [None]:
df_bigger_than_1000 = ...

In [None]:
df_february = ...

In [None]:
def checar_subconjuntos(df_february, df_hazardous, df_bigger_than_1000):
    
    import pandas as pd
    import base64

    datos = b'CmFzc2VydCAoZGZfaGF6YXJkb3VzWydpc19wb3RlbnRpYWxseV9oYXphcmRvdXNfYXN0ZXJvaWQnXSA9PSAwKS5zdW0oKSA9PSAwLCAnQWxndW5vcyByZWNvcmRzIGVuIGBkZl9oYXphcmRvdXNgIHBlcnRlbmVjZW4gYSBvYmpldG9zIGRvbmRlIGlzX3BvdGVudGlhbGx5X2hhemFyZG91c19hc3Rlcm9pZCBlcyBgRmFsc2VgJwphc3NlcnQgKGRmX2hhemFyZG91c1snaXNfcG90ZW50aWFsbHlfaGF6YXJkb3VzX2FzdGVyb2lkJ10gPT0gMSkuc3VtKCkgPiAwLCAnTm8gaGF5IG5pbmd1biByZWNvcmQgZW4gYGRmX2hhemFyZG91c2AgZG9uZGUgaXNfcG90ZW50aWFsbHlfaGF6YXJkb3VzX2FzdGVyb2lkIHNlYSBgVHJ1ZWAnCgphc3NlcnQgKGRmX2JpZ2dlcl90aGFuXzEwMDBbJ2VzdGltYXRlZF9kaWFtZXRlci5tZXRlcnMuZXN0aW1hdGVkX2RpYW1ldGVyX21heCddIDw9IDEwMDApLnN1bSgpID09IDAsICdBbGd1bm9zIHJlY29yZHMgZW4gYGRmX2JpZ2dlcl90aGFuXzEwMDBgIHBlcnRlbmVjZW4gYSBvYmpldG9zIGNvbiBkacOhbWV0cm8gbWVub3IgYSAxMDAwIG1ldHJvcycKYXNzZXJ0IChkZl9iaWdnZXJfdGhhbl8xMDAwWydlc3RpbWF0ZWRfZGlhbWV0ZXIubWV0ZXJzLmVzdGltYXRlZF9kaWFtZXRlcl9tYXgnXSA+IDEwMDApLnN1bSgpID4gMCwgJ05vIGhheSBuaW5nw7puIHJlY29yZCBlbiBgZGZfYmlnZ2VyX3RoYW5fMTAwMGAgcXVlIHBlcnRlbmV6Y2EgYSBvYmpldG9zIGNvbiBkacOhbWV0cm8gbWF5b3IgYSAxMDAwIG1ldHJvcycKCmZlYnJ1YXJ5ID0gcGQudG9fZGF0ZXRpbWUoJzE5OTUtMDItMDEnLCBmb3JtYXQ9JyVZLSVtLSVkJykudGltZXN0YW1wKCkgKiAxMDAwCm1hcmNoID0gcGQudG9fZGF0ZXRpbWUoJzE5OTUtMDMtMDEnLCBmb3JtYXQ9JyVZLSVtLSVkJykudGltZXN0YW1wKCkgKiAxMDAwIAoKYXNzZXJ0IChkZl9mZWJydWFyeVsnZXBvY2hfZGF0ZV9jbG9zZV9hcHByb2FjaCddIDwgZmVicnVhcnkpLnN1bSgpID09IDAsICdBbGd1bm9zIHJlY29yZHMgZGUgYGRmX2ZlYnJ1YXJ5YCBwZXJ0ZW5lY2VuIGEgbWVzZXMgYW50ZXJpb3JlcyBhIEZlYnJlcm8gZGUgMTk5NScKYXNzZXJ0IChkZl9mZWJydWFyeVsnZXBvY2hfZGF0ZV9jbG9zZV9hcHByb2FjaCddID49IG1hcmNoKS5zdW0oKSA9PSAwLCAnQWxndW5vcyByZWNvcmRzIGRlIGBkZl9mZWJydWFyeWAgcGVydGVuZWNlbiBhIG1lc2VzIHBvc3RlcmlvcmVzIGEgRmVicmVybyBkZSAxOTk1Jwo='
    eval(compile(base64.b64decode(datos), "", "exec"), globals())
    
    print('Todos tus subconjuntos son correctos. ¡Gran trabajo!')
    
checar_subconjuntos(df_february, df_hazardous, df_bigger_than_1000)