<img src=images/gdd-logo.png width=300px align=right> 

# Selections and sorting

In this notebook, you will learn how to do selections and sorting with Pandas.

- [Selections](#s)
    - [Selecting the top/bottom](#tb)
    - [Selecting columns](#c)
    - [Selecting rows and columns together](#rc)
    - <mark>[Exercise: Selections](#e-select)</mark>
- [Sorting](#sort)

Let's load in the `chickweight` dataset again.

In [None]:
import pandas as pd

In [None]:
chickweight = pd.read_csv('data/chickweight.csv').rename(str.lower, axis='columns')
chickweight

<a id='s'></a>
## Selections

Before attempting to figure out the best diet for the chicks, let's investigate how to select different rows/columns. There are a few different approaches!

<a id='tb'></a>
### Selecting the top/bottom

In [None]:
chickweight.head()

In [None]:
chickweight.tail(2)

In [None]:
chickweight.head(5).tail(2)

<a id='c'></a>
### Selecting columns (DataFrames vs. Series)

In [None]:
chickweight['weight'].head() 

Note that the output of this next command is a little bit different.

In [None]:
chickweight[['weight']].head()

There is a subtle difference at work here. 

- `chickweight['weight']` returns a `pandas` `Series`, which can only contain one column.
- `chickweight[['weight']]` returns a `pandas` `DataFrame`, which can have one or more columns.

As you'll see, a lot of techniques that work on DataFrames will also work on series objects, but not all of them!

Returning a DataFrame (using two square brackets) means you can select more columns.

In [None]:
chickweight[['weight', 'time', 'chick']].head()

<a id=e-select></a>

#### <mark>Practice: Selections</mark>

1. Select the diet column as a Series

2. Select the time and chick column

3. Select the diet column as a DataFrame

In [None]:
# %load answers/02_Selections_and_Filtering/selections-columns.py

<a id='rc'></a>
### Selecting rows & columns together

You can use the [`.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) method to select a subset of a DataFrame, based on the **names** of the index and columns. This method accepts multiple types of arguments, this notebook covers 3 different ones. 

Note that the `.loc[]` method requires **square** brackets `[]`, unlike the other methods we cover.

For selecting multiple rows and columns, it uses the following format:

`.loc[[index_names], [column_names]]` 

In [None]:
chickweight.loc[[1,2,3], ['time', 'chick']]

You **must** specify how to select rows, however the columns part is optional. 

In [None]:
chickweight.loc[[1,2,3]]

Because the index of the chickweight DataFrame is numeric, you can use an indexing style similar to the one seen with Python lists:
 
 - if you use `10:15`, it will select all the rows between those with index `10` and `15` (inclusive)
 
 - if you use `:50`, it will select all rows to index `50` (including the row with index `50`)
 
 - if you use ` : `, it will select all rows
 
 - if you use `::2`, it will select every other row
 

In [None]:
chickweight.loc[10:15]

In [None]:
chickweight.loc[10:15, ['time', 'chick']]

In [None]:
chickweight.loc[ : , ['time', 'chick']]

### Stylistic Choices

When writing pandas, you might end up with long lines. To make queries easier to read you can surround it entirely with parentheses. 

This will make it more readable, especially when chaining more and more methods.

In [None]:
(
    chickweight
    .loc[10:15, ['time', 'chick']]
)

### <mark>Exercise: Select rows and columns</mark>

Select only:

a) Up to row with index 10 of the data (without using `.head()`).

   b) Rows 50 to 60 of the data.

   c) The `chick` and `weight` columns **without** `.loc[]`.

 d) The chick and weight columns **with** `.loc[]`.


   e) Rows 50 to 60 of the data of the `chick` and `weight` columns.

**Answers**

In [None]:
# %load answers/02_Selections_and_Filtering/selections.py

<a id='sort' ></a>
## Sorting in Pandas

Sometimes, you might be interested in the top (or bottom) number of values. In these cases, it is useful to sort the data before slicing. 

For example, you can find the lowest 10 weights by sorting the weight column:

In [None]:
(
    chickweight
    .sort_values('weight')
    .head(10)
)

Or the top-10 values by sorting in descending order:

In [None]:
(
    chickweight
    .sort_values(by='weight', ascending=False)
    .head(10)
)

Or you can sort by two columns: *(To see what happens, look at the order of chicken IDs where* `weight` *is equal to* `39`*)* 

In [None]:
(
    chickweight
    .sort_values(by=['weight', 'chick'])
    .head(10)
)

<a id='e2'></a>
## <mark>Exercise: Sorting</mark>
1. Sort the data in chickweight by `rownum` from highest to lowest.

2. After sorting the DataFrame keeps the index of the original data. Figure out how to *ignore* the original index, so that the resulting DataFrame has an index of increasing integers starting 0. Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) to check if you can use a parameter in `.sort_values()` to do this.

### Bonus
1. Using the resulting DataFrame from above, select the first 25 rows with `.loc[]`. Compare the results when you ignore the original index in `.sort_values()` and when you don't. What causes this difference? 

2. Sort the data by weight (ascending) and by chick (descending). Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) to see what type of inputs the `ascending` parameter takes.

<details>
    
  <summary><span style="color:blue">Show hint</span></summary>
  
Description of the ascending parameter: *"Specify list for multiple sort orders. If this is a list of bools, must match the length of the by."*
    
If you're sorting by 2 columns, you can give them to `by` in a list. Simarly, you can also give a list as an input to `ascending`. What data types should that list contain?

</details>

**Answers**

In [None]:
# %load answers/02_Selections_and_Filtering/sorting.py