**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2025**

# Lesson 14. Sorting Rows and Selecting Columns in Pandas

## In this lesson...

- How do we sort the rows of a DataFrame?

- How do we select columns from a DataFrame?
    - For that matter, how do we drop columns? Rename columns?

- Data wrangling by method chaining

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Setup

- Let's start by importing Pandas

In [None]:
import pandas as pd

* We'll use the nycflights13 dataset that we used in the previous lesson


* This dataset is located in `data/nycflights13_flights.csv.zip` in the same folder as this notebook:

In [None]:
df = pd.read_csv('data/nycflights13_flights.csv.zip')

* Just to remind ourselves what this dataset looks like:

In [None]:
df.head()

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Sorting rows

- We can sort the rows of a DataFrame with the `.sort_values()` method


- For example, let's sort the flights of the nycflights13 dataset according to their time in the air:

- Note that the `.sort_values()` method returns a new DataFrame


- By default, `.sort_values()` sorts the rows in ascending order


- We can specify the sort order with the `ascending=...` keyword argument, like this:

- Note that the missing values (`NaN`) are always sorted at the end, regardless of the sort order


- We can also provide `.sort_values()` with a *list* of column names: each additional column will be used to break ties between the values of preceding columns


- For example, we can sort the rows by carrier and flight number, and then by year, month and day:

- We can also specify a sort order for specific columns by passing a list to the `ascending=...` keyword argument, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Selecting  columns

- Sometimes, it's useful to be able to select columns so that we can focus on a particular part of the dataset


- To select multiple columns from `df` and put them into a new DataFrame, we can use Python dictionary notation, with a *list* as the key


- For example, we can create a *new* DataFrame with only the flight numbers and date information, like this:

- Note that the columns are reordered according to the order given in the list used as a key


- Sometimes a DataFrame's column names have a particular structure that we might want to exploit


- For example, several of the columns in `df` start with `dep_` or `arr_`


- The DataFrame attribute `df.columns` contains all the column names of `df`:

In [None]:
df.columns

- Recall that `.startswith()` is a Python built-in string method 
    - `s.startswith(prefix)` returns `True` if the string `s` starts with `prefix`, and `False` otherwise

- We can create a new DataFrame with only those columns that start with `dep_` or `arr_` by creating a list of these column names first:

- We can be 😎 and use a list comprehension instead, requiring only 1 line of code:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Dropping columns

- We might find it easier to drop columns instead of selecting them


- We can do this with the `.drop()` DataFrame method


- For example, we can drop the `time_hour` variable from `df` like this:

- Don't forget the `columns=...` keyword!


- Similar to the other DataFrame methods we've learned about so far, `.drop()` also returns a new DataFrame


- We can drop multiple columns simultaneously in a similar fashion


- For example, let's drop the scheduled departure and arrival times from `df`:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Renaming columns

- Sometimes, we may want to rename columns to facilitate easier wrangling or analysis


- For example, we might have two datasets from different sources that contain the same kind of data, but in differently named columns


- We can use the `.rename()` DataFrame method to do this


- For example, let's rename `carrier` to `airline` and `flight` to `flight_no`:

- Like with `.drop()`, don't forget the `columns=...` keyword!


- The `columns=...` keyword argument is a *dictionary* that maps old column names to new column names 


- Again, note that `.rename()` returns a new DataFrame

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Data wrangling by method chaining

- Why is it significant that these methods we've learned about so far return new DataFrames?

- Since these methods act on DataFrames and return DataFrames, we can use **method chaining**
    - We learned about method chaining at the beginning of the semester
    - We also used method chaining extensively with Altair

- Suppose we want to take the nycflights13 dataset and:
    1. Keep only the flights on September 1 on United Airlines
    2. Keep only the year, month, day, carrier, flight, origin, and destination columns
    3. Sort the resulting data by flight number, in descending order

- We could do each step individually, using auxiliary variables to store our intermediate steps:

In [None]:
df2 = df.query('(month == 9) and (day == 1) and (carrier == "UA")')
df3 = df2[['year', 'month', 'day', 'carrier', 'flight', 'origin', 'dest']]
df4 = df3.sort_values('flight', ascending=False)

df4

- Or, we can write a method chain to do this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

For the problems below, use the nycflights13 dataset we used in this lesson.

### Problem 1

Find the most delayed flights. Find the flights that left the earliest.

### Problem 2

Which flights travelled the farthest? Which travelled the shortest?

### Problem 3

Come up with 3 or more ways to select the `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` columns from `df`.

*Hint.* You may find [this section on built-in Python string methods](https://realpython.com/python-strings/#built-in-string-methods) from [this *Real Python* article on strings and character data](https://realpython.com/python-strings/) useful.

### Problem 4

Create a list of all Southwest (WN) flights cancelled on December 10, 2013, departing from Newark (EWR). Sort the flights in order of scheduled arrival time. Drop the following columns: `dep_time`, `arr_time`, `dep_delay`, `arr_delay`, `air_time`, `time_hour`.

### Problem 5

Create a table of all Delta (DL) flights in 2013 from JFK to LAX that arrived more than 180 minutes late, sorted in descending order of arrival delay. Select and rename the columns so that your table looks like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>airline</th>
      <th>flight number</th>
      <th>year</th>
      <th>month</th>
      <th>day</th>
      <th>delay</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>152312</th>
      <td>DL</td>
      <td>2363</td>
      <td>2013</td>
      <td>3</td>
      <td>18</td>
      <td>784.0</td>
    </tr>
    <tr>
      <th>256516</th>
      <td>DL</td>
      <td>1163</td>
      <td>2013</td>
      <td>7</td>
      <td>7</td>
      <td>302.0</td>
    </tr>
    <tr>
      <th>258671</th>
      <td>DL</td>
      <td>2363</td>
      <td>2013</td>
      <td>7</td>
      <td>10</td>
      <td>284.0</td>
    </tr>
    <tr>
      <th>259397</th>
      <td>DL</td>
      <td>95</td>
      <td>2013</td>
      <td>7</td>
      <td>10</td>
      <td>274.0</td>
    </tr>
    <tr>
      <th>244507</th>
      <td>DL</td>
      <td>95</td>
      <td>2013</td>
      <td>6</td>
      <td>24</td>
      <td>259.0</td>
    </tr>
    <tr>
      <th>271972</th>
      <td>DL</td>
      <td>17</td>
      <td>2013</td>
      <td>7</td>
      <td>23</td>
      <td>224.0</td>
    </tr>
  </tbody>
</table>

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- From the [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/):
    - [Basics of accessing columns of a DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics)

- From the [Pandas API reference](https://pandas.pydata.org/docs/reference/index.html):
    - [`DataFrame.sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)
    - [`DataFrame.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
    - [`DataFrame.rename`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

- Lesson and problems inspired by Chapter 5 of [R for Data Science](https://r4ds.had.co.nz/)