# Lecture 3 – Pandas II

### DATA 2201, Fall 2025

A demonstration of advanced `pandas` syntax to accompany Lecture 3.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Loading the elections DataFrame
elections = pd.read_csv("../data/elections.csv")

elections.head() 

## Slicing in `DataFrame`s

### Review: Label-Based Extraction Using`loc`

Arguments to `.loc` can be:
1. A list.
2. A slice (syntax is inclusive of the right-hand side of the slice).
3. A single value.

#### For example: Selection by a list

<details>
<summary>Click to show solution</summary>

<pre>

elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

</pre>
</details>


In [None]:
# For example: Selection by a list
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

### Review: Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [None]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!


In [None]:
# Index-based extraction using a list of rows and a slice of column indices


In [None]:
# Selecting all rows using a colon


In [None]:
# Extracting the value at row 0 and the second column


### Context-dependent Extraction using `[]`

- We could technically do anything we want using `loc` or `iloc`. 
- However, in practice, the `[]` operator is often used instead to yield more concise code.

- `[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality.
- The difference is that `[]` is *context-dependent*.

- `[]` only takes one argument, which may be:
    1. A slice of row integers.
    2. A list of column labels.
    3. A single column label.


## Why Use []?
- In short: [ ] can be much more concise than .loc or .iloc
- Consider the case where we wish to extract the "Candidate" column. It is far simpler to write elections["Candidate"] than it is to write elections.loc[:, "Candidate"]

- In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!


If we provide a slice of row numbers, we get the numbered rows.

If we provide a list of column names, we get the listed columns.

And if we provide a single column name we get back just that column, stored as a `Series`.

### Exercise - Check your understanding

In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

A. What is the result of the following code? 

```python
weird[1]
```

<details><summary>Click for Solution</summary> <br>
    
```python
0    topdog
1    botdog
Name: 1, dtype: object
```
    
</details><br>

B. What is the result of the following code?

```python
weird["1"]
```

<details><summary>Click for Solution</summary> <br>
    
```python
0    topcat
1    botcat
Name: 1, dtype: object
```
    
</details><br>

C. What is the result of the following code?

```python
weird[1:]
```


In [None]:
weird[1:]

## Dataset: California baby names

- In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

- The cell below pulls census data from a government website and then loads it into a usable form.

- The code shown here is outside of the scope of lecture (pandas), focus on it for the next examples and topics.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "../data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

## Conditional Selection

### Boolean Array Input for .loc and [ ]
- We can extract data according to its integer position (.iloc) or its label (.loc)
- What if we want to extract rows that satisfy a given condition?
.loc and [ ]  also accept boolean arrays as input.
- Rows corresponding to True are extracted; rows corresponding to False are not.


In [None]:
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = ...

babynames_first_10_rows

- By passing in a sequence (list, array, or `Series`) of boolean values, we can extract a subset of the rows in a `DataFrame`.
- We will keep *only* the rows that correspond to a boolean value of `True`.

In [None]:
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, 
                         False, True, False, True, False]]

In [None]:
# Or using .loc to filter a DataFrame by a Boolean array argument.
babynames_first_10_rows.loc[[True, False, True, False, True, 
                             False, True, False, True, False], :]


#### Oftentimes, we'll use boolean selection to check for entries in a `DataFrame` that meet a particular condition.

In [None]:
# First, use a logical condition to generate a boolean Series
logical_operator = ...
logical_operator

In [None]:
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]

Boolean selection also works with `loc`!

In [None]:
# Notice that we did not have to specify columns to select 
# If no columns are referenced, pandas will automatically select all columns


### Bitwise Operators

To filter on multiple conditions, we combine boolean operators using **bitwise comparisons**.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

#### Example:
- This line filters the babynames DataFrame to return only the rows where:
    - The baby is female (Sex == "F"), and
    - The birth year is before 2000 (Year < 2000)


<details>
<summary>Click to show solution</summary>

<pre>

babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]

</pre>
</details>


#### Repeat the earlier example with '|' operator

### Exercise - Check your understanding 

How could you write a pandas statement to return a DataFrame of the first 3 baby names with Count > 250?



<details><summary>Click for Solution</summary> <br>

If you know the rows of data that meet this condition: 

```python
babynames.iloc[[0, 233, 484], [3, 4]]
# or 
babynames.loc[[0, 233, 484]]
```

Alternatively, use a conditional selection and then select the first three rows:
```python
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
# or 
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
```
    
</details><br>

In [None]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability

(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Narges") |
              (babynames["Name"] == "Lisa")]
)


In [None]:
# A more concise method to achieve the above: .isin
names = ["Bella", "Alex", "Narges", "Lisa"]




<details>
<summary>Click to show solution</summary>

<pre>

display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])

</pre>
</details>

In [None]:
# What if we only want names that start with "N"?



<details>
<summary>Click to show solution</summary>

<pre>

display(babynames["Name"].str.startswith("N"))
display(babynames[babynames["Name"].str.startswith("N")])

</pre>
</details>

## Adding, Removing, and Modifying Columns

#### Adding a column is easy:

- To add a column, use `[]` to reference the desired new column.
- Assign it to a `Series` or array of appropriate length.

In [None]:
# Create a Series of the length of each name
babyname_lengths = ...

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths

babynames

#### Modifying a column is very similar to adding a column.

- To modify a column, use `[]` to access the desired column. 
- Re-assign it to a new array or Series.

In [None]:
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = ...
babynames

### Syntax for Renaming a Column

- Rename a column using the (creatively named) `.rename()` method.
    - `.rename()` takes in a dictionary that maps old column names to new ones.


In [None]:
# Rename "name_lengths" to "Length"
babynames = ...
babynames

### Syntax for Dropping a Column (or Row)
- Remove a column using `.drop()`.
- The `.drop()` method assumes you're dropping a row by default. Use axis = "columns" to drop a column instead.


In [None]:
# Remove our new "Length" column
babynames = ...
babynames

## Useful Utility Functions

#### `NumPy`

The `NumPy` functions you encountered in [Data 8](https://www.data8.org/su23/reference/#array-functions-and-methods) are compatible with objects in `pandas`. 

In [None]:
yash_counts = ...
yash_counts


<details>
<summary>Click to show solution</summary>

<pre>

yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]

</pre>
</details>



In [None]:
# Average number of babies named Yash each year



In [None]:
# Max number of babies named Yash born in any single year



#### Built-In `pandas` Methods

- There are *many* utility functions built into `pandas`, far more than we can possibly cover in lecture.
- You are encouraged to explore all the functionality outlined in the `pandas` [documentation](https://pandas.pydata.org/docs/reference/index.html).

In [None]:
# Returns the shape of the object in the format (num_rows, num_columns)


In [None]:
# Returns the total number of entries in the object, equal to num_rows * num_columns


In [None]:
# What summary statistics can we describe?


In [None]:
# Our statistics are slightly different when working with a Series


In [None]:
# Randomly sample row(s) from the DataFrame


In [None]:
# Rerun this cell a few times – you'll get different results!



<details>
<summary>Click to show solution</summary>

<pre>

babynames.sample(5).iloc[:, 2:]

</pre>
</details>



In [None]:
# Sampling with replacement



<details>
<summary>Click to show solution</summary>

<pre>

babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]

</pre>
</details>


In [None]:
# Count the number of times each unique value occurs in a Series


In [None]:
# Return an array of all unique values in the Series


In [None]:
# Sort a Series


In [None]:
# Sort a DataFrame – there are lots of Michaels in California



<details>
<summary>Click to show solution</summary>

<pre>

babynames.sort_values(by="Count", ascending=False)

</pre>
</details>

