# Lecture 2 – Data 100, Spring 2025

Data 100, Spring 2025

[Acknowledgments Page](https://ds100.org/sp25/acks/)

A high-level overview of the [`pandas`](https://pandas.pydata.org) library to accompany Lecture 2.

In [1]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

## Series, DataFrames, and Indices 

Series, DataFrames, and Indices are fundamental `pandas` data structures for storing tabular data and processing the data using vectorized operations.

### Series

A `Series` is a 1-D labeled array of data. We can think of it as columnar data. 

#### Creating a new `Series` object
Below, we create a `Series` object and will look into its two components: 1) values and 2) index.

In [2]:
s = pd.Series(["welcome", "to", "data 100"])

s

0     welcome
1          to
2    data 100
dtype: object

In [3]:
s.values

array(['welcome', 'to', 'data 100'], dtype=object)

In [4]:
s.index

RangeIndex(start=0, stop=3, step=1)

In the example above, `pandas` automatically generated an `Index` of integer labels. We can also create a `Series` object by providing a custom `Index`.

In [5]:
s = pd.Series([-1, 10, 2], index=["a", "b", "c"])
s

a    -1
b    10
c     2
dtype: int64

In [6]:
s.values

array([-1, 10,  2])

In [7]:
s.index

Index(['a', 'b', 'c'], dtype='object')

After it has been created, we can reassign the Index of a `Series` to a new Index.

In [8]:
s.index = ["first", "second", "third"]
s

first     -1
second    10
third      2
dtype: int64

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition

In [9]:
s = pd.Series([4, -2, 0, 6], index=["a", "b", "c", "d"])
s

a    4
b   -2
c    0
d    6
dtype: int64

**Selection using one or more label(s)**

In [10]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]

4

In [11]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]

a    4
c    0
dtype: int64

**Selection using a filter condition**

In [12]:
# Filter condition: select all elements greater than 0
s>0

a     True
b    False
c    False
d     True
dtype: bool

In [13]:
# Use the Boolean filter to select data from the original Series
s[s>0]

a    4
d    6
dtype: int64

### DataFrame

A `DataFrame` is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file. 

#### Creating a new `DataFrame` object
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:
1. From a CSV file
2. Using a list and column names
3. From a dictionary
4. From a `Series`


##### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object. 

In [14]:
elections = pd.read_csv("data/elections.csv")
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
182,2024,Donald Trump,Republican,77303568,win,49.808629
183,2024,Kamala Harris,Democratic,75019230,loss,48.336772
184,2024,Jill Stein,Green,861155,loss,0.554864
185,2024,Robert Kennedy,Independent,756383,loss,0.487357


By passing a column to the `index_col` attribute, the `Index` can be defined at the initialization.

In [15]:
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections

Unnamed: 0_level_0,Year,Party,Popular vote,Result,%
Candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
Andrew Jackson,1828,Democratic,642806,win,56.203927
John Quincy Adams,1828,National Republican,500897,loss,43.796073
Andrew Jackson,1832,Democratic,702735,win,54.574789
...,...,...,...,...,...
Donald Trump,2024,Republican,77303568,win,49.808629
Kamala Harris,2024,Democratic,75019230,loss,48.336772
Jill Stein,2024,Green,861155,loss,0.554864
Robert Kennedy,2024,Independent,756383,loss,0.487357


In [16]:
elections = pd.read_csv("data/elections.csv", index_col="Year")
elections

Unnamed: 0_level_0,Candidate,Party,Popular vote,Result,%
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
1828,Andrew Jackson,Democratic,642806,win,56.203927
1828,John Quincy Adams,National Republican,500897,loss,43.796073
1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...
2024,Donald Trump,Republican,77303568,win,49.808629
2024,Kamala Harris,Democratic,75019230,loss,48.336772
2024,Jill Stein,Green,861155,loss,0.554864
2024,Robert Kennedy,Independent,756383,loss,0.487357


##### Creating a `DataFrame` using a list and column names

In [17]:
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], 
                         columns=["Number"])
display(df_list_1)

Unnamed: 0,Number
0,1
1,2
2,3


In [18]:
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], 
                         columns=["Number", "Description"])
df_list_2

Unnamed: 0,Number,Description
0,1,one
1,2,two


##### Creating a `DataFrame` from a dictionary

In [19]:
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], 
                          "Price":[5.49, 3.99]})
df_dict_1

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


In [20]:
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                          {"Fruit":"Orange", "Price":3.99}])
df_dict_2

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


##### Creating a `DataFrame` from a `Series`

In [21]:
# In the examples below, we create a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index=["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index=["r1", "r2", "r3"])

In [22]:
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser

Unnamed: 0,A-column,B-column
r1,a1,b1
r2,a2,b2
r3,a3,b3


In [23]:
# Passing a Series to the DataFrame constructor to make a one-column DataFrame
df_ser = pd.DataFrame(s_a)
df_ser

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


In [24]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


In [25]:
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col="Candidate")
elections.head(5) # Using `.head` shows only the first 5 rows to save space

Unnamed: 0_level_0,Year,Party,Popular vote,Result,%
Candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Andrew Jackson,1824,Democratic-Republican,151271,loss,57.210122
John Quincy Adams,1824,Democratic-Republican,113142,win,42.789878
Andrew Jackson,1828,Democratic,642806,win,56.203927
John Quincy Adams,1828,National Republican,500897,loss,43.796073
Andrew Jackson,1832,Democratic,702735,win,54.574789


In [26]:
elections.reset_index(inplace=True) # Need to reset the Index to keep 'Candidate' as one of the DataFrane Columns
elections.set_index("Party", inplace=True) # This sets the Index to the "Candidate" column
elections

Unnamed: 0_level_0,Candidate,Year,Popular vote,Result,%
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Democratic-Republican,Andrew Jackson,1824,151271,loss,57.210122
Democratic-Republican,John Quincy Adams,1824,113142,win,42.789878
Democratic,Andrew Jackson,1828,642806,win,56.203927
National Republican,John Quincy Adams,1828,500897,loss,43.796073
Democratic,Andrew Jackson,1832,702735,win,54.574789
...,...,...,...,...,...
Republican,Donald Trump,2024,77303568,win,49.808629
Democratic,Kamala Harris,2024,75019230,loss,48.336772
Green,Jill Stein,2024,861155,loss,0.554864
Independent,Robert Kennedy,2024,756383,loss,0.487357


#### `DataFrame` attributes: `index`, `columns`, and `shape`

In [27]:
elections.index

Index(['Democratic-Republican', 'Democratic-Republican', 'Democratic',
       'National Republican', 'Democratic', 'National Republican',
       'Anti-Masonic', 'Whig', 'Democratic', 'Whig',
       ...
       'Green', 'Democratic', 'Republican', 'Libertarian', 'Green',
       'Republican', 'Democratic', 'Green', 'Independent',
       'Libertarian Party'],
      dtype='object', name='Party', length=187)

In [28]:
elections.columns

Index(['Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

In [29]:
elections.reset_index(inplace=True) # Revert the Index back to its default numeric labeling
elections

Unnamed: 0,Party,Candidate,Year,Popular vote,Result,%
0,Democratic-Republican,Andrew Jackson,1824,151271,loss,57.210122
1,Democratic-Republican,John Quincy Adams,1824,113142,win,42.789878
2,Democratic,Andrew Jackson,1828,642806,win,56.203927
3,National Republican,John Quincy Adams,1828,500897,loss,43.796073
4,Democratic,Andrew Jackson,1832,702735,win,54.574789
...,...,...,...,...,...,...
182,Republican,Donald Trump,2024,77303568,win,49.808629
183,Democratic,Kamala Harris,2024,75019230,loss,48.336772
184,Green,Jill Stein,2024,861155,loss,0.554864
185,Independent,Robert Kennedy,2024,756383,loss,0.487357


In [30]:
elections.shape

(187, 6)