# Agenda

- Data frames
    - Creating
    - Retrieving
    - Methods (series methods and special data frame methods)
- Loading data
    - CSV
    - Excel
    - Downloading from the Internet (retrieving files and scraping sites)

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# create a series based on any iterable in Python (usually a list or a NumPy array)
# we can pass additional keyword arguments, especially index and dtype

# Data frames are similar; we can pass a 2D set of values (a list of lists or a 2D NumPy array), and
# also index, and also columns -- the names of the columns we want

In [4]:
df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [5]:
type(df)

pandas.core.frame.DataFrame

# Our basic data frame

- A data frame has rows and columns
- The rows are identified with the index (i.e, the same index that we had on our series)
- The columns are identified with names that we just call "columns"

The key thing to remember: Each column is a Pandas series!

- I can retrieve a row with `.loc` or `.iloc`, as before
- I can retrieve a column with `[]` and the column's name

In [6]:
df.loc[1]

0    40
1    50
2    60
Name: 1, dtype: int64

In [7]:
df[1]  # this will retrieve the column

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [8]:
%timeit df.loc[1]

11.4 μs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [9]:
%timeit df.iloc[1]

10.3 μs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [10]:
%timeit df[1]

1.33 μs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [11]:
# behind the scenes of our data frame is actually a 2D NumPy array (in theory)

df.values

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

In [12]:
# we can retrieve in all of the ways we've already seen!

df.loc[[1, 3]]  # fancy indexing

Unnamed: 0,0,1,2
1,40,50,60
3,100,110,120


In [13]:
df[[0, 1]]  # here, we retrieve two columns

Unnamed: 0,0,1
0,10,20
1,40,50
2,70,80
3,100,110


In [14]:
# if I retrieve one row, or one column, then I get a series
df[1]

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [15]:
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [16]:
# We can name the index and the columns via keyword arguments when we create the data frame

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


- Row names (just like the index in a series) can repeat, and that's even useful!
- Column names cannot repeat (like the keys in a dict)

You can even think of a data frame as a dict of series, where the keys are the column names and the values are Pandas series objects.

In [17]:
df.loc['b']

x    40
y    50
z    60
Name: b, dtype: int64

In [18]:
df.iloc[2]

x    70
y    80
z    90
Name: c, dtype: int64

In [19]:
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [20]:
df.loc[['a', 'b']]

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60


In [21]:
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [22]:
# if I retrieve a series, then I can run a method on it

df['x'].mean()

np.float64(55.0)

In [23]:
df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [25]:
# I can do that for the rows

df.loc['b'].describe()

count     3.0
mean     50.0
std      10.0
min      40.0
25%      45.0
50%      50.0
75%      55.0
max      60.0
Name: b, dtype: float64

As a general rule, any method that we can run on a series can also be run on a data frame. We'll get one result per column, and the results will either be a series (if we get one value per column) or a data frame (if we get a series back from the method).

In [26]:
df.mean()   # this will calculate the mean for each column

x    55.0
y    65.0
z    75.0
dtype: float64

In [27]:
df.describe()   # this will invoke describe on each column

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [28]:
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [30]:
# we've seen how we can retrieve either a row or a column.
# but how can we retrieve an individual value?

df.loc['b'].loc['y']

np.int64(50)

In [31]:
# even worse:

df.loc['b']['y']

np.int64(50)

In [32]:
# better is to use the 2-argument version of .loc, where you specify the row and the column

df.loc['b', 'y']

np.int64(50)

In [33]:
df.sum()

x    220
y    260
z    300
dtype: int64

In [34]:
# what if I want to sum across the rows?

df.sum(axis='columns')

a     60
b    150
c    240
d    330
dtype: int64

In [35]:
df = DataFrame([[10, 20, 30.5],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [36]:
df.dtypes   # get the dtype of each column

x      int64
y      int64
z    float64
dtype: object

In [37]:
# what happens when we retrieve row 'c'?

df.loc['c']

x    70.0
y    80.0
z    90.0
Name: c, dtype: float64

In [38]:
# what if I want to assign back to a value?

# this is a bad way to do it, as you'll see!
df.loc['c'].loc['z'] = 999 

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['c'].loc['z'] = 999
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc['c'].loc['z'] = 999


In [39]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [40]:
# the solution is always to use the two-argument version of .loc,
# which guarantees that we'll assign back to the original data frame

df.loc['c', 'z'] = 999 

In [41]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,999.0
d,100,110,120.0


# Exercise: Simple data frame

1. Create a 5x5 data frame in which the rows are `abcde` and the columns are `vwxyz`. The values can be random integers from 0-1,000. (You can use a 2D NumPy array to initialize the values.)
2. Retrieve row `b`
3. Retrieve rows `b` and `d`
4. Retrieve rows `b`, `c`, and `d`
5. Retrieve column `w`
6. Retrieve columns `w` and `y`
7. Retrieve columns `w`, `x`, and `y`
8. Retrieve the item at row `e`, column `v`
9. Update the value at row e`, column `z` to be 123.456

In [42]:
import numpy as np

np.random.randint(0, 1000, 20).reshape(4, 5)

array([[ 92, 855, 155, 443, 455],
       [386, 725, 796, 590, 586],
       [756, 567, 363, 394, 451],
       [449, 357,   3, 378, 714]])

In [43]:
np.random.randint(0, 1000, [4,5])

array([[444, 675, 888, 800, 967],
       [174, 391, 834, 520, 112],
       [976, 750, 342, 754, 295],
       [367, 243, 289, 455,  79]])

In [44]:
np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [45]:
# Retrieve row b

df.loc['b']

v    763
w    707
x    359
y      9
z    723
Name: b, dtype: int64

In [46]:
%timeit df.loc['b']

12.2 μs ± 70 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [47]:
%timeit df.iloc[1]

10.6 μs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [48]:
# Retrieve rows b and d

df.loc[['b', 'd']]

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
d,472,600,396,314,705


In [49]:
# Retrieve rows b, c, and d

df.loc['b':'d']  # use a slice, up to and including (with df.loc)

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [52]:
# Retrieve column w

%timeit df['w']

1.57 μs ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [53]:
# you can also use a different syntax, "dot syntax," to retrieve a column

%timeit df.w

2.92 μs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [54]:
# Retrieve columns w and y

df[['w', 'y']]

Unnamed: 0,w,y
a,559,192
b,707,9
c,754,599
d,600,314
e,551,174


In [55]:
# Retrieve columns w, x, and y

df[['w', 'x', 'y']]

Unnamed: 0,w,x,y
a,559,629,192
b,707,359,9
c,754,804,599
d,600,396,314
e,551,87,174


In [56]:
df['w':'y']  # if you give a slice to [], Pandas looks for that slice ON THE ROWS!

Unnamed: 0,v,w,x,y,z


In [57]:
# Retrieve the item at row e, column v

df.loc[ 'e',    # row selector
        'v']    # column selector


np.int64(486)

In [58]:
df.loc[ ['b', 'e']   # row selector is two rows
    ,
        ['v', 'z']   # column selector
]

Unnamed: 0,v,z
b,763,723
e,486,600


In [59]:
# Update the value at row e, column z` to be 123.456

df.loc['e', 'z'] = 123.456

  df.loc['e', 'z'] = 123.456


In [60]:
df.dtypes

v      int64
w      int64
x      int64
y      int64
z    float64
dtype: object

In [61]:
# how should we update that column/value to avoid the problem?
# answer: replace the column with a new version of itself using float

np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [62]:
df.dtypes

v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [63]:
# you can add a new column to a data frame via assignment
df['u'] = [10, 20, 30, 40, 50]   # needs to be the same length as the other columns

df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,10
b,763,707,359,9,723,20
c,277,754,804,599,70,30
d,472,600,396,314,705,40
e,486,551,87,174,600,50


In [64]:
# what if I assign to a column that already exists? It's replaced.

df['u'] = [100 ,200, 300, 400, 500]
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100
b,763,707,359,9,723,200
c,277,754,804,599,70,300
d,472,600,396,314,705,400
e,486,551,87,174,600,500


In [65]:
# I can do the same, but using the output of .astype

df['u'] = df['u'].astype(np.float64)
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,500.0


In [66]:
df.loc['e', 'u'] = 123.456


In [67]:
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,123.456
