# Day 2 -- data frames

1. Q&A
2. Data frames -- creating, and working with them
3. Adding and removing data in our data frames
4. Useful methods for our data frames
5. Boolean indexes / mask indexes
6. Using `.loc` to retrieve rows, rows/columns
7. Reading data from outside sources
    - Download this zipfile: https://files.lerner.co.il/data-science-exercise-files.zip
    - Scraping HTML files
    - Retrieving other formats

In [4]:
import pandas as pd
from pandas import Series, DataFrame

In [5]:
temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20])

# `.loc` retrieves data in numerous ways

1. Use it to retrieve one value from a series with the (default, numeric) index
2. Use it to retrieve one value from a series with the (non-default) index of another dtype -- strings, ints, floats, etc.
3. Retrieve one or more values as a series based on a "fancy index," passing a list of index values
4. Pass `.loc` a series/list of booleans of the same length as the series itself, and you get back only those elements that correspond to a `True`
5. Generate a boolean series with a comparison operator, and then pass that to `.loc`, to get only those values in the series for which the comparison is `True`

In [6]:
temps.loc[4]

np.int64(23)

In [7]:
temps.loc[2]

np.int64(25)

In [8]:
temps.loc[200]

KeyError: 200

In [9]:
temps

0    20
1    23
2    25
3    22
4    23
5    25
6    22
7    27
8    20
dtype: int64

In [11]:
# if I give values for an index, that is used -- .loc uses that
# (that's why we sometimes need .iloc)

temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20],
              index='Mon Tue Wed Thu Fri Sat Sun Mon Tue'.split())

In [12]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [13]:
temps.loc['Sat']

np.int64(25)

In [14]:
temps.loc['Tue']

Tue    23
Tue    20
dtype: int64

In [15]:
temps.loc[2]

KeyError: 2

In [16]:
temps.loc[[2, 4]]   # fancy indexing -- we give a list of indexes and get a series back

KeyError: "None of [Index([2, 4], dtype='object')] are in the [index]"

In [17]:
temps.loc[['Mon', 'Thu']]   # fancy indexing -- we give a list of indexes and get a series back

Mon    20
Mon    27
Thu    22
dtype: int64

In [18]:
# if you use fancy indexing, and there's only one match, you will still get a series back

temps.loc[['Thu']]

Thu    22
dtype: int64

In [19]:
# we can also use a boolean series (or list)
# in this case, the argument to .loc must be a list/series of booleans that is the same
# length as the series itself

temps.loc[[True, False, True, False, True, False, True, False, True]]

Mon    20
Wed    25
Fri    23
Sun    22
Tue    20
dtype: int64

In [21]:
# a variation on this is to generate a boolean list, and pass it to .loc

temps.loc[temps < 25]   # this generates a boolean series, and passes that series to .loc

Mon    20
Tue    23
Thu    22
Fri    23
Sun    22
Tue    20
dtype: int64

In [22]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [23]:
temps.index

Index(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue'], dtype='object')

In [24]:
temps.values

array([20, 23, 25, 22, 23, 25, 22, 27, 20])

# What is a data frame?

- 2D data
- The rows have an index (just like a series)
- The columns have names (which work like the index, but vertically)
- Each column is basically a series, which means that all values in that column have the same dtype

We can create a data frame using a list of lists

In [25]:
df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]])
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [26]:
# we can name the data frame's index just as we did with a series -- passing the "index" kwarg
# we can name the data frame's columns in the same way, passing the "columns" kwarg

In [28]:
df = DataFrame([[10, 20, 30, 40],
              [50, 60, 70, 80],
              [90, 100, 110, 120]],
               index=list('abc'),
               columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [29]:
# retrieve from the rows using .loc
# all of the techniques we talked about still work!
# when you use .loc (in the simplest way), you get a row back
# and that row is represented as a series!

df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [30]:
df.loc['b']

w    50
x    60
y    70
z    80
Name: b, dtype: int64

In [31]:
df.loc['d']

KeyError: 'd'

In [32]:
# how do I retrieve columns?
# we use []

df['x']

a     20
b     60
c    100
Name: x, dtype: int64

In [33]:
df['z']

a     40
b     80
c    120
Name: z, dtype: int64

In [34]:
# can I retrieve more than one row with a fancy index?
df.loc[['a', 'c']]  # fancy indexing on the row

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [36]:
df[['x', 'z']]   # this gives us more than one column back -- fancy indexing on the columns

Unnamed: 0,x,z
a,20,40
b,60,80
c,100,120


# Exercise: Grocery store

1. Create a data frame in which you have two columns: One will be the price of an item (`price`), and the second will be the number of sales of that item (`sales`). The index will be the items that you sell.
2. The data frame should have 4 rows, and each item will have a price and a number of sales.
3. Retrieve all of the information for apples.
4. Retrieve all of the information for bananas.
5. Retrieve all information for apples and bananas.
6. What is the mean price for all products?

In [39]:
# because we're creating a data frame with two columns, 
# we'll use a list of lists in which every internal
# list will have two elements, the price and the number of sales

df = DataFrame([[1, 10],
                [0.5, 15],
                [3, 8],
                [2, 20]],
              columns=['price', 'sales'],
              index='apple banana mushroom pepper'.split())

df

Unnamed: 0,price,sales
apple,1.0,10
banana,0.5,15
mushroom,3.0,8
pepper,2.0,20


In [40]:
df.loc['apple']   # I want the row for the index "apple"

price     1.0
sales    10.0
Name: apple, dtype: float64

In [41]:
df.loc['banana']

price     0.5
sales    15.0
Name: banana, dtype: float64

In [42]:
# get both apple and banana

df.loc[['apple', 'banana']]

Unnamed: 0,price,sales
apple,1.0,10
banana,0.5,15


In [44]:
# What is the mean price for all products?

df['price'].mean()

np.float64(1.625)

In [46]:
# when the median is lower than the mean, it's because there is some value
# that is higher than the rest, and is "pulling up" the mean

df['price'].median()   

np.float64(1.5)

In [47]:
df['price']

apple       1.0
banana      0.5
mushroom    3.0
pepper      2.0
Name: price, dtype: float64

In [48]:
# AM asks: why do we use [] after .loc, and not ()?
# isn't it a method?

# official answer: no, it's not a method

# to allow us to use a slice, they used []
# slice syntax (first:second) only works inside of [], not ()

df.loc['banana':'mushroom']

Unnamed: 0,price,sales
banana,0.5,15
mushroom,3.0,8


In [50]:
df.loc[slice('banana','mushroom')]

Unnamed: 0,price,sales
banana,0.5,15
mushroom,3.0,8


# Modifying our data frame -- adding and retrieving values

1. We can add a new column by assigning to it. (If the column already exists, we replace it.)
2. We can add a new row by assigning to it via `.loc`. Here, if the index already exists, we have a new row anyway, with the same index.

In [53]:
# the index of price and the index of sales are identical, so we 
# can multiply one by the other

df['revenue'] = df['price'] * df['sales'] # assign based on an operation
df

Unnamed: 0,price,sales,revenue
apple,1.0,10,10.0
banana,0.5,15,7.5
mushroom,3.0,8,24.0
pepper,2.0,20,40.0


In [54]:
df['stuff']= [10, 20, 30, 40]   # assign directly
df

Unnamed: 0,price,sales,revenue,stuff
apple,1.0,10,10.0,10
banana,0.5,15,7.5,20
mushroom,3.0,8,24.0,30
pepper,2.0,20,40.0,40


In [55]:
df['other_things']= 9  # if you assign a scalar value, it'll be assigned to all rows
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10,10.0,10,9
banana,0.5,15,7.5,20,9
mushroom,3.0,8,24.0,30,9
pepper,2.0,20,40.0,40,9


In [56]:
# if it's not scalar, then you must assign the right number of values

df['this_will_not_work'] = [10, 20]

ValueError: Length of values (2) does not match length of index (4)

In [57]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10,10.0,10,9
banana,0.5,15,7.5,20,9
mushroom,3.0,8,24.0,30,9
pepper,2.0,20,40.0,40,9


In [59]:
# make sure to use df.loc when you add a row!

df.loc['cucumber'] = [0.25, 10, 2.50, 25, 9]
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0
cucumber,0.25,10.0,2.5,25.0,9.0


In [60]:
# how can we remove a row or column? We can use the "df.drop" method

df.drop('cucumber')  # this returns a new data frame, based on df, without the "cucumber" row

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [61]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0
cucumber,0.25,10.0,2.5,25.0,9.0


In [64]:
df = df.drop('cucumber')   # this is the way to do it!

In [65]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [66]:
# you can pass a list of indexes to drop

df.drop(['apple', 'banana'])

Unnamed: 0,price,sales,revenue,stuff,other_things
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [67]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [68]:
df.drop('other_things') # by default, df.drop assumes you want to remove rows

KeyError: "['other_things'] not found in axis"

In [69]:
df.drop('other_things', axis='columns')    # by default, df.drop assumes you want to remove rows

Unnamed: 0,price,sales,revenue,stuff
apple,1.0,10.0,10.0,10.0
banana,0.5,15.0,7.5,20.0
mushroom,3.0,8.0,24.0,30.0
pepper,2.0,20.0,40.0,40.0


In [71]:
df = df.drop(['stuff', 'other_things'], axis='columns') 
df

Unnamed: 0,price,sales,revenue
apple,1.0,10.0,10.0
banana,0.5,15.0,7.5
mushroom,3.0,8.0,24.0
pepper,2.0,20.0,40.0


# Exercise: Sales tax!

1. Building on your grocery-store data frame (or mine, if you want to use it), add a new column for the sales tax, which is 10% for each item. 
2. Add another column, total revenue, which will be the product of price, sales, and (added on) the sales tax.

In [72]:
df

Unnamed: 0,price,sales,revenue
apple,1.0,10.0,10.0
banana,0.5,15.0,7.5
mushroom,3.0,8.0,24.0
pepper,2.0,20.0,40.0


In [73]:
df['sales_tax'] = 0.1    # assign 10% sales tax to each item
df

Unnamed: 0,price,sales,revenue,sales_tax
apple,1.0,10.0,10.0,0.1
banana,0.5,15.0,7.5,0.1
mushroom,3.0,8.0,24.0,0.1
pepper,2.0,20.0,40.0,0.1


In [75]:
df['total_revenue'] = df['revenue'] + (df['sales_tax'] * df['price'] * df['sales'])
df

Unnamed: 0,price,sales,revenue,sales_tax,total_revenue
apple,1.0,10.0,10.0,0.1,11.0
banana,0.5,15.0,7.5,0.1,8.25
mushroom,3.0,8.0,24.0,0.1,26.4
pepper,2.0,20.0,40.0,0.1,44.0


In [78]:
# to rename a column, use the "rename" method, with a dict handed to the "columns"
# keyword argument

df.rename(columns={'revenue':'gross_revenue'})

Unnamed: 0,price,sales,gross_revenue,sales_tax,total_revenue
apple,1.0,10.0,10.0,0.1,11.0
banana,0.5,15.0,7.5,0.1,8.25
mushroom,3.0,8.0,24.0,0.1,26.4
pepper,2.0,20.0,40.0,0.1,44.0


# Next up

- Useful methods (some of which you already know)
- Boolean indexes and our data frames

# Pandas rule of thumb #1 

Anywhere you can pass a single string (typically for an index or a column name), you can pass a list of strings (for multiple rows or multiple columns)

# Pandas rule of thumb #2

Anything you can do to a series, you can also do to a data frame.

- If the series version of the method returns a single (scalar) value, then the result from a data frame will be a series.
- If the series version of the method returns a series, then the result from a data frame will be a new data frame.

In [80]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80],
          index=list('abcdefgh'))

# use a dict of lists to create a data frame; the key is the column name
# and the value is a list of values
df = DataFrame({'numbers':[10, 20, 30, 40, 50, 60, 70, 80],
               'times_10':[100, 200, 300, 400, 500, 600, 700,800]},
               index=list('abcdefgh')
              )


In [81]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
h    80
dtype: int64

In [82]:
df

Unnamed: 0,numbers,times_10
a,10,100
b,20,200
c,30,300
d,40,400
e,50,500
f,60,600
g,70,700
h,80,800


In [83]:
# aggregation methods

s.mean()

np.float64(45.0)

In [84]:
df.mean()

numbers      45.0
times_10    450.0
dtype: float64

In [85]:
s.min()

np.int64(10)

In [86]:
s.max()

np.int64(80)

In [87]:
df.min()

numbers      10
times_10    100
dtype: int64

In [88]:
df.max()

numbers      80
times_10    800
dtype: int64

In [89]:
s.describe()

count     8.000000
mean     45.000000
std      24.494897
min      10.000000
25%      27.500000
50%      45.000000
75%      62.500000
max      80.000000
dtype: float64

In [90]:
df.describe()

Unnamed: 0,numbers,times_10
count,8.0,8.0
mean,45.0,450.0
std,24.494897,244.948974
min,10.0,100.0
25%,27.5,275.0
50%,45.0,450.0
75%,62.5,625.0
max,80.0,800.0


In [91]:
s.head()  # first 5 values

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [92]:
s.tail() # final 5 values

d    40
e    50
f    60
g    70
h    80
dtype: int64

In [93]:
s.tail(3)  # final 3 values

f    60
g    70
h    80
dtype: int64