# Supplemental notes on Pandas

The [**pandas** library](https://pandas.pydata.org/) is a Python module for representing what we call "tibbles" in Topic 7. Beyond what you see there, this notebook has additional notes to help you understand how to manipulate objects in Pandas. These notes adapt those found in the recommended text, [Python for Data Analysis (2nd ed.)](http://shop.oreilly.com/product/0636920050896.do), which is written by the createor of pandas, [Wes McKinney](http://wesmckinney.com/).

**Versions.** The state of pandas is a bit in-flux, so it's important to be flexible and accommodate differences in functionality that might vary by version. The following code shows you how to check what version of pandas you have.

In [1]:
import pandas as pd  # Standard idiom for loading pandas

print("=== pandas version: {} ===\n".format(pd.__version__))

import sys
print("=== Python version ===\n{}".format(sys.version))

=== pandas version: 1.1.2 ===

=== Python version ===
3.7.5 (default, Dec 18 2019, 06:24:58) 
[GCC 5.5.0 20171010]


The main object that pandas implements is the `DataFrame`, which is essentially a 2-D table. It's an ideal target for holding the tibbles of Topic+Notebook 7, and its design derives in part from [data frame objects in the R language](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/data.frame).

In addition to `DataFrame`, another important component of pandas is the `Series`, which is essentially one column of a `DataFrame` object (and, therefore, corresponds to variables and responses in a tibble).

In [2]:
from pandas import DataFrame, Series

# `Series` objects

A pandas [`Series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) object is a column-oriented object that we will use to store a variable of a tibble.

In [3]:
obj = Series([-1, 2, -3, 4, -5])
print(f"`obj` has type `{type(obj)}`:\n\n{obj}")

`obj` has type `<class 'pandas.core.series.Series'>`:

0   -1
1    2
2   -3
3    4
4   -5
dtype: int64


Observe the common **base type** (`dtype: int64`) and **index** (element numbers).

Regarding the base type, a `Series` differs from a Python `list` in that the types of its elements are assumed to be the same. Doing so allows many operations on a `Series` to be faster than their counterparts for `list` objects, as in this search example.

In [4]:
from random import randint
n_ints = 10000000
max_value = 5*n_ints

print(f"""
Creating random `list` and `Series` objects:
- Length: {n_ints} elements
- Range: [{-max_value}, {max_value}]
""")
a_list = [randint(-max_value, max_value) for _ in range(n_ints)]
a_series = Series(a_list)

print("==> Estimating time to search the `list`:")
t_list_search = %timeit -o randint(-max_value, max_value) in a_list

print("\n==> Estimating time to search the `Series`:")
t_series_search = %timeit -o a_series.isin([randint(-max_value, max_value)])

print(f"\n==> (`list` time) divided by `Series` time is roughly {t_list_search.average / t_series_search.average:.1f}x")


Creating random `list` and `Series` objects:
- Length: 10000000 elements
- Range: [-50000000, 50000000]

==> Estimating time to search the `list`:
171 ms ± 32.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

==> Estimating time to search the `Series`:
20.6 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

==> (`list` time) divided by `Series` time is roughly 8.3x


If you create a `Series` with "mixed types," the `dtype` will become the most generic Python type, `object`. (A deeper understanding of what this fact means requires some knowledge of object-oriented programming, but that won't be necessary for our course.)

In [5]:
obj2 = Series([-1, '2', -3, '4', -5])
obj2

0    -1
1     2
2    -3
3     4
4    -5
dtype: object

If you want to query the base type, use:

In [6]:
print(obj.dtype)
print(obj2.dtype)

int64
object


Regarding the index, it provides a convenient way to reference individual elements of the `Series`.

By default, a `Series` has an index that is akin to `range()` in standard Python, and effectively numbers the entries from 0 to `n-1`, where `n` is the length of the `Series`. A `Series` object also becomes list-like in how you reference its elements.

In [7]:
print("obj.index: {}".format(obj.index))
print("range(0, 5): {}".format(range(0, 5)))

obj.index: RangeIndex(start=0, stop=5, step=1)
range(0, 5): range(0, 5)


In [8]:
print("==> obj[2]:\n{}\n".format(obj[2]))
print("==> obj[3]:\n{}\n".format(obj[3]))
print("==> obj[1:3]:\n{}\n".format(obj[1:4]))

==> obj[2]:
-3

==> obj[3]:
4

==> obj[1:3]:
1    2
2   -3
3    4
dtype: int64



You can also use more complex index objects, like lists of integers and conditional masks.

In [9]:
I = [0, 2, 3]
obj[I] # Also: obj[[0, 2, 3]]

0   -1
2   -3
3    4
dtype: int64

In [10]:
I_pos = obj > 0
print(type(I_pos), I_pos)

<class 'pandas.core.series.Series'> 0    False
1     True
2    False
3     True
4    False
dtype: bool


In [11]:
print(obj[I_pos])

1    2
3    4
dtype: int64


However, the index can be a more general structure, which effectively turns a `Series` object into something that is "dictionary-like."

In [12]:
obj3 = Series([      1,    -2,       3,     -4,        5,      -6],
              ['alice', 'bob', 'carol', 'dave', 'esther', 'frank'])
obj3

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64

In [13]:
print("* obj3['bob']: {}\n".format(obj3['bob']))
print("* obj3['carol']: {}\n".format(obj3['carol']))

* obj3['bob']: -2

* obj3['carol']: 3



In fact, you can construct a `Series` from a dictionary directly:

In [14]:
peeps = {'alice': 1, 'carol': 3, 'esther': 5, 'bob': -2, 'dave': -4, 'frank': -6}
obj4 = Series(peeps)
print(obj4)

alice     1
carol     3
esther    5
bob      -2
dave     -4
frank    -6
dtype: int64


In [15]:
mujeres = [0, 2, 4] # list of integer offsets
print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))

* las mujeres of `obj3` at offsets [0, 2, 4]:
alice     1
carol     3
esther    5
dtype: int64



In [16]:
hombres = ['bob', 'dave', 'frank'] # list of index values
print("* hombres, by their names, {}:\n{}".format(hombres, obj3[hombres]))

* hombres, by their names, ['bob', 'dave', 'frank']:
bob     -2
dave    -4
frank   -6
dtype: int64


In [17]:
I_neg = obj3 < 0
print(I_neg)

alice     False
bob        True
carol     False
dave       True
esther    False
frank      True
dtype: bool


In [18]:
print(obj3[I_neg])

bob     -2
dave    -4
frank   -6
dtype: int64


Because of the dictionary-like naming of `Series` elements, you can use the Python `in` operator in the same way you would a dictionary.

> Note: In the timing experiment comparing `list` search and `Series` search, you may have noticed that the benchmark does not use `in`, but rather, `Series.isin`. Why is that?

In [19]:
print('carol' in peeps)
print('carol' in obj3)

True
True


Basic arithmetic works on `Series` as vector-like operations.

In [20]:
print(obj3, "\n")
print(obj3 + 5, "\n")
print(obj3 + 5 > 0, "\n")
print((-2.5 * obj3) + (obj3 + 5))

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64 

alice      6
bob        3
carol      8
dave       1
esther    10
frank     -1
dtype: int64 

alice      True
bob        True
carol      True
dave       True
esther     True
frank     False
dtype: bool 

alice      3.5
bob        8.0
carol      0.5
dave      11.0
esther    -2.5
frank     14.0
dtype: float64


A `Series` object also supports vector-style operations with automatic alignment based on index values.

In [21]:
print(obj3)

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64


In [22]:
obj_l = obj3[mujeres]
obj_l

alice     1
carol     3
esther    5
dtype: int64

In [23]:
obj3 + obj_l

alice      2.0
bob        NaN
carol      6.0
dave       NaN
esther    10.0
frank      NaN
dtype: float64

Observe what happened with undefined elements. If you are familiar with relational databases, this behavior is akin to an _outer-join_. 

Another useful transformation is the `.apply(fun)` method. It returns a copy of the `Series` where the function `fun` has been applied to each element. For example:

In [24]:
abs(-5) # Python built-in function

5

In [25]:
obj3 # Recall

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64

In [26]:
obj3.apply(abs)

alice     1
bob       2
carol     3
dave      4
esther    5
frank     6
dtype: int64

In [27]:
obj3 # Note: `.apply()` returned a copy, so the original is untouched

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
dtype: int64

A `Series` may be _named_, too.

In [28]:
print(obj3.name)

None


In [29]:
obj3.name = 'peep'
obj3

alice     1
bob      -2
carol     3
dave     -4
esther    5
frank    -6
Name: peep, dtype: int64

When we move on to `DataFrame` objects, you'll see why names matter.

# `DataFrame` objects

A pandas [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) object is a table whose columns are `Series` objects, all keyed on the same index. It's the perfect container for what we have been referring to as a tibble.

In [30]:
cafes = DataFrame({'name': ['east pole', 'chrome yellow', 'brash', 'taproom', '3heart', 'spiller park pcm', 'refuge', 'toptime'],
                   'zip': [30324, 30312, 30318, 30317, 30306, 30308, 30303, 30318],
                   'poc': ['jared', 'kelly', 'matt', 'jonathan', 'nhan', 'dale', 'kitti', 'nolan']})
print("type:", type(cafes))
print(cafes)

type: <class 'pandas.core.frame.DataFrame'>
               name    zip       poc
0         east pole  30324     jared
1     chrome yellow  30312     kelly
2             brash  30318      matt
3           taproom  30317  jonathan
4            3heart  30306      nhan
5  spiller park pcm  30308      dale
6            refuge  30303     kitti
7           toptime  30318     nolan


In [31]:
display(cafes) # Or just `cafes` as the last line of a cell

Unnamed: 0,name,zip,poc
0,east pole,30324,jared
1,chrome yellow,30312,kelly
2,brash,30318,matt
3,taproom,30317,jonathan
4,3heart,30306,nhan
5,spiller park pcm,30308,dale
6,refuge,30303,kitti
7,toptime,30318,nolan


The `DataFrame` has named columns, which are stored as an `Index` (more later!):

In [32]:
cafes.columns

Index(['name', 'zip', 'poc'], dtype='object')

Each column is a named `Series`:

In [33]:
type(cafes['zip']) # Aha!

pandas.core.series.Series

As you might expect, these `Series` objects should all have the same index.

In [34]:
cafes.index

RangeIndex(start=0, stop=8, step=1)

In [35]:
cafes.index == cafes['zip'].index

array([ True,  True,  True,  True,  True,  True,  True,  True])

In [36]:
cafes['zip'].index == cafes['poc'].index

array([ True,  True,  True,  True,  True,  True,  True,  True])

You may use complex indexing of columns.

In [37]:
target_fields = ['zip', 'poc']
cafes[target_fields]

Unnamed: 0,zip,poc
0,30324,jared
1,30312,kelly
2,30318,matt
3,30317,jonathan
4,30306,nhan
5,30308,dale
6,30303,kitti
7,30318,nolan


But slices apply to rows.

In [38]:
cafes[1::2]

Unnamed: 0,name,zip,poc
1,chrome yellow,30312,kelly
3,taproom,30317,jonathan
5,spiller park pcm,30308,dale
7,toptime,30318,nolan


The index above is, by default, an integer range.

In [39]:
cafes.index

RangeIndex(start=0, stop=8, step=1)

In [40]:
cafes2 = cafes[['poc', 'zip']]
cafes2.index = cafes['name']
cafes2.index.name = None
cafes2

Unnamed: 0,poc,zip
east pole,jared,30324
chrome yellow,kelly,30312
brash,matt,30318
taproom,jonathan,30317
3heart,nhan,30306
spiller park pcm,dale,30308
refuge,kitti,30303
toptime,nolan,30318


You can access subsets of rows using the `.loc` field and index values:

In [41]:
cafes2.loc[['chrome yellow', '3heart']]

Unnamed: 0,poc,zip
chrome yellow,kelly,30312
3heart,nhan,30306


Alternatively, you can use integer offsets via the `.iloc` field, which is 0-based.

In [42]:
cafes2.iloc[[1, 3]]

Unnamed: 0,poc,zip
chrome yellow,kelly,30312
taproom,jonathan,30317


Adding columns is easy. Suppose every cafe has a 4-star rating on Yelp! and a two-dollar-sign cost:

In [43]:
cafes2['rating'] = 4.0
cafes2['price'] = '$$'
cafes2

Unnamed: 0,poc,zip,rating,price
east pole,jared,30324,4.0,$$
chrome yellow,kelly,30312,4.0,$$
brash,matt,30318,4.0,$$
taproom,jonathan,30317,4.0,$$
3heart,nhan,30306,4.0,$$
spiller park pcm,dale,30308,4.0,$$
refuge,kitti,30303,4.0,$$
toptime,nolan,30318,4.0,$$


And vector arithmetic should work on columns as expected.

In [44]:
prices_as_ints = cafes2['price'].apply(lambda s: len(s))
prices_as_ints

east pole           2
chrome yellow       2
brash               2
taproom             2
3heart              2
spiller park pcm    2
refuge              2
toptime             2
Name: price, dtype: int64

In [45]:
cafes2['value'] = cafes2['rating'] / prices_as_ints
cafes2

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Because the columns are `Series` objects, there is an implicit matching that is happening on the indexes. In the preceding example, it works because all the `Series` objects involved have identical indexes.

However, the following will not work as intended because referencing rows yields copies.

For instance, suppose there is a price hike of one more `'$'` for being in the 30306 and 30308 zip codes. (If you are in Atlanta, you may know that these are the zip codes that place you close to, or in, [Ponce City Market](http://poncecitymarket.com/) and the [Eastside Beltline Trail](https://beltline.org/explore-atlanta-beltline-trails/eastside-trail/)!) Let's increase the price there, on a copy of the dataframe, `cafes3`.

In [46]:
cafes3 = cafes2.copy()
cafes3

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


In [47]:
is_fancy = cafes3['zip'].isin({30306, 30308})
# Alternative:
#is_fancy = cafes3['zip'].apply(lambda z: z in {30306, 30308})
is_fancy

east pole           False
chrome yellow       False
brash               False
taproom             False
3heart               True
spiller park pcm     True
refuge              False
toptime             False
Name: zip, dtype: bool

In [48]:
cafes3[is_fancy]

Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0


In [49]:
# Recall: Legal Python for string concatenation
s = '$$'
s += '$'
print(s)

$$$


In [50]:
cafes3[is_fancy]['price'] += '$'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


What does that error message mean? Let's see if anything changed.

In [51]:
cafes3

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Nope! When you slice horizontally, you get copies of the original data, not references to subsets of the original data. Therefore, we'll need different strategy.

Observe that the error message suggests a way!

In [52]:
cafes3.loc[is_fancy, 'price'] += '$'
cafes3

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


**A different approach.** For pedagogical purposes, let's see if we can go about solving this problem in other ways to see what might or might not work.

In [53]:
cafes4 = cafes2.copy() # Start over
cafes4

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$,2.0
spiller park pcm,dale,30308,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Based on the earlier discussion, a well-educated first attempt might be to construct a `Series` with a named index, where the index values for fancy neighborhoods have an additional `'$'`, and then use string concatentation.

In [54]:
fancy_shops = cafes4.index[is_fancy]
fancy_shops

Index(['3heart', 'spiller park pcm'], dtype='object')

In [55]:
fancy_markup = Series(['$'] * len(fancy_shops), index=fancy_shops)
fancy_markup

3heart              $
spiller park pcm    $
dtype: object

In [56]:
cafes4['price'] + fancy_markup

3heart              $$$
brash               NaN
chrome yellow       NaN
east pole           NaN
refuge              NaN
spiller park pcm    $$$
taproom             NaN
toptime             NaN
dtype: object

Close! Remember that missing values are treated as `NaN` objects.

**Exercise**. Develop an alternative scheme.

In [57]:
# Preliminary observation:
print("False * '$' == '{}'".format(False * '$'))
print("True * '$' == '{}'".format(True * '$'))

False * '$' == ''
True * '$' == '$'


In [58]:
cafes4 = cafes2.copy()
cafes4['price'] += Series([x * '$' for x in is_fancy.tolist()], index=is_fancy.index)
cafes4

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


**More on `apply()` for `DataFrame` objects.** As with a `Series`, there is a `DataFrame.apply()` procedure. However, it's meaning is a bit more nuanced because a `DataFrame` is generally 2-D rather than 1-D.

In [59]:
cafes4.apply(lambda x: repr(type(x))) # What does this do? What does the output tell you?

poc       <class 'pandas.core.series.Series'>
zip       <class 'pandas.core.series.Series'>
rating    <class 'pandas.core.series.Series'>
price     <class 'pandas.core.series.Series'>
value     <class 'pandas.core.series.Series'>
dtype: object

A useful parameter is `axis`:

In [60]:
cafes4.apply(lambda x: repr(type(x)), axis=1) # What does this do? What does the output tell you?

east pole           <class 'pandas.core.series.Series'>
chrome yellow       <class 'pandas.core.series.Series'>
brash               <class 'pandas.core.series.Series'>
taproom             <class 'pandas.core.series.Series'>
3heart              <class 'pandas.core.series.Series'>
spiller park pcm    <class 'pandas.core.series.Series'>
refuge              <class 'pandas.core.series.Series'>
toptime             <class 'pandas.core.series.Series'>
dtype: object

And just to quickly verify what you get when `axis=1`:

In [61]:
cafes4.apply(lambda x: print('==> ' + x.name + '\n' + repr(x)) if x.name == 'east pole' else None, axis=1);

==> east pole
poc       jared
zip       30324
rating        4
price        $$
value         2
Name: east pole, dtype: object


**Exercise.** Use `DataFrame.apply()` to update the `'value'` column in `cafes4`, which is out of date given the update of the prices.

In [62]:
cafes4 # Verify visually that `'value'` is out of date

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,2.0
spiller park pcm,dale,30308,4.0,$$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


In [63]:
def calc_value(row):
    return row['rating'] / len(row['price'])

cafes4['value'] = cafes4.apply(calc_value, axis=1)
cafes4

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Another useful operation is gluing `DataFrame` objects together. There are several helpful operations covered in Notebook 7; one not mentioned there, but useful in one of its exercises, is `.concat()`.

In [64]:
# Split based on price
is_cheap = cafes4['price'] <= '$$'
cafes_cheap = cafes4[is_cheap]
cafes_pricey = cafes4[~is_cheap]

display(cafes_cheap)
display(cafes_pricey)

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333


In [65]:
# Never mind; recombine
pd.concat([cafes_cheap, cafes_pricey])

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333


## More on index objects

A pandas [`Index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html), used by `Series` and `DataFrame`, is "list-like." It has a number of useful operations, including set-like operations (e.g., testing for membership, intersection, union, difference):

In [66]:
from pandas import Index

In [67]:
cafes4.index

Index(['east pole', 'chrome yellow', 'brash', 'taproom', '3heart',
       'spiller park pcm', 'refuge', 'toptime'],
      dtype='object')

In [68]:
cafes4.index.isin(['brash', '3heart'])

array([False, False,  True, False,  True, False, False, False])

In [69]:
cafes4.index.union(['chattahoochee'])

Index(['3heart', 'brash', 'chattahoochee', 'chrome yellow', 'east pole',
       'refuge', 'spiller park pcm', 'taproom', 'toptime'],
      dtype='object')

In [70]:
cafes4.index.difference(['chattahoochee', 'starbucks', 'bar crema'])

Index(['3heart', 'brash', 'chrome yellow', 'east pole', 'refuge',
       'spiller park pcm', 'taproom', 'toptime'],
      dtype='object')

If you need to change the index of a `DataFrame`, here is one way to do it.

In [71]:
cafes5 = cafes4.reindex(Index(['3heart', 'east pole', 'brash', 'starbucks']))

display(cafes4)
display(cafes5)

Unnamed: 0,poc,zip,rating,price,value
east pole,jared,30324,4.0,$$,2.0
chrome yellow,kelly,30312,4.0,$$,2.0
brash,matt,30318,4.0,$$,2.0
taproom,jonathan,30317,4.0,$$,2.0
3heart,nhan,30306,4.0,$$$,1.333333
spiller park pcm,dale,30308,4.0,$$$,1.333333
refuge,kitti,30303,4.0,$$,2.0
toptime,nolan,30318,4.0,$$,2.0


Unnamed: 0,poc,zip,rating,price,value
3heart,nhan,30306.0,4.0,$$$,1.333333
east pole,jared,30324.0,4.0,$$,2.0
brash,matt,30318.0,4.0,$$,2.0
starbucks,,,,,


Observe that this reindexing operation matches the supplied index values against the existing ones. (What happens to index values you leave out? What happens with new index values?)

Another useful operation is dropping the index (and replacing it with the default, integers).

In [72]:
cafes6 = cafes4.reset_index(drop=True)
cafes6['name'] = cafes4.index
cafes6

Unnamed: 0,poc,zip,rating,price,value,name
0,jared,30324,4.0,$$,2.0,east pole
1,kelly,30312,4.0,$$,2.0,chrome yellow
2,matt,30318,4.0,$$,2.0,brash
3,jonathan,30317,4.0,$$,2.0,taproom
4,nhan,30306,4.0,$$$,1.333333,3heart
5,dale,30308,4.0,$$$,1.333333,spiller park pcm
6,kitti,30303,4.0,$$,2.0,refuge
7,nolan,30318,4.0,$$,2.0,toptime


**Fin!** That's the end of these notes. With this information as background, you should be able to complete Notebook 7.