# Day 2 -- data frames

1. Q&A
2. Data frames -- creating, and working with them
3. Adding and removing data in our data frames
4. Useful methods for our data frames
5. Boolean indexes / mask indexes
6. Using `.loc` to retrieve rows, rows/columns
7. Reading data from outside sources
    - Download this zipfile: https://files.lerner.co.il/data-science-exercise-files.zip
    - Scraping HTML files
    - Retrieving other formats

In [4]:
import pandas as pd
from pandas import Series, DataFrame

In [5]:
temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20])

# `.loc` retrieves data in numerous ways

1. Use it to retrieve one value from a series with the (default, numeric) index
2. Use it to retrieve one value from a series with the (non-default) index of another dtype -- strings, ints, floats, etc.
3. Retrieve one or more values as a series based on a "fancy index," passing a list of index values
4. Pass `.loc` a series/list of booleans of the same length as the series itself, and you get back only those elements that correspond to a `True`
5. Generate a boolean series with a comparison operator, and then pass that to `.loc`, to get only those values in the series for which the comparison is `True`

In [6]:
temps.loc[4]

np.int64(23)

In [7]:
temps.loc[2]

np.int64(25)

In [8]:
temps.loc[200]

KeyError: 200

In [9]:
temps

0    20
1    23
2    25
3    22
4    23
5    25
6    22
7    27
8    20
dtype: int64

In [11]:
# if I give values for an index, that is used -- .loc uses that
# (that's why we sometimes need .iloc)

temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20],
              index='Mon Tue Wed Thu Fri Sat Sun Mon Tue'.split())

In [12]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [13]:
temps.loc['Sat']

np.int64(25)

In [14]:
temps.loc['Tue']

Tue    23
Tue    20
dtype: int64

In [15]:
temps.loc[2]

KeyError: 2

In [16]:
temps.loc[[2, 4]]   # fancy indexing -- we give a list of indexes and get a series back

KeyError: "None of [Index([2, 4], dtype='object')] are in the [index]"

In [17]:
temps.loc[['Mon', 'Thu']]   # fancy indexing -- we give a list of indexes and get a series back

Mon    20
Mon    27
Thu    22
dtype: int64

In [18]:
# if you use fancy indexing, and there's only one match, you will still get a series back

temps.loc[['Thu']]

Thu    22
dtype: int64

In [19]:
# we can also use a boolean series (or list)
# in this case, the argument to .loc must be a list/series of booleans that is the same
# length as the series itself

temps.loc[[True, False, True, False, True, False, True, False, True]]

Mon    20
Wed    25
Fri    23
Sun    22
Tue    20
dtype: int64

In [21]:
# a variation on this is to generate a boolean list, and pass it to .loc

temps.loc[temps < 25]   # this generates a boolean series, and passes that series to .loc

Mon    20
Tue    23
Thu    22
Fri    23
Sun    22
Tue    20
dtype: int64

In [22]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [23]:
temps.index

Index(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue'], dtype='object')

In [24]:
temps.values

array([20, 23, 25, 22, 23, 25, 22, 27, 20])

# What is a data frame?

- 2D data
- The rows have an index (just like a series)
- The columns have names (which work like the index, but vertically)
- Each column is basically a series, which means that all values in that column have the same dtype
