# Agenda, day 2

1. Q&A
2. Data frames -- creating and working with them
3. Adding and removing data in our data frames
4. Useful methods for data frames
5. Boolean / mask indexes and selecting data
6. using `.loc` for retrieving rows, rows/columns
7. Reading data from outside sources
    - CSV, Excel, etc.
  
Download this file, which contains a few items we'll use in the class: https://files.lerner.co.il/data-science-exercise-files.zip

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
temps = Series([20, 23, 25, 22, 23], index=list('abcde'))
temps

a    20
b    23
c    25
d    22
e    23
dtype: int64

# Most of the time, we want 2D data!

It's pretty standard to have *tabular* data, which contains 2 dimensions, normally seen as rows and columns.

You want think of each column in a data frame (which is the Pandas 2D data structure) as a series. It will have an index for each element.  Those indexes are shared across all of the columns, and each column has its own name, as well.

In a data frame:

- The rows are labeled with an index
- The columns are labeled with columns (yeah, that's repetitive)

In [4]:
# To create a data frame:
# - we need 2D data -- this could be a list of lists, or a list of dicts, or a dict of lists
#   it could also be a 2D NumPy array
# - we need names for the rows (the index)
# - we need names for the columns

# something simple, using a list of lists

df = DataFrame([[10, 20, 30],
               [40, 50, 60],
               [70, 80, 90],
               [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [5]:
# let's add names for both our index (the rows) and our columns


df = DataFrame([[10, 20, 30],
               [40, 50, 60],
               [70, 80, 90],
               [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [6]:
# if I want to retrieve a row, I can use .loc or .iloc, just like with
# a series!

df.loc['a']   # this retrieves row a

x    10
y    20
z    30
Name: a, dtype: int64

Pandas only knows how to handle data in a limited number of ways. If you ask for one value, you'll get that value. But if you ask for a number of values in 1D, you'll get a series. (That's what we got here.) If you ask for a number of values in 2D, you'll get a data frame.

In [7]:
df.loc['b']

x    40
y    50
z    60
Name: b, dtype: int64

In [8]:
df.loc['d']

x    100
y    110
z    120
Name: d, dtype: int64

In [10]:
%xmode Minimal
df.loc['x']

Exception reporting mode: Minimal


KeyError: 'x'