# Agenda, day 2

1. Q&A
2. Data frames -- creating and working with them
3. Adding and removing data in our data frames
4. Useful methods for data frames
5. Boolean / mask indexes and selecting data
6. using `.loc` for retrieving rows, rows/columns
7. Reading data from outside sources
    - CSV, Excel, etc.
  
Download this file, which contains a few items we'll use in the class: https://files.lerner.co.il/data-science-exercise-files.zip

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
temps = Series([20, 23, 25, 22, 23], index=list('abcde'))
temps

a    20
b    23
c    25
d    22
e    23
dtype: int64

# Most of the time, we want 2D data!

It's pretty standard to have *tabular* data, which contains 2 dimensions, normally seen as rows and columns.

You want think of each column in a data frame (which is the Pandas 2D data structure) as a series. It will have an index for each element.  Those indexes are shared across all of the columns, and each column has its own name, as well.

In a data frame:

- The rows are labeled with an index
- The columns are labeled with columns (yeah, that's repetitive)

In [4]:
# To create a data frame:
# - we need 2D data -- this could be a list of lists, or a list of dicts, or a dict of lists
#   it could also be a 2D NumPy array
# - we need names for the rows (the index)
# - we need names for the columns

# something simple, using a list of lists

df = DataFrame([[10, 20, 30],
               [40, 50, 60],
               [70, 80, 90],
               [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [5]:
# let's add names for both our index (the rows) and our columns


df = DataFrame([[10, 20, 30],
               [40, 50, 60],
               [70, 80, 90],
               [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [6]:
# if I want to retrieve a row, I can use .loc or .iloc, just like with
# a series!

df.loc['a']   # this retrieves row a

x    10
y    20
z    30
Name: a, dtype: int64

Pandas only knows how to handle data in a limited number of ways. If you ask for one value, you'll get that value. But if you ask for a number of values in 1D, you'll get a series. (That's what we got here.) If you ask for a number of values in 2D, you'll get a data frame.

In [7]:
df.loc['b']

x    40
y    50
z    60
Name: b, dtype: int64

In [8]:
df.loc['d']

x    100
y    110
z    120
Name: d, dtype: int64

In [10]:
%xmode Minimal
df.loc['x']

Exception reporting mode: Minimal


KeyError: 'x'

In [11]:
# how can I retrieve a single column?
# for this, we use just []! 
# this is why I asked you last week *NOT* to use just [] to get items from a series,
# but rather to use .loc and .iloc

df['x']  # this is a column, and we get it back!

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [12]:
# can I retrieve more than one row, or more than one column?
# remember, we can use "fancy indexing," passing a list of index/column names

df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [13]:
df.loc[['a', 'c']]  # fancy indexing

Unnamed: 0,x,y,z
a,10,20,30
c,70,80,90


In [15]:
df.loc['b':'d']   # slice -- only one pair of [] -- with .loc, the endpoint is included!

Unnamed: 0,x,y,z
b,40,50,60
c,70,80,90
d,100,110,120


In [16]:
# you can do similar things with .iloc, which uses the numeric position

df.iloc[1]

x    40
y    50
z    60
Name: b, dtype: int64

In [17]:
df.iloc[[1, 3]]

Unnamed: 0,x,y,z
b,40,50,60
d,100,110,120


In [19]:
df.iloc[1:3]  # this is up to and *not* including, as usual in Python

Unnamed: 0,x,y,z
b,40,50,60
c,70,80,90


In [20]:
# how can I retrieve multiple columns?
# answer: just use a list of column names, as you would with the index

df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [21]:
df[['z', 'x']]   # retrieve them in a different order

Unnamed: 0,z,x
a,30,10
b,60,40
c,90,70
d,120,100


In [22]:
# sometimes, you might want a one-column data frame, rather than a series
# you could say

df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [23]:
# instead, you can say

df[['x']]   # the [[ ]] means: I want a data frame back, even though, just one column

Unnamed: 0,x
a,10
b,40
c,70
d,100


In [24]:
# once you have retrieved from a data frame, you can perform all sorts of calculations
# using methods -- min, mean, max, std, median, etc.

# Exercise: Grocery store

1. Create a data frame in which you have two columns. One is the price of an item (`price`), and the other will be the number of sales of that item (`sales`). The index will be the names of the items that you are selling.
2. The data frame should have 4 rows, and each item will have a price and a number of sales.
3. Retrieve all of the info for apples.
4. Retrieve all of the info for bananas.
5. Retrieve all information for apples and bananas.
6. What is the mean price for all products?
7. What is the mean price for just apples and bananas?

In [26]:
df = DataFrame([[10, 5],
                [15, 4],
                [7, 10],
                [20, 2]],
            index='apple banana cucumber dill'.split(),
             columns='price sales'.split())

df

Unnamed: 0,price,sales
apple,10,5
banana,15,4
cucumber,7,10
dill,20,2
