# Agenda: CSV and data frames

1. Data frames in general
    - What are they?
    - Creating them (very simple version)
    - Retrieving from them
2. Working with CSV files
    - What options are there?
3. Retriving from parts of a data frame
    - Applying `.loc` and boolean indexes to retrieving from a data frame

# Data frames

So far, we've been (mostly) using series. But most work in Pandas is done with a data frame

- Two-dimensional table
- Rows (which we identify with the index)
- Columns (which we identify with column names)

Remember that every column in a data frame is a Pandas series:
- The index identifying the rows continues to identify elements of a series
- All of the series (columns) in the data frame share an index
- When we retrieve a column, then it's a (normal) series
- When we retrieve a row, Pandas basically creates a new series on the fly
- Every column has a dtype, whereas rows are a combination of whatever dtypes are in their columns

In [1]:
# it's rare to create a data frame from scratch
# but if we really want to create a data frame, we can do it with a list of lists, or a list of dicts,
# or even a dict of lists

# I'm just going to use a list of lists for my data frame

import pandas as pd
from pandas import Series, DataFrame

In [2]:
# Create a data frame with a list of lists
df = DataFrame([[10, 20, 30],
            [40, 50, 60], 
            [70, 80, 90], 
            [100, 110, 120]])

df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [3]:
# let's make this a bit more readable by giving both 
# an index and column names. Just as we can pass index= to 
# the Series class, we can pass columns= (as well) to the
# DataFrame class

df = DataFrame([[10, 20, 30],
            [40, 50, 60], 
            [70, 80, 90], 
            [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))

df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [4]:
# we can retrieve the index just as we did with a series
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
# we can retrieve the column names with .columns
df.columns

Index(['x', 'y', 'z'], dtype='object')

In [6]:
# how can I retrieve one row?
# use the index and .loc

df.loc['c']

x    70
y    80
z    90
Name: c, dtype: int64

In [7]:
# I can run any series method on this return value
df.loc['c'].mean()

80.0

In [8]:
# how do I get a column? Answer: []
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [9]:
# we can get more than one row if we pass .loc a list of indexes
df.loc[['a', 'c']]

Unnamed: 0,x,y,z
a,10,20,30
c,70,80,90


In [10]:
# we can retrieve more than one column in the same way
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [11]:
# what kinds of operations can we run on a data frame?
# rule of thumb: (Just about) any method that you can run on a series, you can also run on a data frame
# you'll get back one result for each column

df['x'].mean()

55.0

In [12]:
df.mean()   # we get a series back, whose index is the same as df's column names

x    55.0
y    65.0
z    75.0
dtype: float64

In [13]:
df.min()

x    10
y    20
z    30
dtype: int64

In [14]:
# I can even invoke something like df.describe!

df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [15]:
df.describe()  # this will give us all of those values, for each column

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [16]:
# you can add a new column to a data frame just by assigning
df['w'] = ['hello', 'out', 'there', 'everyone']
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [17]:
df.describe()

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [19]:
df['w'].describe()

count         4
unique        4
top       hello
freq          1
Name: w, dtype: object

In [20]:
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [21]:
df.loc['a']

x       10
y       20
z       30
w    hello
Name: a, dtype: object

In [22]:
df.loc['a'].mean()

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U5')) -> None

# Exercise: Create a data frame

1. Create a data frame describing the weather
2. 