# Agenda

1. Fun with promotions
2. Data frames
    - What are they?
    - How can we create simple ones by hand?
    - How do we retrieve from them?
3. Reading from CSV files
    - Some basic options
    - - Things to watch out for
4. Retrieving with `.loc` -- ways to think about it
5. How to avoid a common Pandas warning / error

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
s = Series([10, 20, 30, 40, 50], dtype='int8')
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [4]:
2 ** 8

256

In [5]:
s * 2

0     20
1     40
2     60
3     80
4    100
dtype: int8

In [6]:
s * 10  # here, we multiply by 10 and the numbers "roll over" -- because int8 isn't big enough for some solutions

0    100
1    -56
2     44
3   -112
4    -12
dtype: int8

In [8]:
s * 200   # will this have similar problems?  no!

0     2000
1     4000
2     6000
3     8000
4    10000
dtype: int16

In [20]:
# somewhere along the line, Pandas decided to "promote" the series from int8 to int16, thus saving the day

# Pandas looks at the number by which we're mulitplying  -- if it fits into the current dtype, then it keeps the dtype
# and performs the operation.  But if the number is too big for the current dtype, then it promotes the series when it calculates.

In [21]:
s + 126

0   -120
1   -110
2   -100
3    -90
4    -80
dtype: int8

In [22]:
s + 127

0   -119
1   -109
2    -99
3    -89
4    -79
dtype: int8

In [23]:
s + 128

0    138
1    148
2    158
3    168
4    178
dtype: int16

# Data frames

A data frame is a 2D table with rows and columns:

- Each row is identified by an index
- Each column is identified by a name, or a column name

Each column is basically a Pandas series. So anything that you can do on a series, you can do on a column.

- Because each column is a series, we continue to use the index to identify each element.
- All of teh series (columns) in a data frame share an index
- When we retrieve a column, it's a series, like usual.
- When we retrieve a row, Pandas creates a new series on the fly
- Every column has a dtype, whereas rows are a combination of whatever dtypes are in their columns, and are often "object" columns



In [24]:
# create a data frame with a list of lists
# each inner list describes a row in the data frame

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
               [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [25]:
# by default, the rows and columns are both numbered starting at 0. That's technically fine,
# but in real life you'll want to identify them with names or numbers. We can do that by 
# passing the "index" keyword argument, and also the "columns" keyword argument.

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
               [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [26]:
# we can retrieve the index, just as we did with a series
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [27]:
# how do we retrieve the column names? Just use the "columns" attribute
df.columns

Index(['x', 'y', 'z'], dtype='object')

In [28]:
# how can I retrieve a row from my data frame?
# use .loc, just as we did with a series

df.loc['c']

x    70
y    80
z    90
Name: c, dtype: int64

In [29]:
# with a series, we can retrieve more than one object at a time with fancy indexing.
# can we do that now? What do we get back?

df.loc[['a', 'c']]

Unnamed: 0,x,y,z
a,10,20,30
c,70,80,90


In [30]:
# can we retrieve individual columns? Yes, just use [] without any .loc
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [31]:
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [32]:
# we can run a method on a series, which means (normally) either on a row or a column
df['x'].mean()

55.0

In [33]:
# in most cases, a method you can run on a series can also be run on a data frame
# and in such a case, you'll get back one solution for each column.

df.mean()   # this will return the mean for each column, labeling each column as well

x    55.0
y    65.0
z    75.0
dtype: float64

In [34]:
df.sum()   # let's sum all of the numbers in each column

x    220
y    260
z    300
dtype: int64

In [35]:
# What happens when I do the following:

df.sum().sum()

780

In [36]:
# lots of methods we can run:

df.min()

x    10
y    20
z    30
dtype: int64

In [37]:
# give me info about column x
df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [38]:
# run a method on a data frame, and you get back one set of answers per column -- that means a data frame!
df.describe() 

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [39]:
# behind the scenes of every data frame is a 2D NumPy array
df.values

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

In [40]:
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [41]:
df.mean()   # this gives us the mean for each column

x    55.0
y    65.0
z    75.0
dtype: float64

In [42]:
df.mean(axis='columns')  # this means: calculate across the columns, giving me a new column with the results

a     20.0
b     50.0
c     80.0
d    110.0
dtype: float64

In [43]:
# how can I add a new column to a data frame?
# just assign to it!  (you can also replace an existing column this way, if you want)

df['w'] = 'hello out there everyone'.split()
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [44]:
df.dtypes  # what dtypes are in each column

x     int64
y     int64
z     int64
w    object
dtype: object

In [45]:
df.describe()  # what will be with column w?

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [46]:
df['w'].describe()

count         4
unique        4
top       hello
freq          1
Name: w, dtype: object

# Exercise: Create a data frame

1. Create a data frame containing your local weather forecast
    - It should have six rows, each for a different day
    - The index will contain day names/abbreviations
    - The columns will be `high` and `low`, showing the forecast high and low temps on each
2. Describe the values in `high` and `low`, b