# Agenda, week 2

1. Q&A
2. Data frames
    - Creating them
    - Indexes and column names
    - Retrieving from them
    - Methods and attributes that we'll want to use
    - `.loc` and mask indexes
3. Creating data frames from files
    - CSV
    - Other formats, as well
    - Retrieving data from the Internet

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [5]:
s = Series([10, 20, 30, np.nan, 50, 60, 70, np.nan, np.nan, 100, 200],
          index=list('abcdefghijk'))
s

a     10.0
b     20.0
c     30.0
d      NaN
e     50.0
f     60.0
g     70.0
h      NaN
i      NaN
j    100.0
k    200.0
dtype: float64

In [6]:
s.dropna()    # this method removes the NaN values from a series, and returns the new series without NaN -- the original is unchanged

a     10.0
b     20.0
c     30.0
e     50.0
f     60.0
g     70.0
j    100.0
k    200.0
dtype: float64

In [7]:
s  # how can it be that we still have NaN values here??  

a     10.0
b     20.0
c     30.0
d      NaN
e     50.0
f     60.0
g     70.0
h      NaN
i      NaN
j    100.0
k    200.0
dtype: float64

In [8]:
# how can you actually remove NaN values from a series?

# s = s.dropna()   # this returns the new series that dropna returned, and assigns it back to s, replacing the original one

In [9]:
s.fillna(999)   # new series with 999 instead of NaN is returned, but the original (s) is still unchanged

a     10.0
b     20.0
c     30.0
d    999.0
e     50.0
f     60.0
g     70.0
h    999.0
i    999.0
j    100.0
k    200.0
dtype: float64

In [10]:
s.interpolate()  # returns a new series, but doesn't change the original

a     10.0
b     20.0
c     30.0
d     40.0
e     50.0
f     60.0
g     70.0
h     80.0
i     90.0
j    100.0
k    200.0
dtype: float64

By having methods return a new series/data frame, rather than modify the original, you can "chain" methods and perform longer, more interesting queries without making any assignments.

# Data frames

If a series is a 1D data structure, with an index and values (all of which are one dtype), then a data frame is a 2D data structure, similar to an Excel spreadsheet.

A data frame:

- Has an index, which refers to the rows
- Has column names, which refer to the columns
- Each column is a series! This means that each column has a dtype. You can have different dtypes in different columns, but all of the values in a single column must be the same.

In [11]:
# creating a data frame

# I'll create this using a list of lists


df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


We can see in our data frame:

- The index goes from 0-3, along the left side, just as we saw with a series
- The column names goes from 0-2, along the top, naming the columns

In [12]:
# Can I work with this data frame? Yes. But it's usually better to give names to our index and our columns.
# we can do that at creation time by passing the "index" and "columns" keyword arguments.

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))

df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [13]:
# I can get the index for df with df.index

df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [14]:
# I can get the columns for df with df.columns

df.columns

Index(['x', 'y', 'z'], dtype='object')