# Basics of Pandas

### pd.Series

In [None]:
import pandas as pd

In [None]:
s = pd.Series([1, 2, 3, 4])
s  # Looks like a numpy array with some extra numbers to the left

In [None]:
s.values  # Really looks like a numpy array

In [None]:
type(s.values)  # Whelp, its a numpy array

In [None]:
s.index  # Extra data to helps access and organize data

In [None]:
s[2]  # Same slicing and indexing operations work on a series

In [None]:
s[1:]

So why not just use the indexing that we already have? Why index with a Pandas index? We can use things besides numbers for our index keys.

In [None]:
s = pd.Series([6, 7, 8, 9], index=['a', 'b', 'c', 'd'])
s

In [None]:
s['b':'d']  # NB. Look at that! Endpoint is included.

In [None]:
s = pd.Series({
    'b': 4,
    'c': 2,
    'a': 1,
    'd': 0,
})  # Can toss a dictionary into a series construtor and the keys will be used to create the index!
s

In [None]:
s.index  # Notice: not a RangeIndex. RangeIndex is something special that Pandas uses an optimization when it can.

In [None]:
s = pd.Series(['Grandma', 'Mom', 'Self', 'Daughter'], index=[-2, -1, 0, 1])
s

In [None]:
s[1]  # Wait? Shouldn't that return 'Mom'?

Pandas series use their index to look up a value. Thus, to be more explicit, let's use the indexer methods, .loc() and .iloc().

In [None]:
s.loc[1]  # Look up item by value in index

In [None]:
s.iloc[1]  # Look up item by position in series (Python list style)

There's one more indexer, ix, that we'll take about when we get through a bit of DataFrame info.

In [None]:
a = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'f'])
b = pd.Series([5, 6, 7, 8], index=['b', 'c', 'e', 'f'])

a + b  # Some values based on index! 

In [None]:
a / b  # What's a NaN?

NaN means Not a Number. When pandas can not perform the mathematical opertation due to missing data, it places a NaN. Sometimes this is okay. Sometimes we don't want any NaNs.

In [None]:
a.div(b, fill_value=0)  # Fill in missing data with this number...

In [None]:
a.div(b, fill_value=1)

In [None]:
a.fillna(4)  # NB. This returns a copy, not an in place change

In [None]:
(a / b).dropna()  # Get rid of NaNs all togehter...

In [None]:
s.isnull()

Why not use None?

Slowness due to forced dtype and loss of certain operations.

In [None]:
a = np.array([1, 2, None, 3])
a

In [None]:
a.sum()

In [None]:
s = pd.Series([1, 2, None, 3])  # Pandas automatically converts None to a NaN
s

In [None]:
%timeit s.sum()

In [None]:
s = pd.Series([1, 2, np.NaN, 3])
s

In [None]:
%timeit s.sum()

In [None]:
np.NaN   # a recognaized number by the IEEE. Thus its understood by a lot of other tools outside of Python.

To sum up, Series are numpy arrays with an index object and extra methods to ease usage of the array.

### pd.DataFrame

The spreadsheet equivalent.

If we think of a series as a column, a DataFrame is a collection of columns side by side. These columns also share the same index!

In [None]:
GOOG = {
    '2019-07-01': 1080,
    '2019-04-01': 1194,
    '2019-01-01': 1045,
    '2018-10-01': 1195,
}

AAPL = {
    '2019-07-01': 197,
    '2019-04-01': 191,
    '2019-01-01': 157,
    '2018-10-01': 227,
}

stocks = {
    'GOOG': GOOG,
    'AAPL': AAPL,
}

In [None]:
df = pd.DataFrame(stocks)
df

In [None]:
df['2018-10-01']  #  Can't index by row index anymore...

In [None]:
df['GOOG']

In [None]:
type(df['GOOG'])  # Its a series

In [None]:
df['GOOG']['2018-10-01']  # Index into the outer dictionary first (the columns), then we can slice into the rows

Small gotcha: This behavior is the inverse of 2-D arrays in numpy were rows are returned by the first index and columns are returned by the second index.

In [None]:
import numpy as np

A = np.arange(9).reshape((3, 3))
A

In [None]:
A[1]

In [None]:
df  # Back to the df

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

### DataFrame operations

In [None]:
df

In [None]:
df.T  # Transpose!

In [None]:
df['GOOG-deviation-from-mean'] = df['GOOG'] - df['GOOG'].mean()
df  # Not standard dev! 

In [None]:
df.ix['2019-01-01':, 'GOOG']  # ix was around but its on its way out. 
#     rows , columns  <- Extra confusions
# Also, issue of indexing based on index value rather than by position when indexing with integers is still present. 
# Basically, know of it but don't use it.

In [None]:
# Let's find all the times GOOG was greater than its average in another way.
slice_ = df['GOOG'] > df['GOOG'].mean()  
slice_

In [None]:
df[slice_]  # Wait shouldn't that first argument to an index operation operate on columns? Like df['GOOG']?
# Slices are special, they act on rows.

In [None]:
df.columns = ['GOOG', 'AAPL', 'GOOG-dfm']  # Change column names
df

In [None]:
# Need to specify the axis (0 for rows, 1 for columns) and 
# note that it returns a copy of the DataFrame.
df.drop(['GOOG-dfm'], axis=1)
df

In [None]:
df = df.drop(['GOOG-dfm'], axis=1)
df

In [None]:
df

In [None]:
df2 = pd.DataFrame({
    'GOOG': pd.Series([1127, 1031], index=['2018-07-01', '2018-04-01']),
    'MSFT': pd.Series([133, 101], index=['2019-07-01', '2019-01-01']),
})
df2  # Pandas fills in NaN when it doesn't have a value

In [None]:
df2.fillna(0)  # Fill and drop methods operate on whole DataFrame