# Basics of Pandas

### pd.Series

In [29]:
import pandas as pd
import numpy as np

In [3]:
s = pd.Series([1, 2, 3, 4])
s  # Looks like a numpy array with some extra numbers to the left

0    1
1    2
2    3
3    4
dtype: int64

In [4]:
s.values  # Really looks like a numpy array

array([1, 2, 3, 4])

In [5]:
type(s.values)  # Whelp, its a numpy array

numpy.ndarray

In [6]:
s.index  # Extra data to helps access and organize data

RangeIndex(start=0, stop=4, step=1)

In [7]:
s[2]  # Same slicing and indexing operations work on a series

3

In [8]:
s[1:]

1    2
2    3
3    4
dtype: int64

So why not just use the indexing that we already have? Why index with a Pandas index? We can use things besides numbers for our index keys.

In [9]:
s = pd.Series([6, 7, 8, 9], index=['a', 'b', 'c', 'd'])
s

a    6
b    7
c    8
d    9
dtype: int64

In [10]:
s['b':'d']  # NB. Look at that! Endpoint is included.

b    7
c    8
d    9
dtype: int64

In [11]:
s = pd.Series({
    'b': 4,
    'c': 2,
    'a': 1,
    'd': 0,
})  # Can toss a dictionary into a series construtor and the keys will be used to create the index!
s

b    4
c    2
a    1
d    0
dtype: int64

In [12]:
s.index  # Notice: not a RangeIndex. RangeIndex is something special that Pandas uses an optimization when it can.

Index(['b', 'c', 'a', 'd'], dtype='object')

In [13]:
s = pd.Series(['Grandma', 'Mom', 'Self', 'Daughter'], index=[-2, -1, 0, 1])
s

-2     Grandma
-1         Mom
 0        Self
 1    Daughter
dtype: object

In [14]:
s[1]  # Wait? Shouldn't that return 'Mom'?

'Daughter'

Pandas series use their index to look up a value. Thus, to be more explicit, let's use the indexer methods, .loc() and .iloc().

In [15]:
s.loc[1]  # Look up item by value in index

'Daughter'

In [16]:
s.iloc[1]  # Look up item by position in series (Python list style)

'Mom'

There's one more indexer, ix, that we'll take about when we get through a bit of DataFrame info.

In [17]:
a = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'f'])
b = pd.Series([5, 6, 7, 8], index=['b', 'c', 'e', 'f'])

a + b  # Some values based on index! 

a     NaN
b     7.0
c     9.0
e     NaN
f    12.0
dtype: float64

In [18]:
a / b  # What's a NaN?

a    NaN
b    0.4
c    0.5
e    NaN
f    0.5
dtype: float64

NaN means Not a Number. When pandas can not perform the mathematical opertation due to missing data, it places a NaN. Sometimes this is okay. Sometimes we don't want any NaNs.

In [19]:
a.div(b, fill_value=0)  # Fill in missing data with this number...

a         inf
b    0.400000
c    0.500000
e    0.000000
f    0.500000
dtype: float64

In [20]:
a.div(b, fill_value=1)

a    1.000000
b    0.400000
c    0.500000
e    0.142857
f    0.500000
dtype: float64

In [21]:
a.fillna(4)  # NB. This returns a copy, not an in place change

a    1
b    2
c    3
f    4
dtype: int64

In [22]:
(a / b).dropna()  # Get rid of NaNs all togehter...

b    0.4
c    0.5
f    0.5
dtype: float64

In [23]:
s.isnull()

-2    False
-1    False
 0    False
 1    False
dtype: bool

Why not use None?

Slowness due to forced dtype and loss of certain operations.

In [30]:
a = np.array([1, 2, None, 3])
a

array([1, 2, None, 3], dtype=object)

In [31]:
a.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [32]:
s = pd.Series([1, 2, None, 3])  # Pandas automatically converts None to a NaN
s

0    1.0
1    2.0
2    NaN
3    3.0
dtype: float64

In [33]:
%timeit s.sum()

85.4 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [34]:
s = pd.Series([1, 2, np.NaN, 3])
s

0    1.0
1    2.0
2    NaN
3    3.0
dtype: float64

In [35]:
%timeit s.sum()

109 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [36]:
np.NaN   # a recognaized number by the IEEE. Thus its understood by a lot of other tools outside of Python.

nan

To sum up, Series are numpy arrays with an index object and extra methods to ease usage of the array.

### pd.DataFrame

The spreadsheet equivalent.

If we think of a series as a column, a DataFrame is a collection of columns side by side. These columns also share the same index!

In [37]:
GOOG = {
    '2019-07-01': 1080,
    '2019-04-01': 1194,
    '2019-01-01': 1045,
    '2018-10-01': 1195,
}

AAPL = {
    '2019-07-01': 197,
    '2019-04-01': 191,
    '2019-01-01': 157,
    '2018-10-01': 227,
}

stocks = {
    'GOOG': GOOG,
    'AAPL': AAPL,
}

In [38]:
df = pd.DataFrame(stocks)
df

Unnamed: 0,GOOG,AAPL
2018-10-01,1195,227
2019-01-01,1045,157
2019-04-01,1194,191
2019-07-01,1080,197


In [39]:
df['2018-10-01']  #  Can't index by row index anymore...

KeyError: '2018-10-01'

In [40]:
df['GOOG']

2018-10-01    1195
2019-01-01    1045
2019-04-01    1194
2019-07-01    1080
Name: GOOG, dtype: int64

In [41]:
type(df['GOOG'])  # Its a series

pandas.core.series.Series

In [42]:
df['GOOG']['2018-10-01']  # Index into the outer dictionary first (the columns), then we can slice into the rows

1195

Small gotcha: This behavior is the inverse of 2-D arrays in numpy were rows are returned by the first index and columns are returned by the second index.

In [43]:
import numpy as np

A = np.arange(9).reshape((3, 3))
A

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [44]:
A[1]

array([3, 4, 5])

In [45]:
df  # Back to the df

Unnamed: 0,GOOG,AAPL
2018-10-01,1195,227
2019-01-01,1045,157
2019-04-01,1194,191
2019-07-01,1080,197


In [46]:
df.index

Index(['2018-10-01', '2019-01-01', '2019-04-01', '2019-07-01'], dtype='object')

In [47]:
df.columns

Index(['GOOG', 'AAPL'], dtype='object')

In [48]:
df.values

array([[1195,  227],
       [1045,  157],
       [1194,  191],
       [1080,  197]])

### DataFrame operations

In [49]:
df

Unnamed: 0,GOOG,AAPL
2018-10-01,1195,227
2019-01-01,1045,157
2019-04-01,1194,191
2019-07-01,1080,197


In [50]:
df.T  # Transpose!

Unnamed: 0,2018-10-01,2019-01-01,2019-04-01,2019-07-01
GOOG,1195,1045,1194,1080
AAPL,227,157,191,197


In [51]:
df['GOOG-deviation-from-mean'] = df['GOOG'] - df['GOOG'].mean()
df  # Not standard dev! 

Unnamed: 0,GOOG,AAPL,GOOG-deviation-from-mean
2018-10-01,1195,227,66.5
2019-01-01,1045,157,-83.5
2019-04-01,1194,191,65.5
2019-07-01,1080,197,-48.5


In [52]:
df.ix['2019-01-01':, 'GOOG']  # ix was around but its on its way out. 
#     rows , columns  <- Extra confusions
# Also, issue of indexing based on index value rather than by position when indexing with integers is still present. 
# Basically, know of it but don't use it.

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


2019-01-01    1045
2019-04-01    1194
2019-07-01    1080
Name: GOOG, dtype: int64

In [53]:
# Let's find all the times GOOG was greater than its average in another way.
slice_ = df['GOOG'] > df['GOOG'].mean()  
slice_

2018-10-01     True
2019-01-01    False
2019-04-01     True
2019-07-01    False
Name: GOOG, dtype: bool

In [54]:
df[slice_]  # Wait shouldn't that first argument to an index operation operate on columns? Like df['GOOG']?
# Slices are special, they act on rows.

Unnamed: 0,GOOG,AAPL,GOOG-deviation-from-mean
2018-10-01,1195,227,66.5
2019-04-01,1194,191,65.5


In [55]:
df.columns = ['GOOG', 'AAPL', 'GOOG-dfm']  # Change column names
df

Unnamed: 0,GOOG,AAPL,GOOG-dfm
2018-10-01,1195,227,66.5
2019-01-01,1045,157,-83.5
2019-04-01,1194,191,65.5
2019-07-01,1080,197,-48.5


In [56]:
# Need to specify the axis (0 for rows, 1 for columns) and 
# note that it returns a copy of the DataFrame.
df.drop(['GOOG-dfm'], axis=1)
df

Unnamed: 0,GOOG,AAPL,GOOG-dfm
2018-10-01,1195,227,66.5
2019-01-01,1045,157,-83.5
2019-04-01,1194,191,65.5
2019-07-01,1080,197,-48.5


In [57]:
df = df.drop(['GOOG-dfm'], axis=1)
df

Unnamed: 0,GOOG,AAPL
2018-10-01,1195,227
2019-01-01,1045,157
2019-04-01,1194,191
2019-07-01,1080,197


In [58]:
df

Unnamed: 0,GOOG,AAPL
2018-10-01,1195,227
2019-01-01,1045,157
2019-04-01,1194,191
2019-07-01,1080,197


In [59]:
df2 = pd.DataFrame({
    'GOOG': pd.Series([1127, 1031], index=['2018-07-01', '2018-04-01']),
    'MSFT': pd.Series([133, 101], index=['2019-07-01', '2019-01-01']),
})
df2  # Pandas fills in NaN when it doesn't have a value

Unnamed: 0,GOOG,MSFT
2018-04-01,1031.0,
2018-07-01,1127.0,
2019-01-01,,101.0
2019-07-01,,133.0


In [60]:
df2.fillna(0)  # Fill and drop methods operate on whole DataFrame

Unnamed: 0,GOOG,MSFT
2018-04-01,1031.0,0.0
2018-07-01,1127.0,0.0
2019-01-01,0.0,101.0
2019-07-01,0.0,133.0


In [61]:
df.drop?