# Pandas Overview

As we discussed previously, NumPy is the underlying library for numerical processing in Python. The PyData ecosystem has higher level tools for everything from machine learning to data manipulation so you will not need to spend too much of your time working *directly* with NumPy.

The final library we will explore is the pandas Python Data Analysis Library.

>*pandas* is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Learn more at: http://pandas.pydata.org/

pandas implements an abstraction called the DataFrame, which is borrowed (conceptually) from the R programming language. **You will use pandas all the time.** This library is impressive: it makes simple things simple and hard things possible. While it has its quirks, as does every library, overall it is a worthwhile tool that companies around the world use to run their businesses.

Here are some of the highlights:

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns that can be inserted and deleted from data structures for size mutability;
- Ability to aggregate or transform data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing that provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging; even the ability to create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.

Python with pandas is in use in a wide variety of academic and commercial domains, including finance, neuroscience, economics, statistics, advertising, web analytics, and more.

In [1]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)

3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2


Let's get started with a general overview of the library and what it makes available to us. We will proceed through a couple of subjects at a quick pace. You do not need to take notes at this point. You can if you like, but this section is mainly here to introduce you to the power of pandas.

Let's take a look at the core types that make up pandas.

## Index
One of most important types in pandas is the Index. We have not seen an index thus far, but you might think of it as a way to assign names to the columns and rows of an array.

We can locate it under the pandas module, which by convention is imported like NumPy except `as pd`.

In [2]:
pd.Index

pandas.core.index.Index

## Series
Next we have the pandas Series. The Series is like a one-dimensional array in NumPy with a lot of useful helper functions as well as an index that makes for simple querying.

In [3]:
pd.Series

pandas.core.series.Series

Let's walk through an example.

In [4]:
series_ex = pd.Series(['a','b','c'])
series_ex

0    a
1    b
2    c
dtype: object

We can see the index on the left (which in this case is numerical) and then the values on the right, which are strings or objects at this point. (We will get into the data types later.)

Let's look into each of those.

In [5]:
series_ex.index

Int64Index([0, 1, 2], dtype='int64')

In [6]:
series_ex.values

array(['a', 'b', 'c'], dtype=object)

Indexes do not have to be numerical; they can be almost anything we want. For example, let's replace our Series and its index with something else.

In [7]:
series_ex = pd.Series(np.arange(26))
series_ex

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: int64

In [8]:
import string #python standard library

In [9]:
lc = string.ascii_lowercase
uc = string.ascii_uppercase

In [10]:
print(lc,uc)

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ


In [11]:
print(list(lc))

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [12]:
series_ex.index=list(lc)

In [13]:
series_ex

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
q    16
r    17
s    18
t    19
u    20
v    21
w    22
x    23
y    24
z    25
dtype: int64

We can see that the index is no longer numerical but by letter. That allows us to query it as such too. We can query sections or even specific values. For example, let's get values from p to z.

In [14]:
series_ex.ix['p':'z']

p    15
q    16
r    17
s    18
t    19
u    20
v    21
w    22
x    23
y    24
z    25
dtype: int64

In [15]:
series_ex['c']

2

Think for a moment about how difficult that would be using NumPy by itself. It would be *extremely hard* because we would have to re-create all of the indexing machinery. Don't worry about understanding all of this right now; just get a sense for what I am doing. I have a series of data points in an array, and the index lets me easily query them.

Now let's dive into the final concept: the DataFrame.

## DataFrames
You will be using DataFrames throughout your entire career as a data scientist. They are in R, they are in Python, and they are in other data tools and libraries. They are almost everywhere. Let's take a look. Once again, we can access the type under `pd`.

In [16]:
pd.DataFrame

pandas.core.frame.DataFrame

Let's go ahead and create one; we will make it similar to the series that we just created.

In [17]:
letters = pd.DataFrame([list(lc),list(uc),list(range(26))])
letters

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
1,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
2,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25


Just like in NumPy, we can transpose this to pivot it on its diagonal.

In [18]:
letters.transpose()

Unnamed: 0,0,1,2
0,a,A,0
1,b,B,1
2,c,C,2
3,d,D,3
4,e,E,4
5,f,F,5
6,g,G,6
7,h,H,7
8,i,I,8
9,j,J,9


I will keep the transposed version.

In [19]:
letters = letters.transpose()

As above, we can see its index. However, now we also have columns that we can access.

In [20]:
letters.index

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25],
           dtype='int64')

In [21]:
letters.columns

Int64Index([0, 1, 2], dtype='int64')

At this point, the columns do not have very useful names. Let's change this to make them a bit more accessible in the future.

In [22]:
letters.columns = ['lowercase','uppercase','number']

The head method will return a DataFrame of the first N values. N is defaulted to 5. Notice that this method allows us to see the columns as well as the index.

In [23]:
letters.head()

Unnamed: 0,lowercase,uppercase,number
0,a,A,0
1,b,B,1
2,c,C,2
3,d,D,3
4,e,E,4


Now that we have these column names, we can access the columns using dot syntax. This treats the columns as though they were data attributes of the DataFrame. Note that this is only possible if the names do not have spaces.  We can also access columns via dictionary syntax, which is especially helpful if we have spaces in our column names.

In [24]:
letters.lowercase

0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

In [25]:
letters['lowercase']

0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

This returns a `Series` type of that column.

In [26]:
type(letters.lowercase)

pandas.core.series.Series

This has been a quick, high-level orientation to the key pandas types. At this point you should start to understand how Series, indexes, and Dataframes fit together. Coming up, we will delve deeper into each one to get a better understanding of how they work.