# Pandas Overview

As we discussed previously, NumPy is the underlying library for numerical processing in Python. The PyData ecosystem has higher level tools for everything from machine learning to data manipulation so you won't need to spend two much of your time working *directly* with NumPy.

The final library we're going to explore is the pandas Python Data Analysis Library.

>pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Learn more at: http://pandas.pydata.org/

Pandas implements an abstraction called the DataFrame which is borrowed (conceptually) from the R programming language. **You will use pandas all the time.** Let's just repeat that, **you will use pandas all the time.** This library is awesome, it makes simple things simple and hard things possible. While it has its quirks like every library, overall it is an amazing tool that companies around the world are using to run their businesses.

Here are some of the highlights:

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.

Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

In [2]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)

3.3.5 |Anaconda 2.2.0 (64-bit)| (default, Sep  2 2014, 13:55:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
1.9.2
0.15.2


So let's get started with a general overview of the library and what it makes available to us. Now we're going to breeze through a couple of subjects right now. Don’t feel the need to take notes or even try this code yourself. You can if you like, but it’s mainly to introduce you to the power of pandas, not for you to copy.

Pandas is made up of a couple of core types. 

## Index
One of those types is the Index. Now we haven't seen an index thus far, but think of it like names from the columns and the rows in a NumPy Array.

We can get it under the pandas module which by convention is imported like numpy except `as pd`.

In [None]:
pd.Index

## Series
Next we've got the Series. The Series is like a one dimensional array in numpy with a bunch more helper functions as well as an index that makes for simple querying.

In [4]:
pd.Series

pandas.core.series.Series

Let's walk through an example.

In [7]:
series_ex = pd.Series(['a','b','c'])
series_ex

0    a
1    b
2    c
dtype: object

We can see the index on the left (which in this case is numerical) and then the values on the right which are strings or objects at this point (we'll get into data types later).

Let's look into each of those.

In [8]:
series_ex.index

Int64Index([0, 1, 2], dtype='int64')

In [9]:
series_ex.values

array(['a', 'b', 'c'], dtype=object)

Now indexes don't have to be numerical, they can pretty much be anything we want. For example let's replace our series and its index with something else.

In [10]:
series_ex = pd.Series(np.arange(26))
series_ex

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: int64

In [11]:
import string #python standard library

In [12]:
lc = string.ascii_lowercase
uc = string.ascii_uppercase

In [13]:
print(lc,uc)

abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ


In [17]:
print(list(lc))

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [15]:
series_ex.index=list(lc)

In [16]:
series_ex

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
q    16
r    17
s    18
t    19
u    20
v    21
w    22
x    23
y    24
z    25
dtype: int64

Alright! We can see that our index is no longer numerical but by letter! This is awesome because we can query it as such too. We can query sections or even specific values. For example, let's get values from p to z.

In [18]:
series_ex.ix['p':'z']

p    15
q    16
r    17
s    18
t    19
u    20
v    21
w    22
x    23
y    24
z    25
dtype: int64

In [19]:
series_ex['c']

2

How awesome is that? Think of how difficult that would be using numpy explicitly. It'd be *super hard* because we don't have those handy indexes to work off of. Don't worry about understanding all of this right now, just get a sense for what I'm doing. I'm able to query via an index pretty easily, I've got a series of data points in an array - nothing much more than that.

Now let's dive into the final concept - the DataFrame.

## DataFrames
You will be using DataFrames all your life as a data scientist. They're in R, they're in Python, they're in other data tools and libraries. They're all over the place. So let's get you introduced. We can again access it under `pd`

In [20]:
pd.DataFrame

pandas.core.frame.DataFrame

Now let's go ahead and create one, we'll make it similar to the series that we just created.

In [26]:
letters = pd.DataFrame([list(lc),list(uc),list(range(26))])
letters

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
1,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
2,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25


Now just like in NumPy, we can transpose this to pivot it on its diagonal.

In [30]:
letters.transpose()

Unnamed: 0,0,1,2
0,a,A,0
1,b,B,1
2,c,C,2
3,d,D,3
4,e,E,4
5,f,F,5
6,g,G,6
7,h,H,7
8,i,I,8
9,j,J,9


I'm going to go ahead and keep the transposed version.

In [32]:
letters = letters.transpose()

Like we did above we can see its index. However now we also have columns that we can access.

In [34]:
letters.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], dtype='int64')

In [33]:
letters.columns

Int64Index([0, 1, 2], dtype='int64')

However the columns aren't very semantic at this point. Let's rename them to make them a bit more accessible to us in the future.

In [35]:
letters.columns = ['lowercase','uppercase','number']

The head method will return a dataframe of the first N values. N is defaulted to 5. But this allows us to see the columns as well as the index.

In [37]:
letters.head()

Unnamed: 0,lowercase,uppercase,number
0,a,A,0
1,b,B,1
2,c,C,2
3,d,D,3
4,e,E,4


Another cool feature is now that we have these column names, if they don't have spaces we can access them with dot syntax. Like a property of that DataFrame. We can also access it via dictionary syntax which is especially helpful if we have spaces in our column names.

In [41]:
letters.lowercase

0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

In [42]:
letters['lowercase']

0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

This returns a `Series` type of that column.

In [43]:
type(letters.lowercase)

pandas.core.series.Series

Now that'll be all for this lesson because I'm sure you're feeling a bit overwhelmed. This is intended to give you an overview of the concepts and core data abstractions that pandas has. Now that we've gotten that overview. Let's dive a bit deeper into each one to get a better understanding of how they works.