# Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with panel data.

pandas is well suited for:

Tabular data with heterogeneously-typed colums, as in an SQL table or Excel spreadsheet

**Key Features**:

- Easy handling of **missing data**
- Automatic and explicit **data alignment**
- Intelligent label-based **slicing, indexing and subsetting** of large data sets
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Robust **IO Tools** for loading data from flat files, Excel files, databases etc.

In [5]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")

 - Before we explore the package pandas, let's import pandas package. We often use pd to refer to pandas in convention.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

## Pandas Data Structures

### Series

A Series is a single vector of data (like a Numpy array) with an *index* that labels each element in the vector.

In [28]:
counts = pd.Series([223, 43, 53, 24, 43])
counts

0    223
1     43
2     53
3     24
4     43
dtype: int64

- If an *index* is not specified, a default sequence of integers is assigned as index. 

- We can access the values like an array

In [17]:
counts[0]

223

In [18]:
counts[1:4]

1    43
2    53
3    24
dtype: int64

- You can get the array representation and index object of the *Series* via its values and index atrributes, respectively.

In [7]:
counts.values

array([223,  43,  53,  24,  43], dtype=int64)

In [8]:
counts.index

RangeIndex(start=0, stop=5, step=1)

- We can assign meaningful labels to the index, if they are available:

In [9]:
fruit = pd.Series([223,  43,  53,  24, 43],
                 index=['apple', 'orange', 'banana', 'pears', 'lemon'])

fruit

apple     223
orange     43
banana     53
pears      24
lemon      43
dtype: int64

In [20]:
fruit.index

Index([u'apple', u'orange', u'banana', u'pears', u'lemon'], dtype='object', name=u'fruit')

- These labels can be used to refer to the values in the Series.

In [10]:
fruit['apple']

223

In [11]:
fruit[['apple', 'lemon']]

apple    223
lemon     43
dtype: int64

- We can give both the array of values and the index meaningful labels themselves:



In [15]:
fruit.name = 'counts'
fruit.index.name = 'fruit'
fruit

fruit
apple     223
orange     43
banana     53
pears      24
lemon      43
Name: counts, dtype: int64

- Operations can be applied to Series without losing the data structure.
- Use bool array to filter Series

In [21]:
fruit > 50

fruit
apple      True
orange    False
banana     True
pears     False
lemon     False
Name: counts, dtype: bool

In [22]:
fruit[fruit > 50]

fruit
apple     223
banana     53
Name: counts, dtype: int64

- Critically, the labels are used to align data when used in operations with other Series objects.

In [26]:
fruit2 = pd.Series([11, 12, 13, 14, 15],
                   index=['orange', 'banana', 'pears', 'peach', 'apple'])
fruit2

orange    11
banana    12
pears     13
peach     14
apple     15
dtype: int64

In [29]:
fruit

fruit
apple     223
orange     43
banana     53
pears      24
lemon      43
Name: counts, dtype: int64

In [27]:
fruit + fruit2

apple     238.0
banana     65.0
lemon       NaN
orange     54.0
peach       NaN
pears      37.0
dtype: float64

- Contrast this with arrays, where arrays of the same length will combine values element-wise; Adding Series combined values with the same label in the resulting series.
- Notice that the missing values were propogated by addition.

### DataFrame


In [19]:
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
               'key2':['one', 'two', 'one', 'two', 'one'],
               'data1': np.random.randn(5),
               'data2': np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-1.839735,0.24802,a,one
1,-0.224832,-0.625287,a,two
2,-1.533834,-0.21795,b,one
3,0.772731,-0.037452,b,two
4,0.927768,0.28996,a,one


### Import and Store Data

- Read and write *csv* file.

In [23]:
df.to_csv('test_csv_file.csv',index=False)

In [24]:
df_csv = pd.read_csv('test_csv_file.csv')
df_csv

Unnamed: 0,data1,data2,key1,key2
0,-1.839735,0.24802,a,one
1,-0.224832,-0.625287,a,two
2,-1.533834,-0.21795,b,one
3,0.772731,-0.037452,b,two
4,0.927768,0.28996,a,one


- Read and write *excel* file.

In [26]:
writer = pd.ExcelWriter('test_excel_file.xlsx')
df.to_excel(writer, 'sheet1', index=False)
writer.save()

In [27]:
df_excel = pd.read_excel('test_excel_file.xlsx', sheetname='sheet1')
df_excel

Unnamed: 0,data1,data2,key1,key2
0,-1.839735,0.24802,a,one
1,-0.224832,-0.625287,a,two
2,-1.533834,-0.21795,b,one
3,0.772731,-0.037452,b,two
4,0.927768,0.28996,a,one


In [34]:
grouped = df['data1'].groupby(df['key1'])

grouped

<pandas.core.groupby.SeriesGroupBy object at 0x000000000B533E80>

In [35]:
grouped.mean()

key1
a   -0.640287
b    0.975406
Name: data1, dtype: float64

In [37]:
grouped = df[['data1', 'data2']].groupby(df['key1'])
grouped.mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.640287,-0.054757
b,0.975406,-0.583977


In [38]:
grouped = df.groupby([df['key1'], df['key2']])

group_mean = grouped.mean()

In [46]:
group_mean.index

MultiIndex(levels=[[u'a', u'b'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'key1', u'key2'])

In [51]:
print group_mean

group_mean.loc[('a', 'one'), 'data1']

              data1     data2
key1 key2                    
a    one  -0.351917  0.494564
     two  -1.217026 -1.153398
b    one   0.098915  0.117554
     two   1.851896 -1.285509


-0.35191679594841752

In [42]:
group_mean.unstack()

Unnamed: 0_level_0,data1,data1,data2,data2
key2,one,two,one,two
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,-0.351917,-1.217026,0.494564,-1.153398
b,0.098915,1.851896,0.117554,-1.285509


In [45]:
for name, group in grouped:
    print name
    print group

('a', 'one')
      data1     data2 key1 key2
0 -0.018367  0.109608    a  one
4 -0.685467  0.879519    a  one

('a', 'two')
      data1     data2 key1 key2
1 -1.217026 -1.153398    a  two

('b', 'one')
      data1     data2 key1 key2
2  0.098915  0.117554    b  one

('b', 'two')
      data1     data2 key1 key2
3  1.851896 -1.285509    b  two

