# Pandas Basics

The **Pandas** library is built on `Numpy` and provides easy-to-use data structure and data analysis tools for the Python programming language.

In [1]:
import pandas as pd

## Pandas Data Structures

### Series

A one-dimensional labeled array capable of holding any data type

In [2]:
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
print s

a    3
b   -5
c    7
d    4
dtype: int64


### DataFrame

A two-dimensional labeled data structure with columns of potentially different types.

In [3]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
        'Population': [11190846, 1303171035, 207847528]}

df = pd.DataFrame(data)
df

Unnamed: 0,Capital,Country,Population
0,Brussels,Belgium,11190846
1,New Delhi,India,1303171035
2,Brasilia,Brazil,207847528


## Selection

### Getting

In [4]:
print s['b'] # Get one element from Series

print s[['b', 'c']] # Get multiple elements from Series

-5
b   -5
c    7
dtype: int64


In [5]:
df[1:] # Get subset of a DataFrame

Unnamed: 0,Capital,Country,Population
1,New Delhi,India,1303171035
2,Brasilia,Brazil,207847528


### Selecting, Boolean Indexing & Setting

#### By Position

In [6]:
# Select single value by row & column
df.iloc[[0],[0]]
# print df.iloc[0][0]

Unnamed: 0,Capital
0,Brussels


#### By Label

In [7]:
# Select single value by row & column labels
df.loc[0]['Country']

'Belgium'

#### By Label / Position

In [8]:
df.ix[2]

Capital        Brasilia
Country          Brazil
Population    207847528
Name: 2, dtype: object

In [9]:
print df.ix[1:, 'Capital'] # Select a single column of subset of columns

1    New Delhi
2     Brasilia
Name: Capital, dtype: object


In [10]:
df.ix[1, 'Capital'] # Select rows and columns

'New Delhi'

#### Boolean Indexing

In [11]:
print s

print s[(s > 1)]
print s[~(s > 1)]  # note the usage of `~`. It means 'not'

a    3
b   -5
c    7
d    4
dtype: int64
a    3
c    7
d    4
dtype: int64
b   -5
dtype: int64


In [12]:
print s[(s<-1) | (s>4)]  # Select where value is <-1 or >4

b   -5
c    7
dtype: int64


In [13]:
df[df['Population'] > 1200000000]  # Use filter to subset DataFrame

Unnamed: 0,Capital,Country,Population
1,New Delhi,India,1303171035


#### Setting

In [14]:
print s['a']
s['a'] = 6  # Set index `a` of Series `s` to 6
print s['a']

3
6


## Dropping

In [15]:
# note `drop` operation will return a new Series & will NOT affect the existing series.
print s
print s.drop(['a', 'c'])  # Drop values from rows (axis = 0)
print s

a    6
b   -5
c    7
d    4
dtype: int64
b   -5
d    4
dtype: int64
a    6
b   -5
c    7
d    4
dtype: int64


In [16]:
print df
print df.drop('Country', axis = 1) # Drop values from columns (axis = 1)
print df

     Capital  Country  Population
0   Brussels  Belgium    11190846
1  New Delhi    India  1303171035
2   Brasilia   Brazil   207847528
     Capital  Population
0   Brussels    11190846
1  New Delhi  1303171035
2   Brasilia   207847528
     Capital  Country  Population
0   Brussels  Belgium    11190846
1  New Delhi    India  1303171035
2   Brasilia   Brazil   207847528


## Sort & Rank

**!!! Sort operation will return a new sorted Series and will NOT change the existing series**

In [17]:
df.sort_values(by='Country')  # Sort by row or column index

Unnamed: 0,Capital,Country,Population
0,Brussels,Belgium,11190846
2,Brasilia,Brazil,207847528
1,New Delhi,India,1303171035


In [18]:
print s.sort_values() # Sort a Series by its values
print s  # Sort operation will return a new sorted Series and will NOT change the existing series

b   -5
d    4
a    6
c    7
dtype: int64
a    6
b   -5
c    7
d    4
dtype: int64


In [19]:
df.rank()  # assign ranks to entries

Unnamed: 0,Capital,Country,Population
0,2.0,1.0,1.0
1,3.0,3.0,3.0
2,1.0,2.0,2.0


## Retrieving Series / DataFrame Information

### Basic Information

In [20]:
df.shape  # (rows, columns)

(3, 3)

In [21]:
df.index # Describe index

RangeIndex(start=0, stop=3, step=1)

In [22]:
df.columns # Describe DataFrame columns

Index([u'Capital', u'Country', u'Population'], dtype='object')

In [23]:
df.info()  # Info on DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Capital       3 non-null object
Country       3 non-null object
Population    3 non-null int64
dtypes: int64(1), object(2)
memory usage: 144.0+ bytes


In [24]:
df.count()  # Number of non-NA values

Capital       3
Country       3
Population    3
dtype: int64

### Summary

In [25]:
df.sum() # Sum of Values

Capital       BrusselsNew DelhiBrasilia
Country              BelgiumIndiaBrazil
Population                   1522209409
dtype: object

In [26]:
df.cumsum()  # Cummulative sum of value

Unnamed: 0,Capital,Country,Population
0,Brussels,Belgium,11190846
1,BrusselsNew Delhi,BelgiumIndia,1314361881
2,BrusselsNew DelhiBrasilia,BelgiumIndiaBrazil,1522209409


In [27]:
print df.min()
print df.max()

Capital       Brasilia
Country        Belgium
Population    11190846
dtype: object
Capital        New Delhi
Country            India
Population    1303171035
dtype: object


In [28]:
df.describe() # Summary statistics

Unnamed: 0,Population
count,3.0
mean,507403100.0
std,696134600.0
min,11190850.0
25%,109519200.0
50%,207847500.0
75%,755509300.0
max,1303171000.0


In [29]:
df.mean()  # mean of values

Population    5.074031e+08
dtype: float64

In [30]:
df.median() # median of values

Population    207847528.0
dtype: float64

## Applying Functions

In [31]:
f = lambda x: x*2

print df.apply(f)      # apply function
print df.applymap(f)   # Apply function element-wise

              Capital         Country  Population
0    BrusselsBrussels  BelgiumBelgium    22381692
1  New DelhiNew Delhi      IndiaIndia  2606342070
2    BrasiliaBrasilia    BrazilBrazil   415695056
              Capital         Country  Population
0    BrusselsBrussels  BelgiumBelgium    22381692
1  New DelhiNew Delhi      IndiaIndia  2606342070
2    BrasiliaBrasilia    BrazilBrazil   415695056


## Data Alignment

### Internal Data Alignment

NA values will be introduced in the indices that don't overlap

In [32]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s + s3

a    13.0
b     NaN
c     5.0
d     7.0
dtype: float64

### Arithmetic Operations with Fill Methods

You can also do the internal data alighnment yourself with the hlep of the fill methods.

In [33]:
print s
print s3

print s.add(s3)
print s.add(s3, fill_value=0)

a    6
b   -5
c    7
d    4
dtype: int64
a    7
c   -2
d    3
dtype: int64
a    13.0
b     NaN
c     5.0
d     7.0
dtype: float64
a    13.0
b    -5.0
c     5.0
d     7.0
dtype: float64


In [34]:
s.sub(s3, fill_value=0)

a   -1.0
b   -5.0
c    9.0
d    1.0
dtype: float64

In [35]:
s.div(s3, fill_value=1)  # note here I used different value fo `fill_value`

a    0.857143
b   -5.000000
c   -3.500000
d    1.333333
dtype: float64

In [36]:
s.mul(s3, fill_value=100)  # note the different brought by using different `fill_value`

a     42.0
b   -500.0
c    -14.0
d     12.0
dtype: float64

## I/O

### Read & Write to CSV

```python
pd.read_csv("file.csv", header = None, nrows=5)
pd.to_csv('myDataFrame.csv')
```

### Read & Write to Excel

```python
pd.read_excel('file.xlsx')
pd.to_excel('dir/data.xlsx', sheet_name='Sheet1')
```

Read multiple sheets from the same file

```python
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx, 'Sheet1')
```

### Read & Write to SQL Query or Database Table

```python
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pdf.read_sql_query("SELECT * FROM my_table;", engine)
```

`read_sql()` is a convenientce wrapper around `read_sql_table()` and `read_sql_query`.

```python
pd.to_sql('myDF', engine)
```