Following the pandas to SQL tutorial by Greg Reda
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

## Administrative setup

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns',50)
%matplotlib inline

## Exploring Series and DataFrame

Pandas introduces 2 new data structures - series and DataFrame, both built on top of NumPy.

### Introducing Series

Series is a one dimensional object similar to a list or an array, that can take in any datatype. It assigns a labeled index to each item. Alternatively, you can specify index to the series.

In [2]:
s = pd.Series(['abcd',3.14159,True,-12345,None,0.0])
s

0       abcd
1    3.14159
2       True
3     -12345
4       None
5          0
dtype: object

In [3]:
s_hat = pd.Series(['abcd',3.14159,True,-12345,None,0.0],index = ['A','B','C','D','E','F'])
s_hat

A       abcd
B    3.14159
C       True
D     -12345
E       None
F          0
dtype: object

The Series Constructor can be used to convert a dict as well.

In [11]:
d = {'Sydney':4000,'Melbourne':2500,'Adelaide':3200,'Perth':5800,'Hobart':1700,'Darwin':48,'Brisbane':9000}
d

{'Adelaide': 3200,
 'Brisbane': 9000,
 'Darwin': 48,
 'Hobart': 1700,
 'Melbourne': 2500,
 'Perth': 5800,
 'Sydney': 4000}

In [37]:
cities = pd.Series(d)
cities

Adelaide     3200
Brisbane     9000
Darwin         48
Hobart       1700
Melbourne    2500
Perth        5800
Sydney       4000
dtype: int64

In [14]:
print "Dict is of type ", type(d)
print "Series is of type ", type(cities)

Dict is of type  <type 'dict'>
Series is of type  <class 'pandas.core.series.Series'>


Notice how both cities and 'd' are arranged alphabetically.

In [10]:
cities.sort_values()
cities

Darwin         48
Hobart       1700
Melbourne    2500
Adelaide     3200
Sydney       4000
Perth        5800
Brisbane     9000
dtype: int64

In [16]:
d['Adelaide']

3200

In [18]:
cities['Adelaide']

3200

Advantage of Series is that you can select indexes using different strategies

In [17]:
cities[3]

3200

In [19]:
d[3]

KeyError: 3

Also, you can use boolean subsetting. The series returns only the true elements.

In [21]:
cities[cities < 2000]

Darwin      48
Hobart    1700
dtype: int64

In [22]:
d[d < 2000]

KeyError: False

This last part may be a bit weird. How did the subsetting work? This may help.

In [25]:
less_than_2000 = cities < 2000
print less_than_2000
print '\n'
print cities[cities < 2000]

Darwin        True
Hobart        True
Melbourne    False
Adelaide     False
Sydney       False
Perth        False
Brisbane     False
dtype: bool


Darwin      48
Hobart    1700
dtype: int64


You can change the values of elements in a Series on the fly

In [38]:
# changing values of elements below 2000
print cities[cities < 2000]
print '\n'
cities[cities < 2000] = cities.mean()/2
print cities[cities < 2000]

Darwin      48
Hobart    1700
dtype: int64


Darwin    1874.857143
Hobart    1874.857143
dtype: float64


You can subset multiple elements by using double brackets. Single brackets gives you a key error.

In [63]:
cities[['Sydney','Hobart','Melbourne']]

Sydney          4000
Hobart       1874.86
Melbourne       2500
dtype: object

In [64]:
cities['Sydney','Hobart','Melbourne']

KeyError: ('Sydney', 'Hobart', 'Melbourne')

Use idiomatic python to check if an element is in the series or not

In [39]:
print 'Auckland' in cities
print 'Sydney' in cities
print 'sydney' in cities

False
True
False


Append elements to the Series: Very simple. 

In [57]:
cities['Auckland'] = None
cities

Adelaide        3200
Brisbane        9000
Darwin       1874.86
Hobart       1874.86
Melbourne       2500
Perth           5800
Sydney          4000
Auckland        None
dtype: object

You can do mathematical operations on the series elements using scalar operations

In [40]:
cities/3

Adelaide     1066.666667
Brisbane     3000.000000
Darwin        624.952381
Hobart        624.952381
Melbourne     833.333333
Perth        1933.333333
Sydney       1333.333333
dtype: float64

In [41]:
np.square(cities)

Adelaide     10240000.000000
Brisbane     81000000.000000
Darwin        3515089.306122
Hobart        3515089.306122
Melbourne     6250000.000000
Perth        33640000.000000
Sydney       16000000.000000
dtype: float64

Null checking performed using isnull and notnull functions

In [60]:
print cities.isnull()
print '\n'
print cities.notnull()

Adelaide     False
Brisbane     False
Darwin       False
Hobart       False
Melbourne    False
Perth        False
Sydney       False
Auckland      True
dtype: bool


Adelaide      True
Brisbane      True
Darwin        True
Hobart        True
Melbourne     True
Perth         True
Sydney        True
Auckland     False
dtype: bool


You can add 2 series together. Non intersecting values yield NaN

In [65]:
cities[['Adelaide','Sydney','Melbourne','Brisbane']] + cities[['Hobart','Sydney','Melbourne']]

Adelaide      NaN
Brisbane      NaN
Hobart        NaN
Melbourne    5000
Sydney       8000
dtype: object

## Exploring DataFrames

You can create a dataframe by passing a dictionary of lists

In [2]:
data = {'year':[2010,2011,2012,2013,2014,2015,2016],
        'team':['Crows','Lions','Magpies','Bombers','Hawks','Swans','Bulldogs'],
        'wins':[11,8,10,15,11,6,4],
        'losses':[3,4,1,5,6,2,0]}
print type(data)
print '\n'
data

<type 'dict'>




{'losses': [3, 4, 1, 5, 6, 2, 0],
 'team': ['Crows',
  'Lions',
  'Magpies',
  'Bombers',
  'Hawks',
  'Swans',
  'Bulldogs'],
 'wins': [11, 8, 10, 15, 11, 6, 4],
 'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016]}

In [26]:
footie = pd.DataFrame(data)
footie

Unnamed: 0,losses,team,wins,year
0,3,Crows,11,2010
1,4,Lions,8,2011
2,1,Magpies,10,2012
3,5,Bombers,15,2013
4,6,Hawks,11,2014
5,2,Swans,6,2015
6,0,Bulldogs,4,2016


The columns parameter allows us to tell the dataframe constructor about the ordering and filtering of columns. By default, ordering is alphabetical as seen above.

In [5]:
footie = pd.DataFrame(data,columns = ['year','team','wins','losses'])
footie

Unnamed: 0,year,team,wins,losses
0,2010,Crows,11,3
1,2011,Lions,8,4
2,2012,Magpies,10,1
3,2013,Bombers,15,5
4,2014,Hawks,11,6
5,2015,Swans,6,2
6,2016,Bulldogs,4,0


In [9]:
footie = pd.DataFrame(data,columns = ['year','team','wins'])
print footie
print '\n'
# note that losses didn't come through because of the columns parameter
footie['losses']

   year      team  wins
0  2010     Crows    11
1  2011     Lions     8
2  2012   Magpies    10
3  2013   Bombers    15
4  2014     Hawks    11
5  2015     Swans     6
6  2016  Bulldogs     4




KeyError: 'losses'

### Read csv files

Downloaded dataset containing location of centrelink offices in australia as of 2015.
Download location - https://data.gov.au/dataset/location-of-centrelink-offices

In [13]:
# "!" command is used to execute shell commands
!head -n 5 centrelink.csv

OFFICE TYPE,SITE NAME,ALTERNATIVE NAME,ADDRESS,SUBURB,STATE,POSTCODE,LATITUDE,LONGITUDE,Open,Close,Closed for lunch/Office Notes
Centrelink Customer Service Centre,Albury,,430 Wilson Street,Albury,NSW,2640,-36.07727,146.92370,08:30:00,16:30:00,No
Centrelink Customer Service Centre,Townsville Jobseekers,,307 Ross River Road,Aitkenvale,QLD,4814,-19.29713,146.76441,08:30:00,16:30:00,No
Centrelink Customer Service Centre,Albany,,15 Peels Place,Albany,WA,6330,-35.02591,117.88466,08:30:00,16:30:00,No
Centrelink Customer Service Centre,Alice Springs,,5 Railway Terrace,Alice Springs,NT,870,-23.69590,133.88036,08:30:00,16:30:00,No


In [14]:
empdata = pd.read_csv('centrelink.csv')
empdata.head()

Unnamed: 0,OFFICE TYPE,SITE NAME,ALTERNATIVE NAME,ADDRESS,SUBURB,STATE,POSTCODE,LATITUDE,LONGITUDE,Open,Close,Closed for lunch/Office Notes
0,Centrelink Customer Service Centre,Albury,,430 Wilson Street,Albury,NSW,2640,-36.07727,146.9237,08:30:00,16:30:00,No
1,Centrelink Customer Service Centre,Townsville Jobseekers,,307 Ross River Road,Aitkenvale,QLD,4814,-19.29713,146.76441,08:30:00,16:30:00,No
2,Centrelink Customer Service Centre,Albany,,15 Peels Place,Albany,WA,6330,-35.02591,117.88466,08:30:00,16:30:00,No
3,Centrelink Customer Service Centre,Alice Springs,,5 Railway Terrace,Alice Springs,NT,870,-23.6959,133.88036,08:30:00,16:30:00,No
4,Centrelink Customer Service Centre,Tangentyere,,4 Elder Street,Alice Springs,NT,870,-23.69924,133.87184,08:30:00,15:00:00,12:00:00 to 13:00:00


Pandas has several 'reader' functions that have parameters allowing us to do things like skip lines of a file, parse dates, specify how to handle NA datapoints. You can find these here - 

In [22]:
# ?pd.read_csv

Pandas also has similar writer functions for different formats like

In [23]:
# empdata.to_csv('path_to_file.csv')

### Read excel files

Pandas can help read and write Excel files. This means we can read from Excel, write code in Python, and write back into Excel. No need for VBA.

Reading excel files requires a library called "xlrd". You can install it via pip (pip install xlrd)

<img src="https://imgs.xkcd.com/comics/python.png">

In [25]:
import xlrd

Let's write the footie data into an xlsx.

In [27]:
# since index on our footie data is meaningless, there's no need to write it.
footie.to_excel('footie.xlsx',index = False)

In [28]:
!ls -ltr *.xlsx

-rw-r--r--  1 Vivek  staff  5605  1 Jan 20:38 footie.xlsx


In [29]:
# delete the dataframe

del footie

In [30]:
# read from Excel

footie = pd.read_excel('footie.xlsx','Sheet1')
footie

Unnamed: 0,losses,team,wins,year
0,3,Crows,11,2010
1,4,Lions,8,2011
2,1,Magpies,10,2012
3,5,Bombers,15,2013
4,6,Hawks,11,2014
5,2,Swans,6,2015
6,0,Bulldogs,4,2016


### Reading URL

You can also read directly from a URL. Let me try with the Centrelink url that I provided for the CSV. 

In [36]:
url = 'http://data.gov.au/dataset/70c2b2fe-2a32-450e-98dc-453fe4a02aae/resource/5a45d7b2-8579-425b-bb46-53a0e0bfa053/download/Centrelink-Office-Locations-as-at--2-December-2015.csv'

# let's see how much time it takes to download the data
# need to import the datetime library
from datetime import datetime

print datetime.now().time()
empdata_from_url = pd.read_table(url)
print datetime.now().time()

empdata_from_url.head()

20:52:20.629146
20:52:26.245874


Unnamed: 0,"OFFICE TYPE,SITE NAME,ALTERNATIVE NAME,ADDRESS,SUBURB,STATE,POSTCODE,LATITUDE,LONGITUDE,Open,Close,Closed for lunch/Office Notes"
0,"Centrelink Customer Service Centre,Albury,,430..."
1,"Centrelink Customer Service Centre,Townsville ..."
2,"Centrelink Customer Service Centre,Albany,,15 ..."
3,"Centrelink Customer Service Centre,Alice Sprin..."
4,"Centrelink Customer Service Centre,Tangentyere..."


If I have a tab separated table, then use the command - sep='\t', otherwise use sep=','

In [38]:
print datetime.now().time()
empdata_from_url = pd.read_table(url,sep=',')
print datetime.now().time()

empdata_from_url.head(3)

20:54:44.227679
20:54:50.622689


Unnamed: 0,OFFICE TYPE,SITE NAME,ALTERNATIVE NAME,ADDRESS,SUBURB,STATE,POSTCODE,LATITUDE,LONGITUDE,Open,Close,Closed for lunch/Office Notes
0,Centrelink Customer Service Centre,Albury,,430 Wilson Street,Albury,NSW,2640,-36.07727,146.9237,08:30:00,16:30:00,No
1,Centrelink Customer Service Centre,Townsville Jobseekers,,307 Ross River Road,Aitkenvale,QLD,4814,-19.29713,146.76441,08:30:00,16:30:00,No
2,Centrelink Customer Service Centre,Albany,,15 Peels Place,Albany,WA,6330,-35.02591,117.88466,08:30:00,16:30:00,No


### Databases

Pandas has support for reading and writing into databases. Additionally, you can read into pandas from the clipboard as well. But let me get to these at a later time. 