# Introduction
**pandas** is an open source Python library for data analysis. 

## Data Structures
**pandas** provides two new data structures to Python:
1. Series and 
2. DataFrame, 

Both of them are built on top of NumPy.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Series
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [4]:
# create a Series with an arbitrary list
s = pd.Series(['Jaipur', 5, -34645768, 'You are most welcome!', 3.14])
s

0                   Jaipur
1                        5
2                -34645768
3    You are most welcome!
4                     3.14
dtype: object

You can also specify an index to use when creating the Series.

In [6]:
s = pd.Series(['Jaipur', 5, -34645768, 'You are most welcome!', 3.14], index=['A', 'B', 'C', 'D', 'E'])
s

A                   Jaipur
B                        5
C                -34645768
D    You are most welcome!
E                     3.14
dtype: object

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [30]:
d = {'Chennai': 134434, 'New Delhi': 443443, 'Andaman and Nicobar': 665566, 'Daman and Diu': 76767,
     'Hyderabad': 87665, 'Udaipur': None}
cities = pd.Series(d)
cities

Andaman and Nicobar    665566
Chennai                134434
Daman and Diu           76767
Hyderabad               87665
New Delhi              443443
Udaipur                   NaN
dtype: float64

You can use the index to select specific items from the Series

In [11]:
cities['Chennai']

134434.0

In [12]:
cities[['New Delhi', 'Daman and Diu', 'Udaipur']]

New Delhi        443443
Daman and Diu     76767
Udaipur             NaN
dtype: float64

Or you can use boolean indexing for selection

In [19]:
cities[cities < 10034440]

Andaman and Nicobar    665566
Chennai                134434
Daman and Diu           76767
Hyderabad               87665
New Delhi              443443
dtype: float64

cities < 10034440 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

In [22]:
less_than_10034440 = cities < 10034440
print less_than_10034440
print '\n'
print cities[less_than_10034440]

Andaman and Nicobar     True
Chennai                 True
Daman and Diu           True
Hyderabad               True
New Delhi               True
Udaipur                False
dtype: bool


Andaman and Nicobar    665566
Chennai                134434
Daman and Diu           76767
Hyderabad               87665
New Delhi              443443
dtype: float64


Values in a Series can be changed on the fly.

In [23]:
# changing values based on the index
print 'Old value:', cities['Chennai']
cities['Chennai'] = 343565656
print 'New value:', cities['Chennai']

Old value: 134434.0
New value: 343565656.0


In [28]:
# changing values using boolean logic
print cities[cities < 10034440]
print '\n'
cities[cities < 10034440] = 6970909

print cities[cities < 10034440]

Andaman and Nicobar    665566
Chennai                134434
Daman and Diu           76767
Hyderabad               87665
New Delhi              443443
dtype: float64


Andaman and Nicobar    6970909
Chennai                6970909
Daman and Diu          6970909
Hyderabad              6970909
New Delhi              6970909
dtype: float64


How to check if an item is there in the Series? You can done using idiomatic Python.

In [29]:
print 'Jaipur' in cities
print 'Udaipur' in cities

False
True


Mathematical operations can be done using scalars and functions.

In [31]:
# divide city values by 3
cities / 3

Andaman and Nicobar    221855.333333
Chennai                 44811.333333
Daman and Diu           25589.000000
Hyderabad               29221.666667
New Delhi              147814.333333
Udaipur                          NaN
dtype: float64

In [32]:
# square city values
np.square(cities)

Andaman and Nicobar    4.429781e+11
Chennai                1.807250e+10
Daman and Diu          5.893172e+09
Hyderabad              7.685152e+09
New Delhi              1.966417e+11
Udaipur                         NaN
dtype: float64

When you add two Series together, it returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

In [33]:
print cities[['Chennai', 'New Delhi', 'Udaipur']]
print'\n'
print cities[['Kolkata', 'New Delhi']]
print'\n'
print cities[['Chennai', 'New Delhi', 'Udaipur']] + cities[['Kolkata', 'New Delhi']]

Chennai      134434
New Delhi    443443
Udaipur         NaN
dtype: float64


Kolkata         NaN
New Delhi    443443
dtype: float64


Chennai         NaN
Kolkata         NaN
New Delhi    886886
Udaipur         NaN
dtype: float64


NULL checking can be performed with isnull and notnull.

In [34]:
# returns a boolean series indicating which values aren't NULL
cities.notnull()

Andaman and Nicobar     True
Chennai                 True
Daman and Diu           True
Hyderabad               True
New Delhi               True
Udaipur                False
dtype: bool

In [35]:
# use boolean logic to grab the NULL cities
print cities.isnull()
print '\n'
print cities[cities.isnull()]

Andaman and Nicobar    False
Chennai                False
Daman and Diu          False
Hyderabad              False
New Delhi              False
Udaipur                 True
dtype: bool


Udaipur   NaN
dtype: float64


### DataFrame

A DataFrame is a tablular data structure comprised of rows and columns, similar to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

#### Reading Data
We can pass a dictionary of lists to the DataFrame constructor.

Using the columns parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically.

In [36]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


**Reading from CSV**

Reading a CSV is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the sep parameter.

In [41]:
sales_from_csv = pd.read_csv('../data/kc_housing_sales_data.csv')
sales_from_csv.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed header=None to the function along with a list of column names to use:

In [42]:
sales_from_csv = pd.read_csv('../data/kc_housing_sales_data.csv', header=None)
sales_from_csv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1,7129300520,20141013T000000,221900,3,1,1180,5650,1,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
2,6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
3,5631500400,20150225T000000,180000,2,1,770,10000,1,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
4,2487200875,20141209T000000,604000,4,3,1960,5000,1,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000


**pandas** various reader functions have many parameters allowing you to do things like skipping lines of the file, parsing dates, or specifying how to handle NA/NULL datapoints.

There's also a set of writer functions for writing to a variety of formats (CSVs, HTML tables, JSON).

**pandas** also has some support for
* reading/writing DataFrames directly from/to Excel files.
* reading/writing DataFrames directly from/to a database
* clipboard and url can also be used

**Next:** [Working with DataFrames](https://github.com/ranjankumar-gh/ml-specialization/blob/master/pandas/notebooks/working-with-dataframes.ipynb)

**Reference:** [1](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/), [2](http://pandas.pydata.org/)