# Pandas: First Steps

## What is Pandas?

* The most common/popular library for tabular data work in Python
* Provides straightforward APIs for
  * Reading/writing data to/from most common formats
  * Filtering, projecting, and generally transforming data (including pivoting and joins)
  * Indexing for fast access to data (with optional ordering and multi-tier index patterns)
  * Applying grouping/aggregation and rolling/window operations
  * Applying user-provided (custom) functions
* How is Pandas different from Python? Is it just an API wrapper over Python data structures? __No__
  * Pandas relies on NumPy to provide efficient contiguously allocated native data representations
  * Pandas leverages NumPy and Cython to provide native-code implementation of many operations

<sup>*Credit for parts of this notebook: Takenori Takaki (https://github.com/takenory) for converting to Jupyter from http://pandas.pydata.org/pandas-docs/stable/10min.html and the Pandas team*</sup>

Let's see an example

In [None]:
import pandas as pd

df = pd.DataFrame({'Team':['Tigers', 'Sharks', 'Cobras'], 'Wins': [7, 11, 3]})

df

In [None]:
df['Games'] = [10, 10, 9]

df

## Getting Data In/Out

### CSV
[Writing to a csv file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv)

In [None]:
df.to_csv('foo.csv', index=False)

[Reading from a csv file](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table)

In [None]:
pd.read_csv('foo.csv')

### HDF5
Reading and writing to [HDFStores](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5)

Writing to a HDF5 Store

In [None]:
df.to_hdf('foo.h5','df')

Reading from a HDF5 Store

In [None]:
pd.read_hdf('foo.h5','df')

In [None]:
try:
    df = pd.read_csv('data/housing.csv')
except Exception as e:
    print(e)

In [None]:
! head data/housing.csv

In [None]:
df = pd.read_csv('data/housing.csv', comment='#')

df

## Viewing Data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics) 

See the top & bottom rows of the frame

In [None]:
df.head()

In [None]:
df.tail(3)

Display the index, columns, and the underlying numpy data

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

Describe shows a quick statistic summary of your data

In [None]:
df.describe()

### Getting

Selecting a single column, which yields a Series, equivalent to df.A

In [None]:
df['beds']

Selecting via [], which slices the rows.

In [None]:
df[0:3]

### Selection by Label

For getting a cross section using a label

In [None]:
df.index

In [None]:
df.loc[df.index[100]]

Selection by Label

In [None]:
df.loc[:,['beds','bath']]

Showing label slicing, both endpoints are included

In [None]:
df.loc[10:12,['beds','bath']]

Reduction in the dimensions of the returned object

In [None]:
df.loc[10,['beds','bath']]

For getting a scalar value

In [None]:
df.loc[10, 'beds']

### Selection by Position

See more in [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)

Select via the position of the passed integers

In [None]:
df.iloc[3]

By integer slices, acting similar to numpy/python

In [None]:
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style

In [None]:
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly

In [None]:
df.iloc[1:3,:]

For slicing columns explicitly

In [None]:
df.iloc[:,1:3]

For getting a value explicitly

In [None]:
df.iloc[1,1]

For getting fast access to a scalar (equiv to the prior method)

In [None]:
df.iat[1,1]

## Boolean Indexing

Using a single column’s values to select data.

In [None]:
df[df.beds > 7]

A where operation for getting.

In [None]:
df[df > 0]

Using the isin() method for filtering:

In [None]:
df2 = df.copy()

In [None]:
try:
    df2['E'] = ['one','one', 'two','three','four','three']
except Exception as e:
    print(e)

In [None]:
df2['E'] = range(0, 2*492,2)
df2

In [None]:
df2[df2['E'].isin([10,20,30])]

## Data Cleansing

<img src="https://materials.s3.amazonaws.com/i/Data-Cleansing-tool.jpg" width=500>

__Typical Problems__
* Incomplete records / missing values
* Duplicate (or partial duplicate) records
* Impossible values
* Values that violate business rules
* Sampling/distribution problem
* Skewed values

__Approaches to Cleansing/Repair__
* Dropping records
* Repairing values from alternate sources
* Imputing values
* Upsampling/downsampling/stratified sampling
* Deskewing calculations
* Normalization (scale to 0-1) / standardization (mean 0, sd 1)

__BEWARE!__

*Cleansing your data is like doing surgery: if you get it right, everyone will be happy ... and may not even notice anything happened.*

*But: if you don't understand the data and problems thoroughly, and if you are not thoughtful about the effect of your intervention, you can create worse problems:*
* System crashes
* Financial (business) losses due to poor human or machine decision-making from the data
* Legal liability for your company, your business unit, or yourself, due to violation of US or EU law around privacy, discrimination, accounting rules, etc.