# Slack
You need to fill in the [form](https://docs.google.com/forms/d/1OmT8ODmVBNgl0eOmZT51JMTHUSA_eNrHTcDRnmNDMgQ) to get invitated

Slack url: https://rt-portal.slack.com/

# Pandas Cheet Sheat
https://github.com/pandas-dev/pandas/raw/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

# Getting started with pandas

In [None]:
from pandas import Series, DataFrame
import pandas as pd

In [None]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

In [None]:
%pwd

## Introduction to pandas data structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: `Series` and `DataFrame`. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

### Series

A `Series` is a one-dimensional array-like object containing an array of data (of any
`NumPy` data type) and an associated array of data labels, called its index. The simplest
`Series` is formed from only an array of data:

In [None]:
obj = Series([4, 7, -5, 3])
obj

The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:

In [None]:
print obj.values
print obj.index

In [None]:
type(obj.values)

Often it will be desirable to create a Series with an index identifying each data point:

In [None]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

In [None]:
obj2.index

Compared with a regular `NumPy` array, you can use values in the index when selecting
single values or a set of values:

In [None]:
obj2['a']

In [None]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

`NumPy` array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
np.exp(obj2)

Another way to think about a `Series` is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
`dict`:

In [None]:
'b' in obj2

In [None]:
'e' in obj2

Should you have data contained in a `Python dict`, you can create a `Series` from it by
passing the `dict`:

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

When only passing a `dict`, the index in the resulting Series will have the dict’s keys in
sorted order.

In [None]:
states = ['California', 'Ohio', 'Texas', 'Oregon']
obj4 = Series(sdata, index=states)
obj4

In this case, 3 values found in **sdata** were placed in the appropriate locations, but since
no value for **'California'** was found, it appears as **NaN** (not a number) which is con-
sidered in pandas to mark missing or *NA* values. I will use the terms “missing” or “NA”
to refer to missing data. The **isnull** and **notnull** functions in pandas should be used to
detect missing data:

In [None]:
pd.isnull(obj4)

In [None]:
pd.notnull(obj4)

In [None]:
obj4.isnull()

I discuss working with missing data in more detail later in this chapter.
A critical Series feature for many applications is that it automatically aligns differently-
indexed data in arithmetic operations:

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

Data alignment features are addressed as a separate topic.
Both the Series object itself and its index have a `name` attribute, which integrates with
other key areas of pandas functionality:

In [None]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

A Series’s index can be altered in place by assignment:

In [None]:
obj

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame ), row-oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays. The exact details
of DataFrame’s internals are far outside the scope of this book.

There are numerous ways to construct a DataFrame, though one of the most common
is from a `dict` of equal-length lists or NumPy arrays

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:

In [None]:
frame

If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [None]:
DataFrame(data, columns=['year', 'state', 'pop'])

If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [None]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

In [None]:
frame2.columns

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [None]:
frame2['state']

In [None]:
frame2.year

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.
Rows can also be retrieved by position or name by a couple of methods, such as the
`loc` indexing field (much more on this later):

In [None]:
frame2.loc['three']

In [None]:
frame2['debt'] = 16.5
frame2

In [None]:
frame2['debt'] = np.arange(5.)
frame2

When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:

In [None]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

In [None]:
del frame2['eastern']
frame2.columns

The column returned when indexing a DataFrame is a **view** on the underlying data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s **copy** method.

Another common form of data is a nested dict of dicts format:

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [None]:
frame3 = DataFrame(pop)
frame3

Of course you can always transpose the result:

In [None]:
frame3.T

The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified:

In [None]:
DataFrame(pop, index=[2001, 2002, 2003])

Dicts of Series are treated much in the same way:

In [None]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

For a complete list of things you can pass the DataFrame constructor, see **Table 5-1**.

**Table 5-1. Possible data inputs to DataFrame constructor**

Type | Notes
--- | ---
2D ndarray | A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame. All sequences must be the same length.
NumPy structured/record array | Treated as the “dict of arrays” case
dict of Series | Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed.
dict of dicts | Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case.
list of dicts or Series | Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels
List of lists or tuples | Treated as the “2D ndarray” case
Another DataFrame | The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

If a DataFrame’s **index** and **columns** have their **name** attributes set, these will also be
displayed:

In [None]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

In [None]:
frame3.values

In [None]:
frame2.values

### Index objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [None]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

In [None]:
index[1:]

Index objects are immutable and thus can’t be modified by the user:

In [None]:
index[1] = 'd'

Immutability is important so that Index objects can be safely shared among data
structures:

In [None]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

**Table 5-2** has a list of built-in Index classes in the library. With some development
effort, Index can even be subclassed to implement specialized axis indexing function-
ality.

**Table 5-2. Main Index objects in pandas**

Class | Description
--- | ---
Index | The most general Index object, representing axis labels in a NumPy array of Python objects.
Int64Index | Specialized Index for integer values.
MultiIndex | “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples.
DatetimeIndex | Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
PeriodIndex | Specialized Index for Period data (timespans).

In addition to being array-like, an Index also functions as a fixed-size set:

In [None]:
frame3

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.index

Each Index has a number of methods and properties for set logic and answering other
common questions about the data it contains. These are summarized in **Table 5-3**.

**Table 5-3. Index methods and properties**

Method | Description
--- | ---
append | Concatenate with additional Index objects, producing a new Index
diff | Compute set difference as an Index
intersection | Compute set intersection
union | Compute set union
isin | Compute boolean array indicating whether each value is contained in the passed collection
delete | Compute new Index with element at index i deleted
drop | Compute new index by deleting passed values
insert | Compute new Index by inserting element at index i
is_monotonic | Returns True if each element is greater than or equal to the previous element
is_unique | Returns True if the Index has no duplicate values
unique | Compute the array of unique values in the Index