# Pandas Cheat  Sheet
This notebook provides a summary of the most important and useful pandas commands.

# Resources

* [Master Data Analysis with Python](https://online.dunderdata.com/courses/master-data-analysis-with-python-volume-1-foundations-of-data-exploration)
    * Book by Ted Petrou - 1000 Pages, 400 Exercises
* [Complete Master Data Analysis with Python Bundle](https://online.dunderdata.com/bundles/complete-master-data-analysis-with-python-bundle)
    * Exercise Python and Master Data Analysis with Python and all videos
* [Pandas Cookbook](https://www.amazon.com/Pandas-Cookbook-Ted-Petrou/dp/1784393878)
    * Book from Ted Petrou. Detailed recipes that show how to complete specific tasks with real-world data.
* [Official Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)
* [Tagged pandas questions on stackoverflow](https://stackoverflow.com/questions/tagged/pandas)


# Making the most of a Jupyter Notebook
* Jupyter Notebooks are composed of cells
* There are two main **types** of cells 
    * **Code** cells - understand python code
    * **Markdown** cells - understand markdown (a simple plain text formatting language)
* Each cell has two separate **modes**
    * **Edit** mode - The keys you press will work similarly as they do in a text editor
        * Cell is outlined in green
        * Flashing cursor inside the cell
        * Pencil icon appears in the upper right hand corner of the notebook
    * **Command** mode - The keys have special meaning and do not print to the screen
        * Cell is outlined in blue
        * No flashing cursor
        * No pencil icon
* Pressing **ESC** puts you in command mode
* Pressing **ENTER** puts you in edit mode
* In command mode:
    * `A` - Add a new cell above
    * `B` - Add a new cell below
    * `DD` - Delete a cell
    * `M` - Change cell type to markdown
    * `Y` - Change cell type to code
    * `H` - Get all keyboard shortcuts for both modes
* Executing code cells:
    * **Shift + enter** - execute and move to next cell in command cell OR if last cell create new code cell in edit mode
    * **Ctrl + enter** - execute and stay in current cell in command mode
    * **Alt/option + enter** - execute and insert new cell in edit mode
* Keep your hands on the keyboard. Do not use your mouse to switch from edit to command mode

## Help in the Notebook
* Press **Shift + Tab + Tab** at the end of a function/method to reveal the docstring as a popup window
* Place a `?` at the end of a function/method and execute the cell to reveal the docstring as a window at the bottom of the notebook
* Place `??` at the end of a function/method and execute the cell to reveal the source code

# Five step process for Doing Data Science in the Notebook
[See the blog post for more](https://medium.com/dunder-data/the-five-step-process-for-data-exploration-in-a-jupyter-notebook-92fe818b5a62)
1. Write and execute a single line of code to explore your data. Usually you are doing something to a DataFrame or a Series
1. Verify that this line of code works by inspecting the output
1. Assign the result to a variable
1. Within the same cell, in a second line, output the head of the DataFrame or Series
1. Continue to the next cell. Do not add more lines of code to the cell


# Pandas
* Name derived from panel + data
* Used for two-dimensional, 'tabular' data analysis 
* Tabular data has two dimensions, rows and columns. A table.
* Relies heavily on numpy to store data and do calculations.
* The **DataFrame** is the primary container of data. It is two-dimensional and contains heterogeneous data (data of different types)
* The **Series** is another container of data and is one-dimensional. A Series is essentially a single column of data from a DataFrame

## Data Types
Every column of a Pandas DataFrame has a particular **data type**. The data type is very important and informs us that each value within a particular column is of that data type.

### Most Common Data Types
There are many possible data types with the following being more common:

* Boolean
* Integer
* Float
* Object (can be any Python object but is usually strings)
* Datetime
* Timedelta
* Period
* Categorical

## Missing Value Representation
Pandas uses three representations for missing values: `NaN`, `None`, `NaT`
* NaN - not a number, found only in float or object columns
* None - Python object **`None`** - found only in object columns
* NaT - not a time, found only in Datetime, Timedelta, and Period columns
* No missing values for integer or booleans (an unfortunate Pandas limitation)

## DataFrame

![][2]
* 2 dimensional, "tabular" data. rows and columns. 
* Think of a DataFrame as a collection of columns. Pandas thinks columnwise
* Three main components - **index**, **columns** and **data (values)**
* The index labels the rows and the column names label the columns
* Uses an Index object for both the rows and the columns
* The row Index is simply called the index. The column index is called the columns.
* Access Index with **`df.index`**, the columns with **`df.columns`**, and the values with **`df.values`**
* Most common way to create a DataFrame is with **`pd.read_csv('file_name.csv')`**

## Series

![][1]

* One dimensional object
* Two components to a Series - the **index** and the **data (values)**
* Get the values: **`s.values`** - returns a NumPy array
* Get the index: **`s.index`** - return a Pandas Index object - default index is a **`RangeIndex`**
* Use **`head/tail`** methods to shorten long output


### Common DataFrame Attributes

* **`df.index`**
* **`df.columns`**
* **`df.values`**
* **`df.shape`** - tuple of (rows, columns)
* **`df.size`** - Total elements in DataFrame - rows x columns
* **`df.dtypes`** - Returns each data type of a column as a Series

### Common Series attributes

* **`s.index`**
* **`s.values`**
* **`s.size`** - can also find number of Series elements with **`len(s)`**
* **`s.dtype`**

### Setting a Meaningful index on a DataFrame
* It is never necessary to set an index with DataFrame
* All data analysis is possible without setting an index
* The default index for a DataFrame is the range of integers 0 to n-1. This is called a **`RangeIndex`**
* If you do set an index, choose one that is both unique and descriptive. Uniqueness is not enforced
* Set the index on read with the `index_col` parameter. **`pd.read_csv('file_name.csv', index_col='title')`**
* Set the index after read with `df.set_index('title')`
* Turn the index into the first column of a DataFrame with `df.reset_index()`

## Selecting subsets of DataFrames

* The indexing operator is has specific rules for a DataFrame. It selects either a single as a Series or multiple columns as a DataFrame
* Use a single column name (usually a string) to select one column as a Series - **`df['col1']`**
* **`df.col1`** also selects a single column, but do NOT use. It doesn't work for columns with spaces
* Use a list of column names to select multiple columns as a DataFrame. **`df[['col1', 'col2', 'col3']]`**  Notice the inner list
*  **`df[['col']]`** selects a one column DataFrame

#### Simultaneous row and column selection with `loc`
* The **`loc`** indexer has completely different rules than just the indexing operator.
* Use **`loc`** to simultaneously select rows and columns from a DataFrame - `df.loc[rows, cols]` where `rows` and `cols` can be one of the following three:
    * A single label
    * A list of labels
    * A slice of labels
* **`loc`** only works with index and column **labels** NOT by integer location
* The comma separates the row selection from the column selection

#### `loc` examples
* **`df.loc[['index1', 'index5'], ['col3', 'col1', 'col7']]`** selects two rows and three columns with lists for each row and column selection
* **`df.loc['index1':'index10', ['col3', 'col7']]`** Uses slice notation for the rows and lists for the columns
* **`df.loc['index1', 'col5]`** selects a single scalar value
* Notice that the column selection is optional and not present for the following examples. In these cases, all the columns are selected
* **`df.loc['index1']`** selects a single row as a Series. Notice that the columns are optional and not present here.
* **`df.loc[['index1', 'index2']]`** selects multiple rows as a DataFrame
* **`df.loc['index1':'index100':5]`** - slice notation selects multiple rows as a DataFrame

#### Simultaneous row and column selection with `iloc`
* **`.iloc`** works analogously as `loc` but uses integer location instead
* ** Integer location** refers to the position of the index or column label along its respective axis. 
* It follow the pattern `df.iloc[rows, cols]` where `rows` and `cols` can be one of the following three:
    * A single integer
    * A list of integers
    * A slice of integers

#### `iloc` examples
* **`df.iloc[[2, 6], [0, 4, 2]]`** selects two rows and three columns with lists for each row and column selection
* **`df.iloc[5:10, [3, 7]]`** Uses slice notation for the rows and lists for the columns
* **`df.iloc[1, 5]`** Uses a single integer for both row and column selection to select a single scalar value
* Notice that the column selection is optional and not present for the following examples. In these cases, all the columns are selected
* **`df.iloc[2]`** selects a single row as a Series.
* **`df.iloc[[2, 5]]`** selects multiple rows as a DataFrame
* **`df.iloc[10:200:5]`** - slice notation selects multiple rows as a DataFrame

## Selecting subsets of Series 

* It is possible to use just the indexing operator to make selections from a Series, but I advise against it as it is ambiguous whether you are using it to select with integers or labels. Instead use **`loc`** or **`iloc`**.
* Both the **`loc`** and **`iloc`** indexers work analogously as the do with a DataFrame, but only take one selection as there are no columns. 
* **`s.loc['label1']`** - scalar that selects a single item in Series
* **`s.loc[['label1', 'label5']]`** - use a list to select disjoint items
* **`s.loc['start':'stop':step]`** - use a slice to select from start to stop inclusive
* **`s.iloc[integer1]`** - scalar that selects a single item in Series
* **`s.iloc[[integer1, integer2]]`** - use a list to select disjoint items
* **`s.iloc[start:stop:step]`** - use a slice to select from start to stop inclusive
* **`.ix`** is deprecated. Do NOT use it.
* **`s[label]`** works for both integer and label location. It is ambiguous. Avoid if possible.
* **`automatic alignment of index`** - be careful when operating with two Series at the same time. They will join on the index first, creating a cartesian product and then complete the calculation.

## Boolean Indexing for DataFrames
* Boolean indexing (a.k.a Boolean selection) is the process of filtering data based on its actual values
* Examples of boolean indexing: Find all the employees with salaray over 100k, find all the flights leaving from Houston with a distance over 500 miles
* Boolean indexing works by passing in a Series, or other one-dimensional sequence of booleans to the brackets or **`loc`** indexer. Only values that are True remain.
* Must create a Series (or sequence) of booleans first. Sometimes save this to a variable named **`filt`**
* Usually the boolean Series is created by using comparison operators with columns in the DataFrame
* Example filter: `filt = df['col1'] > 5`
* Pass this filter into the brackets to complete the selection: `df[filt]`
* Boolean indexing does not with `.iloc`

### Multiple Condition Boolean Series
* Create multiple conditions by using `&`, `|` operators for the logical **and** and **or**.
* Must wrap each condition in parentheses when on the same line
* Example filter: `filt = ((df['col1'] > 5) | (df['col2'] < -2)) & (df['col3'] % 2 == 0)`
* Invert a Boolean Series with the `~` operator

#### Simultaneous boolean selection and column selection
* **`df[filt]`** selects all the columns for the rows that are true
* Use **`loc`** to simultaneously do boolean selection on the rows while selecting particular columns
* Example: **`df.loc[filt, [col1, col2]]`** does boolean selection on the rows while simultaneously selecting columns with a list.

#### Methods that create Boolean Series
* Use `isin` to do multiple equality comparisons. `filt = df['col1'].isin([val1, val2, val3])`
* Use `between` as a shortcut to check whether each value in a Series is between two values. `filt = df['col1'].between(50, 100)`
* Use `isna` to check whether each value is missing or not

## Boolean Indexing on a Series

* Works similarly to boolean indexing for DataFrames
* Create the filter and pass it back into the brackets
* Boolean Series example: **`filt = ((s > 5) | (s < -2)) & (s % 2 == 0)`**
*  **`s[filt]`**  does the boolean selection

## Operations on a numeric Series

### Series  Arithmetic Operations

* **`+, -, *, /, //, ** `** - all operate on every value of the Series
* **`s + 5`** adds 5 to every value in the Series
* * Operations are **vectorized**
* Vectorization means no explicit writing of a for loop

### Series Comparison Operations
* **`<, >, <=, >=, ==, !=`** - applies condition to each value in Series - returns boolean Series
* **`s > 5`** - compares every value in Series to 5 and returns a Series of booleans
* Comparison against a missing always returns False
* Operations are vectorized

## Series Descriptive Statistical Methods

### Aggregation methods
An aggregation is defined as a function that returns a **single** value.

* **`s.sum`**
* **`s.min`**
* **`s.max`**
* **`s.median`**
* **`s.mean`**
* **`s.count`** - counts non-missing values
* **`s.std`**
* **`s.var`**
* **`s.quantile(q=.5)`** - percentile of distribution
* **`s.describe`** - returns most of the above aggregations in one Series

### Non-Aggregation methods
A non-aggregating method does not return a single value and most often returns a Series the same length as the original.
* **`s.abs`** - takes absolute value
* **`s.round`** - round to the nearest given decimal
* **`s.cummin`** - cumulative minimum
* **`s.cummax`** - cumulative maximum
* **`s.cumsum`** - cumulative sum
* **`s.rank`** - rank values in a variety of different ways
* **`s.diff`** - difference between one element and another
* **`s.pct_change`** - percent change from one element to another

### Missing Value methods
* **`s.isna`** - Returns a Series of booleans based on whether each value is missing or not. **`isull`** is an alias
* **`s.notna`** - Exact opposite of **`isna`**
* **`s.fillna`** - fills missing values in a variety of ways
* **`s.dropna`** - Drops the missing values from the Series

### More Series methods
* **`s.idxmin`**, **`s.idxmax`** - returns the index label of the minimum/maximum value
* **`s.unique`** - returns a NumPy array of unique values
* **`s.nunique`** - returns the number of unique values
* **`s.drop_duplicates`** - returns a Series of the first unique values by default
* **`s.sort_values`** - Sorts from least to greatest by default. Use **`ascending=False`** to reverse
* **`s.sort_index`** - Sorts index of Series
* **`s.sample`** - Randomly samples Series
* **`s.replace`** - replaces values in a Series

### Series `value_counts` - very important method
* **`value_counts`** - returns the sorted frequency of each unique value in a Series. Use **`normalize=True`** to return proportions

## Operations on a Series of strings with the `str` accessor
* Only available to Series that have string data (object data type)
* Many Methods overlap with Python strings
* Popular str methods include: `contains, count, extract, split, strip`
* Learn regular expressions for more powerful searching and extracting

## Operations on a Series of datetimes with the `dt` accessor
* Available to Series with datetime data
* Mostly simple attributes that retrieve particular information about the data type with attributes such as year, hour, month, weekday, etc...
* Popular methods include `round`, `floor`, `ceil` which require offset alias strings (see below)
* Also available for timedelta and period columns
* timedelta and period columns have different but similar attributes and methods


## DataFrame  Arithmetic Operations
* **`+, -, *, /, //, `** - all apply single value to all values of DataFrame
* **`df + 5`** adds 5 to every value in the Series
* Error will occur if operation does not work with every single data type (i.e. adding 5 to a string column)
* Operations are vectorized
* Use **`select_dtypes`** to select all columns with the given type - Use strings 'int', 'float', 'bool', 'object', 'datetime', 'timedelta', 'category'
* Use string 'number' to select both int and float

## DataFrame Comparison Operations
* **`<, >, <=, >=, ==, !=`** - applies condition to each value in DataFrame - returns DataFrame of all booleans
* **`df > 5`** - compares every value in DataFrame to 5.
* Operations are vectorized

## Basics of DataFrame Methods 
* DataFrames and Series share about 90% of their methods
* DataFrame methods differ than those from Series because they (usually) have an `axis` parameter that controls the **direction of the operation**.
* The direction of the operation will either be vertical or horizontal
* **`axis`** equal to **`index`** (or **`0`**) will perform the operation vertically (i.e. summing all the values in each column). 
* **`axis`** equal to **`columns`** (or **`1`**) will perform the action horizontally. 
* The default is almost always **`axis=0`** - meaning Pandas **thinks column-wise**. By default, operations happen to each column independently.
* Use the string names for the axes ('index', and 'columns') instead of the integers 0 and 1 as the string names are more explicit.
* **`info`** method returns some metadata from the DataFrame
* All the descriptive statistical methods are the same as they are with Series and by default operate down each column.
* **`describe`** method has the **`include`** parameter which accepts a string or list of strings for data types that you would like to find summary statistics for
* **`rename`** - pass a dictionary of old,new key-value pairs to **`columns`** attribute to rename columns
* **`drop`** - pass string or list of strings to **`columns`** parameter to drop columns

### Specifics on certain DataFrame methods
* **`sort_values`** - must pass a string or list of strings to **`by`** parameter to sort columns. Pass **`ascending`** list of booleans that correspond with **`by`** columns to control direction of sort.
* **`drop_duplicates`** and **`dropna`** have a **`subset`** parameter. Pass columns to them to it to limit the functionality to just those columns.

## Methods not available to DataFrames
* The `str, dt, cat` accessors are only available to Series
* `value_counts`, `unique` are only available to Series

## Methods not available to Series
* `set_index`, `info`, `select_dtypes`, `melt`, `pivot_table`

### Reading in Data
* Many datasets will be stored in CSV's. Read them in by passing the file location to **`pd.read_csv`**
* Set the index column on read with parameter **`index_col`**
* Set the **`parse_dates`** parameter to a list of the Datetime column names to convert them on read
* Set the index column after read with **`df.set_index('column name')`**
* A good choice for index is a column that uniquely identifies each row. Indexes can have duplicates

### Automatic alignment of the Index
* Pandas has the surprising and implicit feature of automatically aligning the index and columns when operating with two Series/DataFrames at the same time
* For example adding two Series - `s1 + s2` or two DataFrames `df1 + df2`
* Both the index and columns align first before any operation is done.
* They align by forming an outer join. For index values that repeat in each, a Cartesian product is produced
* Index and Column labels unique to either Series/DataFrame will remain in the result with a missing value


## Grouping
Also, known as split apply combine

* Split - Splits your data into independent groups
* Apply - Apply a function to each of your groups
* Combine - Put the results of the apply function back together

![][3]

The most common type of function to apply to each group is one that **aggregates**. This pattern will always have three components:

* **grouping columns** - Each unique combination forms a group
* **aggregating columns** - Values of these columns are going to be aggregated into a single number
* **aggregating functions** - Determines how the aggregation will happen - **`sum, mean, median`**, etc...

A popular syntax for grouping a single column, aggregating a single column, and applying a single aggregation function is:

```
>>> df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})
```

* Use a list if you would like to use more than one grouping column.
* To have additional aggregating columns, add them as new key-value pairs to the dictionary inside of **`agg`**
* Use a list as the value in the dictionary to use more than one aggregating function
* There are many built-in aggregating functions. Use their string name. Some examples are `min, max, sum, mean, count, size, std` and more.
* By default, grouping columns get put in the index. Call the **`reset_index`** method after to make them columns.
* Shortcut for getting the size of each group: `df.groupby('<grouping column>').size()`

### Advanced Grouping
There are several alternate syntaxes for groupby
* If all aggregating columns will use the same aggregating functions:
```
>>> df.groupby('<grouping column>')['<aggregating columns>'].agg([<aggregating columns>])
```
* If all aggregating columns will use exactly one aggregating function, use it as a method:
```
>>> df.groupby('<grouping column>')['<aggregating columns>'].sum()
```
* If you'd like to aggregate all the non-grouping columns with the same function
```
>>> df.groupby('<grouping column>').sum()
```
* You can write your own custom aggregation function if it does not exist in Pandas.
* Custom aggregation functions have poor performance. Avoid if possible.
* Use the `filter` groupby method to filter out groups as a whole. You must write a custom function that returns a single boolean for each group. The original data is filtered and the same number of columns are returned
* Use the `transform` method to return either a single value of a sequence the same length as the group. Good if you want to preserve the shape of the original DataFrame

### Pivot Table
* The **`pivot_table`** method is very similar to a **`groupby`** aggregation, but will pivot one of the grouping columns. We can use the same terminology as we do with groupby
```
>>> df.pivot_table(index='grouping column 1', columns='grouping column 2', values='aggregating column', aggfunc='sum')
```
* `pd.crosstab` is useful to normalize counts across more than one variable. Otherwise it is not needed and adds no value over `pivot_table`. 
* It is nearly identical to `df.pivot_table` but it is a **function** therefore you need to pass it Series and not strings. Set normalize to either 'all', 'index', or 'columns'
```
>>> pd.crosstab(index=df['grouping column 1'], columns=df['grouping column 2'], normalize='index') 
```
## Time Series

### Datetimes (Timestamps)
* A date is only year, month, day
* A time is only hour, minute, second, part of second
* A datetime is a date and a time combined
* Pandas has a **`Timestamp`** type but it is just a powerful datetime object. It is a specific moment in time.
* In pandas when we talk about Timestamps and Datetimes, we are talking about the same thing. Confusing!
* Create Timestamps with the **`to_datetime`** function. It converts a wide variety of strings. Also converts number to units after Unix epoch of Jan 1, 1970.

### Timedeltas
* Pandas has the **`Timedelta`** data type for amounts of time. e.g. 5 hours 36 minutes and 10 seconds
* The **`to_timedelta`** function converts strings and numbers to Timedeltas
* Subtracting two Timestamps creates a Timedelta
* Both Timestamps and Timedeltas have the same (and more) attributes and methods that the Series **`dt`** accessor has

## Offset Alias Strings
* Used to round, group, select, and roll by amounts of time

![][4]

### Time Series Data
* A time series is a sequence of data collected over time, often with time increments equally spaced and unique
* Set the index to the datetime column when you have time series data to make operations easier. This formally creates a DatetimeIndex

#### Selecting with Time Series
* Use partial date matching to make selections when you have a DatetimeIndex
    * `df['2014-12-5']` selects all data from Dec 5, 2014
    * `df['2014-12']` selects all data from Dec 2014
    * `df['2014']` selects all data from 2014
    * `df['2014-12-5':'2015-2-12]` selects all data from Dec 5, 2014 to Feb 2, 2015 inclusive
    * `df['2014-12':'2015]` selects all data from Dec 1, 2014 to Dec 31, 2015
    
### Sampling Time Series
* Use `asfreq` to upsample/downsample with **offset alias** strings
* Use the **`resample`** method to group by a period of time. Use the **`on`** parameter to specify the date column if its not in the index.
* **`resample`** is very similar to **`groupby`**. Chain the **`agg`** method to aggregate.
* Use offset aliases to specify the date grouping increment.
```
df.resample('M').agg({'col1':'sum'})
```
* **Anchored offset aliases** can shift the date group range. **W-FRI** anchors the week to end on Friday
* A DatetimeIndex makes it easier to select subsets of data. `df['2014:5']` selects all rows in May, 2014.
* Calculate moving window statistics with the **`rolling`** method. This also works just like `groupby`
* The window can be either a date period (use offset aliases) or a fixed size window (use integer)
* Both **`resample`** and **`rolling`** have simpler syntax with Series that have a DatetimeIndex. `s.resample('M').sum()` and `s.rolling('5D').sum()`
* Group by date and another column in two ways:
    * Independently: `df.groupby('col1').resample('M').agg({'col2':'sum'})`
    * Together - must use `pd.Grouper` - think of it as just a dictionary holding data
        * `tg = pd.Grouper(freq='M')`
        * `df.groupby([tg, 'col1']).agg({'col2':'sum'})`

## Tidy Data
* Tidy data is a structure of data that makes data analysis easy
* Tidy data is defined as:
    * Each variable forms a column
    * Each observation forms a row
    * Each type of observational unit forms a table
* All other datasets are "messy"
* Messy data does not mean it is difficult to read
* Tidy data is simply a structure that makes it easy to do most other kinds of data analysis
* Identify variables and then transform your DataFrame

### Melting
* Used to make column names into variable values. Also called 'unpivoting' or 'stacking'
* The **`melt`** method has two main parameters
    * **`id_vars`** - Columns that you wish to keep vertical
    * **`value_vars`** - Columns that you wish to melt. Column names will all go in a single column. Values will go in another column.
* Not necessary to explicitly assign `value_vars`. By default, any columns not given to `id_vars` will be melted.

### Pivoting
* The **`pivot`** method takes three parameters
    * **`index`** - column to keep vertical
    * **`columns`** - column name whose values will become new column names
    * **`values`** - column name whose values will be tiled over the intersection of the index and columns

### Common Messy Datasets
1. Column names are values, not variable names.
1. Multiple variables are stored in one column.
1. Variables are stored in both rows and columns.
1. Multiple types of observational units are stored in the same table.
1. A single observational unit is stored in multiple tables

### Data Normalization
* A process to reduce data redundancy and increase data integrity
* If data is unnecessarily repeated, separate into own table
* Add primary key to uniquely identify each row
* Dimension tables hold non-event information that is more static, such as a store name and address
* Fact tables hold event and transaction information such as a sale within a store

## Regular Expressions

* Used to find patterns withing text
* Miniature programming language
* Two different types of characters - literal and metacharacters (a.k.a special characters)
* Literal characters represent themselves
* Special or metacharacters represent something entirely different
* Primarily usage of regex is to either match a particular string or extract a substring
* Many Pandas `str` accessor methods accept regular expressions
* You will often use **`contains`** and **`extract`** when doing regex work
* Use raw Python strings when writing regex. Raw strings have 'r' prepended to them.

### Metacharacter Summary
* All metacharacters - `. ^ $ * + ? { } [ ] \ | ( )`

#### The dot - `.`
* `.` - Matches any character except line breaks

#### Anchors - `^, $`
* `^` - Anchors next characters to beginning
    * `^My` matches strings that begin with 'My'
* `$` - Anchors previous characters to end
    * `Movie$` matches that strings that end with 'Movie'

#### Quantifiers - `*, +, ?, {}`
* `*` - Matches 0 or more occurrences of previous character
* `+` - Matches 1 or more occurrences of previous character
* `?` - Matches 0 or 1 occurrences of previous character
* `{m}` - Matches exactly m of the previous character, 
* `{m,}` - Matches m or more of the previous character 
* `{,n}` - Matches up to n of the previous character 
* `{m,n}` - Matches between m and n repeats of the previous character

#### Character Sets
* `[]` - A character set to match one out of many characters. `[aeiou]` matches a single vowel
* `[a-z]`, `[A-Z]`, `[0-9]` - Character sets for lowercase, uppercase, and digits
* `[^abc]` - Use caret at beginning of bracket to match anything but these characters
* `\` - backslash changes meaning of next character
* `\s` - whitespace - single space, tab, new-line
* `\S` - non-whitespace
* `\w` - word character - lower/uppercase, digits, and underscore
* `\W` - non-word-character
* `\d` - digits
* `\D` - non-digits
* `\b` - word boundary - matches empty string between words, that is between `\w` and `\W`
* `\B` - non-word boundary
* `\.` - Escapes all special characters such as literal dot here. `\*` matches the literal asterisk


#### Or Clause
* `|` - The pipe metacharacter represents the **or** clause. Matches when either left or right set of characters match. `cat|dog` matches either 'cat' or 'dog'

#### Grouping
* `()` - Groups together parts of regex like mathematical parentheses to achieve different operator precedence
* `()` - Also represents capture groups for extracting text.
* `(?:)` - A non-capturing group. The group won't be captured or be able to be referenced. Use this when you just want to group but don't want to capture. Example: `(?:His|Her) shoe` matches both 'His shoe' and 'Her shoe'
* `(?=)` - Positive lookahead - Example: `Batman(?=.*Robin)(?=.*Joker)` Matches string that have Batman followed by both Robin and Joker somewhere after them. No characters are consumed.
* `(?!)` - Negative lookahead - Example: `Beauty and the (?!.*Beast)` Matches strings that have 'Beauty and the' followed by anything other than 'Beast'
* `(?<=)` - Positive look-behind - Example: `(?<=bird)house` Matches strings the word 'house', but only those that are immediately preceded by 'bird'
* `(?!)` - Negative look-behind - Example: `(?<!bird)house` Matches strings the word 'house' except those that are immediately preceded by 'bird'

## Joining
* Use the function `pd.concat` to stack multiple DataFrames on top of each other or side-by-side. Pandas automatically aligns on the index.
* Use sqlalchemy to make a connection to a SQL database
* The `merge` method does sql-style joins.

## Visualization

* For any numerical Series: `s.plot()` will produce a line plot by default with index as x-axis and values as y-axis.
* Use parameter `kind=` to change plot type to `hist, bar, barh, kde, pie, box, density, area`
* `linestyle` (ls) - Pass a string of one of the following ['--', '-.', '-', ':']
* `color` (c) - Can take a string of a named color, a string of the hexadecimal characters or a rgb tuple with each number between 0 and 1. [Check out this really good stackoverflow post to see the colors](http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib)
* `linewidth` (lw) - controls thickness of line. Default is 1
* `alpha` - controls opacity with a number between 0 and 1
* `figsize` - a tuple used to control the size of the plot. (width, height) 
* `legend` - boolean to control legend
* Change plotting template with `plt.style.use('ggplot')`. See all templates with `plt.style.available`

[1]: 01.%20Selecting%20Subsets%20of%20Data/images/series_components.png
[2]: 01.%20Selecting%20Subsets%20of%20Data/images/df_components.png
[3]: 03.%20Grouping/images/split-apply-combine.png
[4]: 04.%20Time%20Series/images/offsetalias.png