# Pandas Cheat  Sheet
This notebook provides a summary of all the Pandas commands in all of the notebooks in this repository. The most important and useful commands will be presented with as little code as possible.

# Getting Help
* The documentation is excellent - http://pandas.pydata.org/pandas-docs/stable/
* [Pandas Cookbook](https://www.amazon.com/Pandas-Cookbook-Ted-Petrou/dp/1784393878) - Book from Ted Petrou. Detailed recipes that cover the entire Pandas API.
* [Tagged pandas questions on stackoverflow](https://stackoverflow.com/questions/tagged/pandas)
* Shift + Tab + Tab - Opens up the docstring when cursor on attribute/method
* ? - enter at the end of your attribute/method to bring up docstring
* ?? - enter at the end of attribute/method to bring up the source code

# Python
* Everything in Python is an object except the syntatical structures like **`for`**, **`if`**, and operators, etc...
* Every object has a type
* The **`type`** function returns the objects type.
* Always know the type of object you have
* All objects have attributes (descriptions) and methods (actions) which unleash all of their abilities
* Methods must be executed with `()`
* Use **`dir`** function to uncover all attributes and methods

# Pandas
* Name derived from panel + data
* Excellent at tabular data analysis.
* Relies heavily on NumPy to store data and do calculations.
* DataFrame is primary data structure - two dimensional
* Series is one column of data

## Data Types
Every column of a Pandas DataFrame must be of a particular data type. Each value in the column must be of that data type.

* Boolean
* Integer
* Float
* Object (can be any Python object but is mainly strings)
* Datetime
* Timedelta
* Period
* Categorical

## Missing Values
* NaN - not a number, found only in float or object columns
* None - Python object **`None`** - found only in object columns
* NaT - not a time, found only in Datetime, Timedelta, and Period columns
* No missing values for integer or booleans

## Series
![](Basics/images/Components of a Series.png)
* One dimensional object
* Two components to a Series - the **index** and the **data (values)**
* Get the values: **`s.values`** - returns a NumPy array
* Get the index: **`s.index`** - return a Pandas Index object - default index is a **`RangeIndex`**
* Get in the habit of using **`head, tail`** methods to shorten long output

### Selecting subsets of Series 

* There are two main indexers - **`.loc`** for labels and **`.iloc`** for integer location
* Both indexers can take either a **scalar**, a **list**, or **slice notation**
* **`s.loc['label1']`** - scalar that selects a single item in Series
* **`s.loc[['label1', 'label5']]`** - use a list to select disjoint items
* **`s.loc['start':'stop':step]`** - use a slice to select from start to stop inclusive
* **`s.iloc[integer1]`** - scalar that selects a single item in Series
* **`s.iloc[[integer1, integer2]]`** - use a list to select disjoint items
* **`s.iloc[start:stop:step]`** - use a slice to select from start to stop inclusive
* **`.ix`** is deprecated. Do NOT use it.
* **`s[label]`** works for both integer and label location. It is ambiguous. Avoid if possible.
* **`automatic alignment of index`** - be careful when operating with two Series at the same time. They will join on the index first, creating a cartesian product and then complete the calculation.

#### Boolean Indexing

* Boolean indexing works by passing in a Series, or other sequence of booleans to the brackets. Only values that are True remain in the Series.
* Must create a Series of booleans first. Sometimes save this to a variable **`criteria`**
*  **`s[criteria]`**  does the boolean selection
* Use `& | ~` for **and, or, not** to create multiple and complex boolean conditions. 
* Wrap each condition in parentheses
* Criteria example: **`criteria = ((s > 5) | (s < -2)) & (s % 2 == 0)`**
* Use **`s.isin([1,2,3])`** instead of **`(s == 1) | (s == 2) | (s == 3)`**
* **`s.isna()`** turns every value into a boolean whether its missing or not
* Use **`s.between(left, right)`** to replace **`(s >= left) & (s <= right)`**
* It's OK to use just the brackets for boolean selection.

### Common Series attributes

* **`s.index`**
* **`s.values`**
* **`s.size`** - can also find number of Series elements with **`len(s)`**
* **`s.dtype`**

### Series Vectorized Math Operations
* **`+, -, *, /, //, `** - all apply single value to all values of Series
* **`s + 5`** adds 5 to every value in the Series

### Series Vectorized Comparison Operations
* **`<, >, <=, >=, ==, !=`** - applies condition to each value in Series - returns boolean Series
* **`s > 5`** - compares every value in Series to 5 and returns a Series of booleans

## Series Descriptive Statistical Methods

### Aggregation methods
An aggregation is defined as a function that returns a **single** value

* **`s.sum`**
* **`s.min`**
* **`s.max`**
* **`s.median`**
* **`s.mean`**
* **`s.count`** - counts non-missing values
* **`s.std`**
* **`s.var`**
* **`s.quantile(q=.5)`** - percentile of distribution
* **`s.describe`** - returns most of the above aggregations in one Series

### Non-Aggregation methods
* **`s.abs`** - takes absolute value
* **`s.round`** - round to the nearest given decimal
* **`s.cummin`** - cumulative minimum
* **`s.cummax`** - cumulative maximum
* **`s.cumsum`** - cumulative sum
* **`s.rank`** - rank values in a variety of different ways
* **`s.diff`** - difference between one element and another
* **`s.pct_change`** - percent change from one element to another

### Missing Value methods
* **`s.isna`** - Returns a Series of booleans based on whether each value is missing or not. **`isull`** is an alias
* **`s.notna`** - Exact opposite of **`isna`**
* **`s.fillna`** - fills missing values in a variety of ways
* **`s.dropna`** - Drops the missing values from the Series

### More Series methods

* **`s.idxmin`**, **`s.idxmax`** - returns the index label of the minimum/maximum value
* **`s.unique`** - returns a NumPy array of unique values
* **`s.nunique`** - returns the number of unique values
* **`s.drop_duplicates`** - returns a Series of the first unique values by default
* **`s.sort_values`** - Sorts from least to greatest by default. Use **`ascending=False`**
* **`s.sort_index`** - Sorts index of Series
* **`s.sample`** - Randomly samples Series

### Series `value_counts` - very important method
* **`value_counts`** - returns the sorted frequency of each unique value in a Series. Use **`normalize=True`** to return proportions

### Series `str` accessor
* Only available to Series that have string data
* Methods overlap with Python strings
* Learn regular expressions for more powerful searching and extracting

### Series `dt` accessor
* Only available to Series with datetime, timedelta, or period data
* Mostly simple attributes that retrieve particular information about datetime such as year, hour, month, weekday, etc...

## DataFrame

![](Basics/images/Components of a DataFrame.png)
* 2 dimensional. rows and columns. 
* Three main components - **index**, **columns** and **data (values)**
* The index labels the rows and the columns label the columns
* Uses an Index object for both the rows and the columns
* The row Index is simply called the index. The column index is called the columns.
* Access Index with **`df.index`**, the columns with **`df.columns`**, and the values with **`df.values`**
* Most common way to create a DataFrame is with **`pd.read_csv('file_name.csv')`**

### Common DataFrame Attributes

* **`df.index`**
* **`df.columns`**
* **`df.values`**
* **`df.shape`** - tuple of (rows, columns)
* **`df.size`** - Total elements in DataFrame - rows x columns
* **`df.dtypes`** - Returns each data type of a column as a Series

### Selecting subsets of DataFrames

* Indexing operator is different for DataFrames than Series. It selects a column or columns.
* **`df[col1]`** selects a single column as a Series
* **`df.col1`** also selects a single column, but do NOT use. It doesn't work for columns with string names
* **`df[[col1, col2, col3]]`** selects multiple columns. Notice the inner list
*  **`df[[col]]`** selects a one column DataFrame
* **`.iloc`** uses integer location and **`.loc`** uses index labels.
* The **`.iloc`** and **`.loc`** indexers make selections by just rows, just columns, or simultaneously rows and columns:
    * **`df.loc[row_selection]`** - Selects just rows
    * **`df.loc[:, col_selection]`** - Selects just columns
    * **`df.loc[row_selection, col_selection]`** - Selects rows and columns simultaneously. The **comma** separates the row selection from the column selection.
* **`.iloc`** and **`.loc`** accept scalars, lists and slice notation. 
* **`df.loc[label]`** selects a single row of data as a Series
* **`df.loc[[label, label2]]`** selects multiple rows as a DataFrame
* **`df.loc[start:stop:step]`** - slice notation selects multiple rows as a DataFrame
* **`.iloc`** works the same way except with integers
* **`df.loc[label1:label2:3, ["col1", "col2"]]`** uses slice notation for the rows and a list for the columns

#### Boolean Indexing for DataFrames
* Boolean indexing works similarly as it does for Series. First, create a boolean Series (or array) and pass this to the brackets.
* Usually the boolean Series is created by using comparison operators with columns in the DataFrame
* Example Criteria: `criteria = ((df['col1'] > 5) | (df['col2'] < -2)) & (df['col3'] % 2 == 0)`
* **`df[criteria]`** selects all the columns for the rows that are true
* Use **`.loc`** to simultaneously do boolean selection on the rows while selecting particular columns
* Example: **`df.loc[criteria, [col1, col2]]`** does boolean selection on the rows while simultaneously selecting columns with a list.

### DataFrame Vectorized Math Operations
* **`+, -, *, /, //, `** - all apply single value to all values of DataFrame
* **`df + 5`** adds 5 to every value in the Series
* Error will occur if operation does not work with every single data type (i.e. adding 5 to a string column)
* **`select_dtypes`** - pass string or list of strings of data types to select. Use strings 'int', 'float', 'bool', 'object', 'datetime', 'timedelta', 'category'
* Use string 'number' to select both int and float

### DataFrame Vectorized Comparison Operations
* **`<, >, <=, >=, ==, !=`** - applies condition to each value in DataFrame - returns DataFrame of all booleans
* **`df > 5`** - compares every value in DataFrame to 5.
* Take care when working with entire DataFrames to ensure the operation makes sense (i.e. comparing a string columng to a number)

### Basics of DataFrame Methods 
* DataFrames and Series share about 90% of their methods
* DataFrame methods differ than those from Series because they (usually) have an `axis` parameter that controls whether the action will take place up and down the columns or left to right across the rows..
* **`axis`** equal to **`index`** (or **`0`**) will perform the action down the columns. **`axis`** equal to **`columns`** (or **`1`**) will perform the action across the rows. Default is **`axis=0`**
* Use the string names for the axes ('index', and 'columns') instead of the integers 0 and 1 as the string names are more explicit.
* **`info`** method returns lots of metadata from DataFrame
* All the descriptive statistical methods are the same as Series and by default operate down each column.
* **`describe`** method has the **`include`** parameter which accepts a string or list of strings for data types that you would like to find summary statistics for
* **`rename`** - pass a dictionary of old,new key-value pairs to **`columns`** attribute to rename columns
* **`drop`** - pass string or list of strings to **`columns`** parameter to drop columns

### Specifics on certain DataFrame methods
* **`sort_values`** - must pass a string or list of strings to **`by`** parameter to sort columns. Pass **`ascending`** list of booleans that correspond with **`by`** columns to control direction of sort.
* **`drop_duplicates`** and **`dropna`** have a **`subset`** parameter. Pass columns to them to it to limit the functionality to just those columns.

### Reading in Data
* Many datasets will be stored in CSV's. Read them in by passing the file location to **`pd.read_csv`**
* Set the index column on read with parameter **`index_col`**
* Set the **`parse_dates`** parameter to a list of the Datetime column names to convert them on read
* Set the index column after read with **`df.set_index('column name')`**
* A good choice for index is a column that uniquely identifies each row. Indexes can have duplicates

### Automatic alignment of the Index
* Pandas has the surprising feature of automatically aligning the index and columns when operating with two Series/DataFrames at the same time
* For example adding two Series - `s1 + s2` or two DataFrames `df1 + df2`
* Both the index and columns align first before any operation is done.
* They align by forming a Cartesian product over the index and column labels
* Index and Column labels unique to either Series/DataFrame will remain in the result with a missing value


## Split - Apply - Combine (a.k.a Grouping)

* Split - Splits your data into specific groups
* Apply - Apply a function to each of your groups
* Combine - Put the results of the apply function back together

The most common type of function to apply to each group is one that **aggregates**. This pattern will always have three components:

* **grouping columns** - Each unique combination forms a group
* **aggregating columns** - Values of these columns are going to be aggregated into a single number
* **aggregating functions** - Determines how the aggregation will happen - **`sum, mean, median`**, etc...

A popular syntax for grouping a single column, aggregating a single column, and applying a single aggregation function is:

```
>>> df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})
```

* Use a list if you would like to use more than one grouping column.
* To have additional aggregating columns, add them to the dictionary inside of **`agg`**
* Use a list as the value in the dictionary to use more than one aggregating function
* There are many built-in aggregating functions. Use their string name. Some examples are `min, max, sum, mean, count, size, std` and more.
* To prevent the grouping columns going into the index use parameter `as_index=False` in the groupby method or call the **`reset_index`** method after.
* You can write your own custom aggregation function if it does not exist in Pandas.
* Custom aggregation functions have poor performance. Avoid if possible.

## Time Series

#### Timestamps (datetimes) and Timedeltas
* A date is only year, month, day
* A time is only hour, minute, second, part of second
* A datetime is a date and a time combined
* Pandas has the **`Timestamp`** data type and powerful datetime object. It is a specific moment in time
* In pandas when we talk about Timestamps and Datetimes, we are talking about the same thing. Confusing!
* Create Timestamps with the **`to_datetime`** function. It converts a wide variety of strings. Also converts number to units after Unix epoch of Jan 1, 1970.
* Pandas has the **`Timedelta`** data type for amounts of time. e.g. 5 hours 36 minutes and 10 seconds
* The **`to_timedelta`** function converts strings and numbers to Timedeltas
* Subtracting two Timestamps creates a Timedelta
* Both Timestamps and Timedeltas have the same (and more) attributes and methods that the Series **`dt`** accessor has

#### Time Series
* A time series is a sequence of data collected over time, often with time increments equally spaced
* Use the **`resample`** method to group by a period of time. Use the **`on`** parameter to specify the date column if its not in the index.
* **`resample`** is very similar to **`groupby`**. Chain the **`agg`** method to aggregate.
* Use offset aliases to specify the date grouping increment.
* Anchored offset aliases can shift the date group range
* Putting a datetime column in the index creates a DatetimeIndex.
* A DatetimeIndex makes it easier to select subsets of data. `df['2014:5']` selects all rows in May, 2014.
* Calculate moving window statistics with the **`rolling`** method.
* The window can be either a date period (use offset aliases) or a fixed size window (use integer)
* **`rolling`** also operates similarly to **`groupby`**. Chain **`agg`** to aggregate
* Both **`resample`** and **`rolling`** have simpler syntax with Series that have a DatetimeIndex. `s.resample('M').sum()` and `s.rolling('5D').sum()`

## Tidy Data
* Tidy data is a structure of data that makes data analysis easy
* Tidy data is defined as:
    * Each variable forms a column
    * Each observation forms a row
    * Each type of observational unit forms a table
* All other datasets are "messy"
* Messy data does not mean it is difficult to read
* Tidy data is simply a structure that makes it easy to do most other kinds of data analysis

### Melting
* The **`melt`** has two main parameters
    * **`id_vars`** - Columns that you wish to keep vertical
    * **`value_vars`** - Columns that you wish to melt. Column names will all go in a single column. Values will go in another column.

### Pivoting
* The **`pivot`** method takes three parameters
    * **`index`** - column to keep vertical
    * **`columns`** - column name whose values will become new column names
    * **`values`** - column name whose values will be tiled over the intersection of the index and columns
    
### Pivot Table
* The **`pivot_table`** method is very similar to **`pivot`**, but will aggregate the all the common values in the intersection of **`index`** and **`columns`**.

### Common Messy Datasets
1. Column headers are values, not variable names.
1. Multiple variables are stored in one column.
1. Variables are stored in both rows and columns.
1. Multiple types of observational units are stored in the same table.
1. A single observational unit is stored in multiple tables

### Data Normalization
* A process to reduce data redundancy and increase data integrity
* If data is unnecessarily repeated, separate into own table
* Add primary key to uniquely identify each row
* Dimension tables hold non-event information that is more static, such as a store name and address
* Fact tables hold event and transaction information such as a sale within a store

## Visualization

* For any numerical Series: `s.plot()` will produce a line plot by default with index as x-axis and values as y-axis.
* Use parameter `kind=` to change plot type to `hist, bar, barh, kde, pie, box, density, area`
* `linestyle` (ls) - Pass a string of one of the following ['--', '-.', '-', ':']
* `color` (c) - Can take a string of a named color, a string of the hexadecimal characters or a rgb tuple with each number between 0 and 1. [Check out this really good stackoverflow post to see the colors](http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib)
* `linewidth` (lw) - controls thickness of line. Default is 1
* `alpha` - controls opacity with a number between 0 and 1
* `figsize` - a tuple used to control the size of the plot. (width, height) 
* `legend` - boolean to control legend
* Change plotting template with `plt.style.use('ggplot')`. See all templates with `plt.style.available`

