## Data I/O:

In [3]:
import pandas as pd

In [37]:
df = pd.DataFrame({"a":[1,2,3], "b":['a','b','c'], "c":[1,2,3], "d":[1,2,3]})

In [71]:
df.filter(like="A", axis=1)

0
1
2


### Read/Write from Files:

**General syntactical guideline:**
- **Read:** `pd.read_format`
- **Write:** `df.to_format`

**Raw Files**

- `pd.read_csv()`
    - Required Argument - File name relative/absolute including path
    - Optional Arguments
        - **sep or delimeter - column delimiter**
        - **dtype - dict of column to type**
        - **low_memory - boolean (results in lower memory use while parsing)**
        - names - list of column names
        - index_col - Columns to use as row labels (column number or sequence)
        - nrows - number of lines to read (incase of large files)
        - na_values - additional strings to identify NAs (can be dict column to na string)
        - na_filter - boolean (detect missing values)
        - parse_dates - boolean or list of columns **(May throw memory errors for large datasets, instead ust `pd.to_datetime` after loading)**
        - skiprows - rows to skip, can be a lambda function as well
        - skipfooter - number of rows to skip at the end
        - prefix - prefix to add to column numbers when no header available
        - decimal - character to use as decimal seperator
        - thousands - thousands seperator (default None)
        - compression - 'infer' file compression or file compression type specifier
        
    - **`pd.read_table()`** is same as read_csv with tab as default delimiter 
    

- `pd.to_csv()`
    - Required Argument - None
    - Optional Arguments
        - **path - File name relative/absolute including path** (prints to console if none provided)
        - **sep - column delimiter**
        - **na_rep - how to represent NAs, default is blank**
        - header - flag to include header in the output
        - index - flag to include index in the output
        - compression/decimal - same as read
    
    
- `pd.read_excel()`
    - Requird Argument - File name including path
    - Important Optional Argument
        - **sheet_name - integer/string or list of integers/strings with name of sheets to import**


**Binary Formats**

- `pd.read_feather()`
    - Most efficient way of reading and writing columnar data
    - Required Argument - File path
    - Optional Argument - nthreads (# of CPUs to use while reading)


- `pd.read_pickle()`
    - Required Argument - File path
    - Optional Argument - compression
    

- `pd.read_parquet()`
    - Required Argument - File path
    - Aside - Parquet format lacks built in support for categorical data. Optimized for IO constrained scan-oriented use cases.


**SQL**
- `pd.read_sql()`
    - Required Arguments - Query and connection object
    

### Other Parsable Formats:

- JSON (`read_json`, `to_json`)
- HTML (`read_html`, `to_html`)
- Clipboard (`read_clipboard`, `to_clipboard`)
- HDF5 (hierarchical data, `read_hdf`, `to_hdf`)
- MessagePack (`read_msgpack`, `to_msgpack`)
- stata (`read_stata`, `to_stata`)
- SAS (`read_sas`, write not available)
- Google Big Query (`read_gbq`, `to_gbq`)

Detailed Documentation: https://pandas.pydata.org/pandas-docs/stable/io.html

Performance comparison of I/O on various formats: https://pandas.pydata.org/pandas-docs/stable/io.html#io-perf

## Type Casting:

- `df.dtypes` - Get datatypes of all columns
- `df.col.astype('dtype')` - Explicitly convert `col` to type `dtype`
- `df[[cols]].astype('dtype')` - Explicitly convert subset of columns (`cols`) to type `dtype`
- `df.astype({'a': 'dtype1' , 'b': 'dtype2'})` - Specify data type for each column
- Use `copy=False` argument (instead of inplace) for updating without creating a copy

**For one dimensional objects (series)**, use `pd.to_numeric()`, `pd.to_datetime()`, `pd.to_timedelta()` for type casting. Handy arguments:
- `errors`: `raise` to throw error, `coerce` to replace with NaNs, `ignore` to copy the value as is without type conversion
- `downcast`: downcasting the newly (or already) numeric data to a smaller dtype, conserving memory. Can take values `integer`, `signed`, `unsigned`, `float`

**Commonly used data types:**
- `object`
- `category`
- `float<x>` - x can be 16, 32, 64 (default)
- `int<x>` - x can be 8, 16, 32, 64 (default)
- `uint<x>` - x can be 8, 16, 32, 64 (default)
- `bool`


|**Data type**|**Range of Values**|
|------|------|
| int8 | -128 to 127 |
| int16 | -32,768 to 32,767 |
| int32 | -2,147,483,648 to 2,147,483,647 |
| int64 | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
| uint8 | 0 to 255 |
| uint16 | 0 to 65,535 |
| uint32 | 0 to 4,294,967,295 |
| uint64 | 0 to 18,446,744,073,709,551,615 |

|**Data type**|**Resolution**|
|------|------|
| float16 | 1e-3 |
| float32 | 1e-6 |
| float64 | 1e-15 |

**Get info about data types**
- `np.finfo(np.float16)`
- `np.iinfo(np.int16)`

**Columns can get potentially upcasted when combined with other types**

## Data Access and Filtering:

**Index or label based:**
- `df.iloc[1:2,3:4]` - Select data based on integer location
- `df.loc[1:2,'a':'c']` or `df.loc[1:2,['a','b','c']]` - Select data based on row/column labels. By default row labels are row indexes and hence the similarity between first element of `iloc` and `loc`.
    - **Important Note: Slice selected will be inclusive of both indexes specified, unlike elsewhere in python (`loc[1:2]` returns two rows).**
    - Second element of loc can either be a list or from_col:to_col
    

**Filter:**
- `df.select_dtypes(include=[], exclude=[])` - Select columns based on dtypes; Exclude/Include should not have overlap
- `df[boolean_array]` - length of boolean array must be same as # of rows in data frame. Boolean array can be derived from single or multiple conditions. Examples:
    - `df[df.col1=="x"]`
    - `df[cond1 operator cond2]`
    - Operators:
        - `|` or `or` (throws ambiguity error with multiple conditions)
        - `&` or `and` (throws ambiguity error with multiple conditions)
        - `~` (not operator)
        - `==` equality
        - `!=` inequality
        - `isin(list)` equivalent to `in` in SQL
        - `>=`, `<=` relational
        - `all(axis=0)` checks if all values across rows in each column are True
        - `any(axis=1)` checks if atleast one value across columns in each row are True

- `df.query(condition)` - condition can be involving columns or constants. Example - `(col1 > col2)` or `(col1 ==10)`
- `df.filter()` - Arguments (optional but atleast 1 of first 3 required)
    - `items` - list of column or row labels
    - `like` - string for partial match (case sensitive)
    - `regex` - string with regex to parse
    - `axis` - row(0)/column(1) labels to filter 
- `df.isnull()` - checks if individual elements of dataframe are null (NaN/NaT)
- `df.notnull()` - checks if individual elements of dataframe are not null (NaN/NaT)


## Data Manipulation

### Dealing with NaNs

- `df.fillna(value, inplace=True)` - Fills NaNs in all columns with value
- `df.col.fillna(value, inplace=True)` - Fills NaNs in column `col` with value
- `df[[cols]] = df[[cols]].fillna(value)` - Fills NaNs in columns `cols` with value



- `df.dropna(axis=0, how='any', inplace=True)` - Drops rows with NaNs in any of the columns; For columns, use `axis=1`
- `df.dropna(subset=[cols])` - Drops rows with NaNs in any of columns in `cols`

### Broadcasting Behavior

Numpy like broadcasting behavior can be obtained by using following functions:

- `df.add(row, axis=1)` - adds row to all rows of the dataframe df 
- `df.sub(column, axis=0)` - Subtracts column from all columns of the dataframe df
- `df.mul()` - multiply
- `df.div()` - divide

## Aggregate Functions:

- `df.groupby(['col1','col2']).func()` - func can be `sum`, `mean`, `count`, `unique`, `nunique`
- `df.groupby(['col1','col2']).aggregate({'col3':['sum'], 'col4':['mean','count']})`

## Concatenation and Joins:

## Indexing:

## Handy Utilities:

- `df.shape` - Get the dimensions of the dataframe
- `df.values` - Get the values as a numpy array
- `df.head(x)` - Get first x rows of the dataframe
- `df.tail(x)` - Get last x rows of the dataframe
- `df.columns()` - Get column names
- `df.info()` - High level metadata of columns and data
- `df.nunique()` - Number of unique records in each column
- `df.describe()` - High level statistical summary of numerical columns
- `df.columns = df.columns.str.lower()` Converting column names to lower case
- Pandas sees benifits of operations like aggregation, joins on sorted columns similar to tables in relational DBs.