In [None]:
import sys
sys.executable

# Pandas

[pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/api.html)

https://github.com/pandas-dev/pandas


pandas is standard for data science with python

- [pandas](https://pandas.pydata.org/)

pandas data structures:

- Series
- DataFrame

## pandas functions

- to_numeric()---one way to change dtype of column

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Pandas Data Structures

- [Series](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) (1-d)
- [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) (2-d) ~ data.frame

https://pandas.pydata.org/pandas-docs/stable/dsintro.html

Series may be created by passing a list to the constructor. DataFrames may be created by passing a numpy array or dict to the constructor.

`read_csv()` is an easy way to read files. By default will infer column names from first line. `index_col`...
see also `read_table()`

Since pandas is build on ndarray strings are stored as object types to get around the homogeneity requirements of ndarray.

http://pbpython.com/pandas_dtypes.html



### Supporting Data Types

Pandas objects (inherit from `PandasObject`) and their subclass relationships:
- [Index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html)
  - [NumericIndex](https://pandas.pydata.org/pandas-docs/stable/api.html#numeric-index) (abstract class)
    - Int64Index
      - RangeIndex
    - UInt64Index
    - Float64Index
  - CategoricalIndex
  - MultiIndex
- SparseArray
- [Categorical](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Categorical.html#pandas.Categorical)

- NDFrame
  - DataFrame
  - ...



...

- [categorical data type](https://pandas.pydata.org/pandas-docs/stable/categorical.html)
- time?

In [None]:
# See that RangeIndex is a subclass of Index
issubclass(pd.core.indexes.range.RangeIndex, pd.core.indexes.range.Index)

## Index Class

[`pandas.Index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html)


### Attributes

- `dtype_str`---prefer to size
- `name`

### Methods

- `drop()`
- `drop_duplicates()`
- `duplicated()`
- `sort_values()`

### Comparisons

these are on Index and possibly others!?
- `equals()`
- `identical()`---vs equals? and note neither Series nor DataFrame has this (they have eq()), only Index does.
- `difference()`
- `symmetric_difference()`
- `intersection()`
- `union()` (vs append()?)

## Series

[`pandas.Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)

### Attributes

- `index`
- `values`
- `dtype`----There is also `dtypes` but it is exactly the same and seems redundant. Not sure why it even exists on Series.
- `str`---string methods!!!, I think Index has also but documentation is sparse

### Methods

- `hasnans()`---note Index also has this instance method but not sure why, DataFrame does NOT have this.
- `isin()`
- `isna()`/`notna()`
- `isnull()`/`notnull()`
- `nonzero()`

- `describe()`
- `is_unique()`

- `where()`
- `filter()`

- `infer_objects()`

- `iteritems()`---note `items()` is an alias for this

- `equals()`
- `eq()`---similar to `==` but with support for missing values

How to change dtype of Series?

## DataFrame

[`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Axis labels are stored in `Index` objects.

A DataFrame *index* is the row labels and *columns* are the column labels.

TODO review how to use index properly!

The index and columns can have a `name`!

index... set_index(), which can also set a multi-index (hierarchical/multi-level indexing, `MultiIndex`)! See options to `set_index()`, incl. `verify_integrity`!

    df.set_index(keys, inplace=True...)
    
    df.index.name = None; # clear index name
    
    df.index.name = key
    df.reset_index(inplace=True)
    
    ?
    df.set_axis
    df.rename_axis()
    but there is no set_columns why?
    so there is a partial symmetry between index and columns wrt the name thing

columns

An Index has a dtype, will be `object` if heterogeneous (i.e. may contain both str, int, etc.)

### Attributes

- `values`---return a numpy representation of the DataFrame with axes labels removed.
- `loc[]`---see also `filter()`
- `iloc[]`

- `dtypes`---return a Series with the data type (.dtype) of each column.

- `shape`

### Methods

- `info()`
- `summary()`

- `iteritems()`---good way to iterate!

- `filter`---see also `loc` which has overlapping function
- `query()`

- [`where()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where)---shape preserving
- `mask()`
- `combine()`
- `combine_first()`

- `isin()`---selects values!
- `isna()`/`notna()`
- `isnull()`/`notnull()`

- `dropna()`---may want how='all'

- `sort_index()`
- `sort_axis()`
- `set_index()`---set index using one or more existing columns.
- `set_axis()`

- `to_csv()`---see also `pandas.read_csv()`

- `get_dtype_counts()`---see also dtypes attribute
- `astype()`---note, do not use .loc with this
- `infer_objects()`---hmmmm? this could be really useful! See also pandas.to_numeric().

- `equals()`---NaNs are considered equals. and what is this eq()? note some classes also have an identical() method but not df
- `eq()`---you probably want equals, not clear how to use this...actually related to ne, lt, gt, le, ge.

- any()---default of axis=0 gives per-column results
- all()
- bool()
see also empty  

### Notes

TODO Changing data types, incl astype(), infer_objects(), to_numeric()



### Creating DataFrames

In [None]:
# Just create an example DataFrame, from dict of Series.

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

### Inspecting DataFrames

- `info()`
- 

In [None]:
df.info()
# Note that if there are too many columns info will not list them (output None?).
# info() will print even without wrapping in print().

In [None]:
df.index

In [None]:
df.columns

In [None]:
# Return a Series with the data type of each column.
df.dtypes

In [None]:
# The index corresponds to the DataFrame's columns.
df.dtypes.index

In [None]:
df.ftypes

In [None]:
df.describe()

## Basics of Viewing Data

In [None]:
#df.at?

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.values

## Indexing and Selecting Data

I'm still confused!

- [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html)

There are a number of ways 

attribute operator (.)

pd['column_name'] ~ pd[:, 'column_name']
pd[['column_name1', 'column_name2']]

pd[index_variable] will also work

`pd[][]` can combine row and column index/select.


`loc[]` selects based on names, can also do slices of both rows and cols, or via a list.

 use `iloc[]` to do so based on indices
 
 `where()`
 
 df.where(df>0) is equivalent to df[df>0]!
 
 what about `at[]` and `iat[]`?
 
 you can also do conditional selection / boolean indexing, see also `isin()`

In [None]:
# df[1]

### Reindexing

- https://pandas.pydata.org/pandas-docs/stable/10min.html#missing-data
- https://pandas.pydata.org/pandas-docs/stable/basics.html#reindexing-and-altering-labels

`reindex()`,

- `df.drop()`---works on either axis
- `df.drop_duplicates()`---only works on rows, but with a little work `i.drop_duplicates()` can be used for columns. See https://stackoverflow.com/a/40435354/2762964 for a oneliner for removing duplicate columns.

maybe can del df['col_name'] same as drop??

`df.reindex_like()`

`df.reindex_axis()` is deprecated in favor of reindex!
see also [`df.sort_index(axis=1)`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)

what about aligning index `align()`

WTF does align() do? https://pandas.pydata.org/pandas-docs/stable/basics.html#aligning-objects-with-each-other-with-align

- df.sort_index()!

In [None]:
# TODO df.reindex

In [None]:
# TODO
df.sort_index() # defaults to axis='row'
df.sort_index(axis='columns')

### `rename()`

Use rename() to rename index or columns as per a dict-like or function. The simplest way is to use the `index` and/or `columns` keyword arguments. Alternatively you can use `mapper` and specify which `axis`.

Note in the example `index=str` argument converts index to string!


In [None]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s1 = s[:4]

s2 = s[1:]

s1

In [None]:
s2

In [None]:
s1.align(s2)

In [None]:
df.eq??

## ?

pd.unique() vs Series.unique() vs Index.unique()? and also Series.is_unique() and Index.is_unique()

## Sorting Data

sort()

sort_index()

sort_values()

is_monotonic(), etc.

In [None]:
# TODO df.sort_index

In [None]:
# TODO df.sort_values()

## Manipulation




...


df.assign and what about df['var']=?

## Merge

? What operations induce a reindexing?

https://pandas.pydata.org/pandas-docs/stable/merging.html

There are a handful of ways to combine data sets and it can get confusing. Here's a summary:
- df
  - **append**
  - **join**
  - merge--instance method is a convenience wrapper for global version
  - update
  - assign---assign new column(s), or overwrite if existing, returns copy, use concat for more arbitrary concatenation
- global
  - **merge**
  - **concat**


### append

[`df.append()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html)

    DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)

Uses `pd.concat()` internally.

Super naive, simply appends rows to a frame.

Note `append` lacks the `axis` parameter of `concat` and this is **only good for appending rows**. This may lead to duplicate index labels or, if `ignore_index=True`, index labels will be overwritten with meaningless numeric labels. New columns may be added from the other frame, but in summary, don't use concat if the index labels have any meaning.

Note `append` lacks the `join` parameter of `concat` and just uses the default of `outer`.

### join

    DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

[`df.join`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html)

Via the private `df._join_compat()` method, join uses `pd.merge()` internally. But unlike merge, join defaults to joining index-on-index; the `on` parameter allows the caller to join on column name, but the other is always joined on index.

Multiple data frames can be joined but only index-on-index.
however if the index of each frame is unique (to itself)(WHY?) then it uses `pd.concat()`, otherwise falls back to `pd.merge()`.
Read through this section of the `_join_compat` method to make sense of it!

See `combine_first()` for working with overlapping data sets! also `df.combine()`

`how`:
- left---join default
- outer
- inner---merge default
- right


### concat

    pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

[`pd.concat()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) to append rows, with `axis=1` to append columns, see also join parameter

by default joins on index values along concatenation axis unless you `ignore_index=True`???

See also `df.append` instance method for quick way to concatenate along `axis=0`.

See `ignore_index=True` option for concat and append.

`join`:
- `inner`
- `outer` (default)


### merge

    pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, 
                 sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
                 
[`pd.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html) is also available as instance method [`df.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)

Note `merge` can join on arbitrary variables vs `append` which joins only on index...


May need to consider `drop()` or `drop_duplicates()` `duplicated()` (see `keep` parameter) (vs `unique()`?) for cleaning up.
duplicated/drop_duplicates/etc? also on Series and DataFrame (applies to index).
? what can you do with the indices returned by duplicated?

`how`:
- left---join default
- outer
- inner---merge default
- right

### assign

[`df.assign`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html)

### update

[`df.update`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html)

? how does this compare to combine_first?

In [None]:
df.append?

In [None]:
df.set_index?

---

## Sparse...

DataFrames (etc.) can be compressed to *sparse* form (`SparseDataFrame`) for memory efficiency using `to_sparse()`. The `density` attribute gives the ration of non-sparse values to total. Convert back to *dense* form using `to_dense()`.

Sparse structures only store values distinct from the `fill_value`, which defaults to `nan`.

See `ftypes` in addition to `dtypes`.

The use and implication of `get_values()` vs `values` for sparse structures is not at all clear, but probably isn't too important for now. According to [pandas#19617](https://github.com/pandas-dev/pandas/issues/19617) I think we can forget about `get_values()`.

See also SparseSeries and SparseArray.

In [None]:
df = pd.DataFrame({"a": pd.SparseArray([1, None, None]), "c": [1.0, 2.0, 3.0]})

In [None]:
df.get_values()

In [None]:
type(df)

In [None]:
df.ftypes

In [None]:
df.dtypes

In [None]:
pd.SparseDataFrame

In [None]:
type(df.to_sparse())

In [None]:
df.values

In [None]:
df.get_values()

In [None]:
df1 = pd.DataFrame({'a': [1, 2], 'b': [True, False], 'c': [1.0, 2.0]})

In [None]:
df1.values

In [None]:
df1.get_values()

In [None]:
dfs = df.to_sparse()

In [None]:
dfs.density

In [None]:
type(dfs)

In [None]:
print(dfs)

In [None]:
df1.to_sparse??

In [None]:
type(dfs.values)

In [None]:
type(dfs['a'].get_values())

In [None]:
type(dfs)

In [None]:
pd.SparseDataFrame?

In [None]:
type(pd.DataFrame(dfs.values))

In [None]:
type(pd.DataFrame(dfs.get_values()))

In [None]:
dfs.get_values??

In [None]:
pd.DataFrame?

In [None]:
pd.read_csv?

## Cleaning Data

whitespace? consider this?

The str attribute of Series and Index can be used for vectorized string operations that also handle missing and NA values.

    df.columns = df.columns.str.strip()