## Getting Started with Pandas

`pandas` adopts significant parts of NumPy's idiomatic stype of array-based computing, especially array-based functions and a preference for data processing without for loops. 

The biggest difference is `pandas` is designed for working with tabular or heterogeneous data, whereas NumPy, by contrast is best suited for working with homogeneous numerical array data.  

In [1]:
import pandas as pd
from pandas import Series, DataFrame

# 5.1 Introduction to pandas Data Structures 

## Series

 1d array-like object containing a sequence of values, and an associated array of data labels, called its index. 

In [2]:
# The simplest series is formed from only an array of data 
obj = pd.Series([1, 2, 3, 4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [3]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
obj.values

array([1, 2, 3, 4])

 Since we didnt assign an index, one was assigned automatically. Its often desirable to create a Series with an index identifying each data point with a **label**:

In [5]:
obj2 = pd.Series([4,7,-5,3], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [6]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Note: you can use labels to index the series

In [7]:
obj2['b']

np.int64(7)

Using NumPy functions or NumPy-like operations, such as filtering with a `boolean` array, scalar multiplication, or applying math functions, will preserve the index-value link. 

In [9]:
obj2[obj2 > 0]

a    4
b    7
d    3
dtype: int64

In [10]:
obj*2

0    2
1    4
2    6
3    8
dtype: int64

In [11]:
obj2*2

a     8
b    14
c   -10
d     6
dtype: int64

In [13]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

**Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.** 

In [14]:
# use it as if it were a dictionary 
'b' in obj2

True

In [15]:
'e' in obj2

False

As you might have guessed, you can pass a python dict in Series creation 


In [16]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}
obj3=pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Notice that the indeces are the dict keys from the source, in the order they were passed. This can be overridden by passing the dict keys in the order we want them to appear in the resulting Series.  

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [18]:
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

`NaN` (not a number) is considered in pandas to mark missing or `NA` values. If we want to detect **missing** or **NA** values in our data, we can use the `isnull` and `notnull` functions in pandas. 

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4) 

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

 These are pandas functions as well as Series instance methods. 

In [21]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

A useful feature for Series' is that it automatically aligns by index label in arithmetic operations. 

In [22]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

This is similar to a `join` operation. 

Both the Series object **and** its index have a `name` attribute, which integrates with other key areas of pandas functionality.

In [26]:
obj4.name = 'population'

In [27]:
obj4.index.name = 'state'

In [28]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [29]:
obj4.index

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

 A Series' index can be altered in-place by assignment:

In [30]:
obj

0    1
1    2
2    3
3    4
dtype: int64

In [31]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      1
Steve    2
Jeff     3
Ryan     4
dtype: int64

## Data Frame 

Represents a rectangular table of data and contains an ordered collection of columns, each of whichcan be a different value type (numeric, string, bool, etc.). The DF has both a row and column index; it can be thought of as a dict of Series alll sharing the same index. 
- There are many ways to construct one, one of the most common is from a dict of **equal length lists or numpy arrays**

In [33]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [35]:
frame = pd.DataFrame(data) 

In [36]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Just like with `Series` we did not assign an index, so it will be assigned automatically. Also columns are palced in sorted order, just as they were passed to `pd.DataFrame`

In [37]:
# select the first 5 rows with the .head() method 
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


When creating the dataframe, use the `columns` argument to specify the order. 

In [38]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


 If you pass a column that isn't contained in the dict (input data) it will appear with missing values in the result

In [39]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [40]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [41]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

A column in a DF can be retrieved as a series either by dict-like notation or as an attribute

In [42]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [43]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Note, the returned Series have the same index as the DF, and their name attribute has been appropriately set. 

In [44]:
frame2.year.name

'year'

**Rows** can also be retreived by position or name with the special `loc` attribute. 

In [45]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. Let's use this to assign values to the `debt` column. 

In [48]:
frame2.debt = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [52]:
frame2['debt'] = np.arange(6.)
frame2['debt']

one      0.0
two      1.0
three    2.0
four     3.0
five     4.0
six      5.0
Name: debt, dtype: float64

When assigning lists or arrays to a column, the value's **length** must match the length of the DF. 
- If you assign a series, its labels will be realigned exactly to the DF's index, inserting missing values into any holes.
- - So technically in this case it does not have to be the same length, you just have to provide the labels of where u want the data to go. 

In [54]:
val = pd.Series([-1.2, -1.5, -1.7,], index=['two', 'four', 'five'])

frame2['debt']=val

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


- Assigning a column that doesnt exist will create a new column.
- Columns can not be assigned with the `frame.column` syntax
- The `del` key will delete columns as with a dict.  

In [58]:
# add a bool column tothe dict using the method above 
frame2['bool_column'] = frame2.state == 'Nevada'
frame2

Unnamed: 0,year,state,pop,debt,bool_column
one,2000,Ohio,1.5,,False
two,2001,Ohio,1.7,-1.2,False
three,2002,Ohio,3.6,,False
four,2001,Nevada,2.4,-1.5,True
five,2002,Nevada,2.9,-1.7,True
six,2003,Nevada,3.2,,True


In [59]:
# delete that column 
del frame2['bool_column']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


 The column returned from indexing a DF is a **view** on the underlying data, not a copy. Thus, any in-place modificaitons to the Series will be reflected in the DF. The column can be explicitly copied with the Series' `copy` method. 

In [61]:
# nested dict of dicts. 
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
       }
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


 If the nested dict is passed to the DF, pandas will interpret the outer dict keys as the columns and the inner keys as the row indeces.

Transposing syntax is similar to NumPy

In [62]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the innder dicts (of a nested dict input) are combined and sorted to form the index inthe result. This isn't true if an explicit index is specified.

In [63]:
pd.DataFrame(pop, index=[2000, 2001, 2002])

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [69]:
# dicts of series are treated in much the same way. 
pdata = {'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]}

In [70]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


If the DF's `index` and icolumns` have their `name` arrtibutes set, These will also be displayed. 

In [71]:
frame3.index.name='year'; frame3.columns.name='state'

In [72]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


 **As with `Series`, the `values` attr. returns the data contained in the DF as a 2d Array**
 - Remember its an attribute, so you dont need to attach the `()` on the end of it. 

In [74]:
frame3['Nevada'].values

array([2.4, 2.9, nan])

 If the DF's columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns: 

In [75]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Possible data inputs to DF constructor. 
- **2D ndarray:** A matrix of data, passing optional row and column labels
- **dict of arrays, lists, or tuples:** Each sequence becomes a column in the DF; all sequences must be the same length
- **NumPy structured/record array:** Treated as the "dict of arrays" case
- **dict of series:** Each value becomes a column; indexes from each Seriesa are unioned together to form the result's row index if no explicit index is passed
- **dict of dicts:** Each inner dict becomes a column; keys are unioned to form the row index as in the "dict of Series" case.
- **List of dicts or Series:** Each item becomes a row in the DF; union of dict keys or Series indexes become the DF's column labels.
- **List of lists or tuples:** 2D ndarray case
- **Another DF:** DF's indexes are used unless different ones are passed.
- **NumPy MaskewdArray:** Like the 2d ndarray case except masked values become NA/missing in the DF result. 

## Index Objects 

 Responsible for holding the axis labels and other metadata (like the axis name or names)

In [78]:
# arrays opr sequences of labels passed to the `index` arg are converted to an Index
obj = pd.Series(range(3), index=['a','b','c'])
obj.index

Index(['a', 'b', 'c'], dtype='object')

Index objects can be indexed. 

In [79]:
obj.index[0]

'a'

Index objects are **immutable**. This makes it safer to share objects among data structures. 

In [80]:
labels = pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int64')

In [81]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2.index

Index([0, 1, 2], dtype='int64')

In [84]:
obj2.index is labels 

True

It also behaves like a fixed-size set. But unlike python sets, they can contain duplicate labels. 

In [87]:
1 in obj2.index

True

 Selections with duplicate labels will select all occurrences of that label. 

Index objects have a number of methods and properties for set logic
- `append`
- `difference`
- `intersection`
- `union`
- `isin`
- `delete`
- `drop`
- `insert`
- `is_monotonic`
- `is_unique`
- `unique`

# 5.2 Essential Functionality

## Reindexing

 This is an in-place method. 

In [91]:
# reindexing is creating a new object with the data conformed to a new index. 
# NaNs are introduced if there is no value for that label/index
obj = pd.Series([4.5,7.2,-5.3,3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [92]:
obj2 = obj.reindex(['a','b','c','d','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

You can specify the `method` argument when reindexing. 

In [104]:
# forward fill method `ffill`
obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [105]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

Reindexing a DF

In [109]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=['a','c','d'],
                     columns=['Ohio','Texas','California'])
frame                              

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [110]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


 If you want to reindex on the columns, pass the `columns` keyword to the reindex method. 

In [117]:
states=['Texas', 'Utah', 'California']
frame = frame.reindex(columns=states)

Lastly, you can reindex using `loc` known as label-indexing

In [118]:
frame.loc[['a','c','d'],states]

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


 Reindex function arguments:
 - `index`
 - `method`
 - `fill_value`
 - `limit`
 - `tolerance`
 - `level`
 - `copy`

## Dropping entries from an axis

`drop` method will return a new object with the indicated value or values deleted from an axis. 

In [4]:
import pandas as pd 
import numpy as np 

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [8]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

 You can also pass a list of row indeces to drop more than one row at a time. 

In a DF, index values can be deleted from either axis. 
- to drop from the columns use `axis=1` or `axis = 'columns'`
- combine this argument with the name of the column you want to drop
- **Note:** many functions, such as `drop`, that modify the size and shape of the DF, can do this in-place by using the `inplace=True` argument. 

## Indexing, Slicing, Filtering

Series indexing `obj[...]` works analagously to NumPy array indexing, except you can **also** use the Series index values if you would like to use them instead of just integers. 
- for example if you set `index=['a', 'b', 'c']` when you create the DF, you can use these values for indexing. 

In [9]:
obj = pd.Series(np.arange(4.), index = ['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [10]:
obj['b']

np.float64(1.0)

In [11]:
obj[1]

  obj[1]


np.float64(1.0)

In [13]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [14]:
obj[[1,3]]

  obj[[1,3]]


b    1.0
d    3.0
dtype: float64

In [15]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

**Slicing with labels behaves differently than normal Python slicing in that the endpoint is includive**

Setting using these methods 
- `obj['b', 'c'] = 5`

modifies the corresponding section of the Series. 

Indexing into a Data Frame is for retrieving one or more columns either with a single value or sequence
- `data['column_name']`
- `data[['col_1', 'col_2']]`

Slicing or indexing data with a boolean array:
- `data[:2]` selects rows `0, 1`
- `data[data['column'] > 5]` We create a boolean array out of `'column'`, then we use that boolean array to select rows.
- `data < 5` makes boolean arrays out of thw whole DF
- `data[data < 5] = 0` sets the values from all rows and columns to 0 if they are less than 5.

### Selecting with `loc` and `iloc`

These are two special indexing operators. They enable you to select a subset of the rows and columns from a DF with NumPy-like notation using either axis labels `loc` or integers `iloc`

- `data.loc['axis_label', ['column_name1', 'column_name2']]` Select the values from row `axis_label` and the columns specified.
- `data.iloc[2, [3, 0, 1]]` Select the values from row `2` and the columns specified.
- `data.iloc[2]` Select the values from row `2`, and all columns
- `data.iloc[[1, 2], [3, 0, 1]]` Select the values from rows `1, 2` and from columns specified. 

Both indexing functions work with slices in addition to single labels or lists of labels. 
- `data.loc[:'row_3', 'col_2']` Select all rows up to `row_3`, select values from `col_2`

Indexing options with DF:
- `df[val]`
- `df.loc[val]`
- `df.loc[:,val]`
- `df.loc[val1, val2]`
- `df.iloc[where]`
- `df.iloc[:, where]`
- `df.iloc[where_i, where_j]`
- `df.at[label_i, label_j]`
- `df.iat[i, j]`
- `reindex` method
- `get_value`, `set_value` methods

## Integer Indexes

...

## Arithmetic and Data Alignment 

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. 
- When you are adding together objects, if any index pairs are not the same, the resulting index in the result will have a value that is the union of the index pairs. (Similar to an automatic outer join).
- In other words, it will contain `NaN` (missing values) even though one of them had a value for that index. The nonexistence of this index in the other object causes this. 

In the case of a DataFrame, alignment is performed on both the rows and the columns. As you can imagine, this can introduce alot of missing values `NaN`. 

 ### Arithmetic methods with Fill Values 

As a result, we can choose to fill these missing values `NaN` with a value (such as 0), when an axis label is found in one object but not the other. 
- To do this, use the `.add` method on the first DF, and specify the fill value as an argument.
- `df1.add(df2, fill_value=0)`
- This can also be done when using the `reindex` method. 

Flexible arithmetic methods: 
- `add`, `radd`
- `sub`, `rsub`
- `div`, `rdiv`
- `floordiv`, `rfloordiv`
- `mul`, `rmul`
- `pow`, `rpow`

### Operations between DF's and Series 

As a motivating example, consider the subtraction between a 2d array and one of its rows. 
- `arr = np.arange(12.).reshape((3,4))`
- `arr - arr[0]`
In this example the subtraction is performed once for each row. This is referred to as **broadcasting**. Operations between a DF and a Series are similar.

DF - Series: 
- `series = frame.iloc[0]`
- By default arithmetic between DF and Series matches the index of the series on the DF's columns, broadcasting down the rows.
- In other words, each row of the DF subtracts the Series. (df_row1 - Series), (df_row2 - Series), etc.
- **Broadcasting** down the rows
- If an index is not found in one or the other (DF or Series index), the objects will be reindexed to form the union.
- Note that this means if the data is not in both, it will turn to a missing value.

Broadcasting over the columns
- You have to use one of the arithmetic methods for this. Ex. `.sub`
- `series_1 = data_frame['col2']`
- `data_frame.sub(series_1, axes='index')`
- In this case, the axis number that we pass is the axis we want to **match on**. 

In [25]:
data_frame_1 = pd.DataFrame(np.arange(12).reshape((3, 4)), index = ['a','b','c'], columns=['d','e','f','g'])
data_frame_1

Unnamed: 0,d,e,f,g
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11


In [27]:
series_1 = data_frame_1['d']
series_1


a    0
b    4
c    8
Name: d, dtype: int64

In [32]:
data_frame_1.sub(series_1, axis=0)

Unnamed: 0,d,e,f,g
a,0,1,2,3
b,0,1,2,3
c,0,1,2,3


## Function Application and Mapping

NumPy `ufuncs` (universal functions) (element-wise array methods) also work with pandas objects.
- `np.abs(frame)`


Another frequent operation is applying a function on one-dimensionsl arrays to each column or row. DataFrame's `apply` method does exactly this:
- `f = lambda x: x.max() - x.min()` Compute the difference between the max and min of the series `x`
- `frame.apply(f)` Apply this to the data frame -> apply this to each column.
- The result is a series, one value (the output of `f`) for each column. the columns of `frame` are the indeces in this output series.
- If you pass `axis='columns'` to `apply`, the function will be invoked once per row instead.
- Now the output will be a series, the indeces of the output series will be the row indeces of `frame`, and the values will be the scalar value output of `f`.

Many of the most common array statistics (like `sum`, `mean`) are DF methods, so using apply is not necessary. 

Note, the function passed to `apply` method does not have to output a scalar value. It can return a series with multiple values. Note this changes the output significantly, depending on `function`. 

Element-wise python functions can be used as well. 
- Series has a `map` method for applying an element-wise function.
- `frame['column1'].map(function)`

## Sorting and Ranking 

Sorting lexicographically by row or column indes, use the `sort_index` method, which returns a **new sorted object.** 

- `series.sort_index()`

With a Df, you can sort by index on either axis. 
- `frame.sort_index(axis=0)`
- Ascending order by default. Pass `ascending` argument to change this (bool).
- Not passing axis arg results in row sorting.

Sorting a series by its values: 
- `series.sort_values()`
- any missing (`NaN`) values are pushed to the end. 

 When sorting a DF, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the `by` option of `sort_values`.
 - `data_frame.sort_values(by='column3')` Sorts the whole data frame according to column 3.
 - you can also pass more than one column to the `by` argument in a list `[]` syntax. 

**Ranking** assigns ranks from 1 though the total number of valid data points in an array. 
- By default, the `rank method breaks ties by assigning each group the mean rank.
- `data_frame.rank`
- `series.rank`
- ranks can also be assigned according to the order in which they are observed in the data.
- Use the `method = 'first'` argument 
- You can also rank in descending order: `series.rank(ascending=False, method='max)` Here `method='max'` breaks ties using the maximum rank for of the whole group being ranked.
- DF's can compute ranks over the rows or over the columns.
- `data_frame.rank(axis='columns')` This will rank each column. 

Tie-breaking methods with rank:
- `average`
- `min`
- `max`
- `first`
- `dense`

## Axis Indexeswith Duplicate Labels 

Unique axis labels is not mandatory. 
- `series.is_unique` attribute can tell us whether its labels are unique or not. Returns bool

Data Selection behaves differently with duplicates. Indexing a label with multiple entries returns a series, while single entries return a scalar value. 
- In other words, if the index of the row **is not** a duplicate, it will return a single scalar value for that column
- If the index of the row **is** a duplicate, it will return a series.
- Just to reiterate, the output type from indexing can vary based on whether a label is repeated or not.
- The same logic applies to DF's as well. 

# 5.3 Summarizing and Computing Descriptive Statistics 

Pandas objects are equipped with a set of common mathematical and statistical methods. They have similar methods found on NumPy arrays for handling missing data. 
- NA values are excluded unless the entire slice is NA, which can be disabled with `skipna` option.

Common reduction / summary statistics methods include:
- `axis` The axis to reduce over. 0 for DF rows, and 1 for cols
- `skipna`
- `level`
- `idxmax` returns the index value of the max
- `idxmin` ..

Accumulations: 
- `data_frame.cumsum()`

`describe` is not like these, it produces multiple summary stats in one shot. Try it on both numerical and non-numerical data. 

Descriptive and Summary Statistics: 
- `count`
- `describe`
- `min, max`
- `argmin, argmax`
- `idxmin, idxmax`
- `quantile`
- `sum`
- `mean`
- `median`
- `mad`
- `prod`
- `var`
- `std`
- `skew`
- `kurt`
- `cumsum`
- `cummin, cummax`
- `cumprod`
- `diff`
- `pct_change`

## Correlation and Covariance

**Some** summary statistics like covariannce are computed from pairs of arguments. 

The data that we will be working with has column labels as stock tickers (such as 'AAPL', 'GOOG'), the column indeces are dates (time series data) and the values are prices. 

 Let's comput the percent change of the prices
 - `result = data_frame.pct_change()`
 - `result.tail()`

The `corr` method of the series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. `cov` does the same but for the covariance.
- `result['column1'].corr(result['column2'])`
- `result['column4'].cov(result['column5'])`

If the columns are valid python attributes, we can select them with the following syntax 
- `result.column1.corr(column2)`

Using these methods for DatFrames retuens a full correlation or covariance matrix as a DF. 
- Using DF's `corrwith()` method, we can compute **pairwise correlations** between a DF's columns or rows with another Series or DF.
- `result.corrwith(result.column1))`
- Passing a DF computes the correlations of **matching column names**.
- Passing `axis='columns` does things row-by-row instead.
- **In all cases, the data points are aligned by lavel before the correlation is computed.**

## Unique Values, Value Counts, and Membership

A class of related methods that extracts information about the values contained in a 1d series. 
- `series.unique()` gives us an array of the unique values in a series. They are not sorted, but can be with the `.sort()` method.
- Since the `.unique()` method gives us an array, if you want to save it, assign it to a variable when calling it.
- `series.value_counts()` computes a series containing value frequencies, sorted by value in descending order.
- `pd.value_counts()` is also available as a top-level pandas method that can be used with any `array` or sequence
- `isin` performs a vectorized set membership check and can be useful in filtering a datasetdown to a subset of values in a Series or column in a DF. The way in which it can be useful is by creating a boolean **mask**.
- `mask = series.isin(['value1', 'value2'])`
- Make a subset using the mask: `subset = series[mask]` 

Unique value counts and set membership methods:
- `isin`
- `get_indexer`
- `unique`
- `value_counts`


Question: What does the output of this look like

In [4]:
data = pd.DataFrame({'Q1': [1,3,4,3,4],
                     'Q2': [2,3,1,2,3],
                     'Q3': [1,5,2,4,4]})
data

Unnamed: 0,Q1,Q2,Q3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


- `result = data.apply(pd.value_counts).fillna(0)`

Well, the `.apply` function applies the function to each column of the df. So we have `value_counts` being applied to each column. 


In [15]:
data.apply(pd.value_counts)

  data.apply(pd.value_counts)


Unnamed: 0,Q1,Q2,Q3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0
