## 8.1 Hierarchical Indexing


### Construct data frames with multiindex
1. Use nested lists passed to `index` or `columns` to the `pd.Series` or `pd.DataFrame` constructor. 
    ```
    pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                     columns=[["Ohio", "Ohio", "Colorado"],
                              ["Green", "Red", "Green"]])
    ```
2. A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could also be created like this:
    ```
        pd.MultiIndex.from_arrays([["Ohio", "Ohio", "Colorado"],
                          ["Green", "Red", "Green"]],
                          names=["state", "color"])
    ```

3. atributes of `index` or `columns`: 
    - `.names` (supersedes `.name`)
    - `.nlevels`



### Reordering and Sorting Levels
Methods: 
- `.swaplevel("level1", "level2")` or `swaplevel(0,1)`
- `.sort_index(level=0)`: sorty by rows, level=0: outermost level. level=1: inner level

The `swaplevel` method takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

`sort_index` by default sorts the data lexicographically using all the index levels, but you can choose to use only a single level or a subset of levels to sort by passing the level argument. For example:

:::{.callout-note}
Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling `sort_index(level=0)` or `sort_index()`.

### Summary Statistics by Level

`.groupby(level="key", axis="columns").sum()`: default axis is "index"

Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the `level` you want to aggregate by on a particular axis. 

## Indexing with a DataFrame's columns

- `.set_index(['col1','col2'], drop=True)`: use `col1` and `col2` to create multi-index. If drop=False, do not drop the columns. 
- `.reset_index()`

DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

## 8.2 Combining and Merging Datasets

Data contained in pandas objects can be combined in a number of ways:

`pandas.merge`
Connect rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database `join` operations.


`pandas.concat`
Concatenate or "stack" objects together along an axis.

`combine_first`
Splice together overlapping data to fill in missing values in one object with values from another.

:::{.callout-warning}
When you're joining columns on columns, the indexes on the passed DataFrame objects are discarded. If you need to preserve the index values, you can use `reset_index` to append the index to the columns.

### Database-Style DataFrame Joins


- `pd.merge(df1, df2, on="key")`: inner join on common "key". If omitted, inner join on common key
- `pd.merge(df1,df2, left_on='lefkey', right_on='right_key', how='inner')`   how= 'inner', 'left', 'right', 'outer'
- `pd.merge(left, right, on=["key1", "key2"], how="outer")` # merge with multiple keys
- `pd.merge(left, right, on="key1", suffixes=("_left", "_right"))`

Table 8.1: Different join types with the how argument
Option|	Behavior
|:----------|:--------------------------------------------------------|
how="inner"|	Use only the key combinations observed in both tables
how="left"|	Use all key combinations found in the left table
how="right"|	Use all key combinations found in the right table
how="outer"	|Use all key combinations observed in both tables together

By default, pandas.merge does an "inner" join; the keys in the result are the intersection, or the common set found in both tables. Other possible options are "left", "right", and "outer". The outer join takes the union of the keys, combining the effect of applying both left and right joins:

Table 8.2: pandas.merge function arguments
Argument	|Description
|:---------|:-----------------------------------------------------------------------|
left|	DataFrame to be merged on the left side.
right|	DataFrame to be merged on the right side.
how	|Type of join to apply: one of "inner", "outer", "left", or "right"; defaults to "inner".
on	|Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys.
left_on	|Columns in left DataFrame to use as join keys. Can be a single column name or a list of column names.
right_on	|Analogous to left_on for right DataFrame.
left_index|	Use row index in left as its join key (or keys, if a MultiIndex).
right_index|	Analogous to left_index.
sort|	Sort merged data lexicographically by join keys; False by default.
suffixes|	Tuple of string values to append to column names in case of overlap; defaults to ("_x", "_y") (e.g., if "data" in both DataFrame objects, would appear as "data_x" and "data_y" in result).
copy	|If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.
validate|	Verifies if the merge is of the specified type, whether one-to-one, one-to-many, or many-to-many. See the docstring for full details on the options.
indicator|	Adds a special column _merge that indicates the source of each row; values will be "left_only", "right_only", or "both" based on the origin of the joined data in each row.

### Merging on Index

In some cases, the merge key(s) in a DataFrame will be found in its `index` (row labels). In this case, you can pass `left_index=True` or `right_index=True` (or both) to indicate that the index should be used as the merge key:
- `pd.merge(left1, right1, left_on="key", right_index=True) `
- `pd.merge(left2, right2, how="outer", left_index=True, right_index=True)`

Join method using index from both dataframes
- `left2.join(right2, how="outer") # join method using index from both dataframes`
- `left1.join(right1, on="key") #perform left-join on "key" of left1 with index on right1`

for simple index-on-index merges, you can pass a list of DataFrames to join as an alternative to using the more general pandas.concat function described in the next section:
- `left2.join([right2, another])` #left join using index
- `left2.join([right2, another], how = 'outer') `

### pd.concat

- `np.concatenate([arr, arr], axis=1)` 

By default, `pandas.concat` works along axis="index", producing another Series. If you pass axis="columns", the result will instead be a DataFrame:
- `pd.concat([s1, s2, s3])` #Note concat use index  to align data and concat by row( by index)
- `pd.concat([s1, s4], axis="columns", join="inner")`

A potential issue is that the concatenated pieces are not identifiable in the result. Suppose instead you wanted to create a hierarchical index on the concatenation axis. To do this, use the keys argument:
- `pd.concat([s1, s1, s3], keys=["one", "two", "three"])` # keys corresponding to each series
- `pd.concat([df1, df2], axis="columns", keys=["level1", "level2"])` #left join on index
- `pd.concat({"level1": df1, "level2": df2}, axis="columns") #alternatively`
- `pd.concat([df1, df2], axis="columns", keys=["level1", "level2"],
          names=["upper", "lower"])` #name the created axis levels with the names argument

you can pass ignore_index=True, which discards the indexes from each DataFrame and concatenates the data in the columns only, assigning a new default index:
- `pd.concat([df1, df2], ignore_index=True)` # ignore the original index

Table 8.3: pandas.concat function arguments
Argument|	Description
:-----------------|:----------------------------------------------------------------------|
objs|	List or dictionary of pandas objects to be concatenated; this is the only required argument
axis|	Axis to concatenate along; defaults to concatenating along rows (axis="index")
join|	Either "inner" or "outer" ("outer" by default); whether to intersect (inner) or union (outer) indexes along the other axes
keys|	Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels)
levels|	Specific indexes to use as hierarchical index level or levels if keys passed
names|	Names for created hierarchical levels if keys and/or levels passed
verify_integrity|	Check new axis in concatenated object for duplicates and raise an exception if so; by default (False) allows duplicates
ignore_index|	Do not preserve indexes along concatenation axis, instead produce a new range(total_length) index

### Combining Data with Overlap

Using numpy.where does not check whether the index labels are aligned or not (and does not even require the objects to be the same length), so if you want to line up values by index, use the Series combine_first method:

If a value is present in `a` at a specific index or column and is NaN or missing, it will be filled with the corresponding value from `b` at that index or column. If the value in a is not missing, it remains unchanged.

`a.combine_first(b)` # note the index are sorted

#With DataFrames, combine_first does the same thing column by column, 

`df1.combine_first(df2)` #The output of combine_first with DataFrame objects will have the union of all the column names.

## 8.3 Reshaping and Pivoting

### Reshaping with Hierarchical Indexing

`stack`: to make a long table

This “rotates” or pivots from the columns in the data to the rows.

`data.stack()` 

`unstack`: to make a wide table

This pivots from the rows into the columns. By default, the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:

- `result.unstack()`
- `result.unstack(level=0)`
- `result.unstack(level="state")`
- `data2.unstack().stack()` #stacking filters out missing data by default, so the operation is more easily invertible.
- `data2.unstack().stack(dropna=False)` # Keep the NA values
- `df.unstack(level="state").stack(level="side")`

As with unstack, when calling stack we can indicate the name of the axis to stack:

### Pivoting “Long” to “Wide” Format by .pivot()

`pop` method on the DataFrame, which returns a column while deleting it from the DataFrame at the same time.

`pandas.PeriodIndex` (which represents time intervals rather than points in time), discussed in more detail in Ch 11: Time Series,

```
periods = pd.PeriodIndex(year=data.pop("year"),
                         quarter=data.pop("quarter"),
                         name="date") 
```

One way to obtain long data is to use combination of `.stack()` and `.reset_index()` methods
```
long_data = (data.stack()
             .reset_index()
             .rename(columns={0: "value"}))
```

Now use `.pivot()` to make a wide data table
```
long_data.pivot(index="date", columns="item",
                          values="value")
```

By omitting the last argument, you obtain a DataFrame with hierarchical columns when there are more than one columns of values. 
```
long_data.pivot(index="date", columns="item")
```

Note that pivot is equivalent to creating a hierarchical index using set_index followed by a call to unstack:

```
long_data.set_index(["date", "item"]).unstack(level="item")
```

### Pivoting “Wide” to “Long” Format by `melt()`

 `pandas.melt` merges multiple columns into one, producing a DataFrame that is longer than the input. The "key" column may be a group indicator, and the other columns are data values. When using `pandas.melt`, we must indicate which columns (if any) are group indicators. Let's use "key" as the only group indicator here:

```
melted = pd.melt(df, id_vars="key") #id_vars can be omitted. Then there is no group identifier. 
```

Using pivot, we can reshape back to the original layout:

```
reshaped = melted.pivot(index="key", columns="variable",
                        values="value")
```