### Pandas Notes

Pandas = Python’s spreadsheet on steroids
A free, open-source library that gives you two super-powered objects:
Series = a single column of labelled data (any type).
DataFrame = a whole table of Series that share an index (rows + columns).
In one sentence:
“Import messy data, clean it, slice it, join it, group it, time-series it, and export it—all in a couple of lines.”

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid foundation for a wide variety of data tasks.

In [None]:
%pip install pandas

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid foundation for a wide variety of data tasks.

In [1]:
import pandas as pd

In [None]:
obj = pd.Series([4, 7, -5, 3])

In [None]:
obj

In [None]:
print(obj[0])

You can get the array representation and index object of the Series via its array and index attributes, respectively:

In [None]:
print(obj.array)

In [None]:
obj.index

Often, you'll want to create a Series with an index identifying each data point with a label:

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

In [None]:
obj2.index

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:

In [None]:
obj2["a"]

In [None]:
obj2["d"] = 6

In [None]:
obj2[["c", "a", "d"]]

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [3]:
import numpy as np

In [None]:
np.exp(obj2)

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary:

In [None]:
"b" in obj2

In [None]:
"e" in obj2

Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary:

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [None]:
obj3 = pd.Series(sdata)

In [None]:
obj3

A Series can be converted back to a dictionary with its to_dict method:

In [None]:
obj2.to_dict()

When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's keys method, which depends on the key insertion order. You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]

In [None]:
obj4 = pd.Series(sdata, index=states)

In [None]:
obj4

Here, three values found in sdata were placed in the appropriate locations, but since no value for "California" was found, it appears as NaN (Not a Number), which is considered in pandas to mark missing or NA values. Since "Utah" was not included in states, it is excluded from the resulting object.

I will use the terms “missing,” “NA,” or “null” interchangeably to refer to missing data. The isna and notna functions in pandas should be used to detect missing data:

In [None]:
pd.isna(obj4)

In [None]:
pd.notna(obj4)

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

Data alignment features will be addressed in more detail later. If you have experience with databases, you can think about this as being similar to a join operation.

Both the Series object itself and its index have a name attribute, which integrates with other areas of pandas functionality:

In [None]:
obj4.name = "population"
obj4.index.name = "state"

In [None]:
obj4

A Series’s index can be altered in place by assignment:

In [None]:
obj

In [None]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [None]:
obj

### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically, as with Series, and the columns are placed according to the order of the keys in data (which depends on their insertion order in the dictionary):

In [None]:
frame

For large DataFrames, the head method selects only the first five rows:

In [None]:
frame.head(10)

Similarly, tail returns the last five rows:

In [None]:
frame.tail()

If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [None]:
pd.DataFrame(data, columns=["year", "state", "pop"])

If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

In [None]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

In [None]:
frame2

A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation:

In [None]:
frame2["state"]

In [None]:
frame2.year

Columns can be modified by assignment. For example, the empty debt column could be assigned a scalar value or an array of values:

In [None]:
frame2.loc[1, "debt"] = 15.00

In [None]:
frame2

In [None]:
frame2["debt"] = np.arange(6.)

In [None]:
frame2

When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present:

In [None]:
val = pd.Series([-1.2, -1.5, -1.7])

In [None]:
frame2["debt"] = val

In [None]:
frame2

Assigning a column that doesn’t exist will create a new column.

The del keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the state column equals "Ohio":

In [None]:
frame2["eastern"] = frame2["state"] == "Ohio"

In [None]:
frame2

Note: New columns cannot be created with the frame2.eastern dot attribute notation.

The del method can then be used to remove this column:

In [None]:
del frame2["eastern"]

In [None]:
frame2.columns

Another common form of data is a nested dictionary of dictionaries:

In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},"Nevada": {2001: 2.4, 2002: 2.9}}

If the nested dictionary is passed to the DataFrame, pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices:

In [None]:
frame3 = pd.DataFrame(populations)

In [None]:
frame3

You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [None]:
frame3.T

Dictionaries of Series are treated in much the same way:

In [None]:
pdata = {"Ohio": frame3["Ohio"][:-1],"Nevada": frame3["Nevada"][:2]}

In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7},"Nevada": {2001: 2.4, 2002: 2.9}}

In [None]:
frame3["Ohio"][:-1]

In [None]:
frame3["Nevada"][:2]

In [None]:
pd.DataFrame(pdata)

If a DataFrame’s index and columns have their name attributes set, these will also be displayed:

In [None]:
frame3.index.name = "year"

In [None]:
frame3.columns.name = "state"

In [None]:
frame3

Unlike Series, DataFrame does not have a name attribute. DataFrame's to_numpy method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [None]:
frame3.to_numpy()

If the DataFrame’s columns are different data types, the data type of the returned array will be chosen to accommodate all of the columns:

In [None]:
frame2.to_numpy()

#### Index Objects
pandas’s Index objects are responsible for holding the axis labels (including a DataFrame's column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [None]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])

In [None]:
index = obj.index

In [None]:
index

In [None]:
index[1:]

Index objects are immutable and thus can’t be modified by the user:

In [None]:
index[1] = "d"  # TypeError

Immutability makes it safer to share Index objects among data structures:

In [None]:
labels = pd.Index(np.arange(3))

In [None]:
labels

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [None]:
obj2

In [None]:
obj2.index is labels

Unlike Python sets, a pandas Index can contain duplicate labels:

In [None]:
pd.Index(["ffo", "ffo", "bar", "bar"])

Selections with duplicate labels will select all occurrences of that label.

In [None]:
new_data = pd.Series([1,2,3,4], index=["foo", "foo", "bar", "bar"])

In [None]:
new_data['foo']

#### Some Index methods and properties

| Method / Property | Description                                                                               |
| ----------------- | ----------------------------------------------------------------------------------------- |
| `append()`        | Concatenate with additional Index objects, producing a new Index                          |
| `difference()`    | Compute set difference as an Index                                                        |
| `intersection()`  | Compute set intersection                                                                  |
| `union()`         | Compute set union                                                                         |
| `isin()`          | Compute Boolean array indicating whether each value is contained in the passed collection |
| `delete()`        | Compute new Index with element at index *i* deleted                                       |
| `drop()`          | Compute new Index by deleting passed values                                               |
| `insert()`        | Compute new Index by inserting element at index *i*                                       |
| `is_monotonic`    | Returns `True` if each element is greater than or equal to the previous element           |
| `is_unique`       | Returns `True` if the Index has no duplicate values                                       |
| `unique()`        | Compute the array of unique values in the Index                                           |


### Essential Functionality
This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; instead, we'll focus on familiarizing you with heavily used features, leaving the less common (i.e., more esoteric) things for you to learn more about by reading the online pandas documentation.

#### Reindexing
An important method on pandas objects is reindex, which means to create a new object with the values rearranged to align with the new index. Consider an example:

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [None]:
obj

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])

In [None]:
obj2

For ordered data like time series, you may want to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:

In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

In [None]:
obj3

In [None]:
obj3.reindex(np.arange(6), method="ffill")

With DataFrame, reindex can alter the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=["a", "c", "d"], columns=["Ohio", "Texas", "California"])

In [None]:
frame

In [None]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])

In [None]:
frame2

The columns can be reindexed with the columns keyword:

In [None]:
states = ["Texas", "Utah", "California"]

In [None]:
frame.reindex(columns=states, fill_value=4)

Because "Ohio" was not in states, the data for that column is dropped from the result.

Another way to reindex a particular axis is to pass the new axis labels as a positional argument and then specify the axis to reindex with the axis keyword:

In [None]:
frame.reindex(states, axis="columns")

#### `reindex` function arguments

| Argument     | Description                                                                                                                                                          |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `labels`     | New sequence to use as an index. Can be an Index instance or any other sequence-like Python data structure. An Index will be used exactly as-is without any copying. |
| `index`      | Use the passed sequence as the new row-index labels.                                                                                                                 |
| `columns`    | Use the passed sequence as the new column labels.                                                                                                                    |
| `axis`       | Axis to reindex: `"index"` (rows, default) or `"columns"`. You can also call `reindex(index=...)` or `reindex(columns=...)` directly.                                |
| `method`     | Interpolation (fill) method: `"ffill"` (forward-fill) or `"bfill"` (backward-fill).                                                                                  |
| `fill_value` | Scalar value to insert for missing labels (default = NaN).                                                                                                           |
| `limit`      | Max number of consecutive missing elements to fill when using `method`.                                                                                              |
| `tolerance`  | Max absolute numeric distance allowed for inexact matches when using `method`.                                                                                       |
| `level`      | Level of MultiIndex to match on (or subset).                                                                                                                         |
| `copy`       | If `True`, always copy underlying data even when the new index equals the old one; if `False`, avoid copying when possible.                                          |


#### Dropping Entries from an Axis
Dropping one or more entries from an axis is simple if you already have an index array or list without those entries, since you can use the reindex method or .loc-based indexing. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])

In [None]:
obj

In [None]:
new_obj = obj.drop("c")

In [None]:
new_obj

In [None]:
obj.drop(["d", "c"])

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"], columns=["one", "two", "three", "four"])

In [None]:
data

Calling drop with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(index=["Colorado", "Ohio"])

To drop labels from the columns, instead use the columns keyword:

In [None]:
newdata = data.drop(columns=["two"])

In [None]:
newdata

You can also drop values from the columns by passing axis=1 (which is like NumPy) or axis="columns":

In [None]:
data.drop("two" , axis=1)

In [None]:
data.drop(["two", "four"], axis="columns")

#### Indexing, Selection, and Filtering
Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [None]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

In [None]:
obj

In [None]:
obj[["b", "a", "d"]]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 2]

While you can select data by label this way, the preferred way to select index values is with the special `loc` operator:

In [None]:
obj.loc[["b", "a", "d"]]

The reason to prefer loc is because of the different treatment of integers when indexing with []. Regular []-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index. For example:

In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

In [None]:
obj1[0]

In [None]:
obj1.loc[0]

In [None]:
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [None]:
obj2[1]

In [None]:
obj1

In [None]:
obj2

In [None]:
obj1[[0, 1, 2]]

In [None]:
obj2[[0, 1, 2]]

When using `loc`, the expression `obj.loc[[0, 1, 2]]` will fail when the index does not contain integers:

In [None]:
obj2.loc[[0, 1]]

Since loc operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers:

In [None]:
obj1.iloc[[0, 1, 2]]

In [None]:
obj2.iloc[[0, 1, 2]]

You can also slice with labels, but it works differently from normal Python slicing in that the endpoint is inclusive:

In [None]:
obj2.loc["a":"c"]

Assigning values using these methods modifies the corresponding section of the Series:

In [None]:
obj2.loc["b":"c"] = 5

In [None]:
obj2

Indexing into a DataFrame retrieves one or more columns either with a single value or sequence:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=["Ohio", "Colorado", "Utah", "New York"], columns=["one", "two", "three", "four"])

In [None]:
data

In [None]:
data['one']

Indexing like this has a few special cases. The first is slicing or selecting data with a Boolean array:

In [None]:
data[:2]

In [None]:
data[data["three"] > 5]

In [None]:
data    

The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns.

Another use case is indexing with a Boolean DataFrame, such as one produced by a scalar comparison. Consider a DataFrame with all Boolean values produced by comparing with a scalar value:

In [None]:
data < 5

We can use this DataFrame to assign the value 0 to each location with the value True, like so:

In [None]:
data[data < 5] = 0

In [None]:
# data['four'][data['four'] == 15] = 10

data.loc["New York", "four"] = 11

In [None]:
data    

Selection on DataFrame with loc and iloc

Like Series, DataFrame has special attributes loc and iloc for label-based and integer-based indexing, respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns 

with NumPy-like notation using either axis labels (loc) or integers (iloc).

As a first example, let's select a single row by label:

In [None]:
data.loc["Colorado"]

The result of selecting a single row is a Series with an index that contains the DataFrame's column labels. To select multiple roles, creating a new DataFrame, pass a sequence of labels:

In [None]:
data.loc[["Colorado", "New York"]]

In [None]:
data.loc["Ohio":"Utah", "one":"three"] #selecting rows and columns

In [None]:
data.loc["Ohio":"Utah"]

You can combine both row and column selection in loc by separating the selections with a comma:

In [None]:
data.loc["Colorado", ["two", "three"]]

We'll then perform some similar selections with integers using iloc:

In [None]:
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[[1,2], [3, 0, 1]]

In [None]:
data.iloc[1:2, [3, 0, 1]]

Boolean arrays can be used with loc but not iloc:

In [None]:
data.loc[data.three >= 2]

In [None]:
data.three

In [None]:
data.loc[data.three >= 2, 'one']

Indexing options with DataFrame
| Type | Notes |
|------|-------|
| `df[column]` | Select single column or sequence of columns from the DataFrame; special-case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion) |
| `df.loc[rows]` | Select single row or subset of rows from the DataFrame by **label** |
| `df.loc[:, cols]` | Select single column or subset of columns by **label** |
| `df.loc[rows, cols]` | Select both row(s) and column(s) by **label** |
| `df.iloc[rows]` | Select single row or subset of rows from the DataFrame by **integer position** |
| `df.iloc[:, cols]` | Select single column or subset of columns by **integer position** |
| `df.iloc[rows, cols]` | Select both row(s) and column(s) by **integer position** |
| `df.at[row, col]` | Select a **single scalar value** by row and column **label** |
| `df.iat[row, col]` | Select a **single scalar value** by row and column **position** (integers) |
| `reindex` method | Select either rows or columns by **labels** |

In [None]:
data

In [None]:
print(data.iat[1, 3])

In [None]:
data.four   

### Arithmetic and Data Alignment
pandas can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1 + s2


The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both rows and columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"), index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])


In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

Using the add method on df1, I pass df2 and an argument to fill_value, which substitutes the passed value for any missing values in the operation:

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list("abcde"))

df2.loc[1, 'b'] = np.nan

In [None]:
df2

In [None]:
df1 + df2

In [None]:
df1.add(df2, fill_value=0)

Flexible arithmetic methods

| Method | Description |
|--------|-------------|
| `add`, `radd` | Methods for addition (`+`) |
| `sub`, `rsub` | Methods for subtraction (`-`) |
| `div`, `rdiv` | Methods for division (`/`) |
| `floordiv`, `rfloordiv` | Methods for floor division (`//`) |
| `mul`, `rmul` | Methods for multiplication (`*`) |
| `pow`, `rpow` | Methods for exponentiation (`**`) |

### Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [178]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])

In [179]:
frame

Unnamed: 0,b,d,e
Utah,0.088197,0.767472,0.58132
Ohio,-0.683308,-2.00071,0.575624
Texas,-0.3193,-0.089491,0.102543
Oregon,0.388269,-1.734548,0.965957


In [180]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.088197,0.767472,0.58132
Ohio,0.683308,2.00071,0.575624
Texas,0.3193,0.089491,0.102543
Oregon,0.388269,1.734548,0.965957


Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

In [181]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

b    1.071578
d    2.768181
e    0.863414
dtype: float64

In [182]:
f1(frame['b'])

np.float64(1.071577519626428)

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.

If you pass axis="columns" to apply, the function will be invoked once per row instead. A helpful way to think about this is as "apply across the columns":

In [183]:
frame.apply(f1, axis="columns")

Utah      0.679274
Ohio      2.576334
Texas     0.421843
Oregon    2.700506
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

The function passed to apply need not return a scalar value; it can also return a Series with multiple values:

In [188]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

frame.apply(f2)

Unnamed: 0,b,d,e
min,-0.683308,-2.00071,0.102543
max,0.388269,0.767472,0.965957


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with applymap:

In [192]:
def my_format(x):
    return f"{x:.2f}"

frame.map(my_format)

Unnamed: 0,b,d,e
Utah,0.09,0.77,0.58
Ohio,-0.68,-2.0,0.58
Texas,-0.32,-0.09,0.1
Oregon,0.39,-1.73,0.97


The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [195]:
frame

Unnamed: 0,b,d,e
Utah,0.088197,0.767472,0.58132
Ohio,-0.683308,-2.00071,0.575624
Texas,-0.3193,-0.089491,0.102543
Oregon,0.388269,-1.734548,0.965957


In [193]:
frame["e"].map(my_format)

Utah      0.58
Ohio      0.58
Texas     0.10
Oregon    0.97
Name: e, dtype: object

### Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column label, use the sort_index method, which returns a new, sorted object:

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj.sort_index()

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=["three", "one"], columns=["d", "a", "b", "c"])
frame.sort_index()

In [None]:
frame.sort_index(axis="columns")

The data is sorted in ascending order by default but can be sorted in descending order, too:

In [None]:
frame.sort_index(axis="columns", ascending=False)

To sort a Series by its values, use its sort_values method:



In [None]:
obj = pd.Series([4, 7, -3, 2])

obj.sort_values()

Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

Missing values can be sorted to the start instead by using the na_position option:

In [None]:
obj.sort_values(na_position="first")

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to sort_values:

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values("b")

In [None]:
frame.sort_values(["a", "b"])

Ranking assigns ranks from one through the number of valid data points in an array, starting from the lowest value. The rank methods for Series and DataFrame are the place to look; by default, rank breaks ties by assigning each group the mean rank:

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])  #[-5, 0, 2, 4,4,7,7]
obj.rank()

Ranks can also be assigned according to the order in which they’re observed in the data:

In [None]:
obj.rank(method="first")

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data.

You can rank in descending order, too:

In [None]:
obj.rank(ascending=False)

DataFrame can compute ranks over the rows or the columns:

In [2]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1], "c": [-2, 5, 8, -2.5]})

In [3]:
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


 Tie-breaking methods with rank
 | Method | Description |
|--------|------------- |
| `"average"` | Default: assign the **average rank** to each entry in the equal group |
| `"min"` | Use the **minimum rank** for the whole group |
| `"max"` | Use the **maximum rank** for the whole group |
| `"first"` | Assign ranks in the **order the values appear** in the data |
| `"dense"` | Like `"min"`, but ranks **always increase by 1** between groups (no gaps) |

### Axis Indexes with Duplicate Labels
Up until now almost all of the examples we have looked at have unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])

In [None]:
obj

The `is_unique` property of the index can tell you whether or not its labels are unique:

In [None]:
obj.index.is_unique

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [None]:
obj["a"]

This can make your code more complicated, as the output type from indexing can vary based on whether or not a label is repeated.

The same logic extends to indexing rows (or columns) in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)), index=["a", "a", "b", "b", "c"])
df


In [None]:
df.loc["b"]

In [None]:
df.loc["c"]

## Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series, or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=["a", "b", "c", "d"], columns=["one", "two"])
df

Calling DataFrame’s `sum` method returns a Series containing column sums:

In [None]:
df.sum()

Passing `axis="columns"` or `axis=1` sums across the columns instead:

In [None]:
df.sum(axis="columns")

When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA, then the result is NA. This can be disabled with the `skipna` option, in which case any NA value in a row or column names the corresponding result NA:

In [None]:
df.sum(axis="index", skipna=False)

In [None]:
df.sum(axis="columns", skipna=False)

Some aggregations, like mean, require at least one non-NA value to yield a value result, so here we have:

In [None]:
df.mean(axis="columns")

Some methods, like `idxmin` and `idxmax`, return indirect statistics, like the index value where the minimum or maximum values are attained:

In [None]:
df.idxmax()

Other methods are accumulations:

In [None]:
df.cumsum()

Some methods are neither reductions nor accumulations. describe is one such example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

On nonnumeric data, describe produces alternative summary statistics:

In [None]:
obj = pd.Series(["a", "a", "b", "c"] * 4)

In [None]:
obj.describe()

Descriptive and summary statistics

| Method | Description |
|--------|-------------|
| `count` | Number of non-NA values |
| `describe` | Compute set of summary statistics |
| `min`, `max` | Compute minimum and maximum values |
| `argmin`, `argmax` | Compute index **locations** (integers) at which minimum or maximum value is obtained, respectively; not available on DataFrame objects |
| `idxmin`, `idxmax` | Compute index **labels** at which minimum or maximum value is obtained, respectively |
| `quantile` | Compute sample quantile ranging from 0 to 1 (default: 0.5) |
| `sum` | Sum of values |
| `mean` | Mean of values |
| `median` | Arithmetic median (50 % quantile) of values |
| `mad` | Mean absolute deviation from mean value |
| `prod` | Product of all values |
| `var` | Sample variance of values |
| `std` | Sample standard deviation of values |
| `skew` | Sample skewness (third moment) of values |
| `kurt` | Sample kurtosis (fourth moment) of values |
| `cumsum` | Cumulative sum of values |
| `cummin`, `cummax` | Cumulative minimum or maximum of values, respectively |
| `cumprod` | Cumulative product of values |
| `diff` | Compute first arithmetic difference (useful for time series) |
| `pct_change` | Compute percent changes |

## Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes originally obtained from Yahoo! Finance and available in binary Python pickle files you can find in the accompanying datasets for the book:

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [None]:
returns = price.pct_change()

In [None]:
returns.tail(10)

The `corr` method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, `cov` computes the covariance:

In [None]:
returns["MSFT"].corr(returns["IBM"])

In [None]:
returns["MSFT"].cov(returns["IBM"])

DataFrame’s `corr` and `cov` methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [None]:
returns.corr()

In [None]:
returns.cov()

Using DataFrame’s corrwith method, you can compute pair-wise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [None]:
returns.corrwith(returns["IBM"])

Passing a DataFrame computes the correlations of matching column names. Here, I compute correlations of percent changes with volume:

In [None]:
returns.corrwith(volume)

## Unique Values, Value Counts, and Membership
Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

The first function is unique, which gives you an array of the unique values in a Series:

In [None]:
uniques = obj.unique()

print(uniques)

The unique values are not necessarily returned in the order in which they first appear, and not in sorted order, but they could be sorted after the fact if needed `(uniques.sort())`. Relatedly, `value_counts` computes a Series containing value frequencies:

In [None]:
obj.value_counts()

The Series is sorted by value in descending order as a convenience. `value_counts` is also available as a top-level pandas method that can be used with NumPy arrays or other Python sequences:

In [None]:
pd.value_counts(obj.to_numpy(), sort=False)

`isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:

In [None]:
obj

In [None]:
mask = obj.isin(["b", "c"])

mask

In [None]:
obj[mask]

Related to isin is the Index.get_indexer method, which gives you an index array from an array of possibly nondistinct values into another array of distinct values:

In [None]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])

unique_vals = pd.Series(["c", "b", "a"])

# get the indices of the unique values in the to_match series like binding new index to the to_match series
indices = pd.Index(unique_vals).get_indexer(to_match)
to_match.reindex(indices)   

### Unique, value counts, and set membership methods

| Method | Description |
|--------|-------------|
| `isin` | Compute a Boolean array indicating whether each Series or DataFrame value is contained in the passed sequence of values |
| `get_indexer` | Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations |
| `unique` | Compute an array of unique values in a Series, returned in the order observed |
| `value_counts` | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order |

In some cases, you may want to compute a histogram on multiple related columns in a DataFrame. Here’s an example:

In [None]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4], "Qu2": [2, 3, 1, 2, 3], "Qu3": [1, 5, 2, 4, 4]})

In [None]:
data

We can compute the value counts for a single column, like so:

In [None]:
data["Qu1"].value_counts().sort_index()

To compute this for all columns, pass pandas.value_counts to the DataFrame’s apply method:

In [None]:
result = data.apply(pd.value_counts).fillna(0)

In [None]:
result

Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column.

There is also a DataFrame.value_counts method, but it computes counts considering each row of the DataFrame as a tuple to determine the number of occurrences of each distinct row:

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})

In [None]:
data

In [None]:
data.value_counts()

## Data Loading, Storage, and File Formats

Reading data and making it accessible (often called data loading) is a necessary first step for using most of the tools in this book. The term parsing is also sometimes used to describe loading text data and interpreting it as tables and different data types. I’m going to focus on data input and output using pandas, though there are numerous tools in other libraries to help with reading and writing data in various formats.

Input and output typically fall into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.

#### Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object. Table below summarizes some of them; pandas.read_csv is one of the most frequently used in this book. 

### Text and binary data loading functions in pandas
| Function | Description |
|----------|-------------|
| `read_csv` | Load delimited data from a file, URL, or file-like object; uses comma as default delimiter |
| `read_fwf` | Read data in fixed-width column format (i.e., no delimiters) |
| `read_clipboard` | Variation of `read_csv` that reads data from the clipboard; useful for converting tables from web pages |
| `read_excel` | Read tabular data from an Excel `.xls` or `.xlsx` file |
| `read_hdf` | Read HDF5 files written by pandas |
| `read_html` | Read all tables found in the given HTML document |
| `read_json` | Read data from a JSON (JavaScript Object Notation) string, file, URL, or file-like object |
| `read_feather` | Read the Feather binary file format |
| `read_orc` | Read the Apache ORC binary file format |
| `read_parquet` | Read the Apache Parquet binary file format |
| `read_pickle` | Read an object stored by pandas using the Python pickle format |
| `read_sas` | Read a SAS dataset stored in one of the SAS system’s custom storage formats |
| `read_spss` | Read a data file created by SPSS |
| `read_sql` | Read the results of a SQL query (via SQLAlchemy) |
| `read_sql_table` | Read an entire SQL table (via SQLAlchemy); equivalent to `SELECT * FROM table` with `read_sql` |
| `read_stata` | Read a dataset from Stata file format |
| `read_xml` | Read a table of data from an XML file |

#### Indexing
Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, arguments you provide, or not at all.

#### Type inference and data conversion
Includes the user-defined value conversions and custom list of missing value markers.

#### Date and time parsing
Includes a combining capability, including combining date and time information spread over multiple columns into a single column in the result.

#### Iterating
Support for iterating over chunks of very large files.

#### Unclean data issues
Includes skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Because of how messy data in the real world can be, some of the data loading functions (especially `pandas.read_csv`) have accumulated a long list of optional arguments over time. It's normal to feel overwhelmed by the number of different parameters (`pandas.read_csv` has around 50). The online pandas documentation has many examples about how each of these works, so if you're struggling to read a particular file, there might be a similar enough example to help you find the right parameters.

Some of these functions perform type inference, because the column data types are not part of the data format. That means you don’t necessarily have to specify which columns are numeric, integer, Boolean, or string. Other data formats, like HDF5, ORC, and Parquet, have the data type information embedded in the format.

Handling dates and other custom types can require extra effort.

Let’s start with a small comma-separated values (CSV) text file:

In [196]:
df = pd.read_csv("examples/ex1.csv")
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


A file will not always have a header row. Consider this file:

To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:

In [199]:
df = pd.read_csv("examples/ex2.csv", header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [201]:
pd.read_csv("examples/ex2.csv", names=["a", "b", "c", "d", "message"])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named "message" using the index_col argument:

In [200]:
names = ["a", "b", "c", "d", "message"]
pd.read_csv("examples/ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:

While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for pandas.read_csv. This can be expressed by the regular expression \s+, so we have then:

In [205]:
result = pd.read_csv("examples/ex3.txt", sep="\s+")
result

  result = pd.read_csv("examples/ex3.txt", sep="\s+")


Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


Because there was one fewer column name than the number of data rows, pandas.read_csv infers that the first column should be the DataFrame’s index in this special case.

The file parsing functions have many additional arguments to help you handle the wide variety of exception file formats that occur (see a partial listing in Table). For example, you can skip the first, third, and fourth rows of a file with skiprows:

In [206]:
pd.read_csv("examples/ex4.csv", skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file reading process. Missing data is usually either not present (empty string) or marked by some sentinel (placeholder) value. By default, pandas uses a set of commonly occurring sentinels, such as NA and NULL:

In [207]:
result = pd.read_csv("examples/ex5.csv")
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Recall that pandas outputs missing values as NaN, so we have two null or missing values in result:

In [208]:
pd.isna(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The `na_values` option accepts a sequence of strings to add to the default list of strings recognized as missing:

In [None]:
result = pd.read_csv("examples/ex5.csv", na_values=["NULL"])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


`pandas.read_csv` has a list of many default NA value representations, but these defaults can be disabled with the `keep_default_na` option:

In [222]:
result2 = pd.read_csv("examples/ex5.csv", keep_default_na=False)
result2

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [223]:
result2.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [235]:
result3 = pd.read_csv("examples/ex5.csv", keep_default_na=False  , na_values=["NA"])
result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [236]:
result3.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False


Different NA sentinels can be specified for each column in a dictionary:

In [237]:
sentinels = {"message": ["foo", "NA"], "something": ["two"]}
result4 = pd.read_csv("examples/ex5.csv", keep_default_na=False, na_values=sentinels)
result4


Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


#### Some `pandas.read_csv` function arguments
| Argument | Description |
|----------|-------------|
| `path` | String indicating filesystem location, URL, or file-like object. |
| `sep` or `delimiter` | Character sequence or regular expression to split fields in each row. |
| `header` | Row number to use as column names; defaults to 0 (first row). Use `None` if there is no header row. |
| `index_col` | Column number(s) or name(s) to use as the row index; can be single or list for hierarchical index. |
| `names` | List of column names to use for the result. |
| `skiprows` | Number of rows to skip at start of file, or list of row numbers (0-based) to skip. |
| `na_values` | Sequence of strings to treat as NA/NaN (added to default list unless `keep_default_na=False`). |
| `keep_default_na` | Whether to use the default NA value list (`True` by default). |
| `comment` | Character(s) to split comments off the end of lines. |
| `parse_dates` | Attempt to parse data to datetime. `False` by default; can be `True`, list of columns, or list of lists/tuples for multi-column dates. |
| `keep_date_col` | If joining columns to parse date, keep the joined columns (`False` by default). |
| `converters` | Dict mapping column number/name to functions, e.g. `{"foo": f}` applies `f` to the "foo" column. |
| `dayfirst` | Treat ambiguous dates as international format (e.g. 7/6/2012 → June 7, 2012); `False` by default. |
| `date_parser` | Function to use to parse dates. |
| `nrows` | Number of rows to read from beginning of file (excluding header). |
| `iterator` | Return a `TextFileReader` for iterative/chunked reading; usable with `with` statement. |
| `chunksize` | For iteration, number of rows per chunk. |
| `skip_footer` | Number of lines to ignore at end of file. |
| `verbose` | Print parsing info (timing, memory usage). |
| `encoding` | Text encoding (e.g. `"utf-8"`); defaults to `"utf-8"` if `None`. |
| `squeeze` | If parsed data contains only one column, return a `Series`. |
| `thousands` | Thousands separator (e.g. `","` or `"."`); default `None`. |
| `decimal` | Decimal separator (e.g. `"."` or `","`); default `"."`. |
| `engine` | CSV parsing engine: `"c"` (default), `"python"`, or `"pyarrow"` (`"pyarrow"` faster on large files; `"python"` supports extra features but slower). |

#### Reading Text Files in Pieces
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may want to read only a small piece of a file or iterate through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [246]:
pd.options.display.max_rows = 10

Now we have:

In [247]:
result = pd.read_csv("examples/ex6.csv")
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


The elipsis marks `...` indicate that rows in the middle of the DataFrame have been omitted.

If you want to read only a small number of rows (avoiding reading the entire file), specify that with `nrows`:

In [248]:
pd.read_csv("examples/ex6.csv", nrows=10)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.81748,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.35848,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.63783,2.172201,G


#### Writing Data to Text Format
Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:

In [249]:
data = pd.read_csv("examples/ex5.csv")
data.to_csv("examples/out.csv", index=False)

Other delimiters can be used, of course (writing to sys.stdout so it prints the text result to the console rather than a file):

In [250]:
import sys

data.to_csv(sys.stdout, sep="|")

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:

In [251]:
data.to_csv(sys.stdout, na_rep="NULL")

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


With no other options specified, both the row and column labels are written. Both of these can be disabled:

In [252]:
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You can also write only a subset of the columns, and in an order of your choosing:

In [253]:
data.to_csv(sys.stdout, index=False, columns=["a", "b", "c"])

a,b,c
1,2,3.0
5,6,
9,10,11.0


### JSON Data
JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. Here is an example:

In [4]:
obj = """
{"name": "Wes",
 "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},
              {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]
}
"""

JSON is very nearly valid Python code with the exception of its null value null and some other nuances (such as disallowing trailing commas at the end of lists). The basic types are objects (dictionaries), arrays (lists), strings, numbers, Booleans, and nulls. All of the keys in an object must be strings. There are several Python libraries for reading and writing JSON data. I’ll use `json` here, as it is built into the Python standard library. To convert a JSON string to Python form, use `json.loads`:

In [5]:
import json

result = json.loads(obj)
result

{'name': 'Wes',
 'cities_lived': ['Akron', 'Nashville', 'New York', 'San Francisco'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']},
  {'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]}

`json.dumps` converts a Python object back to JSON:

In [6]:
asjson = json.dumps(result)

asjson

'{"name": "Wes", "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"], "pet": null, "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]}, {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]}'

How you convert a JSON object or list of objects to a DataFrame or some other data structure for analysis will be up to you. Conveniently, you can pass a list of dictionaries (which were previously JSON objects) to the DataFrame constructor and select a subset of the data fields:

In [7]:
siblings = pd.DataFrame(result["siblings"], columns=["name", "age"])

siblings

Unnamed: 0,name,age
0,Scott,34
1,Katie,42


The `pandas.read_json` can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. For example:

The default options for `pandas.read_json` assume that each object in the JSON array is a row in the table:

In [8]:
data = pd.read_json("examples/example.json")

data

FileNotFoundError: File examples/example.json does not exist

If you need to export data from pandas to JSON, one way is to use the to_json methods on Series and DataFrame:

In [9]:
data.to_json(sys.stdout)

NameError: name 'data' is not defined

### Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one method that I recommend is the requests package, which can be installed with pip or conda:

`pip install requests`

To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP request using the add-on requests library:

In [11]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


In [7]:
import requests
url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)
resp.raise_for_status()
resp

ConnectionError: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/pandas-dev/pandas/issues (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001B0B25FA360>: Failed to resolve 'api.github.com' ([Errno 11001] getaddrinfo failed)"))

It's a good practice to always call raise_for_status after using requests.get to check for HTTP errors.

The response object’s json method will return a Python object containing the parsed JSON data as a dictionary or list (depending on what JSON is returned):

In [8]:
data = resp.json()

NameError: name 'resp' is not defined

In [9]:
data

NameError: name 'data' is not defined

In [264]:
data[0]["title"]

'DOC: Missing documentation for Series.__invert__() (~ operator)'

Since the results retrieved are based on real-time data, what you see when you run this code will almost definitely be different.

Each element in data is a dictionary containing all of the data found on a GitHub issue page (except for the comments). We can pass data directly to pandas.DataFrame and extract fields of interest:

In [4]:
import pandas as pd

In [10]:
issues = pd.DataFrame(data, columns=["number", "title", "labels", "state"])

issues.to_csv("examples/teejay.csv", index=False)


NameError: name 'data' is not defined

### Interacting with Databases

In a business setting, a lot of data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.

pandas has some functions to simplify loading the results of a SQL query into a DataFrame. As an example, I’ll create a SQLite3 database using Python’s built-in sqlite3 driver:

In [11]:
import sqlite3

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect("mydata.sqlite")

con.execute(query)

<sqlite3.Cursor at 0x1b0b3731140>

In [12]:
con.commit()

Then, insert a few rows of data:

In [13]:
data = [("Atlanta", "Georgia", 1.25, 6),
         ("Tallahassee", "Florida", 2.6, 3),
         ("Sacramento", "California", 1.7, 5)]

stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()

Most Python SQL drivers return a list of tuples when selecting data from a table:

In [271]:
cursor = con.execute("SELECT * FROM test")
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5),
 ('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute. Note that for SQLite3, the cursor description only provides column names (the other fields, which are part of Python's Database API specification, are None), but for some other database drivers, more column information is provided:

In [272]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [273]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


This is quite a bit of munging that you’d rather not repeat each time you query the database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of the common differences between SQL databases. pandas has a read_sql function that enables you to read data easily from a general SQLAlchemy connection. You can install SQLAlchemy with conda like so:

In [1]:
%pip install sqlalchemy

Collecting sqlalchemy
  Downloading sqlalchemy-2.0.44-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Downloading greenlet-3.2.4-cp312-cp312-win_amd64.whl.metadata (4.2 kB)
Collecting typing-extensions>=4.6.0 (from sqlalchemy)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading sqlalchemy-2.0.44-cp312-cp312-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---- ----------------------------------- 0.3/2.1 MB ? eta -:--:--
   -------------- ------------------------- 0.8/2.1 MB 1.6 MB/s eta 0:00:01
   ---------------------------------- ----- 1.8/2.1 MB 2.8 MB/s eta 0:00:01
   ---------------------------------------- 2.1/2.1 MB 2.4 MB/s  0:00:01
Downloading greenlet-3.2.4-cp312-cp312-win_amd64.whl (299 kB)
Downloading typing_extensions-4.

Now, we'll connect to the same SQLite database with SQLAlchemy and read data from the table created before:

In [274]:
import sqlalchemy as sqla

db = sqla.create_engine("sqlite:///mydata.sqlite")
pd.read_sql("SELECT * FROM test", db)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
6,Atlanta,Georgia,1.25,6
7,Tallahassee,Florida,2.6,3
8,Sacramento,California,1.7,5


### Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.

If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere in the pandas library, feel free to share your use case on one of the Python mailing lists or on the pandas GitHub site. Indeed, much of the design and implementation of pandas have been driven by the needs of real-world applications.

In this chapter I discuss tools for missing data, duplicate data, string manipulation, and some other analytical data transformations. In the next chapter, I focus on combining and rearranging datasets in various ways.

#### Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

We call this a sentinel value: when present, it indicates a missing (or null) value:

In [275]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])

In [276]:
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The isna method gives us a Boolean Series with True where values are null:

In [277]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we've adopted a convention used in the R programming language by referring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA:

In [278]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

The pandas project has attempted to make working with missing data consistent across data types. Functions like pandas.isna abstract away many of the annoying details. See Table 7.1 for a list of some functions related to missing data handling.

#### NA handling object methods
| Method | Description |
|--------|-------------|
| `dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
| `fillna` | Fill in missing data with some value or using an interpolation method such as `"ffill"` or `"bfill"`. |
| `isna` | Return Boolean values indicating which values are missing/NA. |
| `notna` | Negation of `isna`; returns `True` for non-NA values and `False` for NA values. |

#### Filtering Out Missing Data
There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isna and Boolean indexing, dropna can be helpful. On a Series, it returns the Series with only the nonnull data and index values:

In [279]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is the same thing as doing:

In [280]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, there are different ways to remove missing data. You may want to drop rows or columns that are all NA, or only those rows or columns containing any NAs at all. dropna by default drops any row containing a missing value:

In [282]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how="all" will drop only rows that are all NA:

In [283]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [284]:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return new objects by default and do not modify the contents of the original object.

To drop columns in the same way, pass axis="columns":

In [285]:
data[4] = np.nan

data   

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [286]:
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose you want to keep only rows containing at most a certain number of missing observations. You can indicate this with the thresh argument:

In [287]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan

print(df)

          0         1         2
0  2.040950       NaN       NaN
1  0.373144       NaN       NaN
2 -1.704707       NaN -1.056701
3 -0.370714       NaN  0.983587
4  1.636196  0.073746 -1.638427
5  1.563269 -1.494755 -0.227718
6 -0.650784 -0.820741 -0.977512


In [288]:
df.dropna()

Unnamed: 0,0,1,2
4,1.636196,0.073746,-1.638427
5,1.563269,-1.494755,-0.227718
6,-0.650784,-0.820741,-0.977512


In [289]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-1.704707,,-1.056701
3,-0.370714,,0.983587
4,1.636196,0.073746,-1.638427
5,1.563269,-1.494755,-0.227718
6,-0.650784,-0.820741,-0.977512


#### Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [290]:
df.fillna(0)

Unnamed: 0,0,1,2
0,2.04095,0.0,0.0
1,0.373144,0.0,0.0
2,-1.704707,0.0,-1.056701
3,-0.370714,0.0,0.983587
4,1.636196,0.073746,-1.638427
5,1.563269,-1.494755,-0.227718
6,-0.650784,-0.820741,-0.977512


Calling fillna with a dictionary, you can use a different fill value for each column:

In [291]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,2.04095,0.5,0.0
1,0.373144,0.5,0.0
2,-1.704707,0.5,-1.056701
3,-0.370714,0.5,0.983587
4,1.636196,0.073746,-1.638427
5,1.563269,-1.494755,-0.227718
6,-0.650784,-0.820741,-0.977512


The same interpolation methods available for reindexing (see Table 5.3) can be used with fillna:

In [292]:
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
print(df)

          0         1         2
0 -0.745716  0.439531  0.703067
1  1.205190 -0.347581  0.317678
2 -0.901683       NaN -0.526896
3  0.810781       NaN -1.113451
4 -0.033441       NaN       NaN
5 -1.295903       NaN       NaN


In [293]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2
0,-0.745716,0.439531,0.703067
1,1.20519,-0.347581,0.317678
2,-0.901683,-0.347581,-0.526896
3,0.810781,-0.347581,-1.113451
4,-0.033441,-0.347581,-1.113451
5,-1.295903,-0.347581,-1.113451


In [294]:
df.fillna(method="ffill", limit=2)

  df.fillna(method="ffill", limit=2)


Unnamed: 0,0,1,2
0,-0.745716,0.439531,0.703067
1,1.20519,-0.347581,0.317678
2,-0.901683,-0.347581,-0.526896
3,0.810781,-0.347581,-1.113451
4,-0.033441,,-1.113451
5,-1.295903,,-1.113451


With fillna you can do lots of other things such as simple data imputation using the median or mean statistics:

In [None]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7]) 
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

#### `fillna` function arguments:
| Argument | Description |
|----------|-------------|
| `value` | Scalar value or dictionary-like object to use to fill missing values |
| `method` | Interpolation method: one of `"bfill"` (backward fill) or `"ffill"` (forward fill); default is `None` |
| `axis` | Axis to fill on (`"index"` or `"columns"`); default is `axis="index"` |
| `limit` | For forward and backward filling, maximum number of consecutive periods to fill |


### Data Transformation

So far in this chapter we’ve been concerned with handling missing data. Filtering, cleaning, and other transformations are another class of important operations.

#### Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [None]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"], "k2": [1, 1, 2, 3, 3, 4, 4]})
data

The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:

In [None]:
data.duplicated()

Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:

In [None]:
data.drop_duplicates()

Both methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates based only on the "k1" column:

In [None]:
data["v1"] = range(7)

In [None]:
data

In [None]:
data.drop_duplicates(subset="k1")

#### Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [2]:
import pandas as pd

In [3]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon", "pastrami", "corned beef", "bacon", "pastrami", "honey ham", "nova lox"], "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [4]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [6]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

The `map` method on a Series accepts a function or dictionary-like object containing a mapping to do the transformation of values:

In [7]:
data["animal"] = data["food"].map(meat_to_animal)

data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [8]:
def get_animal(x):
    return meat_to_animal[x]

data["animal"] = data["food"].map(get_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Using map is a convenient way to perform element-wise transformations and other data cleaning-related operations.

#### Replacing Values
Filling in missing data with the fillna method is a special case of more general value replacement. As you've already seen, map can be used to modify a subset of values in an object, but replace provides a simpler and more flexible way to do so. Let’s consider this Series:

In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

data

In [None]:
data.replace(-999, np.nan)

In [None]:
data.replace([-999, -1000], np.nan)

To use a different replacement for each value, pass a list of substitutes:

In [None]:
data.replace([-999, -1000], [np.nan, 0])

The argument passed can also be a dictionary:

In [None]:
data.replace({-999: np.nan, -1000: 0})

#### Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in place without creating a new data structure. Here’s a simple example:

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=["Ohio", "Colorado", "New York"], columns=["one", "two", "three", "four"])

In [None]:
def transform(x):
    return x[:4].upper()
data.index.map(transform)

You can assign to the index attribute, modifying the DataFrame in place:

In [None]:
data.index = data.index.map(transform)

In [None]:
data

If you want to create a transformed version of a dataset without modifying the original, a useful method is rename:

In [None]:
data.rename(index=str.lower, columns=str.upper)

Notably, rename can be used in conjunction with a dictionary-like object, providing new values for a subset of the axis labels:

In [None]:
data.rename(index={"OHIO": "INDIANA"}, columns={"three": "peekaboo"})

#### Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use `pandas.cut`:

In [None]:
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages, bins)
age_categories

The object pandas returns is a special Categorical object. The output you see describes the bins computed by `pandas.cut`. Each bin is identified by a special (unique to pandas) interval value type containing the lower and upper limit of each bin:

In [None]:
age_categories.codes

In [None]:
age_categories.categories

In [None]:
age_categories.categories[0]

In [None]:
pd.value_counts(age_categories)

You can override the default interval-based bin labeling by passing a list or array to the labels option:

In [None]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]

pd.cut(ages, bins, labels=group_names)

### Grouping and Pivot tables

--------------------------------------------------
0.  Setup (one cell)
--------------------------------------------------

In [None]:
import pandas as pd
import numpy as np

1.  Toy data – keep it tiny so the output fits on one slide

In [None]:
df = pd.DataFrame({
    'Region' : ['North', 'North', 'South', 'South', 'East', 'East'],
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Banana'],
    'Qtr'    : ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Sales'  : [100, 150, 120, 90, 80, 110],
    'Units'  : [10, 15, 12, 9, 8, 11]
})

2.  groupby – “split → apply → combine” in one lin


a. **Total sales per region**


In [None]:
df.groupby('Region')['Sales'].sum()

b. **Multiple metrics**

In [None]:
df.groupby('Region').agg({'Sales': 'sum', 'Units': 'mean'})

c. **Two-level index**

In [None]:
df.groupby(['Region', 'Product'])['Sales'].sum()

3.  pivot_table – “groupby + reshape”

Same result as (c) but **wide** (Region → rows, Product → columns):

In [None]:
pivot_table = pd.pivot_table(df,
               values='Sales',
               index='Region',
               columns='Product',
               aggfunc='sum',
               fill_value=0)


In [None]:
pivot_table

In [None]:
pivot_table.plot(kind='bar')

Add **margins** (Excel-style grand totals):

In [None]:
pd.pivot_table(df, values='Sales', index='Region', columns='Product',
               aggfunc='sum', fill_value=0, margins=True)

Multiple value fields:

In [None]:
pd.pivot_table(df, values=['Sales', 'Units'],
               index='Region', columns='Product',
               aggfunc='sum', fill_value=0)

4.  pivot – “pure reshape, no aggregation”

Use when **index + column combination is unique** (no duplicates).

Example: long → wide for **Qtr** (unique per Region-Product pair):

In [None]:
df.pivot(index=['Region', 'Product'],
         columns='Qtr',
         values='Sales')

### 1.  What merge does
Combines rows from two DataFrames **side-by-side** by matching values in **one or more key columns** – exactly like a SQL JOIN.

### 2.  The four common joins (picture = SQL)

```python
pd.merge(left, right, how='inner', on='key')
```

| `how=` | Keep rows that… | Venn mental pic |
|--------|------------------|-----------------|
| `inner` | **match in BOTH tables** | ∩ intersection |
| `left` | **match in LEFT** (+ keep all left rows) | ← full left |
| `right` | **match in RIGHT** (+ keep all right rows) | → full right |
| `outer` | **match in EITHER** (+ keep ALL rows) | ∪ union |

Imagine two small tables:

**Table A (left)**  
| id | name |
|----|------|
| 1  | Ann  |
| 2  | Bob  |

**Table B (right)**  
| id | city  |
|----|-------|
| 1  | Paris |
| 3  | Rome  |

---

### 1. INNER join  
**Keep only the rows that match in BOTH tables.**  
→ Result:  
| id | name | city  |
|----|------|-------|
| 1  | Ann  | Paris |

---

### 2. LEFT join  
**Keep every row from Table A;**  
if a row has **no match** in B, fill with **NaN**.  
→ Result:  
| id | name | city  |
|----|------|-------|
| 1  | Ann  | Paris |
| 2  | Bob  | NaN   |

---

### 3. RIGHT join  
**Keep every row from Table B;**  
if a row has **no match** in A, fill with **NaN**.  
→ Result:  
| id | name | city  |
|----|------|-------|
| 1  | Ann  | Paris |
| 3  | NaN  | Rome  |

---

**One-sentence summary:**  
- **Inner** = overlap only  
- **Left** = all from **first** table  
- **Right** = all from **second** table

### 3.  Smallest working example

In [None]:
import pandas as pd

users = pd.DataFrame({'user_id': [1, 2, 3], 'name': ['Ann', 'Bob', 'Cat']})
orders = pd.DataFrame({'user_id': [1, 1, 2], 'amount': [10, 20, 30]})

pd.merge(users, orders, on='user_id', how='left')

### 4.  Must-know arguments

| Arg | Meaning |
|-----|---------|
| `on=` | Single column name **common to both** (or list). |
| `left_on=` / `right_on=` | Use when column names **differ** between tables. |
| `left_index=True` / `right_index=True` | Merge on **index labels** instead of columns. |
| `suffixes=('', '_right')` | String to append **when same column names exist** (avoids `name_x`, `name_y`). |
| `validate='1:m'` | Safety check: `'1:1'`, `'1:m'`, `'m:1'`, `'m:m'` – raises error if assumption broken. |

### 5.  Quick reference card

In [None]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'val2': [4, 5, 6]})

In [None]:
# same name
# Check for the inner, left, right and outer
pd.merge(df1, df2, on='key', how='inner')

In [None]:
df1 = pd.DataFrame({'key1': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key2': ['A', 'B', 'D'], 'val2': [4, 5, 6]})

In [None]:
# different names
pd.merge(df1, df2, left_on='key1', right_on='key2', how='left')

In [None]:
# index join
pd.merge(df1, df2, left_index=True, right_index=True, how='outer')

#### 1.  What concat does
Stacks or appends DataFrames **along one axis** – no matching required.  
Think **“glue”** instead of **“lock-and-key”**.

- `axis=0` ➜ **row-wise** (top-to-bottom) – **default**  
- `axis=1` ➜ **column-wise** (side-by-side) – like **bind-cols**

---

#### 2.  Smallest demo

In [None]:
import pandas as pd

df1 = pd.DataFrame({'id': [1, 2], 'val': ['A', 'B']})
df2 = pd.DataFrame({'id': [3, 4], 'val': ['C', 'D']})

pd.concat([df1, df2])          # axis=0 (row-wise)

##### Column-wise:

In [None]:
pd.concat([df1, df2], axis=1)


### 3.  Must-know arguments
| Arg | What it does |
|-----|--------------|
| `objs=[df1, df2, ...]` | **List/tuple** of DataFrames/Series to glue. |
| `axis=0` | `0` or `'index'` = rows; `1` or `'columns'` = columns. |
| `ignore_index=False` | If `True`, re-label rows 0, 1, 2, … (removes duplicate index). |
| `keys=['A','B']` | Creates **extra index level** so you can tell which chunk each row came from (handy for groupby later). |
| `join='outer'` | When `axis=1`, keep **union** of columns (`outer`, default) or **intersection** (`inner`). |
| `verify_integrity=False` | If `True`, raise error on **duplicate index** (safety check). |
