# Pandas Processing Refresher

## The agenda:
- Dataframe vs. Series
- indexes
- iloc: position based indexes
- loc: label-based indexes
- max / min / mean
- filtering / masks
- summary methods: df.info(), df.describe(), series.unique()
- Create new columns with values out of other columns
- Group by 
- Group by + count()
- Group by + max()

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data/cars.csv")

In [None]:
df.head()

`DataFrame`: A two-dimensional labelled data structure where columns can be of
different data types, i.e. a collection of Series of equal length and the
same index

`Series`: A one-dimensional NumPy array with a label-based index; index labels
can be integers or text

`array`: A grid of values, all of the same data type, with an integer-based index (np
is for NumPy, a package used by pandas)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df

In [None]:
df.iloc[0]

In [None]:
df.iloc[0, 0]

In [None]:
df.iloc[:, 0]

In [None]:
df.loc[0, "mpg"]

In [None]:
# Returns a Series
df.loc[:, "mpg"]

In [None]:
df.loc[:, "mpg"].max()

In [None]:
df.loc[:, "mpg"].min()

In [None]:
df["mpg"].max()

In [None]:
df["mpg"].min()

In [None]:
df["origin"].unique()

In [None]:
# returns a DataFrame
df.loc[:, ["mpg", "horsepower"]]

In [None]:
# returns a DataFrame
df[["mpg", "horsepower"]]

In [None]:
mask = df["origin"] == "usa"
df[mask]

In [None]:
print(f"{df[mask].shape[0]} cars from USA")
print(f"out of {df.shape[0]} cars")

In [None]:
string_mask = df['name'].str.contains("chevrolet")
df[string_mask].head(10)

In [None]:
# new column created with value 0 for all rows
df["price"] = 0

In [None]:
df.loc[0, "price"] = 100

In [None]:
df.head()

In [None]:
df.groupby("origin").size()

In [None]:
df.groupby("origin").max()

In [None]:
df.groupby("origin")["mpg"].max()

In [None]:
# returns a Series
df.groupby("origin").count()["model_year"]

In [None]:
# returns a DataFrame
df.groupby("origin").count()[["horsepower", "model_year"]]

In [None]:
df.groupby("origin").size()

### Using `.isnull()`

This method returns a boolean object (of the same dimensions as the input it is called on), indicating if the values are null (e.g. `NaN`, `None`, or `np.NaN`). If a value is null, it is mapped to `True`; non-null values are mapped to `False`.

### Using `.any()`

The method `.any()` returns whether any element is True, potentially over an axis.

Note that when you specify an `axis` argument (e.g. `.any(axis=1)`), you are specifing which axis is *reduced*, or removed. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html) for further details.

In [None]:
mask_nan_columns = df.isnull().any(axis=1)

df[mask_nan_columns] 