# Sorting, Casting, and Categories

In this notebook, we will cover sorting operations as well as categorical data. We will also learn how to use apply functions to perform more custom operations to Pandas. 

Let's import `pandas` and then declare this tiny weatherset dataframe to practice with.

In [None]:
import pandas as pd

df = pd.DataFrame({
    "record_id" : ['DCMXP87EDE', 'ZMIFM3HX9G', 'HIVVXBAPS2', 'U1AA66UDES', 'B20KL5PW3L', 'FIZLY34KSQ'],
    "rain_inches" : [1.1, 0.0, 0.0, 2.4, 11.2, 3.2],
    "tornado" : [0,1,0,0,0,0],
    "lightning" :[0,1,1,1,0,0],
    "wind_speed_mph" : [3.1, 143.0, 12.2, 8.1, 5.0, 19.0],
    "severity" : ['CLEAR', 'SEVERE', 'MINOR', 'MINOR', 'MAJOR', 'CLEAR']
})

df

## Data Types and Casting

As we talk more about data cleaning, it might be a good time to talk about data types and choosing them carefully. Some basic pandas datatypes with examples are declared below. These are the datatypes that are part of NumPy as well


In [None]:
df_types = pd.DataFrame({'float': [1.2],
              'int': [3],
              'datetime': [pd.Timestamp('20230130')],
              'bool' : [True],
              'time_delta' : [pd.Timestamp('20230130') - pd.Timestamp('20230127')],
              'string': ['hello']
            })
df_types

You can view the datatypes for a given dataframe using the `dtypes` property. 

In [None]:
df_types.dtypes

> Note that Pandas does extend with additional datatypes. A full list of supported datatypes [can be found here](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes).

Now let's turn our attention back to the small weather dataset. 

In [None]:
df

Let's observe the datatypes.

In [None]:
df.dtypes

One of the most basic operations in data cleaning is casting. You can use the `astype()` function to coerce a given column to another data type. For example, the `tornado` and `lightning` columns are only 1's and 0's indicating they are intended to do boolean values (True=1, False=0). We can cast them to booleans here.

In [None]:
df['tornado'] = df['tornado'].astype('bool')
df['lightning'] = df['lightning'].astype('bool')

df

And sure enough, you will see the datatypes will have changed to `bool` for those two columns.

In [None]:
df.dtypes

## Sorting Values

In Pandas, you can sort data along a row or column by specifying its axis in the `sort_values()` function. Below, we sort ascending first by the `lightning` field followed by the `rain_inches` field. 

In [None]:
df.sort_values(by=["lightning","rain_inches"])

If I want different sort behaviors for each of the columns, where some are ascending while others are descending, pass a boolean list to the `ascending` parameter. Below we set `lightning` to be descending so `True` records rise to the top, while the `rain_inches` is ascending. 

In [None]:
df.sort_values(by=["lightning","rain_inches"],ascending=[False,True])

> When using the sort methods, remember to add the `inplace=True` parameter if you want to replace the existing dataframe with the sorted one. 

## Sorting Index

Let's demonstrate how you can sort on an index. Let's first set the index to use the `record_id` for the rows

In [None]:
df.set_index('record_id', inplace=True)

df

Now when we sort on the rows (using `axis=0`) notice we now alphabetically sort on the the `record_id` as the index. 

In [None]:
df.sort_index(axis=0)

This may not seem super interesting as we could have also sorted `record_id` as a column. But now consider that if we set `axis=1` in `sort_index()`, we can now sort the columns!

In [None]:
df.sort_index(axis=1)

So as you can imagine that is useful. Note that since the `record_id` was turned into an index for the rows, it is not sorted with the rest of the columns and remains to the right of the dataframe. 

> When using the sort methods, remember to add the `inplace=True` parameter if you want to replace the existing dataframe with the sorted one. 

## Categories

At times, there are going to be columns in a dataframe that only allow a few values. When these values are strings, it becomes all the more important to consider converting them into a category type. Behind-the-scenes, this will improve the performance of the dataframe and eliminate redundancy due to duplicate strings. 

In our weather dataset, note the `severity` column. Let's say the only possible values for it are "CLEAR", "MINOR", "MAJOR", and "SEVERE." Rather than store these as strings, we can turn them explicitly into categories.

First, we can create a new `CategoricalDType` and specify the expected "categories" in a list. If we want the categories to have a notion of ordering, we can specify `ordered=True` and those labels in that order will become the hierarchy. In ascending order, "CLEAR" is before "MINOR," then "MINOR" is before "MAJOR", and so on... 


In [None]:
cat_type = pd.CategoricalDtype(categories=["CLEAR", "MINOR", "MAJOR", "SEVERE"], ordered=True)

We can then pass that instance of `CategoricalDType` to the `astype()` function on a dataframe, and replace that column with the categorized `severity`. 

In [None]:
df["severity"] = df["severity"].astype(cat_type)

df

Sure enough, if you inspect the datatypes of the dataframe the `severity` column is now a `category` type. This will be much more efficient to work with.

In [None]:
df.dtypes

Note that if you apply a categorization on a column that has values not mapping to any category, then those will become `NaN` values. 

Finally, note that when you sort on the `severity` column that it will no longer alphabetically sort but rather on the arbitrary sort order you defined on the `CategoricalDtype`. This is proven by `MINOR` coming before the `MAJOR`. 

In [None]:
df.sort_values(by=["severity"])

# Using apply() and applymap()

Let's say you want to categorize wind speeds so you create this Python function. 

In [None]:
def map_wind_speed(x): 
    if x >= 60: 
        return 'DANGEROUS'
    elif x >= 30: 
        return 'HIGH'
    elif x >= 15:
        return 'MODERATE'
    else:
        return 'LOW'
    

How can you apply this to the `wind_speed_mph` column and create a new column out of it? You can use the `apply()` function 

In [None]:
df['wind_speed_mph'].apply(map_wind_speed)

You could then append this as a new column as a wind_speed_category. 

In [None]:
df["wind_speed_cat"] = df['wind_speed_mph'].apply(map_wind_speed)

df

## Exercise

Take the weather dataframe we just made above and make the `wind_speed_cat` (which current is stored as string objects) into a category type. Set it so the ordering ascending goes `LOW`, `MODERATE`, `HIGH`, then `DANGEROUS`. Then sort on that column descending.

In [None]:
wind_cat_type = ?

df["wind_speed_cat"] = df['wind_speed_cat'].astype(?)

df.sort_values(by=?,ascending=?, inplace=True)
df

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
wind_cat_type = pd.CategoricalDtype(categories=["LOW", "MODERATE", "HIGH", "DANGEROUS"], ordered=True)

df["wind_speed_cat"] = df['wind_speed_cat'].astype(wind_cat_type)

df.sort_values(by=["wind_speed_cat"],ascending=[False], inplace=True)
df