# Sorting, Casting, and Categories

In this notebook, we will cover sorting operations as well as categorical data. We will also learn how to use apply functions to perform more custom operations to Pandas. 

Let's import `pandas` and then declare this tiny weathers dataframe to practice with.

In [1]:
import pandas as pd

df = pd.DataFrame({
    "record_id" : ['DCMXP87EDE', 'ZMIFM3HX9G', 'HIVVXBAPS2', 'U1AA66UDES', 'B20KL5PW3L', 'FIZLY34KSQ'],
    "rain_inches" : [1.1, 0.0, 0.0, 2.4, 11.2, 3.2],
    "tornado" : [0,1,0,0,0,0],
    "lightning" :[0,1,1,1,0,0],
    "wind_speed_mph" : [3.1, 143.0, 12.2, 8.1, 5.0, 19.0],
    "severity" : ['CLEAR', 'SEVERE', 'MINOR', 'MINOR', 'MAJOR', 'CLEAR']
})

df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR
1,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE
2,HIVVXBAPS2,0.0,0,1,12.2,MINOR
3,U1AA66UDES,2.4,0,1,8.1,MINOR
4,B20KL5PW3L,11.2,0,0,5.0,MAJOR
5,FIZLY34KSQ,3.2,0,0,19.0,CLEAR


## Data Types and Casting

As we talk more about data cleaning, it might be a good time to talk about data types and choosing them carefully. Some basic pandas datatypes with examples are declared below. These are the datatypes that are part of NumPy as well, and thus are the most common.


In [4]:
df_types = pd.DataFrame({'float': [1.2],
              'int': [3],
              'datetime': [pd.Timestamp('20230130')],
              'bool' : [True],
              'time_delta' : [pd.Timestamp('20230130') - pd.Timestamp('20230127')],
              'string': ['hello']
            })
df_types

Unnamed: 0,float,int,datetime,bool,time_delta,string
0,1.2,3,2023-01-30,True,3 days,hello


You can view the datatypes for a given dataframe using the `dtypes` property. 

In [8]:
df_types.dtypes

float                 float64
int                     int64
datetime       datetime64[ns]
bool                     bool
time_delta    timedelta64[ns]
string                 object
dtype: object

Now let's turn our attention back to the small weather dataset. 

In [11]:
df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR
1,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE
2,HIVVXBAPS2,0.0,0,1,12.2,MINOR
3,U1AA66UDES,2.4,0,1,8.1,MINOR
4,B20KL5PW3L,11.2,0,0,5.0,MAJOR
5,FIZLY34KSQ,3.2,0,0,19.0,CLEAR


Let's observe the datatypes.

In [14]:
df.dtypes

record_id          object
rain_inches       float64
tornado             int64
lightning           int64
wind_speed_mph    float64
severity           object
dtype: object

One of the most basic operations in data cleaning is casting. You can use the `astype()` function to coerce a given column to another data type. For example, the `tornado` and `lightning` columns are only 1's and 0's indicating they are intended to as boolean values (True=1, False=0). We can cast them to booleans here.

In [17]:
df['tornado'] = df['tornado'].astype('bool')
df['lightning'] = df['lightning'].astype('bool')

df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
0,DCMXP87EDE,1.1,False,False,3.1,CLEAR
1,ZMIFM3HX9G,0.0,True,True,143.0,SEVERE
2,HIVVXBAPS2,0.0,False,True,12.2,MINOR
3,U1AA66UDES,2.4,False,True,8.1,MINOR
4,B20KL5PW3L,11.2,False,False,5.0,MAJOR
5,FIZLY34KSQ,3.2,False,False,19.0,CLEAR


And sure enough, you will see the datatypes will have changed to `bool` for those two columns.

In [20]:
df.dtypes

record_id          object
rain_inches       float64
tornado              bool
lightning            bool
wind_speed_mph    float64
severity           object
dtype: object

## Sorting Values

In Pandas, you can sort data along a row or column by specifying its axis in the `sort_values()` function. Below, we sort ascending first by the `lightning` field followed by the `rain_inches` field. 

In [24]:
df.sort_values(by=["lightning","rain_inches"])

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
0,DCMXP87EDE,1.1,False,False,3.1,CLEAR
5,FIZLY34KSQ,3.2,False,False,19.0,CLEAR
4,B20KL5PW3L,11.2,False,False,5.0,MAJOR
1,ZMIFM3HX9G,0.0,True,True,143.0,SEVERE
2,HIVVXBAPS2,0.0,False,True,12.2,MINOR
3,U1AA66UDES,2.4,False,True,8.1,MINOR


If I want different sort behaviors for each of the columns, where some are ascending while others are descending, pass a boolean list to the `ascending` parameter. Below we set `lightning` to be descending so `True` records rise to the top, while the `rain_inches` is ascending. 

In [27]:
df.sort_values(by=["lightning","rain_inches"],ascending=[False,True])

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
1,ZMIFM3HX9G,0.0,True,True,143.0,SEVERE
2,HIVVXBAPS2,0.0,False,True,12.2,MINOR
3,U1AA66UDES,2.4,False,True,8.1,MINOR
0,DCMXP87EDE,1.1,False,False,3.1,CLEAR
5,FIZLY34KSQ,3.2,False,False,19.0,CLEAR
4,B20KL5PW3L,11.2,False,False,5.0,MAJOR


> When using the sort methods, remember to add the `inplace=True` parameter if you want to replace the existing dataframe with the sorted one. 

## Sorting Index

Let's demonstrate how you can sort on an index. Let's first set the index to use the `record_id` for the rows

In [32]:
df.set_index('record_id', inplace=True)

df

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DCMXP87EDE,1.1,False,False,3.1,CLEAR
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE
HIVVXBAPS2,0.0,False,True,12.2,MINOR
U1AA66UDES,2.4,False,True,8.1,MINOR
B20KL5PW3L,11.2,False,False,5.0,MAJOR
FIZLY34KSQ,3.2,False,False,19.0,CLEAR


Now when we sort on the rows (using `axis=0`) notice we now alphabetically sort on the the `record_id` as the index. 

In [35]:
df.sort_index(axis=0)

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
B20KL5PW3L,11.2,False,False,5.0,MAJOR
DCMXP87EDE,1.1,False,False,3.1,CLEAR
FIZLY34KSQ,3.2,False,False,19.0,CLEAR
HIVVXBAPS2,0.0,False,True,12.2,MINOR
U1AA66UDES,2.4,False,True,8.1,MINOR
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE


This may not seem super interesting as we could have also sorted `record_id` as a column. But now consider that if we set `axis=1` in `sort_index()`, we can now sort the columns!

In [38]:
df.sort_index(axis=1)

Unnamed: 0_level_0,lightning,rain_inches,severity,tornado,wind_speed_mph
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DCMXP87EDE,False,1.1,CLEAR,False,3.1
ZMIFM3HX9G,True,0.0,SEVERE,True,143.0
HIVVXBAPS2,True,0.0,MINOR,False,12.2
U1AA66UDES,True,2.4,MINOR,False,8.1
B20KL5PW3L,False,11.2,MAJOR,False,5.0
FIZLY34KSQ,False,3.2,CLEAR,False,19.0


So that can be useful to sort the columns, and you can sort only certain columns by extracting a partial dataframe and then replaceing those columns. Note that since the `record_id` was turned into an index for the rows, it is not sorted with the rest of the columns and remains to the left of the dataframe. 

> When using the sort methods, remember to add the `inplace=True` parameter if you want to replace the existing dataframe with the sorted one. 

## Categories

At times, there are going to be columns in a dataframe that only allow a few values. When these values are strings, it becomes all the more important to consider converting them into a category type. Behind the scenes, this will improve the performance of the dataframe and eliminate redundancy due to duplicate strings. 

In our weather dataset, note the `severity` column. Let's say the only possible values for it are "CLEAR", "MINOR", "MAJOR", and "SEVERE." Rather than store these as strings, we can turn them explicitly into categories.

First, we can create a new `CategoricalDType` and specify the expected `categories` in a list. If we want the categories to have a notion of ordering, we can specify `ordered=True` and those labels in that order will become the hierarchy. In ascending order, "CLEAR" is before "MINOR," then "MINOR" is before "MAJOR", and so on... 


In [43]:
cat_type = pd.CategoricalDtype(categories=["CLEAR", "MINOR", "MAJOR", "SEVERE"], ordered=True)

We can then pass that instance of `CategoricalDType` to the `astype()` function on a dataframe, and replace that column with the categorized `severity`. 

In [46]:
df["severity"] = df["severity"].astype(cat_type)

df

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DCMXP87EDE,1.1,False,False,3.1,CLEAR
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE
HIVVXBAPS2,0.0,False,True,12.2,MINOR
U1AA66UDES,2.4,False,True,8.1,MINOR
B20KL5PW3L,11.2,False,False,5.0,MAJOR
FIZLY34KSQ,3.2,False,False,19.0,CLEAR


Sure enough, if you inspect the datatypes of the dataframe the `severity` column is now a `category` type. This will be much more efficient to work with.

In [49]:
df.dtypes

rain_inches        float64
tornado               bool
lightning             bool
wind_speed_mph     float64
severity          category
dtype: object

Note that if you apply a categorization on a column that has values not mapping to any category, then those will become `NA` values. 

Finally, note that when you sort on the `severity` column that it will no longer alphabetically sort but rather on the sort order you defined on the `CategoricalDtype`. This is proven by `MINOR` coming before the `MAJOR`. 

In [52]:
df.sort_values(by=["severity"])

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DCMXP87EDE,1.1,False,False,3.1,CLEAR
FIZLY34KSQ,3.2,False,False,19.0,CLEAR
HIVVXBAPS2,0.0,False,True,12.2,MINOR
U1AA66UDES,2.4,False,True,8.1,MINOR
B20KL5PW3L,11.2,False,False,5.0,MAJOR
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE


# Using apply() 

Let's say you want to categorize wind speeds so you create this Python function. 

In [56]:
def map_wind_speed(x): 
    if x >= 60: 
        return 'DANGEROUS'
    elif x >= 30: 
        return 'HIGH'
    elif x >= 15:
        return 'MODERATE'
    else:
        return 'LOW'
    

How can you apply this to the `wind_speed_mph` column and create a new column out of it? You can use the `apply()` function 

In [59]:
df['wind_speed_mph'].apply(map_wind_speed)

record_id
DCMXP87EDE          LOW
ZMIFM3HX9G    DANGEROUS
HIVVXBAPS2          LOW
U1AA66UDES          LOW
B20KL5PW3L          LOW
FIZLY34KSQ     MODERATE
Name: wind_speed_mph, dtype: object

You could then append this as a new column as a wind_speed_category. 

In [62]:
df["wind_speed_cat"] = df['wind_speed_mph'].apply(map_wind_speed)

df

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity,wind_speed_cat
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DCMXP87EDE,1.1,False,False,3.1,CLEAR,LOW
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE,DANGEROUS
HIVVXBAPS2,0.0,False,True,12.2,MINOR,LOW
U1AA66UDES,2.4,False,True,8.1,MINOR,LOW
B20KL5PW3L,11.2,False,False,5.0,MAJOR,LOW
FIZLY34KSQ,3.2,False,False,19.0,CLEAR,MODERATE


Use the `apply()` to pass a column of values through a function, and pass each respective value through it and get the output. 

## Exercise

Take the weather dataframe we just made above and make the `wind_speed_cat` (which currently is stored as string objects) into a category type. Set it so the ordering ascending goes `LOW`, `MODERATE`, `HIGH`, then `DANGEROUS`. Then sort on that column descending.

In [21]:
wind_cat_type = pd.CategoricalDtype(categories=["LOW", "MODERATE", "HIGH", "DANGEROUS"], ordered=True)

df["wind_speed_cat"] = df['wind_speed_cat'].astype(wind_cat_type)

df.sort_values(by=["wind_speed_cat"],ascending=[False], inplace=True)
df

Unnamed: 0_level_0,rain_inches,tornado,lightning,wind_speed_mph,severity,wind_speed_cat
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZMIFM3HX9G,0.0,True,True,143.0,SEVERE,DANGEROUS
FIZLY34KSQ,3.2,False,False,19.0,CLEAR,MODERATE
DCMXP87EDE,1.1,False,False,3.1,CLEAR,LOW
HIVVXBAPS2,0.0,False,True,12.2,MINOR,LOW
U1AA66UDES,2.4,False,True,8.1,MINOR,LOW
B20KL5PW3L,11.2,False,False,5.0,MAJOR,LOW
