## More Data Processing with Pandas

### Scales

Sometimes data can have different scales that are different between groups. As we move through data cleaning and into statistical analysis and machine learning, it's important to clarify our knowledge and terminology.

As a data scientist, there's at least four different scales that's worth knowing about:

* **Ratio Scale**: 
    - Units are equally space
    - Mathematical operations of $+-/*$ are all valid
    - e.g. height and weight

* **Interval Scale**:
    - units are equally spaced, but there's no absence of value (no true zero)
    - $/*$ operations are not valid
    - e.g. temperature in celsius or farenheit (0 degrees is a meaningful value itself), the direction of a compass

* **Ordinal Scale**:
    - The order of the units is important, but not evenly spaced
    - Letter grades such as A+, A are a good example

* **Nominal Scale**:
    - Categories of data, but the categories have no order with respect to one another
    - e.g. teams of a sport

In Pandas, understanding and setting the scale of the variables in a column is very helpful when applying operations to the DataFrame, such as comparisons or boolean masking.

Let's see an example: An small dataframe of letter grades in descending order, with some human judgement of how good a student was.

Let's compare how Pandas performs when it knows the variable scale (`dtype`) and when it does not.

**Converting to categorical and ordered scales**:

In [1]:
import pandas as pd

In [2]:
# Create the dataframe
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good',
                         'ok', 'ok', 'ok', 'poor', 'poor'],
                  columns=['Grades'])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


If we check the data type of this column, we see that it's just an object:

In [3]:
df.dtypes

Grades    object
dtype: object

We can change the type to *category* using the `astype()` function:

In [4]:
df['Grades'].astype('category')

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
Name: Grades, dtype: category
Categories (11, object): ['A', 'A+', 'A-', 'B', ..., 'C+', 'C-', 'D', 'D+']

Now, Pandas is aware of those 11 categories, but it does not understand yet that they are ordered categories (e.g. `'A+' > 'A' > 'A-' > 'B+'`).

To do this, we first need to create a new ordered categorical data type using `pd.CategoricalDType()` and thenpass the `astype()` function:

In [5]:
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                                    ordered=True)

grades = df['Grades'].astype(my_categories)
grades.head() # It now understands the order of the categories

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

Let's compare both lists of grades to a `'C'` to check the difference:

In [6]:
# Unordered categories df
df[df['Grades'] > 'C']

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [7]:
# Ordered categories df
grades[grades > 'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

**Converting to categorical from interval/ratio scales**:

Sometimes, it is useful to convert an interval or ratio scale into a categorical scale. Seems counter-intuitive but this is commonly done when visualizing frequencies of categories. In addition, if you are using a ML classification model, you will need to be using categorical data.

The built-in function `cut()`, which takes as an argument an array-like structure and the number of bins, is usually used in this case.

Let's recall the census data:

In [8]:
import numpy as np

In [9]:
# Read the US census data 
df = pd.read_csv('../resources/week-3/datasets/census.csv')

# And reduce this to county data
df = df[df['SUMLEV']==50]

In [10]:
# Now, let's group by State and apply the aggregate function as we did before
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

If we want to make "bins" of each of these, we can use `cut()`:

In [11]:
pd.cut(df, 10) # 10 beans

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     