# Scales

In [None]:
# Let's bring in pandas as normal
import pandas as pd

# Here’s an example. Lets create a dataframe of letter grades in descending order. We can also set an index
# value and here we'll just make it some human judgement of how good a student was, like "excellent" or "good"

df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

In [None]:
# Now, if we check the datatype of this column, we see that it's just an object, since we set string values
df.dtypes

In [None]:
# We can, however, tell pandas that we want to change the type to category, using the astype() function
df["Grades"].astype("category")
#df["Grades"].astype("category").head()

In [None]:
df[df["Grades"]>"B-"] 
#It knows that they are ordered. However the order is not correct. 

In [None]:
# Now correct the orders. 

my_categories=pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], 
                           ordered=True)
# then we can just pass this to the astype() function


In [None]:
my_categories

In [None]:
df_corr=df["Grades"].astype(my_categories)


In [None]:
# So a C+ is great than a C, but a C- and D certainly are not. However, if we broadcast over the dataframe
# which has the type set to an ordered categorical

df_corr[df_corr>"C"]

In [None]:
df[df["Grades"]>"C"] 

In [None]:
# We see that the operator works as we would expect. We can then use a certain set of mathematical operators,
# like minimum, maximum, etc., on the ordinal data.

In [None]:
# Sometimes it is useful to represent categorical values as each being a column with a true or a false as to
# whether the category applies. This is especially common in feature extraction, which is a topic in the data
# mining course. Variables with a boolean value are typically called dummy variables, and pandas has a built
# in function called get_dummies which will convert the values of a single column into multiple columns of
# zeros and ones indicating the presence of the dummy variable. I rarely use it, but when I do it's very
# handy.

In [None]:
# There’s one more common scale-based operation I’d like to talk about, and that’s on converting a scale from
# something that is on the interval or ratio scale, like a numeric grade, into one which is categorical. Now,
# this might seem a bit counter intuitive to you, since you are losing information about the value. But it’s
# commonly done in a couple of places. For instance, if you are visualizing the frequencies of categories,
# this can be an extremely useful approach, and histograms are regularly used with converted interval or ratio
# data. In addition, if you’re using a machine learning classification approach on data, you need to be using
# categorical data, so reducing dimensionality may be useful just to apply a given technique. Pandas has a
# function called cut which takes as an argument some array-like structure like a column of a dataframe or a
# series. It also takes a number of bins to be used, and all bins are kept at equal spacing.
 
# Lets go back to our census data for an example. We saw that we could group by state, then aggregate to get a
# list of the average county size by state. If we further apply cut to this with, say, ten bins, we can see
# the states listed as categoricals using the average county size.



In [32]:
# let's bring in numpy
import numpy as np

# Now we read in our dataset
df=pd.read_csv("datasets/mpg.csv", index_col=0)

pd.cut(df['cty'],2)

1      (8.974, 22.0]
2      (8.974, 22.0]
3      (8.974, 22.0]
4      (8.974, 22.0]
5      (8.974, 22.0]
6      (8.974, 22.0]
7      (8.974, 22.0]
8      (8.974, 22.0]
9      (8.974, 22.0]
10     (8.974, 22.0]
11     (8.974, 22.0]
12     (8.974, 22.0]
13     (8.974, 22.0]
14     (8.974, 22.0]
15     (8.974, 22.0]
16     (8.974, 22.0]
17     (8.974, 22.0]
18     (8.974, 22.0]
19     (8.974, 22.0]
20     (8.974, 22.0]
21     (8.974, 22.0]
22     (8.974, 22.0]
23     (8.974, 22.0]
24     (8.974, 22.0]
25     (8.974, 22.0]
26     (8.974, 22.0]
27     (8.974, 22.0]
28     (8.974, 22.0]
29     (8.974, 22.0]
30     (8.974, 22.0]
           ...      
205    (8.974, 22.0]
206    (8.974, 22.0]
207    (8.974, 22.0]
208    (8.974, 22.0]
209    (8.974, 22.0]
210    (8.974, 22.0]
211    (8.974, 22.0]
212    (8.974, 22.0]
213     (22.0, 35.0]
214    (8.974, 22.0]
215    (8.974, 22.0]
216    (8.974, 22.0]
217    (8.974, 22.0]
218    (8.974, 22.0]
219    (8.974, 22.0]
220    (8.974, 22.0]
221    (8.974

In [33]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
6,audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
7,audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
8,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
9,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
10,audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


In [39]:
df.pivot_table(values='cty', index='manufacturer', columns='year', aggfunc=[np.mean, np.max,np.min]).head()

Unnamed: 0_level_0,mean,mean,amax,amax,amin,amin
year,1999,2008,1999,2008,1999,2008
manufacturer,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
audi,17.111111,18.111111,21,21,15,15
chevrolet,15.142857,14.916667,19,22,11,11
dodge,13.375,12.952381,18,17,11,9
ford,13.933333,14.1,18,17,11,12
honda,24.8,24.0,28,26,23,21
