## Scales

In [1]:
# Let's bring in pandas as normal
import pandas as pd

# Here’s an example. Lets create a dataframe of letter grades in descending order. We can also set an index
# value and here we'll just make it some human judgement of how good a student was, like "excellent" or "good"

df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [2]:
#now if we do the datatype of this column we will find that it is just an object, since we set string values 

df.dtypes

Grades    object
dtype: object

In [3]:
#we can however tell panda that we want to change the type to categorical uing the astype() fuction
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

In [4]:
# we see now that there are eleven categories and pandas is aware of what tose categories are. 
#more interesting though is that our data isnt just categorical, but that it is ordered too. thai is 
# an A- comes after a B+ and B comes before a B+. we can tell pandas that the data is ordered by first creating a new categorical data type with the list of categories in ordr  and the ordered = True flag

my_categories=pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], 
                           ordered=True)
# then we can just pass this to the astype() function
grades=df["Grades"].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [5]:
# Now we see that pandas is not only aware that there are 11 categories, but it is also aware of the order of
# those categoreies. So, what can you do with this? Well because there is an ordering this can help with
# comparisons and boolean masking. For instance, if we have a list of our grades and we compare them to a “C”
# we see that the lexicographical comparison returns results we were not intending. 

df[df["Grades"]>"C"]

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [7]:
#so a c+ is greater than a  C but a C- and a D+ certainly are not. However, if we broadcast over thedataframe that has the type set to an ordered categorical we will get the result we desired

grades[grades > 'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [8]:
#we see that the operator worked as we could expect. We can then use a certain set of mathematical operators
#like minimum, maximum, etc on the ordinal data

In [9]:
#sometimes it is useful to represent categorical value as each being a column with a true or false as to whether the category applies
#this is especially common in feature extraction which is a topic in the data mining course.
#variables with a boolean values are typically called dummy variables, and pandas has a built in function called get dummies which will convert the values of a single column into multiple columns of zeros and ones.
#indicating the presence of the dummy variale.

In [10]:
# There’s one more common scale-based operation I’d like to talk about, and that’s on converting a scale from
# something that is on the interval or ratio scale, like a numeric grade, into one which is categorical. Now,
# this might seem a bit counter intuitive to you, since you are losing information about the value. But it’s
# commonly done in a couple of places. For instance, if you are visualizing the frequencies of categories,
# this can be an extremely useful approach, and histograms are regularly used with converted interval or ratio
# data. In addition, if you’re using a machine learning classification approach on data, you need to be using
# categorical data, so reducing dimensionality may be useful just to apply a given technique. Pandas has a
# function called cut which takes as an argument some array-like structure like a column of a dataframe or a
# series. It also takes a number of bins to be used, and all bins are kept at equal spacing.
 
# Lets go back to our census data for an example. We saw that we could group by state, then aggregate to get a
# list of the average county size by state. If we further apply cut to this with, say, ten bins, we can see
# the states listed as categoricals using the average county size.

# let's bring in numpy
import numpy as np

# Now we read in our dataset
df=pd.read_csv("census.csv")

# And we reduce this to country data
df=df[df['SUMLEV']==50]

# And for a few groups
df=df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)

df.head()

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [11]:
#now if we want to make bins of each of these, we ca use cut()
pd.cut(df,10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

In [12]:
#here we see states like alabama and alaska fall into the same category, while california and the district of columbia fall in different categories

#now, cutting is just one way to biuld categories from your data and there are many other methods.
#for instance, cut gives you interval data, where the spacing between each category is equal sized. but sometimes you want to form categories based on frequency - you want the number of items in each bin
#to be the same,instead of the spacing between bins. it really depends on what the shape of your data is and what you are planning to do with it


In [13]:
#in simple terms cut divides the data into intervals that we want
#in our example above we divided the data into 10 egual categories