# STAT 441 - General Tutorial

This tutorial will briefly discuss
* The first data challenge;
* JupyterHub and the GPU JupyterHub;
* Basic Python Data-handling Overview;
* Basics of Handling Categorical Data

You can access "JupyterHub" https://jupyter.math.uwaterloo.ca/ or (if you are signed into the school's VPN/on-campus) you can use the same service that is run on a GPU server https://gpu-pt1-01.math.private.uwaterloo.ca/. This provides a great way to overcome the limitations of your individual computers. A word of warning: The servers can get bogged down, particularly around deadlines. Further, while they are fast, they are by no means the fastest computational platforms available! **Leave time to complete the assignments and data challenge, _especially_ if you plan on using JupyerHub**


## Brief Python Data-Handling Overview

In [1]:
# Pandas is a package which provides efficient data structures for data-handling
# Of particular note, pandas gives access to a "dataframe" structure, that should 
#    be quite familiar to R users.
# It also provides easy data reading features (i.e. from different file types),
#    and (as we will see later) allows for many built-in manipulations
# https://pandas.pydata.org/

import pandas as pd # It is standard practice to import this "as pd"

In [2]:
# Numpy is a library desgined to assist in many aspects of scientific computing.
# Of particular interest for our use cases, numpy provides an n-dimmensional array,
#    and plenty of efficient computational functions [i.e. linear algebra, etc.]

import numpy as np # It is standard practice to import this "as np"

### Importing Data from a CSV

For the first data challenge you will find that the training, and test data files, are given to you as CSVs. With Pandas imported it becomes quite straightforward to import this data, from a CSV. 

In [3]:
# The "read_csv" method, provided by Pandas, will allow you to open any standard
#    csv file, and read in the data. 
# Simply provide the path to the file, and the resulting CSV will be stored in a
#    dataframe.

train_df = pd.read_csv('_my/home/individual_data_train.csv') 
train_df_copy = pd.read_csv('_my/home/individual_data_train.csv') 

In [4]:
# The "head" method on the dataframe will allow you to view the first 5 rows
train_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [5]:
# The "tail" method on the dataframe will allow you to view the final 5 rows
train_df.tail()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1


In [6]:
# You can select columns by the headers given. This selection will return a dataframe object itself.

age_df = train_df['Age']
age_df.head()

0    39
1    50
2    38
3    53
4    28
Name: Age, dtype: int64

In [7]:
age_sex_class_df = train_df[['Age', 'Sex', 'Class']]
age_sex_class_df.head()

Unnamed: 0,Age,Sex,Class
0,39,Male,0
1,50,Male,0
2,38,Male,0
3,53,Male,0
4,28,Female,0


In [8]:
# You can add new columns simply by specifing their computed value
train_df['Age_50_Plus'] = train_df['Age'] >= 50
train_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class,Age_50_Plus
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,False
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,True
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,False
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,True
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,False


In [9]:
# You can use the .loc method to make a "location-based" selection of the data.
# For instance, if you wish to select all of the individuals who belong to class 0

class_0_df = train_df.loc[train_df['Class'] == 0]

class_0_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class,Age_50_Plus
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,False
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,True
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,False
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,True
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,False


In [10]:
class_0_df.tail()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class,Age_50_Plus
32553,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,0,False
32555,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,0,False
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0,False
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0,True
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0,False


In [11]:
## Label Extraction
# If we wish to extract the series of classes from the above data frame, it will help to apply 
#   some numpy features as well.

class_df = train_df['Class']         # Extract a dataframe with the classes
class_arr = np.array(class_df)       # Turn the dataframe into a numpy array

In [12]:
class_df.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [13]:
class_arr.shape

(32561,)

In [14]:
class_arr[0:50]

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0])

In [15]:
# Some of the nice numpy features

age_arr = np.array(age_df) # Generate an age array

mean_age = np.mean(age_arr) 
median_age = np.median(age_arr)

class_1_count = np.sum(class_arr)

In [16]:
print("The mean age is {}, with a median of {}.".format(mean_age, median_age))
print("There are {} occurences of class 1 in the dataset.".format(class_1_count))

The mean age is 38.58164675532078, with a median of 37.0.
There are 7841 occurences of class 1 in the dataset.


There are plenty of other ways of manipulating data, plenty more to import it, etc. For now, I will leave this as an exercise for the interested - the online material on the subject is quite good.

If ever you are in need of assistance, reach out on Piazza or to Office Hours and I would be happy to help!

## Dealing with Categorical Data
As you have surely noted, much of the data from the data challenge is categorical by nature. While there exists methods for dealing directly with categorical data, it is almost always better to encode it into numerical data in some manner, so to assist the methods of analysis. 

### Dummy Variables
The most common method of dealing with categorical data is through the use of "dummy" or indicator variables. The general idea should be familiar to those who have worked with categorical data in a regression context; for each categorical variable you create n (or perhaps n - 1) indicator variables which take a value of 1 if the data point belongs to that category, and 0 otherwise. 

For the familiar, the use of dummy variables is effectively equivalent to one hot encoding, though we tend to create a single vector in the case of the latter, and generate new columns in a dataframe in the case of the former.

These binary switches have the advantage of being easy to interpret, and tremendously quick to create, but they can quickly add tremendous dimmensionality to your data, and they give-up hierarchial relationships.

In [17]:
## Pandas allows you to generate a set of dummy variables for a dataframe with the 
##    "get_dummies" method.

pd.get_dummies(train_df).head()

Unnamed: 0,Age,Fnlwgt,Education Num,Capital Gains,Capital Loss,Hours per Week,Class,Age_50_Plus,Workclass_ ?,Workclass_ Federal-gov,...,Native Country_ Portugal,Native Country_ Puerto-Rico,Native Country_ Scotland,Native Country_ South,Native Country_ Taiwan,Native Country_ Thailand,Native Country_ Trinadad&Tobago,Native Country_ United-States,Native Country_ Vietnam,Native Country_ Yugoslavia
0,39,77516,13,2174,0,40,0,False,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,True,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,False,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,True,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,False,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# You can also specify which columns you wish to create dummy variables for
pd.get_dummies(train_df, prefix=['Marital Status', 'Sex'], columns=['Marital Status', 'Sex']).head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Occupation,Relationship,Race,Capital Gains,Capital Loss,...,Age_50_Plus,Marital Status_ Divorced,Marital Status_ Married-AF-spouse,Marital Status_ Married-civ-spouse,Marital Status_ Married-spouse-absent,Marital Status_ Never-married,Marital Status_ Separated,Marital Status_ Widowed,Sex_ Female,Sex_ Male
0,39,State-gov,77516,Bachelors,13,Adm-clerical,Not-in-family,White,2174,0,...,False,0,0,0,0,1,0,0,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Exec-managerial,Husband,White,0,0,...,True,0,0,1,0,0,0,0,0,1
2,38,Private,215646,HS-grad,9,Handlers-cleaners,Not-in-family,White,0,0,...,False,1,0,0,0,0,0,0,0,1
3,53,Private,234721,11th,7,Handlers-cleaners,Husband,Black,0,0,...,True,0,0,1,0,0,0,0,0,1
4,28,Private,338409,Bachelors,13,Prof-specialty,Wife,Black,0,0,...,False,0,0,1,0,0,0,0,1,0


### Categorical -> Numeric Data
A second method, which perhaps provides more utility when there is a clear hierarchy in the categories, involves assigning a numeric value to the different categories. This overcomes some of the shortfalls with the dummy variable technique, but introduces issues if the data does not exhibit a clear structure. Consider the case of regressing on one such numerically encoded variable. If there is not a clear trend between the categories, then it would be effectively impossible to generate a meaningful coefficient from the analysis. 

However, when working with variables such as "Education" it may make sense. 

In [19]:
# First, we can force numpy to make a column be treated as a category.
# This will provide some benefits on the whole, but in particular, allows us to easily
#    extract numerical values for the different categories

train_df['Education'] = train_df['Education'].astype('category')

In [20]:
# We can then assign the category codes (.cat.codes) to a new column in the data frame,
#   to get a sense of what is happening.
train_df['Education_Code'] = train_df['Education'].cat.codes
train_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class,Age_50_Plus,Education_Code
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,False,9
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,True,9
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,False,11
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,True,1
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,False,9


In [23]:
# To provide more meaningful orderings of these categories, you can specify a list of the correct order.
# First we need to import the API for the categorical data type
from pandas.api.types import CategoricalDtype

# Then we need to define the ordering of our categories
education_levels = [ ' Preschool',
                     ' 1st-4th',
                     ' 5th-6th',
                     ' 7th-8th',
                     ' 9th',
                     ' 10th',
                     ' 11th',
                     ' 12th',
                     ' HS-grad',
                     ' Some-college'
                     ' Assoc-acdm',
                     ' Assoc-voc',
                     ' Bachelors',
                     ' Prof-school',
                     ' Masters',
                     ' Doctorate']

# Then we generate our custom datatype
education_cat_type = CategoricalDtype(categories=education_levels, ordered=True)

# And finally assign the column to be that datatype
train_df_copy['Education'] = train_df_copy['Education'].astype(education_cat_type)
train_df_copy['Education_Code'] = train_df_copy['Education'].cat.codes
train_df_copy.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gains,Capital Loss,Hours per Week,Native Country,Class,Education_Code
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,11
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,11
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,8
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,6
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,11


### Sklearn
Sklearn (scikit learn) is a very commonly used Python package for machine learning. There is extensive documentation regarding the different utilities and techniques provided by sklearn. In particular, most classification algorithms, feature extraction, and data handling utilities one could need for this course are available (NOTE: for Data challenges, it is not necessary to implement your own algorithms).

For extensive documentation and examples, check out: http://scikit-learn.org/stable/