Course materials for Stat 154, spring 2018, at UC Berkeley
Switch branches/tags
Nothing to show
Clone or download
Failed to load latest commit information.
apps apps: add roc-curve Apr 6, 2018
data papers: add intro to rpart May 2, 2018
labs slides: tree based methods Apr 22, 2018
papers papers: add intro to rpart May 2, 2018
problems new Apr 30, 2018
slides slides: fix tyos in bias-variance Nov 15, 2018
syllabus slides: tree based methods Apr 22, 2018
.gitignore initial commit Jan 10, 2018 readme: time permiting Mar 21, 2018

Stat 154 Spring 2018

This repository holds the course materials for the Spring 2018 edition of Statistics 154: Modern Statistical Prediction and Machine Learning at UC Berkeley.

  • Instructor: Gaston Sanchez, gasigiri [at] berkeley [dot] edu
  • Class Time: MWF 11-12pm in 180 Tan
  • Session Dates: 01/17/18 - 05/04/18
  • Code #: 30887
  • Units: 4 (more info here)
  • Office Hours: MW 2:15-3:15pm in 309 Evans (or by appointment)
  • Piazza:
  • Final: Tue May 8, 7-10pm (room TBD)
  • GSI: Omid Solari (Mon. 5-6pm, Wed. 8-10am @444 EVANS).
Lab Date Room GSI
101 M 12-2pm 334 Evans Omid Solari
102 M 3-5pm 334 Evans Omid Solari


This is an introductory-level course in statistical learning, with an emphasis on regression and classification methods, and a pinch of unsupervised methods. The course includes, time permiting, the following topics (not necessarily in the displayed order, see syllabus for more info):

  • Process of predictive model building
  • Data Preprocessing
  • Regression Models
    • Linear models
    • Non-linear models (time permitting)
    • Tree-based methods
  • Classification Models
    • Linear models
    • Non-linear models
    • Tree-based methods
    • Support Vector Machines (time permitting)
  • Unsupervised methods like PCA and Clustering
  • Data spending: splitting and resampling methods
  • Bias-Variance Trade-off
  • Model Assessment
  • Model Selection

Throughout the semester we will explore the predictive modeling lifecycle, including question formulation, data preprocessing, exploratory data analysis and visualization, model building, model assessment/validation, model selection, and decision-making.​

Prerequisites / Review

  • Multivariate calculus or the equivalent, esp. partial derivatives; e.g. Math 53
  • Linear algebra or the equivalent (matrices, vector spaces); e.g. Math 54
  • Statistical inference or the equivalent; e.g. Stat 135
  • Scripting experience in R required; e.g. Stat 133

This course will build on a lot of material from matrix algebra. In particular, you should be comfortable with notions such as vector spaces, inner products, norms, matrix products/transpose/rank/determinants/inverses, as well as matrix decompositions.

You should also have some scripting experience---preferably in R---at the level of writing functions, conditionals (if-then-else structures), for loops, while loops, sampling, read in data sets, export results.

Last but not least, it is nice to know the basics of Rmd files, as well as some knowledge of LaTeX, especially some experience writing math symbols and equations.


There is no official textbook for this course although we will use the following texts as supporting material:

  • An Introduction to Statistical Learning (ISL) by James, Witten, Hastie, and Tibshirani. Springer, 2013. It is freely available online in pdf format (courtesy of the authors) at

  • The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. Springer, 2009 (2nd Ed). This book is more mathematically-and-conceptually advanced than ISL. It is freely available online in pdf format (courtesy of the authors) at

  • Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Springer, 2013.

  • Data Mining and Statistics for Decision Making by Stephane Tuffery. Wiley 2011.


We expect that at the end of the course you:

  • Have a basic, yet solid, understanding of the prediction modeling process/lifecycle.
  • Be able to read a well-described algorithm, and write code to implement it computationally (in R).
  • Know the pros and cons of each predictive technique.
  • Be able to describe (to non-professionals) what a predictive technique is doing.

Methods of Instruction

  • We will be using a combination of materials such as slides, tutorials, reading assignments, and chalk-and-talk.
  • The main computational tool will be the computing and programming environment R.
  • The main workbench will be the IDE RStudio. You will also use a terminal emulator to work with the command line.


  • Please read the course logistics and policies for mode details about the structure of the course, DO's and DONT's, etc.


Creative Commons License
Unless otherwise noticed, this work, by Gaston Sanchez, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Author: Gaston Sanchez