This is the Github repo for the course Statistical Machine Learning offered at Columbia University. I set this up as my personal blog for future students instead of just some site for homework answers. Please feel free to email me if you have questions.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Statistical Machine Learning

This is the Github repo for the course Statistical Machine Learning offered at Columbia University. I set this up for future students and hope this site can be helpful. Please feel free to email me if you have questions.


There are the following documents:

  • Homework: there are a few files I collected from past homework; I would love people to not focus on this part, because simply doing homework is not enough.
  • Notes: this document has lecture notes taken by me and I am glad to share this with you. That being said, I have faith that successful audience tend not to use online documents. I believe they are capable of preparing their own notes. Mine is simply up for sharing and inspration purpose. It is strongly recommended that you use your own notes.
  • Lecture: slides I collected in machine learning
  • Exercise: This is a dictonary of packages of different machine learning scripts. Each script has (1) Definition of function, (2) Toy data, and (3) Running the function. I wrote all scripts in the same format so that I can fit them into a larger picture later on. It is strongly recommended that you come up with your own format of executing these functions, but I hope mine can be good inspiration.


I do not believe there is one book or one problem set to do so that one can magically become an expert in machine learning. That being said, there are a few directions to go so that perhaps you can minimize your time searching for the right path. On top of that, your dilligence is a great contributing factor to determine how far you can push yourself in this field.

In simple words, take grades, money, and media attention out of the equation, are you still willing to do machine learning? If not, then this is not for you.

Stage I

(1) Read as many books as you can and try to replicate the machine learning techniques. This is early stage of getting yourself familiar with machine learning tools and you should feel comfortable of getting your hands dirty.

Some great books are:

  • An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani; click here
  • An Introduction to Support Vector Machines and Other Kernel-based Learning Methods by Cristianini; click here
  • Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman; click here
  • Machine Learning by Kevin Murphy; click here

Programming Languages:

  • You should be fluent in both Python and R.
  • You should attempt to replicate same results in C++.
  • A list of ML functions for practice taken from Handson's ML textbook. For full list of Python code, please click here.

A good exercise is to go to Online Code Compiler, click here and do some matrix algebra with different languages simultaneously.

Stage II

(2) For intermediate level students, you should be fluent in Step (1). To move beyond here, you need to go to Github or Kaggle and search for new data sets (the ones you have never touched before) and try to replicate Step (1) using new data sets.

Review this Wiki Site and Search for New Data Set. Once you find something interesting you can go to Github or Kaggle.

A new data set is like a new person you may want to be friends with. You treat it well and learn from it. You will gain experience. The data set does not limit to any form. It can be (1) big or small, (2) supervised or unsupervised, (3) time-series, sound wave, or stock prices, (4) 2D images, 3D objects, ..., (5) classification or regression and so on. you need to be able to tell a great story with results from multiple different machine learning techniques given any data sets.

  • A list of famous data sets can be found here.
  • Kaggle Data Set for competition and machine learning; link can be found here.
  • One can also refer to UCI Machine Learning site for more data sets. Click here.

Stage III

(3) At an advanced or research level, you are fluent in Step (1) and (2). In fact, you might be too fluent to find them interesting. Moreover, you have looked so many data sets that there isn't a single data form you have not seen before. You start to think how you can contribute to the society and what can be improved. You start asking questions such as "why apples fall?" If you are here, you are an advanced machine learning practitioner. You can override any authors or textbooks. You can design and even invent profitable machine learning products so that perhaps you can go out there to look for investors to finance your idea and start your own company.

  • Read my story
  • Sample package and source of instructions to publish R packages, click here.
  • From YinsStockPredictoR_1, I developed 2.0 version here, a design that goes above the platform of the first version.
  • From YinsPredictor 2.0, I further advanced another version of stock predictor. Though a small piece of building blocks at Yin's Capital, it is the most rudimentary part required for my organization. Click here to get yourself acquainted with this package.
  • I published my version of neural network package using Keras R interface, click here to learn more about how to download this package.