Statistical Machine Learning
This is the Github repo for the course Statistical Machine Learning offered at Columbia University. I set this up for future students and hope this site can be helpful. Please feel free to email me if you have questions.
Updated materials have been moved to: https://yinscapital.com/product/statistical-machine-learning-master-collection/
There are the following documents:
- Homework: there are a few files I collected from past homework; I would love people to not focus on this part, because simply doing homework is not enough.
- Notes: this document has lecture notes taken by me and I am glad to share this with you. That being said, I have faith that successful audience tend not to use online documents. I believe they are capable of preparing their own notes. Mine is simply up for sharing and inspration purpose. It is strongly recommended that you use your own notes.
- Lecture: slides I collected in machine learning
- Exercise: This is a dictonary of packages of different machine learning scripts. Each script has (1) Definition of function, (2) Toy data, and (3) Running the function. I wrote all scripts in the same format so that I can fit them into a larger picture later on. It is strongly recommended that you come up with your own format of executing these functions, but I hope mine can be good inspiration.
I do not believe there is one book or one problem set to do so that one can magically become an expert in machine learning. That being said, there are a few directions to go so that perhaps you can minimize your time searching for the right path. On top of that, your dilligence is a great contributing factor to determine how far you can push yourself in this field.
In simple words, take grades, money, and media attention out of the equation, are you still willing to do machine learning? If not, then this is not for you.
(1) Read as many books as you can and try to replicate the machine learning techniques. This is early stage of getting yourself familiar with machine learning tools and you should feel comfortable of getting your hands dirty.
Some great books are:
- An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani; click here
- An Introduction to Support Vector Machines and Other Kernel-based Learning Methods by Cristianini; click here
- Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman; click here
- Machine Learning by Kevin Murphy; click here
- A collection of cheatsheet from Stanford CS230 can be found here.
Programming Languages (academic level):
- You should be fluent in both Python and R. See difference here.
- You should attempt to replicate same results in C++.
- Additional practices would be to use Java.
- You should be fluent navigating windows command and shell-base language so you can remotely control a laptop. This is for the purpose of parallel computing and cloud computing.
- A list of ML functions for practice taken from Handson's ML textbook. For full list of Python code, please click here. Another good python collection is here by Sebastian Raschka. A great TensorFlow notebook source can be found here which is also taught by Andrew Ng on Coursera site. A more difficult Colab research page with advanced neural network algorithms can be found here.
- Professor Ng, a Stanford professor and former Google Brain Engineer, is teaching Deep Learning Specialization on Coursera with materials here. It charges $49 per month and I finished the process in about 3-4 months of time. Results are quite fruitful.
Programming Languages (industrial level):
- You want to be familiar with C++/C#, Java, and Python altogether for the purpose writing sotware packages and .exe format files. These skills go beyond statistical modeling and predictive machine learning. They are more about building software engineers and building platfoms (such as GoogleCloud, Amazon Web Services, or Mobile App).
- I sourced this repo here for those who are interested in getting started with Java. This site has a few tips about quick start-ups in Java as well.
- Click here for Leetcode, an online playground environment to code C-base, Java, and Python in a convenient manner. No need to install anything. One can always refer to some quick tips on Python specifically on Scikit Learn and MatplotLib.
- To handle big data in R, one can always refer to Spark SQL R interface here.
- Another open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java, is kafka. A famous Go Library in Kafka can be found here. This confluence page tells us how to handle clients using Kafka as well.
- To implement GoLang in R, one can refer to this github.
- Key advantages using Go: here.
- Difference between C++ and Python: here.
- Difference between Go and Java:here.
- It is worth knowing that Spark (tutorial is here) is written in Scala and here is a difference comparison using Scala or Python.
- When working as a data scientist or statistcian in a team, DevOps is quite important for one to realize and to be comfortable with.
A good exercise is to go to Online Code Compiler, click here and do some matrix algebra with different languages simultaneously.
(2) For intermediate level students, you should be fluent in Step (1). To move beyond here, you need to go to Github or Kaggle and search for new data sets (the ones you have never touched before) and try to replicate Step (1) using new data sets.
Review this Wiki Site and Search for New Data Set. Once you find something interesting you can go to Github or Kaggle.
A new data set is like a new person you may want to be friends with. You treat it well and learn from it. You will gain experience. The data set does not limit to any form. It can be (1) big or small, (2) supervised or unsupervised, (3) time-series, sound wave, or stock prices, (4) 2D images, 3D objects, ..., (5) classification or regression and so on. you need to be able to tell a great story with results from multiple different machine learning techniques given any data sets.
- A list of famous data sets can be found here.
- Kaggle Data Set for competition and machine learning; link can be found here.
- One can also refer to UCI Machine Learning site for more data sets. Click here.
- A good example is to find an interesting data set, such as Fasion MNIST, and test all sorts of machine learning algorithms. This post focus on a variety of neural network or convolutional neural network architecture on using this one data set. It's worth exploring this data set with other methodologies.
(3) At an advanced or research level, you are fluent in Step (1) and (2). In fact, you might be too fluent to find them interesting. Moreover, you have looked so many data sets that there isn't a single data form you have not seen before. You start to think how you can contribute to the society and what can be improved. You start asking questions such as "why apples fall?" If you are here, you are an advanced machine learning practitioner. You can override any authors or textbooks. You can design and even invent profitable machine learning products so that perhaps you can go out there to look for investors to finance your idea and start your own company.
- Read my story
- Sample package and source of instructions to publish R packages, click here.
- From YinsStockPredictoR_1, I developed 2.0 version here, a design that goes above the platform of the first version.
- From YinsPredictor 3.0, I further advanced another version of stock predictor. Though a small piece of building blocks at Yin's Capital, it is the most rudimentary part required for my organization. Click here to get yourself acquainted with this package.
- I published my version of neural network package using Keras R interface, click here to learn more about how to download this package. A more consolidated ML package can be found at YinsLibrary.
- For my own holding company, I developed a corporate level package and one can access it here at YinsCapital.