< [Data Preprocessing and Visualization](../ica04/Data_Preprocessing_and_Visualization.ipynb) | Contents (TODO) |  [Distance and Similarity](../ica06/Distance_and_Similarity.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica05/Supervised_Learning.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

# Introduction to Supervised Learning for Big Data

In this example, we will take a look at the issues regarding supervised learning in the context of big data. Especially, the computational speed is the major concern we will address here.

Before we begin, here's a simple trick you can use to measure the time elapsed for an operation.

In [None]:
import time

In [None]:
start_time = time.time()
time.sleep(3)  # an operation you want to evaluate
elapsed_time = time.time() - start_time
print('Elapsed time: {} seconds'.format(elapsed_time))

With this simple trick in hands, let's measure how long it takes to solve a linear system.

First, let us consider matrices $X\in \mathbb{R}^{N\times d}$ and $Y \in \mathbb{R}^{N\times 1}$ for some positive integers $N$ and $d < N$.

In [None]:
import numpy as np

In [None]:
N = 10000
d = 500
X = np.random.normal(loc=0, scale=1, size=[N, d])
Y = np.random.normal(loc=0, scale=1, size=[N, 1])

print(X)
print(Y)

For a linear system of equations $Y = XA$, the least square solution to this system is known as:

\begin{equation*}
A = ((X^\top X)^{-1}X^\top)Y
\end{equation*}

To compute this, a straightforward approach would be to (1) compute $X^\top X$ first, (2) take the inverse $(X^\top X)^{-1}$, (3) multiply $X^\top$ to the result, and finally (4) multiply $Y$. The following is an analysis of how much of computational time is requred for each of the steps.

In [None]:
start_time = time.time()
XTX = np.matmul(X.T, X)
XTX_elapsed_time = time.time() - start_time
print('Elapsed time for XTX: {} seconds'.format(XTX_elapsed_time))

start_time = time.time()
inv = np.linalg.inv(XTX)
inv_elapsed_time = time.time() - start_time
print('Elapsed time for the inverse: {} seconds'.format(inv_elapsed_time))

start_time = time.time()
invXT = np.matmul(inv, X.T)
invXT_elapsed_time = time.time() - start_time
print('Elapsed time for the inverse times XT: {} seconds'.format(invXT_elapsed_time))

start_time = time.time()
A = np.matmul(invXT, Y)
A_elapsed_time = time.time() - start_time
print('Elapsed time for the inverse times XT times Y: {} seconds'.format(A_elapsed_time))

print('Total: {} seconds'.format(XTX_elapsed_time + inv_elapsed_time + invXT_elapsed_time + A_elapsed_time))

Now, a simple trick can make a huge difference in computational time. Consider the same equation as above, but this time, let us switch the order of computation a little bit.

\begin{equation*}
A = (X^\top X)^{-1}(X^\top Y)
\end{equation*}

That is, this time, we are going to (1) compute $X^\top X$ first, (2) take the inverse $(X^\top X)^{-1}$, (3) compute $X^\top Y$, and finally (4) multiply $(X^\top X)^{-1}$ and $X^\top Y$. Steps (1) and (2) are the same, but (3) and (4) is in different order. Let's take a look at how much time is required to compute the solution with this strategy.

In [None]:
start_time = time.time()
XTX = np.matmul(X.T, X)
XTX_elapsed_time = time.time() - start_time
print('Elapsed time for XTX: {} seconds'.format(XTX_elapsed_time))

start_time = time.time()
inv = np.linalg.inv(XTX)
inv_elapsed_time = time.time() - start_time
print('Elapsed time for the inverse: {} seconds'.format(inv_elapsed_time))

start_time = time.time()
XTY = np.matmul(X.T, Y)
XTY_elapsed_time = time.time() - start_time
print('Elapsed time for XTY: {} seconds'.format(XTY_elapsed_time))

start_time = time.time()
A = np.matmul(inv, XTY)
A_elapsed_time = time.time() - start_time
print('Elapsed time for the inverse times XTY: {} seconds'.format(A_elapsed_time))

print('Total: {} seconds'.format(XTX_elapsed_time + inv_elapsed_time + XTY_elapsed_time + A_elapsed_time))

Notice the significant reduction of computation time?

### Assignment
- Which step shows the greatest difference?
- Why?
- Fix $d = 500$ but try to increase $N$ from 10,000 to 20,000, 50,000, and 100,000. How does the computation time chanbge? Is there any trend?
- Fix $N = 10000$ but increase $d$ from 500 to 1,000, 2,000, and 5,000. How does the computation time change? Is there any trend?

### Note: Advanced Profiling

Measuring times for running operations part by part is called profiling. Using `time` library is quite simple, but sometimes we may need some more advanced method. For example, you may have already noticed that the computation time of the same code can vary each time you run the code.

One way of profiling your code is by using `%timeit` tag in front of the line you want to evaluate. For example:
```python
%timeit inv = np.linalg.inv(XTX)
```
runs `inv = np.linalg.inv(XTX)` multiple times and take the average and standard deviation of the computation time.

Another way of doing it is by using `%prun` tag in front of the line. For instance:
```python
%prun inv = np.linalg.inv(XTX)
```
will provide more in-depth breakdown of the process. If you are, however, not so familiar with computer programing, `%prun` might be too much, as it gives too detailed information. In this case, you should just be fine with `%timeit` or the `time.time()` method.

In [None]:
%timeit inv = np.linalg.inv(XTX)

In [None]:
%prun inv = np.linalg.inv(XTX)

### Scikit-Learn

One of the reasons why Python is such popular in data science is due to free, open source libraries with pre-defined data analysis functions and algorithms. Among many others, `Scikit-learn` is perhaps the most popular Python library for beginners as they provide a wide variety of algorithm implementations as well as easy-to-follow tutorials. For more information, visit http://scikit-learn.org

Scikit-learn can be installed on your computer by running the following cell.

In [None]:
%pip install scikit-learn

Once it is installed, many machine learning modules including linear regression can be accessed like below: 

In [None]:
from sklearn.linear_model import LinearRegression

start_time = time.time()
reg = LinearRegression(fit_intercept = False, n_jobs=1).fit(X, Y)
sk_elapsed_time = time.time() - start_time
print('Elapsed time for Scikit-Learn: {} seconds'.format(sk_elapsed_time))

Note that Scikit-learn is not intended for production-level development. When you have a large data set, Scikit-learn can be extremely slow.

< [Data Preprocessing and Visualization](../ica04/Data_Preprocessing_and_Visualization.ipynb) | Contents (TODO) |  [Distance and Similarity](../ica06/Distance_and_Similarity.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica05/Supervised_Learning.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>