# Machine Learning: *The Briefest Survey Ever*

Key questions:

  1. What *is* machine learning (or "ML")?
  2. When to *use* ML?

From [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning): "*Machine learning (ML) is the study of computer algorithms that improve automatically through experience*"

 <img src="wordcloud.png" alt="ML word cloud from https://cmci.colorado.edu/classes/INFO-4604/" width="800"> 

Machine learning most often "the answer" when (1) **data input is unbounded** and/or (2) **output should improve in quality with continued "learning."** (ideally in real time).

# Case Study 1 - The Creepy Know-What-You-Want Services

Netflix, Huly, etc., have a weird way of knowing what you want.  How so?

Consider Netflix, which has lots of movies, lots of users, and the star ranking of movies by those users.  

Suppose user $j$ has has ranked movie $i$ with the value $s_{ij}$.  A whole matrix of these scores can be formed:

$$
\begin{matrix}
s_{11} & s_{12} & \ldots & s_{1j} & \ldots & \\
s_{21} & s_{22} & \ldots & s_{2j} &   & \\
\vdots &    & \ddots  &  & \ldots  &\\
 s_{i1} & s_{i2} &   & s_{ij}  &  &  \\
  \vdots &    &  &   & \ddots \\
     &   &   &  & \ldots & s_{NM} \\
\end{matrix}
$$

for $N$ users and $M$ movies.  

The problem is, what if $s_i(m_j)$ doesn't exist?

We assume some sort of relationship between known scores and unknown scores! 

A *simple* option is *linear*:

$$
    \mathbf{S} = \mathbf{A} \mathbf{B}
$$

where $\mathbf{A} \in R^{m\times r}$,  $\mathbf{B} \in R^{r\times n}$, and $r \ll m, n$.

How?  A variant of SVD.  See https://developers.google.com/machine-learning/recommendation/collaborative/matrix and http://ethen8181.github.io/machine-learning/recsys/1_ALSWR.html.

First, download [this file](Download and unzip http://files.grouplens.org/datasets/movielens/ml-100k.zip) and unzip it!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
file_path = '/home/robertsj/Downloads/ml-100k/u.data'
names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(file_path, sep = '\t', names = names)
print(df.shape)
df.head()

In [None]:
# create the rating matrix S
n_users = df['user_id'].unique().shape[0]
n_items = df['item_id'].unique().shape[0]
S = np.zeros((n_users, n_items)) # carefule!
for row in df.itertuples(index = False):
    S[row.user_id - 1, row.item_id - 1] = row.rating
# compute the non-zero elements in the rating matrix
matrix_size = np.prod(S.shape)
interaction = np.flatnonzero(S).shape[0]
sparsity = 100 * (interaction / matrix_size)
print('dimension: ', S.shape)
print('sparsity: {:.1f}%'.format(sparsity))
S

In [None]:
test = np.zeros(S.shape)
train = S.copy()
for user in range(S.shape[0]):
    test_index = np.random.choice(
    np.flatnonzero(S[user]), size = 10, replace = False)
    train[user, test_index] = 0.0
    test[user, test_index] = S[user, test_index]
train

In [None]:
from sklearn.decomposition import NMF
model = NMF(n_components=10, init='random', random_state=0)
A = model.fit_transform(train)
B = model.components_

# Case Study 2 - Who Belongs Where?



https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

# Case Study 3 - Given $x_i$ and $y_i$, how to predict $y_j$?

In [None]:
# https://towardsdatascience.com/deep-neural-multilayer-perceptron-mlp-with-scikit-learn-2698e77155e

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import pandas as pd

cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data,columns=cal_housing.feature_names)
y = cal_housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1, test_size=0.2)

sc_X = StandardScaler()
X_trainscaled=sc_X.fit_transform(X_train)
X_testscaled=sc_X.transform(X_test)

In [None]:
reg = MLPRegressor(hidden_layer_sizes=(8,8),activation="relu",
                   random_state=1, max_iter=100).fit(X_trainscaled, y_train)

In [None]:
y_pred=reg.predict(X_testscaled)

print("The Score with ", (r2_score(y_pred, y_test)))

In [None]:
plt.plot(y_test, y_pred, 'o')