# The Market for AI
In the last lesson, we saw models like logistic regression and random forests, which
are examples of supervised learning. That is, these models receive labelled
data (x, y), and learn relationships in the data to make a prediction. 

However, the data we collect won't always have a label, and that's
where different unsupervised learning techniques can help. 

## k-means Clustering

k-means clustering is a method that partitions a dataset into *k* different groups, with each data 
point belonging to the cluster with the nearest mean. With k-means clustering, we can segment into any number of groups.

For example, say you've developed a tool for Airbnb 
hosts that automates their guest interactions. You have four different 
subscription tiers. You'd like to be able to market each tier to different 
groups, but you are not sure which customer fits in which group. 

You can run k-means clustering on the data and separate the data different groups. You would like to see the results graphically, but the data has many feautures (including the number of properties managed, size of each property, and
location), and we can only graph something in three dimensions (x, y, and z axes). 

## Principal Component Analysis

Principal component analysis (PCA) reduces the number of features 
under consideration and helps focus the analysis on the so-called 
"principal component" affecting the behavior of the data. With a tool like 
PCA, we can reduce the dimensionality of the data to a viewable form. 

Let's take another look at the Boston housing data to get a feel for these tools.

## Step 1: Load the Infrastructure
Run the following cell to load all the functions that we will need later.

In [None]:
from mpl_toolkits import mplot3d
from sklearn.cluster import KMeans
from sklearn.datasets import load_boston
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import seaborn as sns; sns.set()

def norm_data(X):
    mu = X.mean(axis=0)
    sd = X.std(axis=0)
    X_norm = (X - mu) / sd

    return X_norm

def pca_transform(n, data, inv = False):
    pca = PCA(n_components = n, random_state = 0)
    data_norm = norm_data(data)
    Z = pca.fit_transform(data_norm)
    if inv:
        Z = pca.inverse_transform(Z)*data.std(axis=0) + data.mean(axis=0)

    return Z

def plot_PCs(n, data, y, km = False):

    Z = pca_transform(n, data)

    if km:
        kmeans = KMeans(4)
        kmeans.fit(Z)
        cl = kmeans.labels_

    # Split data by MEDV > mean.
    gt = Z[y==1]
    lt = Z[y==0]

    # 3D k-means.
    if km and n == 3:
        ax = plt.subplot(projection='3d')
        ax.scatter3D(Z[:,0], Z[:,1], Z[:,2], c=cl,  cmap = "Accent")

        ax = plt.gca()
        ax.set_zlabel("3rd PC")
        plt.xlabel("1st PC")
        plt.ylabel("2nd PC")


    # 3D MEDV split.
    elif n == 3:
        ax = plt.subplot(projection='3d')
        ax.scatter3D(gt[:,0], gt[:,1], gt[:,2], marker = '+', cmap = "Accent",
                    label = "Greater than mean MEDV")
        ax.scatter3D(lt[:,0], lt[:,1], lt[:,2], s =5, marker = 'o', cmap = "Accent",
                    label = "Less than mean MEDV")

        ax = plt.gca()
        ax.set_zlabel("3rd PC")
        plt.xlabel("1st PC")
        plt.ylabel("2nd PC")

    # 2D k-means.
    if km and n == 2:
        plt.scatter(Z[:,0], Z[:,1], c=cl,  cmap = "Accent")

        plt.xlabel("1st PC")
        plt.ylabel("2nd PC")
    # 2D MEDV split.
    elif n == 2:
        plt.scatter(gt[:,0], gt[:,1],  marker = '+', cmap = "Accent",
                    label = "Greater than mean MEDV")
        plt.scatter(lt[:,0], lt[:,1], s=5, marker = 'o', cmap = "Accent",
                    label = "Less than mean MEDV")
        plt.xlabel("1st PC")
        plt.ylabel("2nd PC")

    # 1D k-means.
    if km and n == 1:
        plt.scatter(Z[:,0],[0]*Z.shape[0], s=5, c=cl, cmap = "Accent")

        plt.xlabel("1st PC")
        plt.title("Boston Housing: PCA Reduced")
        ax = plt.gca()
        ax.set(yticks=[])
    # 1D MEDV split.
    elif n == 1:
        plt.scatter(gt[:,0],[0]*gt.shape[0], marker = '+', cmap = "Accent",
                    label = "Greater than mean MEDV")
        plt.scatter(lt[:,0],[0]*lt.shape[0], s=5, marker = 'o', cmap = "Accent",
                    label = "Less than mean MEDV")

        plt.xlabel("1st PC")
        plt.title("Boston Housing: PCA Reduced")
        ax = plt.gca()
        ax.set(yticks=[])

    if not km:
        plt.legend()
    if km:
        plt.title("Boston Housing: PCA and K-Means")
    else:
        plt.title("Boston Housing: PCA Reduced")
    plt.show()

def plot_diff_means(d1, d2, labels):
    fig, ax = plt.subplots()
    bar_width = 0.35
    opacity = 0.4
    index = np.arange(d1.shape[1])
    mean1 = d1.mean(axis=0)
    mean2 = d2.mean(axis=0)
    diff = 100*(mean1 - mean2) / mean2
    
    rect1 = ax.bar(index, diff, bar_width,
                alpha=opacity, color='b',
                label = "% Difference")

    plt.xticks(index, labels, rotation = 45)
    plt.ylabel("Percentage")
    plt.title("Difference of Means for Of Interest and Original Data")
    plt.legend()
    
    plt.show()

## Step 2: Run a Principal Component Analysis

First, we will reduce the dimensionality and take a look at the homes with values that are greater than and
less than the mean home value (MEDV). Run the following cell to see the results.

In [None]:
data = load_boston().data
target = load_boston().target
y = np.zeros_like(target)
y[target > target.mean()] = 1

for i in range(1,4):
    plot_PCs(i, data, y)

### Results

Originally, the data contained 506 examples with 13 different features and it
wasn't possible to visualize. With the help of PCA, we can reduce our dataset
down to the three, two, or one most important features. This makes
visualization possible. 

It's worth taking another look at the graph with one principal component. There
are regions on the edges where home values are mostly greater than or less than MEDV.
By investigating what values of our original 13 features connect to this data, 
we might find a way to determine why these home values are so different.

## Step 3: Run a Second Principal Component Analysis

Run the following cell to see the features that affect impact the MEDV outliers.

In [None]:
# Code to pick out interesting data.
Z = pca_transform(1, data)
# Of all homes < mean MEDV, what is the least? 
min_lt = Z[y==0].min()
ind = []
for i, v in enumerate(Z):
    # Grab homes that are to the left of min_lt on the PC1 graph.
    if v < min_lt:
        ind.append(i)

# Go back to 13 features.
Z_rec = pca_transform(1, data, inv=True)
# Grab points of interest in the original 13 feature space.
Z_rec[ind]
bar_labels = load_boston().feature_names
# Plot the difference between the data as a whole and our points of interest.
plot_diff_means(Z_rec[ind], data, bar_labels)

CRIM and ZN immediately stand out.

- CRIM is the per capita crime rate 
- ZN is the proportion of residential land zoned for lots over 25,000 sq. ft. (2,323 sq. m.).

CRIM is much lower and ZN is much bigger. This is probably an exclusive
neighborhood filled with large homes—a 25,000 sq. ft. (2,323 sq. m.) lot is pretty big. 

By looking at the graph of the first principal component, we see that there are
two regions that stand out from the data: the region < -4 and the region > 4.

We took these data points < -4 and transformed them back into the original 13
feature space. After comparing the means of our data of interest with the
original data, we saw that CRIM and ZN were very different, leading us to the
conclusion that our data of interest probably represents an exclusive
neighborhood filled with large homes. 

In the following exercise, you will repeat this process for values from the first principal component > 4. 

## Step 4: Run Another PCA

In step 3, we discovered that two regions stood out from the data: < -4 and > 4. We then transformed the values < -4 from the first principal component. 


In this step, you will add values to the following code cell to transform the values > 4 from the first principal component. 

**Note**: To see the code that we used, see **Answer Code** below the code cell.

In [None]:
# Code to investigate the opposite side of the first principal component.
Z = pca_transform(1, data)
# Of all homes < mean MEDV, what is the least? 
max_gt = 
ind = []
Z_rec = 
plot_diff_means(Z_rec[ind], data, bar_labels)

### Answer Code
We used the following code in the code cell.

```python
Z = pca_transform(1, data)
# Of all homes < mean MEDV, what is the least? 
max_gt = Z[y==1].max()
ind = []
for i, v in enumerate(Z):
    # Grab homes that are to the right of max_gt on the PC1 graph.
    if v > max_gt:
        ind.append(i)

# Go back to 13 features.
Z_rec = pca_transform(1, data, inv=True)
# Grab points of interest in the original 13 feature space.
Z_rec[ind]
bar_labels = load_boston().feature_names
# Plot the difference between the data as a whole and our points of interest.
plot_diff_means(Z_rec[ind], data, bar_labels)
```

## Step 5: k-means Clustering 

Now that we're able to visualize our data, let's segment it further. Let's say
you're selling four different versions of a product to families in the Boston
area. You'd like to segment your potential customers into four distinct groups,
so you can better target your advertising to them. 

Run the following cell to perform k-means clustering on the PCA-reduced Boston housing data:

In [None]:
for i in range(1,4):
    plot_PCs(i, data, y, km = True)

We could then follow the same process to back out what our groups look like in
our original data to come up with profiles on them. 

k-means clustering is a powerful tool that can help you group your data better. 
Without any labels, k-means clustering is able to group the housing data into any number
of clusters. This can be very useful, especially for customer segmentation. 

## Additional Exercise 
Clustering or segmenting a dataset into different groups is valuable for a
number of cases; for example, trying to target customers for specific products
within a product offering. For further practice:
- Investigate how the different clusters correspond to the original 13
  features by following the same procedure from the PCA from earlier. 
    - Isolate the principal component values that correspond to each group.
    - Back out your highlighted data to the original 13 feature space.
    - Compare the means of the highlighted data to those of the original data.


## Additional Reading
For more information about the math in this coding exercise, see the following
resources:

- [Andrew Ng on K-Means](https://www.youtube.com/watch?v=Ev8YbxPu_bQ)
- [PCA Tutorial](http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf)

## Conclusion
We've seen two new tools from unsupervised learning (learning on unlabeled
data): k-means clustering and PCA. k-means clustering was able to find unique labels for our data
based on the number of groups we wantd. PCA reduced the dimensionality of our
data, making it possible to visualize. These are both effective tools because
most of the data "in the wild" doesn't come with labels. PCA and
k-means add meaning to this data.