# Undergraduate Final Project: Semi-Supervised Learning
Data Science and Applied Machine Learning <br/>
Dr. Leslie Kerby \\
Student's name: Ziming Wang


**Question 1** <br/>
Import the MNIST Digits Dataset as shown in class. Put all 70k images in one container (no splitting the dataset).

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
# not needed in Colab

# Download MNIST digits
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784',version=1)

In [None]:
X, y = mnist['data'], mnist['target']

In [None]:
mnist['data'].shape

(70000, 784)

**Question 2: Semi-Supervised Learning**<br/>
While labelling all 70k images would be very time-consuming, labelling a small subset is not. A supervised model can be trained on that small subset. This is called "semi-supervised learning".<br/>
Take 150 random images from the 70k image dataset and use their labels to train a Logistic Regression model. Test it on the full dataset and report the accuracy.  

In [None]:
# X is not a pandas DataFrame so we can't use sample method
%%time
from sklearn.linear_model import LogisticRegression
# take just 150 images
idx = np.random.permutation(70000)
idx = idx[:150]
X_small = X[idx]

semi_log = LogisticRegression()
semi_log.fit(X_small, y[idx])

CPU times: user 230 ms, sys: 110 ms, total: 340 ms
Wall time: 187 ms


In [None]:
print(f'Logistic Regression Semi-Supervised -- Training score: {semi_log.score(X_small,y[idx])}      Testing score: {semi_log.score(X,y)}')

Logistic Regression Semi-Supervised -- Training score: 1.0      Testing score: 0.7596714285714286


**Question 3: More Semi-Supervised Learning**<br/>
A more sophisticated semi-supervised method is to use k-means to find "important" or "representative" instances, and then label those.<br/>
Use k-means with k=50. Find the nearest instance to each of the 50 centroids, and these will be your 50 "representative" instances. Use their labels to train a Logistic Regression model. Report the accuracy.

*Hint: You can use KMeans.transform() or KMeans.fit_transform() to transform X to cluster-distance space.*<br/>
*Hint 2: Then you can use np.argmin() to find the indices of the minimum cluster-distance of each cluster. Use that as a Boolean mask on X to select out your 50 'representative' instances.*

In [None]:
# Use Mnist to find the K-means
%%time
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt

k = 50
kmeans = KMeans(n_clusters = k)
kmeans.fit(X)

dist = kmeans.transform(X)

cluster_pts = np.argmin(dist, axis=0) # axis=0 for col, axis=1 for row
# np.argmin returns the indices of the minimum values along an axis
# cluster_pts # is the indices of nearest instance to each centroid

# Use Logistic Regression model for K-means labels
from sklearn.linear_model import LogisticRegression
semi_log = LogisticRegression()
# semi_log.fit(X_small, y[idx])
semi_log.fit(X[cluster_pts], y[cluster_pts])

CPU times: user 6min 38s, sys: 19.6 s, total: 6min 57s
Wall time: 5min 43s


In [None]:
print(f'Logistic Regression Semi-Supervised  w/ k-means -- Training score: {semi_log.score(X[cluster_pts],y[cluster_pts])}      Testing score: {semi_log.score(X,y)}')

Logistic Regression Semi-Supervised  w/ k-means -- Training score: 1.0      Testing score: 0.7757


**Question 4: Yet More Semi-Supervised Learning** \\
The model from Question 3 could be improved further by assigning the 50 centroid labels to 10 instances nearby, instead of just to the nearest point. Do this and re-train your Logistic Regression model on the 500 labeled points. What is the new accuracy?

In [None]:
%%time
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt

k = 50
kmeans = KMeans(n_clusters = k)
kmeans.fit(X)

dist = kmeans.transform(X)

indices = np.argpartition(dist, 10, 0)
repr_500 = indices[:10].flatten()
X_500 = X[repr_500]
y_500 = y[repr_500]

from sklearn.linear_model import LogisticRegression
semi_log = LogisticRegression()
semi_log.fit(X_500, y_500)

CPU times: user 7min 39s, sys: 19.8 s, total: 7min 59s
Wall time: 6min 44s


In [None]:
print(f'Logistic Regression Semi-Supervised  w/ k-means -- 500 labeled points nearby -- Training score: {semi_log.score(X_500, y_500)}      Testing score: {semi_log.score(X,y)}')

Logistic Regression Semi-Supervised  w/ k-means -- 500 labeled points nearby -- Training score: 1.0      Testing score: 0.8092714285714285


**Question 5**<br/>
Comment on the accuracy of these different semi-supervised methods. Which one was the best? Worst? Best considering time or effort? How did they compare to other supervised and unsupervised models built previously? Etc...

**Answer**  
Based on the results of these different semi-supervised methods, I think the accuracy of Q4 is the best, reaching above 0.8. The accuracy of Q3 is better than that of Q2, so the accuracy of Q2 is the worst among these different semi-supervised methods.  

In terms of running time, although Q4 has the best accuracy, it takes the longest time, reaching 7 minutes. Although Q2 has the worst accuracy, it only takes about 340ms to run. The running time of Q3 and Q4 is not much different, but the result is not as good as Q4.  

Compared with the unsupervised model I built in homework 9. The results of the unsupervised model that only uses K-means are far less good than the semi-supervised model. But if the t-SNE method is used, the time is much faster than the semi-supervised model, and the results can maintain the same level as Q4.  

As for the supervised model, I think the accuracy of the supervised model will not be lower than the accuracy of the semi-supervised model and the unsupervised model, but the supervised model cannot do what the semi-supervised and unsupervised models can do. In general, these three different models have their own characteristics, and they have their own roles in each usage.  