### 1: Principal Components Analysis  
This may be helpful:
Jake VanderPlas' Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
You can assume you have a training set 𝑋 with 𝑁 examples, each one a feature vector with 𝐹 features.

####  1a: True/False, providing explanation for each one

* 1a(i): You compute principal components by finding eigenvectors of the dataset's feature matrix 𝑋. - **True. Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace (k≤d).Construct the projection matrix W from the selected k eigenvectors.Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.**


* 1a(ii): To select the number of components 𝐾 to use, you can find the value of 𝐾 that minimizes reconstruction error on the training set. This choice will help manage accuracy and model complexity.**False. We select K based on whole dataset after performing PCA and not on training dataset.K is selected by plotting number of components against cummulative explained ratio.This curve shows how much of original data is covered in number of selected components K.**


* 1a(iii): If we already have fit a PCA model with 𝐾=10 components, fitting a PCA model with 𝐾=9 components for the same dataset is easy.**False. We need to create another model using K=10. When we create model using scikit-learn's PCA class we provide number of components as input to it. We then train model using this K=10. If we need to projection with K=9 feature then we can not use same trained PC of K=10 features. We need to define new PCA with K=9 and train it.**

#### 1b: You had a dataset for a medical application, with many measurements describing the patient's height, weight, income, daily food consumption, and many other attributes.

* 1b(i): Would the PCA components learned be the same if you used feet or centimeters to measure height? Why or why not? **PCA components learned will be different if height is measured in different units of measurements. Because covariance matrix learned from dataset measuring height in feet and centimeters will be different which will results in different in eigenvalues & eigenvectors and total different projection matrix. If PCA is based on correlation matrix then difference in units measurement will not affect.To avoid this kind of problems, its generally good practice to standardize data using scalers like StandardScaler, RobustScaler, MinMaxScaler etc.**


* 1b(ii): Before applying PCA to this dataset, what (if any) preprocessing would you recommend and why? **Yes. PCA is affected by feature in data measured in different units of measurements. This can affect overall results on data transformed. To avoid this kind of problems, its generally good practice to standardize data using scalers like StandardScaler, RobustScaler, MinMaxScaler etc.**


#### 1c: Stella has a training dataset , which has 𝑁 example feature vectors 𝑥𝑛 each with 𝐹 features. She applies PCA to her training set to learn a matrix 𝑊 of shape (𝐾,𝐹), where each row represents the basis vector for component 𝑘.She would like to now project her test dataset  using the same components 𝑊. She remembers that she needs to center her data, so she applies the following 3 steps to project each test vector x′ to a 𝐾-dimensional vector z′:
$$
m = \frac{1}{T} \sum_{t=1}^T x'_t
\\
\tilde{x}'_t = x'_t - m
\\
z_t = W \tilde{x}'_t
$$

Is this correct? Why??? If not, explain what Stella should do differently.

**It's not correct. Centering data requires calculation of standard deviation as well and formula for centering is subtracting mean and dividing by standard deviation. Please find below correct formula.**

$$
m = \frac{1}{T} \sum_{t=1}^T x'_t
\\
\mu = \sqrt{\sum_{i=1}^T (x_i' - m)^2}.
\\
\tilde{x}'_t = \dfrac {(x'_t - m)} \mu
\\
z_t = W \tilde{x}'_t
$$


### 2: K-Means Clustering
Consider the k-means clustering algorithm, this maybe helpful?
Jake VanderPlas' Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
You can assume you have a training set 𝑋 with 𝑁 examples, each one a feature vector with 𝐹 features.
#### 2a: True/False questions with one sentence of explanation for each one

* 2a(i): you always get the same clustering of dataset 𝑋 when applying K-means with 𝐾=1, no matter how you initialize the cluster centroids 𝜇 - **True. As number of clusters are just 1, it won't affect where we initialize it.**


* 2a(ii): you always get the same clustering of dataset 𝑋 when applying K-means with 𝐾=2, no matter how you initialize the cluster centroids 𝜇 - **False. If we initialize KMeans with different random_state then clustering results will be different. Based on random_state KMeans intializes initial clustering center. KMeans is vulnerable to different initialization of cluster centroids.**


* 2a(iii): The only way to find the cluster centroids 𝜇 that minimize the K-means cost function (minimize sum of distances to nearest centroid) is to apply the K-means algorithm, alternatively updating assignments and cluster centroid locations. - **True. It starts with randomly guessing cluster centers. Assign points to nearest cluster based on Euclidean distance. After all points are assigned then update cluster centers to mean moving it to new location. Repeast process until convergence where cluster centers are not moving anymore.**


* 2a(iv): The K-means cost function requires computing the Euclidean distance from each example 𝑥𝑛 to its nearest cluster centroid 𝜇𝑘. Because the Euclidean distance requires a square root operation (e.g. np.sqrt or np.pow(___, 0.5)), no implementation of the K-means algorithm can be correct unless a "sqrt" operation is performed when computing distances between examples and clusters. - **False. It's not necessary to perform square root on distance. We can minimize these squared distance instead of squared root one. If we don't compute squared root then we can avoid that step and algorithm will run bit faster.**

#### 2b: Suppose you are given a dataset 𝑋 with 𝑁 examples, as well as a group of 𝐾=5 cluster locations  fit by the K-means algorithm to this dataset. You know 𝑁>5. Describe how you could initialize K-means using 𝐾=6 clusters to obtain a better cost than the 𝐾=5 solution.

    kmeans = sklearn.cluster.KMeans(
        n_clusters=6, random_state=42, init='k-means++', n_init=10, algorithm='auto')

    kmeans.fit(X)
    
#### 2c: Suppose you are using sklearn's implementation of K-means to fit 10 clusters to a dataset.
You start with code like this:
    
    kmeans = sklearn.cluster.KMeans(
        n_clusters=10, random_state=42, init='random', n_init=1, algorithm='full')

    kmeans.fit(X)
    
List at least two changes you might make to these keyword arguments to improve the quality of your learned 
clusters (as measured by the K-means cost function).

**Below are changes suggested which will improve quality of learned model.**

    kmeans = sklearn.cluster.KMeans(
        n_clusters=10, random_state=42, init='k-means++', n_init=10, algorithm='full')

    kmeans.fit(X)
    
    We have made 2 changes. n_init changed to '10' from '1' and init changed to 'k-means++' from 'random'.
    
    n_init is number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
    
    'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. 
    
    
#### 2d: Consider the two steps of the k-means algorithm, assign_examples_to_clusters and update_cluster_locations. Given docstrings for these steps below describing input and output, so you can be sure what happens in each step.

* 2d(i): What is the big-O runtime of assign_examples_to_clusters? And explain why? Express in terms of 𝑁,𝐾,𝑎𝑛𝑑 𝐹. - **It'll take O(N) running time.As it loops through each examples and take time to loop through all examples.**


* 2d(ii): What is the big-O runtime of update_cluster_locations? And explain why? Express in terms of 𝑁,𝐾,𝑎𝑛𝑑 𝐹. - **It'll take O(K*F) running time. As it loop through all m_KF members and updates values. m_KF is of size KxF.**
    
        def assign_examples_to_clusters(x_NF, m_KF):
        ''' Assign each training feature vector to closest of K centroids

        Returned assignments z_N will, for each example n,
        provide the index of the row of m_KF that is closest to vector x_NF[n]

        Args
        ----
        x_NF : 2D array, size n_examples x n_features (N x F)
            Observed data feature vectors, one for each example
        m_KF : 2D array, size n_clusters x n_features (K x F)
            Centroid location vectors, one for each cluster

        Returns
        -------
        z_N : 1D array, size N
            Integer indicator of which cluster each example is assigned to.
            Example n is assigned to cluster k if z_N[n] = k
        '''
        pass

        def update_cluster_locations(x_NF, z_N):
        ''' Update the locations of each cluster

        Returned centroid locations will minimize the distance between
        each cluster k's vector m_KF[k] and its assigned data x_NF[z_N == k]

        Args
        ----
        x_NF : 2D array, size n_examples x n_features (N x F)
            Observed data feature vectors, one for each example
        z_N : 1D array, size N
            Integer indicator of which cluster each example is assigned to.
            Example n is assigned to cluster k if z_N[n] = k

        Returns
        -------
        m_KF : 2D array, size n_clusters x n_features (K x F)
            Centroid location vectors, one for each cluster

In [1]:
import sklearn
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_biclusters, make_classification

In [2]:
X, Y = make_classification()
X.shape, Y.shape

((100, 20), (100,))

In [3]:
kmeans1 = KMeans(1, random_state=0)
kmeans1.fit(X, Y)
Y1 = kmeans1.predict(X)

kmeans2 = KMeans(1, random_state=123)
kmeans2.fit(X, Y)
Y2 = kmeans2.predict(X)

np.all(Y1 == Y2)

True

In [4]:
kmeans1 = KMeans(2, random_state=0)
kmeans1.fit(X, Y)
Y1 = kmeans1.predict(X)
#print(Y1)
kmeans2 = KMeans(2, random_state=123)
kmeans2.fit(X, Y)
Y2 = kmeans2.predict(X)
#print(Y2)
np.all(Y1 == Y2)

False

In [5]:
KMeans()

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)