# Unsupervised Learning

Most of the applications of machine learning today are based on supervised learning, that is where we have labelled data. However the vast majority of data is unlabelled, that is will have input features $X$, but we do not have the input labels $y$. In cases like this we use unsupervised learning. Unsupervised learning has a very large potential and we are just beginning to learn about its applications. 

Lets look at an example of why we need unsupervised learning. Lets say we have a manufacturing production line and we wish to decide which items are defective. Now we can fairly easily create a system that will take picture of the items every day. We will be able to build a reasonably large dataset very quickly. However the data will not be labelled. To make it a supervised learning problem we will need workers to manually go through and label the data, this can end up being a long, costly and tedious task and so it will usually only be done for a small subset of the data, and as a result due to less training instances the classifiers performance will be dissapointing. Also whenever the company creates a new product or changes existing products we will have to start the whole process from scratch.

Unsupervised learning can be used to solve this problem. The algorithm will be able to exploit the unlabelled data without needing the humans to kabel every picture. it would also be able to classify whether an item is defective or not. We will look into this process in this chapter.

Earlier we did dimensionality reduction, which is one of the most common unsupervised learning tasks. We will look at some other tasks as well such as: -

1. Clustering: the goal is to group similar instances together into **clusters**. CLustering is a great tool for data analysis, customer segmentation, recommendation systems, search engines, semi-supervised learning, dimensionality reduction, and more.

2. Anomaly detection - the objective in this case is to figure out what normal data looks like and then use that to detect abnormal instances such as defective items on a production line or a new trend in a time series.

3. Density estimation - this is the task of estimating the **probability density function** (PDF) of the random process that generatd the dataset. Density estimation is commonly used for anomaly detection: instances located in very low density regions are likely to be anomalies. It is also useful for data analysis and visualisation.

This chapter will be divided into the following sections: -
1. Clustering
2. Gaussian Mixtures

## 1. Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

To understand this lets look at an example. Lets say you are walking in the park and you see some flowers that you have never seen before, nearby you see other flowers that are similar but are not the same as te ones you saw before as a result you can infer from this information that both the flowers are of the same species or genus. While you may need an expert to tell you what the flowers are, you can definietly tell whether theyare similar or not and this does not require an expert. THis is the concept upon which clustering is built. We identify similar instances and and assign them to clusters or groups of similar instances, where the inter-cluster similarity is less than the intra-cluster similarity.(similarity of instances belonging to the same cluster is higher and similarity between instances belonging to different clusters is lower).

Just like in classification, each instance gets assigned to a group, however unlike classification, it is an unsupervised problem as a result the groups are not known beforehand. The algorithm will figure out the groups on its own. We can take an example of the iris dataset we used earlier, on the left we can see the seperation because of the labels. THe right one does not have access to the labels and so it has to figure out the groups on its own. Since it does not have access to the labels we cannot use a classification algorithm, this is where the clustering algorithm comes in. Most clustering algorithms will easily be able to detect the lower left clusters however on seeing the cluser in the upper right corener it might not be able to detect it as two seperate clusters. However the dataset has two addition features, sepal length and width, which are not represented here and clustering algorithms can made good use of all features, so in fact they identigy the three clusters fairly well (e.g. using a Gaussian mixture model, only 5 instances out of 150 are assigned to the wrong cluster). 

Clustering is used in a wide variety of applications such as: -

1. Customer segmentation: you can cluster you customer based on their purchases and their activity in the website. This is useful to understand who your customer sare and what thet need, so you can adapt yoyr products and marketing campaigns to each segment. For example, customer segmentation can be useful in recommender sustems to suggest content that other users in the same cluster enjoyed.

2. Data Analysis: when you analyse a new dataset, it can be useful to run a clutering algorithm and then alayse each cluster seperately. This can help you gain some insights on the dataset. For example, you can use clustering to detect the main topics discussed in a collection of newsgroups posts, or to find similar groups of customers.

3. Dimensionality Reduction: once a dataset has been clustered it is possible to measure each instances **affinity** with each cluster (affinity is a measure of how well an instance fits into a cluster). Each instances feature vector $x$ can then be replaed with a vector of its cluster affinities. If there are $k$ clusters then this vector is $k$-dimensional. This vector is typically of much lower-dimensional than the original feature vector, but it can preserve enough information for further processing. This is called **feature vector compression**. For example, you can use clustering for dimensionality reduction, then feed the resulting vectors to a linear classifier such as a logistic regression classifier.

4. Anomaly Detection (Outlier Detecion): any instance that has a low affinity to all the clusters is likely to be an anomaly. FOr example if you have clustere the users on your wesbite based on their behavious, such as an unusual number of requests per second. Anomaly detection is typically useful for fraud detection, for example, you can use it to detect credit card fraud.

5. Semi-Supervised Learning: if you have only a few instances with labels but a very large dataset, then you could perform clustering and propogate those labels to all the instances in the same clusters. This technique can greatly increase the amount of labelled data available for training, which in turn will lead to an increased performance of the subsequent supervised learning algorithms.

6. Search Engines: some search engines let you search for images similar to a reference image. To build such a system you would first have to appy a clustering algorithm to all the images in your database; similar images would end up in the same cluster. Then when a user provides a reference image, we can find similar images by finding the cluster this image would belong to and then simply return all images from that cluster.

7. Image Segmentation:by clustering pixels according to their color then replacing each pixels color it is possible to find the mean color o its cluster, it is hence possible to considerably reduce the number of colors in the image. Image segmentation is used in many object detection and tracking system as it makes it easier to detect the contour of ech object.

There is no universal definition of what a cluster is: it really depends on the context and different algorithms will cpature different kinds of clusters, some algorithms look for instances centered around a particular point called a **centroid**. Others look for continous regions of densely packed instances: these clusters can take on any shape. SOme algorithms are hierarchal, looking for clusters of clusters.

In this section we will look at the two most popular clustering algorithms, K-Means and DBSCAN, we will explore some of their applications such as nonlinear dimensionality reduction, semi-supervised learning, and anomaly detection.