# Machine Learning
## Compulsory task

1. For each of the following examples describe at least one possible input and
output. Justify your answers:  
* 1.1 A self-driving car
* 1.2 Netflix recommendation system
* 1.3 Signature recognition
* 1.4 Medical diagnosis


1. Answer here

|      | Input(s)    | Output     |
| ---- | ----------- | ---------  |
| 1.1  | Sensor data from LIDAR, RADAR, cameras. GPS data with maps  |     Steering angle, acceleration of the motor, braking force applied       |
| 1.2  | User's watch history, cookies and search history, ratings of shows watched, saved to watchlist shows, number of films vs series watched, demographic information            |    List of recommended movies/shows (and now, video game apps), confidence score "your percentage match" of each show        |
| 1.3  |   Image(s) of signature, or if sketched digitally then lengths of strokes, thickness of line (pressure applied to digital pen), all to be divided into key features like where loops are located, if letters are joined up, the spikiness of lines, etc.    Also must have a saved input of what the user's signature should look like to compare against      |   Verification result, true if the new signature matches the saved known signature    |
| 1.4  |    Patient's medical history, DNA information, family history, symptoms, test results, scans, images of problem areas, subjective pain scores, medication that has been tried already, etc         |   Diagnosis, confidence score of diagnosis, contacts to refer to local specialists, drugs recommendations, differential diagnosis queries to narrow down potential conditions to a smaller set        |


2. For each of the following case studies, determine whether it is appropriate to utilise regression or classification machine learning algorithms. Justify your answers:
* 2.1 Classifying emails as promotion or social based on their content and metadata.
* 2.2 Forecasting the stock price of a company based on historical data and market trends.
* 2.3 Sorting images of animals into different species based on their visual features.
* 2.4 Predicting the likelihood of a patient having a particular disease based on medical history and diagnostic test results.

2. Answer here
* 2.1 Classification - there are two predefined buckets to sort emails into, without a continuous scale between them
* 2.2 Regression - stock price is a continuous scale from zero to infinity, which can take any value. Regression will help the algorithm align its predictions with the true performance of the stock
* 2.3 Classification - species are by definition separate, and viable/fertile animals can belong to a single species only, so an image of an animal can only be of a single species
* 2.4 Regression (assuming that we are only trying to diagnose a single disease, otherwise classifcation would also be involved!) - how closely can we fit patients' metrics across numerous variables in their history, observations, test results, etc, with the assertion that that patient is likely to develop this specific disease in the future. The question leaves a bit of ambiguity because you might be talking about diagnosing a single disease, and you might be talking about whether the patient is ever going to develop said disease in their lifetime, so I hope I don't get marked down for this!

3. For each of the following real-world problems, determine whether it is appropriate to utilise a supervised or unsupervised machine learning algorithm. Justify your answers:
* 3.1 Detecting anomalies in a manufacturing process using sensor data without prior knowledge of specific anomaly patterns.
* 3.2 Predicting customer lifetime value based on historical transaction data and customer demographics.
* 3.3 Segmenting customer demographics based on their purchase history, browsing behaviour, and preferences.
* 3.4 Analysing social media posts to categorise them into different themes.


3. Answer here
* 3.1 Unsupervised - detecting anomalies without pre-existing labels in the training set of what an anomaly should look/behave/feel/smell like
* 3.2 Supervised - the task is to predict a continuous value (lifetime monetary value to our company), and we have inputs which are labelled (we know from many previous customers what they spent over their lifetimes, what demographics they were, and we can correlate these features to make an algorithm match up its predictions with the true lifetime value that we recorded for those historical customers)
* 3.3 Unsupervised - we are segmenting customers into different groups without pre-existing demographic labels in the inputs. The inputs are not directly demographic labels. We are not telling the algorithm what to aim for, we are testing how it does by itself
* 3.4  Supervised - we have predefined our themes, so the algorithm can understand how it should be proceeding

4.
For each of the following real-world problems, determine whether it is appropriate to utilise semi-supervised machine learning algorithms. Justify your answers:
* 4.1 Predicting fraudulent financial transactions using a dataset where most transactions are labelled as fraudulent or legitimate.
* 4.2 Analysing customer satisfaction surveys where only a small portion of the data is labelled with satisfaction ratings.
* 4.3 Identifying spam emails in a dataset where the majority of emails are labelled.
* 4.4 Predicting the probability of default for credit card applicants based on their complete financial and credit-related information.


4. Answer here
All of these answers pivot on the proportion of our dataset which is labelled!
* 4.1  Supervised: Because most of the dataset has transactions labelled, I would suggest using a fully labelled subset of our dataset to train a fully supervised ML model. This probably gives better fitting to unseen values. It wouldn't be as expensive to add more labelled data since we already have so much of it too. 
* 4.2 Semi-supervised: A small portion of the dataset being labelled makes this problem a good candidate for semi-supervised ML, because it is cheaper to make assumptions and extrapolate a function from the few labels we have, which can propagate these labels to similar data points  in the rest of the dataset, than it would be to pay humans for their time to label the rest of the dataset manually. We can extrapolate to "fill in the blanks" so to speak. We will be able to predict a continuous "customer satisfaction value" despite missing labels/data.
* 4.3 Supervised: Majority of the emails are labelled, so I would suggest making a supervised dataset. Train the ML model function only on labelled data, then use that function to label the unlabelled emails accordingly. Cheaper than paying human reviewers.
* 4.4 Supervised: we have vast data for this, and we know exactly which historical customers defaulted, and we have those historical customers' complete data, so we can fully train a supervised ML model to predict the future probability of unseen future credit card applicants. 