# Linear regression
- Calculate the regression line by drawing a line that's as close to every data point as possible
    - The **Least Sqaures Method** is used, and only measures the closeness in the **Up** and **Down** direction

# Bias-Variance and Cross Validation
- Fundamental topic of understanding you models performance

<br></br>
## Bias-Variance
- A point within the model where adding model complexity is only generated unwanted noise within the data
- The **training error** goes down but the **test error** goes up
- The model after the **bias-variance** trade-off begins to overfit

<br></br>
<img src='pics/bias-variance_trade-off.png'>

<br></br>
- If complexity is continually added to a model it will overfit to the training data
    - Making it less usefull when applied to the untrained data, such as the test set

<img src='pics/b_v_t_o_overfit.png'>

<br></br>

<img src='pics/b_v_flexibility.png'>

<br></br>
- **Flexibility:** Complexity of the model, i.e. the polynomial level of a regression fit (how many independent variables are being account for)
- **Mean Squared Error**: The error metric

<br></br>
- The goal is to balance the **bias** and **variance** of your model, so that it has the lowest error
    - Find training-data-point that closest to the mininium-error-point on the test-data
    - Find the ideal number of variables for analysis
        - In this example it is the blue dot, within the middle figure
            - The blue dot is the quadratic fit

<img src='pics/b_v_over_under_fit.png'>

<br></br>
- *Going left*: Higher bias, lower variance
- *Going right*: Lower bias, higher variance
- **Underfitting**: Not enough data has been included in the model, as the error could be reduced
- **Overfitting**: Too much data has been included in the model, as the error could be reduced

<br></br>
- Pick a point as a **bias-variance trade-off**

# Logistic regression
- Trying to predict the probability of discrete categories

<br></br>
- A method for classification
    - Spam and real emails
    - Loan default (yes/no)
    - Diesease diagnosis (have/do not have)
        - These are examples of **binary classification**, aka there are 2 classes
            - Typically **0** and **1**
            
<br></br>
- A linear regression will give a bad fit for **binary classification** models
    - It would predict for values below zero
        - Transofrmed into a **logistic regression** curve is used as it will only fit data betweem **0** and **1** 

<img src='pics/log_regress_curve.png'>

<br></br>
- The shape of the logistic regression curve is called the **Signoid**
    - It will take in any value and only output a value between 0 and 1

<img src='pics/log_reg_signoid.png'>

<br></br>
-  The *linear model* is transformed into the *logistic model*
    - So that all the **output values** range from **0** to **1**

<img src='pics/log_reg_transform.png'>

<br></br>
-  A line can be draw half way in the probability
    - Anything **below** the line is **class 0**
    - Anything **above** the line is **class 1**

<img src='pics/log_reg_class_line.png'>

<br></br>
- Train the logistics model then test it
    - A **confusion matrix** can be used to evaluate classification models

<img src='pics/log_reg_confusion.png'>

**Basic terminolgy:**
- True positives (TP)
- True negatives (TN)
- False positives (FP): **Type 1 error**
- True negatives (FN): **Type 2 error**

<br></br>
**Accuracy rate:**
- How often is the value correct
    - (TP + TN) / total = 150 / 165 = 0.91 = 91%

<br></br>
**Misclassification rate:**
- How often is the value wrong
    - (FP + FN) / total = 15 / 165 = 0.09 = 9%

# K nearest neighbours (KNN)
- Use for classification problems
- When two or more classes have continuous parameters
    - E.g. The hieght and weight of dogs and horses

<img src='pics/knn_and_pre.png'>

- Predict whether a data point represents a horse or a dog based on the height and weight

<br></br>
Training alogorithm:
- Store all the data

<br></br>
Prediction alogorithm:
1. Calculate the distance from x to all points in the data being used
2. Sort the points by increasing distance from x
3. Predict the majority label (class) of the closest 'K' points

<img src='pics/knn_and_post.png'>

<br></br>
- For low values of 'K' there may be a lot of noise
- For high values of 'K' there will be less noise but the bias will be increased

<img src='pics/knn_increase.png'>

<br></br>
Pros:
- Easy to set up
- Works with any number of classes
- Easy to add more data

<br></br>
Cons:
- High prediction cost (worse for large data sets)
- Categorical features don't work well

# Decision Trees and Random Forests
## Decision trees
- Make decision when there are different outcomes for many features

Will my friend play (Yes/No) based on the weather

<img src='pics/decision_tree.png'>

- **Nodes**: Split value of a certain attribute (white box)
- **Edges**: Connection between the nodes, denoting the outcome of splitting to a node
- **Root**: The node that perform the first split
- **Leaves**: Terminal nodes that predict the outcome (Red and green)

<br></br>
- Trying to choose the features that best split your data
    - I.e. When the node is split the maximum number of each class should be on either side
    - Referred to as *maximising the information gain* off of this split

<br></br>
## Random forests
- Improve performance off single decision trees
    - Use many trees with a random sample of features chosen as the split

<br></br>
- Sometimes the predictive accuracy of decision trees can be low due to the high variance
    - Caused by the different splits in the training-data, which can lead to very different trees
        - **Bagging** is method used to reduce the variance in machine learning models

<br></br>
- **Random forests** are a slight variation on bagged-decisions-trees with better performance 
- A new random sample of features is chosen for **every single tree at every single split**
- Sampling from the training data set with replacement
    - For **classification**, m is typically chosen to be the square root of **p**
        - **m**: A random sample of features
        - **p**: The full set of features

# Support Vector Machines

# K means Clustering
- An unsupervised learning algorithm that will label unlabelled data
    - Will attempt to group similar clusters of data together
         - Cluster similar documents
         - Cluster customer based on Features
         - Market segmentation
         - Identify similar physical groups
         
<br></br>
- The goal is to divide data into distinct groups such that observations within each group are similar

<img src='pics/k_means_cluster.png'>
<br></br>
- Unlaballed training data of the left
- KMC algorithm trying to clsuter the data into 5 coloured groups

<img src='pics/k_means_algo.png'>

<img src='pics/k_means_cluster_process.png'>


# Principle Component Analysis
- An unsupervised statistical technique used to examine the interrelations among a set of variables
    - In order to identify the underlying structure to those variables
        - Sometimes referred to as **factor analysis**
        
<br></br>
- **Regresion** determines a line of best fit to a data set
- **Factor analysis** determines several orthogonal lines of best fit to the data set
    - The lines are perpendicular to each other in n-dimensional space
        - 2 variables, 2 lines, 2D
        - 3 variables, 3 lines, 3D
- n-Dimensional Space is the variable sample space
    - There are as many dimensions as there are variables
    
<img src='pics/principle_c_a.png'>

<br></br>
- The **componenets** are **linear transformations**, that chooses a variable from the data set
    - The greatest variance of the data set comes to lie on the first axis
    - The second greatest variance on the second axis

<br></br>
- The **components are uncorrelated** since they are orthogonal to each other in the sample space

<br></br>
- This process allows us to **reduce the number of variables** used in an analysis

# Recommender Systems 
Two most common types of recommender systems:
- **Content-based**
    - Recommendations based on the attributes of the item (most similar items)
- **Collaborative filtering (CF)**
    - Recommendation based on the knowledge of the users' behaviour (crowd behaviour ) 
    
<br></br>
- Collaborative filtering (CF) can be divided into:
    - **Memory-Based collaborative filtering**
        - Compute cosine similarity
    - **Model-Based collaborative filter**
        - Used singluar value decomposition (SVD)