# Unsupervised Learning
-  aim is to learn a mapping from the input space to an output space
    - no pre-defined labels for the data
        - i.e. just data, no "correct" answers to test against
    - the model determines what features are important
        - i.e. it is deciding how to group the data

### k-Means Clustering
- split data into k clusters
    - k is a hyperparameter
    - first, find the centroid of a group of points and determine the distance of each point from the centroid
    - then group the points by the closest centroid
    - i.e. find the k centroids that minimize the distance between the grouped points and their respective centroids 
- variety of algorithms for moving centroids to new locations
    - basic idea is to monitor the average distance of points from their centroids
    - move around until the average distance is minimized and stops decreasing
#### Reconstruction Error
- used to determine the optimal number of clusters (k)
- it is the sum of squared distances $x^t$ between each point and its centroid $m_i$
    - Reconstruction Error $= \sum_{t} \sum_{i} ||x^t - m_i||^2$ 
- the best k is usually found at the "elbow" of the graph of the reconstruction error
    - i.e. the point where the error stops decreasing as much
    - <img src="images/recoerror.png">
#### Non-Deterministic
- k-means clustering is non-deterministic
    - i.e. it can give different results each time it is run
- mitigation
    - run the algorithm multiple times and choose the best result
        - about 100 times usually works
    - alternatively, use a uniform distribution of initial centroids
        - this may miss a more optimal solution
#### k-Means in Python
- scikit-learn supports k-Means with `sklearn.cluster.Kmeans`
    - `init` specifies the method for initializing the centroids
        - `random` randomly selects k points from the data
        - `k-means++` is a more sophisticated method
    - `n_init` specifies the number of times to run the algorithm
        - the best result is returned
    - `inertia` is the reconstruction error
#### Summary
- advantages
    - simple
    - available in most libraries
    - scales well
    - applicable to many fields and problems
- disadvantages
    - requires k to be specified
    - assumes that clusters are spherical in shape
    - sensitive to outliers
    - "curse of dimensionality"
        - i.e. the more dimensions, the more data is needed to get good results
        - PCA can help with this
    - will always converge, but may be to a local minimum

### IC 26Oct23
- 1. 
        - distance((1, 4, 8) -> (3, 8, 2))
        - $d = \sqrt{(1-3)^2 + (4-8)^2 + (8-2)^2}$
        - $d = \sqrt{4 + 16 + 36}$
        - $d = \sqrt{56}$
        - $d = 7.483$
- 2. 
        - centroid_1((2, 3, 4), (1, 6, 3), (2 ,1 ,5))
        - $c_1 = (\frac{2+1+2}{3}, \frac{3+6+1}{3}, \frac{4+3+5}{3})$
        - $c_1 = (\frac{5}{3}, \frac{10}{3}, \frac{12}{3})$
        - $c_1 = (1.667, 3.333, 4)$
        - centroid_2((6, 3, 5), (9, 8, 7), (7, 2, 6)) 
        - $c_2 = (\frac{6+9+7}{3}, \frac{3+8+2}{3}, \frac{5+7+6}{3})$
        - $c_2 = (\frac{22}{3}, \frac{13}{3}, \frac{18}{3})$
        - $c_2 = (7.333, 4.333, 6)$
- 3.
        - 

### Syllabification
- common problem in applied linguistics
- goal is to split a word into syllables
    - i.e. "syllabification"
- phones
    - about 150 sounds which the human vocal tract can produce
        - english uses about 44-62
    - any word can be represented as a sequence of phones
- sonority scale
    - a ranking of speech phones by intensity from least to most sonorous
        - here we will use a scale of 1-13
        - i.e. how loud they are
        - vowels are the most sonorous because they have the most vibration
        - consonants are less sonorous because they have less vibration
- sonority sequencing principle
    - one of several theories on how words are split into syllables
    - suggests that the middle of a syllable is the sonority apex (nucleus)
        - usually a vowel
    - can be applied to an audio file
        - represent phones as a sequence of their intensities between 1 and 13
        - how do you determine the exact place to separate the words based off of these numbers?

### Syllabification Problem Statement
- need a set of rules to divide phones into syllables
    - i.e. a syllabification algorithm
    - rules are like look-up tables showing a pattern and the corresponding action
        - e.g. if you have the pattern [13 12], split it into [13] & [12]
        - e.g. if you have the pattern [13 5 10], split it into [13] & [5 10]
- what if not in the table?
    - find the closest matching pattern from the rules we do have
        - uses euclidean distance
        - <img src="images/syllabification.png">

### Syllable Alignment Error
- the metric used to evaluate the accuracy of the syllabification algorithm
    - an ugly, ugly function
- compare the machine classified syllables to the human classified syllables
    - calculate how much time each machine classified syllable overlaps with each human classified syllable

### Syllabification Objective Function
- $min_{e(r)} S(R,X)$
    - $R$ is the set of rules
    - r is a rule
    - $e(r)$ is the error of a rule
    - X is a set of utterances
    - $S(R,X)$ is a function which takes R and X and returns a set of utterances broken into variables

### Syllabification Genetic Algorithm
- often use the TIMIT corpus to train the algorithm
    - 6300 utterances
    - 8 different dialects
    - 630 different speakers
    - speach is made up of
        - dialect sentences to identify dialect
        - phonetically rich sentences to identify common phonemes
        - phonetically compact sentences to include interesting and problematic phoneme combinations 
    - also includes manually classified beginnings and endings of syllables and phones

### IC 09Nov23
- a. euclidean distance from [9 8 9] to [10 2 10] and [10 7 10]
    - $d_1 = \sqrt{(9-10)^2 + (8-2)^2 + (9-10)^2}$
    - $d_1 = \sqrt{1 + 36 + 1}$
    - $d_1 = \sqrt{38}$
    - $d_1 \approx 6.164$
    - $d_2 = \sqrt{(9-10)^2 + (8-7)^2 + (9-10)^2}$
    - $d_2 = \sqrt{1 + 1 + 1}$
    - $d_2 = \sqrt{3}$
    - $d_2 \approx 1.732$
- b. [9] [8 9] 

### Temporal Difference Learning
- a form of reinforcement learning
    - learning by trial and error
    - i.e. learning by doing
- Off-Policy
    - trial and error
- On-Policy
    - learning from instruction
    - follow the policy and learn about it at the same time
- Q-Learning
    - a form of off-policy learning
    - uses a Q-table to store the value of each action in each state
        - Q-table is initialized to 0
        - Q-table is updated after each action
        - Q-table is used to determine the next action
    - e.g. robot in a house
    <img src="images/qlearnstate.png">

    - use the state table to represent the options
        - assign doors that lead outside a value of 100
        - all other doors are assigned a value of 0
    <img src="images/qlearnstaterewards.png">

    - create a rewards matrix and a Q matrix
        - rewards matrix is the same as the state matrix
        - Q matrix is initialized to 0 and is updated after each action
        - Q matrix is like the brain or memory of the robot
    - transition rule is dictated by Q(state, action) = R(state, action) + $\gamma$ Max[Q(next state, all actions)]
        - $\gamma$ is the discount factor
            - it is a value between 0 and 1
            - it is used to determine how much to discount future rewards
            - i.e. how much to discount the Max[Q(next state, all actions)]
            - if $\gamma$ is 0, then the robot will only consider the immediate reward
            - if $\gamma$ is 1, then the robot will consider all future rewards

### Q-Learning Algorithm
- initialize Q-table to 0
- for each episode
    - select random initial state
    - while state is not terminal
        - randomly select an action
        - consider taking this action
        - get maximum Q-value for this state based on all possible actions
        - compute Q(state, action)
        - set the next state to the current state
- e.g. initial state of 1 & $\gamma$ = 0.8
    <img src="images/qlearnroom1.png">

    - options are go to state 3 or state 5
        - random selection
    - choose 5
    - calculate Q(1,5)
        - R(1,5) = 100
        - max[Q(next state, all actions)] = max[Q(5,1), Q(5,4), Q(5,5)] = Max[0, 0, 0] = 0
        - Q(1,5) = 100 + 0.8 * 0 = 100
    - append Q(1,5) to Q-table
- e.g. initial state of 3 & $\gamma$ = 0.8
    - options are go to state 1, 2, or state 4
        - random selection
    - choose 1
    - calculate Q(3,1)
        - R(3,1) = 0
        - max[Q(next state, all actions)] = max[Q(1,3), Q(1,5)] = Max[0, 100] = 100
        - Q(3,1) = 0 + 0.8 * 100 = 80
    - append Q(3,1) to Q-table
- after many iterations, the Q-table will start to converge
    <img src="images/qlearntable.png">

    - convergence can be defined as when the Q-table stops changing
        - because the algorithm is exploratory, it may be worth continuing after the first identical iteration
            - two iterations may not differ, but the third may differ significantly

### Q-Learning vs Monte Carlo
- Q-Learning can learn ***before*** knowing the final outcome
    - can learn every step
    - monte carlo can only learn at the end of the episode
- Q-Learning can learn ***without*** the final outcome
    - Q-Learning can learn from incomplete episodes
    - monte carlo can only learn from complete episodes
    - Q-Learning works in continuous (non-terminating) environments
        - e.g. a robot in a house
    - monte carlo only works in episodic (terminating) environments
        - e.g. a robot in a maze

-
<img src="images/ICP_28Nov.png">

### IC 28Nov23
- initial state of 2 
- options are go to state 3
    - no random selection
- calculate Q(2,3)
    - R(2,3) = 0
    - max[Q(next state, all actions)] = max[Q(3,1), Q(3,2), Q(3,4)] = Max[80, 0, 0] = 80
    - Q(2,3) = 0 + 0.8 * 80 = 64
- append Q(2,3) to Q-table
- - - - - - - - 
- new state is 3
- options are go to state 1 or state 2 or state 4
    - random selection
- go to state 1
- calculate Q(3,1)
    - R(3,1) = 0
    - max[Q(next state, all actions)] = max[Q(1,3), Q(1,5)] = Max[0, 100] = 100
    - Q(3,1) = 0 + 0.8 * 100 = 80
- append Q(3,1) to Q-table
- - - - - - - - 
- new state is 1
- options are go to state 3 or state 5
    - random selection
- go to state 5
- calculate Q(1,5)
    - R(1,5) = 100
    - max[Q(next state, all actions)] = max[Q(5,1), Q(5,4), Q(5,5)] = Max[0, 0, 0] = 0
    - Q(1,5) = 100 + 0.8 * 0 = 100
- append Q(1,5) to Q-table