---
execute:
  cache: false
  eval: true
  echo: true
  warning: false
jupyter: python3
---

# Clustering



## DBSCAN

* Video: [Clustering with DBSCAN, Clearly Explained!!!](https://youtu.be/RDZUdRSDOok?si=C7SzTAQC8BmD8AZy)



## k-Means Clustering

The $k$-means algorithm is an unsupervised learning algorithm that has a loose relationship to the $k$-nearest neighbor classifier.
The $k$-means algorithm works as follows:

* Step 1: Randomly choose $k$ centers. Assign points to cluster.
* Step 2: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance
* Step 3: Calculate cluster centroids again
* Step 4: Repeat steps 2 and 3 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

The basic principle of the $k$-means algorithm is illustrated in @fig-kmeans1, @fig-kmeans2, @fig-kmeans3, and @fig-kmeans4.

![k-means algorithm. Step 1. Randomly choose $k$ centers. Assign points to cluster. $k$ initial \lq means\rq (in this case $k=3$) are randomly generated within the data domain (shown in color). Attribution: I, Weston.pace, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons](./figures_static/kmeans1.png){width=70% #fig-kmeans1}

![k-means algorithm. Step 2. $k$ clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. Attribution: I, Weston.pace, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons](./figures_static/kmeans2.png){width=70% #fig-kmeans2}

![k-means algorithm. Step 3. The centroid of each of the $k$ clusters becomes the new mean. Attribution: I, Weston.pace, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons](./figures_static/kmeans3.png){width=70% #fig-kmeans3}

![k-means algorithm. Step 4. Steps 2 and 3 are repeated until convergence has been reached. Attribution: I, Weston.pace, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons](./figures_static/kmeans4.png){width=70% #fig-kmeans4}



* Video: [K-means clustering](https://youtu.be/4b5d3muPQmA?si=O9-s32Kw676wXCQF)



## DDMO-Additional Videos

* [Odds and Log(Odds), Clearly Explained!!!](https://youtu.be/ARfXDSkQf1Y?si=E4TjFoloRjbQYbzQ)
* [One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!](https://youtu.be/589nCGeWG1w?si=YN9vO0HQlnll1wb6)
* [Maximum Likelihood for the Exponential Distribution, Clearly Explained!!!](https://youtu.be/p3T-_LMrvBc?si=3Jcjueue1otXzS1r)
* [ROC and AUC, Clearly Explained!](https://youtu.be/4jRBRDbJemM?si=hJXxjRV7_Ib2ckVe)
* [Entropy (for data science) Clearly Explained!!!](https://youtu.be/YtebGVx-Fxw?si=xbMzfX2oqAsE6MGK)
* [Classification Trees in Python from Start to Finish](https://youtu.be/q90UDEgYqeI?si=teRC5oYkHcXXgUSU): Long live video!



## DDMO-Exercises 

::: {#exr-small-bins}
### Smaller Bins
What happens when we use smaller bins in a histogram?
:::

::: {#exr-curve}
### Density Curve
Why plot a curve to approximate a histogram?
:::

::: {#exr-2SD}
###  TwoSDQuestion
How many samples are plus/minus two SD around the mean?
:::


::: {#exr-2SD1}
###  OneSDQuestion
How many samples are plus/minus one SD around the mean?
:::

::: {#exr-2SD2}
###  ThreeSDQuestion
How many samples are plus/minus three SD around the mean?
:::


::: {#exr-2SD3}
###  DataRangeQuestion
You have a mean at 100 and a SD of 10. Where are 95% of the data?
:::


::: {#exr-2SD4}
###  PeakHeightQuestion
If the peak is very high, is the SD low or high?
:::

::: {#exr-POP1}
### ProbabilityQuestion
If we have a certain curve and want to calculate the probability of values equal to 20 if the mean is 20.
:::


::: {#exr-CAL1}
### MeanDifferenceQuestion
The difference between $\mu$ and x-bar?
:::

::: {#exr-CAL2}
### EstimateMeanQuestion
How do you calculate the sample mean?
:::


::: {#exr-CAL3}
### SigmaSquaredQuestion
What is sigma squared?
:::


::: {#exr-CAL4}
### EstimatedSDQuestion
What is the formula for the estimated standard deviation?
:::


::: {#exr-CAL5}
### VarianceDifferenceQuestion
Difference between the variance and the estimated variance?
:::

::: {#exr-MAT1}
### ModelBenefitsQuestion
What are the benefits of using models?
:::

::: {#exr-SAM1}
### SampleDefinitionQuestion
What is a sample in statistics?
:::

::: {#exr-Hyp1}
### RejectHypothesisQuestion
What does it mean to reject a hypothesis?
:::


::: {#exr-Hyp2}
### NullHypothesisQuestion
What is a null hypothesis?
:::


::: {#exr-Hyp3}
### BetterDrugQuestion
How can you show that you have found a better drug?
:::


::: {#exr-PVal1}
### PValueIntroductionQuestion
What is the reason for introducing the p-value?
:::


::: {#exr-PVal2}
### PValueRangeQuestion
Is there any range for p-values? Can it be negative?
:::


::: {#exr-PVal3}
### PValueRangeQuestion
Is there any range for p-values? Can it be negative?
:::


::: {#exr-PVal4}
### TypicalPValueQuestion
What are typical values of the p-value and what does it mean? 5%?
:::

::: {#exr-PVal5}
### FalsePositiveQuestion
What is a false-positive?
:::



::: {#exr-Calc1}
### CalculatePValueQuestion
How to calculate p-value?
:::


::: {#exr-Calc2}
### SDCalculationQuestion
What is the SD if the mean is 155 and in the range from 142 - 169 there are 95% of the data?
:::


::: {#exr-Calc3}
### SidedPValueQuestion
When do we need the two-sided p-value and when the one-sided?
:::


::: {#exr-Calc4}
### CoinTestQuestion
Test a coin with Tail-Head-Head. What is the p-value?
:::


::: {#exr-Calc5}
### BorderPValueQuestion
If you get exactly the 0.05 border value, can you reject?
:::


::: {#exr-Calc6}
### OneSidedPValueCautionQuestion
Why should you be careful with a one-sided p-test?
:::


::: {#exr-Calc7}
### BinomialDistributionQuestion
What is the binomial distribution?
:::

::: {#exr-Hack1}
### PHackingWaysQuestion
Name two typical ways of p-hacking.
:::


::: {#exr-Hack2}
### AvoidPHackingQuestion
How can p-hacking be avoided?
:::


::: {#exr-Hack3}
### MultipleTestingProblemQuestion
What is the multiple testing problem?
:::


#### Covariance

::: {#exr-Cov1}
### CovarianceDefinitionQuestion
What is covariance?
:::


::: {#exr-Cov2}
### CovarianceMeaningQuestion
What is the meaning of covariance?
:::


::: {#exr-Cov3}
### CovarianceVarianceRelationshipQuestion
What is the relationship between covariance and variance?
:::

::: {#exr-Cov4}
### HighCovarianceQuestion
If covariance is high, is there a strong relationship?
:::

::: {#exr-Cov5}
### ZeroCovarianceQuestion
What if the covariance is zero?
:::


::: {#exr-Cov6}
### NegativeCovarianceQuestion
Can covariance be negative?
:::


::: {#exr-Cov7}
### NegativeVarianceQuestion
Can variance be negative?
:::



::: {#exr-Corr1}
### CorrelationValueQuestion
What do you do if the correlation value is 10?
:::


::: {#exr-Corr2}
### CorrelationRangeQuestion
What is the possible range of correlation values?
:::


::: {#exr-Corr3}
### CorrelationFormulaQuestion
What is the formula for correlation?
:::

::: {#exr-StatPow1}
### UnderstandingStatisticalPower
What is the definition of power in a statistical test?
:::


::: {#exr-StatPow2}
### DistributionEffectOnPower
What is the implication for power analysis if the samples come from the same distribution?
:::

::: {#exr-StatPow3}
### IncreasingPower
How can you increase the power if the distributions are very similar?
:::


::: {#exr-StatPow4}
### PreventingPHacking
What should be done to avoid p-hacking when the distributions are close to each other?
:::

::: {#exr-StatPow5}
### SampleSizeAndPower
If there is overlap and the sample size is small, will the power be high or low?
:::

::: {#exr-PowAn1}
### FactorsAffectingPower
Which are the two main factors that affect power?
:::


::: {#exr-PowAn2}
### PurposeOfPowerAnalysis
What does power analysis tell us?
:::


::: {#exr-PowAn3}
### ExperimentRisks
What are the two risks faced when performing an experiment?
:::


::: {#exr-PowAn4}
### PerformingPowerAnalysis
How do you perform a power analysis?
:::

::: {#exr-CenLi1}
### CentralLimitTheoremExplanation
What does the Central Limit Theorem state?
:::

::: {#exr-BoxPlo1}
### MedianInBoxplot
What is represented by the middle line in a boxplot?
:::


::: {#exr-BoxPlo2}
### BoxContentInBoxplot
What does the box in a boxplot represent?
:::


::: {#exr-RSqu1}
### RSquaredDefinition
What is R-squared? Show the formula.
:::


::: {#exr-RSqu2}
### NegativeRSquared
Can the R-squared value be negative?
:::


::: {#exr-RSqu3}
### RSquaredCalculation
Perform a calculation involving R-squared.
:::

::: {#exr-FitLin1}
### LeastSquaresMeaning
What is the meaning of the least squares method?
:::

::: {#exr-ML1}
### RegressionVsClassification
What is the difference between regression and classification?
:::



::: {#exr-MaxLike1}
### LikelihoodConcept
What is the idea of likelihood?
:::

::: {#exr-Prob1}
### ProbabilityVsLikelihood
What is the difference between probability and likelihood?
:::

::: {#exr-CroVal1}
### TrainVsTestData
What is the difference between training and testing data?
:::


::: {#exr-CroVal2}
### SingleValidationIssue
What is the problem if you validate the model only once?
:::


::: {#exr-CroVal3}
### FoldDefinition
What is a fold in cross-validation?
:::


::: {#exr-CroVal4}
### LeaveOneOutValidation
What is leave-one-out cross-validation?
:::


::: {#exr-ConMat1}
### DrawingConfusionMatrix
Draw the confusion matrix.
:::

::: {#exr-SenSpe1}
### SensitivitySpecificityCalculation1
Calculate the sensitivity and specificity for a given confusion matrix.
:::


::: {#exr-SenSpe2}
### SensitivitySpecificityCalculation2
Calculate the sensitivity and specificity for a given confusion matrix.
:::


::: {#exr-MalLea1}
### BiasAndVariance
What are bias and variance?
:::

::: {#exr-MutInf1}
### MutualInformationExample
Provide an example and calculate if mutual information is high or low.
:::

::: {#exr-PCA1}
### WhatIsPCA
What is PCA?
:::


::: {#exr-PCA2}
### ScreePlotExplanation
What is a scree plot?
:::


::: {#exr-PCA3}
### LeastSquaresInPCA
Does PCA use least squares?
:::


::: {#exr-PCA4}
### PCASteps
Which steps are performed by PCA?
:::


::: {#exr-PCA5}
### EigenvaluePC1
What is the eigenvalue of the first principal component?
:::


::: {#exr-PCA6}
### DifferencesBetweenPoints
Are the differences between red and yellow the same as the differences between red and blue points?
:::


::: {#exr-PCA7}
### ScalingInPCA
How to scale data in PCA?
:::


::: {#exr-PCA8}
### DetermineNumberOfComponents
How to determine the number of principal components?
:::


::: {#exr-PCA9}
### LimitingNumberOfComponents
How is the number of principal components limited?
:::

::: {#exr-tSNE1}
### WhyUseTSNE
Why use t-SNE?
:::


::: {#exr-tSNE2}
### MainIdeaOfTSNE
What is the main idea of t-SNE?
:::


::: {#exr-tSNE3}
### BasicConceptOfTSNE
What is the basic concept of t-SNE?
:::


::: {#exr-tSNE4}
### TSNESteps
What are the steps in t-SNE?
:::


::: {#exr-KMeans1}
### HowKMeansWorks
How does K-means clustering work?
:::


::: {#exr-KMeans2}
### QualityOfClusters
How can the quality of the resulting clusters be calculated?
:::


::: {#exr-KMeans3}
### IncreasingK
Why is it not a good idea to increase k too much?
:::

::: {#exr-DBSCAN1}
### CorePointInDBSCAN
What is a core point in DBSCAN?
:::


::: {#exr-DBSCAN2}
### AddingVsExtending
What is the difference between adding and extending in DBSCAN?
:::


::: {#exr-DBSCAN3}
### OutliersInDBSCAN
What are outliers in DBSCAN?
:::

::: {#exr-KNN1}
### AdvantagesAndDisadvantagesOfK
What are the advantages and disadvantages of k = 1 and k = 100 in K-nearest neighbors?
:::

::: {#exr-NaiveBayes1}
### NaiveBayesFormula
What is the formula for Naive Bayes?
:::

::: {#exr-NaiveBayes2}
### CalculateProbabilities
Calculate the probabilities for a given example using Naive Bayes.
:::

::: {#exr-GaussianNB1}
### UnderflowProblem
Why is underflow a problem in Gaussian Naive Bayes?
:::

::: {#exr-Tree1}
### Tree Usage
For what can we use trees? 
:::

::: {#exr-DTree1}
### Tree Usage

Based on a shown tree graph: 

* How can you use this tree? 
* What is the root node? 
* What are branches and internal nodes? 
* What are the leafs? 
* Are the leafs pure or impure? 
* Which of the leafs is more impure? 
:::



::: {#exr-DTree2}
### Tree Feature Importance
Is the most or least important feature on top? 
:::



::: {#exr-DTree3}
### Tree Feature Imputation
How can you fill a gap/missing data? 
:::

::: {#sol-DTree3}
### Tree Feature Imputation
* Mean
* Median
* Comparing to column with high correlation
:::


::: {#exr-RTree1}
### Regression Tree Limitations
What are limitations? 
:::


::: {#exr-RTree2}
### Regression Tree Score
How is the tree score calculated? 
:::



::: {#exr-RTree3}
### Regression Tree Alpha Value Small
What can we say about the tree if the alpha value is small? 
:::


::: {#exr-RTree4}
### Regression Tree Increase Alpha Value 
What happens if you increase alpha? 
:::



::: {#exr-RTree5}
### Regression Tree Pruning
What is the meaning of pruning? 
:::