Skip to content


k-means added
Browse files Browse the repository at this point in the history
  • Loading branch information
soodoku committed Sep 17, 2015
1 parent c9a6d92 commit 81a9724
Show file tree
Hide file tree
Showing 3 changed files with 252 additions and 2 deletions.
Binary file added ds6/kmeans.pdf
Binary file not shown.
244 changes: 244 additions & 0 deletions ds6/kmeans.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
\setbeamercolor{normal text}{fg=black}
\setbeamercovered{transparent, still covered={\opaqueness<1->{0}}, again covered={\opaqueness<1->{30}}}
\setbeamertemplate{navigation symbols}{}

\renewcommand*\familydefault{\sfdefault} %% Only if the base font of the document is to be sans serif

\title[DS]{\scalebox{.20}{\includegraphics{specialk.png}}\\ $K$-means Clustering}
\href{}{twitter} \textbf{|} \href{}{github}}



setwd(paste0(githubdir, "data-science/ds6/"))
tools::texi2dvi("kmeans.tex", pdf=TRUE,clean=TRUE)


\frametitle{Unsupervised Learning}
\item[-]<2> Everything is dimension reduction
\item[-]<3> In supervised learning, labels supervise dimension reduction\\
\item[-]<4> For instance, regression is about finding a low dimensional representation of $Y$
\item[-]<5> Supervised Learning $\sim$ Given Apples and Oranges, learn traits of Apples Vs. Oranges
\item[-]<6> Given a bunch of spherical fruits, optimally describe types of fruits

\frametitle{Ways to Think About Unsupervised Learning}
\item[-]<2>Learning the probability model of the data $p(x_n|x_1,...,x_{n-1})$
\item[-]<3>\textbf{Applications:} Outlier detection, Data compression
\item[-]<4>Find rows similar to each other, groups of rows dissimilar to each other
\item[-]<5>Find columns similar to each other, groups of columns dissimilar to each other
\item[-]<6>\textbf{Applications:} Group movies by ratings, Segment shoppers

\item[-]<1-3>Two kinds of methods:
\item[-]<2->Principal components analysis
\item[-]<4>Clustering looks to partition data into similar subgroups
\item[-]<5-7>Two popular methods:
\item[-]<6-> Hierarchical clustering (computationally expensive)
\item[-]<7> $k$-means clustering (pre-specify $k$)

\only<1>{\scriptsize{Source: \href{}{Pattern Recognition and Machine Learning}}}


\frametitle{$k$-Means Clustering}
\item[-]<1-3>$k$-means: Assume that we must split data into $k$ clusters
\item[-]<2-5>Each observation belongs to one cluster
\item[-]<3-5>No observation belongs to more than one cluster
\item[-]<4>Find partitioning that minimizes within cluster variation summed over all $k$
\item[-]<5>Euclidean distance between observations, sum it over all observations\\\normalsize
\text{min.}_{C_1,\ldots,C_K} \sum_{k=1}^{k} \frac{1}{|C_k|} \sum_{i, i^{'} \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i^{'}j})^2

\frametitle{$k$-Means (Lloyd's) Algorithm}
\item[-]<2>Randomly assign observations to 1 of $k$ clusters
\item[-]<4-> For each of the $k$ clusters, compute the centroid
\item[-]<5-> Assign each observation to cluster whose centroid is closest
\only<1-6>{\scriptsize{Source: James et al. 2015}}



\frametitle{$k$-Means Algorithm}
\item[-]<0>Randomly assign observations to 1 of $k$ clusters
\item[-]<1-> For each of the $k$ clusters, compute the centroid
\item[-]<1-> Assign each observation to cluster whose centroid is closest
\item[-]<2>Why does it work?
\item[-]<3>It doesn't. Local minima possible.
\item[-]<5->Forgy: Randomly choose $k$ observations and set them as centroids.
\item[-]<6->Random Partition: Assign each observation randomly to one of the clusters.
\item[-]<7->Run an alternate clustering algorithm on a small sample and use the clusters as initial centroids
\item[-]<8> Pick dispersed points as centroids. For e.g. $k$-means++ and variations of it.

\frametitle{Distance between clusters}
\item[-]<1>Complete Linkage\\\normalsize
Farthest distance between points in clusters
Closest pair
All pairs, and then take the average
Has problems called inversions\\
Used in Genomics
\item[-]<5>Complete and Average most commonly used

\frametitle{Practical Issues}
\item[-]<1-4>Choice of Similarity Measure
\item[-]<2->Scaling Matters
\item[-]<3->Jaccard --- can be gotten quickly by minhashing via LSH Distance
\item[-]<4->Correlation based measures (+/- may matter)
\item[-]<5>High dimensional data. Solutions e.g. DANN
\item[-]<6-8>Choosing $k$:
\item[-]<7->Calculate average distance to centroid for multiple $k$
\item[-]<8>Plot them, look for the \emph{knee}\\
\frametitle{A(N)alyst choose $k$ contd.}
\item[-]<1-5>Calinski-Harabasz (CH) Index:
\item[-]<2->Between Cluster, $B = \sum_{1}^k n_k \lVert X_k - \bar{X} \rVert^2$
\item[-]<3->Within Cluster, $W = \sum_{1}^k \lVert X_i - \bar{X_k} \lVert^2$
\item[-]<4->Maximize Between Cluster Variation, Minimize Within Cluster Variation
\item[-]<5->$\text{CH(K)} = \frac{B(K)}{(K-1)}\frac{n - K}{W(K)}$

\item[-]<6-8>Gap Statistic (Tibshirani):
\item[-]<6->Compare observed $W(K)$ to $W_{\text{unif}}(K)$
\item[-]<7->$\text{GAP}(K) = \text{log} W(K) - \text{log} W_{\text{unif}}(K)$
\item[-]<8->Calculate $W_{\text{unif}}(K)$ by simulation.
\frametitle{Running Time}
\item[-]<1> $O(kn)$ for each iteration.
\item[-]<2> But total iterations can be a lot, and not bounded.
\item[-]<3> But in practice, polynomial running time.
\item[-]<4-6> Big (Long) Data Solutions:
\item[-]<5->Bradley-Fayyad-Reina (BFR)
\item[-]<7->Assumes clusters are normally distributed around a centroid in Euclidean space.
\item[-]<8->Exploit that to quantify likelihood point belongs to a cluster
10 changes: 8 additions & 2 deletions
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,20 @@ Data Science: Some Basics
- Assessing model fit
- Clarification about Big Data

5. Presenting Analyses
5. Supervised Methods

6. Unsupervised Methods
- k-means ([presentation](ds6/kmeans.pdf), [tex](ds6/kmeans.tex))

7. Presenting Analyses
- [ggplot2 in brief](graphs/
- Examples of ggplot in action:
- NYT Civil Rights Coverage ([R code](, [Graph](
- Military Experience of UK Prime Ministers ([R code](, [Graph](
- [Suggestions for writing](

6. Some Applications
8. Some Applications
- From paper to digital ([presentation](app/PaperToDigital.pdf), [tex](app/PaperToDigital.tex))
- Text as Data
- [Sentiment Analysis](
Expand Down

0 comments on commit 81a9724

Please sign in to comment.