Files

Clustering

Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
1. k_means_clustering.ipynb		1. k_means_clustering.ipynb
2. DBSCAN clustering.ipynb		2. DBSCAN clustering.ipynb
3. Hierarchical clustering.ipynb		3. Hierarchical clustering.ipynb
4. Mean Shift Clustering.ipynb		4. Mean Shift Clustering.ipynb
5. Outlier Detection and Feature Scaling.ipynb		5. Outlier Detection and Feature Scaling.ipynb
6. Kmeans by mini-batches.ipynb		6. Kmeans by mini-batches.ipynb
7. Gaussian mixture model.ipynb		7. Gaussian mixture model.ipynb
README.md		README.md

README.md

Clustering

K means clustering

Points to remember

Starting centroid matters
Final results are not same regardless of start

For understanding only

For that scikit learn run kmeans many times with random starting point
If it end up with different point than we choose best grouping.
Best grouping for cluster k is defined as grouping with avg dist from the points to its corresponding centroids is the smallest.

Feature Scaling

For any machine learning algorithm that uses distances as a part of its optimization, it is important to scale your features.

Most Common

Normalizing or Max-Min Scaling - this type of scaling moves variables between 0 and 1
Standardizing or Z-Score Scaling - this type of scaling creates variables with a mean of 0 and standard deviation of 1

Advantage

Without feature scaling features with much larger variance dominates on features with small variance which we don't want