# Chapter 1

### Unsupervised Learning

- Unsupervised learning : Machine learning on unlabeled data
- pure pattern discovery with unguided learning
- dimension = number of features or columns in dataset
- Algorithms :
    - clustering: Organize data into similar groups
        - k-means : 
            - Number of groups formed by samples (rows in dataset)
            - calculates the mean of each cluster as centroid
            - randomly assign centroids and shift the centroids based on mean of nearest samples in specified iterations
            - Inertia : how far the samples are from centroids. good clustering has low inertia
        - Scatterplot visualization
        - t-SNE : 
            - maps data from high dimension to 2 dimensional space for visualization
            - A black box : Does not provide valid interpretation. Only gives insight about cluster numbers.
            - Produces different result on different runs
        - Hierarchical clustering : 
            - split into tree of subgroups
            - All leaf clusters are sample themselves
            - 2 different clusters merge in each step based on distance (linkage condition)
                - for n samples, n-1 steps are taken to complete the whole merge process
                - complete linkage = when distance between clusters is maximum distance between samples
                - single linkage = when distance between clusters is minimum distance between samples
            - At final step, there is only one cluster ("agglomerative") or the merging is done until the specified no of clusters are created
            - The reverse way of doing the same thing by splitting : Divisive clustering
            - Computation time increases exponentially with the increase of datapoints, therefore not optimal choice
    - Dimension reduction : Reduce redundant features in data in order to produce simpler model
        - Identifies less informative features as noisy features
        - Pattern Information achieved in a compressed form
        - PCA:
            - Step 1 decorrelation : 
                - principal components = the direction or axis where the sample varies the most
                - rotates data so that data aligns with axis with respect to principal components
                - the mean becomes 0 
                - no data loss
                - due to rotation, any correlated features become decorrelated
            - Step 2  dimension reduction :
                - Intrinsic dimension = number of features needed to approximate the dataset
        - NMF
            - Non-negative matrix factorization
            - De-composes samples as sum of parts
                - documents : 
                    - combination of common themes (here components = topics)
                    - eg : tf-idf
                - images : 
                    - combination of common patterns (here components = frequent patterns)
                    - eg : grayscale image
                    - need to flatten from 2D to 1D row in order to feed into NMF
                    - later can be reshaped from 1D to 2D to re-construct the image
            - works well with both normal and sparse arrays
            - Models are interpretable
            - All sample features must be non-negative
            - sum of (components * feature value) = reconstruction of sample
        - TruncatedSVD : PCA alternative that works on sparse dataset where most entries are 0. (eg : "tf-idf")
- Market basket analysis: Find items that are frequently bought together (eg: pencil and eraser) 
- Anomaly detection : When data appears outside of normal range (outlier). eg: Suspicious Credit Card Transactions
- A list here: https://mlforall.files.wordpress.com/2019/09/machinelearningalgorithms.png
- and here: https://www.theinsaneapp.com/wp-content/uploads/2022/02/Machine-Learning-Algorithms-PDF.png

### Normalize data

```
# Normalize the whole dataset before modeling
from sklearn.preprocessing import StandardScaler
X = StandardScaler()\
	.fit(X)\
	.transform(X.astype(float))
# There are other normalization methods available like MinMaxScaler, Z-score etc

# Alternative approach
from sklearn.preprocessing import normalize
X = normalize(X.astype(float), norm="l1")

# Alternative approach
from sklearn.preprocessing import Normalizer
X = Normalizer()\
	.fit(X)\
	.transform(X.astype(float))

# Alternative approach with scipy : normalize to a standard deviation of 1
from scipy.cluster.vq import whiten
scaled_data = whiten(data) # Works with multi-dimensional data

# Normalizing without library
# Feature Scaling
df["feature_scaled"] = df["col"]/ (df["col"].max())
# Min-max Scaling
df["minmax_scaled"] = (df["col"] - df["col"].min()) / (df["col"].max() - df["col"].min())
# Z-score
df["z_scaled"] = (df["col"] - df["col"].mean()) / df["col"].std()
```

# Chapter 2

### Hierarchical clustering

<center><img src="images/02.01.png"  style="width: 400px, height: 300px;"/></center>

```
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
sample = df.drop("target", axis=1)
mergings = linkage(sample.values, method='complete', metric='euclidean', optimal_ordering=False)
# Visualize
dendrogram(mergings, labels=df["target].values, leaf_rotation=90, leaf_font_size=6)
plt.show()
from scipy.cluster.hierarchy import fcluster
# Take only specified portion of cluster based on distance
labels = fcluster(mergings, 15, criterion='distance')
print(labels)
df['predicted_labels'] = labels
# See distribution of samples 
ct = pd.crosstab(df['predicted_labels'], df['target'])
print(ct)

```

### Measure Timing performance

```
import time
# Measure execution time
start = time.time()
### Your code .......
end = time.time()
print(end - start)

# Alternative approach 1 : simulating many runs on a function
import timeit
time_taken = timeit.timeit(example_function, number=1000)

# Alternative approach 2 : simulating many runs on a function
%timeit example_function()
```