# Lab 8  356

## Agglomerative clustering in Python

```linkage()``` performs agglomerative clustering. The most important parameters are ```method``` and ```metric```. The method parameter specifies the measure of similarity, such as single, complete, and centroid. The metric parameter specifies the kind of distance between instances, such as Euclidean distance. The rest of the parameters and matching values can be found in scipy documentation for hierarchical clustering.

The dendrogram function plots a dendrogram given a dataframe. The scipy documentation for dendrograms lists the parameters and corresponding values. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html

Sometimes, a more convenient way of structuring the data for clustering is by using a distance matrix. The agglomerative clustering model can take in a distance matrix as input by using the squareform function from the spatial.distance package.

Researchers studying chemical properties of wines collected data on a sample of white wines in Northern Portugal. A research goal was to cluster wines based on similar chemical properties.

Cluster wines with single linkage.
The code provided creates a dataframe with two features (residual_sugar and fixed_acidity), normalizes the dataframe, creates a distance matrix, and displays the cluster membership of each data point.

In [None]:
import pandas as pd

from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [None]:
wine = pd.read_csv('wine1.csv')
wine.head()

In [None]:
# Calculate a distance matrix with selected variables
X = wine[['residual_sugar', 'fixed_acidity']]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# pdist() calculates pairs of distances between each instance in the dataset
dist = pdist(X)

clusterModel = linkage(dist, method='single')

# Compute the distance matrix
dist = pdist(X_scaled)

# Perform hierarchical clustering using the centroid method
#clusterModel = linkage(dist, method='centroid')
print(clusterModel)

In [None]:

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(clusterModel)
plt.title('Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()


1. What does linkage mean? single? linkage matrix?
2. Interpret the linkage matrix and dendrogram together- what is happening in each step in the printout of the clusterModel?
3. Change the linkage method to centroid. Plot it above, in addition to the first plot. How did your dendrogram change?
4. What does centroid mean?
5. What distance would you choose to decide the number of clusters?
6. What is one way you could use this visualization to understand more about your samples?
7. How might you label your clusters? Try to label your clusters and replot your dendrogram

## DBSCAN
The main idea behind the DBSCAN algorithm is that connected core points and corresponding boundary points form a single cluster. An instance that is neither a core point nor a boundary point will be classified as an outlier.

The DBSCAN algorithm requires two parameters:

epsilon or ε - the radius of the spherical region
min_samples - the minimum number of samples, or instances, for a point to be a core point

Given ε and min_samples, the following steps outline the DBSCAN algorithm:

Step 1: Count the number of points within the ε-neighborhood of each instance and classify as core points instances whose ε-neighborhood has at least min_samples of points.

Step 2: Identify the core points that are within the ε-neighborhood of other core points. These connected core points form a single cluster.

Step 3: Assign points that are within the ε-neighborhood of a cluster to that cluster.

Step 4: Assign points that are not within the ε-neighborhood of a cluster as outliers.

In [None]:
from sklearn.cluster import DBSCAN
wineDB = pd.read_csv('wine2.csv')
wineDB

In [None]:

# Create an input matrix with selected features
X_2 = wineDB[['sulphates', 'total_sulfur_dioxide']]

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_2))

# Cluster using DBSCAN with default options
dbModel2 = DBSCAN()

dbModel2 = dbModel2.fit(X)

print(dbModel2.labels_)

In [None]:
wineDB = pd.read_csv('wine2.csv')

# Create an input matrix with selected features
X_2 = wineDB[['density', 'alcohol']]

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_2))

# Cluster using DBSCAN with default options
dbModel = DBSCAN()

dbModel = dbModel.fit(X)
print(dbModel.labels_)

8. What do the -1's all mean (assuming for the last two input matrices you got all -1's)?
9. How can you fix this? (hint try an eps of .69 and min_samples=3) What does this mean?
10. Plot one of your results below.

In [None]:
#!pip install tpot

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define TPOT search parameters
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)

# Fit TPOT to find the best pipeline
tpot.fit(X_train, y_train)

# Evaluate the best pipeline on the test set
accuracy = tpot.score(X_test, y_test)
print("Test set accuracy:", accuracy)


# Export the final pipeline code
tpot.export('tpot_iris_pipeline.py')


11. Go to http://epistasislab.github.io/tpot/ , read about this tool
12. What is TPOT good for? What is it not good for?
13. Which algorithms does TPOT check?
14. In the code above, what do the parameters mean? # Do NOT be so simplistic with your metrics, this is a code sample above.
15. In a new notebook, do:
    -- DBSCAN on aspect of your data and visualize
    - TPOT on an aspect of your final project, and implement its suggestion plus report on your results
    - Dendrogram that is readable and legible (you can take representative samples from your data)
    - Spelling counts from now on! (too many misspelled words in your notebooks)