# Supervised Learning with scikit learn

## Table of contents:

* <a href=#Clust>Clustering for dataset exploration</a>
* <a href=#Viz>Visualization with hierarchical clustering and t-SNE</a>
* <a href=#Decorr>Decorrelating your data and dimension reduction</a>
* <a href=#Disc>Discovering interpretable features</a>

## Load Packages and Set Global Variables

<a id="imports"></a>

In [21]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, Normalizer, normalize
from sklearn.manifold import TSN
from sklearn.decomposition import PCA, TruncatedSVD, NMF
from sklearn.feature_extraction.text import TfidfVectorizer


from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.stats import pearsonr






## Global Variables

All embeddings and clusterings can be saved and loaded into this script. Be carful with overwriting cluster caches as soon as cell type annotation has started as cluster labels may be shuffled.

Set whether anndata objects are recomputed or loaded from cache.

In [2]:
bool_recomp = False

Set whether clustering is recomputed or loaded from saved .obs file. Loading makes sense if the clustering changes due to a change in scanpy or one of its dependencies and the number of clusters or the cluster labels change accordingly.

In [3]:
bool_recluster = False

Set whether cluster cache is overwritten. Note that the cache exists for reproducibility of clustering, see above.

In [4]:
bool_write_cluster_cache = False

Set whether to produce plots, set to False for test runs.

In [5]:
bool_plot = False

Set whether observations should be calculated. If false, it is necessary to read cacheed file that contains the necssary information. It then shows the the distributions of counts and genes, as well as mt_frac after filtering. 
Set to true in order to see the data before filtering and follow the decisions for cutoffs.

In [6]:
bool_create_observations = True

<a id="Dataloading"></a>

## Clustering for dataset exploration

Clustering 2D points

In [7]:
if bool_recomp == True:   
   # Create a KMeans instance with 3 clusters: model
    model = KMeans(n_clusters=3)

    # Fit model to points
    model.fit(points)

    # Determine the cluster labels of new_points: labels
    labels = model.predict(new_points)

    # Print cluster labels of new_points
    print(labels)

Inspect your clustering

In [8]:
if bool_recomp == True: 
    # Assign the columns of new_points: xs and ys
    xs = new_points[:,0]
    ys = new_points[:,1]

    # Make a scatter plot of xs and ys, using labels to define the colors
    plt.scatter(xs, ys, c=labels, alpha=0.5)

    # Assign the cluster centers: centroids
    centroids = model.cluster_centers_

    # Assign the columns of centroids: centroids_x, centroids_y
    centroids_x = centroids[:,0]
    centroids_y = centroids[:,1]

    # Make a scatter plot of centroids_x and centroids_y
    plt.scatter(centroids_x, centroids_y, marker='D', s=50)
    plt.show()

How many clusters of grain?

In [9]:
if bool_recomp == True: 
    ks = range(1, 6)
    inertias = []

    for k in ks:
        # Create a KMeans instance with k clusters: model
        model =KMeans(n_clusters=k)

        # Fit model to samples
        model.fit(samples)

        # Append the inertia to the list of inertias
        inertias.append(model.inertia_)

    # Plot ks vs inertias
    plt.plot(ks, inertias, '-o')
    plt.xlabel('number of clusters, k')
    plt.ylabel('inertia')
    plt.xticks(ks)
    plt.show()

Evaluating the grain clustering

In [10]:
if bool_recomp == True: 
    # Create a KMeans model with 3 clusters: model
    model = KMeans(n_clusters=3)

    # Use fit_predict to fit model and obtain cluster labels: labels
    labels = model.fit_predict(samples)

    # Create a DataFrame with labels and varieties as columns: df
    df = pd.DataFrame({'labels': labels, 'varieties': varieties})

    # Create crosstab: ct
    ct = pd.crosstab(df['labels'],df['varieties'])

    # Display ct
    print(ct)

Scaling fish data for clustering

In [12]:
if bool_recomp == True: 
    # Create scaler: scaler
    scaler = StandardScaler()

    # Create KMeans instance: kmeans
    kmeans = KMeans(n_clusters=4)

    # Create pipeline: pipeline
    pipeline = make_pipeline(scaler, kmeans)

Clustering the fish data

In [14]:
if bool_recomp == True: 
    # Fit the pipeline to samples
    pipeline.fit(samples)

    # Calculate the cluster labels: labels
    labels = pipeline.predict(samples)

    # Create a DataFrame with labels and species as columns: df
    df = pd.DataFrame({'labels': labels, 'species': species})

    # Create crosstab: ct
    ct = pd.crosstab(df['labels'],df['species'])

    # Display ct
    print(ct)

Clustering stocks using KMeans

In [None]:
if bool_recomp == True:
    # Create a normalizer: normalizer
    normalizer = Normalizer()

    # Create a KMeans model with 10 clusters: kmeans
    kmeans = KMeans(n_clusters=10)

    # Make a pipeline chaining normalizer and kmeans: pipeline
    pipeline = make_pipeline(normalizer, kmeans)

    # Fit pipeline to the daily price movements
    pipeline.fit(movements)

Which stocks move together?

In [None]:
if bool_recomp == True:
    # Predict the cluster labels: labels
    labels = pipeline.predict(movements)

    # Create a DataFrame aligning labels and companies: df
    df = pd.DataFrame({'labels': labels, 'companies': companies})

    # Display df sorted by cluster label
    print(df.sort_values(by='labels'))

## Visualization with hierarchical clustering and t-SNE

Hierarchical clustering of the grain data

In [None]:
if bool_recomp == True:
    # Calculate the linkage: mergings
    mergings = linkage(samples,method='complete')

    # Plot the dendrogram, using varieties as labels
    dendrogram(mergings,
               labels=varieties,
               leaf_rotation=90,
               leaf_font_size=6,
    )
    plt.show()

Hierarchies of stocks

In [15]:
if bool_recomp == True:
    # Normalize the movements: normalized_movements
    normalized_movements = normalize(movements)

    # Calculate the linkage: mergings
    mergings = linkage(normalized_movements, method='complete')

    # Plot the dendrogram
    dendrogram(mergings,
               labels=companies,
               leaf_rotation=90,
               leaf_font_size=6
    )
    plt.show()

Different linkage, different hierarchical clustering!

In [None]:
if bool_recomp == True:
    # Calculate the linkage: mergings
    mergings = linkage(samples, method='single')

    # Plot the dendrogram
    dendrogram(mergings,
               labels=country_names,
               leaf_rotation=90,
               leaf_font_size=6
    )
    plt.show()

Extracting the cluster labels

In [None]:
if bool_recomp == True:
    # Use fcluster to extract labels: labels
    labels = fcluster(mergings, 6, criterion='distance')

    # Create a DataFrame with labels and varieties as columns: df
    df = pd.DataFrame({'labels': labels, 'varieties': varieties})

    # Create crosstab: ct
    ct = pd.crosstab(df['labels'],df['varieties'])

    # Display ct
    print(ct)

t-SNE visualization of grain dataset

In [17]:
if bool_recomp == True:
    # Create a TSNE instance: model
    model = TSNE(learning_rate=200)

    # Apply fit_transform to samples: tsne_features
    tsne_features = model.fit_transform(samples)

    # Select the 0th feature: xs
    xs = tsne_features[:,0]

    # Select the 1st feature: ys
    ys = tsne_features[:,1]

    # Scatter plot, coloring by variety_numbers
    plt.scatter(xs, ys, c=variety_numbers)
    plt.show()

A t-SNE map of the stock market

In [None]:
if bool_recomp == True:
    # Create a TSNE instance: model
    model = TSNE(learning_rate=50)

    # Apply fit_transform to normalized_movements: tsne_features
    tsne_features = model.fit_transform(normalized_movements)

    # Select the 0th feature: xs
    xs = tsne_features[:,0]

    # Select the 1th feature: ys
    ys = tsne_features[:,1]

    # Scatter plot
    plt.scatter(xs, ys, alpha=0.5)

    # Annotate the points
    for x, y, company in zip(xs, ys, companies):
        plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
    plt.show()

## Decorrelating your data and dimension reduction

Correlated data in nature

In [18]:
if bool_recomp == True:
    # Assign the 0th column of grains: width
    width = grains[:,0]

    # Assign the 1st column of grains: length
    length = grains[:,1]

    # Scatter plot width vs length
    plt.scatter(width, length)
    plt.axis('equal')
    plt.show()

    # Calculate the Pearson correlation
    correlation, pvalue = pearsonr(width, length)

    # Display the correlation
    print(correlation)

Decorrelating the grain measurements with PCA

In [19]:
if bool_recomp == True:
    # Create PCA instance: model
    model = PCA()

    # Apply the fit_transform method of model to grains: pca_features
    pca_features = model.fit_transform(grains)

    # Assign 0th column of pca_features: xs
    xs = pca_features[:,0]

    # Assign 1st column of pca_features: ys
    ys = pca_features[:,1]

    # Scatter plot xs vs ys
    plt.scatter(xs, ys)
    plt.axis('equal')
    plt.show()

    # Calculate the Pearson correlation of xs and ys
    correlation, pvalue = pearsonr(xs, ys)

    # Display the correlation
    print(correlation)

The first principal component

In [None]:
if bool_recomp == True:
    # Make a scatter plot of the untransformed points
    plt.scatter(grains[:,0], grains[:,1])

    # Create a PCA instance: model
    model = PCA()

    # Fit model to points
    model.fit(grains)

    # Get the mean of the grain samples: mean
    mean = model.mean_

    # Get the first principal component: first_pc
    first_pc = model.components_[0,:]

    # Plot first_pc as an arrow, starting at mean
    plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

    # Keep axes on same scale
    plt.axis('equal')
    plt.show()

Variance of the PCA features

In [None]:
if bool_recomp == True:
    # Create scaler: scaler
    scaler = StandardScaler()

    # Create a PCA instance: pca
    pca = PCA()

    # Create pipeline: pipeline
    pipeline =make_pipeline(scaler, pca)

    # Fit the pipeline to 'samples'
    pipeline.fit(samples)

    # Plot the explained variances
    features = range(pca.n_components_)
    plt.bar(features, pca.explained_variance_)
    plt.xlabel('PCA feature')
    plt.ylabel('variance')
    plt.xticks(features)
    plt.show()

Dimension reduction of the fish measurements

In [21]:
if bool_recomp == True:
    # Create a PCA model with 2 components: pca
    pca = PCA(n_components=2)

    # Fit the PCA instance to the scaled samples
    pca.fit(scaled_samples)

    # Transform the scaled samples: pca_features
    pca_features = pca.transform(scaled_samples)

    # Print the shape of pca_features
    print(pca_features.shape)

A tf-idf word-frequency array

In [22]:
if bool_recomp == True:
    # Create a TfidfVectorizer: tfidf
    tfidf = TfidfVectorizer()

    # Apply fit_transform to document: csr_mat
    csr_mat = tfidf.fit_transform(documents)

    # Print result of toarray() method
    print(csr_mat.toarray())

    # Get the words: words
    words = tfidf.get_feature_names()

    # Print words
    print(words)

Clustering Wikipedia part I - Create a Pipeline object consisting of a TruncatedSVD followed by KMeans.

In [23]:
if bool_recomp == True:
    # Create a TruncatedSVD instance: svd
    svd = TruncatedSVD(n_components=50)

    # Create a KMeans instance: kmeans
    kmeans = KMeans(n_clusters=6)

    # Create a pipeline: pipeline
    pipeline = make_pipeline(svd, kmeans)

Clustering Wikipedia part II

In [24]:
if bool_recomp == True:
    # Fit the pipeline to articles
    pipeline.fit(articles)

    # Calculate the cluster labels: labels
    labels = pipeline.predict(articles)

    # Create a DataFrame aligning labels and titles: df
    df = pd.DataFrame({'label': labels, 'article': titles})

    # Display df sorted by cluster label
    print(df.sort_values(by='label'))

## Discovering interpretable features

NMF applied to Wikipedia articles

In [26]:
if bool_recomp == True:
    # Create an NMF instance: model
    model = NMF(n_components=6)

    # Fit the model to articles
    model.fit(articles)

    # Transform the articles: nmf_features
    nmf_features = model.transform(articles)

    # Print the NMF features
    print(nmf_features.round(2))

NMF features of the Wikipedia articles

In [27]:
if bool_recomp == True:
    # Create a pandas DataFrame: df
    df = pd.DataFrame(nmf_features, index=titles)

    # Print the row for 'Anne Hathaway'
    print(df.loc['Anne Hathaway'])

    # Print the row for 'Denzel Washington'
    print(df.loc['Denzel Washington'])

NMF learns topics of documents

In [28]:
if bool_recomp == True:
    # Create a DataFrame: components_df
    components_df = pd.DataFrame(model.components_, columns=words)

    # Print the shape of the DataFrame
    print(components_df.shape)

    # Select row 3: component
    component = components_df.iloc[3,:]

    # Print result of nlargest
    print(component.nlargest())