# Supervised Learning with scikit learn

## Table of contents:

* <a href=#Class>Applying logistic regression and SVM</a>
* <a href=#Regres>Loss functions</a>
* <a href=#Tuning>Logistic regression</a>
* <a href=#Pipe>Support Vector Machines</a>

## Load Packages and Set Global Variables

<a id="imports"></a>

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC


## Global Variables

All embeddings and clusterings can be saved and loaded into this script. Be carful with overwriting cluster caches as soon as cell type annotation has started as cluster labels may be shuffled.

Set whether anndata objects are recomputed or loaded from cache.

In [5]:
bool_recomp = False

Set whether clustering is recomputed or loaded from saved .obs file. Loading makes sense if the clustering changes due to a change in scanpy or one of its dependencies and the number of clusters or the cluster labels change accordingly.

In [6]:
bool_recluster = False

Set whether cluster cache is overwritten. Note that the cache exists for reproducibility of clustering, see above.

In [7]:
bool_write_cluster_cache = False

Set whether to produce plots, set to False for test runs.

In [8]:
bool_plot = False

Set whether observations should be calculated. If false, it is necessary to read cacheed file that contains the necssary information. It then shows the the distributions of counts and genes, as well as mt_frac after filtering. 
Set to true in order to see the data before filtering and follow the decisions for cutoffs.

In [9]:
bool_create_observations = True

<a id="Dataloading"></a>

## Applying logistic regression and SVM

KNN classification

In [10]:
if bool_recomp == True:
    # Create and fit the model
    knn = KNeighborsClassifier()
    knn.fit(X_train, y_train)

    # Predict on the test features, print the results
    pred = knn.predict(X_test)[0]
    print("Prediction for test example 0:", pred)

Running LogisticRegression and SVC

In [11]:
if bool_recomp == True:
    digits = datasets.load_digits()
    X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

    # Apply logistic regression and print scores
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    print(lr.score(X_train, y_train))
    print(lr.score(X_test, y_test))

    # Apply SVM and print scores
    svm = SVC()
    svm.fit(X_train, y_train)
    print(svm.score(X_train, y_train))
    print(svm.score(X_test, y_test))

Sentiment analysis for movie reviews

In [12]:
if bool_recomp == True:
    # Instantiate logistic regression and train
    lr = LogisticRegression()
    lr.fit(X, y)

    # Predict sentiment for a glowing review
    review1 = "LOVED IT! This movie was amazing. Top 10 this year."
    review1_features = get_features(review1)
    print("Review:", review1)
    print("Probability of positive review:", lr.predict_proba(review1_features)[0,1])

    # Predict sentiment for a poor review
    review2 = "Total junk! I'll never watch a film by that director again, no matter how good the reviews."
    review2_features = get_features(review2)
    print("Review:", review2)
    print("Probability of positive review:", lr.predict_proba(review2_features)[0,1])

Visualizing decision boundaries

In [13]:
if bool_recomp == True:
    # Define the classifiers
    classifiers = [LogisticRegression(),LinearSVC(),SVC(),KNeighborsClassifier()]

    # Fit the classifiers
    for c in classifiers:
        c.fit(X, y)


    # Plot the classifiers
    plot_4_classifiers(X, y, classifiers)
    plt.show()

## Loss functions