# Logistic Regression Lab 2

Scikit-Learn includes several sample datasets which can demonstrate
logistic regression's usefulness.

This is a very free-form lab: you won't be walked through it step-by-step,
so you might want to keep some other examples open.

In [38]:
import sklearn.datasets
import sklearn.linear_model as lm
import sklearn.cross_validation
import sklearn.neighbors as nb
import sklearn.metrics as met
import sklearn.grid_search as gs
import sklearn.preprocessing as prep
import pandas as pd

We will look at the Wisconsin breast cancer database, and a classic
dataset of [different kinds of iris flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set).

# Wisconsin

In the Wisconsin breast cancer database, you are trying to predict whether
a tumour is malignant or benign. The database consists of the measurements
of the tumour (bc.data) and the nature of the tumour (bc.target) -- 1 = malignant, 0 == benign.

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation



In [15]:
bc = sklearn.datasets.load_breast_cancer()
#print bc.DESCR
bcdf = pd.DataFrame(bc.data)
#bc.target

In [39]:
scaler = prep.RobustScaler()
bcdf = scaler.fit_transform(bcdf)
x_train, x_test, y_train, y_test = sklearn.cross_validation.train_test_split(bcdf,bc.target)

knn = nb.KNeighborsClassifier()
knn.fit(x_train, y_train)
preds = knn.predict(x_test)

print met.confusion_matrix(preds, y_test)
print knn.score(x_test, y_test)

logreg = lm.LogisticRegressionCV()
logreg.fit(x_train, y_train)
preds = logreg.predict(x_test)

print met.confusion_matrix(preds, y_test)
print logreg.score(x_test, y_test)

knncv = gs.GridSearchCV(nb.KNeighborsClassifier(), 
                        {'n_neighbors':range(3,12),
                         'weights':['uniform','distance']})
knncv.fit(x_train, y_train)
preds = knncv.predict(x_test)

print met.confusion_matrix(preds, y_test)
print knncv.score(x_test, y_test)

print knncv.best_params_

[[47  0]
 [ 6 90]]
0.958041958042
[[51  0]
 [ 2 90]]
0.986013986014
{'n_neighbors': 8, 'weights': 'distance'}
[[48  0]
 [ 5 90]]
0.965034965035


# Irises

There are three kinds of flowers in the dataset:

- [Setosa](https://en.wikipedia.org/wiki/Iris_setosa) ( = 0)

- [Versicolor](https://en.wikipedia.org/wiki/Iris_versicolor) ( = 1)

- [Virginica](https://en.wikipedia.org/wiki/Iris_virginica) ( = 2)

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation

In [None]:
iris = sklearn.datasets.load_iris()
print iris.DESCR