# Getting started with SciKitLearn (and a NumPy refresher)

This notebook is a gentle introduction to working with SciKitLearn (sklearn) and finding your way around documentation. Please also have a look at the tutorial notebook in the git repo. Some of the following code and explanations is taken from the sklearn quickstart guide (https://scikit-learn.org/0.21/tutorial/basic/tutorial.html) and will be marked accordingly. Also, you will probably have to work with the numpy docu https://numpy.org/doc/stable/reference/arrays.ndarray.html. 

Let's get started by importing the first library we need. Usualy you would import all the things you need in one place, but for this intro we do it one lib at a time



In [1]:
import numpy as np

We need a radnomizer. Let's talk about that: https://docs.scipy.org/doc/numpy-1.3.x/reference/generated/numpy.random.seed.html



In [11]:
print("first " + str(np.random.rand(4))) #print out 4 random numbers between 0 and 1
print("second " + str(np.random.rand(4))) #again
print("third "  + str(np.random.rand(4))) #and again
np.random.seed(0) #now let's set the seed -> that will cause the random number to be generated from a spceific number (0)
print("first with seed" + str(np.random.rand(4))) #print 4 random numbers with the seed 0
np.random.seed(0) #note we have to set the same seed again
print("second with seed" + str(np.random.rand(4))) #and try it again
np.random.seed(0)
print("third with seed"  + str(np.random.rand(4))) #to see the numbers are repeated.

first [0.56804456 0.92559664 0.07103606 0.0871293 ]
second [0.0202184  0.83261985 0.77815675 0.87001215]
third [0.97861834 0.79915856 0.46147936 0.78052918]
first with seed[0.5488135  0.71518937 0.60276338 0.54488318]
second with seed[0.5488135  0.71518937 0.60276338 0.54488318]
third with seed[0.5488135  0.71518937 0.60276338 0.54488318]


Play with this to see what happens. Why can it be helpful to use pseud random numbers? When should you not do that?

Then we move on to creating data. We use ndarrays and the function linspace: https://numpy.org/doc/stable/reference/arrays.ndarray.html

https://numpy.org/doc/stable/reference/generated/numpy.linspace.html







In [16]:
np.random.seed(7) #our friend the seed again
X = np.linspace(0, 1, 5) # creates a one dimensional ndarray of values between 0 and 1

# Inspect X.
print(X) #print the array we created and see it one row
print (type(X)) #what is the type?
print (X.shape) #what is the shape? Let's talk more about the shape! You'll need that later for the class projects

[0.   0.25 0.5  0.75 1.  ]
<class 'numpy.ndarray'>
(5,)


The data we will work with will rarely be one-dimensional, so let's get a grip on matrices.

In [25]:
a = np.random.random((2,3)) #creates a matrix with 2 rows and three columns
print(a)
print("\n")
b = np.array([[1, 2, 3], [4, 5, 6]], np.int32) #dimensions (2x3) and data type
print(b)
print(type(b))
print(b.shape)
print(b.dtype)
print("one element: 2nd row, 3rd column")
print(b[1,2])
print("\ntransppose that:")
c = b.transpose()
print(c)
print("\ndimensions/shape:")
print(c.shape)

[[0.65739946 0.37035108 0.45909298]
 [0.71932412 0.41299183 0.90642327]]


[[1 2 3]
 [4 5 6]]
<class 'numpy.ndarray'>
(2, 3)
int32
one element: 2nd row, 3rd column
6

transppose that:
[[1 4]
 [2 5]
 [3 6]]

dimensions/shape:
(3, 2)


From here, most code taken from https://scikit-learn.org/0.21/tutorial/basic/tutorial.html

In [26]:
from sklearn import datasets #import more libs
digits = datasets.load_digits() #load data set "digits"


"A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target member" 

In [27]:
print(digits.data) 

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]


In [28]:
digits.target

array([0, 1, 2, ..., 8, 9, 8])

What about the shape?

In [31]:
print(digits.data.shape)

(1797, 64)


Can we print out one data sample?

In [32]:
print(digits.images) #In the case of the digits, each original sample is an image of shape (8, 8)

[[[ 0.  0.  5. ...  1.  0.  0.]
  [ 0.  0. 13. ... 15.  5.  0.]
  [ 0.  3. 15. ... 11.  8.  0.]
  ...
  [ 0.  4. 11. ... 12.  7.  0.]
  [ 0.  2. 14. ... 12.  0.  0.]
  [ 0.  0.  6. ...  0.  0.  0.]]

 [[ 0.  0.  0. ...  5.  0.  0.]
  [ 0.  0.  0. ...  9.  0.  0.]
  [ 0.  0.  3. ...  6.  0.  0.]
  ...
  [ 0.  0.  1. ...  6.  0.  0.]
  [ 0.  0.  1. ...  6.  0.  0.]
  [ 0.  0.  0. ... 10.  0.  0.]]

 [[ 0.  0.  0. ... 12.  0.  0.]
  [ 0.  0.  3. ... 14.  0.  0.]
  [ 0.  0.  8. ... 16.  0.  0.]
  ...
  [ 0.  9. 16. ...  0.  0.  0.]
  [ 0.  3. 13. ... 11.  5.  0.]
  [ 0.  0.  0. ... 16.  9.  0.]]

 ...

 [[ 0.  0.  1. ...  1.  0.  0.]
  [ 0.  0. 13. ...  2.  1.  0.]
  [ 0.  0. 16. ... 16.  5.  0.]
  ...
  [ 0.  0. 16. ... 15.  0.  0.]
  [ 0.  0. 15. ... 16.  0.  0.]
  [ 0.  0.  2. ...  6.  0.  0.]]

 [[ 0.  0.  2. ...  0.  0.  0.]
  [ 0.  0. 14. ... 15.  1.  0.]
  [ 0.  4. 16. ... 16.  7.  0.]
  ...
  [ 0.  0.  0. ... 16.  2.  0.]
  [ 0.  0.  4. ... 16.  2.  0.]
  [ 0.  0.  5. ... 12.  0.  

In [33]:
print(digits.images[0])

[[ 0.  0.  5. 13.  9.  1.  0.  0.]
 [ 0.  0. 13. 15. 10. 15.  5.  0.]
 [ 0.  3. 15.  2.  0. 11.  8.  0.]
 [ 0.  4. 12.  0.  0.  8.  8.  0.]
 [ 0.  5.  8.  0.  0.  9.  8.  0.]
 [ 0.  4. 11.  0.  1. 12.  7.  0.]
 [ 0.  2. 14.  5. 10. 12.  0.  0.]
 [ 0.  0.  6. 13. 10.  0.  0.  0.]]


In the future we will need a test and train data set (see week 2 to learn why). So let's split the data. We have 1797 examples.

In [37]:
X,Y = digits.data, digits.target #machine learning standard
print(X.shape)
print(Y.shape)
train_data, train_labels = X[:1700], Y[:1700]
test_data, test_labels = X[1700:], Y[1700:]


(1797, 64)
(1797,)


We can use this to train a simple model like KNN (see week 2) https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [40]:
from sklearn.neighbors import KNeighborsClassifier #import the classifier we wnat to use
knn = KNeighborsClassifier(n_neighbors=3) #create the classifier
knn.fit(train_data, train_labels) #train on the training data
pred = knn.predict(test_data) #make a prediction with the test data
#how good is our prediction?
print(knn.score(test_data, test_labels))

0.9896907216494846
