# Numpy, Matplotlib and Sklearn Tutorial

We often use numpy to handle high dimensional arrays.

Let's try the basic operation of numpy:

In [1]:
import numpy as np

a = np.array([[1,2,3], [2,3,4]])
print("a",a.ndim, a.shape, a.size, a.dtype, type(a))

b = np.zeros((3,4))
c = np.ones((3,4))
d = np.random.randn(2,3)
e = np.array([[1,2], [2,3], [3,4]])
f = b*2 - c*3
g = 2*c*f
h = np.dot(a,e)
i = d.mean()
j = d.max(axis=1)
k = a[-1][:2]
print("b",b.ndim, b.shape, b.size, b.dtype, type(b))
print("c",c.ndim, c.shape, c.size, c.dtype, type(c))
print("d",d.ndim, d.shape, d.size, d.dtype, type(d))
print("e",e.ndim, e.shape, e.size, e.dtype, type(e))
print("f",f.ndim, f.shape, f.size, f.dtype, type(f))
print("g",g.ndim, g.shape, g.size, g.dtype, type(g))
print("h",h.ndim, h.shape, h.size, h.dtype, type(h))
print("i",i.ndim, i.shape, i.size, i.dtype, type(i))
print("j",j.ndim, j.shape, j.size, j.dtype, type(j))
print("k",k.ndim, k.shape, k.size, k.dtype, type(k))
print("a是一个二位数组，2行3列，第一行是1，2，3，第二行是2，3，4.dim指的是维数，shape指的是是几乘几的数组，size指的是总共有多少个数据，type是数据类型，最后一个是整个的类型")
print("b是一个3行4列的数据都为0的数组\n c是一个3行4列的都为1的数组\n d是2行3列的随机数的数组\n e是一个3行2列的数组，第一行是1，2，第二行是2，3，第三行是3，4\n ")
print("f是b乘2后减去c乘3的结果\n g是2乘c乘f\n h是a点乘e\n i是d的平均值\n j是求的d的次外层的最大值，于是是\n k是取最后一行，从第一个取到第二个")
# You can print from a to k for details

a 2 (2, 3) 6 int32 <class 'numpy.ndarray'>
b 2 (3, 4) 12 float64 <class 'numpy.ndarray'>
c 2 (3, 4) 12 float64 <class 'numpy.ndarray'>
d 2 (2, 3) 6 float64 <class 'numpy.ndarray'>
e 2 (3, 2) 6 int32 <class 'numpy.ndarray'>
f 2 (3, 4) 12 float64 <class 'numpy.ndarray'>
g 2 (3, 4) 12 float64 <class 'numpy.ndarray'>
h 2 (2, 2) 4 int32 <class 'numpy.ndarray'>
i 0 () 1 float64 <class 'numpy.float64'>
j 1 (2,) 2 float64 <class 'numpy.ndarray'>
k 1 (2,) 2 int32 <class 'numpy.ndarray'>
a是一个二位数组，2行3列，第一行是1，2，3，第二行是2，3，4.dim指的是维数，shape指的是是几乘几的数组，size指的是总共有多少个数据，type是数据类型，最后一个是整个的类型
b是一个3行4列的数据都为0的数组
 c是一个3行4列的都为1的数组
 d是2行3列的随机数的数组
 e是一个3行2列的数组，第一行是1，2，第二行是2，3，第三行是3，4
 
f是b乘2后减去c乘3的结果
 g是2乘c乘f
 h是a点乘e
 i是d的平均值
 j是求的d的次外层的最大值，于是是
 k是取最后一行，从第一个取到第二个


matplotlib.pyplot provides very useful apis for drawing graphs.

Let's try the basic operation of matplotlib.pyplot:

In [None]:
import matplotlib.pyplot as plt

x = np.arange(2, 10, 0.2)

plt.plot(x, x**1.5*.5, 'r-', x, np.log(x)*5, 'g--', x, x, 'b.')
plt.show()

If you want to print them in different graphs, try this:

In [None]:
def f(x):
    return np.sin(np.pi*x)

x1 = np.arange(0, 5, 0.1)
x2 = np.arange(0, 5, 0.01)

plt.subplot(211)
plt.plot(x1, f(x1), 'go', x2, f(x2-1))

plt.subplot(212)
plt.plot(x2, f(x2), 'r--')
plt.show()

How about printing images?

Let's try to print a image whose pixels gradually change:

Different pixel values represent different gray levels.

In [None]:
img = np.arange(0, 1, 1/32/32) # define an 1D array with 32x32 elements gradually increasing
img = img.reshape(32, 32) # reshape it into 32x32 array, the array represents a 32x32 image,
                          # each element represents the corresponding pixel of the image
plt.imshow(img, cmap='gray')
plt.show()

Based on numpy, Scikit-learn (sklearn) provides a lot of tools for machine learning.It is a very powerful machine learning library.

Then, let's use it for mnist classification:

In [1]:
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
# download and load mnist data from https://www.openml.org/d/554
# for this tutorial, the data have been downloaded already in './scikit_learn_data'
X, Y = fetch_openml('mnist_784', version=1, data_home='./scikit_learn_data', return_X_y=True)

# make the value of pixels from [0, 255] to [0, 1] for further process
X = X / 255.

# print the first image of the dataset
img1 = X[0].reshape(28, 28)
plt.imshow(img1, cmap='gray')
plt.show()

# print the images after simple transformation
img2 = 1 - img1
plt.imshow(img2, cmap='gray')
plt.show()

img3 = img1.transpose()
plt.imshow(img3, cmap='gray')
plt.show()

KeyError: 0

In [2]:
# split data to train and test (for faster calculation, just use 1/10 data)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X[::10], Y[::10], test_size=1000)
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

#### Q1:
Please use the logistic regression(default parameters) in sklearn to classify the data above, and print the training accuracy and test accuracy.

In [3]:
# TODO:use logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

clf = LogisticRegression(max_iter=1000, C = 0.1).fit(X_train, Y_train)

train_accuracy = clf.score(X_train, Y_train)
test_accuracy = clf.score(X_test, Y_test)




print('Training accuracy: %0.2f%%' % (train_accuracy*100))
print('Testing accuracy: %0.2f%%' % (test_accuracy*100))

Training accuracy: 98.12%
Testing accuracy: 88.30%


#### Q2:
Please use the naive bayes(Bernoulli, default parameters) in sklearn to classify the data above, and print the training accuracy and test accuracy.

In [4]:
# TODO:use naive bayes
from sklearn.naive_bayes import BernoulliNB

clf_bayes = BernoulliNB()
clf_bayes.fit(X_train, Y_train)

train_accuracy = clf_bayes.score(X_train, Y_train)
test_accuracy = clf_bayes.score(X_test, Y_test)



print('Training accuracy: %0.2f%%' % (train_accuracy*100))
print('Testing accuracy: %0.2f%%' % (test_accuracy*100))

Training accuracy: 82.32%
Testing accuracy: 81.10%


#### Q3:
Please use the support vector machine(default parameters) in sklearn to classify the data above, and print the training accuracy and test accuracy.

In [5]:
# TODO:use support vector machine
from sklearn.svm import LinearSVC

svc = LinearSVC(max_iter=100000)
svc.fit(X_train, Y_train)

train_accuracy = svc.score(X_train, Y_train)
test_accuracy = svc.score(X_test, Y_test)



print('Training accuracy: %0.2f%%' % (train_accuracy*100))
print('Testing accuracy: %0.2f%%' % (test_accuracy*100))

Training accuracy: 99.43%
Testing accuracy: 80.60%


#### Q4:
Please adjust the parameters of SVM to increase the testing accuracy, and print the training accuracy and test accuracy.

In [6]:
# TODO:use SVM with another group of parameters
from sklearn.model_selection import GridSearchCV

svc = LinearSVC(max_iter=200000)
param_grid = {'C': [1e-3, 1e-2, 0.1, 1, 10]}
grid_search = GridSearchCV(svc, param_grid=param_grid)
grid_search.fit(X_train, Y_train)
best_param = grid_search.best_estimator_.get_params() 

svc = LinearSVC(C= best_param['C'])
svc.fit(X_train, Y_train)

train_accuracy = svc.score(X_train, Y_train)
test_accuracy = svc.score(X_test, Y_test)

test_accuracy = svc.score(X_test, Y_test)




print('Training accuracy: %0.2f%%' % (train_accuracy*100))
print('Testing accuracy: %0.2f%%' % (test_accuracy*100))

Training accuracy: 95.45%
Testing accuracy: 86.10%
