Skip to content

Latest commit

History

History
208 lines (143 loc) 路 7.77 KB

scikit-and-scipy-notes.md

File metadata and controls

208 lines (143 loc) 路 7.77 KB

SciPy and Scikit Learn Stuff

New stuff I learned using the scikit learn and SciPy modules.

Table of Contents

Working with Numpy

Numpy Create Range of Values with The Given Interval

Use arange to create an array with evenly spaced interval.

import numpy as np

np.arange(0, 5, 1)
# array([0,1,2,3,4])

np.arange(1, 4, 0.5)
# array([1. , 1.5, 2. , 2.5, 3. , 3.5])

Numpy Create Coordinate Matrices from Coordinate Vectors

We can use Numpy meshgrid function to make coordinate matrices from one-dimentional coordinate arrays.

import numpy as np

np.meshgrid([1, 2, 3], [0, 7])
# [
#   array([[1,2,3], [1,2,3]]),
#   array([[0,0,0], [7,7,7]])
# ]

Flatten Numpy Array

When we have a multi-dimensional Numpy array, we can easily flatten it with ravel method:

import numpy as np

arr = np.array([[1,2], [3,4]])
arr.ravel()
# array([1, 2, 3, 4])

Pairing Array Values with Second Axis

We can use Numpy c_ function to pair array values with another array that will be it's second axis. Read the [numpy.c_ documentation].

import numpy as np

x = [1,2]
y = [10,20]

np.c_[x, y]
# array([1,10], [2,20])

Generate Coordinates Across The Grid

With the knowledge of Numpy arange, meshgrid, ravel and c_ functions, we can easily generate coordinates across the grid so we can pass it to the clasifier and plot the decision surface.

import numpy as np

# Generate an evenly spaced coordinates.
x_points, y_points = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))

# Pair the x and y points.
test_coordinates = np.c_[x_points.ravel(), y_points.ravel()]

Plotting

Plot The Surface Decision

We can pass the coordinates across the grid to the clasifier to predict the output on each coordinate. We can then use matplotlib.pyplot to plot the surface decision.

import matplotlib.pyplot as plt
import pylab as pl

# Pass coordinates across the grid.
predicted_labels = classifier.predict(test_coordinates)

# Don't forget to reshape the output array dimension.
predicted_labels = predicted_labels.reshape(x_points.shape)

# Set the axes limit.
plt.xlim(x_points.min(), x_points.max())
plt.ylim(y_points.min(), y_points.max())

# Plot the decision boundary with seismic color map.
plt.pcolormesh(x_points, y_points, predicted_labels, cmap = pl.cm.seismic)

The classifier output would be one dimensional array, so don't forget to reshape it back into two dimensional array before plotting. The cmap is optional parameter for the color map. Here we use the seismic color map from pylab module. It has the red-blue colors.

Scatter Plot

we need to seperate the test points based on its predicted label (the speed). So we can plot the test points with two different colors.

# Separate fast (label = 0) & slow (label = 1) test points.
grade_fast = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 0]
bumpy_fast = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 0]
grade_slow = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 1]
bumpy_slow = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 1]

# Plot the test points based on its speed.
plt.scatter(grade_fast, bumpy_fast, color = "b", label = "fast")
plt.scatter(grade_slow, bumpy_slow, color = "r", label = "slow")

# Show the plot legend.
plt.legend()

# Add the axis labels.
plt.xlabel("grade")
plt.ylabel("bumpiness")

# Show the plot.
plt.show()

If you want to save the plot into an image, you can use the savefig method instead:

plt.savefig('scatter_plot.png')

Dealing with Data

Deserializing Python Object

We can use pickle module for serializing and deserializing Python object. There's also the cPickle, the faster C implementation. We use both of these modules to deserialize the email text and author list.

import pickle
import cPickle

# Unpickling or deserializing the texts.
texts_file_handler = open(texts_file, "r")
texts = cPickle.load(texts_file_handler)
texts_file_handler.close()

# Unpickling or deserializing the authors.
authors_file_handler = open(authors_file, "r")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()

Split Data for Training and Testing

We can use the built in train_test_split function from scikit-learn to split the data both for training and testing.

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(texts, authors, test_size = 0.1, random_state = 42)

test_size is the proportion of data to split into the test, on our case we split 10% for testing.

Vectorized the Strings

When working with text document, we need to vectorize the strings into list of numbers so it's easier to process. We can use the TfidfVectorizer class to vectorize the strings into a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = "english")
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed = vectorizer.transform(features_test)

Word with frequency higher than the max_df will be ignored. Stop words are also ignored, these are the most common words in a language (e.g. a, the, has).

Feature Selection

Text can have a lot of features thus it may slow to compute. We can use scikit SelectPercentile class to select only important features.

selector = SelectPercentile(f_classif, percentile = 10)
selector.fit(features_train_transformed, labels_train)
selected_features_train_transformed = selector.transform(features_train_transformed).toarray()
selected_features_test_transformed = selector.transform(features_test_transformed).toarray()

percentile is the percentage of features we'd like to select based on its highest score.