New stuff I learned using the scikit learn and SciPy modules.
Use arange
to create an array with evenly spaced interval.
import numpy as np
np.arange(0, 5, 1)
# array([0,1,2,3,4])
np.arange(1, 4, 0.5)
# array([1. , 1.5, 2. , 2.5, 3. , 3.5])
We can use Numpy meshgrid
function to make coordinate matrices from one-dimentional coordinate arrays.
import numpy as np
np.meshgrid([1, 2, 3], [0, 7])
# [
# array([[1,2,3], [1,2,3]]),
# array([[0,0,0], [7,7,7]])
# ]
When we have a multi-dimensional Numpy array, we can easily flatten it with ravel
method:
import numpy as np
arr = np.array([[1,2], [3,4]])
arr.ravel()
# array([1, 2, 3, 4])
We can use Numpy c_
function to pair array values with another array that will be it's second axis. Read the [numpy.c_
documentation].
import numpy as np
x = [1,2]
y = [10,20]
np.c_[x, y]
# array([1,10], [2,20])
With the knowledge of Numpy arange
, meshgrid
, ravel
and c_
functions, we can easily generate coordinates across the grid so we can pass it to the clasifier and plot the decision surface.
import numpy as np
# Generate an evenly spaced coordinates.
x_points, y_points = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
# Pair the x and y points.
test_coordinates = np.c_[x_points.ravel(), y_points.ravel()]
We can pass the coordinates across the grid to the clasifier to predict the output on each coordinate. We can then use matplotlib.pyplot
to plot the surface decision.
import matplotlib.pyplot as plt
import pylab as pl
# Pass coordinates across the grid.
predicted_labels = classifier.predict(test_coordinates)
# Don't forget to reshape the output array dimension.
predicted_labels = predicted_labels.reshape(x_points.shape)
# Set the axes limit.
plt.xlim(x_points.min(), x_points.max())
plt.ylim(y_points.min(), y_points.max())
# Plot the decision boundary with seismic color map.
plt.pcolormesh(x_points, y_points, predicted_labels, cmap = pl.cm.seismic)
The classifier output would be one dimensional array, so don't forget to reshape
it back into two dimensional array before plotting. The cmap
is optional parameter for the color map. Here we use the seismic
color map from pylab
module. It has the red-blue colors.
we need to seperate the test points based on its predicted label (the speed). So we can plot the test points with two different colors.
# Separate fast (label = 0) & slow (label = 1) test points.
grade_fast = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 0]
bumpy_fast = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 0]
grade_slow = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 1]
bumpy_slow = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 1]
# Plot the test points based on its speed.
plt.scatter(grade_fast, bumpy_fast, color = "b", label = "fast")
plt.scatter(grade_slow, bumpy_slow, color = "r", label = "slow")
# Show the plot legend.
plt.legend()
# Add the axis labels.
plt.xlabel("grade")
plt.ylabel("bumpiness")
# Show the plot.
plt.show()
If you want to save the plot into an image, you can use the savefig
method instead:
plt.savefig('scatter_plot.png')
We can use pickle
module for serializing and deserializing Python object. There's also the cPickle
, the faster C implementation. We use both of these modules to deserialize the email text and author list.
import pickle
import cPickle
# Unpickling or deserializing the texts.
texts_file_handler = open(texts_file, "r")
texts = cPickle.load(texts_file_handler)
texts_file_handler.close()
# Unpickling or deserializing the authors.
authors_file_handler = open(authors_file, "r")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()
We can use the built in train_test_split
function from scikit-learn to split the data both for training and testing.
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(texts, authors, test_size = 0.1, random_state = 42)
test_size
is the proportion of data to split into the test, on our case we split 10% for testing.
When working with text document, we need to vectorize the strings into list of numbers so it's easier to process. We can use the TfidfVectorizer
class to vectorize the strings into a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = "english")
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed = vectorizer.transform(features_test)
Word with frequency higher than the max_df
will be ignored. Stop words are also ignored, these are the most common words in a language (e.g. a, the, has).
Text can have a lot of features thus it may slow to compute. We can use scikit SelectPercentile
class to select only important features.
selector = SelectPercentile(f_classif, percentile = 10)
selector.fit(features_train_transformed, labels_train)
selected_features_train_transformed = selector.transform(features_train_transformed).toarray()
selected_features_test_transformed = selector.transform(features_test_transformed).toarray()
percentile
is the percentage of features we'd like to select based on its highest score.