# A Gentle Introduction to Machine Learning with sklearn

## Playing With 20 Newsgroup Dataset

Consists of 18k newsgroup posts on 20 different topics.  Given a post, can we find out what topic it's on?

In [None]:
import numpy as np

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [None]:
print("FileNames: {}".format(newsgroups_train.filenames.shape));
print("Target: {}".format(newsgroups_train.target.shape));
datapoint=1 # view a data point, pick from 0 to 11313
print(newsgroups_train.data[datapoint])
print(newsgroups_train.target_names[newsgroups_train.target[datapoint]])

How can we convert text into features that we can plug into mathematical models?  Enter feature_extraction.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create bag-of-words representation of text, ignoring stopwords (like "the" or "of", etc.)
vectorizer = CountVectorizer(stop_words='english') 
vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_train.shape

In [None]:
vocab = vectorizer.vocabulary_
vocab

In [None]:
# Let's view a vector for the datapoint we saw earlier
# vectors is a sparse matrix, so we have to convert to a dense matrix.
data_mat = vectors_train[datapoint].todense()
print(data_mat.shape)
data_mat

In [None]:
# let's see the count for a specific word
data_mat[(0,vocab['clock'])]

Let's naively try fitting the data now, using ... wait for it ... a Naive Bayes Classifier!

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(vectors_train, newsgroups_train.target)

In [None]:
# Get test data, we must use the same vectorizer or else we'll end up with a different feature set!
newsgroups_test = fetch_20newsgroups(subset='test')
vectors_test = vectorizer.transform(newsgroups_test.data)
vectors_test.shape

In [None]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(vectors_test)
y_pred.shape
accuracy_score(newsgroups_test.target, y_pred)

Many words have little to no predictive value and are just noise so we want to get rid of them through a process called Feature Selection.  This means we can test fewer words with little impact (or even improvement) in predictive accuracy.

In [None]:
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectPercentile

ch2 = SelectPercentile(chi2, percentile=5) # use "percentile" best features
y_train = newsgroups_train.target
X_train = ch2.fit_transform(vectors_train, y_train)
y_test = newsgroups_test.target
X_test = ch2.transform(vectors_test)

inv_vocab = {v: k for k, v in vocab.items()} # maps from index to word

# list most important words
feature_names = [inv_vocab[i] for i in ch2.get_support(indices=True)]
feature_names

In [None]:
# Train with new set
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred.shape
accuracy_score(y_test, y_pred)

Let's see what the most important features were by looking at their coefficients (that is how important they were in the model)

In [None]:
category = 1
print(newsgroups_train.target_names[category])
feature_coefs = np.column_stack([np.array(feature_names), clf.coef_[category]])
feature_coefs = np.core.records.fromarrays(feature_coefs.transpose(), names='feature, coef', formats = 'S8, f8')
feature_coefs = np.sort(feature_coefs, order=['coef'], kind='mergesort')
feature_coefs.shape

In [None]:
[x[0] for x in feature_coefs[-10:-1]] # print 10 best features