In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# removing headers, footers, etc. to deal with just text
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers','footers', 'quotes')) 
newsgroups_train = fetch_20newsgroups(subset='test',
                                      remove=('headers','footers', 'quotes'))

Because we're dealing with text data, how we're going to try and represent our data is the bag of words. In order, there's no real-- or what I'm trying to say is there's no real concept of order. We're going to look for the occurrence of a word and treat those as features. And so our count vectorizer is just going to count the words in a specific instance.

In [3]:
newsgroups_train.data[0]

'I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.'

Count vectorizer only starts with two letter words, so I is not included.

Instantiating vectorizer, and fitting to newspaper data. Creates a vocabulary for us.

In [4]:
vectorizer = CountVectorizer()
vectorizer.fit(newsgroups_train.data)

CountVectorizer()

The below cell produces indices of the words, not counts. Essentially one-hot encoded our vocab.

In [5]:
vectorizer.vocabulary_

{'am': 11171,
 'little': 39524,
 'confused': 19380,
 'on': 46954,
 'all': 10989,
 'of': 46671,
 'the': 62245,
 'models': 43322,
 '88': 7467,
 '89': 7517,
 'bonnevilles': 15201,
 'have': 31837,
 'heard': 31987,
 'le': 38799,
 'se': 56730,
 'lse': 39985,
 'sse': 59442,
 'ssei': 59444,
 'could': 20106,
 'someone': 58743,
 'tell': 61935,
 'me': 41870,
 'differences': 22831,
 'are': 12131,
 'far': 27275,
 'as': 12404,
 'features': 27478,
 'or': 47215,
 'performance': 49013,
 'also': 11107,
 'curious': 20854,
 'to': 62900,
 'know': 37828,
 'what': 67769,
 'book': 15215,
 'value': 65952,
 'is': 35662,
 'for': 28386,
 'prefereably': 50685,
 'model': 43308,
 'and': 11430,
 'how': 32989,
 'much': 44076,
 'less': 39052,
 'than': 62219,
 'can': 16636,
 'you': 70187,
 'usually': 65546,
 'get': 29823,
 'them': 62278,
 'in': 34305,
 'other': 47474,
 'words': 68390,
 'they': 62379,
 'demand': 22070,
 'this': 62428,
 'time': 62702,
 'year': 69964,
 'that': 62235,
 'mid': 42617,
 'spring': 59286,
 'earl

Transforming the data

So let's move forward and let's transform our data. Right now, we've just fit it. So we've got parameters to the vectorizer, but we haven't transformed our data. So let's go ahead and transform our data. So I'm just going to transform one row, and I'll use this vectorizer, this model, and we're going to transform some data. And we're going to use the newsgroup training data.

We need to make sure we're using the data and not something else. Well, there are additional pieces to the newsgroup data. So we only want the first example, or example 0. So we'll close that off. And because the vectorizer stores or returns information in a sparse matrix form, that sparse matrix form is efficient when lots of the values are zero. And that's what we expect.

However, we want to be able to inspect visually different portions, and it's difficult to do that with the sparse matrix format. So I'm going to do that by-- or get away from the sparse matrix and get a traditional looking matrix by casting it as an array.

In [6]:
sample = vectorizer.transform([newsgroups_train.data[0]]).toarray()

So let's take a look at what this sample looks like. Let's take a look at the shape. And we see that the shape is it's actually a single column vector, or row vector, and it has over 101,000 elements to it, which means it's very, very large. Let's take a look at it visually, and we see that it's mostly zeros. This could be alarming, but we knew and I mentioned ahead of time that it could be or it should be a sparse matrix, because not every word occurs in every example.

In [7]:
sample.shape

(1, 71018)

In [8]:
sample

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

So as you can see, the count vectorizer transforms your text data into a bag of words model, in which the bag of words is just count of the occurrences of your words. Now, we obviously could do this manually, but the count vectorizer is extremely efficient. 

In [9]:
sample[0][11171] #Number of the word "am" in the original text paragraph

2

#### DOCUMENT FREQUENCY - TFIDF Vectorizer

Now, there's both a transformer and a vectorizer. The transformer goes if you've already got your accounts. The TF-IDF vectorizer does the counting for us. So although we already may have counts, in this case, I'm going to start with the raw text and show you how fast the vectorizer works to produce this TF-IDF document.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tf_vectors = TfidfVectorizer()

So this (BELOW) will produce a matrix of term frequency, inverse document frequency. So the first part, that term frequency, is just the counts, our bag of word models. So in each document or each post, the count of the vocabulary is recorded. Most words aren't there so there will be many 0's. But the words that are there will count out the number of occurrences.

Now, to help improve our ability to discern between different types of documents, this adds an inverse document frequency. And what that means is it takes the count of the word, or the total occurrence of the word, in each document and divides that out. The key here is it helps with words that occur commonly. They will have a very common document frequency. And so their importance or their value will be lessened when we multiply by the inverse of the document frequency.

Now, words that occur in some documents but not others will have a low document frequency. And as we divide by that, they will be transformed less. Or in other words, we're essentially dividing by the document frequency. So instead of dividing by a large number, we're dividing by a small number. This helps us distinguish between terms that occur in some documents and not others-- in other words, terms that can help us distinguish differences-- and terms that occur in everything, which don't help us determine the difference between certain classes. That's why a TF-IDF is a very popular source to start clustering and other unstructured data, as well as dealing with textual data.

In [12]:
new_vectors = tf_vectors.fit_transform(newsgroups_train.data)

In [13]:
new_vectors.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

So remember, although the bag of words starts as the count vectorizer, the TF-IDF, or term frequency, inverse document frequency, is an improvement over the count vectorizer in that it helps establish features that are words represented as numbers. Those numbers are counts. But those counts are normalized to how often they appear in documents. And remember, common words will actually have a low occurrence as opposed to semi-common words that appear in some documents but not others, because those are the words that we want to look for as they will help us distinguish between different types or classes of documents. 

In [14]:
#same number of vocab words as earlier, but now with 7.5 thousand instances
new_vectors.shape

(7532, 71018)

#### KNN - Relies on supervised, counts on data having labels

Now, we can choose the number of neighbors and all the different variables, and what we want to do is we'll have to try and tune it. But I'll start with the number of neighbors being 5.

In [15]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

So, let's make sure, probably spelled this wrong, there we are. All right, so let's go ahead and take a look, as we fit this. Now, we're going to use our new vectors that we did with our TF-IDF, and, of course, we'll need the targets, so we'll need the target. So let's see what happens.

In [16]:
knn.fit(new_vectors, newsgroups_train['target'])

KNeighborsClassifier()

In [17]:
preds = knn.predict(new_vectors)

And it's actually-- so we didn't do too well. That illustrates some of the issues that you have with the KNN. We'll probably have to take a look. Now, we remember that the number of neighbors was 5, let's see if we can change that and see if it affects our accuracy.

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(newsgroups_train['target'], preds)

0.36311736590547

In [19]:
knn = KNeighborsClassifier(n_neighbors=10)

In [20]:
knn.fit(new_vectors, newsgroups_train['target'])

KNeighborsClassifier(n_neighbors=10)

In [21]:
preds = knn.predict(new_vectors)

So you can see in this case, our accuracy went down, so we probably need to take a look at the other direction. So rather than hyperparameter tuning to find our final model, what we can do is just take it as an activity left to the student at this point to optimize your KNN.

In [22]:
accuracy_score(newsgroups_train['target'], preds)

0.26925119490175253

DECREASING n_neighbors

In [23]:
knn = KNeighborsClassifier(n_neighbors=3)

In [24]:
knn.fit(new_vectors, newsgroups_train['target'])

KNeighborsClassifier(n_neighbors=3)

In [25]:
preds = knn.predict(new_vectors)

In [26]:
accuracy_score(newsgroups_train['target'], preds)

0.7338024429102497

#### DBSCAN - Unsupervised learning algo - Clustering

In [27]:
from sklearn.cluster import DBSCAN

So we need to create our classifier, which we'll just call db, so let's go ahead and hit our important hyperparameters. So the first one is the distance. So this is the distance between points that would consider them the same cluster. This is the one we're going to have to tune the most to get our best results.

In [28]:
#DEFAULT HYPERPARAMETERS
db = DBSCAN(eps=0.5, min_samples=5)

So, what we see as we got out were things that are all negative 1, which is telling us that our distance is probably a little bit too far. But let's check and make sure that we're not just seeing the ends, so we'll assign our predictions to a preds variable. We'll let it run quickly again.

In [29]:
preds = db.fit_predict(new_vectors)

So now that we've got our preds(ABOVE), let's see if we can find a max value. So there are one or two clusters in here. But we know that there should roughly be 20(BECAUSE THERE IS 20 TOPICS IN THE NEWSGROUP), and most of the data look like it was outliers. We can see that by either putting this into a DataFrame or we may be lucky with our NumPy-- because this is an array-- to do the value counts.

In [30]:
max(preds)

0

In [32]:
import pandas as pd
x = pd.DataFrame(preds)

All right, so let's look at the value counts--So what we see here is we've actually got two clusters. but the majority of the data is classified as an outlier. What that's saying is our EPS was probably a little large. Let's see if we can maybe drop it down to size.

In [33]:
x[0].value_counts()

-1    7308
 0     224
Name: 0, dtype: int64

In [34]:
db = DBSCAN(eps=5, min_samples=5)

In [35]:
preds = db.fit_predict(new_vectors)

Now it looks like we've got actually got the opposite problem in that all the data is in a single cluster. So we've located the region where the EPS needs to be tuned. Further tuning can help us determine a number of clusters that we want. And we know our target is roughly 20, and that's because there's 20 topics in the news group.

But if you're in unstructured and you're trying to create features, keep in mind how many features or values you want to create out of your clusters. So typically, we would want something in the 20 to 100 range. And now, once we refine our EPS between 0.5 and 5, we'll eventually settle on a number of clusters.

In [36]:
x = pd.DataFrame(preds)
x[0].value_counts()

0    7532
Name: 0, dtype: int64

#### K-MEANS CLUSTERING - unstructured - finding relationships

The number of clusters has to be decided ahead of time.

In [39]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=20, tol=0.0001, max_iter=300)

In [40]:
kmeans.fit_predict(new_vectors)

array([ 8, 10,  1, ..., 10,  1,  6])

So while the numbers won't match up exactly because, remember, these are assigned labels, and those assigned labels have meanings.

So 7 may be the car news group and 4 may be the Sci-fi news group. Of course, in clustering, the clustering has no idea which cluster is which. So there may be a one-to-one correspondence that says in the clustering, cluster 12 in the clustering is actually target 7. Now this is because we have supervised data and we're trying to find unsupervised structure.

In [41]:
newsgroups_train['target']

array([ 7,  5,  0, ...,  9,  6, 15])

We don't know at the end what the correspondence is either. So that's why it's really difficult for us at this point, until we actually find a match up of what cluster would go to what. So at this point, we will finish our k-means tutorial.

A reminder, we did guess initially on the clusters, but we had a fairly educated guess because we knew there were 20 classes. In reality, what we would do is look at the dispersion of the centers. In other words, we would measure the distance to the points in the cluster as a lost metric and try and find where that stop decreasing-- known as the elbow method.

Once that has stopped decreasing, that would tell us the proper number of clusters that it optimizes our problem. But for now, we used k-means to demonstrate that we could cluster and that it did find some structure features within our data.

And remember, that the labels found by k-means definitely do not correspond to your supervised labels because k-means has no concept of what your individual original targets were. 