Permalink
Browse files

first iteration of clustering documents

  • Loading branch information...
1 parent 6de88b4 commit efd1edbe01723ab80f948c220df3c4b042a7c9dc @thedatachef committed Apr 25, 2011
@@ -1,51 +0,0 @@
-register 'target/varaha-1.0-SNAPSHOT.jar';
-
-vectors = LOAD '$TFIDF-vectors' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
-
---
--- Choose K random centers. This is kind of a hacky process. Since we can't really use
--- parameters for the sampler we have to precompute S. Here S=(K+5)/NDOCS. This way we're
--- guaranteed to get greater than (but not too much so) K vectors. Then we limit it to K.
---
--- sampled = SAMPLE vectors $S;
--- k_centers = LIMIT sampled $K;
---
--- STORE k_centers INTO '$TFIDF-centers-0';
-
--- k_centers = LOAD '$TFIDF-centers-0' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
--- with_centers = CROSS k_centers, vectors;
--- similarities = FOREACH with_centers GENERATE
--- k_centers::doc_id AS center_id,
--- k_centers::vector AS center,
--- vectors::doc_id AS doc_id,
--- vectors::vector AS vector,
--- varaha.text.TermVectorSimilarity(k_centers::vector, vectors::vector) AS similarity;
-
--- STORE similarities INTO '$TFIDF-similarities-0';
--- similarities = LOAD '$TFIDF-similarities-0' AS (
--- center_id:chararray,
--- center:bag {t:tuple (token:chararray, weight:double)},
--- doc_id:chararray,
--- vector:bag {t:tuple (token:chararray, weight:double)},
--- similarity:double
--- );
---
--- finding_nearest = GROUP similarities BY doc_id;
--- only_nearest = FOREACH finding_nearest {
--- nearest_center = TOP(1, 4, similarities);
--- GENERATE
--- FLATTEN(nearest_center) AS (center_id, center, doc_id, vector, similarity)
--- ;
--- };
--- cut_nearest = FOREACH only_nearest GENERATE center_id, vector;
--- clusters = GROUP cut_nearest BY center_id; -- this gets worse as K/NDOCS gets smaller
---
--- cut_clusters = FOREACH clusters GENERATE group AS center_id, cut_nearest.vector AS vector_collection;
--- STORE cut_clusters INTO '$TFIDF-clusters-0';
-
-clusters = LOAD '$TFIDF-clusters-0' AS (center_id:chararray, vectors:bag {t:tuple (vector:bag {s:tuple (token:chararray, weight:double)})});
-centroids = FOREACH clusters GENERATE
- center_id,
- varaha.text.TermVectorCentroid(vectors) -- implement this
- ;
-STORE centroids INTO '$TFIDF-centroids-0';
@@ -0,0 +1,35 @@
+h1. Document Clustering
+
+Here we use K-means clustering to classify a set of raw text documents.
+
+h3. TFIDF
+
+First we need to run tf-idf over our documents to vectorize them. It is assumed that your documents are tab-separated where the first field is the document id and the second field is the document text that contains *no* newlines.
+
+<pre><code>
+pig -p DOCS=/path/to/my_docs -p TFIDF=/path/to/output tfidf.pig
+</code></pre>
+
+h3. K Centers
+
+This is the tricky step. There's a pig sampler that can uniformly sample your data and generate K initial centers but it's hacky. To use it do:
+
+<pre><code>
+pig -p TFIDF=/path/to/tfidf-output -p CENTERS=/path/to/output sample_k_centers.pig
+</code></pre>
+
+You can also generate your own centers as long as they can be read into the following pig schema:
+
+<pre><code>
+(doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)})
+</code></pre>
+
+h3. Iteration
+
+I haven't written a driver yet. Boo hoo. Meanwhile, to run a single iteration that clusters document vectors around the K centers, and computes centroids, do:
+
+<pre><code>
+pig -p TFIDF=/path/to/tfidf-output -p CURR_CENTERS=/path/to/current-centers -p NEXT_CENTERS=/path/to/next-centers cluster_documents.pig
+</code></pre>
+
+Run that one a bunch of times, using the output centers as new centers for the next iteration and you should be happy.
@@ -0,0 +1,49 @@
+register '../../target/varaha-1.0-SNAPSHOT.jar'; -- yikes, just autoregister this in the runner
+
+vectors = LOAD '$TFIDF' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
+k_centers = LOAD '$CURR_CENTERS' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
+
+--
+-- Compute the similarity between all document vectors and each of the K centers
+--
+--
+-- FIXME: this can be optimized for K large, cross is dangerous
+--
+with_centers = CROSS k_centers, vectors;
+similarities = FOREACH with_centers GENERATE
+ k_centers::doc_id AS center_id,
+ k_centers::vector AS center,
+ vectors::doc_id AS doc_id,
+ vectors::vector AS vector,
+ varaha.text.TermVectorSimilarity(k_centers::vector, vectors::vector) AS similarity
+ ;
+
+--
+-- Foreach vector, find the nearest center
+--
+finding_nearest = GROUP similarities BY doc_id;
+only_nearest = FOREACH finding_nearest {
+ nearest_center = TOP(1, 4, similarities);
+ GENERATE
+ FLATTEN(nearest_center) AS (center_id, center, doc_id, vector, similarity)
+ ;
+ };
+
+--
+-- Group on center_id and collect all the documents associated with each center. This
+-- can be quite memory intensive and gets nearly impossible when K/NDOCS is a small number.
+--
+cut_nearest = FOREACH only_nearest GENERATE center_id, vector;
+clusters = GROUP cut_nearest BY center_id; -- this gets worse as K/NDOCS gets smaller
+cut_clusters = FOREACH clusters GENERATE group AS center_id, cut_nearest.vector AS vector_collection;
+
+--
+-- Compute the centroid of all the documents associated with a given center. These will be the new
+-- centers in the next iteration.
+--
+centroids = FOREACH cut_clusters GENERATE
+ group AS center_id,
+ varaha.text.TermVectorCentroid(vector_collection)
+ ;
+
+STORE centroids INTO '$NEXT_CENTERS';
@@ -1,7 +1,7 @@
--
-- Load the vectors from the tfidf process.
--
-vectors = LOAD '$TFIDF-vectors' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
+vectors = LOAD '$TFIDF' AS (doc_id:chararray, vector:bag {t:tuple (token:chararray, weight:double)});
--
-- Choose K random centers. This is kind of a hacky process. Since we can't really use
@@ -13,4 +13,4 @@ vectors = LOAD '$TFIDF-vectors' AS (doc_id:chararray, vector:bag {t:tuple (token
sampled = SAMPLE vectors $S;
k_centers = LIMIT sampled $K;
-STORE k_centers INTO '$TFIDF-centers-0';
+STORE k_centers INTO '$CENTERS';
@@ -62,4 +62,4 @@ tfidf_all = FOREACH token_usages {
grouped = GROUP tfidf_all BY doc_id;
vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector;
-STORE vectors INTO '$OUT';
+STORE vectors INTO '$TFIDF';

0 comments on commit efd1edb

Please sign in to comment.