# Homework - kNN implementation with Spark 

In this homework you will implement kNN algorithm with Spark, and apply that to classify text documents. “Classification”
is the task of labeling documents based upon their contents. You will be asked to perform 4 tasks, covering data preparation, feature extraction, and classification.

## Data

The dataset you will be using in this homework is the same as the dataset you used in the Spark introduction lab. That is the widely-used “20 newsgroups” dataset. A newsgroup post is like an old-school blog post, and this dataset has 19,997 such posts from 20 different categories, according to where the blog post was made. 

The 20 categories are listed in the file `news_categories.txt`. The category name can be extracted from the id of the document. For example, 
* the document with the id `20_newsgroups/comp.graphics/37261` is from the `comp.graphics` category, 
* the document with the id `20_newsgroups/sci.med/59082` is from the `sci.med` category. 

The data file has one line per document of text. It can be accessed at:

`s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt`

We have also provided a small subset of the data in the file `20_news_same_line_random_sample.txt`, so that you can debug your Spark code on a small dataset, before run it on the entire dataset. 

In [80]:
import pyspark
import re
import numpy as np

In [81]:
# pyspark works best with java8 
# set JAVA_HOME enviroment variable to java8 path 
%env JAVA_HOME = /usr/lib/jvm/java-8-openjdk-amd64

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


In [82]:
sc = pyspark.SparkContext()

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-3-b22b2d7afdf6>:1 

## Task 1 - compute "bag of words" for each document (25 pts)

For task 1, we want to extract "bag of words" features for documents. 

The first part of this task is the same as what you've already implemented in `Lab - Spark introduction (Vocareum)`. We need a dictionary, as an RDD, that includes the 20,000 most frequent words
in the training corpus. The result of such an RDD must be in this format:
`
[('mostcommonword', 0),
 ('nextmostcommonword', 1),
 ...]
`

**NOTE**: There aren’t 20,000 unique words in the small dataset (`20_news_same_line_random_sample.txt`). Use only the top 50 words when working with this file.

For this part, we provided our code, so that you only need to run it, to create this dictionary, named `refDict`, as an RDD. This `refDict` RDD will be our reference dictionary of words. **The words in `refDict` will be our reference words for which we will compute "bag of words" and "TF-IDF" features for our training corpus and finally for the test documents.**

**Provided code to create the reference dictionary of words.**

Run the code cells below to create the `refDict` RDD.

In [83]:
# set the number of dictionary words 
# 50 for the small dataset
# 20,000 for the large dataset
numWords = 50

In [84]:
# load up the dataset 
# "data/20_news_same_line_random_sample.txt" for small dataset 
# "s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt" for entire large dataset
corpus = sc.textFile ("data/20_news_same_line_random_sample.txt")

# each entry in validLines will be a line from the text file
validLines = corpus.filter(lambda x : 'id=' in x)

# now we transform it into a bunch of (docID, text) pairs
keyAndText = validLines.map(lambda x : (x[x.index('id="') + 4 : x.index('" url=')], x[x.index('"> ') + 3:x.index(' </doc>')]))

# now we split the text in each (docID, text) pair into a list of words
# after this, we have a data set with (docID, ["word1", "word2", "word3", ...])
# we have a bit of fancy regular expression stuff here to make sure that we do not
# die on some of the documents                       
regex = re.compile('[^a-zA-Z]')  
keyAndListOfWords = keyAndText.map(lambda x : (str(x[0]), regex.sub(' ', x[1]).lower().split()))

# now get the top 20,000 words... first change (docID, ["word1", "word2", "word3", ...])
# to ("word1", 1) ("word2", 1)...
allWords = keyAndListOfWords.flatMap(lambda x: ((j, 1) for j in x[1]))

# now, count all of the words, giving us ("word1", 1433), ("word2", 3423423), etc.
allCounts = allWords.reduceByKey (lambda a, b: a + b)

# and get the top numWords (50 for small dataset, 20K for large dataset) frequent words in a local array
topWords = allCounts.top (numWords, lambda x : x[1])

# and we'll create an RDD that has a bunch of (word, rank) pairs
# start by creating an RDD that has the number 0 up to numWords (50 for small dataset, 20K for large dataset) 
# numWords is the number of words that will be in our dictionary
twentyK = sc.parallelize(range(numWords))

# now, we transform (0), (1), (2), ... to ("mostcommonword", 0) ("nextmostcommon", 1), ...
# the number will be the spot in the dictionary used to tell us where the word is located
refDict = twentyK.map(lambda x:(topWords[x][0],x))

In [85]:
refDict.take(100)

[('the', 0),
 ('to', 1),
 ('of', 2),
 ('a', 3),
 ('and', 4),
 ('i', 5),
 ('in', 6),
 ('is', 7),
 ('that', 8),
 ('it', 9),
 ('you', 10),
 ('for', 11),
 ('s', 12),
 ('t', 13),
 ('w', 14),
 ('from', 15),
 ('on', 16),
 ('this', 17),
 ('are', 18),
 ('be', 19),
 ('not', 20),
 ('have', 21),
 ('with', 22),
 ('as', 23),
 ('or', 24),
 ('but', 25),
 ('edu', 26),
 ('was', 27),
 ('if', 28),
 ('they', 29),
 ('subject', 30),
 ('m', 31),
 ('date', 32),
 ('lines', 33),
 ('apr', 34),
 ('can', 35),
 ('re', 36),
 ('by', 37),
 ('at', 38),
 ('gmt', 39),
 ('my', 40),
 ('what', 41),
 ('an', 42),
 ('all', 43),
 ('he', 44),
 ('will', 45),
 ('we', 46),
 ('do', 47),
 ('writes', 48),
 ('would', 49)]

Now, your task is to write Spark code to create "bag of words" features based on the words in the reference dictionary, `refDict`.  

You need to create a new RDD, named `bag_of_words`. Each element of this RDD corresponds to one document, and is a key-value pair. Specifically, the key is the document identifier `id` (like `20 newsgroups/comp.graphics/37261`) and the value is a `numpy` array with `numWords` (50 for small dataset, 20K for large dataset) entries, where the item index i of the array is the number of times the word with rank i in `refDict` (created in the first part) appears in that document. This array corresponds to the "bag of words" features for each document.  

Once you created this `bag_of_words` RDD, print out the result arrays for these documents:
* `20 newsgroups/soc.religion.christian/21626`
* `20 newsgroups/talk.politics.misc/179019`
* `20 newsgroups/rec.autos/103167`

Since each array is going to be huge, with a lot of zeros, just print out non-zero entries in the array (that is, for an array `a`, print out `a[a.nonzero()`].

<font color=red>Important: `refDict` is an RDD and must stay as an RDD as you work with; don't collect it into a Python object to work with</font> 

In [86]:
# putting the key and value in a string together 
keyValue = keyAndListOfWords.map (lambda x: list(str(x[0]) + ' ' + str(val) for val in x[1]))
keyValueOne = keyValue.flatMap (lambda x: ((j, 1) for j in x))

# count the number of times that key + value occur 
counting = keyValueOne.reduceByKey (lambda a, b: a + b)

# make the words the key to join with the reference dictionary 
splitToJoin = counting.map (lambda x: (x[0].split()[1], (x[0].split()[0], x[1])))
joined = splitToJoin.join(refDict)

# reorgainze to have the document as key and (word index, word count) as value 
split = joined.map(lambda x: (x[1][0][0], (x[1][1], x[1][0][1])))
lists = split.map (lambda x: (x[0], [x[1]]))
proper = lists.reduceByKey (lambda a, b: a + b)

# create an array of zeroes and join it to the RDD with the doc, word index, and word count 
zeros = keyAndListOfWords.map(lambda x: (x[0], np.zeros(numWords)))
joinedZeros = zeros.join(proper)

# create a function to initialize the zero array to the proper values 
def init(x):
    doc = x[0]
    arr, tuples = x[1]
    for tup in tuples:
        arr[tup[0]] = tup[1]
    return (doc, arr)

# final RDD
bag_of_words = joinedZeros.map(init)

In [87]:
bag_of_words.take(5)

[('20_newsgroups/comp.graphics/38583',
  array([1., 3., 0., 2., 0., 2., 2., 1., 0., 1., 0., 0., 1., 1., 0., 1., 0.,
         1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1.,
         1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2.])),
 ('20_newsgroups/comp.graphics/38752',
  array([5., 1., 0., 1., 0., 4., 2., 1., 2., 0., 0., 0., 0., 0., 0., 1., 0.,
         0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 0., 0., 0., 1., 2., 1., 1.,
         1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.])),
 ('20_newsgroups/comp.os.ms-windows.misc/10020',
  array([2., 3., 3., 2., 2., 0., 0., 4., 0., 0., 2., 0., 0., 0., 0., 1., 0.,
         2., 0., 0., 0., 0., 2., 0., 1., 0., 2., 0., 0., 0., 1., 0., 1., 1.,
         1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.])),
 ('20_newsgroups/comp.os.ms-windows.misc/9810',
  array([12.,  8.,  5.,  8.,  4.,  1.,  4.,  9.,  6.,  4.,  6.,  3.,  1.,
          3.,  0.,  2.,  3.,  2.,  3.,  2.,  6.,  1.,  4.,  5

Once you created your `bag_of_words` RDD, print out the result arrays for these documents,
* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

by running the code cells below:

In [88]:
arr1_1 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/soc.religion.christian/21626').values().collect())
arr1_1[arr1_1.nonzero()]

array([ 7.,  2., 10.,  4.,  4.,  5.,  1.,  6.,  7.,  8.,  1.,  1.,  1.,
        3.,  1.,  2.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.])

In [89]:
arr1_2 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/talk.politics.misc/179019').values().collect())
arr1_2[arr1_2.nonzero()]

array([ 7., 23.,  5., 17.,  6.,  5., 14., 10.,  3., 20., 15.,  4., 11.,
        1.,  1.,  4.,  4.,  8.,  4.,  3.,  2.,  1.,  2.,  3., 10.,  3.,
        1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,  2.,
        3.])

In [90]:
arr1_3 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167').values().collect())
arr1_3[arr1_3.nonzero()]

array([9., 1., 2., 3., 2., 1., 1., 1., 1., 1., 1., 1., 1., 2., 1., 3., 1.,
       1., 1., 1., 1., 1.])

## Task 2 - compute TF-IDF for each document (30 pts)

It is often difficult to classify documents accurately using raw count vectors (bag of words). Thus, the next task is
to write some more Spark code that converts each of the count vectors to TF-IDF vectors. You need to create an RDD of key-value pairs, named `tfidf`, that the keys are document identifiers, and the values are the TF-IDF vector. Again, we are only interested in the top `numWords` (50 for small dataset, 20K for large dataset) most common words as our features.  

The item index i in a TF-IDF vector of document d corresponds to the TF-IDF value of the word with rank i in `refDict`. Then, TF-IDF value of a word with rank i for document $d$ is computed as:

$$ TF(i, d) \times IDF(i) $$

Where $TF(i, d)$ is: 

$$ \frac {\textrm{Number of occurances of word with rank $i$ in $d$}} {\textrm{Total number of refDict words in $d$}} $$

Note that the “Total number of `refDict` words” is not the number of distinct words. The “total number of words”
in “Today is a great day today” is six. 

And the $IDF(i)$ is:

$$ \ln \frac {\textrm{Number of documents in corpus}} {\textrm{Number of documents having word with rank $i$}} $$

Once you created this `tfidf` RDD, print out the non-zero array entries (TF-IDF vector) that you have created for these documents:

* 20_newsgroups/soc.religion.christian/21626
* 20_newsgroups/talk.politics.misc/179019
* 20_newsgroups/rec.autos/103167

<font color=red> Important: If you are using `bag_of_words` RDD, don't collect it into a Python object; work with it as an RDD </font>  

In [91]:
# function using vector operation to divide total refdict words in from occurances in d
def tf(x):
    doc = x[0]
    arr = x[1]
    total = arr.sum()
    arr /= total
    return (doc, arr)

tf = bag_of_words.map(tf)
tf.take(1)

[('20_newsgroups/comp.graphics/38583',
  array([0.03846154, 0.11538462, 0.        , 0.07692308, 0.        ,
         0.07692308, 0.07692308, 0.03846154, 0.        , 0.03846154,
         0.        , 0.        , 0.03846154, 0.03846154, 0.        ,
         0.03846154, 0.        , 0.03846154, 0.        , 0.03846154,
         0.        , 0.        , 0.        , 0.        , 0.03846154,
         0.        , 0.        , 0.        , 0.03846154, 0.        ,
         0.03846154, 0.        , 0.03846154, 0.03846154, 0.03846154,
         0.        , 0.        , 0.        , 0.        , 0.03846154,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.07692308]))]

In [92]:
number_of_docs = bag_of_words.count()

# creating an rdd where all the (key, value) pairs have the same keys, and binary arrays showing a word's presence (vectorized)
def to_binary(x):
    arr = x[1]
    binary = (arr != 0).astype(int)
    return (1, binary)

binary = bag_of_words.map(to_binary)

# adding all these elements of the rdd to find number of docs containing each word 
docs_having_word = binary.reduceByKey (lambda a, b: a + b)

python_result = docs_having_word.first()
python_array = np.array(python_result[1])
idf = number_of_docs / python_array
print(idf)

[ 1.07066381  1.14155251  1.2300123   1.14547537  1.21359223  1.20772947
  1.16144019  1.2755102   1.37551582  1.38888889  1.71821306  1.37551582
  1.68918919  1.77304965 16.66666667  1.          1.61290323  1.64473684
  1.89753321  1.74216028  1.7921147   1.65289256  1.67504188  2.1691974
  1.78571429  1.92678227  1.61550889  2.35849057  1.92307692  2.43902439
  1.          3.89105058  1.          1.00502513  1.06269926  2.0746888
  1.3986014   2.57069409  2.30946882  1.17096019  2.55102041  2.44498778
  2.35849057  2.48138958  4.52488688  2.96735905  3.57142857  2.7173913
  1.81818182  2.4691358 ]


In [93]:
# union of tf and idf
tfidf = tf.map(lambda x: (x[0], x[1] * idf))
tfidf.take(1)

[('20_newsgroups/comp.graphics/38583',
  array([0.04117938, 0.1317176 , 0.        , 0.08811349, 0.        ,
         0.09290227, 0.08934155, 0.04905808, 0.        , 0.0534188 ,
         0.        , 0.        , 0.06496881, 0.06819422, 0.        ,
         0.03846154, 0.        , 0.06325911, 0.        , 0.06700616,
         0.        , 0.        , 0.        , 0.        , 0.06868132,
         0.        , 0.        , 0.        , 0.0739645 , 0.        ,
         0.03846154, 0.        , 0.03846154, 0.03865481, 0.04087305,
         0.        , 0.        , 0.        , 0.        , 0.04503693,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.18993352]))]

Once you created your `tfidf` RDD, print out the result arrays for these documents,
* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

by running the code cells below:

In [94]:
arr2_1 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/soc.religion.christian/21626').values().collect())
arr2_1[arr2_1.nonzero()]

array([0.09733307, 0.02965071, 0.15974186, 0.05950521, 0.06304375,
       0.07842399, 0.01508364, 0.09939041, 0.12504689, 0.14430014,
       0.01298701, 0.0209468 , 0.02136022, 0.07392987, 0.02327422,
       0.04293227, 0.02175379, 0.02319109, 0.05004629, 0.02497502,
       0.01298701, 0.01298701, 0.01305227, 0.02694401, 0.0299931 ,
       0.01520728, 0.03313014, 0.03853713, 0.0352908 ])

In [95]:
arr2_2 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/talk.politics.misc/179019').values().collect())
arr2_2[arr2_2.nonzero()]

array([0.03638178, 0.12745489, 0.02985467, 0.09452952, 0.03534735,
       0.0281903 , 0.08668516, 0.06677261, 0.02022654, 0.1668168 ,
       0.10015892, 0.03279979, 0.09467741, 0.00485437, 0.00782963,
       0.03193664, 0.0368453 , 0.06765671, 0.03479834, 0.02407125,
       0.01626254, 0.01053008, 0.01733703, 0.02352683, 0.09335325,
       0.03551977, 0.00485437, 0.00485437, 0.00487876, 0.00515873,
       0.00678933, 0.0124791 , 0.02242203, 0.00568427, 0.01238359,
       0.01144898, 0.02409116, 0.02880931, 0.01765225, 0.03595829])

In [96]:
arr2_3 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167').values().collect())
arr2_3[arr2_3.nonzero()]

array([0.26043174, 0.03085277, 0.06559958, 0.09792401, 0.06278055,
       0.03447325, 0.0371761 , 0.0371761 , 0.02702703, 0.04359198,
       0.05128468, 0.04843553, 0.04467277, 0.0905428 , 0.0520752 ,
       0.19122896, 0.02702703, 0.02702703, 0.02716284, 0.0287216 ,
       0.03780004, 0.03164757])

## Task 3 - build a kNN classifier (30 pts)

Task 3 is to build a kNN classifier, as a Python function named `predictLabel` in the cell below. This function will take as input a text string (`test_doc`) and a number k, and then output the name of one of the 20 newsgroups. This name is the news group that the classifier thinks that the text string is “closest” to. It is computed using the classical kNN algorithm. 

Your function first needs to convert the input string into all lower case words, and then compute a TF-IDF vector corresponding to the words in `refDict` created in the first task. Recall that the words in `refDict` is our reference words to compute "TF-IDF" features. In task 2, we already computed TF-IDF values of these words for our training corpus. In this task, you need to compute TF-IDF values of these words for the input text string `test_doc`. For that, you need to compute term frequency of these words in the `test_doc`. Since IDF measure of a word only depends on the training corpus, and this measure is already calculated for `refDict` words in task 2, you don't need to re-calculate IDF for the `test_doc` and can re-use what you have.    
Then, your function needs to find the k documents in the corpus that are “closest” to the `test_doc` (where distance is computed using the l2 norm between the TF-IDF feature vectors), and returns the newsgroup label that is most frequent in those top k. Ties go to the label with the closest corpus document. 

Once you have implemented your function, run it on the following 8 test cases, each is an excerpt from a Wikipedia article,
chosen to match one of the 20 newsgroups. By reading each test document, you can guess which of the 20 newsgroups is the most relevent topic, and you can compare that with what your prediction function returns. The result you get from the small dataset might not be so accurate, due to the small training corpus. But, once you run it on the entire dataset in S3, you should see reasonable results (with few mis-matches).  

<font color=red>Important: `refDict` is an RDD and must stay as an RDD as you work with; don't collect it into a Python object to work with</font> 

In [187]:
# k is the number of neighbors to consider
# test_doc is the text to compare 

def predictLabel (k, test_doc):
    
    # your code here

    # clean test doc 
    regex = re.compile('[^a-zA-Z]')  
    new_test_doc = regex.sub(' ', test_doc).lower().split()

    rdd = sc.parallelize (new_test_doc)
    rdd_ones = rdd.map(lambda x: (x, 1))

    # count the number of times that each word appears
    test_counting = rdd_ones.reduceByKey (lambda a, b: a + b)

    # join the count with the refDict
    test_joined = test_counting.join(refDict)
    joined_no_label = test_joined.map(lambda x: x[1])

    # create bag of word based on word count and index in the refDict 
    test_proper = joined_no_label.collect()
    test_bag_of_words = np.zeros(numWords)

    for tup in test_proper:
        test_bag_of_words[tup[1]] = tup[0]

    # calculate tf, with potential warning to avoid division by 0 
    test_total = test_bag_of_words.sum()
    if test_total == 0:
        print('TEST DOCUMENT HAS NOTHING IN COMMUN WITH REFDICT')
        test_total = 1
    test_tf = test_bag_of_words / test_total

    # calculate tfidf
    test_tfidf = test_tf * idf

    # find the distance between this doc and the training docs 
    tfidf_difference = tfidf.map (lambda x: (x[0], x[1] - test_tfidf))
    tfidf_distance = tfidf_difference.map (lambda x: (x[0], (x[1] * x[1]).sum()))

    # order the distances 
    ordered_tfidf_distance = tfidf_distance.sortBy(lambda x: x[1])

    # keep the k closest distances in a python list 
    k_closest = ordered_tfidf_distance.take(k)

    # find the number of occurences of each category
    cleaned_k_closest = []
    number_occurences = {}

    for doc, dist in k_closest:
        cat = doc.split('/')[1]
        if cat not in cleaned_k_closest:
            cleaned_k_closest.append(cat)
            number_occurences[cat] = 1
        else: 
            number_occurences[cat] += 1

    # find the majority. In case of a tie, picks the closest doc
    majority = ''
    biggest_occurence = 0

    for cat in cleaned_k_closest:
        if number_occurences[cat] > biggest_occurence:
            majority = cat
            biggest_occurence = number_occurences[cat]
    
    return majority

#### Test cases

Run your predictLabel function on the 8 test cases below.

In [188]:
print(predictLabel (10, 'Graphics are pictures and movies created using computers – usually referring to image data created by a computer specifically with help from specialized graphical hardware and software. It is a vast and recent area in computer science. The phrase was coined by computer graphics researchers Verne Hudson and William Fetter of Boeing in 1960. It is often abbreviated as CG, though sometimes erroneously referred to as CGI. Important topics in computer graphics include user interface design, sprite graphics, vector graphics, 3D modeling, shaders, GPU design, implicit surface visualization with ray tracing, and computer vision, among others. The overall methodology depends heavily on the underlying sciences of geometry, optics, and physics. Computer graphics is responsible for displaying art and image data effectively and meaningfully to the user, and processing image data received from the physical world. The interaction and understanding of computers and interpretation of data has been made easier because of computer graphics. Computer graphic development has had a significant impact on many types of media and has revolutionized animation, movies, advertising, video games, and graphic design generally.'))

talk.politics.mideast


In [189]:
print(predictLabel (10, 'A deity is a concept conceived in diverse ways in various cultures, typically as a natural or supernatural being considered divine or sacred. Monotheistic religions accept only one Deity (predominantly referred to as God), polytheistic religions accept and worship multiple deities, henotheistic religions accept one supreme deity without denying other deities considering them as equivalent aspects of the same divine principle, while several non-theistic religions deny any supreme eternal creator deity but accept a pantheon of deities which live, die and are reborn just like any other being. A male deity is a god, while a female deity is a goddess. The Oxford reference defines deity as a god or goddess (in a polytheistic religion), or anything revered as divine. C. Scott Littleton defines a deity as a being with powers greater than those of ordinary humans, but who interacts with humans, positively or negatively, in ways that carry humans to new levels of consciousness beyond the grounded preoccupations of ordinary life.'))

talk.politics.guns


In [190]:
print(predictLabel (10, 'Egypt, officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and southwest corner of Asia by a land bridge formed by the Sinai Peninsula. Egypt is a Mediterranean country bordered by the Gaza Strip and Israel to the northeast, the Gulf of Aqaba to the east, the Red Sea to the east and south, Sudan to the south, and Libya to the west. Across the Gulf of Aqaba lies Jordan, and across from the Sinai Peninsula lies Saudi Arabia, although Jordan and Saudi Arabia do not share a land border with Egypt. It is the worlds only contiguous Eurafrasian nation. Egypt has among the longest histories of any modern country, emerging as one of the worlds first nation states in the tenth millennium BC. Considered a cradle of civilisation, Ancient Egypt experienced some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government. Iconic monuments such as the Giza Necropolis and its Great Sphinx, as well the ruins of Memphis, Thebes, Karnak, and the Valley of the Kings, reflect this legacy and remain a significant focus of archaeological study and popular interest worldwide. Egypts rich cultural heritage is an integral part of its national identity, which has endured, and at times assimilated, various foreign influences, including Greek, Persian, Roman, Arab, Ottoman, and European. One of the earliest centers of Christianity, Egypt was Islamised in the seventh century and remains a predominantly Muslim country, albeit with a significant Christian minority.'))

talk.politics.mideast


In [191]:
print(predictLabel (10, 'The term atheism originated from the Greek atheos, meaning without god(s), used as a pejorative term applied to those thought to reject the gods worshiped by the larger society. With the spread of freethought, skeptical inquiry, and subsequent increase in criticism of religion, application of the term narrowed in scope. The first individuals to identify themselves using the word atheist lived in the 18th century during the Age of Enlightenment. The French Revolution, noted for its unprecedented atheism, witnessed the first major political movement in history to advocate for the supremacy of human reason. Arguments for atheism range from the philosophical to social and historical approaches. Rationales for not believing in deities include arguments that there is a lack of empirical evidence; the problem of evil; the argument from inconsistent revelations; the rejection of concepts that cannot be falsified; and the argument from nonbelief. Although some atheists have adopted secular philosophies (eg. humanism and skepticism), there is no one ideology or set of behaviors to which all atheists adhere.'))

talk.politics.mideast


In [192]:
print(predictLabel (10, 'President Dwight D. Eisenhower established NASA in 1958 with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASAs predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958. Since that time, most US space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.'))

talk.politics.mideast


In [193]:
print(predictLabel (10, 'The transistor is the fundamental building block of modern electronic devices, and is ubiquitous in modern electronic systems. First conceived by Julius Lilienfeld in 1926 and practically implemented in 1947 by American physicists John Bardeen, Walter Brattain, and William Shockley, the transistor revolutionized the field of electronics, and paved the way for smaller and cheaper radios, calculators, and computers, among other things. The transistor is on the list of IEEE milestones in electronics, and Bardeen, Brattain, and Shockley shared the 1956 Nobel Prize in Physics for their achievement.'))

talk.politics.mideast


In [194]:
print(predictLabel (10, 'The Colt Single Action Army which is also known as the Single Action Army, SAA, Model P, Peacemaker, M1873, and Colt .45 is a single-action revolver with a revolving cylinder holding six metallic cartridges. It was designed for the U.S. government service revolver trials of 1872 by Colts Patent Firearms Manufacturing Company – todays Colts Manufacturing Company – and was adopted as the standard military service revolver until 1892. The Colt SAA has been offered in over 30 different calibers and various barrel lengths. Its overall appearance has remained consistent since 1873. Colt has discontinued its production twice, but brought it back due to popular demand. The revolver was popular with ranchers, lawmen, and outlaws alike, but as of the early 21st century, models are mostly bought by collectors and re-enactors. Its design has influenced the production of numerous other models from other companies.'))

talk.religion.misc


In [195]:
print(predictLabel (10, 'Howe was recruited by the Red Wings and made his NHL debut in 1946. He led the league in scoring each year from 1950 to 1954, then again in 1957 and 1963. He ranked among the top ten in league scoring for 21 consecutive years and set a league record for points in a season (95) in 1953. He won the Stanley Cup with the Red Wings four times, won six Hart Trophies as the leagues most valuable player, and won six Art Ross Trophies as the leading scorer. Howe retired in 1971 and was inducted into the Hockey Hall of Fame the next year. However, he came back two years later to join his sons Mark and Marty on the Houston Aeros of the WHA. Although in his mid-40s, he scored over 100 points twice in six years. He made a brief return to the NHL in 1979–80, playing one season with the Hartford Whalers, then retired at the age of 52. His involvement with the WHA was central to their brief pre-NHL merger success and forced the NHL to expand their recruitment to European talent and to expand to new markets.'))

talk.politics.mideast


## Task 4 - run on the entire dataset in EMR cluster (15 pts)

For the last part of this homework, you need to run your Spark code for tasks 1 through 3, on the entire dataset stored in S3, in an AWS EMR cluster. 

Follow the instructions on `Lab - Spark Intro (AWS)` to create and connect to an EMR cluster in AWS and run Spark programs in there. For better efficiency, in the hardware configuration of your cluster, choose `m5.xlarge` as instance type, and type 4 as the number of core instances. For your reference, an efficient program might not take any longer than 4 minutes overall, with this configuration.

--------------------------------------------
**Important new configuration in your cluster:**

The latest AWS EMR version does not come with pre-installed numpy. So, we need to install numpy on all the nodes in the cluster. This is doable through bootstrapping while creating a cluster. When you are creating a cluster, under "Bootstrap actions", click on "add", provide a name such as "install_numpy", and under "Script location", copy this S3 URI: 

`s3://comp643bucket/install_numpy.sh`

and click on "Add bootstrap action". The rest of cluster configuration is the same as what you had done in the labs.

For your information, the content of install_numpy.sh is as follows:

```
#!/bin/bash
sudo python3 -m pip install numpy
```

Running this shell script over the cluster through bootstrapping will install numpy on all the nodes in the cluster. 

-----------------------------------------------

You can gather your code for each task in a Python `.py` file and submit them as jobs in the batch mode and get the final result back. To troubleshoot, you can run your code line by line, in an interactive mode to debug your program.       

The entire dataset exists in this S3 URI: `s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt`

Repeat tasks 1 through 3 on the entire dataset in your EMR cluster, and print your results in the markdown cells below (keep the results from the small subset above). 

**Repeat task 1 on the entire dataset in your EMR cluster - print out the non-zero array entries (bag of words) that you have created for documents:**

* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

**using the following code:**

```
arr1_1 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/soc.religion.christian/21626').values().collect())
print("20_newsgroups/soc.religion.christian/21626 ... \n", arr1_1[arr1_1.nonzero()])

arr1_2 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/talk.politics.misc/179019').values().collect())
print("20_newsgroups/talk.politics.misc/179019 ...\n", arr1_2[arr1_2.nonzero()])

arr1_3 = np.array(bag_of_words.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167').values().collect())
print("20_newsgroups/rec.autos/103167 ... \n", arr1_3[arr1_3.nonzero()])
```

25/03/16 14:05:27 INFO DAGScheduler: Job 1 finished: collect at /home/hadoop/tas
20_newsgroups/soc.religion.christian/21626 ...
 
 [ 7.  2. 10.  4.  4.  5.  1.  6.  7.  8.  1.  1.  1.  1.  3.  2.  1.  1.
  1.  2.  1.  1.  1.  1.  1.  1.  1.  3.  1.  1.  1.  1.  1.  4.  1.  1.
  1.  1.  2.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  2.  1.  1.  1.  1.  1.  2.  1.  1.  1.  1.  1.  2.  1.  1.  1.  1.
  1.  1.  3.  1.  1.  1.  1.  1.  1.  1.  1.  2.  1.  2.  1.  1.  2.  1.
  1.  1.  1.  2.  1.  1.  1.  2.  3.  1.  1.  2.  1.  2.  5.  3.  1.  2.
  1.  1.  1.  1.  3.  1.  1.  1.  1.  2.  2.]
 
25/03/16 14:05:28 INFO DAGScheduler: Job 2 finished: collect at /home/hadoop/tas
20_newsgroups/talk.politics.misc/179019 ...
 
 [ 7. 23.  5. 17.  6.  5. 14. 10.  3. 15. 20.  4.  1. 11.  1.  4.  8.  4.
  4.  3.  2.  1.  2.  3. 10.  3.  1.  1.  1.  1.  1.  1.  2.  1.  1.  1.
  1.  2.  2.  3.  5.  2.  2.  1.  1.  2.  1.  1.  8.  1.  2.  1.  3.  1.
  1.  2.  1.  2.  1.  1.  2.  2.  3.  1.  1.  1.  1.  2.  1.  1. 11.  3.
  1.  1.  1.  1.  1.  2.  1.  2.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  2.  2.  1.  1.  1.  1.  1.  1.  2.  1.  1.  2.  1.  1.  6.  1.  3.  1.
  3.  1.  1.  1.  3.  3.  2.  1.  1.  3. 11.  2.  1.  1.  1.  1. 11.  1.
  1.  1.  1.  1.  2.  2.  1.  1.  1.  2.  2.  1.  5.  1.  2.  1.  4.  1.
  2.  1.  1.  1.  1.  1.  2.  3.  1.  1.  1.  1.  4.  3.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  2.  1.  2.  1.  1.  1.  1.]
  
25/03/16 14:05:29 INFO DAGScheduler: Job 3 finished: collect at /home/hadoop/tas
20_newsgroups/rec.autos/103167 ...
 
 [9. 1. 2. 3. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 3. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 3. 1. 1. 1. 1. 1. 1. 1. 3. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 1. 1. 3. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1.]


**Repeat task 2 on the entire dataset in your EMR cluster - print out the non-zero array entries (TF-IDF) that you have created for documents:**

* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

**using the following code:**

```
arr2_1 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/soc.religion.christian/21626').values().collect())
print('20_newsgroups/soc.religion.christian/21626 ... \n', arr2_1[arr2_1.nonzero()])

arr2_2 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/talk.politics.misc/179019').values().collect())
print("20_newsgroups/talk.politics.misc/179019 ...\n", arr2_2[arr2_2.nonzero()])

arr2_3 = np.array(tfidf.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167').values().collect())
print("20_newsgroups/rec.autos/103167 ... \n", arr2_3[arr2_3.nonzero()])
```

20_newsgroups/soc.religion.christian/21626 ...
 [3.78280215e-02 1.14274677e-02 6.04144996e-02 2.28536359e-02
 2.41353278e-02 3.01149116e-02 5.93301276e-03 3.94171432e-02
 4.91930946e-02 5.60873158e-02 5.02512563e-03 8.28830726e-03
 8.38233543e-03 9.12692436e-03 2.84157142e-02 1.69213500e-02
 8.99860636e-03 9.71550200e-03 9.70705537e-03 1.94913078e-02
 5.02512563e-03 5.03974308e-03 5.02512563e-03 1.10462171e-02
 1.19899102e-02 5.91172121e-03 1.29011988e-02 3.72221647e-02
 1.35209146e-02 1.56254762e-02 1.33662460e-02 1.36069651e-02
 1.46184808e-02 5.56871362e-02 1.76603580e-02 1.44961681e-02
 1.60420557e-02 1.80343570e-02 3.57606538e-02 1.19913410e-01
 1.81680414e-02 9.18532333e-02 5.91450484e-02 2.20512261e-02
 2.26885160e-02 2.11330047e-02 7.32949943e-02 2.79753444e-02
 2.52227503e-02 2.55368328e-02 2.07190592e-01 2.86207454e-02
 6.75318798e-02 2.99783524e-02 3.15106420e-02 5.44204913e-02
 4.13018649e-02 6.92539195e-02 5.60755788e-02 5.05979039e-02
 5.63586299e-02 1.08988544e-01 5.53040381e-02 5.30276713e-02
 6.66362316e-02 6.76227707e-02 5.77514007e-02 1.78803269e-01
 9.39134927e-02 7.18794257e-02 9.09388572e-02 8.13663459e-02
 8.14983270e-02 9.95911171e-02 3.62334509e-01 8.64036433e-02
 1.11901378e-01 1.11405141e-01 1.52253693e-01 1.82042459e-01
 2.00573727e-01 2.33691714e-01 1.45633967e-01 3.05897830e-01
 2.10224764e-01 3.20534090e-01 1.35610576e-01 2.03415865e-01
 5.82535868e-01 2.18450950e-01 2.39255803e-01 2.40977068e-01
 4.63075747e-01 6.78969170e-01 5.26112237e-01 4.67383429e-01
 4.69567463e-01 1.21069201e+00 2.16879361e+00 6.35996438e-01
 5.55179211e-01 1.86087847e+00 7.85058103e-01 3.04507385e+00
 5.70951348e+00 3.91509496e+00 1.10425755e+00 2.61006330e+00
 1.49981250e+00 1.79441852e+00 1.54596057e+00 1.76293749e+00
 1.50731156e+01 2.39255803e+00 2.79131770e+00 5.91102572e+00
 5.91102572e+00 1.82704431e+01 2.00974874e+01]

20_newsgroups/talk.politics.misc/179019 ...
 [1.69163512e-02 5.87679996e-02 1.35084106e-02 4.34347474e-02
 1.61896525e-02 1.32659499e-02 4.11296483e-02 3.14267282e-02
 9.40565380e-03 4.64226019e-02 7.65668404e-02 1.50884172e-02
 2.24719101e-03 4.57565366e-02 3.70645650e-03 1.49940202e-02
 3.18138610e-02 1.63259141e-02 1.69430026e-02 1.13506135e-02
 8.04819175e-03 4.91868199e-03 8.68937033e-03 1.07607947e-02
 4.34090791e-02 1.69275786e-02 2.24719101e-03 2.25372780e-03
 2.24719101e-03 2.44356056e-03 5.80507411e-03 3.20224319e-03
 1.07235601e-02 2.64366859e-03 5.42129070e-03 5.76930012e-03
 5.54847248e-03 1.13707183e-02 1.06599641e-02 1.74377488e-02
 2.92178665e-02 1.39751450e-02 8.52049273e-03 6.58322277e-03
 7.76115348e-03 9.85894661e-03 8.06480234e-03 7.16128743e-03
 6.03587356e-02 7.06113744e-03 1.95591202e-02 1.77336538e-02
 2.43737545e-02 7.21764835e-03 8.23023419e-03 1.77511668e-02
 8.38691278e-03 1.91670201e-02 9.32111152e-03 9.86111008e-03
 1.89009795e-02 2.35828279e-02 3.58159500e-02 1.13678418e-02
 1.28870314e-02 3.40432414e-02 1.50694429e-02 2.62482936e-02
 1.50291233e-02 1.51303295e-02 1.63678101e-01 4.62474223e-02
 1.55384089e-02 1.53736157e-02 1.82005179e-02 1.61586043e-02
 1.86770900e-02 3.83585819e-02 1.84850180e-02 4.02301510e-02
 2.06227988e-02 2.02146103e-02 2.05285878e-02 2.12770259e-02
 2.53309350e-02 2.55469464e-02 2.47314687e-02 2.67641922e-02
 3.36103804e-02 3.06528504e-02 6.04805904e-02 6.09316321e-02
 3.18026034e-02 3.54673075e-02 4.18408554e-02 3.62688286e-02
 4.05935670e-02 3.69548344e-02 7.03240667e-02 4.48026706e-02
 4.14166624e-02 8.98741573e-02 4.98193777e-02 5.04910996e-02
 3.32456809e-01 8.14077512e-02 3.43031135e-01 6.45647682e-02
 2.09334217e-01 7.85613263e-02 6.68706528e-02 1.24135576e-01
 4.92011810e-01 3.01591132e-01 1.69254534e-01 1.09602631e-01
 9.94183156e-02 3.51987561e-01 2.34269130e+00 2.89916636e-01
 1.11230393e-01 1.04993174e-01 1.33741306e-01 1.35761567e-01
 2.06823375e+00 1.00982199e-01 1.19196495e-01 1.18255470e-01
 1.47334684e-01 1.14053499e-01 3.43031135e-01 3.59496629e-01
 2.01511563e-01 1.86460907e-01 1.97092450e-01 4.34174673e-01
 1.30252402e+00 1.91221611e-01 1.09602631e+00 2.09986349e-01
 4.60893114e-01 2.62789934e-01 1.37212454e+00 2.55324311e-01
 6.46576671e-01 2.59751900e-01 3.01591132e-01 2.89916636e-01
 4.68094569e-01 5.10648621e-01 1.72834918e+00 1.89874980e+00
 5.28671514e-01 5.68823780e-01 5.99161049e-01 7.13286963e-01
 2.85314785e+00 2.54360823e+00 1.12342697e+00 1.21451564e+00
 8.64174589e-01 1.15223279e+00 8.98741573e-01 1.12342697e+00
 9.76893014e-01 9.17083238e-01 1.12342697e+00 9.56108056e-01
 1.12342697e+00 2.36510940e+00 1.87237828e+00 4.73021881e+00
 4.08518897e+00 1.12342697e+01 2.80856742e+00 3.20979133e+00
 2.80856742e+00 2.80856742e+00]

20_newsgroups/rec.autos/103167 ...
 [9.97790671e-02 1.17219901e-02 2.47573723e-02 3.70692830e-02
 2.43437019e-02 1.34776830e-02 1.44174165e-02 1.41979779e-02
 1.03092784e-02 1.70038468e-02 1.87243087e-02 1.94320520e-02
 1.73574673e-02 3.69221168e-02 7.44061498e-02 1.99936611e-02
 1.03092784e-02 1.03392667e-02 1.03092784e-02 1.12101489e-02
 1.46907033e-02 1.21281703e-02 2.44519795e-02 2.74214737e-02
 4.79987518e-02 1.35449829e-01 4.77430846e-02 4.24012010e-02
 2.78587350e-01 5.37420853e-02 1.03258021e-01 4.54485536e-02
 7.10143435e-02 7.32603551e-02 2.89272178e-01 1.03803947e-01
 9.07770318e-02 1.23519856e-01 1.18479678e-01 1.48633482e-01
 2.06154639e-01 2.40834859e-01 1.18253139e+00 2.79342330e-01
 3.64230811e-01 3.91928972e-01 3.24142514e-01 4.29488832e-01
 4.75010689e-01 4.00300270e-01 4.05816219e-01 4.99163775e-01
 4.23315481e-01 5.60202824e-01 6.84899133e-01 1.13897591e+00
 3.14739907e+00 5.49745704e+00 2.10361877e+00 2.24081130e+00
 1.04824393e+01 2.60955239e+00 2.78587350e+00 3.81767850e+00
 4.68533271e+00 3.61674806e+00 6.65014965e+00 4.12309278e+01
 1.47253314e+01 1.47253314e+01 2.57693299e+01]

**Repeat task 3 on the entire dataset in your EMR cluster - print out the predicted label for each of the below test document:**

In [None]:
print(predictLabel (10, 'Graphics are pictures and movies created using computers – usually referring to image data created by a computer specifically with help from specialized graphical hardware and software. It is a vast and recent area in computer science. The phrase was coined by computer graphics researchers Verne Hudson and William Fetter of Boeing in 1960. It is often abbreviated as CG, though sometimes erroneously referred to as CGI. Important topics in computer graphics include user interface design, sprite graphics, vector graphics, 3D modeling, shaders, GPU design, implicit surface visualization with ray tracing, and computer vision, among others. The overall methodology depends heavily on the underlying sciences of geometry, optics, and physics. Computer graphics is responsible for displaying art and image data effectively and meaningfully to the user, and processing image data received from the physical world. The interaction and understanding of computers and interpretation of data has been made easier because of computer graphics. Computer graphic development has had a significant impact on many types of media and has revolutionized animation, movies, advertising, video games, and graphic design generally.'))
print(predictLabel (10, 'A deity is a concept conceived in diverse ways in various cultures, typically as a natural or supernatural being considered divine or sacred. Monotheistic religions accept only one Deity (predominantly referred to as God), polytheistic religions accept and worship multiple deities, henotheistic religions accept one supreme deity without denying other deities considering them as equivalent aspects of the same divine principle, while several non-theistic religions deny any supreme eternal creator deity but accept a pantheon of deities which live, die and are reborn just like any other being. A male deity is a god, while a female deity is a goddess. The Oxford reference defines deity as a god or goddess (in a polytheistic religion), or anything revered as divine. C. Scott Littleton defines a deity as a being with powers greater than those of ordinary humans, but who interacts with humans, positively or negatively, in ways that carry humans to new levels of consciousness beyond the grounded preoccupations of ordinary life.'))
print(predictLabel (10, 'Egypt, officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and southwest corner of Asia by a land bridge formed by the Sinai Peninsula. Egypt is a Mediterranean country bordered by the Gaza Strip and Israel to the northeast, the Gulf of Aqaba to the east, the Red Sea to the east and south, Sudan to the south, and Libya to the west. Across the Gulf of Aqaba lies Jordan, and across from the Sinai Peninsula lies Saudi Arabia, although Jordan and Saudi Arabia do not share a land border with Egypt. It is the worlds only contiguous Eurafrasian nation. Egypt has among the longest histories of any modern country, emerging as one of the worlds first nation states in the tenth millennium BC. Considered a cradle of civilisation, Ancient Egypt experienced some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government. Iconic monuments such as the Giza Necropolis and its Great Sphinx, as well the ruins of Memphis, Thebes, Karnak, and the Valley of the Kings, reflect this legacy and remain a significant focus of archaeological study and popular interest worldwide. Egypts rich cultural heritage is an integral part of its national identity, which has endured, and at times assimilated, various foreign influences, including Greek, Persian, Roman, Arab, Ottoman, and European. One of the earliest centers of Christianity, Egypt was Islamised in the seventh century and remains a predominantly Muslim country, albeit with a significant Christian minority.'))
print(predictLabel (10, 'The term atheism originated from the Greek atheos, meaning without god(s), used as a pejorative term applied to those thought to reject the gods worshiped by the larger society. With the spread of freethought, skeptical inquiry, and subsequent increase in criticism of religion, application of the term narrowed in scope. The first individuals to identify themselves using the word atheist lived in the 18th century during the Age of Enlightenment. The French Revolution, noted for its unprecedented atheism, witnessed the first major political movement in history to advocate for the supremacy of human reason. Arguments for atheism range from the philosophical to social and historical approaches. Rationales for not believing in deities include arguments that there is a lack of empirical evidence; the problem of evil; the argument from inconsistent revelations; the rejection of concepts that cannot be falsified; and the argument from nonbelief. Although some atheists have adopted secular philosophies (eg. humanism and skepticism), there is no one ideology or set of behaviors to which all atheists adhere.'))
print(predictLabel (10, 'President Dwight D. Eisenhower established NASA in 1958 with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASAs predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958. Since that time, most US space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.'))
print(predictLabel (10, 'The transistor is the fundamental building block of modern electronic devices, and is ubiquitous in modern electronic systems. First conceived by Julius Lilienfeld in 1926 and practically implemented in 1947 by American physicists John Bardeen, Walter Brattain, and William Shockley, the transistor revolutionized the field of electronics, and paved the way for smaller and cheaper radios, calculators, and computers, among other things. The transistor is on the list of IEEE milestones in electronics, and Bardeen, Brattain, and Shockley shared the 1956 Nobel Prize in Physics for their achievement.'))
print(predictLabel (10, 'The Colt Single Action Army which is also known as the Single Action Army, SAA, Model P, Peacemaker, M1873, and Colt .45 is a single-action revolver with a revolving cylinder holding six metallic cartridges. It was designed for the U.S. government service revolver trials of 1872 by Colts Patent Firearms Manufacturing Company – todays Colts Manufacturing Company – and was adopted as the standard military service revolver until 1892. The Colt SAA has been offered in over 30 different calibers and various barrel lengths. Its overall appearance has remained consistent since 1873. Colt has discontinued its production twice, but brought it back due to popular demand. The revolver was popular with ranchers, lawmen, and outlaws alike, but as of the early 21st century, models are mostly bought by collectors and re-enactors. Its design has influenced the production of numerous other models from other companies.'))
print(predictLabel (10, 'Howe was recruited by the Red Wings and made his NHL debut in 1946. He led the league in scoring each year from 1950 to 1954, then again in 1957 and 1963. He ranked among the top ten in league scoring for 21 consecutive years and set a league record for points in a season (95) in 1953. He won the Stanley Cup with the Red Wings four times, won six Hart Trophies as the leagues most valuable player, and won six Art Ross Trophies as the leading scorer. Howe retired in 1971 and was inducted into the Hockey Hall of Fame the next year. However, he came back two years later to join his sons Mark and Marty on the Houston Aeros of the WHA. Although in his mid-40s, he scored over 100 points twice in six years. He made a brief return to the NHL in 1979–80, playing one season with the Hartford Whalers, then retired at the age of 52. His involvement with the WHA was central to their brief pre-NHL merger success and forced the NHL to expand their recruitment to European talent and to expand to new markets.'))

talk.politics.misc

talk.religion.misc

soc.religion.christian

talk.politics.misc

talk.politics.misc

alt.atheism

talk.politics.guns

talk.politics.misc


### Copyright ©2020 Christopher M Jermaine (cmj4@rice.edu), Risa B Myers  (rbm2@rice.edu), Marmar Orooji (marmar.orooji@rice.edu)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.