### Coding Challenge #3:

In this coding challenge, you will work through a couple of scenarios that help you become acquainted with the Spark Mllib package to surface predictions.


**Question 1**:  We are going to utilize the ML library from Spark (specifically a decision tree model) to predict whether a person gets hired or not based on a select set of attributes/features. The **ask** is to train a Decision Tree model on "Hiring" related data using the Spark ML library  and then use the trained model on test data to predict outcomes (**hired** or **not hired**)

**Dataset**: https://www.dropbox.com/s/owywl67x4y7ftv8/History_Hires.csv?raw=1 - Download the file and save it to a local folder and then utilize the textfile method of the SparkContext package to read in the file

The dataset consists of the following attributes:

**1) **Years Experience
**2) **Employed?
**3)** Previous Employers (i.e. how many previous employers they have had)
**4) ** Level of Education (i.e. degrees)
**5) ** Top-Tier School
**6) ** Interned?
**7) ** Hired (i.e. dependent variable)

Once the decision tree model is trained, test it against the following 2 test candidates and surface predictions

**Test Candidate 1**: 

The first candidate with 10 years of experience, currently employed,
3 previous employers, a BS degree, but from a non-top-tier school where he or she did not do an internship

**Test Candidate 2**:

The second condidate with 0 years of experience, currently not employed,
no previous employers, a BS degree, but from a non-top-tier school where
he or she did not do an internship.

**Stretch Goal**: 

Make up a large number of test candidates and populate a "csv" file. Read the "csv" file and then test the trained model against your test candidates to surface predictions

Reference: https://spark.apache.org/docs/2.3.0/mllib-decision-tree.html

In [1]:
from pyspark import SparkContext, SparkConf, SparkFiles
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

import os

In [2]:
conf = SparkConf().setAppName("CC3").setMaster("local[2]")
sc = SparkContext(conf=conf)

In [3]:
sc.addFile('https://uc22357bc171769db5c2d50b65b9.dl.dropboxusercontent.com/cd/0/inline/AJXdOZX3BxXZVGAFZypk6TdnQeDGQex5n28d0Pev9CJde_uJyKDxKhT9qunecHjGEBZIKZJN6GAM8dfsMGgrLxofjN5fjy0DpsSsXgQAJginaKoemAT35zZ0QXI-8lpHKViGjB5YRl766ASgqOl50hJgNKxqqWWqxOkDuOAZIwjgMFDkXXkYAS5cv2D4ZOnB-6c/file')

In [4]:
data = sc.textFile(SparkFiles.get('file'))

In [5]:
header = data.first()
data = data.filter(lambda line: line != header)

In [6]:
data = data.map(lambda line: line.split(','))

In [7]:
data.collect()

[['10', 'Y', '4', 'BS', 'N', 'N', 'Y'],
 ['0', 'N', '0', 'BS', 'Y', 'Y', 'Y'],
 ['7', 'N', '6', 'BS', 'N', 'N', 'N'],
 ['2', 'Y', '1', 'MS', 'Y', 'N', 'Y'],
 ['20', 'N', '2', 'PhD', 'Y', 'N', 'N'],
 ['0', 'N', '0', 'PhD', 'Y', 'Y', 'Y'],
 ['5', 'Y', '2', 'MS', 'N', 'Y', 'Y'],
 ['3', 'N', '1', 'BS', 'N', 'Y', 'Y'],
 ['15', 'Y', '5', 'BS', 'N', 'N', 'Y'],
 ['0', 'N', '0', 'BS', 'N', 'N', 'N'],
 ['1', 'N', '1', 'PhD', 'Y', 'N', 'N'],
 ['4', 'Y', '1', 'BS', 'N', 'Y', 'Y'],
 ['0', 'N', '0', 'PhD', 'Y', 'N', 'Y']]

In [8]:
yes_no_dict = {'N': 0, 'Y': 1}
edu_dict = {'BS': 0, 'MS': 1, 'PhD': 2}

In [9]:
def label_encode(line, indexer, col):
    line[col] = indexer[line[col]]
    return line

In [10]:
# throwing errors if I try to put in for loop
data = data.map(lambda line: label_encode(line, yes_no_dict, 1))
data = data.map(lambda line: label_encode(line, edu_dict, 3))
data = data.map(lambda line: label_encode(line, yes_no_dict, 4))
data = data.map(lambda line: label_encode(line, yes_no_dict, 5))
data = data.map(lambda line: label_encode(line, yes_no_dict, 6))

In [11]:
def line_to_labeledpoint(line):
    label = line.pop()    
    
    for element in line:
        try:
            int(element)
        except:
            pass
        
    return LabeledPoint(label, line)

data = data.map(line_to_labeledpoint)

In [12]:
model = DecisionTree.trainClassifier(data, numClasses=2, 
                                     categoricalFeaturesInfo={1:2, 3:3, 4:2, 5:2},
                                     impurity='gini', maxDepth=5, maxBins=32)

Test Candidate 1:

The first candidate with 10 years of experience, currently employed, 3 previous employers, a BS degree, but from a non-top-tier school where he or she did not do an internship

Test Candidate 2:

The second condidate with 0 years of experience, currently not employed, no previous employers, a BS degree, but from a non-top-tier school where he or she did not do an internship.

In [13]:
test1 = [10, 1, 3, 0, 0, 0]
model.predict(test1)

1.0

In [14]:
test2 = [0, 0, 0, 0, 0, 0]
model.predict(test2)

0.0

In [15]:
os.remove(SparkFiles.get('file'))

**Question 2**: The ask in this case is to build a Logistic Regression model to decipher whether a body of text is "Spam" or "Ham". You will leverage the  "SMSSpamCollection" file that contains spam and ham messages respectively. You will need to create a feature vector from text data and then train a Logistic Regression model with the entire set of messages (both spam and ham). Once you have trained the model, you will test the model with 2 messages (i.e. one spam message and another ham message) to ascertain how the model categorizes the respective messages (i.e. 1 indicates spam and 0 indicates ham).

**Test Message 1 (Spam)**:

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"


**Test Message 2 (Ham)**:

"I've been searching for the right words to thank you for this breather"

**Dataset**: https://www.dropbox.com/s/z5zm0fxevqvujee/SMSSpamCollection.tsv?raw=1 - Download the file and save it to a local folder and then utilize the textfile method of the SparkContext package to read in the file


In [16]:
from pyspark.mllib.feature import HashingTF, IDF

In [17]:
sc.addFile('https://uc7ff37af3d623228ac1e3441384.dl.dropboxusercontent.com/cd/0/inline/AJWVbSHDtM1eOSh4hkz0ovY9J81DJozzzjVPrTg7O0uqjhJhrjjIxnm-Liq9IzMlDVbbaXNwUGwm5lnDXY9JCiASulav49bR8pC8d5cUO-SArHcs972RTXPBsuRee54mtkZK_roORzXe9hH2yO0B5z4ivPSfJ4EYJHoQRgQAmI206WkPtg6mUvijHGDZ7w7oAa0/file')

In [18]:
data = sc.textFile(SparkFiles.get('file'))

In [19]:
data = data.map(lambda line: line.split('\t'))

In [20]:
data.collect()[:5]

[['ham',
  "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times."],
 ['spam',
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],
 ['ham', "Nah I don't think he goes to usf, he lives around here though"],
 ['ham',
  'Even my brother is not like to speak with me. They treat me like aids patent.'],
 ['ham', 'I HAVE A DATE ON SUNDAY WITH WILL!!']]

In [21]:
data = data.map(lambda line: (line[0], line[1].split()))

In [22]:
print(data.collect()[0])

('ham', ["I've", 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this', 'breather.', 'I', 'promise', 'i', 'wont', 'take', 'your', 'help', 'for', 'granted', 'and', 'will', 'fulfil', 'my', 'promise.', 'You', 'have', 'been', 'wonderful', 'and', 'a', 'blessing', 'at', 'all', 'times.'])


In [23]:
labels = data.map(lambda x: x[0])
documents = data.map(lambda x: x[1])

In [24]:
ham_spam = {'ham': 0, 'spam': 1}
labels = labels.map(lambda x: ham_spam[x])

In [25]:
hashingTF = HashingTF()
tf = hashingTF.transform(documents)

In [26]:
dataset = labels.zip(tf)

In [27]:
dataset.collect()[0]

(0,
 SparseVector(1048576, {1475: 1.0, 70882: 1.0, 151357: 2.0, 154253: 1.0, 163495: 1.0, 173174: 1.0, 231791: 1.0, 235395: 1.0, 238153: 1.0, 241476: 1.0, 250929: 1.0, 270412: 1.0, 276491: 3.0, 463522: 1.0, 479025: 1.0, 486014: 1.0, 488866: 1.0, 494808: 1.0, 550685: 1.0, 578619: 2.0, 622323: 1.0, 648331: 1.0, 702216: 1.0, 706364: 1.0, 724221: 1.0, 789438: 1.0, 837499: 1.0, 910746: 1.0, 935701: 1.0, 990085: 1.0, 1000347: 1.0, 1016101: 1.0, 1031802: 1.0}))

In [28]:
dataset = dataset.map(lambda x: LabeledPoint(x[0], x[1]))

In [29]:
model = LogisticRegressionWithLBFGS.train(dataset)

Test Message 1 (Spam):

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

Test Message 2 (Ham):

"I've been searching for the right words to thank you for this breather"

In [30]:
def process(text):
    tf = hashingTF.transform(text.split())
    return tf

In [31]:
test1 = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]
test1 = sc.parallelize(test1)
test1 = test1.map(process)

In [32]:
model.predict(test1).collect()

[1]

In [33]:
test2 = ["I've been searching for the right words to thank you for this breather"]
test2 = sc.parallelize(test2)
test2 = test2.map(process)

In [34]:
model.predict(test2).collect()

[0]