### Coding Challenge #3:

In this coding challenge, you will work through a couple of scenarios that help you become acquainted with the Spark Mllib package to surface predictions.


**Question 1**:  We are going to utilize the ML library from Spark (specifically a decision tree model) to predict whether a person gets hired or not based on a select set of attributes/features. The **ask** is to train a Decision Tree model on "Hiring" related data using the Spark ML library  and then use the trained model on test data to predict outcomes (**hired** or **not hired**)

**Dataset**: https://www.dropbox.com/s/owywl67x4y7ftv8/History_Hires.csv?raw=1 - Download the file and save it to a local folder and then utilize the textfile method of the SparkContext package to read in the file

The dataset consists of the following attributes:

**1) **Years Experience
**2) **Employed?
**3)** Previous Employers (i.e. how many previous employers they have had)
**4) ** Level of Education (i.e. degrees)
**5) ** Top-Tier School
**6) ** Interned?
**7) ** Hired (i.e. dependent variable)

Once the decision tree model is trained, test it against the following 2 test candidates and surface predictions

**Test Candidate 1**: 

The first candidate with 10 years of experience, currently employed,
3 previous employers, a BS degree, but from a non-top-tier school where he or she did not do an internship

**Test Candidate 2**:

The second condidate with 0 years of experience, currently not employed,
no previous employers, a BS degree, but from a non-top-tier school where
he or she did not do an internship.

**Stretch Goal**: 

Make up a large number of test candidates and populate a "csv" file. Read the "csv" file and then test the trained model against your test candidates to surface predictions

Reference: https://spark.apache.org/docs/2.3.0/mllib-decision-tree.html

In [0]:
# https://mikestaszel.com/2018/03/07/apache-spark-on-google-colaboratory/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [5]:
from pyspark import SparkFiles

sc.addFile('https://raw.githubusercontent.com/saranyamandava/ML-Sprint-Challenges/master/Datasets/History_Hires.csv')
hires = sc.textFile(SparkFiles.get('History_Hires.csv'))
hires.take(5)

['Years Experience,Employed?,Previous employers,Level of Education,Top-tier school,Interned,Hired',
 '10,Y,4,BS,N,N,Y',
 '0,N,0,BS,Y,Y,Y',
 '7,N,6,BS,N,N,N',
 '2,Y,1,MS,Y,N,Y']

In [7]:
# Remove the header
header = hires.first()
hires = hires.filter(lambda row: row != header)

# Split on commas
hires = hires.map(lambda row: row.split(','))



In [8]:
# Helper function to encode to LabeledPoints
def construct_labeled_points(row):
  Years = int(row[0])
  Employed = 1 if row[1] == 'Y' else 0
  previous_employers = int(row[2])
  education = 1 if row[3] == 'MS' else 2 if row[3] == 'PhD' else 0
  top_tier = 1 if row[4] == 'Y' else 0
  intern = 1 if row[5] == 'y' else 0
  Hired = 1 if row[6] == 'Y' else 0
  return LabeledPoint(Hired,[Years,Employed,previous_employers,education,top_tier,intern])
  

labeled_points_rdd = hires.map(construct_labeled_points)


In [9]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree

In [10]:
hires_model = DecisionTree.trainClassifier(labeled_points_rdd, categoricalFeaturesInfo={1:2, 3:3, 4:2, 5:2}, numClasses=2)

In [11]:
test_candidates = [[10, 1, 3, 0, 0, 0],
                   [0, 0, 0, 0, 0, 0]]
test_data = sc.parallelize(test_candidates)
predict = hires_model.predict(test_data)
print(predict.collect())


[1.0, 0.0]


**Question 2**: The ask in this case is to build a Logistic Regression model to decipher whether a body of text is "Spam" or "Ham". You will leverage the  "SMSSpamCollection" file that contains spam and ham messages respectively. You will need to create a feature vector from text data and then train a Logistic Regression model with the entire set of messages (both spam and ham). Once you have trained the model, you will test the model with 2 messages (i.e. one spam message and another ham message) to ascertain how the model categorizes the respective messages (i.e. 1 indicates spam and 0 indicates ham).

**Test Message 1 (Spam)**:

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"


**Test Message 2 (Ham)**:

"I've been searching for the right words to thank you for this breather"

**Dataset**: https://www.dropbox.com/s/z5zm0fxevqvujee/SMSSpamCollection.tsv?raw=1 - Download the file and save it to a local folder and then utilize the textfile method of the SparkContext package to read in the file


In [0]:
from pyspark.mllib.feature import HashingTF, IDF
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

In [21]:
sc.addFile('https://uc7ff37af3d623228ac1e3441384.dl.dropboxusercontent.com/cd/0/inline/AJWVbSHDtM1eOSh4hkz0ovY9J81DJozzzjVPrTg7O0uqjhJhrjjIxnm-Liq9IzMlDVbbaXNwUGwm5lnDXY9JCiASulav49bR8pC8d5cUO-SArHcs972RTXPBsuRee54mtkZK_roORzXe9hH2yO0B5z4ivPSfJ4EYJHoQRgQAmI206WkPtg6mUvijHGDZ7w7oAa0/file')

In [22]:
data = sc.textFile(SparkFiles.get('file'))

In [23]:
data = data.map(lambda line: line.split('\t'))

In [24]:
data.collect()[:5]

[['ham',
  "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times."],
 ['spam',
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],
 ['ham', "Nah I don't think he goes to usf, he lives around here though"],
 ['ham',
  'Even my brother is not like to speak with me. They treat me like aids patent.'],
 ['ham', 'I HAVE A DATE ON SUNDAY WITH WILL!!']]

In [25]:
data = data.map(lambda line: (line[0], line[1].split()))

In [26]:
print(data.collect()[0])

('ham', ["I've", 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this', 'breather.', 'I', 'promise', 'i', 'wont', 'take', 'your', 'help', 'for', 'granted', 'and', 'will', 'fulfil', 'my', 'promise.', 'You', 'have', 'been', 'wonderful', 'and', 'a', 'blessing', 'at', 'all', 'times.'])


In [27]:
labels = data.map(lambda x: x[0])
documents = data.map(lambda x: x[1])

In [28]:
ham_spam = {'ham': 0, 'spam': 1}
labels = labels.map(lambda x: ham_spam[x])

In [29]:
hashingTF = HashingTF()
tf = hashingTF.transform(documents)

In [30]:
dataset = labels.zip(tf)

In [31]:
dataset.collect()[0]

(0,
 SparseVector(1048576, {1475: 1.0, 70882: 1.0, 151357: 2.0, 154253: 1.0, 163495: 1.0, 173174: 1.0, 231791: 1.0, 235395: 1.0, 238153: 1.0, 241476: 1.0, 250929: 1.0, 270412: 1.0, 276491: 3.0, 463522: 1.0, 479025: 1.0, 486014: 1.0, 488866: 1.0, 494808: 1.0, 550685: 1.0, 578619: 2.0, 622323: 1.0, 648331: 1.0, 702216: 1.0, 706364: 1.0, 724221: 1.0, 789438: 1.0, 837499: 1.0, 910746: 1.0, 935701: 1.0, 990085: 1.0, 1000347: 1.0, 1016101: 1.0, 1031802: 1.0}))

In [32]:
dataset = dataset.map(lambda x: LabeledPoint(x[0], x[1]))

In [47]:
dataset.collect()[0]

LabeledPoint(0.0, (1048576,[1475,70882,151357,154253,163495,173174,231791,235395,238153,241476,250929,270412,276491,463522,479025,486014,488866,494808,550685,578619,622323,648331,702216,706364,724221,789438,837499,910746,935701,990085,1000347,1016101,1031802],[1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))

In [33]:
model = LogisticRegressionWithLBFGS.train(dataset)

In [48]:
def process(text):
    tf = hashingTF.transform(text.split())
    return tf

In [50]:
test1 = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]
test1 = sc.parallelize(test1)
test1 = test1.map(process)

In [51]:
model.predict(test1).collect()

[1]

In [52]:
test2 = ["I've been searching for the right words to thank you for this breather"]
test2 = sc.parallelize(test2)
test2 = test2.map(process)


In [53]:
model.predict(test2).collect()

[0]

**Alternate Way:**

In [12]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD

!wget -nc https://www.dropbox.com/s/z5zm0fxevqvujee/SMSSpamCollection.tsv?raw=1 -O spam.csv
raw_data = sc.textFile('spam.csv')

tf = HashingTF()

spam = raw_data.filter(lambda row: row.startswith('spam'))
ham = raw_data.filter(lambda row: row.startswith('ham'))

def process_text(text):
  columns = text.split('\t')
  label = 1 if columns[0] == 'spam' else 0
  body = columns[1]
  tokens = body.split(' ')  # TODO: try different tokenization
  features = tf.transform(tokens)
  return LabeledPoint(label, features)

spam_data = spam.map(process_text)
ham_data = ham.map(process_text)

# Maybe do some other investigation of spam/ham separately

train_data = spam_data.union(ham_data)
train_data.cache()  # Caching as logistic is iterative

spam_model = LogisticRegressionWithSGD.train(train_data)

spam_test = tf.transform("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's".split(" "))
ham_test = tf.transform("I've been searching for the right words to thank you for this breather".split(" "))

print("Prediction for the spam test example: %g" %
      spam_model.predict(spam_test))

print("Prediction for the ham test example: %g" %
      spam_model.predict(ham_test))

File `spam.csv' already there; not retrieving.
Prediction for the spam test example: 0
Prediction for the ham test example: 0
