#Spark-MLlib實例——垃圾郵件分類


1、垃圾郵件分類，使用Spark-MLlib中的兩個函數： 

1）HashingTF： 從文本數據構建詞頻（term frequency）特徵向量

2）LogisticRegressionWithSGD： 使用隨機梯度下降法（Stochastic Gradient Descent）,實現邏輯回歸。


2、訓練原數據集

垃圾郵件例子 spam.txt

非垃圾郵件例子 normal.txt

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
#from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

# 先把檔案上傳到對應的HDFS資料夾
spam = sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")

# First step, turn raw data into feature vectors = (using feature hashing here; TF stands for "term frequency")
tf = HashingTF(numFeatures = 1000)

# Each email is split into words, and each word is mapped to one feature
spamFeatures = (spam.map(lambda email: tf.transform(email.split(" "))))
normalFeatures = (normal.map(lambda email: tf.transform(email.split(" "))))


# Now each feature vectored is labeled with a class, using MLLib's LabeledPoint type
positives = spamFeatures.map(lambda features: LabeledPoint(1, features))
negatives = normalFeatures.map(lambda features: LabeledPoint(0, features))

trainingData = positives.union(negatives)
trainingData.cache()  # Cache since Logistic Regression is an iterative algorithm


# Run Logistic Regression using the SGD algorithm
#model = LogisticRegressionWithSGD.train(trainingData)
model = LogisticRegressionWithLBFGS.train(trainingData)




In [None]:
type(model)

In [None]:
# Test on a positive example (spam) and a negative one (normal).
# We first apply the same HashingTF feature transformation to get vectors, then apply the model.
posTest = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
negTest = tf.transform("Dear Dad, Spark is great!  Love, Pat".split(" "))

print("Prediction for posTest: {0}".format(model.predict(posTest)))
print("Prediction for negTest: {0}".format(model.predict(negTest)))

In [None]:
posTest1Example = tf.transform("I really wish well to all my friends.".split(" "))
posTest2Example = tf.transform("He stretched into his pocket for some money.".split(" "))
posTest3Example = tf.transform("He entrusted his money to me.".split(" "))
posTest4Example = tf.transform("Where do you keep your money?".split(" "))
posTest5Example = tf.transform("She borrowed some money of me.".split(" "))

print("Prediction for 1: {0}".format(model.predict(posTest1Example)))
print("Prediction for 2: {0}".format(model.predict(posTest2Example)))
print("Prediction for 3: {0}".format(model.predict(posTest3Example)))
print("Prediction for 4: {0}".format(model.predict(posTest4Example)))
print("Prediction for 5: {0}".format(model.predict(posTest5Example)))