### TF-IDF
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d, while document frequency DF(t,D) is the number of documents that contains term t. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides.

In Spark you can use the 

In [13]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

#create a dataframe 
sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard today is National Bacon Day!"),
    (0.0, "I want bacon on everything"),
    (1.0, "Spark is like bacon on pizza!")
], ["label", "sentence"])



In [14]:
sentenceData.show() 

+-----+--------------------+
|label|            sentence|
+-----+--------------------+
|  0.0|Hi I heard today ...|
|  0.0|I want bacon on e...|
|  1.0|Spark is like bac...|
+-----+--------------------+



In [15]:
#next we use the tokenizer to break sentences into works
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show()

+-----+--------------------+--------------------+
|label|            sentence|               words|
+-----+--------------------+--------------------+
|  0.0|Hi I heard today ...|[hi, i, heard, to...|
|  0.0|I want bacon on e...|[i, want, bacon, ...|
|  1.0|Spark is like bac...|[spark, is, like,...|
+-----+--------------------+--------------------+



In [16]:
#next we apply the words column on the hashingTF transformer into a vector of features
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()

+-----+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|
+-----+--------------------+--------------------+--------------------+
|  0.0|Hi I heard today ...|[hi, i, heard, to...|(20,[0,1,2,5,9,10...|
|  0.0|I want bacon on e...|[i, want, bacon, ...|(20,[2,5,9,12,16]...|
|  1.0|Spark is like bac...|[spark, is, like,...|(20,[1,2,5,10,13,...|
+-----+--------------------+--------------------+--------------------+



In [5]:
# alternatively, CountVectorizer can also be used to get term frequency vectors
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("label", "features").show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(20,[0,5,9,17],[0...|
|  0.0|(20,[2,7,9,13,15]...|
|  1.0|(20,[4,6,13,15,18...|
+-----+--------------------+



## Stop Words Remover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. 

In [8]:


from pyspark.ml.feature import StopWordsRemover

sentenceData = spark.createDataFrame([
    (0, ["I", "saw", "the", "red", "balloon"]),
    (1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)


+---+----------------------------+--------------------+
|id |raw                         |filtered            |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+

