<a href="https://colab.research.google.com/github/susiexia/BigData_ETL-on-Amazon-dataset/blob/master/Pyspark_NLP_pipeline_steps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Create a NPL Pipeline using pyspark (more details)
1. Tokenizer 
2. StopwordsRemover 
3. HashingTF & IDF (TF-IDF)

In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

# start a  Spark.sql.Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('NPL_Pipeline').getOrCreate()

In [0]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

In [0]:
# Read in data from AWS S3 buckets
from pyspark import SparkFiles
url = "https://s3.amazonaws.com/dataviz-curriculum/day_2/airlines.csv"

spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("airlines.csv"), sep=",", header=True)

df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------+
|Airline Tweets                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------+
|@VirginAmerica plus you've added commercials to the experience... tacky.                                                               |
|@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA|
|@VirginAmerica do you miss me? Don't worry we'll be together very soon.                                                                |
|@VirginAmerica Are the hours of operation for the Club at SFO that are posted online current?                                          |
|@VirginAmerica awaiting my return

In [0]:
# tokenize df
tokened = Tokenizer(inputCol='Airline Tweets', outputCol='tokened_words')
tokened_df = tokened.transform(df)
# tokened_df.show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Airline Tweets                                                                                                                         |tokened_words                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|@VirginAmerica plus you've added commercials to the experience... tacky.                                 

In [0]:
# StopWordsRomover
remover = StopWordsRemover(inputCol='tokened_words', outputCol='stopWords_filtered')
removered_df = remover.transform(tokened_df)
# removered_df.select('tokened_words','stopWords_filtered').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
|tokened_words                                                                                                                                                  |stopWords_filtered                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
|[@virginamerica, plus, you've, added, commercials, to, the, experience..., tacky.]                                                                             |[@virginamerica, plus, added, commercials, experience..., tacky.]

In [0]:
# HashingTF (term frequency)
hashingTF = HashingTF(inputCol="stopWords_filtered", outputCol="hashedValues", numFeatures=pow(2,4))
hashed_df = hashingTF.transform(removered_df)
hashed_df.select('stopWords_filtered','hashedValues').show(truncate = False)

+-----------------------------------------------------------------------------------------------+----------------------------------------------------------------+
|stopWords_filtered                                                                             |hashedValues                                                    |
+-----------------------------------------------------------------------------------------------+----------------------------------------------------------------+
|[@virginamerica, plus, added, commercials, experience..., tacky.]                              |(16,[1,3,5,7,8,12],[1.0,1.0,1.0,1.0,1.0,1.0])                   |
|[@virginamerica, seriously, pay, $30, flight, seats, playing., really, bad, thing, flying, va] |(16,[0,1,8,9,11,13,14],[1.0,1.0,1.0,1.0,2.0,2.0,4.0])           |
|[@virginamerica, miss, me?, worry, together, soon.]                                            |(16,[1,8,10,12,15],[1.0,1.0,1.0,1.0,2.0])                       |
|[@virginamerica, hour

In [0]:
# IDF (inverse document frequency) ---need fit() first
idf = IDF(inputCol='hashedValues',outputCol='features')
idfModel = idf.fit(hashed_df)
rescaledData = idfModel.transform(hashed_df)
rescaledData.select('hashedValues','features').show(truncate=False)

+----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|hashedValues                                                    |features                                                                                                                                                                                |
+----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(16,[1,3,5,7,8,12],[1.0,1.0,1.0,1.0,1.0,1.0])                   |(16,[1,3,5,7,8,12],[0.0,1.0986122886681098,1.0986122886681098,0.6931471805599453,0.1823215567939546,0.6931471805599453])                                                          