### Steps in NLP

1. Reading the corpus
2. Tokenization
3. Cleaning/Stopward removal
4. Stemming
5. Converting into numerical form

In [1]:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext('local', 'NLP with PySpark')
spark = SparkSession(sc)

In [2]:
! ls ../datasets/nlp_with_pyspark/

axisdata.csv		       Log_Reg_dataset.csv  Movie_reviews.csv
Linear_regression_dataset.csv  movie_reviews.csv    sample_data.csv


In [3]:
df_text = spark.read.csv('../datasets/nlp_with_pyspark/movie_reviews.csv',
                        inferSchema=True, header=True, sep=',')

In [4]:
df_text.printSchema()

root
 |-- Review: string (nullable = true)
 |-- Sentiment: string (nullable = true)



In [5]:
df_text.count()

7087

some might not be labelled properly hence we will filter out those records<br>
we will only keep the ones with 0/1

In [6]:
df_text=df_text.filter((df_text.Sentiment == '1')|(df_text.Sentiment == '0'))

In [7]:
df_text.count()

6990

In [8]:
df_text.groupBy('Sentiment').count().show()

+---------+-----+
|Sentiment|count|
+---------+-----+
|        0| 3081|
|        1| 3909|
+---------+-----+



In [9]:
from pyspark.sql.functions import rand

In [10]:
df_text.orderBy(rand()).show(10, False)

+------------------------------------------------------------------------+---------+
|Review                                                                  |Sentiment|
+------------------------------------------------------------------------+---------+
|The Da Vinci Code sucked big time.                                      |0        |
|What I'd like to see is some Mission Impossible stuff, maybe throw them |1        |
|Harry Potter = Gorgeous!.                                               |1        |
|I OFFICIALLY * HATE * BROKEBACK MOUNTAIN!!!!!!!!!!!                     |0        |
|i love being a sentry for mission impossible and a station for bonkers. |1        |
|The Da Vinci Code sucked big time.                                      |0        |
|I love Brokeback Mountain.                                              |1        |
|Mission Impossible 3 was excellent.                                     |1        |
|I am going to start reading the Harry Potter series again becaus

##### now we will create an Integer type column for sentiment instead of a string type column 

In [11]:
df_text=df_text.withColumn('Label', df_text.Sentiment.cast('float')).drop('Sentiment')

In [12]:
df_text.orderBy(rand()).show(10, False)

+------------------------------------------------------------------------+-----+
|Review                                                                  |Label|
+------------------------------------------------------------------------+-----+
|I love all the Mission Impossible movies, and this one is no exception. |1.0  |
|I hate Harry Potter, that daniel wotshisface needs a fucking slap...    |0.0  |
|I wanted desperately to love'The Da Vinci Code as a film.               |1.0  |
|"Anyway, thats why I love "" Brokeback Mountain."                       |1.0  |
|Because I would like to make friends who like the same things I like, an|1.0  |
|Harry Potter dragged Draco Malfoy ’ s trousers down past his hips and   |0.0  |
|These Harry Potter movies really suck.                                  |0.0  |
|Oh, and Brokeback Mountain is a TERRIBLE movie...                       |0.0  |
|dudeee i LOVED brokeback mountain!!!!                                   |1.0  |
|I wanted desperately to lov

#### adding another column indicating length of the review column 

In [13]:
from pyspark.sql.functions import length

In [14]:
df_text=df_text.withColumn('length', length(df_text['Review']))

In [15]:
df_text.orderBy(rand()).show(10, False)

+------------------------------------------------------------------------+-----+------+
|Review                                                                  |Label|length|
+------------------------------------------------------------------------+-----+------+
|brokeback mountain is such a beautiful movie.                           |1.0  |45    |
|This quiz sucks and Harry Potter sucks ok bye..                         |0.0  |47    |
|Is it just me, or does Harry Potter suck?...                            |0.0  |44    |
|And Da Vinci Code is awesome.                                           |1.0  |29    |
|I am going to start reading the Harry Potter series again because that i|1.0  |72    |
|The Da Vinci Code is awesome!!                                          |1.0  |30    |
|"I liked the first "" Mission Impossible."                              |1.0  |42    |
|I hate Harry Potter, it's retarted, gay and stupid and there's only one |0.0  |71    |
|I love Brokeback Mountain....  

In [16]:
df_text.groupBy('Label').agg({'Length': 'mean'}).show()

+-----+-----------------+
|Label|      avg(Length)|
+-----+-----------------+
|  1.0|47.61882834484523|
|  0.0|50.95845504706264|
+-----+-----------------+



we don't see any major difference between the average length of the positive and negative reviews. Now we will tokenize.

In [17]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover

In [18]:
tokenization = Tokenizer(inputCol='Review', outputCol='tokens')

In [19]:
df_tokenized = tokenization.transform(df_text)

In [20]:
stopword_removal = StopWordsRemover(inputCol='tokens',
                                  outputCol='refined_tokens')

In [21]:
df_refined_text = stopword_removal.transform(df_tokenized)

now we will calculate number of tokens of each column since we are dealing with tokens now instead of text

In [22]:
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType

In [23]:
len_udf = udf(lambda s: len(s), IntegerType())

In [24]:
df_refined_text=df_refined_text.withColumn("token_count",
                                          len_udf(col('refined_tokens')))

In [25]:
df_refined_text.orderBy(rand()).show(10)

+--------------------+-----+------+--------------------+--------------------+-----------+
|              Review|Label|length|              tokens|      refined_tokens|token_count|
+--------------------+-----+------+--------------------+--------------------+-----------+
|I hated Brokeback...|  0.0|    27|[i, hated, brokeb...|[hated, brokeback...|          3|
|I either LOVE Bro...|  1.0|    71|[i, either, love,...|[either, love, br...|          7|
|friday hung out w...|  0.0|    72|[friday, hung, ou...|[friday, hung, ke...|          9|
|dudeee i LOVED br...|  1.0|    37|[dudeee, i, loved...|[dudeee, loved, b...|          4|
|I hate Harry Potter.|  0.0|    20|[i, hate, harry, ...|[hate, harry, pot...|          3|
|I love Brokeback ...|  1.0|    29|[i, love, brokeba...|[love, brokeback,...|          3|
|Da Vinci code is ...|  1.0|    25|[da, vinci, code,...|[da, vinci, code,...|          4|
|Harry Potter drag...|  0.0|    69|[harry, potter, d...|[harry, potter, d...|          9|
|the last 

##### Now we have refined tokens after stopword removal, we can use any of the above approaches to convert text into numerical features.  In this case, we use a countvectorizer for feature vectorization for the machine learning model

In [26]:
from pyspark.ml.feature import CountVectorizer

In [27]:
count_vec = CountVectorizer(inputCol='refined_tokens',
                            outputCol='features')

In [28]:
df_cv_text=count_vec.fit(df_refined_text).transform(df_refined_text)

In [29]:
df_cv_text.select(['refined_tokens', 'token_count', 'features', 'Label']).show(10)

+--------------------+-----------+--------------------+-----+
|      refined_tokens|token_count|            features|Label|
+--------------------+-----------+--------------------+-----+
|[da, vinci, code,...|          5|(2302,[0,1,4,43,2...|  1.0|
|[first, clive, cu...|          9|(2302,[11,51,229,...|  1.0|
|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|  1.0|
|[liked, da, vinci...|          5|(2302,[0,1,4,53,3...|  1.0|
|[liked, da, vinci...|          8|(2302,[0,1,4,53,6...|  1.0|
|[even, exaggerati...|          6|(2302,[46,229,271...|  1.0|
|[loved, da, vinci...|          8|(2302,[0,1,22,30,...|  1.0|
|[thought, da, vin...|          7|(2302,[0,1,4,228,...|  1.0|
|[da, vinci, code,...|          6|(2302,[0,1,4,33,2...|  1.0|
|[thought, da, vin...|          7|(2302,[0,1,4,223,...|  1.0|
+--------------------+-----------+--------------------+-----+
only showing top 10 rows



In [30]:
df_model_text=df_cv_text.select(['features', 'token_count', 'Label'])

now we have feature vector for each row, we can make use of VectorAssembler to create input features for machine learning model

In [31]:
from pyspark.ml.feature import VectorAssembler

In [32]:
df_assembler = VectorAssembler(inputCols=['features', 'token_count'],
                              outputCol='features_vec')

In [33]:
df_model_text = df_assembler.transform(df_model_text)

In [34]:
df_model_text.printSchema()

root
 |-- features: vector (nullable = true)
 |-- token_count: integer (nullable = true)
 |-- Label: float (nullable = true)
 |-- features_vec: vector (nullable = true)



##### we can use any of the classification model, but we generally try linear regression first 

In [35]:
from pyspark.ml.classification import LogisticRegression

In [36]:
df_training, df_test = df_model_text.randomSplit([0.75,0.25])

In [37]:
df_training.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0| 2964|
|  0.0| 2280|
+-----+-----+



In [38]:
df_test.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|  1.0|  945|
|  0.0|  801|
+-----+-----+



In [39]:
log_reg = LogisticRegression(featuresCol='features_vec', labelCol='Label').fit(df_training)

##### now we evaluate the model

In [40]:
results = log_reg.evaluate(df_test).predictions

In [41]:
results.show()

+--------------------+-----------+-----+--------------------+--------------------+--------------------+----------+
|            features|token_count|Label|        features_vec|       rawPrediction|         probability|prediction|
+--------------------+-----------+-----+--------------------+--------------------+--------------------+----------+
|(2302,[0,1,4,5,30...|          5|  1.0|(2303,[0,1,4,5,30...|[-22.123993180017...|[2.46417668500255...|       1.0|
|(2302,[0,1,4,10,2...|         11|  0.0|(2303,[0,1,4,10,2...|[17.1895591385806...|[0.99999996574931...|       0.0|
|(2302,[0,1,4,11,2...|         10|  0.0|(2303,[0,1,4,11,2...|[45.3133904036052...|[1.0,2.0923994219...|       0.0|
|(2302,[0,1,4,12,1...|          5|  1.0|(2303,[0,1,4,12,1...|[-28.892175567477...|[2.83326750617783...|       1.0|
|(2302,[0,1,4,12,3...|          5|  1.0|(2303,[0,1,4,12,3...|[-24.555254355521...|[2.16664885491525...|       1.0|
|(2302,[0,1,4,12,3...|          5|  1.0|(2303,[0,1,4,12,3...|[-24.555254355521..

In [42]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [51]:
true_positives = results[(results.Label == 1) & (results.prediction == 1)].count()

In [52]:
true_negatives = results[(results.Label == 0) & (results.prediction == 0)].count()

In [53]:
false_positives = results[(results.Label == 0) & (results.prediction == 1)].count()

In [54]:
false_negatives = results[(results.Label == 1) & (results.prediction == 0)].count()

#### PERFORMANCE OF THE MODEL 

In [55]:
recall = float(true_positives)/(true_positives + false_negatives)
print(recall)

0.9873015873015873


In [56]:
precision = float(true_positives)/(true_positives+false_positives)
print(precision)

0.97288842544317


In [57]:
accuracy=float((true_positives + true_negatives) /(results.count()))

In [58]:
print(accuracy)

0.9782359679266895
