To perform NLP text classification on corona virus tweets

1) Login to virtual machine 
2) Go to LXTerminal 
3) To start hadoop write ./allstart.sh 
4) Copy the required csv file to the local system through bitwise client 
5) Once hadoop gets started use command hadoop fs -put Corona_NLP_train.csv to import the file to hadoop
6) After hadoop gets started write command pysparknb to start pyspark 7) In pyspark take a jupyter notebook and start with the project

In [4]:
from pyspark.sql import SparkSession

In [5]:
spark= SparkSession.builder.appName('nlp').getOrCreate()

In [6]:
#Importing all the necessary libraries
import numpy as np

from pyspark.ml.feature import StringIndexer, OneHotEncoder

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [7]:
#Reading the dataset
corona= spark.read.csv('Corona_NLP_train.csv', header=True, inferSchema=True)

In [8]:
corona.head()

Row(UserName='3799', ScreenName='48751', Location='London', TweetAt='16-03-2020', OriginalTweet='@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8', Sentiment='Neutral')

In [9]:
corona.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+---------+
|            UserName|          ScreenName|            Location|             TweetAt|       OriginalTweet|Sentiment|
+--------------------+--------------------+--------------------+--------------------+--------------------+---------+
|                3799|               48751|              London|          16-03-2020|@MeNyrbie @Phil_G...|  Neutral|
|                3800|               48752|                  UK|          16-03-2020|advice Talk to yo...| Positive|
|                3801|               48753|           Vagabonds|          16-03-2020|Coronavirus Austr...| Positive|
|                3802|               48754|                null|          16-03-2020|My food stock is ...|     null|
|              PLEASE|         don't panic| THERE WILL BE EN...|                null|                null|     null|
|           Stay calm|          stay safe.|                null|

In [10]:
corona.select("OriginalTweet", "Sentiment").show(10)

+--------------------+---------+
|       OriginalTweet|Sentiment|
+--------------------+---------+
|@MeNyrbie @Phil_G...|  Neutral|
|advice Talk to yo...| Positive|
|Coronavirus Austr...| Positive|
|My food stock is ...|     null|
|                null|     null|
|                null|     null|
|                null|     null|
|Me, ready to go a...|     null|
|                null|     null|
|                null|     null|
+--------------------+---------+
only showing top 10 rows



In [11]:
#Removing unnecesary columns
drop_col= ['ScreenName', 'UserName', 'TweetAt','Location']

In [12]:
corona= corona.select([column for column in corona.columns if column not in drop_col])

In [13]:
corona.show()

+--------------------+---------+
|       OriginalTweet|Sentiment|
+--------------------+---------+
|@MeNyrbie @Phil_G...|  Neutral|
|advice Talk to yo...| Positive|
|Coronavirus Austr...| Positive|
|My food stock is ...|     null|
|                null|     null|
|                null|     null|
|                null|     null|
|Me, ready to go a...|     null|
|                null|     null|
|                null|     null|
|As news of the re...| Positive|
|"Cashier at groce...| Positive|
|Was at the superm...|     null|
|                null|     null|
|Due to COVID-19 o...| Positive|
|For corona preven...| Negative|
|All month there h...|  Neutral|
|Due to the Covid-...|     null|
|                null|     null|
|                null|     null|
+--------------------+---------+
only showing top 20 rows



In [14]:
corona.printSchema()

root
 |-- OriginalTweet: string (nullable = true)
 |-- Sentiment: string (nullable = true)



In [15]:
#Dropping null values from dataset
corona = corona.dropna()
corona.count()

28617

In [16]:
#Checking the type of sentiments
from pyspark.sql.functions import col
corona.groupBy("Sentiment").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|           Sentiment|count|
+--------------------+-----+
|            Positive| 7718|
|            Negative| 6857|
|             Neutral| 5224|
|  Extremely Positive| 4412|
|  Extremely Negative| 3751|
|   social distancing|    5|
|    N. Y. - April 10|    3|
| state governors ...|    2|
|             however|    2|
| supermarket workers|    2|
|        Stay with us|    2|
| but we also need...|    2|
| or click the lin...|    2|
| just ""selfish p...|    2|
|           of course|    2|
| not going to the...|    2|
| ecological collapse|    2|
|        Corona Virus|    2|
|            delivery|    2|
| Big Tech could b...|    1|
+--------------------+-----+
only showing top 20 rows



In [17]:
#Filtering the data to get Sentiment in terms of Positive, Negative,Neutral,Extremely Positive and Extremely Negative values
import pyspark.sql.functions as fn
data = corona.where(fn.col("Sentiment").isin(["Positive", "Negative", "Neutral","Extremely Positive","Extremely Negative"]))
data.groupBy("Sentiment").count().orderBy(col("count").desc()).show()

+------------------+-----+
|         Sentiment|count|
+------------------+-----+
|          Positive| 7718|
|          Negative| 6857|
|           Neutral| 5224|
|Extremely Positive| 4412|
|Extremely Negative| 3751|
+------------------+-----+



Data has been transformed with the help of tokenizers, stopwords to find the count vector of the feature column for the purpose of modelling

In [18]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression
# using tokenizer
regexTokenizer = RegexTokenizer(inputCol="OriginalTweet", outputCol="words", pattern="\\W")
# checking the stop words
add_stopwords = ["http","https","amp","rt","t","c","the"] 
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords)
# checking the bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=1000, minDF=5)

String indexing has been done to convert the text to label

In [19]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
label_stringIdx = StringIndexer(inputCol = "Sentiment", outputCol = "label")
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])
pipelineFit = pipeline.fit(data)
dataset = pipelineFit.transform(data)
dataset.show(10)

+--------------------+------------------+--------------------+--------------------+--------------------+-----+
|       OriginalTweet|         Sentiment|               words|            filtered|            features|label|
+--------------------+------------------+--------------------+--------------------+--------------------+-----+
|@MeNyrbie @Phil_G...|           Neutral|[menyrbie, phil_g...|[menyrbie, phil_g...|(1000,[1,5],[2.0,...|  2.0|
|advice Talk to yo...|          Positive|[advice, talk, to...|[advice, talk, to...|(1000,[0,2,25,34,...|  0.0|
|Coronavirus Austr...|          Positive|[coronavirus, aus...|[coronavirus, aus...|(1000,[0,5,6,8,9,...|  0.0|
|As news of the re...|          Positive|[as, news, of, th...|[as, news, of, re...|(1000,[0,1,2,5,8,...|  0.0|
|"Cashier at groce...|          Positive|[cashier, at, gro...|[cashier, at, gro...|(1000,[0,4,5,11,1...|  0.0|
|Due to COVID-19 o...|          Positive|[due, to, covid, ...|[due, to, covid, ...|(1000,[0,1,4,5,7,...|  0.0|
|

Splitting the data into train and test dataset

In [20]:
(trainData, testData) = dataset.randomSplit([0.75, 0.25], seed=120)
print("Training Data Count: " + str(trainData.count()))
print("Test Data Count: " + str(testData.count()))

Training Data Count: 20996
Test Data Count: 6966


Using Random Forest Classifier model

In [22]:
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(labelCol="label", 
                            featuresCol="features",
                            numTrees = 100, 
                            maxDepth = 4, 
                            maxBins = 32)
# Training the model
rfcModel = rfc.fit(trainData)
# Making prediction
predictions = rfcModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) .select("OriginalTweet","Sentiment","probability","label","prediction") .orderBy("probability", ascending=False)  .show(n = 10, truncate = 30)

+------------------------------+------------------+------------------------------+-----+----------+
|                 OriginalTweet|         Sentiment|                   probability|label|prediction|
+------------------------------+------------------+------------------------------+-----+----------+
|All e com sites like @amazo...|Extremely Positive|[0.3234730543809337,0.20512...|  3.0|       0.0|
|looking forward to working ...|Extremely Positive|[0.3225341763305235,0.18365...|  3.0|       0.0|
|To help contain the COVID-1...|Extremely Positive|[0.32141291519207366,0.1847...|  3.0|       0.0|
|The first 2 steps in the di...|Extremely Positive|[0.3185136475272952,0.20021...|  3.0|       0.0|
|Today At Riverside HS I ran...|Extremely Positive|[0.3181401306110266,0.20709...|  3.0|       0.0|
|Like for Hand sanitizer Ret...|Extremely Positive|[0.31798576146088153,0.2054...|  3.0|       0.0|
|Brewers and distillers acro...|          Positive|[0.3178164717030156,0.19192...|  0.0|       0.0|


In [23]:
#Checking the accuracy of model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

0.1770125486389085

Using Logistic Regression model

In [24]:
logr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
logrModel = logr.fit(trainData)
# making prediction on test data 
predictions = logrModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) .select("OriginalTweet","Sentiment","probability","label","prediction") .orderBy("probability", ascending=False) .show(n = 10, truncate = 30)

+------------------------------+------------------+------------------------------+-----+----------+
|                 OriginalTweet|         Sentiment|                   probability|label|prediction|
+------------------------------+------------------+------------------------------+-----+----------+
|If you want to learn about ...|          Positive|[0.6246888881907727,0.07304...|  0.0|       0.0|
|The UK Government is helpin...|Extremely Positive|[0.6091860852243968,0.06411...|  3.0|       0.0|
|Members - Tune in Wed for a...|Extremely Positive|[0.6024668771197693,0.03902...|  3.0|       0.0|
|@adrparsons @commerson @Mic...|          Positive|[0.5937497976647509,0.16216...|  0.0|       0.0|
|Please join us for a webina...|Extremely Positive|[0.5930786297371786,0.04188...|  3.0|       0.0|
|Appalachian Wireless has le...|          Positive|[0.5882698052413038,0.11207...|  0.0|       0.0|
|Last week, @Captify opened ...|          Positive|[0.576844022207999,0.098579...|  0.0|       0.0|


In [25]:
#Checking the accuracy of model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

0.4841331688930419

We can conclude that Logistic Regression model gives the best accuracy for the dataset i.e., 48% as compared to Random forest classifier model i.e., around 18%