# Project: NLP Tag Filter

Dataset: stack-overflow-data.csv (contains Stack Overflow questions and associated tags)

In this project, we will build a tag filter by using PySpark, various NLP tools and a classifier to predict tag for each question.

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.hadoop.dfs.client.use.datanode.hostname', 'true')
sc=SparkContext(master='local', appName='New Spark Context')
sc

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()
spark

In [None]:
from pyspark.sql.functions import *
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer, VectorAssembler
from pyspark.ml.classification import NaiveBayes, LogisticRegression, RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vector

In [None]:
data=spark.read.csv("D:/DS/stack_overflow_data.csv", inferSchema=True, header=True)
data.show(5)

+--------------------+-----------+
|                post|       tags|
+--------------------+-----------+
|what is causing t...|         c#|
|have dynamic html...|    asp.net|
|how to convert a ...|objective-c|
|.net framework 4 ...|       .net|
|trying to calcula...|     python|
+--------------------+-----------+
only showing top 5 rows



In [5]:
data.groupBy('tags').count().show()

+-----------+-----+
|       tags|count|
+-----------+-----+
|     iphone| 2000|
|    android| 2000|
|         c#| 2000|
|       NULL|20798|
|    asp.net| 2000|
|       html| 2000|
|      mysql| 2000|
|     jquery| 2000|
| javascript| 2000|
|        css| 2000|
|        sql| 2000|
|        c++| 2000|
|          c| 2000|
|objective-c| 2000|
|       java| 2000|
|        php| 2000|
|       .net| 2000|
|        ios| 2000|
|     python| 2000|
|  angularjs| 2000|
+-----------+-----+
only showing top 20 rows



In [6]:
data.count()

60798

In [7]:
data.select([count(when (isnan(c), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
post,0
tags,0


In [8]:
data.select([count(when (isnull(c), c)).alias(c) for c in data.columns]).toPandas().T

Unnamed: 0,0
post,0
tags,20798


In [9]:
data=data.dropna()
data.count()

40000

In [10]:
data=data.dropDuplicates()
data.count()

39790

In [11]:
data=data.withColumn('length', length(data['post']))
data.show()

+--------------------+----------+------+
|                post|      tags|length|
+--------------------+----------+------+
|sql simple select...|       sql|   521|
|send reminder: 2 ...|       php|   247|
|htmlinputfile thr...|      html|   421|
|why remove and re...|         c|  1249|
|designing and wri...|   asp.net|   743|
|input fields asso...|javascript|   240|
|navigation bar st...|       ios|   936|
|help with searchf...|       php|   509|
|sql concat rows w...|       sql|   795|
|how can i divide ...|      html|   401|
|dynamic create ed...|   android|   225|
|sql query: how to...|       sql|   677|
|specific css clas...|       css|   565|
|truncate timespan...|        c#|   343|
|python error cont...|    python|   572|
|is possible t<t> ...|        c#|   720|
|what is the diffe...|       css|   477|
|how to solve inco...|       sql|   599|
|developing websit...|   asp.net|   181|
|storing numbers t...|     mysql|   145|
+--------------------+----------+------+
only showing top

In [12]:
data.groupBy('tags').mean().show()

+-------------+------------------+
|         tags|       avg(length)|
+-------------+------------------+
|       iphone|           709.621|
|      android| 1714.287643821911|
|           c#|         1145.3065|
|         html| 912.1049129989765|
|      asp.net|            999.95|
|        mysql|1039.5925925925926|
|       jquery| 1095.646403242148|
|   javascript| 986.9109518935517|
|          css|  983.086508753862|
|          sql|           870.912|
|          c++|1296.5992996498248|
|            c|         1121.1115|
|  objective-c|          972.8925|
|         java| 1357.982991495748|
|          php|1125.6558116232466|
|         .net|          731.0075|
|          ios|          970.7565|
|       python|         1018.6695|
|    angularjs| 1310.383097165992|
|ruby-on-rails| 1244.823911955978|
+-------------+------------------+



In [13]:
tokenizer = Tokenizer(inputCol="post", outputCol="token_post")

stopremove = StopWordsRemover(inputCol='token_post',outputCol='stop_tokens')

count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')

idf = IDF(inputCol="c_vec", outputCol="tf_idf")

tags_to_num = StringIndexer(inputCol='tags',outputCol='label')

In [14]:
clean_up=VectorAssembler(inputCols=['tf_idf', 'length'], outputCol='features')

In [15]:
data_pre_pipe=Pipeline(stages=[tags_to_num,
                               tokenizer,
                               stopremove, 
                               count_vec,
                               idf,
                               clean_up])

clean_data=data_pre_pipe.fit(data).transform(data)

In [16]:
clean_data = clean_data.select(['label','features'])
clean_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  8.0|(262145,[0,1,12,3...|
| 14.0|(262145,[0,19,32,...|
| 17.0|(262145,[0,11,20,...|
|  2.0|(262145,[0,2,4,7,...|
|  1.0|(262145,[0,20,47,...|
| 18.0|(262145,[0,33,36,...|
|  4.0|(262145,[0,1,4,9,...|
| 14.0|(262145,[0,4,5,12...|
|  8.0|(262145,[0,4,6,9,...|
| 17.0|(262145,[0,4,32,5...|
|  9.0|(262145,[0,12,21,...|
|  8.0|(262145,[0,4,20,3...|
| 19.0|(262145,[0,2,3,4,...|
|  3.0|(262145,[0,11,18,...|
|  7.0|(262145,[0,1,4,9,...|
|  3.0|(262145,[0,4,8,13...|
| 19.0|(262145,[0,16,29,...|
|  8.0|(262145,[0,4,12,1...|
|  1.0|(262145,[0,11,237...|
| 13.0|(262145,[0,74,77,...|
+-----+--------------------+
only showing top 20 rows



In [17]:
train, test = clean_data.randomSplit([0.7, 0.3], seed=142)

In [18]:
nb=NaiveBayes()

In [19]:
predictor=nb.fit(train)

In [20]:
test_result=predictor.transform(test)

In [21]:
test_result.show(10,False)

+-----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
test_result.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  8.0|       3.0|    5|
| 12.0|      16.0|    1|
| 10.0|       1.0|    1|
|  2.0|       0.0|   11|
| 15.0|      16.0|    8|
| 12.0|       5.0|    1|
|  0.0|       8.0|    4|
|  0.0|      12.0|    3|
|  1.0|      19.0|    7|
|  1.0|      12.0|    3|
|  7.0|       3.0|    9|
| 11.0|      17.0|    1|
| 17.0|      19.0|  107|
| 17.0|       7.0|    3|
|  6.0|       1.0|    1|
|  3.0|       9.0|    1|
|  4.0|       6.0|   69|
|  3.0|       5.0|    2|
|  6.0|       8.0|    2|
|  9.0|       5.0|   17|
+-----+----------+-----+
only showing top 20 rows



In [23]:
acc_eval=MulticlassClassificationEvaluator()
acc=acc_eval.evaluate(test_result)
acc

0.724638798587695

The current results are not optimal. To improve the model’s performance, it is recommended to collect more data. A larger and more diverse dataset can help the model better understand user preferences, leading to more accurate predictions.