# 3. Analyze data set from Stack Exchange
By Siny

The data set contains the following attributes:
* qid: Unique question id
* i: User id of questioner
* qs: Score of the question
* qt: Time of the question (in epoch time)
* tags: a comma-separated list of the tags associated with the question.
* qvc: Number of views of this question (at the time of the datadump)
* qac: Number of answers for this question (at the time of the datadump)
* aid: Unique answer id
* j: User id of answerer
* as: Score of the answer
* at: Time of the answer (in epoch time)

In [0]:
from pyspark.sql import SparkSession, functions as F

In [0]:
spark = SparkSession.builder.appName('master').getOrCreate()

## Load Data

In [0]:
answers= spark.read.csv('/FileStore/tables/answers.csv', inferSchema=True, header=True)

## Data Analysis and Preprocessing

In [0]:
answers.show(15)

+---+------+-----+---+----------+--------------------+----+---+------+-----+---+----------+
|_c0|   qid|    i| qs|        qt|                tags| qvc|qac|   aid|    j| as|        at|
+---+------+-----+---+----------+--------------------+----+---+------+-----+---+----------+
|  1|563355|62701|  0|1235000081|php,error,gd,imag...| 220|  2|563372|67183|  2|1235000501|
|  2|563355|62701|  0|1235000081|php,error,gd,imag...| 220|  2|563374|66554|  0|1235000551|
|  3|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563358|15842|  3|1235000177|
|  4|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563413|  893| 18|1235001545|
|  5|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563454|11649|  4|1235002457|
|  6|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563472|50742|  6|1235002809|
|  7|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563484| 8899|  1|1235003266|
|  8|563356|15842| 10|1235000140|lisp,scheme,subje...|1047| 16|563635|60190| 12|

Above is a glimpse of the data. All the features except for tags seems to be numeric in nature. From initial analysis, the questioner userid seems to be depending on the question id(qid). Time can be the difference of answer time and question time.

In [0]:
print(f"Number of rows : {answers.count()}")

Number of rows : 263540


In [0]:
answers.summary().show()

+-------+-----------------+------------------+-----------------+-----------------+--------------------+--------------------+-----------------+-----------------+------------------+------------------+------------------+--------------------+
|summary|              _c0|               qid|                i|               qs|                  qt|                tags|              qvc|              qac|               aid|                 j|                as|                  at|
+-------+-----------------+------------------+-----------------+-----------------+--------------------+--------------------+-----------------+-----------------+------------------+------------------+------------------+--------------------+
|  count|           263540|            263540|           263540|           263540|              263540|              263540|           263540|           263540|            263540|            263540|            263540|              263540|
|   mean|         131770.5| 758744.192536237

In [0]:
answers.columns

Out[14]: ['_c0', 'qid', 'i', 'qs', 'qt', 'tags', 'qvc', 'qac', 'aid', 'j', 'as', 'at']

In [0]:
# Find the null value counts in the data for each column
for col, dtype in answers.dtypes:
  print(f"{col}:")
  print(f"Null count: {answers.filter(answers[col].isNull()).count()}")
  print('#' * 50)

_c0:
Null count: 0
##################################################
qid:
Null count: 0
##################################################
i:
Null count: 0
##################################################
qs:
Null count: 0
##################################################
qt:
Null count: 0
##################################################
tags:
Null count: 0
##################################################
qvc:
Null count: 0
##################################################
qac:
Null count: 0
##################################################
aid:
Null count: 0
##################################################
j:
Null count: 0
##################################################
as:
Null count: 0
##################################################
at:
Null count: 0
##################################################


In [0]:
print('Datatypes of each column are:')
answers.dtypes

Datatypes of each column are:
Out[32]: [('_c0', 'int'),
 ('qid', 'int'),
 ('i', 'string'),
 ('qs', 'int'),
 ('qt', 'int'),
 ('tags', 'string'),
 ('qvc', 'int'),
 ('qac', 'int'),
 ('aid', 'int'),
 ('j', 'string'),
 ('as', 'int'),
 ('at', 'int')]

None of the columns show missing values. But from definition of columns i and j, that are userids of answerer and questioner repectively are supposed to be numbers but dtypes are string. In the describe also we can see NA. Therefore let us explore those two columns further.

In [0]:
print(f"Number of NA values in column j : {answers.filter(F.col('j') == 'NA').count()}")
print(f"Number of NA values in column i : {answers.filter(F.col('i') == 'NA').count()}")

Number of NA values in column j : 140
Number of NA values in column i : 276


In [0]:
"""From analsying the data we know that for a particular qid the questioner userid is same. So let us check whether for the same qid if we have corrsponding i value thatis not na. For that we are creating a dataframe with just qid and count from the data with NAs in i column """

ans_with_NA = answers.filter(F.col('i') == 'NA').groupBy('qid').count()


In [0]:
"""Here we are inner joining the dataframes with NAs and the original dataframe. We check if there is any non-na value in the resulting dataframe. Here it is zero other wise we could have mapped the NA with corresponding i values of non-NA. This step could have been omitted as the analysis is not much affected by the i column(questioner column)"""
answers.join(ans_with_NA, on='qid', how='inner').filter(F.col('i') != 'NA').count()

Out[136]: 0

In [0]:
answers_final = answers

In [0]:
"""The records with NA answerer ids may be dropped using the following code. But since the percentage of those rows are negligible, I am retaining them"""
# answers_final = answers_final.filter((F.col('j') != 'NA'))
# print(f"Number of rows without null values: {answers_noNA.count()}")

Out[34]: 'The records with NA answerer ids may be dropped using the following code. But since the percentage of those rows are negligible, I am retaining them'

Now let us analyse the time taken to complete each question. qt is question time and at is answer time. Therefore the difference between at and qt should not be negative. Let us see if we have any such values.

In [0]:
answers_noNA.filter('(at - qt) <0').count()

Out[139]: 17

We have 17 records whose question time is after answer time which is logically not possible. Therefore we can remove those records as well.

In [0]:
answers_final = answers_final.filter('(at - qt) >=0')

In [0]:
answers_final.count()

Out[6]: 263523

### Question 1
Top 10 most commonly used tags in this data set.

In [0]:
answers_final.groupBy('tags').count().orderBy('count', ascending=False).show(10, truncate=False)

+-----------+-----+
|tags       |count|
+-----------+-----+
|c#         |1805 |
|javascript |948  |
|aspûnet    |856  |
|c++        |798  |
|python     |782  |
|java       |751  |
|aspûnet-mvc|707  |
|jquery     |675  |
|php        |667  |
|c#,ûnet    |531  |
+-----------+-----+
only showing top 10 rows



### Question 2:
Average time to answer questions.

In [0]:
answers_final.withColumn('time', F.col('at') - F.col('qt')).select(F.ceil(F.avg('time')).alias('avg time(seconds)')).show()

+-----------------+
|avg time(seconds)|
+-----------------+
|           133792|
+-----------------+



### Question 3:
Number of questions which got answered within 1 hour

In [0]:
answers_1hr =answers_final.withColumn('time', (F.col('at') - F.col('qt'))/3600).filter('time < 1')

In [0]:
answers_1hr.count()

Out[25]: 174581

### Question 4:
Tags of questions which got answered within 1 hour

In [0]:
print(f"Count of distinct tags which were answered within one hour: {answers_1hr.select('tags').distinct().count()}")

Count of distinct tags which were answered within one hour: 52938


In [0]:
print('Displaying 10 tags that got answered within 1hr')
answers_1hr.select('tags').distinct().show(10, truncate=False)

Displaying 10 tags that got answered within 1hr
+------------------------------+
|tags                          |
+------------------------------+
|aspûnet                       |
|lisp,scheme,subjective,clojure|
|css                           |
|jquery,javascript,dom         |
|python,http                   |
|visualstudio                  |
|php,error,gd,image-processing |
|cocoa-touch,core-graphics     |
|winforms,gridview,ûnet        |
|core-animation                |
+------------------------------+
only showing top 10 rows

