# Getting the Data

To get started we pulled data from Stack Overflow from http://data.stackexchange.com/stackoverflow/query/new 

The Query that we ran was:
Select id, Tags

from posts

where CreationDate BETWEEN '01/01/2013' AND '12/31/2013'

AND AnswerCount >1 AND Tags <> ''

ORDER BY NEWID()

This next step is to load findspark to make sure that the Spark Context is available to be used in the following code.  Then we loaded pyspark to the project.

In [17]:
import findspark
findspark.init('/usr/hdp/2.6.3.0-235/spark2')

import pyspark
sc = pyspark.SparkContext(appName="FinalProject")

Here I loaded the text file of our data from the Query that was run above.  The BaseData.txt file was saved under HDFS user/vagrant/data/input.  I set the BaseData as a variable so I could manipulate it going forward. 

In [20]:
text_file = sc.textFile("hdfs:///user/vagrant/data/input/BaseData.txt")

from pyspark.rdd import RDD
isinstance(text_file, RDD)

text_file.take(5)
#This shows the output of the data and shows us what we need to remove to clean up the data

['id,Tags',
 '"19448711","<javascript><javascript-events><cross-browser><onkeyup>"',
 '"16042613","<java>"',
 '"17933344","<python><postgresql><fetchall>"',
 '"15837853","<exception><activemq><jmx>"']

With the above results we see the items that we need to clean up to make the data more readable.  Then after we clean up the data and split it we take the first 5 to see if that output is as expected.

In [21]:
word_list = text_file.map(lambda x: x.lower().replace('>',' ').replace(',',' ').replace('<',' ').replace('"',' ').split())
#This takes the first 5 of the split
word_list.take(5)

[['id', 'tags'],
 ['19448711', 'javascript', 'javascript-events', 'cross-browser', 'onkeyup'],
 ['16042613', 'java'],
 ['17933344', 'python', 'postgresql', 'fetchall'],
 ['15837853', 'exception', 'activemq', 'jmx']]

The output is as expected so we move on to using a flatMap to find each tag associated with each POST ID.  Then we take the first 20 results.

In [12]:
test = word_list.flatMap(lambda list: [(element, list[0]) for element in list[1:]] )
test.take(20)

[('tags', 'id'),
 ('javascript', '19448711'),
 ('javascript-events', '19448711'),
 ('cross-browser', '19448711'),
 ('onkeyup', '19448711'),
 ('java', '16042613'),
 ('python', '17933344'),
 ('postgresql', '17933344'),
 ('fetchall', '17933344'),
 ('exception', '15837853'),
 ('activemq', '15837853'),
 ('jmx', '15837853'),
 ('java', '14944882'),
 ('c++', '20601118'),
 ('php', '16505917'),
 ('c++', '18993197'),
 ('arrays', '18993197'),
 ('pointers', '18993197'),
 ('memory', '18993197'),
 ('dynamic', '18993197')]

Here we run a reduce to find each which POST IDs are associated with each tag.  This produces a long list of POST IDs which can be used to find how popular a certain tag is.  

In [14]:
test2 = test.reduceByKey(lambda a, b: a + ',' + b)

test2.take(5)

[('tags',
  'id,20077858,20817230,17472061,19913064,14731380,16330901,19593785,20194271,14829726,18678603,14374265,15917677,14413526,14366182,19769173,19937442,20069658,20464028,17767136,14244220,16968479,16757687,16790722,17426451,19769469,16372299,17632799,17735715,18970704,18250404'),
 ('javascript',
  '19448711,17539449,16182963,19888637,19902017,15454992,19596551,17123706,15382064,17340373,19607874,19079241,18286925,14634138,17533806,18370542,20635469,17731425,20173927,16923385,15634575,16428895,20337869,14187680,20089865,16292349,15235933,18393040,14172343,20139300,17399897,15451959,14314867,16428413,20309746,19401511,14357824,17798278,16282045,17185225,15418134,19398346,16739908,16264634,18763046,16822781,18703906,15387939,19426323,17400191,15774020,20353834,16160622,14655605,19654195,20583551,18919849,20219004,18518812,18060564,19274386,20306204,18016899,17567583,17962353,17460116,19690108,17021873,17663358,14699856,20476255,14133559,15815663,14671204,20243208,18054461,16995108

This next part creates a python list and adds it to a dictionary and splits on the comma.

In [8]:
test3 = test2.collectAsMap()
test4 = []
test5 = []
for  tag in test3:
   test4.extend((tag, test3[tag].split(',')))

In [10]:
test4[:5]

['tags',
 ['id',
  '20077858',
  '20817230',
  '17472061',
  '19913064',
  '14731380',
  '16330901',
  '19593785',
  '20194271',
  '14829726',
  '18678603',
  '14374265',
  '15917677',
  '14413526',
  '14366182',
  '19769173',
  '19937442',
  '20069658',
  '20464028',
  '17767136',
  '14244220',
  '16968479',
  '16757687',
  '16790722',
  '17426451',
  '19769469',
  '16372299',
  '17632799',
  '17735715',
  '18970704',
  '18250404'],
 'javascript',
 ['19448711',
  '17539449',
  '16182963',
  '19888637',
  '19902017',
  '15454992',
  '19596551',
  '17123706',
  '15382064',
  '17340373',
  '19607874',
  '19079241',
  '18286925',
  '14634138',
  '17533806',
  '18370542',
  '20635469',
  '17731425',
  '20173927',
  '16923385',
  '15634575',
  '16428895',
  '20337869',
  '14187680',
  '20089865',
  '16292349',
  '15235933',
  '18393040',
  '14172343',
  '20139300',
  '17399897',
  '15451959',
  '14314867',
  '16428413',
  '20309746',
  '19401511',
  '14357824',
  '17798278',
  '16282045',
 

# Lessons Learned

In this class I learned a lot about how virtual machines work and how to set up a cluster.  I have known about how this works in theory before I have avoided every doing it myself due to the fact that there are a lot of ways to mess up.  Being able to do a walk through with you, the professor, was extremely helpful because you were able to show us and explain what each step was actually doing.  As soon as we had the cluster up and running the rest of the class was quite a bit simpler in my opinion.

This last project showed me that there are many different ways to accomplish the same outcome, and hundreds of was of doing something incorrectly.  Before starting this Master’s Program my knowledge on Python was extremely limited but each lab/project we do helps me understand the language better and the possibilities for its use.   

Overall, I think that this class did a really good job of explaining the basics.  I know that we have barely scrapped the surface and I am excited to see where this all goes.  I was also extremally excited to learn that you could query against Stack Overflow, I feel with more time this project could have gone a lot farther.  Out of curiosity I wanted to see how addition analysis of this data would have gone.  I found it a lot easier to redo all of this project in Excel and I got the top 10 Tags, which are as follows:

Tags	Counts
java	       3,818 
c#	       3,044 
javascript	       2,885 
php	       2,725 
android	       2,409 
jquery	       2,065 
python	       1,703 
c++	       1,658 
html	       1,559 


I used the excel document that I ran through to make sure that I was ending up with the same results that this program was.  After I did the reduce I ran the following to get the number of unique tags:

test2.count()
Out[248]:
10571

Using my excel data I was able to verify that there were infacet 10,571 unique tags in my data (including headers).  I would have liked to have figured out how to verify all of this data without having to go back to Excel.


