### Let's read the data

In [42]:
! gsutil ls gs://pyspark-workshop/so-posts

gs://pyspark-workshop/so-posts/Posts.xml-aa
gs://pyspark-workshop/so-posts/Posts.xml-ab
gs://pyspark-workshop/so-posts/Posts.xml-ac
gs://pyspark-workshop/so-posts/Posts.xml-ad
gs://pyspark-workshop/so-posts/Posts.xml-ae
gs://pyspark-workshop/so-posts/Posts.xml-af
gs://pyspark-workshop/so-posts/Posts.xml-ag
gs://pyspark-workshop/so-posts/Posts.xml-ah
gs://pyspark-workshop/so-posts/Posts.xml-ai
gs://pyspark-workshop/so-posts/Posts.xml-aj
gs://pyspark-workshop/so-posts/Posts.xml-ak
gs://pyspark-workshop/so-posts/Posts.xml-al
gs://pyspark-workshop/so-posts/Posts.xml-am
gs://pyspark-workshop/so-posts/Posts.xml-an
gs://pyspark-workshop/so-posts/Posts.xml-ao
gs://pyspark-workshop/so-posts/Posts.xml-ap
gs://pyspark-workshop/so-posts/Posts.xml-aq
gs://pyspark-workshop/so-posts/Posts.xml-ar
gs://pyspark-workshop/so-posts/Posts.xml-as
gs://pyspark-workshop/so-posts/Posts.xml-at
gs://pyspark-workshop/so-posts/Posts.xml-au
gs://pyspark-workshop/so-posts/Posts.xml-av
gs://pyspa

In [2]:
lines = sc.textFile("gs://pyspark-workshop/so-posts/*")

In [33]:
# or a smaller piece of them
lines = sc.textFile("gs://pyspark-workshop/so-posts/Posts.xml-*a")

### Let's check what's inside these files...

In [34]:
lines.take(5)

['<?xml version="1.0" encoding="utf-8"?>',
 '<posts>',
 '  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="421" ViewCount="28370" Body="&lt;p&gt;I want to use a track-bar to change a form\'s opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA;  &lt;p&gt;Cannot implicitly convert type \'decimal\' to \'double\'.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making &lt;code&gt;trans&lt;/code&gt; a &lt;code&gt;double&lt;/code&gt;, but then the control doesn\'t work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="5455605" LastEditorDisplayName="Rich B" LastEditDate="2015-12-23T21:34:28.557" LastActivityDate="2016-07-17T20:33:18.217" Ti

### Only proper rows with posts

In [5]:
rows = lines.filter(lambda x: x.lstrip().startswith('<row'))

### Let's parse this mess...

In [9]:
import xml.etree.ElementTree as ET

In [10]:
parsed = lines.map(lambda x: x.lstrip()).filter(lambda x: x.startswith('<row')).map(lambda x: ET.fromstring(x))

In [12]:
from pprint import pprint
pprint(parsed.take(2))

[<Element 'row' at 0x7f58e69ed868>, <Element 'row' at 0x7f58e69ed818>]


### Better:

In [49]:
pprint(parsed.map(lambda x: x.attrib).take(3))

[{'AcceptedAnswerId': '7',
  'AnswerCount': '13',
  'Body': "<p>I want to use a track-bar to change a form's opacity.</p>\n"
          '\n'
          '<p>This is my code:</p>\n'
          '\n'
          '<pre><code>decimal trans = trackBar1.Value / 5000;\n'
          'this.Opacity = trans;\n'
          '</code></pre>\n'
          '\n'
          '<p>When I try to build it, I get this error:</p>\n'
          '\n'
          '<blockquote>\n'
          "  <p>Cannot implicitly convert type 'decimal' to 'double'.</p>\n"
          '</blockquote>\n'
          '\n'
          '<p>I tried making <code>trans</code> a <code>double</code>, but '
          "then the control doesn't work. This code has worked fine for me in "
          'VB.NET in the past. </p>\n',
  'CommentCount': '3',
  'CommunityOwnedDate': '2012-10-31T16:42:47.213',
  'CreationDate': '2008-07-31T21:42:52.667',
  'FavoriteCount': '33',
  'Id': '4',
  'LastActivityDate': '2016-07-17T20:33:18.217',
  'LastEditDate': '2015-12-23T21:34

### Let's compute tag counts!

In [36]:
def parse_tags(x):
    return x[1:-1].split("><")

tags = parsed.map(lambda x: parse_tags(x.attrib['Tags']) if 'Tags' in x.attrib else [])
tags.take(5)

[['c#', 'winforms', 'type-conversion', 'decimal', 'opacity'],
 ['html', 'css', 'css3', 'internet-explorer-7'],
 [],
 ['c#', '.net', 'datetime'],
 ['c#', 'datetime', 'datediff', 'relative-time-span']]

In [40]:
counts = tags.flatMap(lambda x: x).groupBy(lambda x: x).map(lambda x: (x[0], len(x[1])))

### Taking long? go to: http://cluster-1-m:8088 and explore it (if you're using default cluster name).

### Did you know flatMap?? If yes, rewrite the statement before to use flatMap.

In [41]:
counts.sortBy(lambda x: x[1], ascending=False).take(10)

[('javascript', 1206322),
 ('java', 1128803),
 ('c#', 997465),
 ('php', 969707),
 ('android', 887010),
 ('jquery', 770009),
 ('python', 624016),
 ('html', 572556),
 ('c++', 467183),
 ('ios', 457172)]

### Shout if you're the first one here! Congrats!

Puzzles:

1. Can you compute how many times someone asked about Python this month (you can compute posts with python tag only)?
2. Can you measure Pythons monthly popularity over last year? Can you plot it?
3. Can you do the same but only for main posts (questions)?
4. (*) Can you find the question that has the most posts attached?? 
  1. Do the same but use ranking by total score of subposts.

In [50]:
# if you hate xml (you do), then save it as json on hdfs!
import json
parsed.map(lambda x: json.dumps(x.attrib)).saveAsTextFile("posts.jsons")