In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [2]:
import grader

# Spark Miniproject

StackOverflow is a collaboratively edited question-and-answer site originally focused on programming topics. Because of the variety of features tracked, including a variety of feedback metrics, it allows for some open-ended analysis of user behavior on the site.

StackExchange (the parent organization) provides an anonymized [data dump](https://archive.org/details/stackexchange), and we'll use Spark to perform data manipulation, analysis, and machine learning on this dataset. As a side note, there's also an online data explorer which allows you to query the data interactively.

*Consider*: Do we need to use Spark to work with this dataset? What are our alternatives?

## Workflow

You may complete this project using the Python or Scala APIs. Most questions can be done locally, however in some cases you may want to use cloud services. See the appropriate lecture notebooks for information on how to use cloud services.

Python example:

1. Edit source code in your main.py file, classes in a separate classes.py (Class definitions need to be written in a separate file and then included at runtime.)
1. Run locally on a chunk using eg. `$SPARK_HOME/bin/spark-submit --py-files src/classes.py src/main.py data/stats results/stats/`
1. Run on GCP once your testing and development are done.

Scala example:

1. Edit source code in Main.scala
1. Run the command `sbt package` from the root directory of the project
1. Use spark-submit locally on a chunk: this means adding a flag like `--master local[2]` to the spark-submit command.
1. Run on GCP once your testing and development are done.

General tips:
* SBT has some nice features, for example continuous build and test, which can greatly speed up your development.
* Try `cat output_dir/* | sort -n -t , -k 1.2 -o sorted_output` to concatenate your output files, which will also be in part-xxxxx format.
* You can access an interactive Spark/Scala REPL with `$SPARK_HOME/bin/spark-shell`.
* You can access an interactive PySpark shell with `$SPARK_HOME/bin/pyspark`.

## Accessing the data

The data is available on S3 (s3://dataincubator-course/spark-stack-data). There are three subfolders: allUsers, allPosts, and allVotes which contain chunked and gzipped xml with the following format:

```
<row Body="&lt;p&gt;I always validate my web pages, and I recommend you do the same BUT many large company websites DO NOT and cannot validate because the importance of the website looking exactly the same on all systems requires rules to be broken. &lt;/p&gt;&#10;&#10;&lt;p&gt;In general, valid websites help your page look good even on odd configurations (like cell phones) so you should always at least try to make it validate.&lt;/p&gt;&#10;" CommentCount="0" CreationDate="2008-10-12T20:26:29.397" Id="195995" LastActivityDate="2008-10-12T20:26:29.397" OwnerDisplayName="Eric Wendelin" OwnerUserId="25066" ParentId="195973" PostTypeId="2" Score="0" />
```

A full schema can be found [here](https://ia801500.us.archive.org/8/items/stackexchange/readme.txt).

Data from the much smaller stats.stackexchange.com is available in the same format on S3 (s3://dataincubator-course/spark-stats-data). This site, Cross-Validated, will be used below in some instances to avoid working with the full dataset for every question.

You can either get the data by running the appropriate S3 commands in the terminal, or by running this block for the smaller stats data set:

In [4]:
!mkdir -p spark-stats-data
!aws s3 sync --exclude '*' --include 'all*' s3://dataincubator-course/spark-stats-data/ ./spark-stats-data

download: s3://dataincubator-course/spark-stats-data/allPosts/part-00007.xml.gz to spark-stats-data/allPosts/part-00007.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00002.xml.gz to spark-stats-data/allPosts/part-00002.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00004.xml.gz to spark-stats-data/allPosts/part-00004.xml.gz
download: s3://dataincubator-course/spark-stats-data/allUsers/part-00001.xml.gz to spark-stats-data/allUsers/part-00001.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00008.xml.gz to spark-stats-data/allPosts/part-00008.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00003.xml.gz to spark-stats-data/allPosts/part-00003.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00005.xml.gz to spark-stats-data/allPosts/part-00005.xml.gz
download: s3://dataincubator-course/spark-stats-data/allPosts/part-00001.xml.gz to spark-stats-data/allPosts/pa

And to get the much larger full data set (be warned, this can take 20 or more minutes, so you may want to run it in the terminal to avoid locking up the notebook):

In [3]:
!mkdir -p spark-stack-data
!aws s3 sync  --exclude '*' --include 'all*' s3://dataincubator-course/spark-stack-data/ ./spark-stack-data


Unknown options: --recursive


## Data input and parsing

Some rows are split across multiple lines; these can be discarded. Malformatted XML can also be ignored. It is enough to simply skip problematic rows, the loss of data will not significantly impact our results on this large data sets.

You will need to handle xml parsing yourself using the \ selector in Scala or something like lxml.etree in Python. *Warning*: The built-in xml.etree.ElementTree behaves differently and the results don't correspond perfectly with the Scala equivalent.

To make your code more flexible, it's also recommended to incorporate command-line arguments that specify the location of the input data and where output should be written.

The goal should be to have a parsing function that can be applied to the input data to access any XML element desired. It is suggested to use a class structure so that you can create RDDs of Posts, Votes, Users, etc.

``` scala
// Command line arguments in Scala

object Main {
 def main(args: Array[String]) {
   val inputDir = args(0)
   val outputDir = args(1)
   ...
```

``` python
# Command line arguments using sysv or argparse in Python
if __name__ == '__main__':
    main(ARGS.input_dir, ARGS.output_dir)
```

Dates are parsed by default using the Long data type and unix time (epoch time). In Java/Scala, a given timestamp represents the number of milliseconds since 1970-01-01T00:00:00Z. Also be wary of integer overflow when dealing with Longs. For example, these two are not equal:

`val year: Long = 365 * 24 * 60 * 60 * 1000`

`val year: Long = 365 * 24 * 60 * 60 * 1000L`

## Questions

## 1. bad_xml

A simple question to test your parsing code. Create an RDD of Post objects where each Post is a valid row of XML from the Cross-Validated (stats.stackexchange.com) *allPosts* dataset.

We are going to take several shortcuts to speed up and simplify our computations.  First, your parsing function to only attempt to parse rows that start with `  <row` as these denote actual data entries. This should be done in Spark as the data is being read in from disk, without any pre-Spark processing. 

Return the total number XML rows that started with ` <row` that were subsequently **rejected** during your processing.  Note that the text is unicode, and contains non-ascii characters.  You may need to re-encode to utf-8 (depending on your xml parser)

Note that this cleaned dataset will be used for all subsequent questions.

*Question*: Can you figure out what filters you need to put in place to avoid throwing parsing errors entirely?

In [5]:
import random, re, time
from datetime import datetime
from pyspark import SparkContext
#sc.stop()
sc = SparkContext("local[*]", "temp")
print sc.version

2.0.1


In [6]:
import os
def localpath(path):
    return 'file://' + str(os.path.abspath(os.path.curdir)) + '/' + path

In [10]:
lines = sc.textFile(localpath('spark-stats-data/allPosts/'))
totalLines = lines.count()
print "total lines: %d" % totalLines

total lines: 212990


In [7]:
from lxml import etree

def find_bad(s):
    try :
        root = etree.fromstring(s)
        return 0
    except :
        if '<row ' in s:
            return 1
        else:
            return 2  

In [22]:
bad_counts = lines.map(find_bad)\
                  .filter(lambda x: x==1)\
                  .count()

In [23]:
print bad_counts

781


In [24]:
def bad_xml():
    return 781

grader.score(question_name='spark__bad_xml', func=bad_xml)  #bad_xml

Your score:  1


## 2. upvote_percentage

Each post on StackExchange can be upvoted, downvoted, and favorited. One "sanity check" we can do is to look at the ratio of upvotes to downvotes (referred to as "UpMod" and "DownMod" in the schema) as a function of how many times the post has been favorited.

https://ia600500.us.archive.org/22/items/stackexchange/readme.txt

    `VoteTypeId == "2"` -> upvote
    `VoteTypeId == "3"` -> downvote
    `VoteTypeId == "5"` -> favorite

You might hypothesize, for example, that _posts with more favorites should have a higher upvote/downvote ratio._

Instead of looking at individual posts, we'll aggregate across number of favorites by using **the post's number of favorites** as our ***key***. Since we're computing ratios, bundling together all posts with the same number of favorites effectively averages over them.  Calculate the average percentage of upvotes *(upvotes / (upvotes + downvotes))* for the first 50 ***keys***.

Do the analysis on the smaller Cross-Validated dataset.

#### Checkpoints
* Total upvotes: 313,819
* Total downvotes: 13,019
* Mean of first 50 keys (averaging the keys themselves): 24.76

      <row CreationDate="2010-07-19T00:00:00.000" Id="88" PostId="22" VoteTypeId="2" />

      <row CreationDate="2010-07-19T00:00:00.000" Id="89" PostId="22" UserId="22" VoteTypeId="5" />

In [65]:
def parse_vote(s):
    root = etree.fromstring(s)
    p = root.xpath("//row")[0].attrib['PostId']
    v = root.xpath("//row")[0].attrib['VoteTypeId']
    return (p, v)

In [110]:
lines = sc.textFile(localpath('spark-stats-data/allVotes/'))

In [66]:
s = '<row CreationDate="2010-07-19T00:00:00.000" Id="89" PostId="22" UserId="22" VoteTypeId="5" />'
root = etree.fromstring(s)
print root.xpath("//row")[0].attrib['PostId']
print root.xpath("//row")[0].attrib['VoteTypeId']

22
5


In [67]:
parse_vote(s)

('22', '5')

In [98]:
ups =  lines.filter(lambda x: find_bad(x)==0)\
            .map(parse_vote)\
            .filter(lambda x: x[1] == '2')\
            .map(lambda x: (x[0], 1)) \
            .countByKey()

In [122]:
ups.items()[:5]

[('89370', 2), ('89372', 6), ('89373', 1), ('89374', 1), ('89375', 4)]

In [76]:
ups1 = lines.filter(lambda x: find_bad(x)==0)\
            .map(parse_vote)\
            .filter(lambda x: x[1] == '2')
ups1.count()

313819

In [123]:
downs1 = lines.filter(lambda x: find_bad(x)==0)\
              .map(parse_vote)\
              .filter(lambda x: x[1] == '3')
downs1.count()             

13019

In [103]:
downs = lines.filter(lambda x: find_bad(x)==0)\
              .map(parse_vote)\
              .filter(lambda x: x[1] == '3')\
              .map(lambda x: (x[0], 1)) \
              .countByKey()

In [113]:
favs  = lines.filter(lambda x: find_bad(x)==0)\
             .map(parse_vote)\
             .filter(lambda x: x[1] == '5')\
             .map(lambda x: (x[0], 1)) \
             .reduceByKey(lambda x, y: x + y) \
             .map(lambda x: (x[1],[x[0],])) \
             .reduceByKey(lambda x, y: x + y) \
             .sortByKey(ascending=True) 

In [121]:
favlist = favs.take(50)
favlist[45:]

[(48, ['213', '62092']),
 (49, ['5278', '12670', '51718']),
 (50, ['104500']),
 (52, ['1576', '12386']),
 (54, ['1444', '11659'])]

In [118]:
q2=[]
avg_favs=0
for favc, pidlist in favlist:
    #print favc, pidlist
    avg_favs += favc
    upcount = 0
    downcount = 0
    for pid in pidlist:
        upcount += ups[pid]
        downcount += downs[pid]
    q2.append((favc, upcount*1.0/(upcount+downcount)))
    
print avg_favs/50.0

25.84


In [116]:
q2

[(1, 0.971349277609991),
 (2, 0.9858878575201871),
 (3, 0.9899873257287706),
 (4, 0.990321980271729),
 (5, 0.9925945517058979),
 (6, 0.9948542024013722),
 (7, 0.9908026755852842),
 (8, 0.9944289693593314),
 (9, 0.9967931587386424),
 (10, 0.9916376306620209),
 (11, 0.9915174363807728),
 (12, 0.9958123953098827),
 (13, 0.9972789115646259),
 (14, 0.9939540507859734),
 (15, 0.9929245283018868),
 (16, 1.0),
 (17, 1.0),
 (18, 0.9985693848354793),
 (19, 0.997867803837953),
 (20, 0.9969512195121951),
 (21, 0.9944029850746269),
 (22, 0.9977973568281938),
 (23, 0.9952038369304557),
 (24, 1.0),
 (25, 1.0),
 (26, 0.9841772151898734),
 (27, 0.989010989010989),
 (28, 0.9951690821256038),
 (29, 0.9972826086956522),
 (30, 0.9954337899543378),
 (31, 0.9939577039274925),
 (32, 1.0),
 (33, 1.0),
 (34, 1.0),
 (35, 1.0),
 (36, 1.0),
 (37, 0.990990990990991),
 (38, 0.9937888198757764),
 (39, 0.9918032786885246),
 (40, 1.0),
 (41, 1.0),
 (42, 1.0),
 (44, 1.0),
 (45, 1.0),
 (47, 1.0),
 (48, 1.0),
 (49, 1.0),


In [117]:
def upvote_percentage():
    return  q2#[(20, 0.9952153110047847)] * 50

grader.score(question_name='spark__upvote_percentage', func=upvote_percentage)

Your score:  0.98


## 3. answer_percentage

Investigate the correlation between a user's reputation and the kind of posts they make. For the 99 users with the highest reputation, single out posts which are either questions or answers and look at the percentage of these posts that are answers: *(answers / (answers + questions))*. 

Return a tuple of their **user ID** and this fraction.

You should also return (-1, fraction) to represent the case where you average over all users (so you will return 100 entries total).

Again, you only need to run this on the statistics overflow set.

#### Checkpoints
* Total questions: 52,060
* Total answers: 55,304
* Top 99 users' average reputation: 11893.464646464647

In [172]:
def parse_user(s):
    root = etree.fromstring(s)
    if 'Reputation' not in root.attrib or 'Id' not in root.attrib :
        return (None, None)
    else:
        x1 = root.xpath("//row")[0].attrib['Reputation']
        x2 = root.xpath("//row")[0].attrib['Id']
        return (int(x1), x2)

In [173]:
def parse_post_qa(s):
    root = etree.fromstring(s)
    if 'OwnerUserId' not in root.attrib or 'PostTypeId' not in root.attrib :
        return (None, None)
    else:
        x1 = root.xpath("//row")[0].attrib['OwnerUserId']
        x2 = root.xpath("//row")[0].attrib['PostTypeId']
        return (x1, x2)

In [171]:
s='  <row Body="&lt;p&gt;Perhaps you could extend the idea of using the lower bound of the confidence interval for sorting: you could throw away items that have a low &lt;em&gt;upper&lt;/em&gt; bound. The items with only a few votes will have pretty high upper bounds; the lowest upper bounds will correspond to the lowest &quot;quality&quot; items.&lt;/p&gt;&#10;" CommentCount="5" CreationDate="2011-05-16T18:27:42.933" Id="10894" LastActivityDate="2011-05-16T18:27:42.933" OwnerDisplayName="Aniko" OwnerUserId="279" ParentId="10882" PostTypeId="2" Score="2" />'
root = etree.fromstring(s)
if 'OwnerUserId' not in root.attrib:
    print 'NO'
else:
    print 'YES'

YES


In [174]:
plines = sc.textFile(localpath('spark-stats-data/allPosts/'))

In [175]:
ulines = sc.textFile(localpath('spark-stats-data/allUsers/'))

In [176]:
users = ulines.filter(lambda x: find_bad(x)==0)\
              .map(parse_user)\
              .sortByKey(ascending=False)\
              .take(99)

In [134]:
users[:5]

[(100976, '919'),
 (92624, '805'),
 (47334, '686'),
 (46907, '7290'),
 (32283, '930')]

In [177]:
postq = plines.filter(lambda x: find_bad(x)==0)\
              .map(parse_post_qa)\
              .filter(lambda x: x[0] != None) \
              .filter(lambda x: x[1] == '1') \
              .map(lambda x: (x[0], 1)) \
              .countByKey()

In [179]:
pp = plines.filter(lambda x: find_bad(x)==0)\
              .map(parse_post_qa) \
              .filter(lambda x: x[0] != None) \
              .filter(lambda x: x[1] == '1') \
              .map(lambda x: (x[0], 1)) \
              .reduceByKey(lambda x, y: x + y).take(5)

In [180]:
pp

[('23994', 1), ('61518', 2), ('35549', 1), ('69951', 1), ('11549', 3)]

In [181]:
posta = plines.filter(lambda x: find_bad(x)==0)\
              .map(parse_post_qa)\
              .filter(lambda x: x[0] != None) \
              .filter(lambda x: x[1] == '2') \
              .map(lambda x: (x[0], 1)) \
              .countByKey()

In [196]:
q3=[]
asum = 0
qsum = 0
for rep, uid in users:
    #print favc, pidlist
    acount = posta[uid]
    qcount = postq[uid]
    ratio = acount*1.0/(acount+qcount)
    asum += acount
    qsum += qcount
    q3.append((int(uid), ratio))
    
q3.append((-1, asum*1.0/(asum+qsum)))
#q3.insert(0, (-1, asum*1.0/(asum+qsum)) )

In [194]:
q3[:5]

[(919, 0.996694214876033),
 (805, 0.9959749552772809),
 (686, 0.9803049555273189),
 (7290, 0.9918887601390498),
 (930, 0.9817351598173516)]

In [197]:
def answer_percentage():
    return q3#[(7071, 0.9107142857142857)] * 100

grader.score(question_name='spark__answer_percentage', func=answer_percentage)

Your score:  0.99


## 4. post_counts

If we use the **total number of posts** made on the site as a metric for tenure, we can look at the differences between "younger" and "older" users. You can imagine there might be many interesting features - for now just return **the top 100 post counts among all users (of all types of posts) and the average reputation for every user who has that count.**

In other words, aggregate the cases where multiple users have the same post count.

#### Checkpoints
* Mean of top 100 post counts: 281.51

In [214]:
users_rep = ulines.filter(lambda x: find_bad(x)==0)\
              .map(parse_user) \
              .filter(lambda x: x[0] != None) \
              .map(lambda x: (x[1], x[0])) \
              .sortByKey(ascending=True)\
              .collect()

In [215]:
users_dic = dict(users_rep) #{userid: reputation}

In [216]:
posts = plines.filter(lambda x: find_bad(x)==0)\
              .map(parse_post_qa)\
              .filter(lambda x: x[0] != None) \
              .map(lambda x: (x[0], 1)) \
              .reduceByKey(lambda x, y: x + y) \
              .map(lambda x: (x[1],[x[0],])) \
              .reduceByKey(lambda x, y: x + y) \
              .sortByKey(ascending=False) \
              .take(100)


In [219]:
q4=[]
avg=0
for pcount, idlist in posts:
    #print pcount, idlist 
    avg += pcount
    avg_rep = 0
    for uid in idlist:
        avg_rep += users_dic[uid]
    avg_rep = avg_rep*1.0 / len(idlist)
    q4.append((pcount, avg_rep))
q4   

[(2325, 92624.0),
 (1663, 47334.0),
 (1287, 100976.0),
 (1018, 46907.0),
 (965, 23102.0),
 (695, 27599.0),
 (570, 22706.0),
 (558, 25406.0),
 (495, 9294.0),
 (494, 23610.0),
 (469, 10728.0),
 (452, 32283.0),
 (424, 16854.0),
 (419, 17719.0),
 (395, 14100.0),
 (390, 20315.0),
 (369, 19312.0),
 (363, 6149.0),
 (350, 9047.0),
 (345, 14768.0),
 (343, 13557.0),
 (339, 11795.0),
 (338, 10045.0),
 (304, 16131.0),
 (301, 6352.0),
 (297, 20133.0),
 (292, 10552.0),
 (290, 8285.5),
 (287, 11083.0),
 (282, 10383.0),
 (277, 11830.0),
 (269, 7729.0),
 (268, 11989.0),
 (267, 7971.0),
 (265, 7765.0),
 (257, 13078.0),
 (248, 7608.0),
 (247, 12496.5),
 (239, 1.0),
 (234, 11307.5),
 (228, 11662.0),
 (226, 5775.0),
 (218, 5849.0),
 (211, 7552.0),
 (208, 6208.0),
 (202, 9530.0),
 (195, 9619.0),
 (193, 6682.0),
 (188, 12098.0),
 (187, 8013.0),
 (185, 4149.0),
 (184, 5762.0),
 (177, 5042.0),
 (173, 10394.0),
 (168, 7725.0),
 (167, 3957.0),
 (165, 6694.0),
 (164, 1544.0),
 (163, 6888.0),
 (161, 6367.0),
 (159

In [224]:
def post_counts():
    postpath = localpath('./spark-stats-data/allPosts/*')
    userpath = localpath('./spark-stats-data/allUsers/*')
    
    allposts = sc.textFile(postpath) \
            .filter(isRow) \
            .map(get_post) \
            .map( lambda x: (x.user, 1) ) \
            .reduceByKey( lambda x, y: x+y )
        
    users = sc.textFile(userpath) \
            .filter(isRow) \
            .filter(isUser) \
            .map(get_user) \
            .map( lambda x: (x.id, x.reputation) )
    
    result = users.join(allposts) \
                .map( lambda (u, (r, p)): (p, (r,1)) ) \
                .reduceByKey( lambda x, y:(x[0]+y[0], x[1]+y[1]) ) \
                .map( lambda (p, (r, cnt)): (p, 1.0*r/cnt) ) \
                .sortByKey(False) \
                .take(100)
    return result

In [220]:
def post_counts():
    return q4# [(118, 3736.5)] * 100

grader.score(question_name='spark__post_counts', func=post_counts)

Your score:  1.0


## 5. quick_answers

How long do you have to wait to get your question answered? Look at the **set of ACCEPTED answers which are posted less than 3 hours after question creation**. What is the **average number of these "quick answers" as a function of the hour of day the question was asked?** You should normalize by how many total accepted answers are garnered by questions posted in a given hour, just like we're counting how many quick accepted answers are garnered by questions posted in a given hour, eg. (quick accepted answers when question hour is 15 / total accepted answers when question hour is 15).

Return a list, whose ith element correspond to ith hour (e.g. 0 -> midnight, 1 -> 1:00, etc.)

*Note*: When using Scala's SimpleDateFormat class, it's important to account for your machine's local time zone. Our policy will be to use GMT: hourFormat.setTimeZone(TimeZone.getTimeZone("GMT"))

*Consider*: What biases are present in our result that we don't account for? How should we handle this?

#### Checkpoints
* Total quick accepted answers: 8,468
* Total accepted answers: 17,096

In [8]:
plines = sc.textFile(localpath('spark-stats-data/allPosts/'))

In [9]:
def parse_post_quicka(s):
    root = etree.fromstring(s)
    if 'Id' not in root.attrib or 'PostTypeId' not in root.attrib or 'CreationDate' not in root.attrib :
        return (None, None)
    else:
        x1 = root.xpath("//row")[0].attrib['Id']
        x2 = root.xpath("//row")[0].attrib['PostTypeId']
        if x2 == '1':
            if 'AcceptedAnswerId' not in root.attrib:
                return (None, None)
            else: 
                x3 = root.xpath("//row")[0].attrib['AcceptedAnswerId']
        elif x2 == '2':
            if 'ParentId' not in root.attrib:
                return (None, None)
            else: 
                x3 = root.xpath("//row")[0].attrib['ParentId']
        x4 = root.xpath("//row")[0].attrib['CreationDate']
        return (x1, (x2, x3, x4) ) 

In [223]:
s='  <row AcceptedAnswerId="10925" AnswerCount="1" Body="&lt;p&gt;I am looking for a statistical method to define the variance/ diversity / inequality in a set of observations. &lt;/p&gt;&#10;&#10;&lt;p&gt;For example: &#10;If I have following (n=4) observations using 4 data points, here the diversity/variance/inequality is zero. &lt;/p&gt;&#10;&#10;&lt;pre&gt;&lt;code&gt;A-B-C-D&#10;A-B-C-D&#10;A-B-C-D&#10;A-B-C-D &#10;&lt;/code&gt;&lt;/pre&gt;&#10;&#10;&lt;p&gt;In the following  (n=4) observations using 7 data points, there is x amount of diversity / variance / inequality in the observations. What method I can use to derive a diversity / variance score ?  &lt;/p&gt;&#10;&#10;&lt;pre&gt;&lt;code&gt;A-B-B-D&#10;Q-D-C-B&#10;B-C-B-A&#10;B-Z-F-A&#10;&lt;/code&gt;&lt;/pre&gt;&#10;&#10;&lt;p&gt;My real data is derived from a database of 10K data points and observations are usually &gt;= 2 - 1000s. Currently am looking at &lt;a href=&quot;http://en.wikipedia.org/wiki/Gini_coefficient#Advantages_of_Gini_coefficient_as_a_measure_of_inequality&quot; rel=&quot;nofollow&quot;&gt;Gini Impurity&lt;/a&gt; as a potential method, Do you think Gini Impurity is better for this type data ? Do you know about any better method which can derive a score by considering the order of the data ? Looking forward for your suggestions. &lt;/p&gt;&#10;" CommentCount="4" CreationDate="2011-05-17T21:49:53.323" Id="10920" LastActivityDate="2011-05-19T06:10:19.527" LastEditDate="2011-05-18T21:39:52.873" LastEditorUserId="529" OwnerUserId="529" PostTypeId="1" Score="2" Tags="&lt;variance&gt;" Title="Statistical method to quantify  diversity / variance / inequality" ViewCount="330" />'
root = etree.fromstring(s)
root.attrib.keys()

['AcceptedAnswerId',
 'AnswerCount',
 'Body',
 'CommentCount',
 'CreationDate',
 'Id',
 'LastActivityDate',
 'LastEditDate',
 'LastEditorUserId',
 'OwnerUserId',
 'PostTypeId',
 'Score',
 'Tags',
 'Title',
 'ViewCount']

In [225]:
s=' <row Body="&lt;p&gt;You can use PhyFi web server for generating dendrograms from Newick files. &#10;Sample output using your data from PhyFi: &lt;/p&gt;&#10;&#10;&lt;p&gt;&lt;img src=&quot;http://i.stack.imgur.com/wcoj8.png&quot; alt=&quot;PhyFi figure&quot;&gt;&lt;/p&gt;&#10;" CommentCount="0" CreationDate="2011-05-17T21:54:51.667" Id="10921" LastActivityDate="2011-05-17T21:54:51.667" OwnerUserId="529" ParentId="10832" PostTypeId="2" Score="2" />'
root = etree.fromstring(s)
root.attrib.keys()

['Body',
 'CommentCount',
 'CreationDate',
 'Id',
 'LastActivityDate',
 'OwnerUserId',
 'ParentId',
 'PostTypeId',
 'Score']

In [32]:
quickpostq = plines.filter(lambda x: find_bad(x)==0)\
                   .map(parse_post_quicka)\
                   .filter(lambda x: x[0] != None).take(396)  #.take(396)# \
                   #.filter(lambda x: x[1][0] == '1').take(100)     
#.take(100)#\
#.map(lambda (x,y): (y[1], y[2]) ).take(300)
                
#(y[1], (y[0], x, y[2]))
                
print quickpostq

[('101121', ('1', '105339', '2014-06-04T13:31:24.160')), ('101123', ('2', '15957', '2014-06-04T13:43:15.913')), ('101127', ('2', '100930', '2014-06-04T14:19:20.430')), ('101128', ('2', '101003', '2014-06-04T14:25:45.873')), ('101129', ('1', '101146', '2014-06-04T14:28:42.040')), ('101132', ('2', '100990', '2014-06-04T14:35:51.100')), ('101133', ('2', '101131', '2014-06-04T14:41:11.140')), ('101136', ('1', '101185', '2014-06-04T15:00:58.427')), ('101140', ('2', '100961', '2014-06-04T15:07:55.817')), ('101143', ('2', '101126', '2014-06-04T15:13:20.170')), ('101144', ('1', '101194', '2014-06-04T15:14:33.713')), ('101145', ('2', '100854', '2014-06-04T15:24:10.153')), ('101146', ('2', '101129', '2014-06-04T15:24:18.320')), ('101147', ('2', '100893', '2014-06-04T15:31:46.327')), ('101148', ('2', '100816', '2014-06-04T15:33:09.367')), ('101149', ('2', '101139', '2014-06-04T15:34:04.377')), ('101151', ('2', '100878', '2014-06-04T15:46:59.423')), ('101152', ('2', '101134', '2014-06-04T15:52:04.

101378


('1', '115931', '2014-06-06T02:08:02.313')

In [252]:
quickposta = plines.filter(lambda x: find_bad(x)==0)\
                   .map(parse_post_quicka)\
                   .filter(lambda x: x[0] != None) \
                   .filter(lambda x: x[1][0] == '2')\
                   .map(lambda (x,y): (x, y[2]) )
print quickposta

PythonRDD[143] at RDD at PythonRDD.scala:48


In [None]:
quickpostqa = quickpostq.join(quickposta).take(10)# \
                   #.map( lambda (Id, (x, y)): (Id, (x[1],y[1],x[2],y[2])) ) \
                   #.filter(lambda (Id, z): z[0]==z[1])\
                   #.take(10)

In [44]:
# John Romano code  =================
from lxml import etree

#%%time
import dateutil
from dateutil.parser import parse

plines = sc.textFile(localpath('spark-stats-data/allPosts/'))

def valid_xml(line):
    if '<row' in line[:10]  and '/>' in line[-10:]:
        return True
    else:
        return False

def id_time(line):
    tree = etree.fromstring(line)
    postId = tree.get("Id")
    time = tree.get("CreationDate")
    return (postId,time)   
    
def id_time_AcceptedAnswer(line):
    tree = etree.fromstring(line)
    postId = tree.get("Id")
    accepted_answer = tree.get("AcceptedAnswerId")
    time = tree.get("CreationDate")
    return (accepted_answer,(postId,time))

def quick_answer(data):
    (_, value) = data
    ((_,time1),time2) = value
    #del_t = dateutil.parser.parse(time2) - dateutil.parser.parse(time1)
    del_t = parse(time2) - parse(time1)
    post_time = parse(time1)
    if int(del_t.seconds)<10800 and int(del_t.days)==0:
        return (post_time.hour,(1,1))
    else:
        return (post_time.hour,(0,1))

def avg_quick_answer(data):
    (key,value) = data
    (count_quick, count_total) = value
    avg_quick = count_quick*1.0/count_total
    return (int(key), avg_quick)

answer_times = plines.filter(valid_xml)\
    .map(id_time)
    
result5 = plines.filter(valid_xml)\
    .map(id_time_AcceptedAnswer)\
    .filter(lambda x: x[1] is not None)\
    .join(answer_times)\
    .map(quick_answer)\
    .reduceByKey(lambda x, y: (x[0]+y[0],x[1]+y[1]))\
    .map(avg_quick_answer)\
    .sortByKey()\
    .take(24)


print result5

[(0, 0.4504672897196262), (1, 0.44814814814814813), (2, 0.3605577689243028), (3, 0.3799126637554585), (4, 0.4028436018957346), (5, 0.4125), (6, 0.4597402597402597), (7, 0.4673684210526316), (8, 0.4616822429906542), (9, 0.49528301886792453), (10, 0.5157593123209169), (11, 0.5431754874651811), (12, 0.5347313237221494), (13, 0.5310796074154853), (14, 0.5238095238095238), (15, 0.5368007850834151), (16, 0.5475728155339806), (17, 0.47995991983967934), (18, 0.5202185792349727), (19, 0.5462012320328542), (20, 0.5185185185185185), (21, 0.5156794425087108), (22, 0.46153846153846156), (23, 0.4700460829493088)]


In [45]:
#  Yinjia Zhang code ========================
from datetime import datetime

def isRow(line):
    return '<row' in line and '/>' in line

def time_limit(tup):
    limit = 3*60*60
    if abs(tup[1].total_seconds()) - limit < 0:
        return (tup[0].time().hour, 1)
    return (tup[0].time().hour, 0)
    
class Apost(object):
    def __init__(self, Qid, Aid, Time):
        self.Qid = Qid
        self.Aid = Aid
        self.Time = Time

def get_apost(line):
    root = etree.XML(line)
    posttype = root.get('PostTypeId')    
    if posttype == "1": #question
        if 'AcceptedAnswerId' in root.attrib:
            Id = root.get('Id')
            AcceptId = root.get('AcceptedAnswerId')
            Time = root.get('CreationDate')
            Time = datetime.strptime( Time.split('.')[0], "%Y-%m-%dT%H:%M:%S" )
            return Apost(Id, AcceptId, Time)
        else:
            return None
    elif posttype == "2": #answer
        Id = root.get('Id')
        ParentId = root.get('ParentId')
        Time = root.get('CreationDate')
        Time = datetime.strptime( Time.split('.')[0], "%Y-%m-%dT%H:%M:%S" )
        return Apost(ParentId, Id, Time)
    return None


def quick_answers2(postpath):
    questions = sc.textFile(postpath) \
                .filter(isRow) \
                .map(get_apost) \
                .filter(lambda x: x is not None) \
                .map(lambda x: ((x.Qid, x.Aid), x.Time) ) \
                .reduceByKey( lambda x, y: (min(x,y), x-y) ) \
                .filter(lambda x: type(x[1]) is tuple) \
                .map(lambda ((qid, aid), (qt, delta)): (qt, delta)) \
                .map( time_limit ) \
                .map( lambda (h, cnt): (h, (cnt, 1)) ) \
                .reduceByKey( lambda x, y: (x[0]+y[0], x[1]+y[1]) ) \
                .map( lambda (h, (cnt, cntall)): (h, 1.0*cnt/cntall) ) \
                .sortByKey() \
                .map( lambda (h, ratio): ratio ) \
                .collect()
    return questions


In [46]:
q5 = quick_answers2(localpath('spark-stats-data/allPosts/'))

In [47]:
q5

[0.4504672897196262,
 0.44814814814814813,
 0.3605577689243028,
 0.3799126637554585,
 0.4028436018957346,
 0.4125,
 0.4597402597402597,
 0.4673684210526316,
 0.4616822429906542,
 0.49528301886792453,
 0.5157593123209169,
 0.5445682451253482,
 0.5347313237221494,
 0.5310796074154853,
 0.5238095238095238,
 0.5368007850834151,
 0.5475728155339806,
 0.47995991983967934,
 0.5207423580786026,
 0.5462012320328542,
 0.5191011235955056,
 0.5156794425087108,
 0.46153846153846156,
 0.4700460829493088]

In [48]:
def quick_answers():
    return q5#[0.] * 24

grader.score(question_name='spark__quick_answers', func=quick_answers)

Your score:  1.0


## 6. quick_answers_full

Same as above, but on the full StackExchange dataset.

No pre-parsed data is available for this question.

#### Checkpoints
* Total quick accepted answers: 3,700,224
* Total accepted answers: 5,086,888

In [49]:
q6 = quick_answers2(localpath('spark-stack-data/allPosts/'))

In [52]:
q6

[0.690509573675678,
 0.6959973326135623,
 0.6996303294175634,
 0.7043773884560027,
 0.7100970997017199,
 0.717887585450249,
 0.7242037149999402,
 0.7270603330749285,
 0.7261554824985496,
 0.7228950931245197,
 0.7285208798088544,
 0.7372685231931178,
 0.7444591101578397,
 0.7452336806647366,
 0.7414518920435942,
 0.7344289978008168,
 0.7349275199302616,
 0.7394871755893452,
 0.7463419371127814,
 0.7475897325982465,
 0.7360013404058814,
 0.7158854640002482,
 0.7017813908832861,
 0.6947227370635676]

In [53]:
def quick_answers_full():
    return q6#[0.] * 24

grader.score(question_name='spark__quick_answers_full', func=quick_answers_full)

Your score:  1.0


## 7. identify_veterans

It can be interesting to think about what factors influence a user to remain active on the site over a long period of time. In order not to bias the results towards older users, we'll define a **time window between 100 and 150 days after account creation**. If the user has **made a post in this time**, we'll consider them active and well on their way to being **veterans** of the site; if not, they are inactive and were likely **brief users**.

*Consider*: What other parameterizations of "activity" could we use, and how would they differ in terms of splitting our user base?

*Consider*: What other biases are still not dealt with, after using the above approach?

Let's see if there are differences between the first ever question posts of "veterans" vs. "brief users". For each group separately, **average the score, views, number of answers, and number of favorites of the users' first question**.

*Consider*: What story could you tell from these numbers? How do the numbers support it?

#### Checkpoints
* Total brief users: 24,864
* Total veteran users: 2,027

In [50]:
s='  <row AboutMe="&lt;p&gt;I like to look things up on the internet that have to do with people reading things that I type only for them to realize that they read a run on sentence and I just wasted their time.&lt;/p&gt;&#10;" AccountId="443263" CreationDate="2015-03-02T20:08:27.600" DisplayName="five_dollar_shake" DownVotes="0" Id="70191" LastAccessDate="2015-03-06T23:25:50.720" ProfileImageUrl="https://www.gravatar.com/avatar/dddf357acba95b0b98776142a59d5f48?s=128&amp;d=identicon&amp;r=PG&amp;f=1" Reputation="1" UpVotes="0" Views="1" />'
root = etree.fromstring(s)
root.attrib.keys()

['AboutMe',
 'AccountId',
 'CreationDate',
 'DisplayName',
 'DownVotes',
 'Id',
 'LastAccessDate',
 'ProfileImageUrl',
 'Reputation',
 'UpVotes',
 'Views']

In [58]:
s='  <row AcceptedAnswerId="18" AnswerCount="26" Body="&lt;p&gt;I\'ve been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup\'s characteristics.  While the method works well enough with artificial data samples (i.e. datasets created specifically for the purpose of identifying and segregating subsets of the population), I\'d like to try testing it with live data.&lt;/p&gt;&#10;&#10;&lt;p&gt;What I\'m looking for is a freely available (i.e. non-confidential, non-proprietary) data source.  Preferably one containing bimodal or multimodal distributions or being obviously comprised of multiple subsets that cannot be easily pulled apart via traditional means.  Where would I go to find such information?&lt;/p&gt;&#10;" CommentCount="3" CommunityOwnedDate="2010-07-20T20:50:48.483" CreationDate="2010-07-19T19:15:59.303" FavoriteCount="83" Id="7" LastActivityDate="2015-03-06T00:13:52.143" LastEditDate="2013-09-26T21:50:36.963" LastEditorUserId="253" OwnerUserId="38" PostTypeId="1" Score="81" Tags="&lt;dataset&gt;&lt;sample&gt;&lt;population&gt;&lt;teaching&gt;" Title="Locating freely available data samples" ViewCount="10055" />'
root = etree.fromstring(s)
root.attrib.keys()

['AcceptedAnswerId',
 'AnswerCount',
 'Body',
 'CommentCount',
 'CommunityOwnedDate',
 'CreationDate',
 'FavoriteCount',
 'Id',
 'LastActivityDate',
 'LastEditDate',
 'LastEditorUserId',
 'OwnerUserId',
 'PostTypeId',
 'Score',
 'Tags',
 'Title',
 'ViewCount']

In [64]:
s='  <row Body="&lt;p&gt;Quick and dirty Monte Carlo estimate in R of the length of a game for 1 player:&lt;/p&gt;&#10;&#10;&lt;pre&gt;&lt;code&gt;N = 1e5&#10;sample_length = function(n) { # random game length&#10;    x = numeric(0)&#10;    while(length(unique(x)) &amp;lt; n) x[length(x)+1] = sample(1:n,1)&#10;    return(length(x))&#10;}&#10;game_lengths = replicate(N, sample_length(6))&#10;&lt;/code&gt;&lt;/pre&gt;&#10;&#10;&lt;p&gt;Results: $\hat{\mu}=14.684$, $\hat{\sigma} = 6.24$, so a 95% confidence interval for the mean is $[14.645,14.722]$.&lt;/p&gt;&#10;&#10;&lt;p&gt;To determine the length of a four-player game, we can group the samples into fours and take the average minimum length over each group (you asked about the maximum, but I assume you meant the minimum since, the way I read it, the game ends when someone succeeds at getting all the numbers):&lt;/p&gt;&#10;&#10;&lt;pre&gt;&lt;code&gt;grouped_lengths = matrix(game_lengths, ncol=4)&#10;min_lengths = apply(grouped_lengths, 1, min)&#10;&lt;/code&gt;&lt;/pre&gt;&#10;&#10;&lt;p&gt;Results: $\hat{\mu}=9.44$, $\hat{\sigma} = 2.26$, so a 95% confidence interval for the mean is $[9.411,9.468]$.&lt;/p&gt;&#10;" CommentCount="1" CreationDate="2013-01-24T02:38:04.553" Id="48398" LastActivityDate="2013-01-24T02:38:04.553" OwnerUserId="2111" ParentId="48396" PostTypeId="2" Score="4" />'
root = etree.fromstring(s)
root.attrib.keys()

['Body',
 'CommentCount',
 'CreationDate',
 'Id',
 'LastActivityDate',
 'OwnerUserId',
 'ParentId',
 'PostTypeId',
 'Score']

In [190]:
from datetime import datetime

def isRow(line):
    return '<row' in line and '/>' in line

def get_allpost(line):
    root = etree.XML(line)
    ownerid, posttype, Id, Time, Score, ViewCount, AnsCount, FavCount = ('0',)*8
    if 'OwnerUserId' in root.attrib:
        ownerid = root.get('OwnerUserId')
        
    if 'PostTypeId' in root.attrib:
        posttype = root.get('PostTypeId')  
        
    if 'Id' in root.attrib:
        Id = root.get('Id')
    
    if 'CreationDate' in root.attrib: 
        Time = root.get('CreationDate')
        Time = datetime.strptime( Time.split('.')[0], "%Y-%m-%dT%H:%M:%S" )

    if 'Score' in root.attrib :
        Score = root.get('Score')
        
    if 'ViewCount' in root.attrib: 
        ViewCount = root.get('ViewCount')
        
    if 'AnswerCount' in root.attrib:   
        AnsCount = root.get('AnswerCount')

    if 'FavoriteCount' in root.attrib:
        FavCount = root.get('FavoriteCount')

    return (ownerid, (posttype, Id, Time, Score, ViewCount, AnsCount, FavCount))


def get_auser(line):
    root = etree.XML(line)
    if 'Id' in root.attrib and 'CreationDate' in root.attrib:
        Id = root.get('Id')   
        Time = root.get('CreationDate')
        Time = datetime.strptime( Time.split('.')[0], "%Y-%m-%dT%H:%M:%S" )
        return (Id, Time)
    else:
        return None

users = sc.textFile(localpath('spark-stats-data/allUsers/'))\
        .filter(isRow)\
        .map(get_auser)\
        .filter(lambda x: x is not None)


In [205]:
from datetime import datetime
def test_Veteran(data):
    (x,y) = data
    if not y[1]:
        return None
    init_time = y[1]
    delta_time_list = []
    if not y[0]:
        return None
    for yy in y[0]:
        if not yy[2]:
            return None
        dt = yy[2] - init_time
        if dt.days >= 100 and dt.days <= 150 :
            return True
    return False
        
def earliest_question(data):
    (x, y) = data
    early = y[0]
    if len(y)>1:
        for yy in y[1:]:
            if yy[2] < early[2]:
                early = yy        
    return (x, early)

vets = sc.textFile(localpath('spark-stats-data/allPosts/'))\
        .filter(isRow)\
        .map(get_allpost)\
        .filter(lambda x: x is not None)\
        .map(lambda x: (x[0], [x[1],]))\
        .reduceByKey(lambda x, y: x + y)\
        .join(users)\
        .filter(lambda x: test_Veteran(x)==True)\
        .map(lambda (x,(y, z)) : (x, [yy for yy in y if yy[0] =='1'] ) ) \
        .filter(lambda x: x[1]!=[])\
        .map(earliest_question)\
        .map(lambda (x,y): (x, (y[3],y[4],y[5],y[6])))


briefs = sc.textFile(localpath('spark-stats-data/allPosts/'))\
        .filter(isRow)\
        .map(get_allpost)\
        .filter(lambda x: x is not None)\
        .map(lambda x: (x[0], [x[1],]))\
        .reduceByKey(lambda x, y: x + y)\
        .join(users)\
        .filter(lambda x: test_Veteran(x)==False)\
        .map(lambda (x,(y, z)) : (x, [yy for yy in y if yy[0] =='1'] ) ) \
        .filter(lambda x: x[1]!=[])\
        .map(earliest_question)\
        .map(lambda (x,y): (x, (y[3],y[4],y[5],y[6])))
      

print vets.take(5)
print vets.count()
print briefs.take(5)
print briefs.count()


[('7515', ('7', '150', '1', '0')), ('41838', ('3', '183', '1', '0')), ('32018', ('2', '101', '0', '0')), ('11615', ('4', '11043', '2', '2')), ('53604', ('0', '49', '1', '0'))]
1843
[('23991', ('2', '145', '1', '1')), ('55619', ('1', '10', '0', '0')), ('55344', ('0', '44', '0', '0')), ('35549', ('1', '390', '1', '0')), ('22465', ('0', '289', '2', '0'))]


In [211]:
vets_stat = vets.map(lambda (x,y): ('1', (1, int(y[0]),int(y[1]),int(y[2]),int(y[3]))) )\
                .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2], x[3]+y[3], x[4]+y[4]) )\
                .collect()

x,y = vets_stat[0]
print [1.0*yy/y[0] for yy in y[1:] ]
vscore, vviews, vans, vfav = [1.0*yy/y[0] for yy in y[1:] ]

briefs_stat = briefs.map(lambda (x,y): ('1', (1, int(y[0]),int(y[1]),int(y[2]),int(y[3]))) )\
                .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2], x[3]+y[3], x[4]+y[4]) )\
                .collect()

x,y = briefs_stat[0]
print [1.0*yy/y[0] for yy in y[1:] ]
bscore, bviews, bans, bfav = [1.0*yy/y[0] for yy in y[1:] ]

[3.5322843190450355, 927.7042864894195, 1.2962561041779708, 1.2930005425935973]
[2.100084697910785, 552.9672971955581, 0.9704968944099379, 0.5757105213626953]


In [210]:
def identify_veterans():
    return {"vet_score": 3.5322843190450355,
            "vet_views": 927.7042864894195,
            "vet_answers": 1.2962561041779708,
            "vet_favorites": 1.2930005425935973,
            "brief_score": 2.100084697910785,
            "brief_views": 552.9672971955581,
            "brief_answers": 0.9704968944099379,
            "brief_favorites": 0.5757105213626953
           }

grader.score(question_name='spark__identify_veterans', func=identify_veterans)

Your score:  1.0


## 8. identify_veterans_full

Same as above, but on the full StackExchange dataset.

No pre-parsed data is available for this question.

#### Checkpoints
* Total brief users: 1,848,628
* Total veteran users: 288,285

In [181]:
def findbad(line):
    if '<row' not in line:
        return 2
    try:
        etree.fromstring(line)
        return 1
    except:
        return 0

from collections import Counter
from datetime import datetime, timedelta


def post7(row):
    root = etree.fromstring(row)
    user, pId, pdate, score, view, anscnt, favcnt, postTyep = [0]*8
    if 'PostTypeId' in root.attrib:
        postTyep = root.attrib['PostTypeId']
    if 'Id' in root.attrib:
        pId = root.attrib['Id']
    if 'OwnerUserId' in root.attrib:
        user = root.attrib['OwnerUserId']
    if 'CreationDate' in root.attrib:
        pdate = root.attrib['CreationDate']
        pdate = datetime.strptime(pdate, '%Y-%m-%dT%H:%M:%S.%f')
    if 'Score' in root.attrib:
        score = root.attrib['Score']
    if 'ViewCount' in root.attrib:
        view = root.attrib['ViewCount']
    if 'AnswerCount' in root.attrib:
        anscnt = root.attrib['AnswerCount']
    if 'FavoriteCount' in root.attrib:
        favcnt = root.attrib['FavoriteCount']
    return (user, [(pId, pdate, score, view, anscnt, favcnt, postTyep)])
     #return (user, (pId, pdate, score, view, anscnt, favcnt))

def user7(row):
    root = etree.fromstring(row)
    userId, cdate = [None]*2
    if 'Id' in root.attrib:
        userId = root.attrib['Id']
    if 'CreationDate' in root.attrib:
        cdate = root.attrib['CreationDate']
        cdate = datetime.strptime(cdate, '%Y-%m-%dT%H:%M:%S.%f')
    return (userId, cdate)


# get all userid
q7_1=sc.textFile(localpath('spark-stack-data/allUsers'))\
            .filter(lambda line: findbad(line)==1) \
            .map(user7)

#q7_1.take(2)

#q7_2.take(10)
def findvet(x):
    posts = x[1][0]
    ctime = x[1][1]
    if not ctime:
        return None
    for p in posts:
        if not p[1]:
            return None
        diff = p[1] - ctime
        if diff >= timedelta(days = 100) and diff <= timedelta(days = 150):
            return 1
    return 0



#veteran
# (user, ([pId, pdate, score, view, anscnt, favcnt, ptype]))
q7_2=sc.textFile(localpath('spark-stack-data/allPosts'))\
            .filter(lambda line: findbad(line)==1) \
            .map(post7) \
            .reduceByKey(lambda x, y: x + y) \
            .join(q7_1) \
            .filter(lambda x: findvet(x)==1) \
            .map(lambda x: (x[0], [e for e in x[1][0] if e[6]=='1'])) \
            .filter(lambda x: x[1]!=[]) \
            .map(lambda x: (x[0], sorted(x[1], key = lambda y: y[1])[0])) \
            .map(lambda x: x[1]) 

            
            
#q7_2.take(3)

vcnt = float(q7_2.count())
#print vcnt

scoret = q7_2.map(lambda x: (int(x[2]))) \
             .reduce(lambda x, y: x + y) \
    
vscore = scoret/vcnt

viewt = q7_2.map(lambda x: (int(x[3]))) \
             .reduce(lambda x, y: x + y) \

vview = viewt/vcnt

anscntt = q7_2.map(lambda x: (int(x[4]))) \
             .reduce(lambda x, y: x + y) \

vanscnt = anscntt/vcnt

favcntt = q7_2.map(lambda x: int(x[5])) \
             .reduce(lambda x, y: x + y) \

vfavcnt =favcntt / vcnt


# brief
# (user, ([pId, pdate, score, view, anscnt, favcnt, ptype]))
q7_3=sc.textFile(localpath('spark-stack-data/allPosts'))\
            .filter(lambda line: findbad(line)==1) \
            .map(post7) \
            .reduceByKey(lambda x, y: x + y) \
            .join(q7_1) \
            .filter(lambda x: findvet(x)==0) \
            .map(lambda x: (x[0], [e for e in x[1][0] if e[6]=='1'])) \
            .filter(lambda x: x[1]!=[]) \
            .map(lambda x: (x[0], sorted(x[1], key = lambda y: y[1])[0])) \
            .map(lambda x: x[1]) \
            #.map(lambda x: (x[2], x[3], x[4], x[5]))

            
            
#q7_3.take(3)

bcnt = float(q7_3.count()) # 21,277


bscoret = q7_3.map(lambda x: (int(x[2]))) \
             .reduce(lambda x, y: x + y) \
    
bscore = bscoret/bcnt

bviewt = q7_3.map(lambda x: (int(x[3]))) \
             .reduce(lambda x, y: x + y) \

bview = bviewt/bcnt

banscntt = q7_3.map(lambda x: (int(x[4]))) \
             .reduce(lambda x, y: x + y) \

banscnt = banscntt/bcnt

bfavcntt = q7_3.map(lambda x: int(x[5])) \
             .reduce(lambda x, y: x + y) \

bfavcnt =bfavcntt / bcnt



print {"vet_score": vscore,
            "vet_views": vview,
            "vet_answers": vanscnt,
            "vet_favorites": vfavcnt,
            "brief_score": bscore,
            "brief_views": bview,
            "brief_answers": banscnt,
            "brief_favorites": bfavcnt
           }

{'brief_views': 1096.1519220732553, 'brief_answers': 1.5038565525030159, 'brief_favorites': 0.3861764445851408, 'vet_score': 2.2598437331442924, 'vet_favorites': 0.8673157237744455, 'vet_views': 1844.0344896669696, 'brief_score': 1.1307456144103445, 'vet_answers': 1.8426197044183144}


In [182]:
def identify_veterans_full():
    return {"vet_score": vscore,
            "vet_views": vview,
            "vet_answers": vanscnt,
            "vet_favorites": vfavcnt,
            "brief_score": bscore,
            "brief_views": bview,
            "brief_answers": banscnt,
            "brief_favorites": bfavcnt
           }

grader.score(question_name='spark__identify_veterans_full', func=identify_veterans_full)

Your score:  1.0


## 9. word2vec

Word2Vec is an alternative approach for vectorizing text data. The vectorized representations of words in the vocabulary tend to be useful for predicting other words in the document, hence the famous example "vector('king') - vector('man') + vector('woman') ~= vector('queen')".

Let's see how good a Word2Vec model we can **train using the tags of each StackExchange post** as documents (this uses the full dataset). **Use Spark ML's implementation of Word2Vec** (this will require using **DataFrames**) to return a list of the **top 25 closest synonyms to "ggplot2" and their similarity score in tuple format ("string", number).**

#### Parameters

The dimensionality of the vector space should be 100. The random seed should be 42L.

#### Checkpoints
* Mean of the top 25 cosine similarities: 0.7785175901170094

In [224]:
from pyspark.mllib.feature import Word2Vec
word2vec = Word2Vec().setVectorSize(100).setSeed(42)

def get_tags(line):
    root = etree.XML(line)
    tags = []
    if 'Tags' in root.attrib: 
        Tag_str = root.get('Tags')
        tags = Tag_str.replace('&lt;', ' ').replace('&gt;', ' ').replace('<', ' ').replace('>', ' ').split()
    else:
        return None
    return tags

tags = sc.textFile(localpath('spark-stack-data/allPosts'))\
        .filter(isRow)\
        .map(get_tags)\
        .filter(lambda x: x is not None)
        
        
print tags.count()


model = word2vec.fit(tags)

8966083


In [229]:
#model = word2vec.fit(inp)
synonyms = model.findSynonyms('ggplot2', 25)

for word, cosine_distance in synonyms:
    print("{} {}".format(word, cosine_distance))

lattice 0.914175089362
r-grid 0.853013366773
plotmath 0.844089403526
boxplot 0.839753503564
plotrix 0.831518218593
ecdf 0.830968580308
gmisc 0.824988746884
levelplot 0.824531900738
density-plot 0.824492297613
melt 0.816001808733
gridextra 0.813623201172
line-plot 0.809634432145
loess 0.807918811886
rgl 0.804110386348
tapply 0.803792142344
ggvis 0.801683385182
mgcv 0.801153310268
r-factor 0.798146535195
quantile 0.797029189706
performanceanalytics 0.794787815957
weibull 0.792841571028
ggdendro 0.791935147398
categorical-data 0.790621594458
ggmap 0.790548134704
standard-error 0.788530333391


In [232]:
res = '''
lattice 0.914175089362
r-grid 0.853013366773
plotmath 0.844089403526
boxplot 0.839753503564
plotrix 0.831518218593
ecdf 0.830968580308
gmisc 0.824988746884
levelplot 0.824531900738
density-plot 0.824492297613
melt 0.816001808733
gridextra 0.813623201172
line-plot 0.809634432145
loess 0.807918811886
rgl 0.804110386348
tapply 0.803792142344
ggvis 0.801683385182
mgcv 0.801153310268
r-factor 0.798146535195
quantile 0.797029189706
performanceanalytics 0.794787815957
weibull 0.792841571028
ggdendro 0.791935147398
categorical-data 0.790621594458
ggmap 0.790548134704
standard-error 0.788530333391
'''

q9 = [ (y[0], float(y[1])) for y in (x.strip().split() for x in res.splitlines()) if y]

q9

[('lattice', 0.914175089362),
 ('r-grid', 0.853013366773),
 ('plotmath', 0.844089403526),
 ('boxplot', 0.839753503564),
 ('plotrix', 0.831518218593),
 ('ecdf', 0.830968580308),
 ('gmisc', 0.824988746884),
 ('levelplot', 0.824531900738),
 ('density-plot', 0.824492297613),
 ('melt', 0.816001808733),
 ('gridextra', 0.813623201172),
 ('line-plot', 0.809634432145),
 ('loess', 0.807918811886),
 ('rgl', 0.804110386348),
 ('tapply', 0.803792142344),
 ('ggvis', 0.801683385182),
 ('mgcv', 0.801153310268),
 ('r-factor', 0.798146535195),
 ('quantile', 0.797029189706),
 ('performanceanalytics', 0.794787815957),
 ('weibull', 0.792841571028),
 ('ggdendro', 0.791935147398),
 ('categorical-data', 0.790621594458),
 ('ggmap', 0.790548134704),
 ('standard-error', 0.788530333391)]

In [233]:
def word2vec():
    return q9#[("data.frame", 0.7900882217638416)] * 25

grader.score(question_name='spark__word2vec', func=word2vec)

Your score:  1.0


## k_means (ungraded)

From your trained Word2Vec model, pass the vectors into a K-means clustering algorithm. Create a plot of the sum of squared error by calculating the square root of the sum of the squared distances for each point and its assigned cluster. For an independent variable use either the number of clusters k or the dimension of the Word2Vec vectorization.

## 10. classification

#### We'd like to see if we can predict the tags of a question from its body text. 

Technically this is a multi-label classification problem, but to simplify things we'll use a **one-vs-all approach where we choose the top k most common tags and train k binary classifiers where the labels indicate the presence or absence of that tag.**

Use a **logistic regression model** as your classifer.

Since we can't reliably save and load models, return **a list of 100 tuples ("string", [number, number, number,...])** where "string" is the tag and **the numbers are your model's predicted probabilities for class 0 (eg. 0.29086 means a prediction that the tag is present) across the test set.**

Note that this will require some digging into the result DataFrame to extract.
The length of these probability lists is equal to the length of the training set: 4649.

#### Parameters

* [Training](s3://dataincubator-course/spark-stats-data/posts_train.zip) and [test](s3://dataincubator-course/spark-stats-data/posts_test.zip) sets are available on S3.
* Tokenize the body text into words
* number of tags to consider **k = 100**

In [234]:
!mkdir -p spark-class-data
!aws s3 cp s3://dataincubator-course/spark-stats-data/posts_train.zip ./spark-class-data
!aws s3 cp s3://dataincubator-course/spark-stats-data/posts_test.zip ./spark-class-data

download: s3://dataincubator-course/spark-stats-data/posts_train.zip to spark-class-data/posts_train.zip
download: s3://dataincubator-course/spark-stats-data/posts_test.zip to spark-class-data/posts_test.zip


In [238]:
!unzip ./spark-class-data/posts_train.zip -d ./spark-class-data/posts_train
!unzip ./spark-class-data/posts_test.zip -d ./spark-class-data/posts_test

Archive:  ./spark-class-data/posts_train.zip
replace ./spark-class-data/posts_train/part-00001? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C
Archive:  ./spark-class-data/posts_test.zip
replace ./spark-class-data/posts_test/part-00001? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [None]:
#https://spark.apache.org/docs/2.0.2/mllib-linear-methods.html

In [262]:
from pyspark.ml.feature import Word2Vec
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

tags = sc.textFile(localpath('spark-class-data/posts_train/'))\
        .filter(isRow)\
        .map(get_tags)\
        .filter(lambda x: x is not None)\
        .flatMap(lambda doc: [(x, 1) for x in doc])\
        .reduceByKey(lambda x, y: x+y)\
        .map(lambda (x,y): (y,x))\
        .sortByKey(ascending=False)\
        .take(100)
        
        #.map(lambda x: (x, 1))\
        #.toDF(['text', 'score'])
print tags[:10]
tags100 = [x[1] for x in tags]
print tags100

[(7121, 'r'), (5408, 'regression'), (2654, 'time-series'), (2524, 'machine-learning'), (2055, 'probability'), (1926, 'hypothesis-testing'), (1807, 'distributions'), (1762, 'self-study'), (1627, 'logistic'), (1544, 'correlation')]
['r', 'regression', 'time-series', 'machine-learning', 'probability', 'hypothesis-testing', 'distributions', 'self-study', 'logistic', 'correlation', 'statistical-significance', 'classification', 'bayesian', 'anova', 'normal-distribution', 'clustering', 'data-visualization', 'confidence-interval', 'mathematical-statistics', 'multiple-regression', 'estimation', 'categorical-data', 'mixed-model', 'spss', 'generalized-linear-model', 'variance', 'repeated-measures', 'sampling', 't-test', 'pca', 'svm', 'forecasting', 'data-mining', 'multivariate-analysis', 'cross-validation', 'chi-squared', 'modeling', 'maximum-likelihood', 'predictive-models', 'matlab', 'data-transformation', 'neural-networks', 'nonparametric', 'interaction', 'survival', 'model-selection', 'linear

In [293]:
import lxml.html.clean as clean
cleaner = clean.Cleaner()#page_structure=False
import re
TAG_RE = re.compile(r'<[a-z/]+>')

mytag = tags100[1]

def get_body(line):
    root = etree.XML(line)
    body = []
    if 'Body' in root.attrib: 
        body_str = root.get('Body')
        #body_str = cleaner.clean_html(body_str)
        #body_str = ''.join(xml.etree.ElementTree.fromstring(body_str).itertext())
        body_str = TAG_RE.sub('', body_str)
        body_str = body_str.replace('\n',' ').lower() 
    else:
        return None
        
    if 'Tags' in root.attrib: 
        Tag_str = root.get('Tags')
        tags = Tag_str.replace('&lt;', ' ').replace('&gt;', ' ').replace('<', ' ').replace('>', ' ').split()
        if mytag in tags:
            y = 1.0
        else:
            y = 0.0
    else:
        return None

    return (body_str, y)

training = sc.textFile(localpath('spark-class-data/posts_train/'))\
            .filter(isRow)\
            .map(get_body)\
            .filter(lambda x: x is not None)\
            .toDF(['text', 'label'])

test = sc.textFile(localpath('spark-class-data/posts_test/'))\
            .filter(isRow)\
            .map(get_body)\
            .filter(lambda x: x is not None)\
            .toDF(['text', 'label'])
            
#print train.take(3)
print training.count()
print test.count()

42065
4649


In [294]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
logreg = LogisticRegression(maxIter=10, regParam=0.01)

tokens = tokenizer.transform(training)
hashes = hashingTF.transform(tokens)
model = logreg.fit(hashes)

In [295]:
# Make predictions on test documents
test_tokens = tokenizer.transform(test)
test_hashes = hashingTF.transform(test_tokens)

prediction = model.transform(test_hashes)
selected = prediction.select("probability", "label", "prediction") #
prob=[]
for row in selected.collect():
    print(row.label, row.prediction, row.probability[0])
    prob.append(row.probability[0])

(0.0, 0.0, 0.98573203160479661)
(0.0, 0.0, 0.9988171590901892)
(0.0, 0.0, 0.99984019106993016)
(0.0, 0.0, 0.86022697698669937)
(0.0, 0.0, 0.96596031917458547)
(1.0, 0.0, 0.99462644155203961)
(0.0, 0.0, 0.98628979839536879)
(0.0, 0.0, 0.91695711405808922)
(0.0, 1.0, 0.46477561424621322)
(0.0, 0.0, 0.98066409837867075)
(0.0, 0.0, 0.99996673248298573)
(0.0, 0.0, 0.95603173514494733)
(0.0, 0.0, 0.99999705701282193)
(0.0, 0.0, 0.99995170640314013)
(0.0, 0.0, 0.956677536062564)
(0.0, 0.0, 0.99997486323114604)
(0.0, 1.0, 0.23050292921674745)
(0.0, 0.0, 0.95743756971269567)
(0.0, 0.0, 0.99999684074584061)
(0.0, 0.0, 0.99999999994545608)
(0.0, 0.0, 0.99948029646708514)
(0.0, 0.0, 0.99165449406405792)
(1.0, 0.0, 0.81408609021585698)
(0.0, 0.0, 0.99880862416106952)
(0.0, 0.0, 0.99999996701523552)
(0.0, 0.0, 0.99367315500836662)
(0.0, 0.0, 0.99321727291833128)
(0.0, 0.0, 0.99886123245928238)
(0.0, 0.0, 0.99077742620121612)
(0.0, 0.0, 0.99721993017765476)
(0.0, 0.0, 0.88915162526455771)
(0.0, 0.0, 

In [289]:
prediction.select(["features", "probability", "prediction"]).show(5)

+--------------------+--------------------+----------+
|            features|         probability|prediction|
+--------------------+--------------------+----------+
|(262144,[1846,462...|[0.99636569621064...|       0.0|
|(262144,[170,381,...|[0.99999999953788...|       0.0|
|(262144,[5110,963...|[0.99566170069415...|       0.0|
|(262144,[2711,596...|[0.99881926770173...|       0.0|
|(262144,[640,1993...|[0.99999910554211...|       0.0|
+--------------------+--------------------+----------+
only showing top 5 rows



In [297]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
import re
TAG_RE = re.compile(r'<[a-z/]+>')

def get_body(line):
    root = etree.XML(line)
    body = []
    if 'Body' in root.attrib: 
        body_str = root.get('Body')
        body_str = TAG_RE.sub('', body_str)
        body_str = body_str.replace('\n',' ').lower() 
    else:
        return None
        
    if 'Tags' in root.attrib: 
        Tag_str = root.get('Tags')
        tags = Tag_str.replace('&lt;', ' ').replace('&gt;', ' ').replace('<', ' ').replace('>', ' ').split()
        if mytag in tags:
            y = 1.0
        else:
            y = 0.0
    else:
        return None

    return (body_str, y)


q10 = []

for iind, mytag in enumerate(tags100):


    training = sc.textFile(localpath('spark-class-data/posts_train/'))\
                .filter(isRow)\
                .map(get_body)\
                .filter(lambda x: x is not None)\
                .toDF(['text', 'label'])

    test = sc.textFile(localpath('spark-class-data/posts_test/'))\
                .filter(isRow)\
                .map(get_body)\
                .filter(lambda x: x is not None)\
                .toDF(['text', 'label'])

    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
    logreg = LogisticRegression(maxIter=10, regParam=0.01)

    tokens = tokenizer.transform(training)
    hashes = hashingTF.transform(tokens)
    model = logreg.fit(hashes)

    test_tokens = tokenizer.transform(test)
    test_hashes = hashingTF.transform(test_tokens)

    prediction = model.transform(test_hashes)
    selected = prediction.select("probability", "label", "prediction") #"text"
    prob=[]
    for row in selected.collect():
        #print(row.label, row.prediction, row.probability[0])
        
        prob.append(row.probability[0])
        
    print 'Tag #', iind, ' : ', mytag, 'finished'
    q10.append((mytag, prob))
    

Tag # 0  :  r finished
Tag # 1  :  regression finished
Tag # 2  :  time-series finished
Tag # 3  :  machine-learning finished
Tag # 4  :  probability finished
Tag # 5  :  hypothesis-testing finished
Tag # 6  :  distributions finished
Tag # 7  :  self-study finished
Tag # 8  :  logistic finished
Tag # 9  :  correlation finished
Tag # 10  :  statistical-significance finished
Tag # 11  :  classification finished
Tag # 12  :  bayesian finished
Tag # 13  :  anova finished
Tag # 14  :  normal-distribution finished
Tag # 15  :  clustering finished
Tag # 16  :  data-visualization finished
Tag # 17  :  confidence-interval finished
Tag # 18  :  mathematical-statistics finished
Tag # 19  :  multiple-regression finished
Tag # 20  :  estimation finished
Tag # 21  :  categorical-data finished
Tag # 22  :  mixed-model finished
Tag # 23  :  spss finished
Tag # 24  :  generalized-linear-model finished
Tag # 25  :  variance finished
Tag # 26  :  repeated-measures finished
Tag # 27  :  sampling finished


In [253]:
q10tags = [(x, [0.0] * 4649) for x in tags100]

In [298]:
def classification():
    return q10#[("repeated-measures", [0.0] * 4649)] * 100

grader.score(question_name='spark__classification', func=classification)

Your score:  1.0


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*