# DATA 620, Assignment 2, Part 2: “Centrality Measures”

### Group 1 Members:

* Mauricio Alarcon
* Sekhar Mekala 
* Aadi Kalloo
* Srinivasa Illapani
* Param Singh 

### Assignment Description

Centrality measures can be used to predict (positive or negative) outcomes for a node.

Your task in this week’s assignment is to identify an interesting set of network data that is available on the web (either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.  As an additional constraint, there should be at least one categorical variable available for each node (such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis, and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups. 

For this week’s assignment, you are not required to actually load or analyze the data.  Please see also Project 1 below.


# Assignment Response

We identified the Amazon Co-Product dataset as a viable source for this exercise. The Amazon Co-Product data from https://snap.stanford.edu/data/amazon0302.html and https://snap.stanford.edu/data/amazon-meta.html

Let's look a little closer at the datasets:

In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb') # use IPython Notebook output for GraphLab Canvas

### Edges

The list of Edges:

In [2]:
import gzip

i=0
with gzip.open('data/amazon0302.txt.gz','r') as fin:
    for line in fin:
        print(line.strip())
        i+=1
        if (i==20):
            break

# Directed graph (each unordered pair of nodes is saved once): Amazon0302.txt
# Amazon product co-purchaisng network from March 02 2003
# Nodes: 262111 Edges: 1234877
# FromNodeId	ToNodeId
0	1
0	2
0	3
0	4
0	5
1	0
1	2
1	4
1	5
1	15
2	0
2	11
2	12
2	13
2	14
3	63


### Vertices

In [3]:
i=0
with gzip.open('data/amazon-meta.txt.gz','r') as fin:
    for line in fin:
        print(line.strip())
        i+=1
        if (i==30):
            break

# Full information about Amazon Share the Love products
Total items: 548552

Id:   0
ASIN: 0771044445
discontinued product

Id:   1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2  downloaded: 2  avg rating: 5
2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

Id:   2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Book

As we can see in the preview above, the vertices are data-rich. We can consider collecting the following attributes:

* Group
* Categories (first line)
* Reviews Total
* Reviews Avg Rating

# Import List of Edges

In [4]:
edges = gl.SFrame.read_csv('data/amazon0302.txt.gz',delimiter='\t',skiprows=4,header=False)

This non-commercial license of GraphLab Create for academic use is assigned to mauricio.alarcon_balan@spsmail.cuny.edu and will expire on June 05, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1497372256.log


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [5]:
edges.rename({'X1':'src', 'X2':'dst'})#.column_names([['src','dst']])
#edges.head()

src,dst
0,1
0,2
0,3
0,4
0,5
1,0
1,2
1,4
1,5
1,15


In [6]:
edges

src,dst
0,1
0,2
0,3
0,4
0,5
1,0
1,2
1,4
1,5
1,15


In [None]:
import re
import gzip

v = gl.SFrame()
i=0
with gzip.open('data/amazon-meta.txt.gz','r') as fin:
    id = None
    title = None 
    group = None
    reviews_total = None
    reviews_downloaded = None
    reviews_avg_rating = None
    category= None
    inCategory = False
    for line in fin:
        #Id:   1
        id_m = re.search('Id:\s*([0-9])', line)
        if (id_m):
            id = id_m.group(1).strip()
        if (id is not None):
            #title: Patterns of Preaching: A Sermon Sampler
            title_m = re.search('title:\s*(.+)', line)
            if (title_m):
                title = title_m.group(1).strip()
            
            #group: Book
            group_m = re.search('group:\s*(.+)', line)
            if (group_m):
                group = group_m.group(1).strip()
            
            #categories: 2
            #|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
            #|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
            if (inCategory):
                category = line.strip()
                inCategory = False                
            categories_m = re.search('categories:\s*(.+)', line)
            if (categories_m):
                inCategory = True

            #reviews: total: 2  downloaded: 2  avg rating: 5
            reviews_m = re.search('reviews:\s*total:\s*([0-9]+)\s*downloaded:\s*([0-9]+)\s*avg rating:\s*([0-9.]+)', line)
            if (reviews_m):
                reviews_total = reviews_m.group(1).strip()
                reviews_downloaded = reviews_m.group(2).strip()
                reviews_avg_rating = reviews_m.group(3).strip()
                
            if (id is not None 
                and title is not None 
                and group is not None
                and reviews_total is not None
                and reviews_downloaded is not None
                and reviews_avg_rating is not None
                and category is not None):
                v = v.append(gl.SFrame({'id':[id],'title':[title],'group':[group],'reviews_total':[int(reviews_total)],'reviews_downloaded':[int(reviews_downloaded)],'reviews_avg_rating':[float(reviews_avg_rating)],'category':[category]}))
                id = None
                title = None 
                group = None
                reviews_total = None
                reviews_downloaded = None
                reviews_avg_rating = None
                category= None
        i+=1
#        if (i==300):
#            break


In [None]:
v