* In this file we will create a matrix which shows the distance for each dealer from all the starting nodes in the 2013 q1 purchased from network
* We also have to find these start nodes
    - It is a little counter intuitive. These start nodes are actually the end points of a vat chain
    

* Step 0: Import Graphlab, sgraph etc
* Step 1: Import nodes list and edge list from csv files
* Step 2: Create SGraph from imported nodes list and edge list
* Step 3: Save the graph in a directory "sold_to_2013_q1". Then load the graph into a new Sgraph
* Step 4: Find starting nodes
    - We define these start nodes as the nodes which have indegree zero and outdegree greater than zero
    - Another issue is that we have to include dealers which have a self loop and are starting nodes. Graphlab, by design, counts their indegree as 1. As a result we will add them separately
* Step 5: Calculating distance from starting nodes
    - The files were becoming huge, so we had to split the data in sets of 100 nodes.
    - Will merge these files in stata


* Step 0: Import Graphlab, sgraph etc
* Step 1: Import nodes list and edge list from csv files

In [1]:
import graphlab 
from graphlab import SGraph, Vertex, Edge, SFrame, degree_counting

In [2]:
edge_data = SFrame.read_csv( 'H:/graphlab/2013_q1/PurchasedFrom/edge_list.csv')
node_data=SFrame.read_csv('H:/graphlab/2013_q1/PurchasedFrom/nodes.csv')

[INFO] GraphLab Create v1.8.3 started. Logging: C:\Users\ADMINI~1\AppData\Local\Temp\2\graphlab_server_1456896165.log.0


------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------


Inferred types from first line of file as 
column_type_hints=[long,long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
node_data.head()

DealerTIN,indegree,outdegree,network_number,self_loop
427816,0,3,1,0
489461,2,2,1,0
427810,3,26,1,0
160199,0,9,1,0
427809,0,104,1,0
155446,7,8,1,0
252192,34,1,1,0
300114,20,1,1,0
427796,8,3,1,0
395109,15,10,1,0


* Step 2: Create SGraph from imported nodes list and edge list
* Step 3: Save the graph in a directory "Purchased_from_2013_q1". Then load the graph into a new Sgraph

In [4]:
g = SGraph(vertices=node_data, edges=edge_data, vid_field='DealerTIN', src_field='Source', dst_field='Destination')
g.save('H:/graphlab/2013_q1/PurchasedFrom/Purchased_from_2013_q1')
new_graph = graphlab.load_sgraph('H:/graphlab/2013_q1/PurchasedFrom/Purchased_from_2013_q1')

In [5]:
new_graph.summary()

{'num_edges': 1209025L, 'num_vertices': 241350L}

* Step 4: Find starting nodes
    - We define these start nodes as the nodes which have indegree zero and outdegree greater than zero


In [6]:
deg = degree_counting.create(new_graph) # count degrees of each node
deg_graph = deg['graph'] # a new SGraph with degree data attached to each vertex

sub_verts=deg_graph.get_vertices(fields={'in_degree': 0})
sub_verts=sub_verts[sub_verts['out_degree']>0]

In [8]:
print sub_verts.unique()
print len(sub_verts)

+--------+-----------+------------+--------------+
|  __id  | in_degree | out_degree | total_degree |
+--------+-----------+------------+--------------+
| 289734 |     0     |     3      |      3       |
| 490170 |     0     |     13     |      13      |
| 327437 |     0     |     2      |      2       |
| 299590 |     0     |     2      |      2       |
| 517345 |     0     |     2      |      2       |
| 444817 |     0     |     2      |      2       |
| 374527 |     0     |     2      |      2       |
| 85789  |     0     |     2      |      2       |
| 321040 |     0     |     4      |      4       |
| 324718 |     0     |     2      |      2       |
+--------+-----------+------------+--------------+
[53481 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
53481


- Another issue is that we have to include dealers which have a self loop and are starting nodes.
- Graphlab, by design, counts their indegree as 1. As a result we will add them separately
    - We created this list in stata, condition was indegree=0&outdegree>0&selfloop=1

In [10]:
self_loop_nodes=SFrame.read_csv('H:/graphlab/2013_q1/PurchasedFrom/IndegreeZeroOutdegreeNonzeroWithSelfLoop.csv')
sub_verts=sub_verts.append(self_loop_nodes)

------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [11]:
print sub_verts.unique()
print len(sub_verts)

+--------+-----------+------------+--------------+
|  __id  | in_degree | out_degree | total_degree |
+--------+-----------+------------+--------------+
| 289734 |     0     |     3      |      3       |
| 490170 |     0     |     13     |      13      |
| 327437 |     0     |     2      |      2       |
| 299590 |     0     |     2      |      2       |
| 517345 |     0     |     2      |      2       |
| 444817 |     0     |     2      |      2       |
| 374527 |     0     |     2      |      2       |
| 85789  |     0     |     2      |      2       |
| 321040 |     0     |     4      |      4       |
| 324718 |     0     |     2      |      2       |
+--------+-----------+------------+--------------+
[53950 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
53950


In [12]:
sub_verts_id=sub_verts.select_column('__id')
sub_verts.export_csv('H:/graphlab/2013_q1/PurchasedFrom/StartNodes.csv')
sub_verts_id.save('H:/graphlab/2013_q1/PurchasedFrom/StartNodesSarray.csv')

In [13]:
parent_sframe=graphlab.SFrame()
i=1
for StartNode in sub_verts_id:
    if not parent_sframe.column_names():
        sp = graphlab.shortest_path.create(new_graph, source_vid=StartNode, verbose=False) # finds shortest path for all nodes
        sp_sframe = sp['distance'] 
        parent_sframe= sp_sframe
        parent_sframe.rename({'distance':'d'+str(StartNode)})
        i=i+1        
    else:
        sp = graphlab.shortest_path.create(new_graph, source_vid=StartNode, verbose=False)
        sp_sframe = sp['distance'] 
        sp_sframe.rename({'distance':'d'+str(StartNode)})
        parent_sframe = parent_sframe.join(sp_sframe,on='__id',how='outer')
        if i%100==0:
            parent_sframe.export_csv('H:/graphlab/2013_q1/PurchasedFrom/StartNodes/DistanceMatrixFromStartNodes'+str(i)+'.csv')
            parent_sframe=graphlab.SFrame()
            i=i+1
        else:
            i=i+1
parent_sframe.export_csv('H:/graphlab/2013_q1/PurchasedFrom/StartNodes/DistanceMatrixFromStartNodes'+str(i)+'.csv')    

IOError: Fail to write.

In [14]:
print i

25900


In [27]:
print sub_verts_id[25800]

188278


In [29]:
print len(sub_verts_id[25800:])

28150


In [17]:
print len(sub_verts_id)

53950


In [31]:
parent_sframe=graphlab.SFrame()
i=25800
for StartNode in sub_verts_id[25800:]:
    if not parent_sframe.column_names():
        sp = graphlab.shortest_path.create(new_graph, source_vid=StartNode, verbose=False) # finds shortest path for all nodes
        sp_sframe = sp['distance'] 
        parent_sframe= sp_sframe
        parent_sframe.rename({'distance':'d'+str(StartNode)})
        i=i+1        
    else:
        sp = graphlab.shortest_path.create(new_graph, source_vid=StartNode, verbose=False)
        sp_sframe = sp['distance'] 
        sp_sframe.rename({'distance':'d'+str(StartNode)})
        parent_sframe = parent_sframe.join(sp_sframe,on='__id',how='outer')
        if i%100==0:
            parent_sframe.export_csv('H:/graphlab/2013_q1/PurchasedFrom/StartNodes/DistanceMatrixFromStartNodes'+str(i)+'.csv')
            parent_sframe=graphlab.SFrame()
            i=i+1
        else:
            i=i+1
parent_sframe.export_csv('H:/graphlab/2013_q1/PurchasedFrom/StartNodes/DistanceMatrixFromStartNodes'+str(i)+'.csv')    