## Large Network Example: Computing Network Projections on Retail Data

This notebook demonstrates how to compute the network projection of the Amazon Movies and TV reviews data set, which you can download from [here](https://nijianmo.github.io/amazon/index.html#subsets).

Be warned: this is a large input graph and it is recommended to run this example on a multiple machine computing cluster.

To learn more, visit the [projection documentation](https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=default&docsetId=casmlnetwork&docsetTarget=casmlnetwork_network_syntax22.htm&locale=en).

In [1]:
import pandas as pd
import swat
import sys
sys.path.append(r"../../../common/python")
import visualization as vz
import cas_connection as cas

In [2]:
swat.options.cas.print_messages=True
caslib = 'myData'
subdir = 'data/snap/amazon/'
s = cas.reconnect(caslib=caslib, datasubdir=subdir)
s.loadActionSet('network')
s.loadActionSet('fedsql')

NOTE: 'myData' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'myData'.
NOTE: Added action set 'network'.
NOTE: Added action set 'fedsql'.


### Define the Input Graph
In the following links data, we load the links in the bipartite network from a csv file.

In [3]:
s.table.loadtable(
    caslib=caslib,
    path="Movies_and_TV.csv",
    casout="links",
    importOptions={
        "filetype":"csv",
        "delimiter":",",
        "getNames":False,
        "vars":[
            {"name":"from", "type":"CHAR", "length":16},
            {"name":"to",   "type":"CHAR", "length":16},
            {"name":"rating", "type":"DOUBLE"},
            {"name":"timestamp", "type":"DOUBLE"}
         ]
    }
)

NOTE: Cloud Analytic Services made the file Movies_and_TV.csv available as table LINKS in caslib myData.


In [4]:
s.CASTable("links").head()

Unnamed: 0,from,to,rating,timestamp
0,1527665,A3478QRKQDOPQ2,5.0,1362960000.0
1,1527665,A2VHSG6TZHU1OB,5.0,1361146000.0
2,1527665,A23EJWOW1TLENE,5.0,1358381000.0
3,1527665,A1KM9FNEJ8Q171,5.0,1357776000.0
4,1527665,A38LY2SSHVHRYB,4.0,1356480000.0


### Create the Nodes Table

In this example, the pairs of products that were reviewed by the same user are of interest. The PROJECTION statement requires a nodes data table with a column that indicates which nodes are users and which nodes are products. You can use the following statements to
generate the nodes data table (which has an identifier variable called *node* and a partition variable called *partitionFlag*).

Since we want to infer links between pairs of products, we need
to assign a partition value of 1 for product nodes and 0 for
user nodes.

In [5]:
s.fedSql.execDirect(
    query='''
        create table nodesUser {options replace=True}  as
        select distinct a.from as "node", 0 as "partitionFlag"
        from links as a;
    '''
)
s.fedSql.execDirect(
    query='''
        create table nodesProduct {options replace=True}  as
        select distinct a.to as "node", 1 as "partitionFlag"
        from links as a;
    '''
)
s.datastep.runCode(
    code='''
        data nodes;
            set nodesUser nodesProduct;
        run;
    '''
)

NOTE: Table NODESUSER was created in caslib myData with 182032 rows returned.
NOTE: Table NODESPRODUCT was created in caslib myData with 3826085 rows returned.


Unnamed: 0,casLib,Name,Rows,Columns,casTable
0,myData,nodesUser,182032,2,"CASTable('nodesUser', caslib='myData')"
1,myData,nodesProduct,3826085,2,"CASTable('nodesProduct', caslib='myData')"

Unnamed: 0,casLib,Name,Rows,Columns,Append,Promoted,casTable
0,myData,nodes,4008117,2,,N,"CASTable('nodes', caslib='myData')"


In [6]:
s.CASTable("nodes").head()

Unnamed: 0,node,partitionFlag
0,0980066441,0.0
1,6303383378,0.0
2,6303473253,0.0
3,6304609493,0.0
4,B000006QQZ,0.0


### Run the Projection Algorithm

In [7]:
s.network.projection(
    links              = {"name": "links"},
    nodes              = {"name": "nodes"},
    outProjectionLinks = {"name": "links_out",
                          "replace":True,
                          "where":"commonNeighbors >= 5"},
    partition          = "partitionFlag",
    commonNeighbors    = True,
    nThreads           = 4
    )

NOTE: The number of nodes in the input graph is 4008117.
NOTE: The number of links in the input graph is 8765568.
NOTE: Processing network projection using 484 threads across 121 machines.
NOTE: Processing projection used 69.45 (cpu: 10744.02) seconds.


Unnamed: 0,Name1,Label1,cValue1,nValue1
0,numNodes,Number of Nodes,4008117,4008117.0
1,numLinks,Number of Links,8765568,8765568.0
2,graphDirection,Graph Direction,Undirected,

Unnamed: 0,Name1,Label1,cValue1,nValue1
0,problemType,Problem Type,Projection,
1,status,Solution Status,OK,
2,cpuTime,CPU Time,10744.02,10744.02
3,realTime,Real Time,69.45,69.452025

Unnamed: 0,casLib,Name,Label,Rows,Columns,casTable
0,myData,links_out,,3544940,3,"CASTable('links_out', caslib='myData')"


In [8]:
output_rows = len(s.CASTable("links_out"))
print(f"There are {output_rows:,} rows in the projected links table.")

There are 3,544,940 rows in the projected links table.


View the first 5 product pairs

In [9]:
s.CASTable("links_out").head()

Unnamed: 0,from,to,commonNeighbors
0,A10O1QPYEGSYBF,A2YUA3H1LLU53Z,5.0
1,A10O1QPYEGSYBF,AV6QDP8Q0ONK4,5.0
2,A10O32IJF4LY1V,A2EDZH51XHFA9B,5.0
3,A10O32IJF4LY1V,A8DI0COTCMRDV,6.0
4,A10O32IJF4LY1V,AIMR915K4YCN,7.0


In [10]:
s.terminate()