# Converting CSV files to graph-tool graphs

In order to not have to convert the csv files to a graph every time we need it, we decided to create the graph once and store it to the disk. Like that we save on computation time when we need to work on the graph.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from graph_tool.all import *
from graph_tool.draw import *

## Preprocessing

Both the patents and citations data sheets are several gigabytes large tab-separated files and the categorymap
clocks in at about 200MB.
As a first step in the cleaning pipeline we resorted to preprocessing via the command in order to
reduce size of the files by extracting only the relevant columns.

More specifically, the coreutil [awk](https://www.gnu.org/software/gawk/manual/gawk.html) is a handy stream
processing language through which a tab-separated file can be read line by line, processed and written to
a new file, preferably a comma-separated one.

#### Example

```shell
echo 'Alice\tBob\tEve' | awk -F '\t' '{ print $1","$3 }' > alice_eve.csv
cat alice_eve.csv # Alice,Eve```


In our case we extract from the above citations_raw.tsv file the patent_id and the citation_id columns, from the
patents_raw.tsv the id and data columns, and from the categorymap.tsv the patent_id and subcategory_id columns.

In terms of filesize reduction we get the following improvements:
 - citations_raw.tsv **7.71GB** to **1.45GB** (~81% reduction)
 - patents_raw.tsv **4.9GB** to **121MB** (~97% reduction)
 - categorymap_raw.tsv **204.2MB** to **61.3MB** (~70% reduction)

In [4]:
DATA = './data/'

In [5]:
citations = pd.read_csv(DATA + 'citations.csv', nrows=10)

In [6]:
citations.head()

Unnamed: 0,patent_id,citation_id
0,9009250,8127342
1,9643605,5471515
2,5354551,4875247
3,D786922,D718330
4,D490798,D190749


In [5]:
citations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
patent_id      10 non-null object
citation_id    10 non-null object
dtypes: object(2)
memory usage: 240.0+ bytes


In [6]:
citations.dropna(inplace=True)

citations = citations[citations.patent_id.str.match(r'([D]\d+$)|(^\d+$)')]
citations = citations[citations.citation_id.str.match(r'([D]\d+$)|(^\d+$)')]

citations.columns.str.strip()

Index(['patent_id', 'citation_id'], dtype='object')

In [7]:
citations['patent_id'] = np.array(list(map(lambda i: int(i, 16), citations.patent_id)), dtype=np.int32)
citations['citation_id'] = np.array(list(map(lambda i: int(i, 16), citations.citation_id)), dtype=np.int32)

In [8]:
citations.head()

Unnamed: 0,patent_id,citation_id
0,151032400,135426882
1,157562373,88544533
2,87377233,75977287
3,225995042,225542960
4,222889880,219744073


In [None]:
citations.info()

### Saving mapped citations to CSV

In [None]:
citations.to_csv('citations_processed.csv')

## Creating the graph

Vertexes in graph-tool are named after an index from 0 to N-1, where N is the number of vertices in the graph. In order to know which vertex is which, we need to store the id of the respective patent inside a property of each vertex. Because graph-tool has no inbuilt function to create such a graph with our id's, we loop through all edges and add them manually to the graph.

In [None]:
graph = Graph()
props = graph.add_edge_list(citations.values, hashed=True)
#graph.vertex_properties["id"] = props

In [None]:
prop_dict = {idx: val for idx, val in enumerate(props.get_array())}
prop_dict

In [None]:
a = np.empty(0, dtype=np.int32)
a = np.append(a, props.get_array())
a = np.append(a, props.get_array())
b = PropertyArray(a, graph)
b

In [None]:
y = props.get_array()
props.get_array()[len(y)-2:]

In [None]:
#graph.vertex_properties['id'] = PropertyMap(prop_dict, graph, "v")
x = graph.new_vertex_property('int', props.get_array())

In [None]:
graph.vertex_properties['id'] = x

In [None]:
graph.vp.id['0']

In [None]:
#pos = arf_layout(graph, max_iter=0)
#graph_draw(graph, pos=pos, vertex_text=prop, vertex_font_size=10, inline=True)

## Saving the graph to disk

In [None]:
graph.save("citations_graph.xml.gz")

In [18]:
graph = load_graph("citations_graph.xml.gz")

In [28]:
graph.vp.id['0']

151032400

In [29]:
g2

<Graph object, directed, with 20 vertices and 10 edges at 0x7f50cf7f12b0>

## Trying out some graph algorithms

In [33]:
# Let's plot its in-degree distribution
#shortest_path(graph, graph.vertex(234), graph.vertex(81273))
#'{0:0X}'.format(int(graph.vp.id['234']))
v = find_vertex(graph, graph.vertex_properties.id, '143216977')
v2 = v[0]
int(v2)

IndexError: list index out of range

In [None]:
vertices = [int(v) for v in graph.vertices()]
in_deg = graph.get_in_degrees(vertices)
in_deg.max()