# Converting CSV files to graph-tool graphs

In order to not have to convert the csv files to a graph every time we need it, we decided to create the graph once and store it to the disk. Like that we save on computation time when we need to work on the graph.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from graph_tool.all import *
from graph_tool.draw import *

## Preprocessing

Both the patents and citations data sheets are several gigabytes large tab-separated files and the categorymap
clocks in at about 200MB.
As a first step in the cleaning pipeline we resorted to preprocessing via the command in order to
reduce size of the files by extracting only the relevant columns.

More specifically, the coreutil [awk](https://www.gnu.org/software/gawk/manual/gawk.html) is a handy stream
processing language through which a tab-separated file can be read line by line, processed and written to
a new file, preferably a comma-separated one.

#### Example

```shell
echo 'Alice\tBob\tEve' | awk -F '\t' '{ print $1","$3 }' > alice_eve.csv
cat alice_eve.csv # Alice,Eve```


In our case we extract from the above citations_raw.tsv file the patent_id and the citation_id columns, from the
patents_raw.tsv the id and data columns, and from the categorymap.tsv the patent_id and subcategory_id columns.

In terms of filesize reduction we get the following improvements:
 - citations_raw.tsv **7.71GB** to **1.45GB** (~81% reduction)
 - patents_raw.tsv **4.9GB** to **121MB** (~97% reduction)
 - categorymap_raw.tsv **204.2MB** to **61.3MB** (~70% reduction)

In [2]:
DATA = './'

In [None]:
citations = pd.read_csv(DATA + 'citations.csv')

In [None]:
citations.head()

In [None]:
citations.info()

In [None]:
citations.dropna(inplace=True)

citations = citations[citations.patent_id.str.match(r'([D]\d+$)|(^\d+$)')]
citations = citations[citations.citation_id.str.match(r'([D]\d+$)|(^\d+$)')]

citations.columns.str.strip()

In [None]:
citations['patent_id'] = np.array(list(map(lambda i: int(i, 16), citations.patent_id)), dtype=np.int32)
citations['citation_id'] = np.array(list(map(lambda i: int(i, 16), citations.citation_id)), dtype=np.int32)

In [None]:
citations.head()

In [None]:
citations.info()

### Saving mapped citations to CSV

In [None]:
citations.to_csv('citations_processed.csv')

## Creating the graph

Vertexes in graph-tool are named after an index from 0 to N-1, where N is the number of vertices in the graph. In order to know which vertex is which, we need to store the id of the respective patent inside a property of each vertex. Because graph-tool has no inbuilt function to create such a graph with our id's, we loop through all edges and add them manually to the graph.

In [None]:
graph = Graph()
props = graph.add_edge_list(citations.values, hashed=True)
graph.vertex_properties["id"] = props

In [None]:
graph.vp.id['0']

In [None]:
#pos = arf_layout(graph, max_iter=0)
#graph_draw(graph, pos=pos, vertex_text=prop, vertex_font_size=10, inline=True)

## Saving the graph to disk

In [None]:
graph.save("citations_graph.xml.gz")

In [None]:
g2 = load_graph("citations_graph.xml.gz")

In [None]:
g2.vp.id['0']

## Trying out some graph algorithms

In [None]:
# Let's plot its in-degree distribution
shortest_path(graph, graph.vertex(234), graph.vertex(81273))
'{0:0X}'.format(int(graph.vp.id['234']))
v = find_vertex(graph, graph.vertex_properties.id, '127345446')
v2 = v[0]
int(v2)

In [None]:
vertices = [int(v) for v in graph.vertices()]
in_deg = graph.get_in_degrees(vertices)
in_deg.max()