# Setting up the Knowledge Graph Datasets

In [1]:
from pyspark.sql import SparkSession

from aips import get_engine
from aips.spark.dataframe import from_csv
import aips.indexer
spark = SparkSession.builder.appName("AIPS").getOrCreate()
engine = get_engine()

## Index the Jobs Dataset into the Search Engine

In [None]:
aips.indexer.build_collection(engine, "jobs")

## Index StackExchange datasets: health, scifi, cooking, travel, devops

In [3]:
datasets = ["health", "cooking", "scifi", "travel", "devops"]
for dataset in datasets:
    aips.indexer.build_collection(engine, dataset, log=True)

Collection [health] exists? True
Documents expected: 12892
Documents found: 12892
Collection [health] is healthy
Collection [cooking] exists? True
Documents expected: 79324
Documents found: 17647
Reindexing [cooking] collection
Wiping "cooking" collection
Creating "cooking" collection
Status: Success
File [data/cooking/posts.csv] exists? True
Loading data/cooking/posts.csv
Schema: 
root
 |-- post_type_id: integer (nullable = true)
 |-- accepted_answer_id: integer (nullable = true)
 |-- parent_id: integer (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- deletion_date: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- view_count: integer (nullable = true)
 |-- body: string (nullable = true)
 |-- owner_user_id: integer (nullable = true)
 |-- owner_display_name: string (nullable = true)
 |-- last_editor_user_id: integer (nullable = true)
 |-- last_editor_display_name: string (nullable = true)
 |-- last_edit_date: timestamp (nullable = true)
 |-- last

100%|██████████| 6.0/6.0 [00:02<00:00,  2.09it/s]   


Extracting [data/repositories/scifi/scifi.tgz] to [data/scifi]
Loading data/scifi/posts.csv
Schema: 
root
 |-- post_type_id: integer (nullable = true)
 |-- accepted_answer_id: integer (nullable = true)
 |-- parent_id: integer (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- deletion_date: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- view_count: integer (nullable = true)
 |-- body: string (nullable = true)
 |-- owner_user_id: integer (nullable = true)
 |-- owner_display_name: string (nullable = true)
 |-- last_editor_user_id: integer (nullable = true)
 |-- last_editor_display_name: string (nullable = true)
 |-- last_edit_date: timestamp (nullable = true)
 |-- last_activity_date: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- answer_count: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- favorite_count: integer (nullable = true)
 |-- closed_date: timestamp (

100%|██████████| 6.0/6.0 [00:01<00:00,  4.21it/s]    


Extracting [data/repositories/travel/travel.tgz] to [data/travel]
Loading data/travel/posts.csv
Schema: 
root
 |-- post_type_id: integer (nullable = true)
 |-- accepted_answer_id: integer (nullable = true)
 |-- parent_id: integer (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- deletion_date: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- view_count: integer (nullable = true)
 |-- body: string (nullable = true)
 |-- owner_user_id: integer (nullable = true)
 |-- owner_display_name: string (nullable = true)
 |-- last_editor_user_id: integer (nullable = true)
 |-- last_editor_display_name: string (nullable = true)
 |-- last_edit_date: timestamp (nullable = true)
 |-- last_activity_date: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- answer_count: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- favorite_count: integer (nullable = true)
 |-- closed_date: timesta

100%|██████████| 6.0/6.0 [00:00<00:00, 25.62it/s]    


Extracting [data/repositories/devops/devops.tgz] to [data/devops]
Loading data/devops/posts.csv
Schema: 
root
 |-- post_type_id: integer (nullable = true)
 |-- accepted_answer_id: integer (nullable = true)
 |-- parent_id: integer (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- deletion_date: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- view_count: integer (nullable = true)
 |-- body: string (nullable = true)
 |-- owner_user_id: integer (nullable = true)
 |-- owner_display_name: string (nullable = true)
 |-- last_editor_user_id: integer (nullable = true)
 |-- last_editor_display_name: string (nullable = true)
 |-- last_edit_date: timestamp (nullable = true)
 |-- last_activity_date: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- answer_count: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- favorite_count: integer (nullable = true)
 |-- closed_date: timesta

In [4]:
aips.indexer.build_collection(engine, "stackexchange")

Wiping "stackexchange" collection
Creating "stackexchange" collection
Status: Success
Loading data/health/posts.csv
Schema: 
root
 |-- post_type_id: integer (nullable = true)
 |-- accepted_answer_id: integer (nullable = true)
 |-- parent_id: integer (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- deletion_date: string (nullable = true)
 |-- score: integer (nullable = true)
 |-- view_count: integer (nullable = true)
 |-- body: string (nullable = true)
 |-- owner_user_id: integer (nullable = true)
 |-- owner_display_name: string (nullable = true)
 |-- last_editor_user_id: integer (nullable = true)
 |-- last_editor_display_name: string (nullable = true)
 |-- last_edit_date: timestamp (nullable = true)
 |-- last_activity_date: timestamp (nullable = true)
 |-- title: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- answer_count: integer (nullable = true)
 |-- comment_count: integer (nullable = true)
 |-- favorite_count: integer (nullable = true)
 |-- 

<engines.solr.SolrCollection.SolrCollection at 0x7f18afbd1270>

## Dual index datasets into Solr for SKG

In [5]:
solr_engine = get_engine("solr")
aips.indexer.build_collection(solr_engine, "jobs")
for dataset in datasets:
    aips.indexer.build_collection(solr_engine, dataset)
aips.indexer.build_collection(solr_engine, "stackexchange")

<engines.solr.SolrCollection.SolrCollection at 0x7f18ac3421a0>

## Success!

Now that you've indexed several large text datasets, in the next notebook we will explore the rich graph of semantic relationships embedded within those documents by leveraging Semantic Knowledge Graphs for real-time traversal and ranking of arbitrary relationships within the domains of our datasets.

Up next: [Working with Semantic Knowledge Graphs](3.semantic-knowledge-graph.ipynb)