# Setting up the Knowledge Graph Datasets

In [1]:
from aips import get_engine
import aips.indexer
from aips.spark import get_spark_session
from aips.spark.dataframe import from_csv

spark = get_spark_session()
engine = get_engine()

## Index the Jobs Dataset into the Search Engine

In [2]:
jobs_collection = aips.indexer.build_collection(engine, "jobs")

Wiping "jobs" collection
Creating "jobs" collection
Status: Success
Loading data/jobs/jobs.csv
Schema: 
root
 |-- job_title: string (nullable = true)
 |-- job_description: string (nullable = true)
 |-- job_type: string (nullable = true)
 |-- category: string (nullable = false)
 |-- job_location: string (nullable = true)
 |-- job_city: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- job_country: string (nullable = true)
 |-- job_zip_code: string (nullable = true)
 |-- job_address: string (nullable = true)
 |-- min_salary: string (nullable = true)
 |-- max_salary: string (nullable = true)
 |-- salary_period: string (nullable = true)
 |-- apply_url: string (nullable = true)
 |-- apply_email: string (nullable = true)
 |-- num_employees: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- company_email: string (nullable = true)
 |-- company_website: string (nullable = true)
 |-- company_phone: string (nulla

## Index StackExchange datasets: health, scifi, cooking, travel, devops

In [3]:
datasets = ["health", "cooking", "scifi", "travel", "devops"]
for dataset in datasets:
    aips.indexer.build_collection(engine, dataset)

In [6]:
stackexchange_collection = aips.indexer.build_collection(engine, "stackexchange")

## Dual index datasets into Solr for SKG

In [7]:
solr_engine = get_engine("solr")
aips.indexer.build_collection(solr_engine, "jobs")
for dataset in datasets:
    aips.indexer.build_collection(solr_engine, dataset)
stackexchange_collection = aips.indexer.build_collection(solr_engine, "stackexchange")

Wiping "jobs" collection
Creating "jobs" collection
Status: Success
Loading data/jobs/jobs.csv
Schema: 
root
 |-- job_title: string (nullable = true)
 |-- job_description: string (nullable = true)
 |-- job_type: string (nullable = true)
 |-- category: string (nullable = false)
 |-- job_location: string (nullable = true)
 |-- job_city: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- job_country: string (nullable = true)
 |-- job_zip_code: string (nullable = true)
 |-- job_address: string (nullable = true)
 |-- min_salary: string (nullable = true)
 |-- max_salary: string (nullable = true)
 |-- salary_period: string (nullable = true)
 |-- apply_url: string (nullable = true)
 |-- apply_email: string (nullable = true)
 |-- num_employees: string (nullable = true)
 |-- industry: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- company_email: string (nullable = true)
 |-- company_website: string (nullable = true)
 |-- company_phone: string (nulla

## Success!

Now that you've indexed several large text datasets, in the next notebook we will explore the rich graph of semantic relationships embedded within those documents by leveraging Semantic Knowledge Graphs for real-time traversal and ranking of arbitrary relationships within the domains of our datasets.

Up next: [Working with Semantic Knowledge Graphs](3.semantic-knowledge-graph.ipynb)