# MAG Sample: Get Fields of Study for an Affiliation

## Prerequisites

Complete these tasks before you begin this tutorial:

- Setting up provisioning of Microsoft Academic Graph to an Azure blob storage account. See [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning).
- Setting up Azure Databricks service. See [Set up Azure Databricks](https://docs.microsoft.com/academic-services/graph/get-started-setup-databricks).
- Install python library `plotly`, and `pycountry` on the cluster you want to run this tutorial.

## Gather the information

Before you begin, you should have these items of information:

- The name of your Azure Storage (AS) account containing MAG dataset from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The access key of your Azure Storage (AS) account from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The name of the container in your Azure Storage (AS) account containing MAG dataset.

## Import notebooks

- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) samples/pyspark/MagClass.py under your working folder.
- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) this notebook under the same folder.

### Initialize storage account and container details

  | Variable  | Value | Description  |
  | --------- | --------- | --------- |
  | AzureStorageAccount | Replace **`<AzureStorageAccount>`** | This is the Azure Storage account containing MAG dataset. |
  | AzureStorageAccessKey | Replace **`<AzureStorageAccessKey>`** | This is the Access Key of the Azure Storage account. |
  | MagContainer | Replace **`<MagContainer>`** | This is the container name in Azure Storage account containing MAG dataset, usually in the form of mag-yyyy-mm-dd. |
  | OutputContainer | Replace **`<OutputContainer>`** | This is the container name in Azure Storage account where the output goes to, this container needs to be created before running this script. |

In [0]:
AzureStorageAccount = '<AzureStorageAccount>'
AzureStorageAccessKey = '<AzureStorageAccessKey>'
MagContainer = '<MagContainer>'
OutputContainer = '<OutputContainer>'

### Define MicrosoftAcademicGraph class

Run the MagClass notebook to define MicrosoftAcademicGraph class.

In [0]:
%run "./MagClass"

### Create a MicrosoftAcademicGraph instance to access MAG dataset
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer.

In [0]:
mag = MicrosoftAcademicGraph(account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer)

### Create a AzureStorageUtil to access other Azure Storage files
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer.

In [0]:
asu = AzureStorageUtil(account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer)

### Load Papers

In [0]:
# Get paper details for the input organization from previous output
orgPapers = asu.load('Paper.tsv')

### Get PaperFieldsOfStudy

In [0]:
# Load FieldsOfStudy data
fieldOfStudy = mag.getDataframe('FieldsOfStudy')

# Load PaperFieldsOfStudy data
paperFieldsOfStudy = mag.getDataframe('PaperFieldsOfStudy')

# Get Paper-Field-of-Study relationships for the input organization
orgPaperFieldOfStudy = paperFieldsOfStudy \
    .join(orgPapers, paperFieldsOfStudy.PaperId == orgPapers.PaperId, 'inner') \
    .select(orgPapers.PaperId, paperFieldsOfStudy.FieldOfStudyId)

# Optional: peek result
orgPaperFieldOfStudy.show(10)

### Save PaperFieldOfStudyRelationship

In [0]:
asu.save(orgPaperFieldOfStudy, 'PaperFieldOfStudyRelationship.tsv')

### Get FieldsOfStudy

In [0]:
# Get all field-of-study Ids for the input organization
orgFieldOfStudyIds = orgPaperFieldOfStudy.select(orgPaperFieldOfStudy.FieldOfStudyId).distinct()

# Get all field-of-study details for the input organization
orgFiledOfStudy = fieldOfStudy \
    .join(orgFieldOfStudyIds, fieldOfStudy.FieldOfStudyId == orgFieldOfStudyIds.FieldOfStudyId, 'inner') \
    .select(orgFieldOfStudyIds.FieldOfStudyId, fieldOfStudy.Level.alias('FieldLevel'), fieldOfStudy.DisplayName.alias('FieldName'))

# Optional: peek result
orgFiledOfStudy.show(10)

### Save FieldOfStudy

In [0]:
asu.save(orgFiledOfStudy, 'FieldOfStudy.tsv')