# MAG Sample: Get Authors and Paper Details

## Prerequisites

Complete these tasks before you begin this tutorial:

- Setting up provisioning of Microsoft Academic Graph to an Azure blob storage account. See [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning).
- Setting up Azure Databricks service. See [Set up Azure Databricks](https://docs.microsoft.com/academic-services/graph/get-started-setup-databricks).

## Gather the information

Before you begin, you should have these items of information:

- The name of your Azure Storage (AS) account containing MAG dataset from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The access key of your Azure Storage (AS) account from [Get Microsoft Academic Graph on Azure storage](https://docs.microsoft.com/academic-services/graph/get-started-setup-provisioning#note-azure-storage-account-name-and-primary-key).
- The name of the container in your Azure Storage (AS) account containing MAG dataset.

## Import notebooks

- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) samples/pyspark/MagClass.py in MAG dataset under your working folder.
- [Import](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook) this notebook under the same folder.

### Initialize storage account and container details

  | Variable  | Value | Description  |
  | --------- | --------- | --------- |
  | AzureStorageAccount | Replace **`<AzureStorageAccount>`** | This is the Azure Storage account containing MAG dataset. |
  | AzureStorageAccessKey | Replace **`<AzureStorageAccessKey>`** | This is the Access Key of the Azure Storage account. |
  | MagContainer | Replace **`<MagContainer>`** | This is the container name in Azure Storage account containing MAG dataset, usually in the form of mag-yyyy-mm-dd. |
  | OutputContainer | Replace **`<OutputContainer>`** | This is the container name in Azure Storage account where the output goes to, this container needs to be created before running this script. |

In [0]:
AzureStorageAccount = '<AzureStorageAccount>'
AzureStorageAccessKey = '<AzureStorageAccessKey>'
MagContainer = '<MagContainer>'
OutputContainer = '<OutputContainer>'

### Define MicrosoftAcademicGraph class

Run the MagClass notebook to define MicrosoftAcademicGraph class.

In [0]:
%run "./MagClass"

### Create a MicrosoftAcademicGraph instance to access MAG dataset
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer.

In [0]:
mag = MicrosoftAcademicGraph(account=AzureStorageAccount, key=AzureStorageAccessKey, container=MagContainer)

### Create a AzureStorageUtil to access other Azure Storage files
Use account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer.

In [0]:
asu = AzureStorageUtil(account=AzureStorageAccount, key=AzureStorageAccessKey, container=OutputContainer)

### Filter Authors by Affiliation

In [0]:
from pyspark.sql.functions import concat, lit, log, when

#Load PaperAuthorAffiliationRelationship data from previous output
paperAuthorAffiliation = asu.load('Affiliation/PaperAuthorAffiliationRelationship.tsv')

orgAuthorIds = paperAuthorAffiliation.select(paperAuthorAffiliation.AuthorId).distinct()

#Load Authors data
authors = mag.getDataframe('Authors')

# Get all author details
orgAuthors = authors \
    .join(orgAuthorIds, authors.AuthorId == orgAuthorIds.AuthorId, 'inner') \
    .select(orgAuthorIds.AuthorId, authors.DisplayName.alias('AuthorName'))

# Peek result
display(orgAuthors.head(5))

# Output result
asu.save(orgAuthors, 'Affiliation/Author.tsv')

AuthorId,AuthorName
53423,Yannai A. Gonczarowski
720112,Sergei Gringauze
1665409,Rogier Dittner
2364515,Steven J. Altschuler
4011424,Nicolae Surpatanu


### Filter Papers by Affiliation

In [0]:
#Load Papers data
papers = mag.getDataframe('Papers')

papers = papers.withColumn('Prefix', lit('https://academic.microsoft.com/#/detail/'))

# Get all paper details
orgPaperIds = paperAuthorAffiliation.select(paperAuthorAffiliation.PaperId).distinct()

orgPapers = papers \
    .join(orgPaperIds, papers.PaperId == orgPaperIds.PaperId) \
    .where(papers.Year >= 1991) \
    .select(papers.PaperId, papers.PaperTitle.alias('Title'), papers.EstimatedCitation.alias('CitationCount'), \
            papers.Date, when(papers.DocType.isNull(), 'Not available').otherwise(papers.DocType).alias('PublicationType'), \
            log(papers.Rank).alias('LogProb'), concat(papers.Prefix, papers.PaperId).alias('Url'), \
            when(papers.ConferenceSeriesId.isNull(), papers.JournalId).otherwise(papers.ConferenceSeriesId).alias('VId'), \
            papers.Year)

# Peek result
display(orgPapers.head(5))

# Optional: Count number of rows in result
print('Number of rows in orgPapers: {}'.format(orgPapers.count()))

PaperId,Title,CitationCount,Date,PublicationType,LogProb,Url,VId,Year
2568259326,the unsplittable stable marriage problem,0,2006-01-01,Not available,9.972360416822504,https://academic.microsoft.com/#/detail/2568259326,,2006
1535764483,the unsplittable stable marriage problem,18,2006-08-01,Conference,9.935664284231349,https://academic.microsoft.com/#/detail/1535764483,2755269626.0,2006
2914429498,proceedings of the the 1st acm workshop on continuous archival and retrieval of personal experiences,0,2004-10-15,Conference,10.036005932236272,https://academic.microsoft.com/#/detail/2914429498,2758214222.0,2004
1536284211,proceedings of the the 1st acm workshop on continuous archival and retrieval of personal experiences,6,2004-10-15,Conference,9.995793223320154,https://academic.microsoft.com/#/detail/1536284211,1135237122.0,2004
1580275452,semantically annotated provenance in the life science grid,4,2009-10-25,Conference,9.951753769945617,https://academic.microsoft.com/#/detail/1580275452,1155608529.0,2009


### Save Paper.tsv

In [0]:
asu.save(orgPapers, 'Affiliation/Paper.tsv')