Link to the repo: https://github.com/taidopurason/bdm-project-1

Check out the Schema of our Warehouse: https://github.com/taidopurason/bdm-project-1/blob/main/Schema.png

### Structure of this notebook
1. Read data from parquet files into a DataFrame.
2. Apply necessary (cleaning) transformations to the dataframe.
3. Create new DataFrames corresponding to our Warehouse Schema.  
4. Save the DataFrames as Delta tables.
5. Demonstrate adding new entries to the warehouse.
6. Demonstrate queries on the data.

In [0]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from delta.tables import *

import logging
import json
import re


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
DISPLAY_LIMIT = 20

### 1. Extract the data
To check how we downloaded the data from the source, see https://github.com/taidopurason/bdm-project-1/blob/main/Loading%20data%20v2.ipynb. We split the downloaded data into parquet files where each file contains data from 250,000 json objects.

In [0]:
# Uncomment one or the other line.


# This reads ALL splits into one dataframe
#_df = spark.read.parquet('dbfs:/user/dblpv13/dblpv13.*.parquet')


# For a faster setup, read just one split
_df = spark.read.parquet('dbfs:/user/dblpv13/dblpv13.0.parquet')

In [0]:
# Immediately delete the abstract column because they look really annoying on GitHub.
_df = _df.drop(F.col('abstract'))

_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- bio: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- gid: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- name_zh: string (nullable = true)
 |    |    |-- oid: string (nullable = true)
 |    |    |-- oid_zh: string (nullable = true)
 |    |    |-- orcid: string (nullable = true)
 |    |    |-- org: string (nullable = true)
 |    |    |-- org_zh: string (nullable = true)
 |    |    |-- orgid: string (nullable = true)
 |    |    |-- orgs: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- orgs_zh: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- sid: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- fos: array (nullable = 

In [0]:
# Display raw data
display(_df.limit(DISPLAY_LIMIT))

_id,authors,doi,fos,isbn,issn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,venue,volume,year
53e99784b7602d9701f3e3f5,,,,,,,List(),en,0.0,,,,,3GIO.,,"List(null, null, null, null, null, null, null, null, null, null, null, null, 0)",,2011
53e99784b7602d9701f3e133,"List(List(53f45728dabfaec09f209538, null, null, null, Peijuan Wang, null, null, null, null, null, null, null, null, null, null), List(5601754345cedb3395e59457, null, null, null, Jiahua Zhang, null, null, null, null, null, null, null, null, null, null), List(53f38438dabfae4b34a08928, null, null, null, Donghui Xie, null, null, null, null, null, null, null, null, null, null), List(5601754345cedb3395e5945a, null, null, null, Yanyan Xu, null, null, null, null, null, null, null, null, null, null), List(53f43d25dabfaeecd6995149, null, null, null, Yun Xu, null, null, null, null, null, null, null, null, null, null))",10.1109/IGARSS.2011.6049503,"List(Agronomy, Moisture, Hydrology, Environmental science, Dry weight, Water content, Stomatal conductance, Transpiration, Irrigation, Soil water, Canopy)",,,,"List(canopy parameters, canopy spectrum, different soil water content control, winter wheat, irrigation, hydrology, radiometry, moisture, indexes, vegetation, indexation, dry weight, soil moisture, water content, indexing terms, spectrum, natural disaster)",en,0.0,1933,1930.0,,,The relationship between canopy parameters and spectrum of winter wheat under different irrigations in Hebei Province.,List(http://dx.doi.org/10.1109/IGARSS.2011.6049503),"List(53a7297d20f7420be8bd4ae7, null, null, International Geoscience and Remote Sensing Symposium, null, null, null, IGARSS, null, null, null, null, 0)",,2011
53e99784b7602d9701f3e151,"List(List(53f46797dabfaeb22f542630, null, null, null, Jairo Rocha, null, null, null, null, null, null, null, null, null, null), List(54328883dabfaeb4c6a8a699, null, null, null, Theo Pavlidis, null, null, null, null, null, null, null, null, null, null))",10.1109/ICDAR.1993.395663,"List(Intelligent character recognition, Pattern recognition, Computer science, Feature (computer vision), Document processing, Handwriting recognition, Optical character recognition, Feature extraction, Feature (machine learning), Artificial intelligence, Intelligent word recognition)",,,,"List(handwriting recognition, prototypes, image segmentation, computer science, expert systems, knowledge base, pattern recognition, usability, optical character recognition, shape, feature extraction)",en,17.0,605,602.0,,"List(53e99cf5b7602d97025ace63, 557e8a7a6fee0fe990caa63d, 53e9a96cb7602d97032c459a, 53e9b929b7602d9704515791, 557e59ebf6678c77ea222447)",A solution to the problem of touching and broken characters.,List(http://dx.doi.org/10.1109/ICDAR.1993.395663),"List(53a72a4920f7420be8bfa51b, null, null, International Conference on Document Analysis and Recognition, null, null, null, ICDAR-1, null, null, null, null, 0)",,1993
53e99784b7602d9701f3e15d,"List(List(53f43b03dabfaedce555bf2a, null, null, null, Min Pan, null, null, null, null, null, null, null, null, null, null), List(53f45ee9dabfaee43ecda842, null, null, null, Chris C. N. Chu, null, null, null, null, null, null, null, null, null, null), List(53f42e8cdabfaee1c0a4274e, null, null, null, Hai Zhou, null, null, null, null, null, null, null, null, null, null))",10.1109/ISCAS.2005.1465124,"List(Delay calculation, Timing failure, Monte Carlo method, Sequential logic, Statistical static timing analysis, Shortest path problem, Computer science, Algorithm, Clock skew, Static timing analysis, Statistics)",0-7803-8834-8,,,"List(sequential circuits, statistical distributions, set-up time constraints, register-to-register paths, statistical static timing analysis, integrated circuit modelling, parameter estimation, statistical analysis, circuit model, path delays, deep sub-micron technology, timing, delay distributions, delays, circuit timing, shortest path variations, hold time constraints, integrated circuit yield, process variations, integrated circuit layout, high-performance circuit designs, clock skew, timing yield estimation, deterministic static timing analysis, monte carlo simulation, design method, static timing analysis, design methodology, process variation, shortest path, registers, circuit design, circuit analysis)",en,28.0,2464Vol.3,2461.0,//static.aminer.org/pdf/PDF/000/423/329/timing_yield_estimation_using_statistical_static_timing_analysis.pdf,"List(53e9a8a9b7602d97031f6bb9, 599c7b6b601a182cd27360da, 53e9b443b7602d9703f3e52b, 53e9a6a6b7602d9702fdc57e, 599c7b6a601a182cd2735703, 53e9aad9b7602d970345afea, 5582821f0cf2bf7bae57ac18, 5e8911859fced0a24bb9a2ba, 53e9b002b7602d9703a5c932)",Timing yield estimation using statistical static timing analysis,"List(http://dx.doi.org/10.1109/ISCAS.2005.1465124, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=1465124)","List(53a72e2020f7420be8c80142, null, null, International Symposium on Circuits and Systems, null, null, null, ISCAS (3), null, null, null, null, 0)",,2005
53e99784b7602d9701f3e161,"List(List(53f46946dabfaec09f24b4ed, null, null, 5b86cf1ae1cd8e14a3fc787b, Miguel Palma, null, 544bd9c245ce266baf189c4f, null, null, Miguel Palma Studio, null, null, null, null, null))",10.1145/1665137.1665166,,,,,"List(global high technology, daily short-distance flight, enormous waste, daily life)",en,,39,39.0,,,360°,,"List(5390a74a20f70186a0e8b40b, null, null, null, null, null, null, ACM SIGGRAPH ASIA 2009 Art Gallery & Emerging Technologies: Adaptation, null, null, null, null, null)",,2009
53e99784b7602d9701f3e162,"List(List(53f43d95dabfaedf435b63fa, null, null, 5b869031e1cd8e14a34a782f, Maureen Squillace, null, null, null, null, Fox Studios Australia, Moore Park, New South Wales, Australia, null, 5f71b2e41c455f439fe3efd1, null, null, null))",10.1145/1281740.1281746,,,,,List(),en,0.0,14,14.0,,,300,"List(http://dx.doi.org/10.1145/1281740.1281746, http://doi.acm.org/10.1145/1281740.1281746)","List(5736ae3ad39c4f40a7976010, null, null, null, null, null, null, SIGGRAPH Computer Animation Festival, null, null, null, null, 10)",,2007
53e99784b7602d9701f3e165,"List(List(54484654dabfae87b7dfc077, null, null, null, Jon G. Hall, null, null, null, null, null, null, null, null, null, 433474))",10.1111/j.1468-0394.2009.00532.x,,,,4.0,List(),en,0.0,306,305.0,,,34957+70764=105621,List(http://dx.doi.org/10.1111/j.1468-0394.2009.00532.x),"List(53a001b1831432abcb737ee4, null, null, null, null, null, null, Expert Systems, null, null, null, null, 0)",26.0,2009
53e99784b7602d9701f3e922,"List(List(53f39e3edabfae4b34aa8c4a, null, null, null, Jungil Park, null, null, null, null, null, null, null, null, null, 237372), List(53f431bcdabfaee2a1cb41b5, null, null, null, Sunyoung Ahn, null, null, null, null, null, null, null, null, null, 24447851), List(53f46ac3dabfaeee22a63eab, null, null, null, Youngmi Kim Pak, null, null, null, null, null, null, null, null, null, 4241287), List(53f44f6adabfaedf435efcb8, null, null, null, James Jungho Pak, null, null, null, null, null, null, null, null, null, 22875855))",10.1109/NEMS.2009.5068754,,,,,List(),en,1.0,1057,1054.0,//static.aminer.org/pdf/PDF/002/845/190/.pdf,,International Conference on Nano/Micro Engineered and Molecular Systems,List(http://doi.ieeecomputersociety.org/10.1109/NEMS.2009.5068754),"List(53a72dfb20f7420be8c7a2f3, null, null, null, null, null, null, NEMS, null, null, null, null, null)",,2009
53e99784b7602d9701f3e4f4,"List(List(53f45ad4dabfaee1c0b3e206, null, null, null, Bonnie Mitchell, null, null, null, null, null, null, null, null, null, null))",10.1145/1596685.1596687,,,,,"List(visual source material, minute sound, integrated journey temporally, abstract environment, intricate detail, particulated image, artists delve, visual experience, multi-faceted granular complexity, stylized natural element)",en,0.0,8,8.0,,,2BTextures,"List(http://dx.doi.org/10.1145/1596685.1596687, http://doi.acm.org/10.1145/1596685.1596687, db/conf/siggraph/siggraph2009festival.html#Mitchell09, https://doi.org/10.1145/1596685.1596687)","List(5736ae3ad39c4f40a7976060, null, null, null, null, null, null, SIGGRAPH Computer Animation Fesitval, null, null, null, null, 10)",,2009
53e99784b7602d9701f3eaf2,"List(List(53f438d0dabfaeee229c1f1c, null, null, null, Naotaka Tanaka, null, null, null, null, null, null, null, null, null, null), List(53f47083dabfaeee22a79321, null, null, null, Mio Yamamoto, null, null, null, null, null, null, null, null, null, null))",10.1007/3-540-45324-5_74,,3-540-42185-8,,,List(),en,0.0,514,513.0,,,11MonkeysII,List(http://dx.doi.org/10.1007/3-540-45324-5_74),"List(5390b44b20f70186a0efa5ba, null, null, null, null, null, null, RoboCup 2009, null, null, null, null, 0)",,2001


### 2. Transform the data

In [0]:
def replace_empty_string(col):
    return F.when(col == "", None).otherwise(col)

def transform(_df):
    # Create the col of author IDs
    _df = _df.withColumn('Author_ID', F.col('authors._id'))
    # Delete entries where any author ID is null
    _df = _df.where("!exists(Author_ID, x -> x is null)")
    # Drop entries with 1-word titles or empty authors or nonexistant _id or any nonexistant author id.
    # Also removes empty or missing references.
    _df = (_df.filter((F.size(F.col('authors')) > 0) & # By default F.size() returns -1 if the value is null.
                      (F.size(F.split(F.col('title'), ' ')) > 1) &  
                      (F.col('_id') != '') & 
                      (F.col('_id').isNotNull()) & 
                      ~(F.array_contains(F.col('references'), '')) & 
                      ~(F.array_contains(F.col('Author_ID'), ''))))
    # Remove all null references
    _df = _df.withColumn('references', F.expr('filter(references, x -> x is not null)'))    
    # Remove entries that are forewords
    _df = _df.filter(~F.lower(F.col("title")).contains("foreword"))
    # Convert n_citation data type to int
    _df = _df.withColumn('n_citation', F.col('n_citation').cast('int'))
    # Replace empty language values with null.
    _df = _df.withColumn('lang', F.when(F.col('lang') == '', None).otherwise(F.col('lang')))
    # Replace empty 'keyword' and 'fos' arrays with null values.
    _df = (_df.withColumn('keywords', F.when(F.size(F.col('keywords')) == 0, None).otherwise(F.col('keywords')))
              .withColumn('fos', F.when(F.size(F.col('fos')) == 0, None).otherwise(F.col('fos'))))
    # Replace non-numeric page numbers with nulls and convert column type to int. Then replace 0 page numbers with nulls as well.
    _df = (_df.withColumn('page_start', F.when(F.col('page_start').cast('int').isNotNull(), F.col('page_start')).otherwise(None)) # replace non-numeric page numbers with null
              .withColumn('page_end', F.when(F.col('page_end').cast('int').isNotNull(), F.col('page_end')).otherwise(None))
              .withColumn('page_start', F.col('page_start').cast('int')) # convert column type to int
              .withColumn('page_end', F.col('page_end').cast('int'))
              .withColumn('page_start', F.when(F.col('page_start') == 0, None).otherwise(F.col('page_start'))) # replace 0 page numbers with null as well
              .withColumn('page_end', F.when(F.col('page_end') == 0, None).otherwise(F.col('page_end'))))
    # Replace empty dois with nulls.
    _df = _df.withColumn('doi', F.when(F.col('doi') == '', None).otherwise(F.col('doi')))
    # Replace empty years with nulls and change data type to int.
    _df = (_df.withColumn('year', F.when(F.col('year') == 0, None).otherwise(F.col('year')))
              .withColumn('year', F.col('year').cast('int')))
    # Replace non-numeric volume and issue numbers with null and convert data types to int. Then repalce 0 values with null as well.
    _df = (_df.withColumn('volume', F.when(F.col('volume').cast('int').isNotNull(), F.col('volume')).otherwise(None)) # replace non-numeric values
              .withColumn('issue', F.when(F.col('issue').cast('int').isNotNull(), F.col('issue')).otherwise(None))
              .withColumn('volume', F.col('volume').cast('int')) # convert column type to int
              .withColumn('issue', F.col('issue').cast('int'))
              .withColumn('volume', F.when(F.col('volume') == 0, None).otherwise(F.col('volume'))) # replace 0 issue and volume numbers with null as well.
              .withColumn('issue', F.when(F.col('issue') == 0, None).otherwise(F.col('issue'))))
    
    # Remove all entries where affiliation (orgid) of the first author is null
    _df = (_df.withColumn('auth_orgs', F.col('authors.orgid'))
              .withColumn('auth_orgs', F.col('auth_orgs').getItem(0))
              .filter(F.col('auth_orgs').isNotNull())
              .drop('auth_orgs'))
    
    # Replace empty strings in some columns with nulls
    venue = F.col("venue")
    for col in ["_id", "issn", "name", "name_d", "name_s", "online_issn", "publisher", "raw", "raw_zh", "t"]:
        venue = venue.withField(col, replace_empty_string(F.col(f"venue.{col}")))  
    _df = (
        _df
        .withColumn("venue", venue)
        .withColumn("issn", replace_empty_string(F.col("issn")))
        .withColumn("isbn", replace_empty_string(F.col("isbn")))
        .withColumn("isbn", F.when(F.col("isbn") == "isbn", None).otherwise(F.col("isbn")))
        .withColumn("issn", F.when(F.col("issn") == "issn", None).otherwise(F.col("issn")))
    )
    # fix incorrect issn
    _df = (_df
               .withColumn("issn",
                           F.when(F.length(F.col("issn")) == 9, F.col("issn"))
                           .when(F.length(F.col("issn")) == 8, F.concat(F.col("issn").substr(1, 4), F.lit("-"), F.col("issn").substr(5, 4)))
                           .when(F.col("issn").contains("E-ISBN"), F.col("issn").substr(1, 9))
                           .otherwise(None)
                          )
               .withColumn("venue", 
                           F.col("venue")
                           .withField("issn", F.coalesce(F.col("venue.issn"), F.col("issn")))
                           )
               .drop("issn")
              )
    # replace venue with null fields with null
    venue_is_empty = (
        F.col("venue.issn").isNull() &
        F.col("venue.name").isNull() &
        F.col("venue.name_d").isNull() &
        F.col("venue.name_s").isNull() &
        F.col("venue.online_issn").isNull() &
        F.col("venue.publisher").isNull() &
        F.col("venue.raw").isNull() &
        F.col("venue.raw_zh").isNull()
    )
    _df = _df.withColumn("venue", F.when(venue_is_empty, None).otherwise(F.col("venue")))
    # remove rows with null venues
    _df = _df.filter(F.col("venue").isNotNull())
    # coalescing venue._id and venue.issn to make up for missing ids
    _df = _df.withColumn("venue", F.col("venue").withField("_id", F.coalesce(F.col("venue._id"), F.col("venue.issn"))))
    # removing rows with venue id null
    _df = _df.filter(F.col("venue._id").isNotNull())
    return _df

In [0]:
logger.info(f"Initially, there were {_df.count()} rows of data")

_df = transform(_df)

logger.info(f"After the transformations, there are {_df.count()} rows of data")

_df.printSchema()

INFO:__main__:Initially, there were 250000 rows of data
INFO:__main__:After the transformations, there are 77009 rows of data
root
 |-- _id: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- bio: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- gid: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- name_zh: string (nullable = true)
 |    |    |-- oid: string (nullable = true)
 |    |    |-- oid_zh: string (nullable = true)
 |    |    |-- orcid: string (nullable = true)
 |    |    |-- org: string (nullable = true)
 |    |    |-- org_zh: string (nullable = true)
 |    |    |-- orgid: string (nullable = true)
 |    |    |-- orgs: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- orgs_zh: array (nullable = true)
 |    |    |    |-- element: string (con

In [0]:
# Display cleaned data
display(_df.limit(DISPLAY_LIMIT))

_id,authors,doi,fos,isbn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,venue,volume,year,Author_ID
53e99784b7602d9701f3f5fe,"List(List(53f46a22dabfaee0d9c3d5e5, null, ysg_2005@hotmail.com, 5b8698cce1cd8e14a3826671, Shuguo Yang, null, null, null, null, School of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China 266061, null, 5f71b2e91c455f439fe3f23f, null, null, null))",10.1007/s11704-011-0127-6,"List(Virtualization, Service level objective, Virtual machine, Computer science, Testbed, Quality of service, Provisioning, Resource allocation, Web application, Operating system, Distributed computing)",,4.0,"List(resource allocation, cpu utilization, quality of service)",en,2,512.0,506.0,,"List(53e9a073b7602d9702957efa, 53e9ad87b7602d970377bfb5, 53e9be51b7602d9704b11381, 53e9be04b7602d9704abb31d, 53e9992bb7602d9702169236, 53e998cdb7602d97021044db, 53e9afa6b7602d97039f6054, 53e99822b7602d9702044e60)",Research on resource allocation for multi-tier web applications in a virtualization environment,"List(http://dx.doi.org/10.1007/s11704-011-0127-6, http://link.springer.com/article/10.1007/s11704-011-0127-6, http://www.webofknowledge.com/)","List(572de199d39c4f49934b3d5c, 1673-7350, null, null, null, null, null, Frontiers of Computer Science in China, null, null, null, null, 0)",5.0,2011,List(53f46a22dabfaee0d9c3d5e5)
53e99792b7602d9701f5af35,"List(List(53f43a51dabfaec22baa659b, null, dedwards@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Dennis Edwards, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f3b3ffdabfae4b34b2dae9, null, ssimmons@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Sharon Simmons, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f4333fdabfaeb22f451979, null, nwilde@uwf.edu, null, Norman Wilde, null, null, null, null, Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., null, null, List(Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null))",10.1016/j.jss.2004.12.018,"List(Data mining, Causality, End user, Ranking, Computer science, Military systems, Software, Feature model, Component-based software engineering, A-weighting, Distributed computing)",,1.0,"List(Feature location, Distributed systems, Software reconnaissance)",en,62,68.0,57.0,//static.aminer.org/pdf/PDF/000/996/035/an_approach_to_feature_location_in_distributed_systems.pdf,"List(53e9b6eeb7602d970427df40, 53e9b6eeb7602d9704283b9f, 53e9b40eb7602d9703f01b25, 53e9a3c0b7602d9702ccdfc9, 53e99818b7602d97020347a2, 53e9a2acb7602d9702bb4d7e, 558aa7ea84ae84d265bee194, 558a5258e4b037c08756714c, 53e9b946b7602d97045336a9, 53e9b1d6b7602d9703c67695, 53e9a516b7602d9702e3bcea, 53e9ac33b7602d97035f892c, 53e9ba22b7602d9704628817, 53e9af3ab7602d97039769c8, 53e9b1a3b7602d9703c2c6f7, 53e9ac89b7602d9703660f90, 53e9ad2db7602d970370e8a2, 53e9a735b7602d970306db2b, 53e99960b7602d97021a17da)",An approach to feature location in distributed systems,"List(http://dx.doi.org/10.1016/j.jss.2004.12.018, https://www.sciencedirect.com/science/article/pii/S016412120500004X, http://www.webofknowledge.com/)","List(54825226582fc50b5e05610e, 0164-1212, null, null, null, null, null, Journal of Systems and Software, null, null, null, null, 0)",79.0,2006,"List(53f43a51dabfaec22baa659b, 53f3b3ffdabfae4b34b2dae9, 53f4333fdabfaeb22f451979)"
53e99792b7602d9701f5b0ed,"List(List(542a6734dabfae646d55cc87, null, gaoyibo@gmail.com, 5b86cb35e1cd8e14a3e00691, Kun Gao, null, null, null, null, Computer Science and Information Technology College, Zhejiang Wanli University, China, null, 5f71b6101c455f439fe555a5, null, null, null))",10.1007/978-3-540-85565-1_38,"List(Data mining, Data stream mining, Grid computing, Data-intensive computing, Computer science, Directed acyclic graph, Semantic grid, Knowledge extraction, Business process discovery, Grid, Distributed computing)",,,"List(knowledge discovery grid, dynamic grid environment, high performance data mining, parallel optimization method, grid feature, high performance ddm application, data mining grid, decomposing data mining application, data mining parallelization, data intensive computing problem, computational grid environment, data mining application, knowledge discovery, distributed computing, data mining, data intensive computing, parallelization, grid computing, directed acyclic graph)",en,4,312.0,306.0,,"List(53e99cbbb7602d970256c4af, 53e9b092b7602d9703afec3f, 53e9b089b7602d9703af06a7, 53e9ae9cb7602d97038c0529, 53e9b5f3b7602d9704144753, 558aac78e4b0b32fcb3831b7, 53e9bc3bb7602d97048b1ed7, 53e9b7b4b7602d970435c58f, 53e9b672b7602d97041db0b8, 53e9b96eb7602d970455d792, 53e9ae97b7602d97038ba7dd, 53e9adf7b7602d97038042d9, 53e9bd6ab7602d9704a0644e)",A Uniform Parallel Optimization Method for Knowledge Discovery Grid,"List(http://dx.doi.org/10.1007/978-3-540-85565-1_38, http://www.webofknowledge.com/)","List(53a727f720f7420be8ba3092, 0302-9743, null, null, null, null, null, KES (2), null, null, null, null, 0)",5178.0,2008,List(542a6734dabfae646d55cc87)
53e99792b7602d9701f5b119,"List(List(5630ff9645cedb3399c3ca55, null, L.yang@lboro.ac.uk, null, Lili Yang, null, null, null, null, Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, Corresponding author. Tel.: +44 1509 223130; fax: +44 1509 223961.), null, null), List(53f4371cdabfaec22ba8766f, null, null, null, Bryan F. Jones, null, null, null, null, Applied Computing Department, The University of Derby, Derby DE22 1GB, UK, null, 5f71b2ba1c455f439fe3dd51, List(Applied Computing Department, The University of Derby, Derby DE22 1GB, UK), null, null), List(54867430dabfae9b40133dc3, null, null, null, Shuanghua Yang, null, null, null, null, Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK), null, null))",10.1016/j.ejor.2006.07.003,"List(Objective programming, Fire protection, Computer science, Fuzzy logic, Operations research, Fire risk, Genetic algorithm, Decision maker)",,2.0,"List(Location, Fire stations, Multi-objective programming, Genetic algorithm, Fuzzy programming)",en,140,915.0,903.0,,"List(53e9bd50b7602d97049e3238, 573695926e3b12023e4b23a9, 53e9af87b7602d97039cdb54, 53e99876b7602d97020b2192, 53e9bbf5b7602d9704850823, 5c7965584895d9cbc6430efd, 53e9ab42b7602d97034d404d, 5b67376bab2dfb7a20258e8e)",A fuzzy multi-objective programming for optimization of fire station locations through genetic algorithms,"List(http://dx.doi.org/10.1016/j.ejor.2006.07.003, https://www.sciencedirect.com/science/article/pii/S037722170600467X)","List(0377-2217, 0377-2217, null, null, null, null, North-Holland, European Journal of Operational Research, null, european-journal-of-operational-research, null, J, null)",181.0,2007,"List(5630ff9645cedb3399c3ca55, 53f4371cdabfaec22ba8766f, 54867430dabfae9b40133dc3)"
53e99792b7602d9701f5b140,"List(List(56017d4445cedb3395e638f7, null, null, 5b86cf81e1cd8e14a3ff7abc, Bielikova, M., null, null, null, null, Dept. of Comput. Sci. & Eng., Slovak Tech. Univ., Bratislava, Slovakia|c|, null, 5f71b2f61c455f439fe3f847, null, null, null))",10.1109/ICALT.2001.943897,"List(Metadata, Programming language, Information retrieval, XML, Computer science, Adaptive system, Hypermedia, Software prototyping, Context model, Software, Presentation logic)",0-7695-1013-2,,"List(evolving information, database, educational module information, educational module, extensible markup language, meta-information, xml, adaptive presentation, proposed framework, administrative information, user view, software prototyping, educational administrative data processing, presentation time, hypermedia markup languages, time view, meta data, adaptive systems, hypermedia, software prototype, computer science, solids, context modeling, databases, html)",en,10,196.0,193.0,//static.aminer.org/pdf/PDF/000/269/528/adaptive_presentation_of_evolving_information_using_xml.pdf,"List(53e9a2c8b7602d9702bd1d38, 53e9b304b7602d9703dc9a0c, 53e9ae84b7602d97038a3a28, 53e99a5cb7602d97022c416d)",Adaptive presentation of evolving information using XML,"List(http://dx.doi.org/10.1109/ICALT.2001.943897, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=943897)","List(5550376d7cea80f9541873d5, null, null, International Conference on Advanced Learning Technologies, null, null, null, ICALT, null, null, null, null, 0)",,2001,List(56017d4445cedb3395e638f7)
53e99792b7602d9701f5b19a,"List(List(54301e81dabfaeca69bca10d, null, null, 5b86919fe1cd8e14a353a903, Lajos Horváth, null, null, null, 0000-0001-8594-4972, Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA, null, 5f71b2841c455f439fe3c6c8, List(Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA), null, null), List(53f4647ddabfaee02ad8cbb6, null, null, 5b86ae49e1cd8e14a31057a0, Marie Hušková, null, null, null, 0000-0002-6868-1362, Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic, null, 5f71b2851c455f439fe3c70f, List(Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic), null, null), List(5448d49ddabfae87b7e82000, null, Piotr.Kokoszka@usu.edu, 5b8689c1e1cd8e14a3214e89, Piotr Kokoszka, null, null, null, null, Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, null, 5f71b2961c455f439fe3ce40, List(Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, Corresponding author.), null, null))",10.1016/j.jmva.2008.12.008,"List(Functional principal component analysis, Time series, Truncation, Autoregressive model, Applied mathematics, Test statistic, Credit card, Calculus, Statistical hypothesis testing, Mathematics, Asymptotic distribution)",,2.0,"List(asymptotic justification, observations x, equation x, functional principal component analysis, test statistic, credit card transaction data, principal component, functional autoregressive process, functional time series data, change-point, functional time series analysis, 62m10, time series data, autoregressive process, transaction data, asymptotic distribution, time series analysis)",en,62,367.0,352.0,,"List(53e9ace1b7602d97036bc507, 53e999ffb7602d970224c172)",Testing the stability of the functional autoregressive process,"List(http://dx.doi.org/10.1016/j.jmva.2008.12.008, https://dl.acm.org/doi/10.1016/j.jmva.2008.12.008, https://www.sciencedirect.com/science/article/pii/S0047259X08002789, http://www.webofknowledge.com/)","List(555036db7cea80f9541603d7, null, null, null, null, null, null, J. Multivariate Analysis, null, null, null, null, 0)",101.0,2010,"List(54301e81dabfaeca69bca10d, 53f4647ddabfaee02ad8cbb6, 5448d49ddabfae87b7e82000)"
53e99792b7602d9701f5b1ba,"List(List(53f42c98dabfaeb22f3fc92d, null, null, 5b869b48e1cd8e14a393cf7b, V. William Porto, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f64e31dabfae6a71b6029f, null, null, 5b869b48e1cd8e14a393cf7b, David B. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f47307dabfaee02adc6671, null, null, 5b869b48e1cd8e14a393cf7b, Lawrence J. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null))",10.1145/1056754.1056755,"List(Interactive evolutionary computation, Human-based evolutionary computation, Evolutionary algorithm, Computer science, Evolutionary computation, Evolution strategy, Artificial intelligence, Evolutionary programming, Evolutionary music, Java Evolutionary Computation Toolkit)",,2.0,"List(adaptive behavior, platoon-level engagement, current knowledge, generating novel tactic, appropriate tactic, simulated evolution, evolutionary computation, projected situation, higher level organization, individual unit, evolutionary algorithm, real time, evolutionary computing)",en,5,14.0,8.0,,"List(53e9bc80b7602d97048fe74a, 53e9bb93b7602d97047d88e5, 53e99d28b7602d97025dc235, 53e99842b7602d970206de90, 558a8260e4b031bae1f80ed5)",Generating novel tactics through evolutionary computation,"List(http://dx.doi.org/10.1145/1056754.1056755, http://doi.acm.org/10.1145/1056754.1056755)","List(53a7310820f7420be8d1bc69, null, null, Intelligence/sigart Bulletin, null, null, null, SIGART Bulletin, null, null, null, null, 0)",9.0,1998,"List(53f42c98dabfaeb22f3fc92d, 53f64e31dabfae6a71b6029f, 53f47307dabfaee02adc6671)"
53e99792b7602d9701f5b1e7,"List(List(53f43685dabfaec09f17df79, null, null, 5b86b6bfe1cd8e14a34c9318, Katie L. Hoffman, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null), List(54069dffdabfae44f084828e, null, null, 5b86b6bfe1cd8e14a34c9318, Robert J. Wood, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null))",10.1109/IROS.2013.6696543,"List(Gait, Longitudinal static stability, Simulation, Control theory, Catastrophic failure, Robustness (computer science), Gait analysis, Fault tolerance, Redundancy (engineering), Engineering, Robot)",,,"List(gait analysis, legged locomotion, stability, centipede-inspired millirobot locomotion, curvature radius, curvature speed, gait options, leg failures, mechanical redundancy, miniature ambulatory robots, performance metrics, robustness enhancement, static stability retention)",en,8,1479.0,1472.0,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b1e7_0.pdf,"List(53e9aa61b7602d97033d7457, 53e9b464b7602d9703f64f1b, 53e998fcb7602d9702138f37, 53e9bd0ab7602d9704990fd9, 558b092284ae84d265c11df2, 53e99f8db7602d9702865bbe, 558b6afae4b037c0875cc785, 53e9992ab7602d9702163a36, 53e9af26b7602d970396322a, 53e9b2f4b7602d9703db0617, 53e99d6bb7602d9702623452, 558a4e4684ae84d265bcd17f, 53e9b12ab7602d9703bad72e)",Robustness of centipede-inspired millirobot locomotion to leg failures.,"List(http://dx.doi.org/10.1109/IROS.2013.6696543, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=6696543)","List(555037837cea80f95418b43e, null, null, Intelligent Robots and Systems, null, null, null, IROS, null, null, null, null, 0)",,2013,"List(53f43685dabfaec09f17df79, 54069dffdabfae44f084828e)"
53e99792b7602d9701f5b2b3,"List(List(53f427b6dabfaec09f0d9c8a, null, null, 5b86b6c7e1cd8e14a34ccbac, Bruno Van Den Bossche, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f4296cdabfaec22b9e434b, null, null, 5b86b6c7e1cd8e14a34ccbac, Tom Verdickt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f3839ddabfae4b34a04de3, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart De Vleeschauwer, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f42f58dabfaee43ebde3dc, null, null, 5b86b6c7e1cd8e14a34ccbac, Stein Desmet, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f455dedabfaee0d9bf1e22, null, null, 5b86b6c7e1cd8e14a34ccbac, Stijn De Mulder, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(543501afdabfaebba588d847, null, null, 5b86b6c7e1cd8e14a34ccbac, Filip De Turck, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(5630a35645cedb3399af4a03, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart Dhoedt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f56636dabfae6293f8045b, null, null, 5b86b6c7e1cd8e14a34ccbac, Piet Demeester, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null))",10.1145/1378191.1378195,"List(Server-side, Division (mathematics), Computer science, Load balancing (computing), Popularity, Microcell, Software architecture, Virtual world, Distributed computing)",1-59593-285-2,,"List(virtual world, dynamic microcell redeployment, evaluation result, massively multiplayer online games, server side, load distribution, multiplayer online game, efficient software architecture, software architecture, huge popularity, virtual worlds, load balancing, load balance)",en,24,,,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b2b3_0.pdf,"List(558a424ae4b0b32fcb35c0ae, 53e9b648b7602d97041a6427, 53e9b8a1b7602d970447c694, 53e99df1b7602d97026b383b)",A platform for dynamic microcell redeployment in massively multiplayer online games,"List(http://dx.doi.org/10.1145/1378191.1378195, http://doi.acm.org/10.1145/1378191.1378195)","List(555037227cea80f95417540f, null, null, Network and Operating System Support for Digital Audio and Video, null, null, null, NOSSDAV, null, null, null, null, 0)",,2006,"List(53f427b6dabfaec09f0d9c8a, 53f4296cdabfaec22b9e434b, 53f3839ddabfae4b34a04de3, 53f42f58dabfaee43ebde3dc, 53f455dedabfaee0d9bf1e22, 543501afdabfaebba588d847, 5630a35645cedb3399af4a03, 53f56636dabfae6293f8045b)"
53e99792b7602d9701f5b2bc,"List(List(54096bf9dabfae8faa68e261, null, lanwang@memphis.edu, 5b86c4f3e1cd8e14a3b2badb, Lan Wang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(Corresponding author. Address: University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States. Tel.: +1 901 678 2727., University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(56017d4c45cedb3395e639a3, null, massey@cs.colostate.edu, 5b86c4f3e1cd8e14a3b2badb, Daniel Massey, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(560aae3445cedb33971711c9, null, lixia@cs.ucla.edu, 5b86c4f3e1cd8e14a3b2badb, Lixia Zhang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null))",10.1016/j.comnet.2006.07.015,"List(Convergence (routing), Default-free zone, Computer science, Soft state, Computer network, Routing domain, Border Gateway Protocol, Routing table, Communications protocol, Routing protocol)",,6.0,"List(Reliability, Fault tolerance, Network state management, Soft-state, State consistency, Refresh overhead)",en,6,1458.0,1444.0,,"List(53e9984fb7602d9702081493, 53e9ab37b7602d97034c376f, 557f1370d19faf961d16e9f2, 53e9bcefb7602d970497580f, 53e99c04b7602d97024ae6fe, 53e9abd4b7602d970358f037, 53e9a508b7602d9702e2c226, 53e99937b7602d970217392b, 558a65c5e4b031bae1f764e6, 53e99d0cb7602d97025c39df, 53e99b51b7602d97023f58dc, 53e99931b7602d970216db73, 53e999e0b7602d9702223ef4, 53e9bac1b7602d97046eef9f, 53e9a9d9b7602d970333a6aa)",Persistent detection and recovery of state inconsistencies,"List(http://dx.doi.org/10.1016/j.comnet.2006.07.015, http://dx.doi.org/https://doi.org/10.1016/j.comnet.2006.07.015, https://www.sciencedirect.com/science/article/pii/S1389128606002088)","List(53907df520f770854f6106bd, null, null, null, null, null, null, Computer Networks, null, null, null, null, 0)",51.0,2007,"List(54096bf9dabfae8faa68e261, 56017d4c45cedb3395e639a3, 560aae3445cedb33971711c9)"


### 3. Create the new DFs
#### 3.1. Venue DF

In [0]:
def create_venues_df(_df):
    venues_df = (_df
                 .withColumn("has_volume_or_issue", F.when(F.col("volume").isNotNull() | F.col("issue").isNotNull(), True).otherwise(None))
                 .select("venue.*", "has_volume_or_issue")
                 .filter(F.col("_id").isNotNull())
                 .drop("src", "sid", "type"))

    # removing the columns from the original df
    _df = _df.withColumn("venue_id", F.col("venue._id")).drop("venue")
    
    # combining rows with the same id, but different column values
    # taking the first non-null value for the id as the column value
    venue_columns = (
        "issn",
        "name",
        "name_d",
        "name_s",
        "raw",
        "raw_zh",
        "online_issn",
        "publisher",
        "t",
        "has_volume_or_issue")
    venues_df = venues_df.groupBy(F.col("_id")).agg(*(F.first(F.col(col), ignorenulls=True).alias(col) for col in venue_columns))
    
    venues_df = (
        venues_df
        # coalescing the name and raw columns
        .withColumn("raw", F.coalesce(
                F.col("raw"), 
                F.col("raw_zh"),
            ))
        .withColumn("name", F.coalesce(
                F.col("name"), 
                F.col("name_d"),
                F.col("raw"),
            ))
        .drop("name_d", "name_s", "raw_zh") 
        # creating the type field
        .withColumn("type",            
                   F.when(
                       (
                           F.col("raw").contains("@") | 
                           F.lower(F.col("raw")).contains("workshop") |
                           F.lower(F.col("name")).contains("workshop")
                       ), 
                       "Workshop"
                   ).when(
                       (F.col("t") == "J"),
                       "Journal"
                   ).when(
                       (
                           (F.col("t") == "C") |
                           F.lower(F.col("raw")).contains("conference") |
                           F.lower(("name")).contains("conference") |
                           F.lower(F.col("raw")).contains("symposium") |
                           F.lower(("name")).contains("symposium") |
                           F.lower(F.col("raw")).contains("proceedings") |
                           F.lower(("name")).contains("proceedings")
                       ),
                       "Conference"
                   ).when(
                       (
                           F.lower(F.col("raw")).contains("journal") |
                           F.lower(("name")).contains("journal") |
                           F.col("has_volume_or_issue")
                       ),
                       "Journal"
                   ).otherwise(None)
          )
        .drop("t", "has_volume_or_issue")
        .select(
            F.col("_id").alias("ID"),
            F.col("issn").alias("ISSN"),
            F.col("name").alias("Name"),
            F.col("online_issn").alias("OnlineISSN"),
            F.col("publisher").alias("Publisher"),
            F.col("type").alias("Type")
        )
    )
    return _df, venues_df

_df, venues_df = create_venues_df(_df)

display(venues_df.limit(DISPLAY_LIMIT))

ID,ISSN,Name,OnlineISSN,Publisher,Type
0001-0782,0001-0782,IEEE Signal Process. Mag.,,,Journal
0001-253X,0001-253X,ASLIB PROCEEDINGS,1758-3748,,Journal
0001-5903,0001-5903,Acta Informatica,,,Journal
0001-8708,0001-8708,Advances in Mathematics,,Academic Press,Journal
0002-9378,0002-9378,AMERICAN JOURNAL OF OBSTETRICS AND GYNECOLOGY,,,Journal
0002-9890,0002-9890,AMERICAN MATHEMATICAL MONTHLY,1930-0972,,Journal
0003-4347,0003-4347,ANNALES DES TELECOMMUNICATIONS-ANNALS OF TELECOMMUNICATIONS,1958-9395,,Journal
0003-486X,0003-486X,Annals of Mathematics,,,Journal
0004-3702,0004-3702,Artificial Intelligence,,Elsevier,Journal
0004-5411,0004-5411,J. ACM,,,Journal


In [0]:
display(_df.limit(DISPLAY_LIMIT))

_id,authors,doi,fos,isbn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,volume,year,Author_ID,venue_id
53e99784b7602d9701f3f5fe,"List(List(53f46a22dabfaee0d9c3d5e5, null, ysg_2005@hotmail.com, 5b8698cce1cd8e14a3826671, Shuguo Yang, null, null, null, null, School of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China 266061, null, 5f71b2e91c455f439fe3f23f, null, null, null))",10.1007/s11704-011-0127-6,"List(Virtualization, Service level objective, Virtual machine, Computer science, Testbed, Quality of service, Provisioning, Resource allocation, Web application, Operating system, Distributed computing)",,4.0,"List(resource allocation, cpu utilization, quality of service)",en,2,512.0,506.0,,"List(53e9a073b7602d9702957efa, 53e9ad87b7602d970377bfb5, 53e9be51b7602d9704b11381, 53e9be04b7602d9704abb31d, 53e9992bb7602d9702169236, 53e998cdb7602d97021044db, 53e9afa6b7602d97039f6054, 53e99822b7602d9702044e60)",Research on resource allocation for multi-tier web applications in a virtualization environment,"List(http://dx.doi.org/10.1007/s11704-011-0127-6, http://link.springer.com/article/10.1007/s11704-011-0127-6, http://www.webofknowledge.com/)",5.0,2011,List(53f46a22dabfaee0d9c3d5e5),572de199d39c4f49934b3d5c
53e99792b7602d9701f5af35,"List(List(53f43a51dabfaec22baa659b, null, dedwards@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Dennis Edwards, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f3b3ffdabfae4b34b2dae9, null, ssimmons@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Sharon Simmons, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f4333fdabfaeb22f451979, null, nwilde@uwf.edu, null, Norman Wilde, null, null, null, null, Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., null, null, List(Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null))",10.1016/j.jss.2004.12.018,"List(Data mining, Causality, End user, Ranking, Computer science, Military systems, Software, Feature model, Component-based software engineering, A-weighting, Distributed computing)",,1.0,"List(Feature location, Distributed systems, Software reconnaissance)",en,62,68.0,57.0,//static.aminer.org/pdf/PDF/000/996/035/an_approach_to_feature_location_in_distributed_systems.pdf,"List(53e9b6eeb7602d970427df40, 53e9b6eeb7602d9704283b9f, 53e9b40eb7602d9703f01b25, 53e9a3c0b7602d9702ccdfc9, 53e99818b7602d97020347a2, 53e9a2acb7602d9702bb4d7e, 558aa7ea84ae84d265bee194, 558a5258e4b037c08756714c, 53e9b946b7602d97045336a9, 53e9b1d6b7602d9703c67695, 53e9a516b7602d9702e3bcea, 53e9ac33b7602d97035f892c, 53e9ba22b7602d9704628817, 53e9af3ab7602d97039769c8, 53e9b1a3b7602d9703c2c6f7, 53e9ac89b7602d9703660f90, 53e9ad2db7602d970370e8a2, 53e9a735b7602d970306db2b, 53e99960b7602d97021a17da)",An approach to feature location in distributed systems,"List(http://dx.doi.org/10.1016/j.jss.2004.12.018, https://www.sciencedirect.com/science/article/pii/S016412120500004X, http://www.webofknowledge.com/)",79.0,2006,"List(53f43a51dabfaec22baa659b, 53f3b3ffdabfae4b34b2dae9, 53f4333fdabfaeb22f451979)",54825226582fc50b5e05610e
53e99792b7602d9701f5b0ed,"List(List(542a6734dabfae646d55cc87, null, gaoyibo@gmail.com, 5b86cb35e1cd8e14a3e00691, Kun Gao, null, null, null, null, Computer Science and Information Technology College, Zhejiang Wanli University, China, null, 5f71b6101c455f439fe555a5, null, null, null))",10.1007/978-3-540-85565-1_38,"List(Data mining, Data stream mining, Grid computing, Data-intensive computing, Computer science, Directed acyclic graph, Semantic grid, Knowledge extraction, Business process discovery, Grid, Distributed computing)",,,"List(knowledge discovery grid, dynamic grid environment, high performance data mining, parallel optimization method, grid feature, high performance ddm application, data mining grid, decomposing data mining application, data mining parallelization, data intensive computing problem, computational grid environment, data mining application, knowledge discovery, distributed computing, data mining, data intensive computing, parallelization, grid computing, directed acyclic graph)",en,4,312.0,306.0,,"List(53e99cbbb7602d970256c4af, 53e9b092b7602d9703afec3f, 53e9b089b7602d9703af06a7, 53e9ae9cb7602d97038c0529, 53e9b5f3b7602d9704144753, 558aac78e4b0b32fcb3831b7, 53e9bc3bb7602d97048b1ed7, 53e9b7b4b7602d970435c58f, 53e9b672b7602d97041db0b8, 53e9b96eb7602d970455d792, 53e9ae97b7602d97038ba7dd, 53e9adf7b7602d97038042d9, 53e9bd6ab7602d9704a0644e)",A Uniform Parallel Optimization Method for Knowledge Discovery Grid,"List(http://dx.doi.org/10.1007/978-3-540-85565-1_38, http://www.webofknowledge.com/)",5178.0,2008,List(542a6734dabfae646d55cc87),53a727f720f7420be8ba3092
53e99792b7602d9701f5b119,"List(List(5630ff9645cedb3399c3ca55, null, L.yang@lboro.ac.uk, null, Lili Yang, null, null, null, null, Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, Corresponding author. Tel.: +44 1509 223130; fax: +44 1509 223961.), null, null), List(53f4371cdabfaec22ba8766f, null, null, null, Bryan F. Jones, null, null, null, null, Applied Computing Department, The University of Derby, Derby DE22 1GB, UK, null, 5f71b2ba1c455f439fe3dd51, List(Applied Computing Department, The University of Derby, Derby DE22 1GB, UK), null, null), List(54867430dabfae9b40133dc3, null, null, null, Shuanghua Yang, null, null, null, null, Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK), null, null))",10.1016/j.ejor.2006.07.003,"List(Objective programming, Fire protection, Computer science, Fuzzy logic, Operations research, Fire risk, Genetic algorithm, Decision maker)",,2.0,"List(Location, Fire stations, Multi-objective programming, Genetic algorithm, Fuzzy programming)",en,140,915.0,903.0,,"List(53e9bd50b7602d97049e3238, 573695926e3b12023e4b23a9, 53e9af87b7602d97039cdb54, 53e99876b7602d97020b2192, 53e9bbf5b7602d9704850823, 5c7965584895d9cbc6430efd, 53e9ab42b7602d97034d404d, 5b67376bab2dfb7a20258e8e)",A fuzzy multi-objective programming for optimization of fire station locations through genetic algorithms,"List(http://dx.doi.org/10.1016/j.ejor.2006.07.003, https://www.sciencedirect.com/science/article/pii/S037722170600467X)",181.0,2007,"List(5630ff9645cedb3399c3ca55, 53f4371cdabfaec22ba8766f, 54867430dabfae9b40133dc3)",0377-2217
53e99792b7602d9701f5b140,"List(List(56017d4445cedb3395e638f7, null, null, 5b86cf81e1cd8e14a3ff7abc, Bielikova, M., null, null, null, null, Dept. of Comput. Sci. & Eng., Slovak Tech. Univ., Bratislava, Slovakia|c|, null, 5f71b2f61c455f439fe3f847, null, null, null))",10.1109/ICALT.2001.943897,"List(Metadata, Programming language, Information retrieval, XML, Computer science, Adaptive system, Hypermedia, Software prototyping, Context model, Software, Presentation logic)",0-7695-1013-2,,"List(evolving information, database, educational module information, educational module, extensible markup language, meta-information, xml, adaptive presentation, proposed framework, administrative information, user view, software prototyping, educational administrative data processing, presentation time, hypermedia markup languages, time view, meta data, adaptive systems, hypermedia, software prototype, computer science, solids, context modeling, databases, html)",en,10,196.0,193.0,//static.aminer.org/pdf/PDF/000/269/528/adaptive_presentation_of_evolving_information_using_xml.pdf,"List(53e9a2c8b7602d9702bd1d38, 53e9b304b7602d9703dc9a0c, 53e9ae84b7602d97038a3a28, 53e99a5cb7602d97022c416d)",Adaptive presentation of evolving information using XML,"List(http://dx.doi.org/10.1109/ICALT.2001.943897, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=943897)",,2001,List(56017d4445cedb3395e638f7),5550376d7cea80f9541873d5
53e99792b7602d9701f5b19a,"List(List(54301e81dabfaeca69bca10d, null, null, 5b86919fe1cd8e14a353a903, Lajos Horváth, null, null, null, 0000-0001-8594-4972, Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA, null, 5f71b2841c455f439fe3c6c8, List(Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA), null, null), List(53f4647ddabfaee02ad8cbb6, null, null, 5b86ae49e1cd8e14a31057a0, Marie Hušková, null, null, null, 0000-0002-6868-1362, Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic, null, 5f71b2851c455f439fe3c70f, List(Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic), null, null), List(5448d49ddabfae87b7e82000, null, Piotr.Kokoszka@usu.edu, 5b8689c1e1cd8e14a3214e89, Piotr Kokoszka, null, null, null, null, Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, null, 5f71b2961c455f439fe3ce40, List(Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, Corresponding author.), null, null))",10.1016/j.jmva.2008.12.008,"List(Functional principal component analysis, Time series, Truncation, Autoregressive model, Applied mathematics, Test statistic, Credit card, Calculus, Statistical hypothesis testing, Mathematics, Asymptotic distribution)",,2.0,"List(asymptotic justification, observations x, equation x, functional principal component analysis, test statistic, credit card transaction data, principal component, functional autoregressive process, functional time series data, change-point, functional time series analysis, 62m10, time series data, autoregressive process, transaction data, asymptotic distribution, time series analysis)",en,62,367.0,352.0,,"List(53e9ace1b7602d97036bc507, 53e999ffb7602d970224c172)",Testing the stability of the functional autoregressive process,"List(http://dx.doi.org/10.1016/j.jmva.2008.12.008, https://dl.acm.org/doi/10.1016/j.jmva.2008.12.008, https://www.sciencedirect.com/science/article/pii/S0047259X08002789, http://www.webofknowledge.com/)",101.0,2010,"List(54301e81dabfaeca69bca10d, 53f4647ddabfaee02ad8cbb6, 5448d49ddabfae87b7e82000)",555036db7cea80f9541603d7
53e99792b7602d9701f5b1ba,"List(List(53f42c98dabfaeb22f3fc92d, null, null, 5b869b48e1cd8e14a393cf7b, V. William Porto, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f64e31dabfae6a71b6029f, null, null, 5b869b48e1cd8e14a393cf7b, David B. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f47307dabfaee02adc6671, null, null, 5b869b48e1cd8e14a393cf7b, Lawrence J. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null))",10.1145/1056754.1056755,"List(Interactive evolutionary computation, Human-based evolutionary computation, Evolutionary algorithm, Computer science, Evolutionary computation, Evolution strategy, Artificial intelligence, Evolutionary programming, Evolutionary music, Java Evolutionary Computation Toolkit)",,2.0,"List(adaptive behavior, platoon-level engagement, current knowledge, generating novel tactic, appropriate tactic, simulated evolution, evolutionary computation, projected situation, higher level organization, individual unit, evolutionary algorithm, real time, evolutionary computing)",en,5,14.0,8.0,,"List(53e9bc80b7602d97048fe74a, 53e9bb93b7602d97047d88e5, 53e99d28b7602d97025dc235, 53e99842b7602d970206de90, 558a8260e4b031bae1f80ed5)",Generating novel tactics through evolutionary computation,"List(http://dx.doi.org/10.1145/1056754.1056755, http://doi.acm.org/10.1145/1056754.1056755)",9.0,1998,"List(53f42c98dabfaeb22f3fc92d, 53f64e31dabfae6a71b6029f, 53f47307dabfaee02adc6671)",53a7310820f7420be8d1bc69
53e99792b7602d9701f5b1e7,"List(List(53f43685dabfaec09f17df79, null, null, 5b86b6bfe1cd8e14a34c9318, Katie L. Hoffman, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null), List(54069dffdabfae44f084828e, null, null, 5b86b6bfe1cd8e14a34c9318, Robert J. Wood, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null))",10.1109/IROS.2013.6696543,"List(Gait, Longitudinal static stability, Simulation, Control theory, Catastrophic failure, Robustness (computer science), Gait analysis, Fault tolerance, Redundancy (engineering), Engineering, Robot)",,,"List(gait analysis, legged locomotion, stability, centipede-inspired millirobot locomotion, curvature radius, curvature speed, gait options, leg failures, mechanical redundancy, miniature ambulatory robots, performance metrics, robustness enhancement, static stability retention)",en,8,1479.0,1472.0,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b1e7_0.pdf,"List(53e9aa61b7602d97033d7457, 53e9b464b7602d9703f64f1b, 53e998fcb7602d9702138f37, 53e9bd0ab7602d9704990fd9, 558b092284ae84d265c11df2, 53e99f8db7602d9702865bbe, 558b6afae4b037c0875cc785, 53e9992ab7602d9702163a36, 53e9af26b7602d970396322a, 53e9b2f4b7602d9703db0617, 53e99d6bb7602d9702623452, 558a4e4684ae84d265bcd17f, 53e9b12ab7602d9703bad72e)",Robustness of centipede-inspired millirobot locomotion to leg failures.,"List(http://dx.doi.org/10.1109/IROS.2013.6696543, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=6696543)",,2013,"List(53f43685dabfaec09f17df79, 54069dffdabfae44f084828e)",555037837cea80f95418b43e
53e99792b7602d9701f5b2b3,"List(List(53f427b6dabfaec09f0d9c8a, null, null, 5b86b6c7e1cd8e14a34ccbac, Bruno Van Den Bossche, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f4296cdabfaec22b9e434b, null, null, 5b86b6c7e1cd8e14a34ccbac, Tom Verdickt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f3839ddabfae4b34a04de3, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart De Vleeschauwer, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f42f58dabfaee43ebde3dc, null, null, 5b86b6c7e1cd8e14a34ccbac, Stein Desmet, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f455dedabfaee0d9bf1e22, null, null, 5b86b6c7e1cd8e14a34ccbac, Stijn De Mulder, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(543501afdabfaebba588d847, null, null, 5b86b6c7e1cd8e14a34ccbac, Filip De Turck, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(5630a35645cedb3399af4a03, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart Dhoedt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f56636dabfae6293f8045b, null, null, 5b86b6c7e1cd8e14a34ccbac, Piet Demeester, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null))",10.1145/1378191.1378195,"List(Server-side, Division (mathematics), Computer science, Load balancing (computing), Popularity, Microcell, Software architecture, Virtual world, Distributed computing)",1-59593-285-2,,"List(virtual world, dynamic microcell redeployment, evaluation result, massively multiplayer online games, server side, load distribution, multiplayer online game, efficient software architecture, software architecture, huge popularity, virtual worlds, load balancing, load balance)",en,24,,,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b2b3_0.pdf,"List(558a424ae4b0b32fcb35c0ae, 53e9b648b7602d97041a6427, 53e9b8a1b7602d970447c694, 53e99df1b7602d97026b383b)",A platform for dynamic microcell redeployment in massively multiplayer online games,"List(http://dx.doi.org/10.1145/1378191.1378195, http://doi.acm.org/10.1145/1378191.1378195)",,2006,"List(53f427b6dabfaec09f0d9c8a, 53f4296cdabfaec22b9e434b, 53f3839ddabfae4b34a04de3, 53f42f58dabfaee43ebde3dc, 53f455dedabfaee0d9bf1e22, 543501afdabfaebba588d847, 5630a35645cedb3399af4a03, 53f56636dabfae6293f8045b)",555037227cea80f95417540f
53e99792b7602d9701f5b2bc,"List(List(54096bf9dabfae8faa68e261, null, lanwang@memphis.edu, 5b86c4f3e1cd8e14a3b2badb, Lan Wang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(Corresponding author. Address: University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States. Tel.: +1 901 678 2727., University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(56017d4c45cedb3395e639a3, null, massey@cs.colostate.edu, 5b86c4f3e1cd8e14a3b2badb, Daniel Massey, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(560aae3445cedb33971711c9, null, lixia@cs.ucla.edu, 5b86c4f3e1cd8e14a3b2badb, Lixia Zhang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null))",10.1016/j.comnet.2006.07.015,"List(Convergence (routing), Default-free zone, Computer science, Soft state, Computer network, Routing domain, Border Gateway Protocol, Routing table, Communications protocol, Routing protocol)",,6.0,"List(Reliability, Fault tolerance, Network state management, Soft-state, State consistency, Refresh overhead)",en,6,1458.0,1444.0,,"List(53e9984fb7602d9702081493, 53e9ab37b7602d97034c376f, 557f1370d19faf961d16e9f2, 53e9bcefb7602d970497580f, 53e99c04b7602d97024ae6fe, 53e9abd4b7602d970358f037, 53e9a508b7602d9702e2c226, 53e99937b7602d970217392b, 558a65c5e4b031bae1f764e6, 53e99d0cb7602d97025c39df, 53e99b51b7602d97023f58dc, 53e99931b7602d970216db73, 53e999e0b7602d9702223ef4, 53e9bac1b7602d97046eef9f, 53e9a9d9b7602d970333a6aa)",Persistent detection and recovery of state inconsistencies,"List(http://dx.doi.org/10.1016/j.comnet.2006.07.015, http://dx.doi.org/https://doi.org/10.1016/j.comnet.2006.07.015, https://www.sciencedirect.com/science/article/pii/S1389128606002088)",51.0,2007,"List(54096bf9dabfae8faa68e261, 56017d4c45cedb3395e639a3, 560aae3445cedb33971711c9)",53907df520f770854f6106bd


#### 3.2. Author DF

In [0]:
# Create the Authors DF
def create_authors_df(_df):
    df2 = _df.withColumn('auth_expl', F.explode(F.col("authors"))) # explode the authors array
    df2 = (df2.withColumn('auth_id', F.col('auth_expl._id')) # separate the authors id and name
              .withColumn('auth_name', F.col('auth_expl.name')))

    # Make the authors df of distinct auth_id and auth_name pairs. Also keeps only the first instance of duplicate ID entries.
    authors_df = df2.select('auth_id', 'auth_name').groupBy(F.col("auth_id")).agg(F.first(F.col("auth_name"), ignorenulls=True).alias("auth_name"))

    authors_df = (authors_df.withColumnRenamed('auth_id', 'ID')
                            .withColumnRenamed('auth_name', 'Name'))
    return _df, authors_df

_df, authors_df = create_authors_df(_df)
display(authors_df.limit(DISPLAY_LIMIT))

ID,Name
53f3186fdabfae9a84425cfb,Matthew Prowse
53f31871dabfae9a84425db7,Renato Fabbri
53f31873dabfae9a84425e8a,Joachim Schimpf
53f31875dabfae9a84425f46,Steven F. Roth
53f31876dabfae9a84425fb2,Jarkko Oksala
53f31878dabfae9a8442603d,Nima Zahadat
53f31879dabfae9a844260c5,Philips S.
53f3187ddabfae9a84426223,Ke Fa Cen
53f31881dabfae9a844263e3,Eduardo H. Ramirez
53f31882dabfae9a844263f4,B. Besserer


In [0]:
# The authors ID column (named Author_ID) is already in the original DF from the Transform function.
display(_df.limit(DISPLAY_LIMIT))

_id,authors,doi,fos,isbn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,volume,year,Author_ID,venue_id
53e99784b7602d9701f3f5fe,"List(List(53f46a22dabfaee0d9c3d5e5, null, ysg_2005@hotmail.com, 5b8698cce1cd8e14a3826671, Shuguo Yang, null, null, null, null, School of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China 266061, null, 5f71b2e91c455f439fe3f23f, null, null, null))",10.1007/s11704-011-0127-6,"List(Virtualization, Service level objective, Virtual machine, Computer science, Testbed, Quality of service, Provisioning, Resource allocation, Web application, Operating system, Distributed computing)",,4.0,"List(resource allocation, cpu utilization, quality of service)",en,2,512.0,506.0,,"List(53e9a073b7602d9702957efa, 53e9ad87b7602d970377bfb5, 53e9be51b7602d9704b11381, 53e9be04b7602d9704abb31d, 53e9992bb7602d9702169236, 53e998cdb7602d97021044db, 53e9afa6b7602d97039f6054, 53e99822b7602d9702044e60)",Research on resource allocation for multi-tier web applications in a virtualization environment,"List(http://dx.doi.org/10.1007/s11704-011-0127-6, http://link.springer.com/article/10.1007/s11704-011-0127-6, http://www.webofknowledge.com/)",5.0,2011,List(53f46a22dabfaee0d9c3d5e5),572de199d39c4f49934b3d5c
53e99792b7602d9701f5af35,"List(List(53f43a51dabfaec22baa659b, null, dedwards@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Dennis Edwards, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f3b3ffdabfae4b34b2dae9, null, ssimmons@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Sharon Simmons, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f4333fdabfaeb22f451979, null, nwilde@uwf.edu, null, Norman Wilde, null, null, null, null, Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., null, null, List(Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null))",10.1016/j.jss.2004.12.018,"List(Data mining, Causality, End user, Ranking, Computer science, Military systems, Software, Feature model, Component-based software engineering, A-weighting, Distributed computing)",,1.0,"List(Feature location, Distributed systems, Software reconnaissance)",en,62,68.0,57.0,//static.aminer.org/pdf/PDF/000/996/035/an_approach_to_feature_location_in_distributed_systems.pdf,"List(53e9b6eeb7602d970427df40, 53e9b6eeb7602d9704283b9f, 53e9b40eb7602d9703f01b25, 53e9a3c0b7602d9702ccdfc9, 53e99818b7602d97020347a2, 53e9a2acb7602d9702bb4d7e, 558aa7ea84ae84d265bee194, 558a5258e4b037c08756714c, 53e9b946b7602d97045336a9, 53e9b1d6b7602d9703c67695, 53e9a516b7602d9702e3bcea, 53e9ac33b7602d97035f892c, 53e9ba22b7602d9704628817, 53e9af3ab7602d97039769c8, 53e9b1a3b7602d9703c2c6f7, 53e9ac89b7602d9703660f90, 53e9ad2db7602d970370e8a2, 53e9a735b7602d970306db2b, 53e99960b7602d97021a17da)",An approach to feature location in distributed systems,"List(http://dx.doi.org/10.1016/j.jss.2004.12.018, https://www.sciencedirect.com/science/article/pii/S016412120500004X, http://www.webofknowledge.com/)",79.0,2006,"List(53f43a51dabfaec22baa659b, 53f3b3ffdabfae4b34b2dae9, 53f4333fdabfaeb22f451979)",54825226582fc50b5e05610e
53e99792b7602d9701f5b0ed,"List(List(542a6734dabfae646d55cc87, null, gaoyibo@gmail.com, 5b86cb35e1cd8e14a3e00691, Kun Gao, null, null, null, null, Computer Science and Information Technology College, Zhejiang Wanli University, China, null, 5f71b6101c455f439fe555a5, null, null, null))",10.1007/978-3-540-85565-1_38,"List(Data mining, Data stream mining, Grid computing, Data-intensive computing, Computer science, Directed acyclic graph, Semantic grid, Knowledge extraction, Business process discovery, Grid, Distributed computing)",,,"List(knowledge discovery grid, dynamic grid environment, high performance data mining, parallel optimization method, grid feature, high performance ddm application, data mining grid, decomposing data mining application, data mining parallelization, data intensive computing problem, computational grid environment, data mining application, knowledge discovery, distributed computing, data mining, data intensive computing, parallelization, grid computing, directed acyclic graph)",en,4,312.0,306.0,,"List(53e99cbbb7602d970256c4af, 53e9b092b7602d9703afec3f, 53e9b089b7602d9703af06a7, 53e9ae9cb7602d97038c0529, 53e9b5f3b7602d9704144753, 558aac78e4b0b32fcb3831b7, 53e9bc3bb7602d97048b1ed7, 53e9b7b4b7602d970435c58f, 53e9b672b7602d97041db0b8, 53e9b96eb7602d970455d792, 53e9ae97b7602d97038ba7dd, 53e9adf7b7602d97038042d9, 53e9bd6ab7602d9704a0644e)",A Uniform Parallel Optimization Method for Knowledge Discovery Grid,"List(http://dx.doi.org/10.1007/978-3-540-85565-1_38, http://www.webofknowledge.com/)",5178.0,2008,List(542a6734dabfae646d55cc87),53a727f720f7420be8ba3092
53e99792b7602d9701f5b119,"List(List(5630ff9645cedb3399c3ca55, null, L.yang@lboro.ac.uk, null, Lili Yang, null, null, null, null, Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, Corresponding author. Tel.: +44 1509 223130; fax: +44 1509 223961.), null, null), List(53f4371cdabfaec22ba8766f, null, null, null, Bryan F. Jones, null, null, null, null, Applied Computing Department, The University of Derby, Derby DE22 1GB, UK, null, 5f71b2ba1c455f439fe3dd51, List(Applied Computing Department, The University of Derby, Derby DE22 1GB, UK), null, null), List(54867430dabfae9b40133dc3, null, null, null, Shuanghua Yang, null, null, null, null, Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK), null, null))",10.1016/j.ejor.2006.07.003,"List(Objective programming, Fire protection, Computer science, Fuzzy logic, Operations research, Fire risk, Genetic algorithm, Decision maker)",,2.0,"List(Location, Fire stations, Multi-objective programming, Genetic algorithm, Fuzzy programming)",en,140,915.0,903.0,,"List(53e9bd50b7602d97049e3238, 573695926e3b12023e4b23a9, 53e9af87b7602d97039cdb54, 53e99876b7602d97020b2192, 53e9bbf5b7602d9704850823, 5c7965584895d9cbc6430efd, 53e9ab42b7602d97034d404d, 5b67376bab2dfb7a20258e8e)",A fuzzy multi-objective programming for optimization of fire station locations through genetic algorithms,"List(http://dx.doi.org/10.1016/j.ejor.2006.07.003, https://www.sciencedirect.com/science/article/pii/S037722170600467X)",181.0,2007,"List(5630ff9645cedb3399c3ca55, 53f4371cdabfaec22ba8766f, 54867430dabfae9b40133dc3)",0377-2217
53e99792b7602d9701f5b140,"List(List(56017d4445cedb3395e638f7, null, null, 5b86cf81e1cd8e14a3ff7abc, Bielikova, M., null, null, null, null, Dept. of Comput. Sci. & Eng., Slovak Tech. Univ., Bratislava, Slovakia|c|, null, 5f71b2f61c455f439fe3f847, null, null, null))",10.1109/ICALT.2001.943897,"List(Metadata, Programming language, Information retrieval, XML, Computer science, Adaptive system, Hypermedia, Software prototyping, Context model, Software, Presentation logic)",0-7695-1013-2,,"List(evolving information, database, educational module information, educational module, extensible markup language, meta-information, xml, adaptive presentation, proposed framework, administrative information, user view, software prototyping, educational administrative data processing, presentation time, hypermedia markup languages, time view, meta data, adaptive systems, hypermedia, software prototype, computer science, solids, context modeling, databases, html)",en,10,196.0,193.0,//static.aminer.org/pdf/PDF/000/269/528/adaptive_presentation_of_evolving_information_using_xml.pdf,"List(53e9a2c8b7602d9702bd1d38, 53e9b304b7602d9703dc9a0c, 53e9ae84b7602d97038a3a28, 53e99a5cb7602d97022c416d)",Adaptive presentation of evolving information using XML,"List(http://dx.doi.org/10.1109/ICALT.2001.943897, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=943897)",,2001,List(56017d4445cedb3395e638f7),5550376d7cea80f9541873d5
53e99792b7602d9701f5b19a,"List(List(54301e81dabfaeca69bca10d, null, null, 5b86919fe1cd8e14a353a903, Lajos Horváth, null, null, null, 0000-0001-8594-4972, Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA, null, 5f71b2841c455f439fe3c6c8, List(Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA), null, null), List(53f4647ddabfaee02ad8cbb6, null, null, 5b86ae49e1cd8e14a31057a0, Marie Hušková, null, null, null, 0000-0002-6868-1362, Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic, null, 5f71b2851c455f439fe3c70f, List(Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic), null, null), List(5448d49ddabfae87b7e82000, null, Piotr.Kokoszka@usu.edu, 5b8689c1e1cd8e14a3214e89, Piotr Kokoszka, null, null, null, null, Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, null, 5f71b2961c455f439fe3ce40, List(Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, Corresponding author.), null, null))",10.1016/j.jmva.2008.12.008,"List(Functional principal component analysis, Time series, Truncation, Autoregressive model, Applied mathematics, Test statistic, Credit card, Calculus, Statistical hypothesis testing, Mathematics, Asymptotic distribution)",,2.0,"List(asymptotic justification, observations x, equation x, functional principal component analysis, test statistic, credit card transaction data, principal component, functional autoregressive process, functional time series data, change-point, functional time series analysis, 62m10, time series data, autoregressive process, transaction data, asymptotic distribution, time series analysis)",en,62,367.0,352.0,,"List(53e9ace1b7602d97036bc507, 53e999ffb7602d970224c172)",Testing the stability of the functional autoregressive process,"List(http://dx.doi.org/10.1016/j.jmva.2008.12.008, https://dl.acm.org/doi/10.1016/j.jmva.2008.12.008, https://www.sciencedirect.com/science/article/pii/S0047259X08002789, http://www.webofknowledge.com/)",101.0,2010,"List(54301e81dabfaeca69bca10d, 53f4647ddabfaee02ad8cbb6, 5448d49ddabfae87b7e82000)",555036db7cea80f9541603d7
53e99792b7602d9701f5b1ba,"List(List(53f42c98dabfaeb22f3fc92d, null, null, 5b869b48e1cd8e14a393cf7b, V. William Porto, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f64e31dabfae6a71b6029f, null, null, 5b869b48e1cd8e14a393cf7b, David B. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f47307dabfaee02adc6671, null, null, 5b869b48e1cd8e14a393cf7b, Lawrence J. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null))",10.1145/1056754.1056755,"List(Interactive evolutionary computation, Human-based evolutionary computation, Evolutionary algorithm, Computer science, Evolutionary computation, Evolution strategy, Artificial intelligence, Evolutionary programming, Evolutionary music, Java Evolutionary Computation Toolkit)",,2.0,"List(adaptive behavior, platoon-level engagement, current knowledge, generating novel tactic, appropriate tactic, simulated evolution, evolutionary computation, projected situation, higher level organization, individual unit, evolutionary algorithm, real time, evolutionary computing)",en,5,14.0,8.0,,"List(53e9bc80b7602d97048fe74a, 53e9bb93b7602d97047d88e5, 53e99d28b7602d97025dc235, 53e99842b7602d970206de90, 558a8260e4b031bae1f80ed5)",Generating novel tactics through evolutionary computation,"List(http://dx.doi.org/10.1145/1056754.1056755, http://doi.acm.org/10.1145/1056754.1056755)",9.0,1998,"List(53f42c98dabfaeb22f3fc92d, 53f64e31dabfae6a71b6029f, 53f47307dabfaee02adc6671)",53a7310820f7420be8d1bc69
53e99792b7602d9701f5b1e7,"List(List(53f43685dabfaec09f17df79, null, null, 5b86b6bfe1cd8e14a34c9318, Katie L. Hoffman, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null), List(54069dffdabfae44f084828e, null, null, 5b86b6bfe1cd8e14a34c9318, Robert J. Wood, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null))",10.1109/IROS.2013.6696543,"List(Gait, Longitudinal static stability, Simulation, Control theory, Catastrophic failure, Robustness (computer science), Gait analysis, Fault tolerance, Redundancy (engineering), Engineering, Robot)",,,"List(gait analysis, legged locomotion, stability, centipede-inspired millirobot locomotion, curvature radius, curvature speed, gait options, leg failures, mechanical redundancy, miniature ambulatory robots, performance metrics, robustness enhancement, static stability retention)",en,8,1479.0,1472.0,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b1e7_0.pdf,"List(53e9aa61b7602d97033d7457, 53e9b464b7602d9703f64f1b, 53e998fcb7602d9702138f37, 53e9bd0ab7602d9704990fd9, 558b092284ae84d265c11df2, 53e99f8db7602d9702865bbe, 558b6afae4b037c0875cc785, 53e9992ab7602d9702163a36, 53e9af26b7602d970396322a, 53e9b2f4b7602d9703db0617, 53e99d6bb7602d9702623452, 558a4e4684ae84d265bcd17f, 53e9b12ab7602d9703bad72e)",Robustness of centipede-inspired millirobot locomotion to leg failures.,"List(http://dx.doi.org/10.1109/IROS.2013.6696543, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=6696543)",,2013,"List(53f43685dabfaec09f17df79, 54069dffdabfae44f084828e)",555037837cea80f95418b43e
53e99792b7602d9701f5b2b3,"List(List(53f427b6dabfaec09f0d9c8a, null, null, 5b86b6c7e1cd8e14a34ccbac, Bruno Van Den Bossche, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f4296cdabfaec22b9e434b, null, null, 5b86b6c7e1cd8e14a34ccbac, Tom Verdickt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f3839ddabfae4b34a04de3, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart De Vleeschauwer, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f42f58dabfaee43ebde3dc, null, null, 5b86b6c7e1cd8e14a34ccbac, Stein Desmet, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f455dedabfaee0d9bf1e22, null, null, 5b86b6c7e1cd8e14a34ccbac, Stijn De Mulder, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(543501afdabfaebba588d847, null, null, 5b86b6c7e1cd8e14a34ccbac, Filip De Turck, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(5630a35645cedb3399af4a03, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart Dhoedt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f56636dabfae6293f8045b, null, null, 5b86b6c7e1cd8e14a34ccbac, Piet Demeester, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null))",10.1145/1378191.1378195,"List(Server-side, Division (mathematics), Computer science, Load balancing (computing), Popularity, Microcell, Software architecture, Virtual world, Distributed computing)",1-59593-285-2,,"List(virtual world, dynamic microcell redeployment, evaluation result, massively multiplayer online games, server side, load distribution, multiplayer online game, efficient software architecture, software architecture, huge popularity, virtual worlds, load balancing, load balance)",en,24,,,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b2b3_0.pdf,"List(558a424ae4b0b32fcb35c0ae, 53e9b648b7602d97041a6427, 53e9b8a1b7602d970447c694, 53e99df1b7602d97026b383b)",A platform for dynamic microcell redeployment in massively multiplayer online games,"List(http://dx.doi.org/10.1145/1378191.1378195, http://doi.acm.org/10.1145/1378191.1378195)",,2006,"List(53f427b6dabfaec09f0d9c8a, 53f4296cdabfaec22b9e434b, 53f3839ddabfae4b34a04de3, 53f42f58dabfaee43ebde3dc, 53f455dedabfaee0d9bf1e22, 543501afdabfaebba588d847, 5630a35645cedb3399af4a03, 53f56636dabfae6293f8045b)",555037227cea80f95417540f
53e99792b7602d9701f5b2bc,"List(List(54096bf9dabfae8faa68e261, null, lanwang@memphis.edu, 5b86c4f3e1cd8e14a3b2badb, Lan Wang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(Corresponding author. Address: University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States. Tel.: +1 901 678 2727., University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(56017d4c45cedb3395e639a3, null, massey@cs.colostate.edu, 5b86c4f3e1cd8e14a3b2badb, Daniel Massey, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(560aae3445cedb33971711c9, null, lixia@cs.ucla.edu, 5b86c4f3e1cd8e14a3b2badb, Lixia Zhang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null))",10.1016/j.comnet.2006.07.015,"List(Convergence (routing), Default-free zone, Computer science, Soft state, Computer network, Routing domain, Border Gateway Protocol, Routing table, Communications protocol, Routing protocol)",,6.0,"List(Reliability, Fault tolerance, Network state management, Soft-state, State consistency, Refresh overhead)",en,6,1458.0,1444.0,,"List(53e9984fb7602d9702081493, 53e9ab37b7602d97034c376f, 557f1370d19faf961d16e9f2, 53e9bcefb7602d970497580f, 53e99c04b7602d97024ae6fe, 53e9abd4b7602d970358f037, 53e9a508b7602d9702e2c226, 53e99937b7602d970217392b, 558a65c5e4b031bae1f764e6, 53e99d0cb7602d97025c39df, 53e99b51b7602d97023f58dc, 53e99931b7602d970216db73, 53e999e0b7602d9702223ef4, 53e9bac1b7602d97046eef9f, 53e9a9d9b7602d970333a6aa)",Persistent detection and recovery of state inconsistencies,"List(http://dx.doi.org/10.1016/j.comnet.2006.07.015, http://dx.doi.org/https://doi.org/10.1016/j.comnet.2006.07.015, https://www.sciencedirect.com/science/article/pii/S1389128606002088)",51.0,2007,"List(54096bf9dabfae8faa68e261, 56017d4c45cedb3395e639a3, 560aae3445cedb33971711c9)",53907df520f770854f6106bd


#### 3.3. Organization DF

In [0]:
# Finds the country names in a list of strings
# modified to only use the first element of the list
# uses regex to remove punctuation from the string and to match the given names of the countries and some abbreviations

def getCountry(s):
    if s is None:
        return None
    arr = []
    countries = ["Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua & Deps", "Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", "Bosnia", "Botswana", "Brazil", "Brunei", "Bulgaria", "Burkina", "Burundi", "Cambodia", "Cameroon", "Canada", "Cape Verde", "Central African Republic", "Chad", "Chile", "China", "Colombia", "Comoros", "Congo", "Congo Democratic Republic", "Costa Rica", "Croatia", "Cuba", "Cyprus", "Czech Republic", "Denmark", "Djibouti", "Dominica", "Dominican Republic", "East Timor", "Ecuador", "Egypt", "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", "France", "Gabon", "Gambia", "Georgia", "Germany", "Ghana", "Greece", "Grenada", "Guatemala", "Guinea", "Guinea-bissau", "Guyana", "Haiti", "Honduras", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory Coast", "Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Kiribati", "South Korea", "Kosovo", "Kuwait", "Kyrgyzstan", "Laos", "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Macedonia", "Madagascar", "Malawi", "Malaysia", "Maldives", "Mali", "Malta", "Marshall Islands", "Mauritania", "Mauritius", "Mexico", "Micronesia", "Moldova", "Monaco", "Mongolia", "Montenegro", "Morocco", "Mozambique", "Myanmar", "Burma", "Namibia", "Nauru", "Nepal", "Netherlands", "New Zealand", "Nicaragua", "Niger", "Nigeria", "Norway", "Romania", "Pakistan", "Palau", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines", "Poland", "Portugal", "Qatar", "Oman", "Russia", "Rwanda", "St Kitts & Nevis", "St Lucia", "Saint Vincent & The Grenadines", "Samoa", "San Marino", "Sao Tome & Principe", "Saudi Arabia", "Senegal", "Serbia", "Seychelles", "Sierra Leone", "Singapore", "Slovakia", "Slovenia", "Solomon Islands", "Somalia", "South Africa", "South Sudan", "Spain", "Sri Lanka", "Sudan", "Suriname", "Swaziland", "Sweden", "Switzerland", "Syria", "Taiwan", "Tajikistan", "Tanzania", "Thailand", "Togo", "Tonga", "Trinidad & Tobago", "Tunisia", "Turkey", "Turkmenistan", "Tuvalu", "Uganda", "Ukraine", "United Arab Emirates", "United Kingdom", "United States", "Uruguay", "Uzbekistan", "Vanuatu", "Vatican City", "Venezuela", "Vietnam", "Yemen", "Zambia", "Zimbabwe"]
    state_names = ["alaska", "alabama", "arkansas", "american samoa", "arizona", "california", "colorado", "connecticut", "district ", "of columbia", "delaware", "florida", "georgia", "guam", "hawaii", "iowa", "idaho", "illinois", "indiana", "kansas", "kentucky", "louisiana", "massachusetts", "maryland", "maine", "michigan", "minnesota", "missouri", "mississippi", "montana", "north carolina", "north dakota", "nebraska", "new hampshire", "new jersey", "new mexico", "nevada", "new york", "ohio", "oklahoma", "oregon", "pennsylvania", "puerto rico", "rhode island", "south carolina", "south dakota", "tennessee", "texas", "utah", "virginia", "virgin islands", "vermont", "washington", "wisconsin", "west virginia", "wyoming"]
    states = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
    
    for i in s:
        if i["org"] is None:
            arr.append(None)
            break
        sent = re.sub("[^a-zA-Z -]", "", i["org"])
        x = None
        for j in countries:
            x = re.search(j.lower(), sent.lower())
            if x is not None:
                if j.lower() == 'india':
                    x = re.search('indiana', sent.lower())
                    if x is not None:
                        arr.append("United States")
                elif j.lower() == 'georgia':
                    x = re.search('USA', sent)
                    if x is not None:
                        arr.append("United States")
                else:
                    arr.append(j)
                break
        if x is None:
            x = re.search("USA", sent)
            if x is not None:
                arr.append("United States")
                break
        if x is None:
            x = re.search("UK", sent)
            if x is not None:
                arr.append("United Kingdom")
                break
        if x is None:
            x = re.search("england", sent.lower())
            if x is not None:
                arr.append("United Kingdom")
                break
        if x is None:
            x = re.search("scotland", sent.lower())
            if x is not None:
                arr.append("United Kingdom")
                break
        if x is None:
            x = re.search("wales", sent.lower())
            if x is not None:
                arr.append("United Kingdom")
                break
        if x is None:
            for j in states:
                x = re.search(j, sent)
                if x is not None:
                    arr.append("United States")
                    break
        if x is None:
            for j in state_names:
                x = re.search(j, sent.lower())
                if x is not None:
                    arr.append("United States")
                    break
        break
                    
    if len(arr) > 0:
        return arr[0]
    else:
        return None

getCountryUDF = udf(getCountry)

In [0]:
# Organization (affiliation of the first author)
# ID - authors.orgid
# Name - authors.org
# Country - getCountryUDF(F.arrays_zip("authors.org"))
def organization(df):
    new_df = (df
              .withColumn("ID", F.col("authors.orgid").getItem(0))
              .withColumn("Name", F.col("authors.org").getItem(0))
              .dropDuplicates(["ID"])
              .withColumn("Country", getCountryUDF(F.arrays_zip("authors.org")))
              .select("ID", "Name", "Country"))
    
    new_df = new_df.na.drop("all")
    return new_df

def create_orgs_df(_df):
    org_df = organization(_df)
    
    org_cols = ('Name', 'Country')
    new_df = org_df.select('ID', 'Name', 'Country')
    
    _df = _df.withColumn("Org", F.col("authors.orgid").getItem(0))

    return _df, new_df

In [0]:
_df, orgs_df = create_orgs_df(_df)
display(orgs_df.limit(DISPLAY_LIMIT))

ID,Name,Country
5f71b2801c455f439fe3c575,"Chair ANSI X3L1.2, GIS Extensions to SQL",
5f71b2811c455f439fe3c57c,Arizona State University,United States
5f71b2811c455f439fe3c57e,"Adobe Research, Adobe Systems Incorporated, San Francisco, CA",United States
5f71b2811c455f439fe3c58a,ACM,
5f71b2811c455f439fe3c592,"Department of Computer Science, Brown University, Providence, RI",United States
5f71b2811c455f439fe3c599,"The Burroughs, Hendon, Middlsex University, London, UK",United Kingdom
5f71b2811c455f439fe3c59c,"Future Technologies Group, British Telecom Laboratories, MLB1 PP12, Adastral Park, Martlesham Heath, Ipswich, IP5 3RE Suffolk, UK",United Kingdom
5f71b2811c455f439fe3c5a3,"Eshraghian Labs. Pty Ltd, Bentley, WA, Australia",Australia
5f71b2811c455f439fe3c5a5,Chalmers University of Technology (e-mail: koen@cs.chalmers.se),
5f71b2811c455f439fe3c5ab,"Department of Mechanical Engineering, Columbia University New York, NY",United States


In [0]:
display(_df.limit(DISPLAY_LIMIT))

_id,authors,doi,fos,isbn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,volume,year,Author_ID,venue_id,Org
53e99784b7602d9701f3f5fe,"List(List(53f46a22dabfaee0d9c3d5e5, null, ysg_2005@hotmail.com, 5b8698cce1cd8e14a3826671, Shuguo Yang, null, null, null, null, School of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, China 266061, null, 5f71b2e91c455f439fe3f23f, null, null, null))",10.1007/s11704-011-0127-6,"List(Virtualization, Service level objective, Virtual machine, Computer science, Testbed, Quality of service, Provisioning, Resource allocation, Web application, Operating system, Distributed computing)",,4.0,"List(resource allocation, cpu utilization, quality of service)",en,2,512.0,506.0,,"List(53e9a073b7602d9702957efa, 53e9ad87b7602d970377bfb5, 53e9be51b7602d9704b11381, 53e9be04b7602d9704abb31d, 53e9992bb7602d9702169236, 53e998cdb7602d97021044db, 53e9afa6b7602d97039f6054, 53e99822b7602d9702044e60)",Research on resource allocation for multi-tier web applications in a virtualization environment,"List(http://dx.doi.org/10.1007/s11704-011-0127-6, http://link.springer.com/article/10.1007/s11704-011-0127-6, http://www.webofknowledge.com/)",5.0,2011,List(53f46a22dabfaee0d9c3d5e5),572de199d39c4f49934b3d5c,5f71b2e91c455f439fe3f23f
53e99792b7602d9701f5af35,"List(List(53f43a51dabfaec22baa659b, null, dedwards@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Dennis Edwards, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f3b3ffdabfae4b34b2dae9, null, ssimmons@cs.uwf.edu, 5b8695e5e1cd8e14a36f684d, Sharon Simmons, null, null, null, null, Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA, null, 5f71b2bd1c455f439fe3dea6, List(Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null), List(53f4333fdabfaeb22f451979, null, nwilde@uwf.edu, null, Norman Wilde, null, null, null, null, Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., null, null, List(Corresponding author. Tel.: +1 850 474 2542; fax: +1 850 857 6056., Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA), null, null))",10.1016/j.jss.2004.12.018,"List(Data mining, Causality, End user, Ranking, Computer science, Military systems, Software, Feature model, Component-based software engineering, A-weighting, Distributed computing)",,1.0,"List(Feature location, Distributed systems, Software reconnaissance)",en,62,68.0,57.0,//static.aminer.org/pdf/PDF/000/996/035/an_approach_to_feature_location_in_distributed_systems.pdf,"List(53e9b6eeb7602d970427df40, 53e9b6eeb7602d9704283b9f, 53e9b40eb7602d9703f01b25, 53e9a3c0b7602d9702ccdfc9, 53e99818b7602d97020347a2, 53e9a2acb7602d9702bb4d7e, 558aa7ea84ae84d265bee194, 558a5258e4b037c08756714c, 53e9b946b7602d97045336a9, 53e9b1d6b7602d9703c67695, 53e9a516b7602d9702e3bcea, 53e9ac33b7602d97035f892c, 53e9ba22b7602d9704628817, 53e9af3ab7602d97039769c8, 53e9b1a3b7602d9703c2c6f7, 53e9ac89b7602d9703660f90, 53e9ad2db7602d970370e8a2, 53e9a735b7602d970306db2b, 53e99960b7602d97021a17da)",An approach to feature location in distributed systems,"List(http://dx.doi.org/10.1016/j.jss.2004.12.018, https://www.sciencedirect.com/science/article/pii/S016412120500004X, http://www.webofknowledge.com/)",79.0,2006,"List(53f43a51dabfaec22baa659b, 53f3b3ffdabfae4b34b2dae9, 53f4333fdabfaeb22f451979)",54825226582fc50b5e05610e,5f71b2bd1c455f439fe3dea6
53e99792b7602d9701f5b0ed,"List(List(542a6734dabfae646d55cc87, null, gaoyibo@gmail.com, 5b86cb35e1cd8e14a3e00691, Kun Gao, null, null, null, null, Computer Science and Information Technology College, Zhejiang Wanli University, China, null, 5f71b6101c455f439fe555a5, null, null, null))",10.1007/978-3-540-85565-1_38,"List(Data mining, Data stream mining, Grid computing, Data-intensive computing, Computer science, Directed acyclic graph, Semantic grid, Knowledge extraction, Business process discovery, Grid, Distributed computing)",,,"List(knowledge discovery grid, dynamic grid environment, high performance data mining, parallel optimization method, grid feature, high performance ddm application, data mining grid, decomposing data mining application, data mining parallelization, data intensive computing problem, computational grid environment, data mining application, knowledge discovery, distributed computing, data mining, data intensive computing, parallelization, grid computing, directed acyclic graph)",en,4,312.0,306.0,,"List(53e99cbbb7602d970256c4af, 53e9b092b7602d9703afec3f, 53e9b089b7602d9703af06a7, 53e9ae9cb7602d97038c0529, 53e9b5f3b7602d9704144753, 558aac78e4b0b32fcb3831b7, 53e9bc3bb7602d97048b1ed7, 53e9b7b4b7602d970435c58f, 53e9b672b7602d97041db0b8, 53e9b96eb7602d970455d792, 53e9ae97b7602d97038ba7dd, 53e9adf7b7602d97038042d9, 53e9bd6ab7602d9704a0644e)",A Uniform Parallel Optimization Method for Knowledge Discovery Grid,"List(http://dx.doi.org/10.1007/978-3-540-85565-1_38, http://www.webofknowledge.com/)",5178.0,2008,List(542a6734dabfae646d55cc87),53a727f720f7420be8ba3092,5f71b6101c455f439fe555a5
53e99792b7602d9701f5b119,"List(List(5630ff9645cedb3399c3ca55, null, L.yang@lboro.ac.uk, null, Lili Yang, null, null, null, null, Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Business School, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, Corresponding author. Tel.: +44 1509 223130; fax: +44 1509 223961.), null, null), List(53f4371cdabfaec22ba8766f, null, null, null, Bryan F. Jones, null, null, null, null, Applied Computing Department, The University of Derby, Derby DE22 1GB, UK, null, 5f71b2ba1c455f439fe3dd51, List(Applied Computing Department, The University of Derby, Derby DE22 1GB, UK), null, null), List(54867430dabfae9b40133dc3, null, null, null, Shuanghua Yang, null, null, null, null, Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK, null, 5f71b29c1c455f439fe3d0d7, List(Computer Science Department, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK), null, null))",10.1016/j.ejor.2006.07.003,"List(Objective programming, Fire protection, Computer science, Fuzzy logic, Operations research, Fire risk, Genetic algorithm, Decision maker)",,2.0,"List(Location, Fire stations, Multi-objective programming, Genetic algorithm, Fuzzy programming)",en,140,915.0,903.0,,"List(53e9bd50b7602d97049e3238, 573695926e3b12023e4b23a9, 53e9af87b7602d97039cdb54, 53e99876b7602d97020b2192, 53e9bbf5b7602d9704850823, 5c7965584895d9cbc6430efd, 53e9ab42b7602d97034d404d, 5b67376bab2dfb7a20258e8e)",A fuzzy multi-objective programming for optimization of fire station locations through genetic algorithms,"List(http://dx.doi.org/10.1016/j.ejor.2006.07.003, https://www.sciencedirect.com/science/article/pii/S037722170600467X)",181.0,2007,"List(5630ff9645cedb3399c3ca55, 53f4371cdabfaec22ba8766f, 54867430dabfae9b40133dc3)",0377-2217,5f71b29c1c455f439fe3d0d7
53e99792b7602d9701f5b140,"List(List(56017d4445cedb3395e638f7, null, null, 5b86cf81e1cd8e14a3ff7abc, Bielikova, M., null, null, null, null, Dept. of Comput. Sci. & Eng., Slovak Tech. Univ., Bratislava, Slovakia|c|, null, 5f71b2f61c455f439fe3f847, null, null, null))",10.1109/ICALT.2001.943897,"List(Metadata, Programming language, Information retrieval, XML, Computer science, Adaptive system, Hypermedia, Software prototyping, Context model, Software, Presentation logic)",0-7695-1013-2,,"List(evolving information, database, educational module information, educational module, extensible markup language, meta-information, xml, adaptive presentation, proposed framework, administrative information, user view, software prototyping, educational administrative data processing, presentation time, hypermedia markup languages, time view, meta data, adaptive systems, hypermedia, software prototype, computer science, solids, context modeling, databases, html)",en,10,196.0,193.0,//static.aminer.org/pdf/PDF/000/269/528/adaptive_presentation_of_evolving_information_using_xml.pdf,"List(53e9a2c8b7602d9702bd1d38, 53e9b304b7602d9703dc9a0c, 53e9ae84b7602d97038a3a28, 53e99a5cb7602d97022c416d)",Adaptive presentation of evolving information using XML,"List(http://dx.doi.org/10.1109/ICALT.2001.943897, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=943897)",,2001,List(56017d4445cedb3395e638f7),5550376d7cea80f9541873d5,5f71b2f61c455f439fe3f847
53e99792b7602d9701f5b19a,"List(List(54301e81dabfaeca69bca10d, null, null, 5b86919fe1cd8e14a353a903, Lajos Horváth, null, null, null, 0000-0001-8594-4972, Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA, null, 5f71b2841c455f439fe3c6c8, List(Department of Mathematics, University of Utah, 155 South 1440 East, Salt Lake City, UT 84112-0090, USA), null, null), List(53f4647ddabfaee02ad8cbb6, null, null, 5b86ae49e1cd8e14a31057a0, Marie Hušková, null, null, null, 0000-0002-6868-1362, Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic, null, 5f71b2851c455f439fe3c70f, List(Department of Statistics, Charles University, Sokolovská 83, CZ-18600 Praha, Czech Republic), null, null), List(5448d49ddabfae87b7e82000, null, Piotr.Kokoszka@usu.edu, 5b8689c1e1cd8e14a3214e89, Piotr Kokoszka, null, null, null, null, Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, null, 5f71b2961c455f439fe3ce40, List(Department of Mathematics and Statistics, Utah State University, 3900 Old Main Hill, Logan, UT 84322-3900, USA, Corresponding author.), null, null))",10.1016/j.jmva.2008.12.008,"List(Functional principal component analysis, Time series, Truncation, Autoregressive model, Applied mathematics, Test statistic, Credit card, Calculus, Statistical hypothesis testing, Mathematics, Asymptotic distribution)",,2.0,"List(asymptotic justification, observations x, equation x, functional principal component analysis, test statistic, credit card transaction data, principal component, functional autoregressive process, functional time series data, change-point, functional time series analysis, 62m10, time series data, autoregressive process, transaction data, asymptotic distribution, time series analysis)",en,62,367.0,352.0,,"List(53e9ace1b7602d97036bc507, 53e999ffb7602d970224c172)",Testing the stability of the functional autoregressive process,"List(http://dx.doi.org/10.1016/j.jmva.2008.12.008, https://dl.acm.org/doi/10.1016/j.jmva.2008.12.008, https://www.sciencedirect.com/science/article/pii/S0047259X08002789, http://www.webofknowledge.com/)",101.0,2010,"List(54301e81dabfaeca69bca10d, 53f4647ddabfaee02ad8cbb6, 5448d49ddabfae87b7e82000)",555036db7cea80f9541603d7,5f71b2841c455f439fe3c6c8
53e99792b7602d9701f5b1ba,"List(List(53f42c98dabfaeb22f3fc92d, null, null, 5b869b48e1cd8e14a393cf7b, V. William Porto, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f64e31dabfae6a71b6029f, null, null, 5b869b48e1cd8e14a393cf7b, David B. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null), List(53f47307dabfaee02adc6671, null, null, 5b869b48e1cd8e14a393cf7b, Lawrence J. Fogel, null, null, null, null, Natural Selection, Inc., La Jolla, CA, null, 5f71b57c1c455f439fe515f1, null, null, null))",10.1145/1056754.1056755,"List(Interactive evolutionary computation, Human-based evolutionary computation, Evolutionary algorithm, Computer science, Evolutionary computation, Evolution strategy, Artificial intelligence, Evolutionary programming, Evolutionary music, Java Evolutionary Computation Toolkit)",,2.0,"List(adaptive behavior, platoon-level engagement, current knowledge, generating novel tactic, appropriate tactic, simulated evolution, evolutionary computation, projected situation, higher level organization, individual unit, evolutionary algorithm, real time, evolutionary computing)",en,5,14.0,8.0,,"List(53e9bc80b7602d97048fe74a, 53e9bb93b7602d97047d88e5, 53e99d28b7602d97025dc235, 53e99842b7602d970206de90, 558a8260e4b031bae1f80ed5)",Generating novel tactics through evolutionary computation,"List(http://dx.doi.org/10.1145/1056754.1056755, http://doi.acm.org/10.1145/1056754.1056755)",9.0,1998,"List(53f42c98dabfaeb22f3fc92d, 53f64e31dabfae6a71b6029f, 53f47307dabfaee02adc6671)",53a7310820f7420be8d1bc69,5f71b57c1c455f439fe515f1
53e99792b7602d9701f5b1e7,"List(List(53f43685dabfaec09f17df79, null, null, 5b86b6bfe1cd8e14a34c9318, Katie L. Hoffman, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null), List(54069dffdabfae44f084828e, null, null, 5b86b6bfe1cd8e14a34c9318, Robert J. Wood, null, null, null, null, harvard university, null, 5f71b4501c455f439fe491ff, null, null, null))",10.1109/IROS.2013.6696543,"List(Gait, Longitudinal static stability, Simulation, Control theory, Catastrophic failure, Robustness (computer science), Gait analysis, Fault tolerance, Redundancy (engineering), Engineering, Robot)",,,"List(gait analysis, legged locomotion, stability, centipede-inspired millirobot locomotion, curvature radius, curvature speed, gait options, leg failures, mechanical redundancy, miniature ambulatory robots, performance metrics, robustness enhancement, static stability retention)",en,8,1479.0,1472.0,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b1e7_0.pdf,"List(53e9aa61b7602d97033d7457, 53e9b464b7602d9703f64f1b, 53e998fcb7602d9702138f37, 53e9bd0ab7602d9704990fd9, 558b092284ae84d265c11df2, 53e99f8db7602d9702865bbe, 558b6afae4b037c0875cc785, 53e9992ab7602d9702163a36, 53e9af26b7602d970396322a, 53e9b2f4b7602d9703db0617, 53e99d6bb7602d9702623452, 558a4e4684ae84d265bcd17f, 53e9b12ab7602d9703bad72e)",Robustness of centipede-inspired millirobot locomotion to leg failures.,"List(http://dx.doi.org/10.1109/IROS.2013.6696543, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=6696543)",,2013,"List(53f43685dabfaec09f17df79, 54069dffdabfae44f084828e)",555037837cea80f95418b43e,5f71b4501c455f439fe491ff
53e99792b7602d9701f5b2b3,"List(List(53f427b6dabfaec09f0d9c8a, null, null, 5b86b6c7e1cd8e14a34ccbac, Bruno Van Den Bossche, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f4296cdabfaec22b9e434b, null, null, 5b86b6c7e1cd8e14a34ccbac, Tom Verdickt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f3839ddabfae4b34a04de3, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart De Vleeschauwer, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f42f58dabfaee43ebde3dc, null, null, 5b86b6c7e1cd8e14a34ccbac, Stein Desmet, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f455dedabfaee0d9bf1e22, null, null, 5b86b6c7e1cd8e14a34ccbac, Stijn De Mulder, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(543501afdabfaebba588d847, null, null, 5b86b6c7e1cd8e14a34ccbac, Filip De Turck, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(5630a35645cedb3399af4a03, null, null, 5b86b6c7e1cd8e14a34ccbac, Bart Dhoedt, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null), List(53f56636dabfae6293f8045b, null, null, 5b86b6c7e1cd8e14a34ccbac, Piet Demeester, null, null, null, null, Ghent University, Gent, Belgium, null, 5f71b2961c455f439fe3ce44, null, null, null))",10.1145/1378191.1378195,"List(Server-side, Division (mathematics), Computer science, Load balancing (computing), Popularity, Microcell, Software architecture, Virtual world, Distributed computing)",1-59593-285-2,,"List(virtual world, dynamic microcell redeployment, evaluation result, massively multiplayer online games, server side, load distribution, multiplayer online game, efficient software architecture, software architecture, huge popularity, virtual worlds, load balancing, load balance)",en,24,,,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b2b3_0.pdf,"List(558a424ae4b0b32fcb35c0ae, 53e9b648b7602d97041a6427, 53e9b8a1b7602d970447c694, 53e99df1b7602d97026b383b)",A platform for dynamic microcell redeployment in massively multiplayer online games,"List(http://dx.doi.org/10.1145/1378191.1378195, http://doi.acm.org/10.1145/1378191.1378195)",,2006,"List(53f427b6dabfaec09f0d9c8a, 53f4296cdabfaec22b9e434b, 53f3839ddabfae4b34a04de3, 53f42f58dabfaee43ebde3dc, 53f455dedabfaee0d9bf1e22, 543501afdabfaebba588d847, 5630a35645cedb3399af4a03, 53f56636dabfae6293f8045b)",555037227cea80f95417540f,5f71b2961c455f439fe3ce44
53e99792b7602d9701f5b2bc,"List(List(54096bf9dabfae8faa68e261, null, lanwang@memphis.edu, 5b86c4f3e1cd8e14a3b2badb, Lan Wang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(Corresponding author. Address: University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States. Tel.: +1 901 678 2727., University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(56017d4c45cedb3395e639a3, null, massey@cs.colostate.edu, 5b86c4f3e1cd8e14a3b2badb, Daniel Massey, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null), List(560aae3445cedb33971711c9, null, lixia@cs.ucla.edu, 5b86c4f3e1cd8e14a3b2badb, Lixia Zhang, null, null, null, null, University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States and Colorado State University, Computer Science Department, Fort Collins, CO 80523, United State ..., null, 5f71b2aa1c455f439fe3d5c6, List(University of Memphis, Computer Science Department, 318 Dunn Hall, Memphis, TN 38152, United States, Colorado State University, Computer Science Department, Fort Collins, CO 80523, United States, University of California, Los Angeles, Computer Science Department, Los Angeles, CA 90095, United States), null, null))",10.1016/j.comnet.2006.07.015,"List(Convergence (routing), Default-free zone, Computer science, Soft state, Computer network, Routing domain, Border Gateway Protocol, Routing table, Communications protocol, Routing protocol)",,6.0,"List(Reliability, Fault tolerance, Network state management, Soft-state, State consistency, Refresh overhead)",en,6,1458.0,1444.0,,"List(53e9984fb7602d9702081493, 53e9ab37b7602d97034c376f, 557f1370d19faf961d16e9f2, 53e9bcefb7602d970497580f, 53e99c04b7602d97024ae6fe, 53e9abd4b7602d970358f037, 53e9a508b7602d9702e2c226, 53e99937b7602d970217392b, 558a65c5e4b031bae1f764e6, 53e99d0cb7602d97025c39df, 53e99b51b7602d97023f58dc, 53e99931b7602d970216db73, 53e999e0b7602d9702223ef4, 53e9bac1b7602d97046eef9f, 53e9a9d9b7602d970333a6aa)",Persistent detection and recovery of state inconsistencies,"List(http://dx.doi.org/10.1016/j.comnet.2006.07.015, http://dx.doi.org/https://doi.org/10.1016/j.comnet.2006.07.015, https://www.sciencedirect.com/science/article/pii/S1389128606002088)",51.0,2007,"List(54096bf9dabfae8faa68e261, 56017d4c45cedb3395e639a3, 560aae3445cedb33971711c9)",53907df520f770854f6106bd,5f71b2aa1c455f439fe3d5c6


#### 3.4. DBLP fact table

In [0]:
def create_dblp_df(_df):
    dblp_df = _df.select('_id','venue_id','Org','Author_ID','references','keywords','fos','title','n_citation','lang','page_start','page_end','doi','isbn','year','volume','issue')
    return _df, dblp_df.toDF('ID','Venue','Org','Authors','References','Keywords','FOS','Title','NoCitations','Lang','PageStart','PageEnd','DOI','ISBN','Year','Volume','Issue')
_df, dblp_df = create_dblp_df(_df)

In [0]:
display(dblp_df.limit(DISPLAY_LIMIT))

ID,Venue,Org,Authors,References,Keywords,FOS,Title,NoCitations,Lang,PageStart,PageEnd,DOI,ISBN,Year,Volume,Issue
53e99784b7602d9701f3f5fe,572de199d39c4f49934b3d5c,5f71b2e91c455f439fe3f23f,List(53f46a22dabfaee0d9c3d5e5),"List(53e9a073b7602d9702957efa, 53e9ad87b7602d970377bfb5, 53e9be51b7602d9704b11381, 53e9be04b7602d9704abb31d, 53e9992bb7602d9702169236, 53e998cdb7602d97021044db, 53e9afa6b7602d97039f6054, 53e99822b7602d9702044e60)","List(resource allocation, cpu utilization, quality of service)","List(Virtualization, Service level objective, Virtual machine, Computer science, Testbed, Quality of service, Provisioning, Resource allocation, Web application, Operating system, Distributed computing)",Research on resource allocation for multi-tier web applications in a virtualization environment,2,en,506.0,512.0,10.1007/s11704-011-0127-6,,2011,5.0,4.0
53e99792b7602d9701f5af35,54825226582fc50b5e05610e,5f71b2bd1c455f439fe3dea6,"List(53f43a51dabfaec22baa659b, 53f3b3ffdabfae4b34b2dae9, 53f4333fdabfaeb22f451979)","List(53e9b6eeb7602d970427df40, 53e9b6eeb7602d9704283b9f, 53e9b40eb7602d9703f01b25, 53e9a3c0b7602d9702ccdfc9, 53e99818b7602d97020347a2, 53e9a2acb7602d9702bb4d7e, 558aa7ea84ae84d265bee194, 558a5258e4b037c08756714c, 53e9b946b7602d97045336a9, 53e9b1d6b7602d9703c67695, 53e9a516b7602d9702e3bcea, 53e9ac33b7602d97035f892c, 53e9ba22b7602d9704628817, 53e9af3ab7602d97039769c8, 53e9b1a3b7602d9703c2c6f7, 53e9ac89b7602d9703660f90, 53e9ad2db7602d970370e8a2, 53e9a735b7602d970306db2b, 53e99960b7602d97021a17da)","List(Feature location, Distributed systems, Software reconnaissance)","List(Data mining, Causality, End user, Ranking, Computer science, Military systems, Software, Feature model, Component-based software engineering, A-weighting, Distributed computing)",An approach to feature location in distributed systems,62,en,57.0,68.0,10.1016/j.jss.2004.12.018,,2006,79.0,1.0
53e99792b7602d9701f5b0ed,53a727f720f7420be8ba3092,5f71b6101c455f439fe555a5,List(542a6734dabfae646d55cc87),"List(53e99cbbb7602d970256c4af, 53e9b092b7602d9703afec3f, 53e9b089b7602d9703af06a7, 53e9ae9cb7602d97038c0529, 53e9b5f3b7602d9704144753, 558aac78e4b0b32fcb3831b7, 53e9bc3bb7602d97048b1ed7, 53e9b7b4b7602d970435c58f, 53e9b672b7602d97041db0b8, 53e9b96eb7602d970455d792, 53e9ae97b7602d97038ba7dd, 53e9adf7b7602d97038042d9, 53e9bd6ab7602d9704a0644e)","List(knowledge discovery grid, dynamic grid environment, high performance data mining, parallel optimization method, grid feature, high performance ddm application, data mining grid, decomposing data mining application, data mining parallelization, data intensive computing problem, computational grid environment, data mining application, knowledge discovery, distributed computing, data mining, data intensive computing, parallelization, grid computing, directed acyclic graph)","List(Data mining, Data stream mining, Grid computing, Data-intensive computing, Computer science, Directed acyclic graph, Semantic grid, Knowledge extraction, Business process discovery, Grid, Distributed computing)",A Uniform Parallel Optimization Method for Knowledge Discovery Grid,4,en,306.0,312.0,10.1007/978-3-540-85565-1_38,,2008,5178.0,
53e99792b7602d9701f5b119,0377-2217,5f71b29c1c455f439fe3d0d7,"List(5630ff9645cedb3399c3ca55, 53f4371cdabfaec22ba8766f, 54867430dabfae9b40133dc3)","List(53e9bd50b7602d97049e3238, 573695926e3b12023e4b23a9, 53e9af87b7602d97039cdb54, 53e99876b7602d97020b2192, 53e9bbf5b7602d9704850823, 5c7965584895d9cbc6430efd, 53e9ab42b7602d97034d404d, 5b67376bab2dfb7a20258e8e)","List(Location, Fire stations, Multi-objective programming, Genetic algorithm, Fuzzy programming)","List(Objective programming, Fire protection, Computer science, Fuzzy logic, Operations research, Fire risk, Genetic algorithm, Decision maker)",A fuzzy multi-objective programming for optimization of fire station locations through genetic algorithms,140,en,903.0,915.0,10.1016/j.ejor.2006.07.003,,2007,181.0,2.0
53e99792b7602d9701f5b140,5550376d7cea80f9541873d5,5f71b2f61c455f439fe3f847,List(56017d4445cedb3395e638f7),"List(53e9a2c8b7602d9702bd1d38, 53e9b304b7602d9703dc9a0c, 53e9ae84b7602d97038a3a28, 53e99a5cb7602d97022c416d)","List(evolving information, database, educational module information, educational module, extensible markup language, meta-information, xml, adaptive presentation, proposed framework, administrative information, user view, software prototyping, educational administrative data processing, presentation time, hypermedia markup languages, time view, meta data, adaptive systems, hypermedia, software prototype, computer science, solids, context modeling, databases, html)","List(Metadata, Programming language, Information retrieval, XML, Computer science, Adaptive system, Hypermedia, Software prototyping, Context model, Software, Presentation logic)",Adaptive presentation of evolving information using XML,10,en,193.0,196.0,10.1109/ICALT.2001.943897,0-7695-1013-2,2001,,
53e99792b7602d9701f5b19a,555036db7cea80f9541603d7,5f71b2841c455f439fe3c6c8,"List(54301e81dabfaeca69bca10d, 53f4647ddabfaee02ad8cbb6, 5448d49ddabfae87b7e82000)","List(53e9ace1b7602d97036bc507, 53e999ffb7602d970224c172)","List(asymptotic justification, observations x, equation x, functional principal component analysis, test statistic, credit card transaction data, principal component, functional autoregressive process, functional time series data, change-point, functional time series analysis, 62m10, time series data, autoregressive process, transaction data, asymptotic distribution, time series analysis)","List(Functional principal component analysis, Time series, Truncation, Autoregressive model, Applied mathematics, Test statistic, Credit card, Calculus, Statistical hypothesis testing, Mathematics, Asymptotic distribution)",Testing the stability of the functional autoregressive process,62,en,352.0,367.0,10.1016/j.jmva.2008.12.008,,2010,101.0,2.0
53e99792b7602d9701f5b1ba,53a7310820f7420be8d1bc69,5f71b57c1c455f439fe515f1,"List(53f42c98dabfaeb22f3fc92d, 53f64e31dabfae6a71b6029f, 53f47307dabfaee02adc6671)","List(53e9bc80b7602d97048fe74a, 53e9bb93b7602d97047d88e5, 53e99d28b7602d97025dc235, 53e99842b7602d970206de90, 558a8260e4b031bae1f80ed5)","List(adaptive behavior, platoon-level engagement, current knowledge, generating novel tactic, appropriate tactic, simulated evolution, evolutionary computation, projected situation, higher level organization, individual unit, evolutionary algorithm, real time, evolutionary computing)","List(Interactive evolutionary computation, Human-based evolutionary computation, Evolutionary algorithm, Computer science, Evolutionary computation, Evolution strategy, Artificial intelligence, Evolutionary programming, Evolutionary music, Java Evolutionary Computation Toolkit)",Generating novel tactics through evolutionary computation,5,en,8.0,14.0,10.1145/1056754.1056755,,1998,9.0,2.0
53e99792b7602d9701f5b1e7,555037837cea80f95418b43e,5f71b4501c455f439fe491ff,"List(53f43685dabfaec09f17df79, 54069dffdabfae44f084828e)","List(53e9aa61b7602d97033d7457, 53e9b464b7602d9703f64f1b, 53e998fcb7602d9702138f37, 53e9bd0ab7602d9704990fd9, 558b092284ae84d265c11df2, 53e99f8db7602d9702865bbe, 558b6afae4b037c0875cc785, 53e9992ab7602d9702163a36, 53e9af26b7602d970396322a, 53e9b2f4b7602d9703db0617, 53e99d6bb7602d9702623452, 558a4e4684ae84d265bcd17f, 53e9b12ab7602d9703bad72e)","List(gait analysis, legged locomotion, stability, centipede-inspired millirobot locomotion, curvature radius, curvature speed, gait options, leg failures, mechanical redundancy, miniature ambulatory robots, performance metrics, robustness enhancement, static stability retention)","List(Gait, Longitudinal static stability, Simulation, Control theory, Catastrophic failure, Robustness (computer science), Gait analysis, Fault tolerance, Redundancy (engineering), Engineering, Robot)",Robustness of centipede-inspired millirobot locomotion to leg failures.,8,en,1472.0,1479.0,10.1109/IROS.2013.6696543,,2013,,
53e99792b7602d9701f5b2b3,555037227cea80f95417540f,5f71b2961c455f439fe3ce44,"List(53f427b6dabfaec09f0d9c8a, 53f4296cdabfaec22b9e434b, 53f3839ddabfae4b34a04de3, 53f42f58dabfaee43ebde3dc, 53f455dedabfaee0d9bf1e22, 543501afdabfaebba588d847, 5630a35645cedb3399af4a03, 53f56636dabfae6293f8045b)","List(558a424ae4b0b32fcb35c0ae, 53e9b648b7602d97041a6427, 53e9b8a1b7602d970447c694, 53e99df1b7602d97026b383b)","List(virtual world, dynamic microcell redeployment, evaluation result, massively multiplayer online games, server side, load distribution, multiplayer online game, efficient software architecture, software architecture, huge popularity, virtual worlds, load balancing, load balance)","List(Server-side, Division (mathematics), Computer science, Load balancing (computing), Popularity, Microcell, Software architecture, Virtual world, Distributed computing)",A platform for dynamic microcell redeployment in massively multiplayer online games,24,en,,,10.1145/1378191.1378195,1-59593-285-2,2006,,
53e99792b7602d9701f5b2bc,53907df520f770854f6106bd,5f71b2aa1c455f439fe3d5c6,"List(54096bf9dabfae8faa68e261, 56017d4c45cedb3395e639a3, 560aae3445cedb33971711c9)","List(53e9984fb7602d9702081493, 53e9ab37b7602d97034c376f, 557f1370d19faf961d16e9f2, 53e9bcefb7602d970497580f, 53e99c04b7602d97024ae6fe, 53e9abd4b7602d970358f037, 53e9a508b7602d9702e2c226, 53e99937b7602d970217392b, 558a65c5e4b031bae1f764e6, 53e99d0cb7602d97025c39df, 53e99b51b7602d97023f58dc, 53e99931b7602d970216db73, 53e999e0b7602d9702223ef4, 53e9bac1b7602d97046eef9f, 53e9a9d9b7602d970333a6aa)","List(Reliability, Fault tolerance, Network state management, Soft-state, State consistency, Refresh overhead)","List(Convergence (routing), Default-free zone, Computer science, Soft state, Computer network, Routing domain, Border Gateway Protocol, Routing table, Communications protocol, Routing protocol)",Persistent detection and recovery of state inconsistencies,6,en,1444.0,1458.0,10.1016/j.comnet.2006.07.015,,2007,51.0,6.0


In [0]:
dblp_df.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Venue: string (nullable = true)
 |-- Org: string (nullable = true)
 |-- Authors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- References: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Keywords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- FOS: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Title: string (nullable = true)
 |-- NoCitations: integer (nullable = true)
 |-- Lang: string (nullable = true)
 |-- PageStart: integer (nullable = true)
 |-- PageEnd: integer (nullable = true)
 |-- DOI: string (nullable = true)
 |-- ISBN: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Issue: integer (nullable = true)



### 4. Load DFs as Delta tables

In [0]:
# DBLP fact table
dblp_df.write.format('delta').mode('overwrite').saveAsTable('dblp_fact_table')
dblp_table = DeltaTable.forName(spark, 'dblp_fact_table')

In [0]:
# Venue table
venues_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("venues")
venue_table = DeltaTable.forName(spark, 'venues')

In [0]:
# Author table
authors_df.write.format('delta').mode('overwrite').saveAsTable('authors')
author_table = DeltaTable.forName(spark, 'authors')

In [0]:
# Organization
orgs_df.write.format('delta').mode('overwrite').saveAsTable('orgs')
org_table = DeltaTable.forName(spark, 'orgs')

### 5. Incremental updates
For simulating updates, we will read some more parquet files into a dataframe and transform them as we did above. Then, stream new entries from the dataframe to the existing Delta tables.

In [0]:
logger.info(f'Row counts before streaming:\n\tDBLP fact table: {dblp_table.toDF().count()}\n\tAuthor table: {author_table.toDF().count()}\n\tVenue table: {venue_table.toDF().count()}\n\tOrganization table: {org_table.toDF().count()}')

INFO:__main__:Row counts before streaming:
	DBLP fact table: 77009
	Author table: 144390
	Venue table: 8418
	Organization table: 5370


In [0]:
from pyspark.sql.types import StructType,StructField, StringType, LongType, ArrayType

# Define the schema for incoming data
schema = StructType([
    StructField("_id", StringType(), True),
    StructField("abstract", StringType(), True),
    StructField("authors",
                ArrayType(
                    StructType([
                        StructField("_id", StringType(), True),
                        StructField("bio", StringType(), True),
                        StructField("email", StringType(), True),
                        StructField("gid", StringType(), True),
                        StructField("name", StringType(), True),
                        StructField("name_zh", StringType(), True),
                        StructField("oid", StringType(), True),
                        StructField("oid_zh", StringType(), True),
                        StructField("orcid", StringType(), True),
                        StructField("org", StringType(), True),
                        StructField("org_zh", StringType(), True),
                        StructField("orgid", StringType(), True),
                        StructField("orgs", ArrayType(StringType(), True), True),
                        StructField("orgs_zh", ArrayType(StringType(), True), True),
                        StructField("sid", StringType(), True)
                    ]),
                    True
                ),
                True
                ),
    StructField("doi", StringType(), True),
    StructField("fos", ArrayType(StringType(), True), True),
    StructField("isbn", StringType(), True),
    StructField("issn", StringType(), True),
    StructField("issue", StringType(), True),
    StructField("keywords", ArrayType(StringType(), True), True),
    StructField("lang", StringType(), True),
    StructField("n_citation", LongType(), True),
    StructField("page_end", StringType(), True),
    StructField("page_start", StringType(), True),
    StructField("pdf", StringType(), True),
    StructField("references", ArrayType(StringType(), True), True),
    StructField("title", StringType(), True),
    StructField("url", ArrayType(StringType(), True), True),
    StructField("venue",
                StructType([
                    StructField("_id", StringType(), True),
                    StructField("issn", StringType(), True),
                    StructField("name", StringType(), True),
                    StructField("name_d", StringType(), True),
                    StructField("name_s", StringType(), True),
                    StructField("online_issn", StringType(), True),
                    StructField("publisher", StringType(), True),
                    StructField("raw", StringType(), True),
                    StructField("raw_zh", StringType(), True),
                    StructField("sid", StringType(), True),
                    StructField("src", StringType(), True),
                    StructField("t", StringType(), True),
                    StructField("type", LongType(), True)
                ]),
                True
                ),
    StructField("volume", StringType(), True),
    StructField("year", LongType(), True)
])

# DF of incoming data
# Read some uncleaned splits
streaming_df = (spark.readStream
                .schema(schema)
                .option("maxFilesPerTrigger", 1)
                .parquet('dbfs:/user/dblpv13/dblpv13.{5,6}.parquet') # Stream two splits: 5 and 6. Note: NO SPACE BETWEEN THE NUMBERS!              
)

# Clean the incoming data
streaming_df = transform(streaming_df)

# Perform this on each batch of incoming data
def update_tables(batch_df, batch_id):
    
    # From the incoming DF, make the same DFs that are in our warehouse
    _df, venues_df = create_venues_df(batch_df)
    _df, authors_df = create_authors_df(_df)
    _df, orgs_df = create_orgs_df(_df)
    _df, dblp_df = create_dblp_df(_df)
    
    # For each existing Delta table and new DF
    for table, df in (
        (venue_table, venues_df), # The *_table variables refer to previously created Delta tables in our warehouse
        (author_table, authors_df), 
        (org_table, orgs_df),
        (dblp_table, dblp_df)
    ):
        # Upsert each Delta table with data from the DF
        (table
         .alias("t")
         .merge(
             df.alias("s"),
             "s.ID == t.ID"
         ).whenMatchedUpdateAll() # when ID already exists: update all columns
        .whenNotMatchedInsertAll() # when ID doesn't exist: insert
        .execute())


query = (streaming_df.writeStream
    .foreachBatch(update_tables)
    # Note: checkpointing causes the stream to not work sometimes
    .start()
)

In [0]:
# loosely based on https://stackoverflow.com/questions/45717433/stop-structured-streaming-query-gracefully

def stop_stream_query(query):
    while True:
        if not query.status["isDataAvailable"] and not query.status["isTriggerActive"] and not query.status["message"] == "Initializing sources":
            query.stop()
            return
        
    
stop_stream_query(query)

INFO:py4j.java_gateway:Received command  on object id 
INFO:py4j.java_gateway:Closing down callback connection
INFO:py4j.java_gateway:Received command c on object id p1


In [0]:
logger.info(f'Row counts after streaming:\n\tDBLP fact table: {dblp_table.toDF().count()}\n\tAuthor table: {author_table.toDF().count()}\n\tVenue table: {venue_table.toDF().count()}\n\tOrganization table: {org_table.toDF().count()}')

INFO:__main__:Row counts after streaming:
	DBLP fact table: 275859
	Author table: 406699
	Venue table: 13567
	Organization table: 8882


#### 6. Queries

Find TOP 10 venues by number of publications.

In [0]:
%sql
SELECT Name, Type, Publications FROM 
  (SELECT Venue, COUNT(ID) AS Publications
  FROM dblp_fact_table
  GROUP BY Venue)
INNER JOIN Venues ON Venues.ID=Venue 
ORDER BY Publications DESC 
LIMIT 10;

Name,Type,Publications
Lecture Notes in Computer Science,Journal,4346
Human Factors in Computing Systems,Journal,1882
Expert Syst. Appl.,Journal,1598
European Journal of Operational Research,Journal,1481
Communications of The ACM,Journal,1461
Algebraic Methodology and Software Technology,Journal,1450
BMC Bioinformatics,Journal,1391
arXiv: Learning,Journal,1317
Discrete Mathematics,Journal,1306
International Joint Conference on Artificial Intelligence,Conference,1127


Find authors who had the most publications at the single venue.

In [0]:
%sql
SELECT Name AS AuthorName, VenueName, VenueType, Publications FROM
  (SELECT Venue, Author, Publications, Name AS VenueName, Type AS VenueType FROM
    (SELECT * FROM
      (SELECT Venue, Author, COUNT(ID) AS Publications
      FROM (SELECT ID, Venue, EXPLODE(Authors) AS Author FROM dblp_fact_table) 
      GROUP BY Venue, Author)
    INNER JOIN Venues ON Venues.ID=Venue)
    )
INNER JOIN Authors ON Authors.ID=Author
ORDER BY Publications DESC
LIMIT 10;

AuthorName,VenueName,VenueType,Publications
Daniel Cohen-Or,International Conference on Computer Graphics and Interactive Techniques,Conference,21
Wolfgang Glänzel,Scientometrics,Journal,21
Markus H. Gross,International Conference on Computer Graphics and Interactive Techniques,Conference,20
W. Bruce Croft,International ACM SIGIR Conference on Research and Development in Information Retrieval,Conference,19
Baining Guo,International Conference on Computer Graphics and Interactive Techniques,Conference,18
Xiaoou Tang,Computer Vision and Pattern Recognition,Journal,18
Jan Borchers,Human Factors in Computing Systems,Journal,17
Nicholas R. Jennings,Adaptive Agents and Multi-Agents Systems,,17
Massoud Pedram,Design Automation Conference,Conference,17
Tovi Grossman,Human Factors in Computing Systems,Journal,17


For each author, show their starting year (year when they released their first publication) and ending year (year when they released their last publication). Sort by career length.

In [0]:
%sql
SELECT AuthorName, StartYear, LastYear, (LastYear - StartYear) AS CareerLengthSoFar FROM
  (SELECT Name AS AuthorName, MIN(Year) AS StartYear, MAX(Year) AS LastYear FROM
      (SELECT EXPLODE(Authors) AS AuthorID, Year FROM dblp_fact_table)
    INNER JOIN Authors ON Authors.ID=AuthorID
  GROUP BY AuthorID, AuthorName)
SORT BY CareerLengthSoFar DESC
LIMIT 10;

AuthorName,StartYear,LastYear,CareerLengthSoFar
Arto Salomaa,1967,2013,46
Richard M Karp,1967,2013,46
Niklaus Wirth,1963,2007,44
Ulrich W. Kulisch,1967,2011,44
Stuart C. Shapiro,1969,2013,44
Ugo Montanari,1969,2012,43
Saul I. Gass,1961,2004,43
John E. Hopcroft,1970,2012,42
Jim Gray,1967,2007,40
Arie Tamir,1972,2012,40


Find the most impactful authors. (Calculate the H-index of each author and sort by it.)

In [0]:
# UDF for calculating the h-index, given the descendingly sorted list of citations the author's works have
def calc_h_index(citations:list):
    h_index = 0 # Each author starts with an h-index of 0
    
    # Iterate over the sorted citations
    for n in citations:
        if n <= h_index: # If citations are fewer or equal to current h-index, we have maximum h-index
            return h_index
        h_index += 1 # Otherwise, increase h-index and continue
        
    return h_index

spark.udf.register("calc_h_index_udf", calc_h_index, T.IntegerType())

Out[32]: <function __main__.calc_h_index(citations: list)>

In [0]:
%sql
select authors.Name as AuthorName, `H-index` from
    (select AuthorID, calc_h_index_udf(CitationsList) as `H-index` from
        (select AuthorID, sort_array(collect_list(NoCitations), false) as CitationsList from
            (select NoCitations, explode(Authors) as AuthorID 
             from dblp_fact_table
             where NoCitations is not NULL)
         group by AuthorID))
join authors on authors.ID=AuthorID
sort by `H-index` desc
limit 10;

AuthorName,H-index
Anil K. Jain,56
Jiawei Han,56
Thomas Huang,53
Philip Yu,51
W. M. P. van der Aalst,46
Thomas A. Henzinger,46
Nicholas R. Jennings,44
Dacheng Tao,41
Zhi-Hua Zhou,41
Luc J. Van Gool,40


Explore the most popular themes in computer science articles of the last 10 years. Titles of articles from the last 10 years that have the keyword "computer science", more than 1 author, more than 20 citations.

In [0]:
%sql
select Title 
from dblp_fact_table
where array_contains(Keywords, "computer science") and 
    year(current_date()) - Year <= 10 and 
    size(Authors) > 1 and 
    NoCitations > 20
sort by Title;

Title
A SOFT way for openflow switch interoperability testing
A taxonomy and survey of SCTP research
"Automatically mining software-based, semantically-similar words from comment-code mappings"
Computer science principles: analysis of a proposed advanced placement course
Efficient Index-Based Approaches for Skyline Queries in Location-Based Applications
From codes to patterns: designing interactive decoration for tableware
Judging a book by its cover: interface elements that affect reader selection of ebooks
Learning dependency-based compositional semantics
Making the Long Code Shorter
NetCoffee: a fast and accurate global alignment approach to identify functionally conserved proteins in multiple networks.


Most prestigious organizations. Sums the citations for each organization all of their articles have. Divides it by total number of articles. Takes top 10.

In [0]:
%sql
SELECT Name, Country, TotalNoCitations, Publications, TotalNoCitations / Publications AS CitationsPerPublication FROM
(SELECT Org, SUM(NoCitations) AS TotalNoCitations, Count(*) AS Publications
FROM dblp_fact_table
GROUP BY Org) AS dblp_fact_table
INNER JOIN orgs
ON dblp_fact_table.Org = orgs.ID
ORDER BY CitationsPerPublication DESC
LIMIT 10;

Name,Country,TotalNoCitations,Publications,CitationsPerPublication
"AT&T Wireless Services, Redmond, WA",United States,17035,3,5678.333333333333
"HHMI Janelia Farm Research Campus, Ashburn, VA 20147, USA. wheelert@janelia.hhmi.org",United States,6897,3,2299.0
"Graduate School of International Corporate Strategy, Hitotsubashi University, Gakujutsu Sogo Center, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8439, Japan",Japan,2203,1,2203.0
Mars Electronics International,,2080,1,2080.0
"Inmarsat, London, UK",United Kingdom,7295,4,1823.75
"Ingenu Syst, Redwood City, CA 94063 USA",United States,1772,1,1772.0
"NHLBI, Natl Inst Hlth, Lab Computat Biol, Bethesda, MD 20892 USA",United States,4773,3,1591.0
School of Information Management and Systems|Monash University|Gartner Group Pacific,,1542,1,1542.0
imperial college school of medicine,,4453,3,1484.3333333333333
lear corporation,,2916,2,1458.0


Most productive countries. Sorts the countries by total number of articles released by organizations in that country. Takes the top 10.

In [0]:
%sql
SELECT nvl(Country, 'Unidentified') AS Country, COUNT(*) AS Publications FROM
(SELECT Country
FROM dblp_fact_table
INNER JOIN orgs
ON dblp_fact_table.Org = orgs.ID)
GROUP BY Country
ORDER BY Publications DESC
LIMIT 10;

Country,Publications
United States,75250
Unidentified,45907
China,21510
Germany,15733
United Kingdom,12150
Canada,11596
France,9212
Italy,8885
Japan,7694
Spain,6911
