<a href="https://colab.research.google.com/github/siherm/PUMA/blob/main/PUMA_Duplicate_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [53]:
!pip install pydantic



In [54]:
import pandas as pd
import re
import requests

from itertools import compress
from pydantic import BaseModel, Field, PrivateAttr, validator
from typing import Optional, List, Union

from __future__ import annotations

In [55]:
def get_simtech_group_entries(username: str, token: str):
    """Returns a list of all posts present in the SimTech group

    Args:

        username (str): Username that is used to evaluate rights. Usually the profile name of your PUMA account.
        token (str): Token required to grant access to functionalities found in Bibsonomy.

    Returns:

        list[dict]: All posts as a list of ditionary that include metadata to PUMA specifics and the corresponding bibtex as dict.
    """
    return requests.get(
        "https://puma.ub.uni-stuttgart.de/api/posts?user=simtech&resourcetype=bibtex&start=0&end=11000&format=json",
        auth=(username, token)
    ).json()["posts"]["post"]

## Fetching and processing records from PUMA

The following code demonstrates how to fetch and process entries found in PUMA. First, the REST-API is called to return a list of all entries found in the SimTech group. Next, the data is parsed into an object to perform analysis. Finally, all the data is collected in a dedicated PUMADataset class to perform duplicate search as well as other analytics.

In [56]:
# 1 - Fetch data from PUMA
username = "hermann" #@param {type:"string"}
token = "cc7b8804d7963cb91dad41f7f8977347" #@param {type:"string"}
records = get_simtech_group_entries(username, token)
len(records)

1984

In [57]:
# Content of a record
records[3]

{'bibtex': {'address': 'Austin, TX, USA',
  'author': 'Str{\\"a}sser, R. and Berberich, J. and Allg{\\"o}wer, F.',
  'bibtexKey': 'ist:strasser21a',
  'booktitle': 'Proc. 60th IEEE Conf. Decision and Control (CDC)',
  'entrytype': 'inproceedings',
  'href': 'https://puma.ub.uni-stuttgart.de/api/users/simtech/posts/44bc0a33f4b4b7c12b9c9c73b575a3ed',
  'interhash': '3d94f3fe49b453350bda077389123ec8',
  'intrahash': '44bc0a33f4b4b7c12b9c9c73b575a3ed',
  'misc': '  doi = {10.1109/CDC45484.2021.9683211}',
  'pages': '4344-4351',
  'title': 'Data-Driven Control of Nonlinear Systems: Beyond Polynomial Dynamics',
  'year': '2021'},
 'changedate': '2022-02-10T11:11:03.000+01:00',
 'group': [{'href': 'https://puma.ub.uni-stuttgart.de/api/groups/public',
   'name': 'public'}],
 'postingdate': '2022-02-10T11:11:03.000+01:00',
 'tag': [{'href': 'https://puma.ub.uni-stuttgart.de/api/tags/myown',
   'name': 'myown'},
  {'href': 'https://puma.ub.uni-stuttgart.de/api/tags/pn4', 'name': 'pn4'},
  {'href

## Setting up classes for management

In order to process PUMA entries and to enable deep analysis, the given entries will be encapsulated in a class that includes the following attributes.

- __str__ title
- __str__ user
- __str__ authors
- __str__ journal
- __str__ entrytype
- __str__ group
- __list[str]__ tags
- __str__ url
- __dict__ bibtex (raw entry to get values not present in this class)
- __Optiona[str]__ doi
- __Optiona[str]__ isbn
- __Optiona[str]__ preprint_id

Please note, PyDantic is used for data validation and processing.

In [58]:
class PUMAEntry(BaseModel):

  class Config:
    allow_population_by_field_name = True
    smart_union = True

  bibtex: dict = Field(
      ...,
      description="Raw bibtex file that can be utilized in later stages"
  )

  title: str = Field(
      ...,
      description="Title of the publication"
  )

  user: Union[str, dict] = Field(
      ...,
      description="User that has added the publication"
  )

  authors: List[str] = Field(
      default_factory=list,
      description="Authors of the publication"
  )

  entrytype: str = Field(
      ...,
      description="Type of entry that is given in the PUMA entry"
  )

  group: str = Field(
      ...,
      description="PUMA group association of the entry"
  )

  tags: list = Field(
      default_factory=list,
      alias="tag",
      description="Tags the dataset has been assigned to"
  )

  journal: Optional[str] = Field(
      None,
      description="Journal in which the manuscript has been published"
  )

  preprint_id: Optional[dict] = Field(
      None,
      description="ID of the preprint as well as the server"
  )

  url: Optional[str] = Field(
      None,
      description="URL for various informations related to the publication"
  )

  doi: Optional[str] = Field(
      None,
      description="Digital Object Identifier of the publication"
  )

  isbn: Optional[str] = Field(
      None,
      description="ISBN of the book"
  )

  issn: Optional[str] = Field(
      None,
      description="ISSN of the book"
  )

  misc: Optional[str] = Field(
      None,
      description="Miscellaneous items, mainly used ot populate DOI and ISBN"
  )

  # * Private attributes
  _raw_data: dict = PrivateAttr()


  # ! Validators
  @validator("user")
  def fetch_username(cls, value: dict):
    """Parses a user dictionary and fetches the name"""
    return value["name"]

  @validator("tags", each_item=True, pre=True)
  def parse_tags(cls, tags: List[dict]):
    """Parses tag name dictionaries to strings"""
    return [
            tag["name"] for tag in tags
    ]

  @validator("misc", always=True)
  def parse_misc(cls, misc: Optional[str], values: dict):
    """Uses the miscellaneous field to find DOIs"""

    if not misc:
      # DOI is already here
      return None
    
    if "doi = " in misc:
      # Use regular expression to fetch the DOI
      doi = cls._fetch_from_misc(misc.lower(), "doi")

      if doi:
        doi = doi[doi.find("10.")::]

        if doi.startswith("10."):
          values["doi"] = doi

    if "issn = " in misc:
      values["issn"] = cls._fetch_from_misc(misc.lower(), "issn")

    if "isbn = " in misc:
      values["isbn"] = cls._fetch_from_misc(misc.lower(), "isbn")

    if "arxivid = " in misc:
      values["preprint_id"] = cls._fetch_from_misc(misc.lower(), "arxivid")

    return None

  @validator("url")
  def parse_url(cls, url: str, values: dict):
    """Parses url to get preprint info"""

    if not url:
      return None

    if "arxiv.org" in url:
      values["preprint_id"] = {
          "type": "arxiv",
          "id": url.split("/")[-1]
      }

    return url

  @validator("journal")
  def parse_journal(cls, journal: str, values: dict):
    """Extracts possible pre-prints from the journal"""

    if not journal:
      return None

    if "ArXiv e-print" in journal:
      values["preprint_id"] = {
          "type": "arxiv",
          "id": journal.split(" ")[-1]
      }

    return journal


  @staticmethod
  def _fetch_from_misc(misc: str, search_term: str):
    """Fetches any additional attribute in misc"""
    pattern = re.compile(f"{search_term} = "+ r"{(.*?)}")

    try:
      return pattern.findall(misc)[0].strip("{}")
    except IndexError:
      return None

  # ! Initializers
  @classmethod
  def from_api_record(cls, api_record: dict, group: str = "SimTech"):
    """Initializes objects coming from an API fetch"""

    # Rearrange data for eeasy initialization
    api_record = {**api_record, **api_record["bibtex"]}
    api_record["group"] = group

    # Create class
    cls = cls(**api_record)
    cls._raw_data = api_record

    return cls

  # ! ReDefs
  def json(self, indent=2, **kwargs):
    return super().json(
        exclude={"bibtex", "_raw_data"},
        indent=indent,
        **kwargs
    )


In [59]:
class PUMACollection(BaseModel):
  """
  Collection of PUMAEntry classes as well as functionalities to perform
  analysis and data export.
  """


  records: List[PUMAEntry] = Field(
      default_factory=list,
      description="Collection of PUMA entries"
  )

  @classmethod
  def from_web_api(cls, username: str, token: str):
    """Initializes a PUMA collection using the REST API"""

    # Initialize object
    cls = cls()

    # Get all entries
    records = get_simtech_group_entries(username, token)

    # Iterate through entries as convert to objects
    for record in records:
      cls.records.append(
          PUMAEntry.from_api_record(record)
      )

    return cls

  # ! Exports
  def to_dataframe(self):
    """Turns the given collection into a Pandas DataFrame for further analysis"""

    data = [record.dict(exclude={"_raw_data", "bibtex"}) for record in self.records]

    return pd.DataFrame(data)

  def __iter__(self):
    return iter(self.records)

## Utilizing the PUMACollection
### Application: Find duplicates

Since the PUMACollection class possesses a ```to_dataframe```-method a DataFrame can be used to perform various analysis. One of it may be to find duplicate entries. In this application, the DataFrame is reduced to only those records that are duplicated. Following procedure is applied:

1. Generate DataFrame
2. Reduce to duplicates
3. Create new PUMACollection using indices
4. Use the ```merge_duplicates```-method (not implemented yet) to decide on which entry shouls be used.

In [60]:
# 0 Fetch remote collection from Simtech group
collection = PUMACollection.from_web_api(username, token)

In [61]:
# 1 Export to DataFrame
df = collection.to_dataframe()
df.head(3)

Unnamed: 0,title,user,authors,entrytype,group,tags,journal,preprint_id,url,doi,isbn,issn,misc
0,A non-intrusive nonlinear model reduction meth...,simtech,[],article,SimTech,"[myown, from:jkneifl, peerReviewed, pn7, exc2075]",International Journal for Numerical Methods in...,,https://doi.org/10.1002%2Fnme.6712,10.1002/nme.6712,,,
1,Dynamic self-triggered control for nonlinear s...,simtech,[],inproceedings,SimTech,"[from:michaelhertneck, myown, peerreviewed, ex...",,,,10.1109/cdc45484.2021.9682784,,,
2,Multi-party computation enables secure polynom...,simtech,[],inproceedings,SimTech,"[myown, from:sebastianschlor, pn4, peerReviewe...",,,,10.1109/cdc45484.2021.9683026,,,


In [62]:
# 2 Reduce DataFrame to only include duplicates
df_duplicates = df.dropna(subset=["doi"], axis=0)
df_duplicates = df_duplicates[df_duplicates.duplicated(subset=["doi"])]
df_duplicates.shape

(102, 13)

In [63]:
# 3 Create new PUMACollection based on duplicates DataFrame
duplicates = PUMACollection(
    records = [collection.records[i] for i in df_duplicates.index.to_list()]
)
len(duplicates.records)

102

### @Sibylle wie es weitergehen kann:

Du hast nun alle Objekte die Duplikate sind in einer Struktur und kannst diese jetzt entsprechend filtern. Heisst zB wenn du die Tags kontrollieren möchtest erreichst du diese via ```record.tags``` sofern du mit einem for loop drüber iterierst.

Weiterhin könntest du die Duplikate über die DOI gruppieren und näher anschauen. Im Folgenden ein Beispiel wie du das machen könntest:

In [64]:
duplicate_groups = {}

for record in duplicates:
  doi = record.doi

  same_doi = list(filter(
      lambda rec: rec.doi == doi,
      duplicates
  ))

  duplicate_groups[doi] = same_doi

In [70]:
# Ein Eintrag sieht dann so aus
print(duplicate_groups['10.1002/fld.4065'][0].json())

{
  "title": "Relaxation of the Navier-Stokes-Korteweg Equations for Compressible Two-Phase Flow with Phase Transition",
  "user": "simtech",
  "authors": [],
  "entrytype": "article",
  "group": "SimTech",
  "tags": [
    "EXC310",
    "from:katharinafuchs",
    "from:simtechpuma",
    "imported",
    "pn_missing"
  ],
  "journal": "Internat. J. Numer. Methods Fluids",
  "preprint_id": null,
  "url": "http://doi.org/10.1002/fld.4065",
  "doi": "10.1002/fld.4065",
  "isbn": null,
  "issn": null,
  "misc": null
}
