## Citation searching with ContentMine & PyCMLib

This is a brief example of how to use the Python library I've written to process data acquired using ContentMine. This specific example will cover searching for citations to a specific paper.

Before starting, you will need to have a directory containing many CMDirectories, each of which contains _at least_ the following two files: `scholarly.html` and `results.json`. For example, something like this:

```
├── mdpi-rs-2009
│   ├── http_dx.doi.org_10.3390_rs1010001
│   │   ├── results.json
│   │   └── scholarly.html
│   ├── http_dx.doi.org_10.3390_rs1010003
│   │   ├── results.json
│   │   └── scholarly.html
│   ├── http_dx.doi.org_10.3390_rs1010022
│   │   ├── results.json
│   │   └── scholarly.html
│   ├── http_dx.doi.org_10.3390_rs1020036
│   │   ├── results.json
│   │   └── scholarly.html
```

This sort of folder structure can easily be created using the `quickscrape` command, followed by `norma` to convert `fulltext.xml` or `fulltext.html` files to `scholarly.html` files.

So, lets get going. First we import the library I have written - called (for the moment) `PyCMLib`

In [8]:
from PyCMLib import *

Now we need to define what paper we want to find all of the citations for. Ideally we should get both the DOI and the title, as some journal papers do not use the DOI in citation lists (yes, I know...!).

In [9]:
doi = "10.1016/S0034-4257(00)00169-3"
title = "Classification and change detection using Landsat TM data: When and how to correct atmospheric effects?"

Then we call the main processing function (`process_all_articles`), which will iterate through all of the folders given (the `mdpi-only/**` string is a 'glob-pattern' that matches all subfolders of `mdpi-only`, each of which represents one article), and run the given processing function (in this case `pf_get_citation`) on it. The other parameters are parameters to this specific processing function, in this case giving it the DOI and title, and telling it to grab two sentences around each use of the citation.

In [10]:
res = process_all_articles('mdpi-only/**', pf_get_citation, doi=doi, title=title, n_sentences=2)
len(res)

2239

Now we've run this code (which may take a little while), we have a Pandas DataFrame called `res`. Currently this contains information on _all_ of the papers: we now need to subset it to just the papers that cited our paper of interest. This is the nastiest bit of the code so far, and will be hidden away sometime soon - but basically it just selects all of the rows that have at least one match.

We can then count the results and find that 22 papers in our corpus cited the paper of interest.

In [11]:
res = res[~res.match_0.isnull()]
len(res)

22

We can then look at the full output DataFrame, and do any processing we want on it:

In [12]:
res

Unnamed: 0,date,doi,journal,match_0,match_1,match_2,title
17,2009-07-15,10.3390/rs1030278,Remote Sensing,"In certain circumstances, radiance calibration...",Scattering effect was computed for each pixel ...,,A Highly Accurate Classification of TM Data th...
123,2010-04-08,10.3390/rs2041035,Remote Sensing,Despite relatively good atmospheric conditions...,,,Per-Field Irrigated Crop Classification in Ari...
145,2010-06-03,10.3390/rs2061508,Remote Sensing,The importance of radiometric normalization fo...,,,Change Detection Accuracy and Image Properties...
438,2012-06-29,10.3390/rs4071947,Remote Sensing,The 30-m image projection was converted to the...,,,Utility of Satellite and Aerial Images for Qua...
496,2012-10-19,10.3390/rs4103184,Remote Sensing,Since the predictors are only used in empirica...,,,Downscaling Land Surface Temperature in an Urb...
507,2012-11-09,10.3390/rs4113417,Remote Sensing,The final image set was pre-processed to at-su...,,,Continental Scale Mapping of Tidal Flats acros...
668,2013-05-30,10.3390/rs5062763,Remote Sensing,These include no-change buffer zone around a f...,The PIFs describe a high-density ridge along a...,Song et al.[ **REF**] identifies PIFs using sc...,Radiometric Normalization of Temporal Images C...
677,2013-06-13,10.3390/rs5062973,Remote Sensing,Because of significant differences between the...,,,Removal of Optically Thick Clouds from Multi-S...
688,2013-07-04,10.3390/rs5073212,Remote Sensing,Though multi-temporal and multi-platform data ...,,,Influence of Multi-Source and Multi-Temporal R...
869,2013-12-27,10.3390/rs6010285,Remote Sensing,Before using the RapidEye images in the final ...,,,Training Area Concept in a Two-Phase Biomass I...


For example, we can print out the full citation context for each paper, so that we can easily look at them

In [24]:
for i,s in res.dropna(subset=['match_0']).iterrows():
    print(s['title'])
    print("-" * 80)
    for col in s.index:
        if 'match' in col:
            if str(s[col]) != 'nan':
                print(s[col])
                print('\n')
    print('\n\n')

A Highly Accurate Classification of TM Data through Correction of Atmospheric Effects
--------------------------------------------------------------------------------
In certain circumstances, radiance calibration of the image data is necessary prior to classification using multitemporal images [ 2]. The atmospheric effect can prevent the proper interpretation of images if it is not corrected [ 3]. For many other applications involving image classification and change detection, atmospheric correction is unnecessary for a single date image [ **REF**]. As long as the training data and images to be classified are on the same relative scale, atmospheric correction has little effect on classification accuracy [ 5, 6, 7]. Past investigators have used relative atmospheric correction techniques, which proceed on the assumption that a linear relationship exists between the measurements from a targeted area over time.


Scattering effect was computed for each pixel and the images were subjected 