<a href="https://colab.research.google.com/github/restrepo/colav/blob/main/sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install
https://www.anaconda.com/products/individual#Downloads
```bash
$ bash Anaconda3-2020.11-Linux-x86_64.sh
```

https://github.com/colav/HunabKu → README
```bash
$ conda install nodejs==10.13.0
$ conda install -c anaconda mongodb
$ conda install mongodb mongo-tools
$ npm install -g apidoc
$ pip install python-Levenshtein
$ pip install hunabku
```
Check the `mongodb` installation with

```bash
$ mongo
...
> exit
```

[and](https://mail.google.com/mail/u/1/#search/omar+mogodb/QgrcJHrnxTDFKqWmJrDLhjJmxrpWZSRjhlq) if necessary, explicit start the `mongod` service
```
$ mkdir -p data/db
$ mongod --dbpath data/db/
```

use `mongo` in another terminal

Download: `Moai-0.0.3a0-py3-none-any.whl` from Google Drive CoLav shared Drive and install it with

```bash
$ pip install Moai-0.0.3a0-py3-none-any.whl
```

Now we can start the server
```bash
$ hunabku_server
```
Take note of the url server generated. The same url serves the documentation.
Check the documentation server with the link at the end, for example

http://fisica.udea.edu.co:8080/apidoc/index.html

This server is the endpoint for `mongodb`. 

<!-- 16m of Video to explain the plugin system and endpoint (url) system) -->

<!-- 24m how to build the wheel -->


## Data sample for Moai
Video Full detailed help [ES]

* https://drive.google.com/file/d/1kQTzKvrmQuUxN-SLs7L9HmMKdhP4fgkl/view
* https://drive.google.com/file/d/18kNkiQqp6lWAKsmk2VBkJli7l5RUElpz/view [TMP]

Data sample [XLSX]:

https://docs.google.com/spreadsheets/d/1tAbdLUTELnaulyoF-ctXXhln7oXFXJ2J

In [1]:
import pandas as pd
import json

In [16]:
df=pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vTJwBnceHMjYSVar20c-XMMgto7GxuWUKSo_ikPZjru1q8vQpFsQZZZBn7VNyFaJw/pub?output=xlsx')

The complete list of the mandatory fields is: 

`["article_id","journal","publisher","country","title", "author","doi","year","volume","issue","pages","abstract"]`

`"author"` must have a comma separated format: Last-Name Initials, e.g: `Bianchinotti MV, Borromei AM`

In [18]:
df[['author','year','publisher', 'title', 'doi',  'country', 'journal', 'volume', 
    'pages', 'article_id', 'language', 'abstract','issue'
    ]]

Unnamed: 0,author,year,publisher,title,doi,country,journal,volume,pages,article_id,language,abstract,issue
0,"Bianchinotti MV, Borromei AM, Musotto LL",2013,MACN Bernardino Rivadavia,Inferencias paleoecológicas a partir del análi...,,Buenos Aires,Revista del Museo Argentino de Ciencias Natura...,,,,,,
1,Brignardello M,2013,Universidad Santiago de Chile,¿Escasez de agua en el siglo XXI?: Formas de a...,,Santiago de Chile,Estudios Avanzados,,,,,,
2,Velez J,2018,Universidad del Rosario,Security floors: Towards an urban anthropology...,,Bogotá,Territorios,,,,,,
3,"Calvinho LF, Dallard BE",2010,Universidad Nacional del Litoral,Receptores tipo toll en la inmunidad innata y ...,,,Fave Ciencias Veterinarias,,,,,,


In [19]:
df[['author','year','publisher', 'title', 'doi',  'country', 'journal', 'volume', 
    'pages', 'article_id', 'language', 'abstract','issue'
    ]].fillna('').to_json('sample.json',orient='records',force_ascii=False)

In [2]:
with open(r"sample.json", "r") as read_file:
    data = json.load(read_file)

Check the data

In [3]:
data[0]['author']

'Bianchinotti MV, Borromei AM,  Musotto LL'

In [4]:
xdata

[{'author': 'Bianchinotti MV, Borromei AM,  Musotto LL',
  'year': 2013,
  'publisher': 'MACN Bernardino Rivadavia',
  'title': 'Inferencias paleoecológicas a partir del análisis de microfósiles fúngicos en una turbera pleistoceno-holocena de Tierra del Fuego, Argentina',
  'doi': '',
  'country': 'Buenos Aires',
  'journal': 'Revista del Museo Argentino de Ciencias Naturales, n.s.',
  'volume': '',
  'pages': '',
  'article_id': '',
  'language': '',
  'abstract': '',
  'issue': ''},
 {'author': 'Brignardello M',
  'year': 2013,
  'publisher': 'Universidad Santiago de Chile',
  'title': '¿Escasez de agua en el siglo XXI?: Formas de apropiación, distribución y uso del recurso hídrico por parte de productores vitivinícolas de Maipú, Mendoza',
  'doi': '',
  'country': 'Santiago de Chile',
  'journal': 'Estudios Avanzados',
  'volume': '',
  'pages': '',
  'article_id': '',
  'language': '',
  'abstract': '',
  'issue': ''},
 {'author': 'Velez J',
  'year': 2018,
  'publisher': 'Universi

## Creates the database in mongodb
Load the database in mongo and assign it to, e.g., `la`

```bash
mongoimport --db la --collection data --file sample.json --jsonArray
```

## GS metadata

Get API KEY from your https://www.scraperapi.com/ account

Now the important thing!

```bash
moai_gslookup --hunabku_server http://127.0.1.1:8080 --proxy_api XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX --max_threads 1 --max_papers 2 --db la --max_tries 2
```

(for wn6 use http://10.0.0.8:8080) Check the results

```bash
$ mongo
> use la
> show collections
> db.stage.count()
> db.stage.find()
```

To delete a collection
```
> db.stage.drop()
```

## GS Cites
To download the cites, we need to control all the cites for each article. Therefore, we need to first create a cache.

### Creating the cache
`
moai_gscites  --hunabku_server http://127.0.1.1:8080 --proxy_apikey xxxxxx --db la --create_cache
`

**Where**:
* http://127.0.1.1:8080: is the Hunabku server
* xxxxxxx: the scraperapi apikey
* test: the name of the database where collection stage produced by moai_gslookup is located. 
* --create_cache: options to create the cites cache

**Results:**
A new collection is created in the database test called `cache_cites`

**NOTE**: depeding of the number of papers and the number of cites for the papers the cache creation can take a while. The proces is not papalelized yet :(


### Downloading the cites
At this point we should have the cache ready.
To download the cites please run:

`
moai_gscites  --hunabku_server http://127.0.1.1:8080 --proxy_apikey xxxxxxxx  --max_threads 5  --max_tries 2  --db la --max_papers 10000
`

**Where**:
* http://127.0.1.1:8080 is the Hunabku server
* xxxxxxxx is the apikey from scraper api
* N (--max_threads option) the number of threads depents of the plan that you buy in scraper api
* M number of cites to dowload, depends of the number of apicalls that you buy in scrapper api. Every cites requires two calls, if you buy 1000000 calls you can to download max 500000 cites.


## Summary:

In [11]:
import getpass

In [12]:
api_key=getpass.getpass()

 ································


In [14]:
summary=f'''
#Creates db.data collection
mongoimport --db la --collection data --file sample.json --jsonArray 

#Creates db.stage collection
moai_gslookup --hunabku_server http://10.0.0.8:8080 --proxy_api {api_key} --max_threads 1 --max_papers 2 --db la --max_tries 2

#Creates db.cache_cites collection
moai_gscites  --hunabku_server http://10.0.0.8:8080 --proxy_apikey {api_key} --db la --create_cache

#Creates db.stage collection
moai_gscites  --hunabku_server http://10.0.0.8:8080 --proxy_apikey {api_key}  --max_threads 5  --max_tries 2  --db la --max_papers 10000
'''

In [None]:
print(summary)