__PyScopus__: An example for author disambiguity

Sometimes Scopus would mix up people with similar names. I recently come up with a not that difficult method to clean author publication profiles, which needs some manual work.

If you can think of a better way, please do let me know!

---

In [3]:
import pyscopus
pyscopus.__version__

'1.0.0a1'

In [6]:
from pyscopus import Scopus
key = 'YOUR_OWN_APIKEY'
scopus = Scopus(key)

In [9]:
import utils

---

When I was collecting data for my own research, I found that [Dr. Vivek K. Singh](http://web.media.mit.edu/~singhv/) has a very [noisy profile in Scopus](https://www.scopus.com/authid/detail.uri?authorId=7404651152). Let's use this as an example.

The basic idea is to match author-affiliation pair:
- For all the paper found in the _mixed profile_
    - Find the focal author (in this case, Dr. Singh)
    - Look at his/her affiliation
        - Keep this paper if the affiliation is indeed where he/she is
        - If not, discard the paper

For Dr. Singh, I manually obtained his affiliation ids by searching through [Scopus affiliation search](https://www.scopus.com/search/form.uri?display=affiliationLookup). Upon obtaining that, create a dictionary containing _name (first/last)_, _affiliation name_, and _a list of affiliation ids_. Author and affiliation names would be used to search for this author. The list of affiliation ids would be used for cleaning papers: 

- UC Irvine `60007278`
- MIT `60022195`
- Rutgers `60030623`

In [23]:
d = {'authfirst': 'Vivek', 'authlastname': 'Singh', 'affiliation': 'Rutgers',
     'affil_id_list': ['60030623', '60022195', '60007278']
    }
d

{'authfirst': 'Vivek',
 'authlastname': 'Singh',
 'affiliation': 'Rutgers',
 'affil_id_list': ['60030623', '60022195', '60007278']}

In [24]:
query = "AUTHLASTNAME({}) and AUTHFIRST({}) and AFFIL({})".format(d['authlastname'], d['authfirst'], d['affiliation'])
author_search_df = scopus.search_author(query)
author_search_df

Unnamed: 0,author_id,name,document_count,affiliation,affiliation_id
0,7404651152,Vivek Kumar N. Singh,490,Banaras Hindu University,60008721


Sometimes we would obtain a list of author profiles for each author. In this case, we only have one and it is clear that the author profile is highly noisy.

In the following step, I would use the helper functions in `utils` to screen each paper by this `author_id`

In [25]:
author_id = '7404651152'
author_id, d['affil_id_list']

('7404651152', ['60030623', '60022195', '60007278'])

The filtering process may take a while, depending on how many documents are mixd up.

In [26]:
filterd_pub_df = utils.check_pub_validity(scopus, author_id, d['affil_id_list'], apikey)
filterd_pub_df.shape[0], filterd_pub_df.scopus_id.unique().size, filterd_pub_df.scopus_id.isnull().sum()

(134, 134, 0)

Obviously, the number of papers is highly reduced. We can now check a random subset to see if the filtered papers make sense for this author.

In [28]:
filterd_pub_df.iloc[utils.pd.np.random.randint(0, high=134, size=20)][['title', 'publication_name']]

Unnamed: 0,title,publication_name
117,Effects of high-energy irradiation on silicon ...,Optics InfoBase Conference Papers
34,"""They basically like destroyed the school one ...",Proceedings of the ACM Conference on Computer ...
58,Situation recognition from multimodal data,MM 2016 - Proceedings of the 2016 ACM Multimed...
70,On-chip mid-infrared gas detection using chalc...,Applied Physics Letters
212,Mid-infrared As<inf>2</inf>Se<inf>3</inf> chal...,IEEE International Conference on Group IV Phot...
88,Geo-intelligence and visualization through big...,Geo-Intelligence and Visualization through Big...
120,Effects of high-energy irradiation on silicon ...,Optics InfoBase Conference Papers
206,Low loss mid-infrared silicon waveguides by us...,"Frontiers in Optics, FIO 2012"
151,Predicting spending behavior using socio-mobil...,Proceedings - SocialCom/PASSAT/BigData/EconCom...
154,Generation of two-cycle pulses and octave-span...,Optics Letters


However, there may still be noise in it (e.g., papers published in optics/photonics venues). We can manually exclude those as well:

In [44]:
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('optic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('photonic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('nano')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('quantum')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('sensor')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('cleo')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('materials')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('physics')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('chip')")
filterd_pub_df.shape

(59, 16)

And let's check again

In [46]:
filterd_pub_df.iloc[utils.pd.np.random.randint(0, high=59, size=20)][['title', 'publication_name']]

Unnamed: 0,title,publication_name
94,"Physical-Cyber-Social Computing: Looking Back,...",IEEE Internet Computing
58,Situation recognition from multimodal data,MM 2016 - Proceedings of the 2016 ACM Multimed...
151,Predicting spending behavior using socio-mobil...,Proceedings - SocialCom/PASSAT/BigData/EconCom...
28,Toward multimodal cyberbullying detection,Conference on Human Factors in Computing Syste...
61,If it looks like a spammer and behaves like a ...,International Journal of Information Security
134,Cyber bullying detection using social and text...,SAM 2014 - Proceedings of the 3rd Internationa...
45,From sensors to sense-making: Opportunities an...,Proceedings of the Association for Information...
45,From sensors to sense-making: Opportunities an...,Proceedings of the Association for Information...
297,Structural analysis of the emerging event-web,Proceedings of the 19th International Conferen...
40,Using cognitive dissonance theory to understan...,Proceedings of the Association for Information...


Now it is much better and we can use this cleaned paper list for this focal author.