## String basics
Strings in Python come with a number of useful features and methods.

In [260]:
quickfox = "the quick brown fox jumped over the lazy dog. "

In [261]:
quickfox.capitalize()

'The quick brown fox jumped over the lazy dog. '

In [262]:
quickfox.upper()

'THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG. '

In [263]:
'fox' in quickfox

True

In [264]:
quickfox.startswith('fox')

False

In [265]:
quickfox.find("fox")

16

In [266]:
quickfox[16:]

'fox jumped over the lazy dog. '

In [267]:
quickfox.count('fox')

1

In [268]:
quickfox.replace('fox', 'hare').replace('lazy', 'adorable')

'the quick brown hare jumped over the adorable dog. '

Splitting strings is an important standard operation that allows you to produce lists of substrings, based on a defined separator. In this case, we split the sentence by whitespace.

In [269]:
quickfox.split(' ')

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', '']

Note the empty string at the end of the list. This exists because the original string ended in a whitespace. We can use the .strip() method to remove leading and trailing whitespace from a string.

In [270]:
quickfox.strip()

'the quick brown fox jumped over the lazy dog.'

.join() is a powerful method that allows you to join a list of strings together, using the specified separator. In this case, we will join a list of numbers together, 

In [271]:
example_list = ['one', 'two', 'three', 'four']

In [272]:
' and '.join(example_list)

'one and two and three and four'

## Pandas stringtypes

In [273]:
import pandas as pd

In [274]:
df = pd.read_csv('../data/publications.txt', sep='\t', encoding='utf-8', dtype={'authors': 'string', 'journal_title': 'string', 'paper_title': 'string', 'abstract': 'string'})

In [299]:
df

Unnamed: 0,authors,journal_title,pub_year,n_cits,n_refs,paper_title,abstract
0,Edward J. Hackett,,2000,10,0,12 interdisciplinary research initiatives at the u s national science foundation,
1,Wilhelm Krull,,2000,3,0,13 beyond the ivory tower some observations on external funding of interdisciplinary research in...,
2,Anthony Rj. Van Raan,,2000,21,0,4 the interdisciplinary nature of science theoretical framework and bibliometric empirical approach,
3,Anne Brüggemann-Klein; Rolf Klein; Britta Landgraf,,2000,9,9,bibrelex exploring bibliographic databases by visualization of annotated content based relations,Traditional searching and browsing functions for bibliographic databases no longer enable resear...
4,Howard D. White; Jan W. Buzydlowski; Xia Lin,,2000,24,18,co cited author maps as interfaces to digital libraries designing pathfinder networks in the hum...,"Using data from the Arts and Humanities Citation Index for 1988-1997, we are attempting to gener..."
...,...,...,...,...,...,...,...
30016,Alexander F. Post; Adam Y. Li; Jennifer B. Dai; Stanislaw Sobotka; Syed Haider; Tanvir F. Choudh...,World Neurosurgery,2019,0,17,academic productivity of spine surgeons at united states neurological surgery and orthopedic sur...,"Objective Spinal surgery is taught and practiced within 2 different surgical disciplines, neuro..."
30017,Ali Akhaddar,World Neurosurgery,2019,1,18,contribution of moroccan neurosurgeons to the world neurosurgical data in pubmed a bibliometric ...,"Background Medical publications reflect the development of training, research, and health servi..."
30018,Walter C. Jean; Daniel R. Felbaum,World Neurosurgery,2019,0,18,impact of training and practice environment on academic productivity of early career academic ne...,Background Factors affecting academic productivity of neurosurgeons are increasingly being stud...
30019,Chesney S. Oravec; Casey D. Frey; Benjamin W. Berwick; Stacey Quintero Wolfe; Carol A. Aschenbre...,World Neurosurgery,2019,0,20,predictors of citations in neurosurgical research,Objective The number of citations an article receives is an important measure of impact for pub...


In [297]:
df['paper_title'].str.title()

0                           12 Interdisciplinary Research Initiatives At The U S National Science Foundation
1        13 Beyond The Ivory Tower Some Observations On External Funding Of Interdisciplinary Research In...
2        4 The Interdisciplinary Nature Of Science Theoretical Framework And Bibliometric Empirical Approach
3           Bibrelex Exploring Bibliographic Databases By Visualization Of Annotated Content Based Relations
4        Co Cited Author Maps As Interfaces To Digital Libraries Designing Pathfinder Networks In The Hum...
                                                        ...                                                 
30016    Academic Productivity Of Spine Surgeons At United States Neurological Surgery And Orthopedic Sur...
30017    Contribution Of Moroccan Neurosurgeons To The World Neurosurgical Data In Pubmed A Bibliometric ...
30018    Impact Of Training And Practice Environment On Academic Productivity Of Early Career Academic Ne...
30019              

In [298]:
df['paper_title'].str.find('citation')

0        -1
1        -1
2        -1
3        -1
4        -1
         ..
30016    -1
30017    -1
30018    -1
30019    14
30020    -1
Name: paper_title, Length: 30021, dtype: Int64

In [277]:
df['authors'].str.split('; ')

0                                                                                        [Edward J. Hackett]
1                                                                                            [Wilhelm Krull]
2                                                                                     [Anthony Rj. Van Raan]
3                                                       [Anne Brüggemann-Klein, Rolf Klein, Britta Landgraf]
4                                                             [Howard D. White, Jan W. Buzydlowski, Xia Lin]
                                                        ...                                                 
30016    [Alexander F. Post, Adam Y. Li, Jennifer B. Dai, Stanislaw Sobotka, Syed Haider, Tanvir F. Choud...
30017                                                                                         [Ali Akhaddar]
30018                                                                    [Walter C. Jean, Daniel R. Felbaum]
30019    [Chesney S

In [293]:
df['authors'].str.contains('van Eck')

0        False
1        False
2        False
3        False
4        False
         ...  
30016    False
30017    False
30018    False
30019    False
30020    False
Name: authors, Length: 30021, dtype: boolean

In [296]:
df.loc[df['authors'].str.contains('van Eck')]

Unnamed: 0,authors,journal_title,pub_year,n_cits,n_refs,paper_title,abstract
3718,N.J. van Eck; Ludo Waltman; J.W.O. van den Berg,,2005,21,9,a novel algorithm for visualizing concept associations,An associative concept space is a map that visualizes the associations between concepts in a sci...
4784,N.J. van Eck; Flavius Frasincar; J.W.O. van den Berg,,2006,10,8,visualizing concept associations using concept density maps,The concept mapping algorithm proposed in an earlier paper is one of the dimensionality reductio...
4785,N.J. van Eck; Ludo Waltman; J.W.O. van den Berg; Uzay Kaymak,,2006,6,10,visualizing the wcci 2006 knowledge domain,"In this paper, a knowledge domain visualization approach is applied to the computational intelli..."
5036,Nees Jan van Eck; Ludo Waltman,ERIM report series research in management Erasmus Research Institute of Management,2006,102,7,vos a new method for visualizing similarities between objects,"We present a new method for visualizing similarities between objects. The method is called VOS, ..."
5082,Nees Jan van Eck; Ludo Waltman; Jan van den Berg; Uzay Kaymak,IEEE Computational Intelligence Magazine,2006,38,4,visualizing the computational intelligence field,"n this paper, we visualize the struc- ture and the evolution of the compu- tational intelligence..."
...,...,...,...,...,...,...,...
26973,Rene Mahieu; Nees Jan van Eck; David van Putten; Jeroen van den Hoven,Ethics and Information Technology,2018,2,18,from dignity to security protocols a scientometric analysis of digital ethics,"Our lives are increasingly intertwined with the digital realm, and with new technology, new ethi..."
27233,Nees Jan van Eck; Ludo Waltman,Journal of Data and Information Science,2018,0,8,analyzing the activities of visitors of the leiden ranking website,Purpose: To get a better understanding of the way in which university rankings are used.Design/m...
27308,Kevin W. Boyack; Nees Jan van Eck; Giovanni Colavizza; Ludo Waltman,Journal of Informetrics,2018,28,25,characterizing in text citations in scientific articles a large scale analysis,We report characteristics of in-text citations in over five million full text articles from two ...
27453,Magnus Palmblad; Nees Jan van Eck,Journal of the American Society for Mass Spectrometry,2018,0,8,bibliometric analyses reveal patterns of collaboration between asms members,We have explored the collaborative network of the current American Society for Mass Spectrometry...


## Formatted strings
Allow for insertion of variables and even expressions within the string.

In [305]:
name='Wout'
f'My name is {name}'

'My name is Wout'

In [307]:
a='Amsterdam'
b='the Netherlands'
c=800000
f'{a} is the capital of {b} and it has a population of over {c}'

'Amsterdam is the capital of the Netherlands and it has a population of over 800000'

In [324]:
a = 5
b = '5'
c = 10
f'{a} times {c} is {a*c} but {b} times {c} is {b*c}'

'5 times 10 is 50 but 5 times 10 is 5555555555'

Beware - typically ' and " do not mix, though either can be used to define strings. If you use strings within the expressions in an f-string, you will have to use a different style, else you get a syntax error, as in the example below.

In [321]:
f'The dataframe contains {df['authors'].str.contains('van Eck').sum()} articles by Nees Jan van Eck.' # this returns an error

SyntaxError: invalid syntax (Temp/ipykernel_4404/2659263662.py, line 1)

In [380]:
f"The dataframe contains {df['authors'].str.contains('van Eck').sum()} articles by Nees Jan van Eck." # this works!

'The dataframe contains 62 articles by Nees Jan van Eck.'

## Regular expression
A powerful tool for parsing and editing string data.

In [381]:
import re

Let's start by retrieving the abstract of Vincent's paper in the data.

In [382]:
vincent_abstract = df.loc[df['authors'].str.contains('Traag') & df['abstract'].notna()]['abstract'].tolist()[0]
vincent_abstract

'When performing a national research assessment, some countries rely on citation metrics whereas others, such as the UK, primarily use peer review. In the influential Metric Tide report, a low agreement between metrics and peer review in the UK Research Excellence Framework (REF) was found. However, earlier studies observed much higher agreement between metrics and peer review in the REF and argued in favour of using metrics. This shows that there is considerable ambiguity in the discussion on agreement between metrics and peer review. We provide clarity in this discussion by considering four important points: (1) the level of aggregation of the analysis; (2) the use of either a size-dependent or a size-independent perspective; (3) the suitability of different measures of agreement; and (4) the uncertainty in peer review. In the context of the REF, we argue that agreement between metrics and peer review should be assessed at the institutional level rather than at the publication level.

Regular expressions allow you to quickly search and manipulate strings. It uses wildcards, patterns, quantifiers, and character groups. For instance, we can find any numeric character:

In [383]:
re.findall('[0-9]', vincent_abstract)

['1', '2', '3', '4']

Regex uses a number of special characters, such as parentheses and square brackets, to denote groups of characters. If you want to explicitly look for these, you need to escape them with a backslash.

In [384]:
re.findall('\([0-9]\)', vincent_abstract)

['(1)', '(2)', '(3)', '(4)']

Quantifiers can be used to denote numbers of characters to look for. Let's find any substring that consists of at least two capital letters.

In [385]:
re.findall('[A-Z]{2,}', vincent_abstract)

['UK', 'UK', 'REF', 'REF', 'REF', 'REF']

Finally, let's use wildcards to match any character between the numbers in parentheses, and ending at the first semicolon or period.

In [386]:
re.findall('\([0-9]\).*?[;.]', vincent_abstract)

['(1) the level of aggregation of the analysis;',
 '(2) the use of either a size-dependent or a size-independent perspective;',
 '(3) the suitability of different measures of agreement;',
 '(4) the uncertainty in peer review.']

Regex allows for more than just finding or matching patterns. It can also be used to substitute a pattern with a new string.

In [387]:
re.sub('\([0-9]\).*?[;.]', '<SENTENCE REMOVED>', vincent_abstract)

'When performing a national research assessment, some countries rely on citation metrics whereas others, such as the UK, primarily use peer review. In the influential Metric Tide report, a low agreement between metrics and peer review in the UK Research Excellence Framework (REF) was found. However, earlier studies observed much higher agreement between metrics and peer review in the REF and argued in favour of using metrics. This shows that there is considerable ambiguity in the discussion on agreement between metrics and peer review. We provide clarity in this discussion by considering four important points: <SENTENCE REMOVED> <SENTENCE REMOVED> <SENTENCE REMOVED> and <SENTENCE REMOVED> In the context of the REF, we argue that agreement between metrics and peer review should be assessed at the institutional level rather than at the publication level. Both a size-dependent and a size-independent perspective are relevant in the REF. The interpretation of correlations may be problematic

Other important built-in features are the detection of the start of a string (^) and the end of a string (%). We can, for instance, extract the first sentence of the abstract by searching for a pattern, starting from the start of the string, up until the first period. re.search returns a match object, which contains both the matched text as well as the location in the original string.

In [403]:
re.search('^.*?\.', vincent_abstract)

<re.Match object; span=(0, 146), match='When performing a national research assessment, s>

There are many more things that you can do with regular expression, which we will not get into today, as it gets rather complex very fast. 

## NLTK
So far, most of these string operations have been possible within SQL as well, so you migth be asking, why Python? The Natural Language ToolKit is the first of a large list of libraries that allow you to do much more with text data than before. First, we need to download the nltk corpus and model files. Run the below cell, then download the 'popular' packages, that is enough for now.

In [414]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

When working with longer texts, it is often useful to break them up into individual sentences, or even words. This is called tokenization.

In [467]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

In [480]:
vincent_sentences = sent_tokenize(vincent_abstract)
vincent_sentences

['When performing a national research assessment, some countries rely on citation metrics whereas others, such as the UK, primarily use peer review.',
 'In the influential Metric Tide report, a low agreement between metrics and peer review in the UK Research Excellence Framework (REF) was found.',
 'However, earlier studies observed much higher agreement between metrics and peer review in the REF and argued in favour of using metrics.',
 'This shows that there is considerable ambiguity in the discussion on agreement between metrics and peer review.',
 'We provide clarity in this discussion by considering four important points: (1) the level of aggregation of the analysis; (2) the use of either a size-dependent or a size-independent perspective; (3) the suitability of different measures of agreement; and (4) the uncertainty in peer review.',
 'In the context of the REF, we argue that agreement between metrics and peer review should be assessed at the institutional level rather than at t

In [496]:
# break the abstract into individual words
vincent_words = word_tokenize(vincent_abstract)
vincent_words

['When',
 'performing',
 'a',
 'national',
 'research',
 'assessment',
 ',',
 'some',
 'countries',
 'rely',
 'on',
 'citation',
 'metrics',
 'whereas',
 'others',
 ',',
 'such',
 'as',
 'the',
 'UK',
 ',',
 'primarily',
 'use',
 'peer',
 'review',
 '.',
 'In',
 'the',
 'influential',
 'Metric',
 'Tide',
 'report',
 ',',
 'a',
 'low',
 'agreement',
 'between',
 'metrics',
 'and',
 'peer',
 'review',
 'in',
 'the',
 'UK',
 'Research',
 'Excellence',
 'Framework',
 '(',
 'REF',
 ')',
 'was',
 'found',
 '.',
 'However',
 ',',
 'earlier',
 'studies',
 'observed',
 'much',
 'higher',
 'agreement',
 'between',
 'metrics',
 'and',
 'peer',
 'review',
 'in',
 'the',
 'REF',
 'and',
 'argued',
 'in',
 'favour',
 'of',
 'using',
 'metrics',
 '.',
 'This',
 'shows',
 'that',
 'there',
 'is',
 'considerable',
 'ambiguity',
 'in',
 'the',
 'discussion',
 'on',
 'agreement',
 'between',
 'metrics',
 'and',
 'peer',
 'review',
 '.',
 'We',
 'provide',
 'clarity',
 'in',
 'this',
 'discussion',
 'by',

Note that there are a lot of 'stopwords' in sentences. These typically add little to a quantitative analysis of text, and can be removed. NLTK has lists of stopwords for various languages. Let's remove these from the text.

In [497]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print(stop_words)

{'then', "haven't", 'only', 'does', 'aren', 'yourselves', 'myself', 'from', 'they', 'doesn', 'has', "shouldn't", 'whom', 'do', 'now', "that'll", 'will', 'such', 'how', 'while', 'very', 'll', "shan't", 'their', 'can', 'ma', 'below', "aren't", 'been', 'in', 'was', 'too', 'yours', 'did', 'when', 'needn', 'won', "she's", "it's", 'on', 'his', 'than', 're', 'just', 'hasn', 'itself', 'over', 'for', 'she', 'this', 'who', 'to', "isn't", "you're", 'out', 'off', "needn't", 'my', 'at', 'hers', 'some', 'hadn', 'again', 'isn', 'above', 'what', 'any', 'am', 'it', 'your', 'our', "should've", "mightn't", "wasn't", "wouldn't", 'no', 'being', 'until', 'because', 'same', 'wasn', 'ours', 'don', 've', "didn't", 'other', 'shan', 'against', 'after', 'into', "weren't", 'you', 'between', 'which', 'me', 'have', 'm', 'didn', 'about', 'all', 'ain', "hasn't", 'but', 'down', 'each', 'a', 'its', 'or', 'where', 'theirs', "you've", 'are', "you'll", 'and', 'under', 'once', 'd', 'further', "you'd", 'couldn', 'nor', "don'

In [498]:
vincent_words = [w.lower() for w in vincent_words if w.lower() not in stopwords.words("english")]
print(vincent_words)

['performing', 'national', 'research', 'assessment', ',', 'countries', 'rely', 'citation', 'metrics', 'whereas', 'others', ',', 'uk', ',', 'primarily', 'use', 'peer', 'review', '.', 'influential', 'metric', 'tide', 'report', ',', 'low', 'agreement', 'metrics', 'peer', 'review', 'uk', 'research', 'excellence', 'framework', '(', 'ref', ')', 'found', '.', 'however', ',', 'earlier', 'studies', 'observed', 'much', 'higher', 'agreement', 'metrics', 'peer', 'review', 'ref', 'argued', 'favour', 'using', 'metrics', '.', 'shows', 'considerable', 'ambiguity', 'discussion', 'agreement', 'metrics', 'peer', 'review', '.', 'provide', 'clarity', 'discussion', 'considering', 'four', 'important', 'points', ':', '(', '1', ')', 'level', 'aggregation', 'analysis', ';', '(', '2', ')', 'use', 'either', 'size-dependent', 'size-independent', 'perspective', ';', '(', '3', ')', 'suitability', 'different', 'measures', 'agreement', ';', '(', '4', ')', 'uncertainty', 'peer', 'review', '.', 'context', 'ref', ',', 'a

let's also remove all tokens that consists of non-alphabetical characters, with a simple regular expression.

In [499]:
vincent_words = [w for w in vincent_words if bool(re.match('[^a-z]', w))==False]
print(vincent_words)

['performing', 'national', 'research', 'assessment', 'countries', 'rely', 'citation', 'metrics', 'whereas', 'others', 'uk', 'primarily', 'use', 'peer', 'review', 'influential', 'metric', 'tide', 'report', 'low', 'agreement', 'metrics', 'peer', 'review', 'uk', 'research', 'excellence', 'framework', 'ref', 'found', 'however', 'earlier', 'studies', 'observed', 'much', 'higher', 'agreement', 'metrics', 'peer', 'review', 'ref', 'argued', 'favour', 'using', 'metrics', 'shows', 'considerable', 'ambiguity', 'discussion', 'agreement', 'metrics', 'peer', 'review', 'provide', 'clarity', 'discussion', 'considering', 'four', 'important', 'points', 'level', 'aggregation', 'analysis', 'use', 'either', 'size-dependent', 'size-independent', 'perspective', 'suitability', 'different', 'measures', 'agreement', 'uncertainty', 'peer', 'review', 'context', 'ref', 'argue', 'agreement', 'metrics', 'peer', 'review', 'assessed', 'institutional', 'level', 'rather', 'publication', 'level', 'size-dependent', 'size-

### Lemmatizataion and stemming
Stemming reduces words to a base stem form by using predefined rules to trim the endings of nouns and verbs.

In [515]:
from nltk.stem.porter import PorterStemmer

In [516]:
ps = PorterStemmer()

for w in vincent_words:
    stemmed = ps.stem(w)
    if w != stemmed:
        print(w, " : ", stemmed)

performing  :  perform
national  :  nation
assessment  :  assess
countries  :  countri
rely  :  reli
citation  :  citat
metrics  :  metric
whereas  :  wherea
others  :  other
primarily  :  primarili
influential  :  influenti
metrics  :  metric
excellence  :  excel
however  :  howev
studies  :  studi
observed  :  observ
metrics  :  metric
argued  :  argu
using  :  use
metrics  :  metric
shows  :  show
considerable  :  consider
ambiguity  :  ambigu
discussion  :  discuss
metrics  :  metric
provide  :  provid
clarity  :  clariti
discussion  :  discuss
considering  :  consid
important  :  import
points  :  point
aggregation  :  aggreg
analysis  :  analysi
size-dependent  :  size-depend
size-independent  :  size-independ
perspective  :  perspect
suitability  :  suitabl
different  :  differ
measures  :  measur
uncertainty  :  uncertainti
argue  :  argu
metrics  :  metric
assessed  :  assess
institutional  :  institut
publication  :  public
size-dependent  :  size-depend
size-independent  :  

Lemmatization looks up words and replaces them with their base form, if found. The downside is that unknown words are ignored.

In [517]:
from nltk.stem.wordnet import WordNetLemmatizer

In [518]:
for w in vincent_words:
    lemmed = WordNetLemmatizer().lemmatize(w)
    if w != lemmed:
        print(w, " : ", lemmed)

countries  :  country
metrics  :  metric
metrics  :  metric
studies  :  study
metrics  :  metric
metrics  :  metric
shows  :  show
metrics  :  metric
points  :  point
measures  :  measure
metrics  :  metric
correlations  :  correlation
measures  :  measure
differences  :  difference
metrics  :  metric
outcomes  :  outcome
physics  :  physic
metrics  :  metric


## POS tagging
We can find part-of-speech tags (nouns, verbs, etc) using NLTK, as well. This allows us to extract, for instance, all verbs from Vincent's abstract. First, let's return to the original tokenized word list, then tag them sentence by sentence. See https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a list of all POS tags.

In [534]:
from nltk import pos_tag
sent_pos_tags = [pos_tag(word_tokenize(sent)) for sent in vincent_sentences]
print(sent_pos_tags[0])

[('When', 'WRB'), ('performing', 'VBG'), ('a', 'DT'), ('national', 'JJ'), ('research', 'NN'), ('assessment', 'NN'), (',', ','), ('some', 'DT'), ('countries', 'NNS'), ('rely', 'VBP'), ('on', 'IN'), ('citation', 'NN'), ('metrics', 'NNS'), ('whereas', 'JJ'), ('others', 'NNS'), (',', ','), ('such', 'JJ'), ('as', 'IN'), ('the', 'DT'), ('UK', 'NNP'), (',', ','), ('primarily', 'RB'), ('use', 'VB'), ('peer', 'NN'), ('review', 'NN'), ('.', '.')]


In [536]:
# retrieve verbs
[[v[0] for v in s if v[1][0]=='V'] for s in sent_pos_tags]

[['performing', 'rely', 'use'],
 ['peer', 'was', 'found'],
 ['observed', 'peer', 'argued', 'using'],
 ['shows', 'is', 'peer'],
 ['provide', 'considering'],
 ['argue', 'peer', 'be', 'assessed'],
 ['are'],
 ['be', 'therefore', 'are', 'based', 'peer'],
 ['get', 'rely', 'bootstrap'],
 ['conclude', 'agree', 'offer', 'peer']]