## String basics
Strings in Python come with a number of useful features and methods.

In [None]:
quickfox = "the quick brown fox jumped over the lazy dog. "

In [None]:
quickfox.capitalize()

In [None]:
quickfox.upper()

In [None]:
'fox' in quickfox

In [None]:
quickfox.startswith('fox')

In [None]:
quickfox.find("fox")

In [None]:
quickfox[16:]

In [None]:
quickfox.count('fox')

In [None]:
quickfox.replace('fox', 'hare').replace('lazy', 'adorable')

Splitting strings is an important standard operation that allows you to produce lists of substrings, based on a defined separator. In this case, we split the sentence by whitespace.

In [None]:
quickfox.split(' ')

Note the empty string at the end of the list. This exists because the original string ended in a whitespace. We can use the .strip() method to remove leading and trailing whitespace from a string.

In [None]:
quickfox.strip()

.join() is a powerful method that allows you to join a list of strings together, using the specified separator. In this case, we will join a list of numbers together, 

In [None]:
example_list = ['one', 'two', 'three', 'four']

In [None]:
' and '.join(example_list)

## Pandas stringtypes

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/publications.txt', sep='\t', encoding='utf-8', dtype={'authors': 'string', 'journal_title': 'string', 'paper_title': 'string', 'abstract': 'string'})

In [None]:
df

In [None]:
df['paper_title'].str.title()

In [None]:
df['paper_title'].str.find('citation')

In [None]:
df['authors'].str.split('; ')

In [None]:
df['authors'].str.contains('van Eck')

In [None]:
df.loc[df['authors'].str.contains('van Eck')]

## Formatted strings
Allow for insertion of variables and even expressions within the string.

In [None]:
name='Wout'
f'My name is {name}'

In [None]:
a='Amsterdam'
b='the Netherlands'
c=800000
f'{a} is the capital of {b} and it has a population of over {c}'

In [None]:
a = 5
b = '5'
c = 10
f'{a} times {c} is {a*c} but {b} times {c} is {b*c}'

Beware - typically ' and " do not mix, though either can be used to define strings. If you use strings within the expressions in an f-string, you will have to use a different style, else you get a syntax error, as in the example below.

In [None]:
f'The dataframe contains {df['authors'].str.contains('van Eck').sum()} articles by Nees Jan van Eck.' # this returns an error

In [None]:
f"The dataframe contains {df['authors'].str.contains('van Eck').sum()} articles by Nees Jan van Eck." # this works!

## Regular expression
A powerful tool for parsing and editing string data.

In [None]:
import re

Let's start by retrieving the abstract of Vincent's paper in the data.

In [None]:
vincent_abstract = df.loc[df['authors'].str.contains('Traag') & df['abstract'].notna()]['abstract'].tolist()[0]
vincent_abstract

Regular expressions allow you to quickly search and manipulate strings. It uses wildcards, patterns, quantifiers, and character groups. For instance, we can find any numeric character:

In [None]:
re.findall('[0-9]', vincent_abstract)

Regex uses a number of special characters, such as parentheses and square brackets, to denote groups of characters. If you want to explicitly look for these, you need to escape them with a backslash.

In [None]:
re.findall('\([0-9]\)', vincent_abstract)

Quantifiers can be used to denote numbers of characters to look for. Let's find any substring that consists of at least two capital letters.

In [None]:
re.findall('[A-Z]{2,}', vincent_abstract)

Finally, let's use wildcards to match any character between the numbers in parentheses, and ending at the first semicolon or period.

In [None]:
re.findall('\([0-9]\).*?[;.]', vincent_abstract)

Regex allows for more than just finding or matching patterns. It can also be used to substitute a pattern with a new string.

In [None]:
re.sub('\([0-9]\).*?[;.]', '<SENTENCE REMOVED>', vincent_abstract)

Other important built-in features are the detection of the start of a string (^) and the end of a string (%). We can, for instance, extract the first sentence of the abstract by searching for a pattern, starting from the start of the string, up until the first period. re.search returns a match object, which contains both the matched text as well as the location in the original string.

In [None]:
re.search('^.*?\.', vincent_abstract)

There are many more things that you can do with regular expression, which we will not get into today, as it gets rather complex very fast. 

## NLTK
So far, most of these string operations have been possible within SQL as well, so you migth be asking, why Python? The Natural Language ToolKit is the first of a large list of libraries that allow you to do much more with text data than before. First, we need to download the nltk corpus and model files. Run the below cell, then download the 'popular' packages, that is enough for now.

In [None]:
import nltk
nltk.download()

When working with longer texts, it is often useful to break them up into individual sentences, or even words. This is called tokenization.

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
vincent_sentences = sent_tokenize(vincent_abstract)
vincent_sentences

In [None]:
# break the abstract into individual words
vincent_words = word_tokenize(vincent_abstract)
vincent_words

Note that there are a lot of 'stopwords' in sentences. These typically add little to a quantitative analysis of text, and can be removed. NLTK has lists of stopwords for various languages. Let's remove these from the text.

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print(stop_words)

In [None]:
vincent_words = [w.lower() for w in vincent_words if w.lower() not in stopwords.words("english")]
print(vincent_words)

let's also remove all tokens that consists of non-alphabetical characters, with a simple regular expression.

In [None]:
vincent_words = [w for w in vincent_words if bool(re.match('[^a-z]', w))==False]
print(vincent_words)

### Lemmatizataion and stemming
Stemming reduces words to a base stem form by using predefined rules to trim the endings of nouns and verbs.

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
ps = PorterStemmer()

for w in vincent_words:
    stemmed = ps.stem(w)
    if w != stemmed:
        print(w, " : ", stemmed)

Lemmatization looks up words and replaces them with their base form, if found. The downside is that unknown words are ignored.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
for w in vincent_words:
    lemmed = WordNetLemmatizer().lemmatize(w)
    if w != lemmed:
        print(w, " : ", lemmed)

## POS tagging
We can find part-of-speech tags (nouns, verbs, etc) using NLTK, as well. This allows us to extract, for instance, all verbs from Vincent's abstract. First, let's return to the original tokenized word list, then tag them sentence by sentence. See https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a list of all POS tags.

In [None]:
from nltk import pos_tag
sent_pos_tags = [pos_tag(word_tokenize(sent)) for sent in vincent_sentences]
print(sent_pos_tags[0])

In [None]:
# retrieve verbs
[[v[0] for v in s if v[1][0]=='V'] for s in sent_pos_tags]