## Strings

**Text**

Can use `"`, `'` or `str()`

In [1]:
"this is fine"

'this is fine'

In [2]:
'so is this (but not in JSON)'

'so is this (but not in JSON)'

In [3]:
str(float(64))

'64.0'

Files are just big lists of characters
- the line structure is a mirage

`\n` is the newline character (UNIX thing)

In [4]:
file = 'First line.\nSecond line.'
print(file)

First line.
Second line.


Multiple line strings:

In [6]:
print("""
First
whatever
""")


First
whatever



Strings next to each other are joined:

In [7]:
'Py' 'thon'

'Python'

This can be useful for multi line strings:

In [8]:
print('An expert is a person who has made all the mistakes '
      'that can be made in a very narrow field – NIELS BOHR')

An expert is a person who has made all the mistakes that can be made in a very narrow field – NIELS BOHR


We can add strings together:

In [9]:
"ja" + " ja " + "ja"

'ja ja ja'

And multiply them:

In [10]:
"ja" * 3

'jajaja'

## Strings are iterable

In [11]:
for character in 'Py' 'thon':
    print(character)

P
y
t
h
o
n


We can use the builtin `len` to measure the number of characters in a string:

In [11]:
len('Python')

6

## `upper` and `lower`

A common operation in NLP is making everything lower case:

In [12]:
'PYTHON'.lower()

'python'

In [13]:
'python'.upper()

'PYTHON'

We can see some of the other functionality available on the `str` object using `dir`:

In [14]:
dir(str)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

## String formatting

There are many ways to do this - below is what works for me:

In [15]:
'{} {} {}'.format('first', 'second', 'third')

'first second third'

We can control the formatting of decimal places

In [16]:
'{:.2f} {:.1f} {:.0f}'.format(420, 420, 420)

'420.00 420.0 420'

## F-strings

str.format can be quite verbose, therefore long and not so easy to read. F-strings is a new and better way for string formatting.

In [17]:
name = "Eric"
age = 74
f"Hello, {name}. You are {age}."

'Hello, Eric. You are 74.'

In [20]:
f"{2 +37}"


'39'

In [21]:
def to_lowercase(input):
    return input.lower()

name = "Eric Idle"
f"{to_lowercase(name)} is funny."

'eric idle is funny.'

In [23]:
d = {'a' : 1}

f"{d['a']}"

'1'

## String splitting

A common operation is to split strings on characters.  Let's get the current working directory:

In [24]:
import os

os.getcwd()

'/home/stas/dsr/dsr-classes/python/basics'

We can then use the `split` method to create an iterable:

In [25]:
os.getcwd().split('/')

['', 'home', 'stas', 'dsr', 'dsr-classes', 'python', 'basics']

In [26]:
os.getcwd().split('/')[-2]

'python'

We will see more on paths in the next notebook - and more on iterables in the notebook after that.

## String stripping

A common operation is removing trailing whitespace:

In [28]:
'python is dynamically typed    '.strip(' ')

'python is dynamically typed'

Related is to remove whitespaces from the string - this can be done by replacing with `''`

In [31]:
'python is dynamically typed    '.replace(' ', '')

'pythonisdynamicallytyped'

## `in`

A very Pythonic pattern is to check if an object exists in an iterable using `in`.  As strings are iterable, this syntax works with strings:

In [32]:
'P' in 'Python'

True

In [33]:
'p' in 'Python'

False

## Exercise

Write a function to check for a letter in a word independently of the case:

In [40]:
def check(char, word):
    return char.lower() in word.lower()

## Exercise

**Stemming** is a process of converting words to their stem (a base or root form).  It is a common operation in NLP.

For the text below (`sample`)
- remove all dots, commas, brackets
- create a list of the stems
- stem (in this case) being the word shortened to 4 characters and letter case shouldn't matter
- words of 2-3 characters should be kept as 2-3
- words of 1 character should be dropped

After creating your list of stems, count them
- use a `collections.defaultdict(int)` to store the counts

(There are libraries that will stemming for you - below we do this to practice working with strings).

In [37]:
sample = 'In linguistic morphology and information retrieval, stemming is   the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.   Algorithms for stemming have been studied in computer   science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.'

sample

'In linguistic morphology and information retrieval, stemming is   the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.   Algorithms for stemming have been studied in computer   science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.'

In [38]:
import collections

def stem(sample):
    words = sample.replace(',', '').replace('.','').replace('(','').replace(')','').lower().split()
    stems = []
    
    for word in words:
        if len(word) != 1:
            if len(word) < 4:
                stems.append(word)
            else:
                stems.append(word[:4])
                
    return stems

stems = stem(sample)

stems_dict = collections.defaultdict(int)

for stem in stems:
    stems_dict[stem] += 1

In [39]:
stems_dict

defaultdict(int,
            {'in': 3,
             'ling': 1,
             'morp': 2,
             'and': 1,
             'info': 1,
             'retr': 1,
             'stem': 11,
             'is': 3,
             'the': 7,
             'proc': 2,
             'of': 3,
             'redu': 1,
             'infl': 1,
             'or': 4,
             'some': 1,
             'deri': 1,
             'word': 7,
             'to': 3,
             'thei': 1,
             'base': 1,
             'root': 3,
             'form': 2,
             'writ': 1,
             'need': 1,
             'not': 2,
             'be': 2,
             'iden': 1,
             'it': 1,
             'usua': 1,
             'suff': 1,
             'that': 2,
             'rela': 1,
             'map': 1,
             'same': 2,
             'even': 1,
             'if': 1,
             'this': 1,
             'itse': 1,
             'vali': 1,
             'algo': 2,
             'for': 1,
             'have'