<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="25%"><img src="../../media/decartes.jpg"
alt="DeCART Icon" width="128" height="171"><br>
</td>
<td valign="center" align="center" width="75%">
<h1 align="center"><font size="+3">DeCART Summer School<br>
for<br>
Biomedical Data Science</font></h1></td>
<td valign="center" align="center" width="25%"><img
src="../../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

# Text and Medicine

It is estimated that 75% of all information in the health record is in free-text form, such as radiology or pathology reports, nursing notes, and discharge summaries. Consequently, text processing is one of the most important data processing tasks in medicine.

# Strings and Lists in Python

Two of the most important data structures we will be using in Python are [strings](https://goo.gl/EEoJHd) and [lists](https://docs.python.org/3/library/stdtypes.html?highlight=strings#lists). While both are considered examples of sequences in Python, strings will be essentially a native data type for us, representing medical texts, genomic sequences, and ICD codes, for example, while lists and tuples will serve as collections where we keep related data.

In [1]:
from quizzes.string_quizzes import *

### Strings
* Strings are **immutable** sequences of characters. 
    * Immutable is a very important. Once created strings cannot be modified. Instead of modifying a string we will always be creating a new string that is derived from our first string.



Here are two example strings. ``code`` may look numeric but it is a unique identifier without numeric meaning.

In [2]:
description = 'CHRONIC HEPATITIS C WITHOUT HEPATIC COMA'
code  = """070.54"""

print(description)
print(code)

CHRONIC HEPATITIS C WITHOUT HEPATIC COMA
070.54


In Python 3.x all strings are unicode strings composed of [unicode](https://en.wikipedia.org/wiki/Unicode) characters. Every unicode character has a unique ordinal position in the unicode definition so we can ask via the [``ord``](https://docs.python.org/3/library/functions.html#ord) function what ordinal number corresponds to a specific character or via the [``chr``](https://docs.python.org/3/library/functions.html#chr) function what character (if any, an arbitrary integer might exceed the number of characters in the code) corresponds to an ordinal number.

The first 256 characters of a Unicode string are the ASCII characters:

![Table of ASCII Characters](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/Ascii_Table-nocolor.svg/800px-Ascii_Table-nocolor.svg.png)

In [None]:
print(chr(89))
print(chr(5674))

print(ord("œ"))

print(ord("a"))

### Comparing Characters Means Comparing their ordinal value

In [None]:
print('a' < 'A')
print("aBc" < "abc")

In the evolution of computer character representations, English upper case letters were defined before lower case letters and thus have a lower ordinal value.

In [None]:
print("." < "A")
print("." < "1")

## Strings: Accessing Data
* We can access individual characters in a string with a square bracket ([]) syntax.
* Python sequences start at 0. That is, the first element in the sequence is accessed by the number 0 (zero).

## Exercise
For the string 

```Python
mystring = """Termination of Swan-Ganz catheter within the proximal right main pulmonary."""
```
What is the value of ``mystring[19]``?

In [None]:
string_index1_quiz("replace me with the character at mystring[19]")

## Exercise

For ``mystring`` defined above, what do you think will happen if you type

```Python
mystring[-4]
```
?

Test your guess below.

In [None]:
string_index2_quiz("replace me with your guess of the result for mystring[-4]")

## **Slicing**

* You can access a segment of a string using a slicing notation: **STRING[start:stop:increment]**
    * start is inclusive
    * **stop is exclusive**
* start, stop and increment all have default values
    * start: 0
    * stop: Length of string
    * increment: 1

In [None]:
print(description)
print(description[0])
print(description[5])
print(description[0:13])
print(description[13:])
print(description[::2])

## Exercise

What are the slicing start and end indices needed to extract the substring 
```Python
Swan-Ganz catheter
```
from ``mystring``?

# Strings: Attributes and Methods
* Strings are objects that have **attributes** and **methods**
    * **attributes** think nouns (things strings have)
    * **methods** think verbs (things strings do)
* You can access **attributes** and **methods** using the 'dot' (.) notation
* You can learn about the **attributes** and **methods** using tab completion and **help()**

In [None]:
description.isupper()

In [None]:
help(code.isalnum)

In [None]:
code.isalnum()

Why did this evaluate to False?

##  Splitting Strings

A common manipulation of strings is splitting a string into a list of substrings
* Split the string with a specified delimiter (defaults to a white space)
* Returns a **list** (to be discussed later) of substrings

In [None]:
a = '1,2,3,4,5'
help(a.split)

In [None]:
a.split()

In [None]:
a.split(',')

In [None]:
note ="""resp care
pt received on psv mode, per team peep placed back on at 5 cmH20. initially pt requiring ps 12, now on 8 for progression of weaning. tolerating fair with rr approx 25-32 range. mdi's given q4h, flovent started at 8 p. bid. cuff leak seems more constant today, ?worse with peep on, cuff pressure kept at 30 cmH20 with 10 cc's in cuff, to seal it would require cuff pressure of 45 cmh20. IP evaluated and chooses not to replace trach at this time, maintain cuff pressure at 30. c/w slow wean, progress to trach mask as soon as possible."""

print(note.split())

##  Joining Strings
* **join()** is the inverse of **split()**
* Base string becomes the delimiter

In [None]:
number_list = a.split(",")
print(number_list)
print(''.join(number_list))

In [None]:
print( ' '.join(number_list))

In [None]:
print(','.join(number_list))

print(', '.join(number_list))
print( 'this will look messy'.join(number_list))         

# String methods for preprocessing/modifying
* Note that since strings are **immutable** these methods don't change the string, but return a *new* string
* **lower()**: converts all characters to lower case
* **upper()**: converts all characters to upper case
* **replace(a,b)**: replaces all occurrences of a in string with b (e.g., replacing tabs with spaces)

In [None]:
note2 = """PLANL: Admin K-excelate. 3% NS with q2h Na levels. Neuro exam q1h. Monitor I/O, resp status. MRI today after questionaire completed. Call H.O. with changes."""
print(note2.upper())
print("-"*42)
print(note2.lower())
print("-"*42)
print(note2)

In [None]:
print(note2.swapcase())
print("-"*42)

print(note2.replace('a','Z'))
print("-"*42)

print(note2.replace(' ','')) # replace spaces with empty string