# 17 Strings & Text Data
File(s) needed: none

This section is about Python functionality, not pandas.

A string is just a sequence of characters. A great deal of data will be available as strings, even numerical data, which will need to be converted to a number type in order to do any calculations with it. 

We create strings manually in Python by putting those characters in quotes.

In [1]:
# Create sample strings and store them in variables
word = 'coconut'
pair = 'holy grail'
mixed = 'MIS3335'
curse = 'Your mother was a hamster and your father smelt of elderberries.'

print(word,pair,mixed,curse,sep=' | ')

coconut | holy grail | MIS3335 | Your mother was a hamster and your father smelt of elderberries.


## Concatenation
Concatenation is all about combining strings by using the `+` operator. In this case, that is the concatenation operator (not addition operator). 

We can also use the `+=` augmented assignment operator with strings. It provides a good way to build output or display strings.

In [3]:
# Concatenation example displaying names in an output string
first_name='Eric'
last_name='Idle'
full_name = first_name +' '+ last_name
full_name

'Eric Idle'

In [None]:
# Build an output string


Using the `+=` makes it look like you are changing the string, but you are not. **Strings are immutable**. Because of that, you can't use an index on the left side of an assignment operator.

In [None]:
# To confirm, see what happens if you try to change a letter in the string.


# Subsetting and slicing strings

As was said earlier, a string is just a sequence of characters. What that means is that you can subset it the same way we did with Series using the index of the characters.

![string_index_positions.png](attachment:string_index_positions.png)


## Individual or multiple character slices

Refer to the index positions above as a reference. When slicing multiple characters, remember that Python is left-side inclusive, right-side exclusive.

In [None]:
# Get the first character in coconut


# Get the fifth character in coconut


In [8]:
# multiple characters - specify the start position
# Get the first eight characters of 'holy grail'


# Use negative index values to do the same - harder to understand, right?


# Get 'con'from the middle of coconut

'coconut'[2:5]
# Get just the last character of 'holy grail'


'con'

Remember that leaving the right side of the colon `:` blank means the slice goes to the end of the sequence. In the same way, a blank before the colon starts from the beginning and goes up to (but not including) the index value on the right of the colon.

In [None]:
# get the first four characters of holy grail


# get the last four characters of holy grail


# String Methods
The functionality available using methods built into the string object can be found in Table 8.3 (page 159 of your text) or at the Python documentation site. https://docs.python.org/3/library/stdtypes.html#string-methods

We will cover a few handy methods with examples. 
- `count`: counts the number of times the character group appears in the string.
- `find`: returns the index of the first location of a target string or -1 if not found.
- `isalpha`, `isdecimal`, `isalnum`: tests to see if characters are alphabetic, numbers, or a mix of both.
- `lower`/`upper`: return a copy of the string in all upper or lower case. Very useful when comparing strings.
- `strip`: removes blanks from th beginning and end by default but can also remove other characters.

In [17]:
# String method examples
# Here is a reminder of the strings we are testing in these examples
print("word:\t",word,"\npair:\t",pair, "\nmixed:\t", mixed, "\ncurse:\t", curse, "\n")

# Find the first occurence of "o", "oc", or "h" in coconut
print(word.find("o"))
print(word.find("oc"))
print(word.find("h"))
# Test to see if the variable contents are all alphabetic or numeric
print(word.isalpha())
print(word.isalnum())
print(mixed.isalpha())
print(mixed.isalnum())

word:	 coconut 
pair:	 holy grail 
mixed:	 MIS3335 
curse:	 Your mother was a hamster and your father smelt of elderberries. 

1
1
-1
True
True
False
True


In [24]:
# Upper and lower
print(word.upper())
print(mixed.lower())
# Is the comparison statement true or false?
fly='bee'
run='Zebra'
print(fly,run,fly<run)

# What about this?
print(fly.lower(),run.lower(),fly.lower()<run.lower())

COCONUT
mis3335
bee Zebra False
bee zebra True


In [35]:
# strip() removes blank spaces from both ends of a string.
# Also check out lstrip() and rstrip().
blanks="  we want ... another shrubbery! "
output="|"+blanks+"|"+'\n'
output+=output.strip()
print(output)


|  we want ... another shrubbery! |
|  we want ... another shrubbery! |


## String Formatting
Formatting allows you to specify a template to control the way the output looks. 

### Formatted string literals
This is a relatively new way to work with output. The output string begins with an 'f' character (outside the string quotes) to tell Python we are using a formatted string literal, and curly braces `{}` are used as a placeholder for any variable values we want to include in the output.


In [36]:
# Simple example with direct output
print(f"our class is {mixed}")

our class is MIS3335


In [38]:
# Using an output variable
output=f"My quest is to seek the {pair}"
print(output)

My quest is to seek the holy grail


In [40]:
# Building an output string.
output="my quest is to "
output+= f"seek the {pair}."
output+= "\nso take that"
print(output)

my quest is to seek the holy grail.
so take that


### Numeric value formats

The `{}` placeholders can represent numeric variables, too. Different number formats can also be specified for numeric values.

https://docs.python.org/3/library/string.html#formatspec

In [41]:
# The variables can have numeric values as well
age=37
out=f"im {age}, im not old."
print(out)

im 37, im not old.


In [48]:
# Example with numeric formats
cost=100
markup=.184
price=cost*(1+markup)
print(f"a markup of {markup:.1%} on a cost of ${cost:.2f} \
give us a sell price of ${price:.2f}.")

a markup of 18.4% on a cost of $100.00 give us a sell price of $118.40.


# Regular Expressions (AKA, "RegEx" or "regex")
When the `find()` string method is not powerful enough, regular expressions allow you to search for patterns in text data. There are a few important points to remember about regex. Most of these will become apparent very soon.
- It is very hard for a person to read a finished expression.
- Expressions can get very complex.
- Writing good regex expressions is a specialized subset of data analysis skills. There are many resources on the Internet to help you practice learning them.
- If you need to use a regex expression in your code, it is usually best to see (i.e., Google search) if someone has already written the appropriate expression(s) and use theirs. Just be sure to document the source of the code you use.

Regex functionality is found in the `re` library in Python.

The official reference: https://docs.python.org/3/library/re.html

There is an add-in library called `regex` you can also use for regex expression writing. It is not part of the base Anaconda distribution but it can be added easily. Find out more in Anaconda Navigator and at https://docs.python.org/3/howto/regex.html


## Basics of regex

### Basic syntax of regex
|<p style="text-align:left;">Syntax</p>|<p style="text-align:left;">  </p>|<p style="text-align:left;">Description</p>|
| --- | --- | --- |
|<p style="text-align:center;font-family:Courier New;font-size:125%">.</p>|<p> </p> |<p style="text-align:left;">Matches any one character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">^</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches from the beginning of a string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">$</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches from the beginning of a string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">*</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches zero or more repetitions of the previous character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">+</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches one or more repetitions of the previous character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">?</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches zero or one repetitions of the previous character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">{m}</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches m repetitions of the previous character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">{m,n}</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches any number of repetitions from m to n of the previous character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Escape character</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">[ ]</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">A set of characters</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\|</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">OR operator</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">()</p>|<p style="text-align:left;font-family:Courier New"> </p>|<p style="text-align:left;">Matches the contained pattern exactly</p>|

### Select special regex characters
|<p style="text-align:left;">Characters</p>|<p style="text-align:left;">  </p>|<p style="text-align:left;">Meaning</p>|
| --- | --- | --- |
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\d</p>|<p> </p> |<p style="text-align:left;">A digit</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\D</p>|<p> </p> |<p style="text-align:left;">NOT a digit</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\s</p>|<p> </p> |<p style="text-align:left;">Any whitespace character (e.g., space or tab)</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\S</p>|<p> </p> |<p style="text-align:left;">Any character NOT a whitespace</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\w</p>|<p> </p> |<p style="text-align:left;">Word characters</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">\\W</p>|<p> </p> |<p style="text-align:left;">Any character NOT a word character</p>|

### Common regex functions
|<p style="text-align:left;">Function</p>|<p style="text-align:left;">  </p>|<p style="text-align:left;">Description</p>|
| --- | --- | --- |
|<p style="text-align:center;font-family:Courier New;font-size:125%">search</p>|<p> </p> |<p style="text-align:left;">Find the first occurence of a string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">match</p>|<p> </p> |<p style="text-align:left;">Match from the beginning of a string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">fullmatch</p>|<p> </p> |<p style="text-align:left;">Match the entire string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">split</p>|<p> </p> |<p style="text-align:left;">Split the string by the pattern</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">findall</p>|<p> </p> |<p style="text-align:left;">Find all non-overlapping matches of a string</p>|
|<p style="text-align:center;font-family:Courier New;font-size:125%">sub</p>|<p> </p> |<p style="text-align:left;">Substitute the matched pattern with the provided string</p>|

## Matching a pattern
As you can see from the above table, there are many ways to match patterns. All but the last one use the basic syntax of 

```re.function_name(pattern, string)```.

Results are returned as a _Match_ object, and the convention is that we assign this object the name _m_ and long regex patterns saved as a separate variable named _p_. This is an example of conventions overruling the standard rules of good variable naming that avoid single letter names [<sup>1</sup>](#fn1). If we also put the target string in a variable, we will often see this basic syntax:

```m = re.function_name(pattern=p, string=target_variable)```

We'll begin with a simple example using `re.match` and incorporate other regex functions as needed. Let's create a pattern to match a 10 digit US telephone number. We'll start by simply matching 10 digits.

---
<sup>1</sup><span id="fn1" style="font-size:85%"> But not really a violation, right? If a reader understands at a glance what the variable is, that makes it a good variable name.</span>

In [None]:
# Import the re library


In [None]:
# Match 10 digits


In [None]:
# print the m object contents


You'll see the match object identifies the location of the matching text and the exact text that was matched. Sometimes we don't care about where the match occurs but only want to see if it does or does not occur. The `bool()` function gets the boolean value from the match object and gives you a `True` or `False` value.

In [None]:
# Print the boolean value of the m object


We might just be looking to match an expression so we can do something if it is there and something else if it is not there. In that case we can use the match object in an `if` block without extracting the actual boolean value. 

In [None]:
# Using match results in a loop - for future reference

    

We can also use methods of the match object to extract parts of the span or the string that matched the pattern.

In [None]:
# Using match methods to get parts of the results

# get the first index of the matched string


# get the last index of the matched string


# get the first and last index of the matched string


# get the string that matched the pattern


Telephone numbers are rarely represented as 10 digits with no separators. What if there are spaces between the groupings?

In [None]:
# Telephone number with spaces


# Simplify the previous pattern using the {} repetition syntax, 
# then use it on the new example.


There was no match because the new telephone number contains spaces, so we will need to change our expression. That means we want a pattern with three digits, a space, three digits, a space, and then four digits.

<p style="color:green;font-size:120%">NOTE: Due to the large number of "typo candidates" in the code remaining in this notebook, a working version will be provided in a text file on Bb at the end of class. You can copy and paste that code into the appropriate cells in your notebook if you run into any problems.</p>

In [None]:
# Pattern with spaces


But we also want it to be compatible with the first telephone number that had no spaces. To generalize between the two cases, we want to match either zero or one space at a time. We need the `?` from the above syntax table to do that. This is also a good time to move the pattern to a separate variable.

In [None]:
# Backward compatible pattern - run it on both telephone numbers


Don't we normally see telephone numbers formatted with parentheses and a dash like (501) 450-5000? Let's add that capability to our pattern.

In [None]:
# Add parentheses and a dash to the pattern but keep it compatible


What if we have a US country code before the number?

In [None]:
# Add US country code


What if the parts of the number are divided by periods, a particular favorite of Dr. Ellis? We can add to the pattern the syntax to also match zero or one periods. Let's also make sure it works for all the phone numbers we've tried so far.

In [None]:
# Add periods as dividers.


In [None]:
# Test our pattern with all the phone numbers we've used.


<p style="font-size:150%">WAIT! What if you wanted to make it work for <span style="color:green"><i>all country codes?</i></span></p>

Take a look at this: https://countrycode.org/

---
##### Does this ever end? And this is just telephone numbers! Now you should understand the important points mentioned at the start of this section.

Sometimes you need to use regular expressions because it is the best way to do what you need to do. A good rule of thumb is to use the simplest method that gets the job done.

## Finding a pattern
Use the `findall()` function to find all matches within a string. This example will find all the digits in a string.

In [None]:
# Find all the digits
s = "Campus: 356 total campus acreage, 124 campus buildings and facilities, "\
    "3,242,632 building square feet maintained, 2018 Enrollment: 11,177"


## Compiling a pattern
If we need to use a pattern multiple times, maybe to do the same thing on multiple columns or rows, we can **_compile_** it to reuse it easily. We write the pattern as before, but instead of saving it to a variable like we did we compile it into an object name. Then we use that compiled object as the basis for our regex methods.

The examples below show how our previous examples work when using this compile capability.

In [None]:
# Matching telephone numbers with a compiled pattern


In [None]:
# Using findall with a compiled pattern


In [None]:
# Matching all telephone numbers in a loop with a compiled pattern
