# Working with text
* Working with strings in Python and pandas
* Scraping text from the web
* Regular expressions
* ???

## String functions in Python
* Python has rich built-in support for text processing
* We can do powerful things with strings in very little code
* Let's take a look at the methods available in the string class...

In [None]:
# dir() lists all attributes (including methods) of the provided object 
msg = "Hello, World!"
print(dir(msg))

### A few examples

In [None]:
msg = "Hello, World!"

# Convert to lowercase
print(1, 'ABCDE'.lower())

# Find the index of the first occurence of 'o' in msg
print(2, msg.find('o'))   

# Split the string into a list on commas
print(3, msg.split(','))

# Replace a substring with another string
print(4, msg.replace('World', 'Mars'))

# Select and reverse characters 2 - 9
print(5, msg[9:2:-1])

# Separate all characters with an exclamation
print(6, '!'.join(list(msg)))

# Check if a string starts or ends with another string
print(7, msg.startswith('Hell'))

# Most string functions return a string, so we can chain calls
msg = "  A seNtEnCe witH somE Case ISsuES aNTYPOd exTra WhitesPACE.  "
print(8, msg.strip().lower().capitalize().replace('typo', ''))

## String formatting
* Python has a powerful built-in string formatting engine (examples: [1](https://pyformat.info/), [2](https://mkaz.tech/python-string-format.html))

In [None]:
# Define a string template
template = "I bought a ${:.2f} plane ticket from {} to {} yesterday."

# Variables we want to inject into the string
source = 'Austin'
dest = 'Chicago'
price = 282.45708284

# Format the string
print(template.format(price, source, dest))

### External report templates
We can take advantage of string formatting to populate external report templates that can be maintained/edited independently

In [None]:
# Read in the animal outcome data
import pandas as pd
data = pd.read_csv('../data/Austin_Animal_Center_Outcomes.csv')

# Or, uncomment the next two lines if you don't have the local file
# url = 'https://raw.githubusercontent.com/tyarkoni/SSI2016/master/data/Austin_Animal_Center_Outcomes.csv'
# data = pd.read_csv(url)

# Read in an HTML template
template = open('../templates/report.html').read()

# Restrict the dataset to only dogs with names (.dropna() will remove rows with missing values)
dog_data = data[data['Animal Type']=='Dog']
dogs_with_names = dog_data[['DateTime', 'Name', 'Breed', 'Outcome Type']].dropna()

# Populate the fields in the template with data. Argument names passed to .format() must
# match the names inside the {}'s in the template.
formatted_html = template.format(
        dataset_name = 'Austin Animal Center Outcomes',
        number_of_rows = len(data),
        table_contents = dogs_with_names.iloc[:100].to_html(index=False)
    )

# Write the result to a file. Open this in a browser and take a look!
report_file = open('../templates/formatted_reported.html', 'w')
report_file.write(formatted_html)

## Strings in pandas
* Pandas provides DataFrame-ready versions of many Python string methods
* Or, we can apply any predefined Python code to a DataFrame (slower)
* We'll look at one example of each of these approaches

### How do we handle color?
* We might want to know if animals' outcomes differ by color
* But many animals have mixed colors; e.g., "Orange/White"
* To analyze based on color, we might want to focus on the primary color
    * We need to split the color string and keep just the first element
    * We should probably also have an indicator tracking mixed vs. single color

In [None]:
data['Color'].unique()[:50]

In [None]:
# pandas string methods are accessed through a column's str attribute

# store the first color in the DataFrame
data['first_color'] = data['Color'].str.split('/').str.get(0)

# also store an indicator for mixed/single color
data['mixed_color'] = data['Color'].str.count('/')

data[['Color', 'first_color', 'mixed_color']].head(10)