In [14]:
import pandas as pd

# Python String Methods

First, we'll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and `pandas`. The Python functions operate on a single string, while their equivalent in `pandas` are  **vectorized** - they operate on a Series of string data.

| Operation             | Python String Methods | Series (Dataframe Columns) String Methods | Explanation|
|-----------------------|-----------------|---------------------------|-- |
| Transformation        | - `s.lower()`, `s.upper()`  | - `ser.str.lower()`, `ser.str.upper()`      | The first command turns all characters lowercase, the second command turns all characters upper case|
| Replacement/Deletion| - `s.replace(_)`| - `ser.str.replace(_)`    |Replaces the first string with the second string. If the string argument is the empty string (`''`), effectively removes the first string.|
| Split                 | - `s.split(_)`  | - `ser.str.split(_)`      | Splits the string into separate strings, forming a new string everytime the argument inside `.split(_)` is found. Output is a list of strings|
| Substring             | - `s[1:4]`      | - `ser.str[1:4]`          | Outputs the subset of the string from the 1st character to the 4th character (uses 0-indexing)|
| Membership            | - `'_' in s`    | - `ser.str.contains(_)`   | Returns true/false depending on whether or not the string contains the value |
| Length                | - `len(s)`      | - `ser.str.len()`         | Returns the length of the string|

## Example of Cleaning Data

In [15]:
with open('data/county_and_state.csv') as f:
    county_and_state = pd.read_csv(f)
    
with open('data/county_and_population.csv') as f:
    county_and_pop = pd.read_csv(f)

display(county_and_state), display(county_and_pop);

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LS


Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


If we want to join the two tables above, we would run into issues as the `County` column is not standardized across both tables. To merge them on the `County`, we would want to do some data cleaning. Specifically, we would want to: -
- Convert all strings to lower case (to deal with the difference between `Lac qui Parle` and `Lac Qui Parle`)
- Remove all spaces (to deal with the difference between `DeWitt` and `De Witt`)
- Replace all '&' with 'and' (for standardization)
- Remove all full stops (to deal with the difference between `St. John` and `St John`)
- Remove all instances of the words `county` and `parish`

The code below performs all these actions for us with the series string methods mentioned above.

In [16]:
def canonicalize_county_series(county_series):
    return (
        county_series
            .str.lower()
            .str.replace(' ', '')
            .str.replace('&', 'and')
            .str.replace('.', '')
            .str.replace('county', '')
            .str.replace('parish', '')
    )

county_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])
county_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])
display(county_and_pop), display(county_and_state);

  .str.replace('.', '')
  .str.replace('.', '')


Unnamed: 0,County,Population,clean_county_pandas
0,DeWitt,16798,dewitt
1,Lac Qui Parle,8067,lacquiparle
2,Lewis & Clark,55716,lewisandclark
3,St. John the Baptist,43044,stjohnthebaptist


Unnamed: 0,County,State,clean_county_pandas
0,De Witt County,IL,dewitt
1,Lac qui Parle County,MN,lacquiparle
2,Lewis and Clark County,MT,lewisandclark
3,St John the Baptist Parish,LS,stjohnthebaptist


We can now merge these datasets!

In [17]:
county_and_pop.merge(county_and_state, left_on='clean_county_pandas', right_on='clean_county_pandas', how='inner')

Unnamed: 0,County_x,Population,clean_county_pandas,County_y,State
0,DeWitt,16798,dewitt,De Witt County,IL
1,Lac Qui Parle,8067,lacquiparle,Lac qui Parle County,MN
2,Lewis & Clark,55716,lewisandclark,Lewis and Clark County,MT
3,St. John the Baptist,43044,stjohnthebaptist,St John the Baptist Parish,LS


## Extraction

Let's say want to read some data from a `.txt` file:

In [18]:
with open('data/log.txt', 'r') as f:
    log_lines = f.readlines()

log_lines

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

As you can see, it looks like this data seems to be of the format `'[IP Address] - - [Date and Time] "GET and HTTP URLs" number number "web URL"\n'`. Let's say we want to read in the date and time. Unfortunately, the date and time starts at a different spot every time, so we can't just use subsets of the string.

Instead, we can use Python's split function.

In [19]:
for i in range(len(log_lines)):                   # Iterate through all the rows of the data
    first = log_lines[i]                          # Consider row i of the data
    pertinent = first.split("[")[1].split(']')[0] # Isolate the values inside the first hard brackets []
    day, month, rest = pertinent.split('/')       # Split on the slashes to get the day and month
    year, hour, minute, rest = rest.split(':')    # Split on the semi-colons to get the year, hour and minute
    seconds, time_zone = rest.split(' ')          # Split on the spaces to get the seconds and timezone
    print(day, month, year, hour, minute, seconds, time_zone) # Print everything

26 Jan 2014 10 47 58 -0800
2 Feb 2005 17 23 6 -0800
3 Feb 2006 10 18 37 -0800


Another tool we can use to solve this problem is regular expressions, which will be discussed in the next subchapter