In [1]:
import numpy as np
# import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

import re # regular expressions

## Python String Methods

Python provides a variety of methods for basic string manipulation. Although simple, these methods form the primitives that piece together to form more complex string operations. We will introduce Python's string methods in the context of a common use case for working with text: data cleaning.

In [None]:
'ca' > 'hi'
max('az', 'ca', 'hi')

In [None]:
df = pd.DataFrame({'State': ['AZ', 'HI', 'CA', 'AZ', 'HI', 'CA', 'AZ', 'CA', 'HI'],
                   'Tax': [3, 1, 4, 1, 5, 9, 3, 4, 6],
                   'County': ['ak','tx','fl','hi','mi','ak','ca','sd','nc']
                   })
df

In [None]:
df.groupby(['State']).agg(max)
#df.groupby(['State', 'Tax']).agg(max)


## Cleaning Text Data

Data often comes from several different sources that each implements its own way of encoding information. In the following example, we have one table that records the state that a county belongs to and another that records the population of the county.

In [2]:
state = pd.DataFrame({
    'County': [
        'Lac qui Parle County',
                'De Witt County',
        'Lewis and Clark County',
        'St John the Baptist Parish',
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA',
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt  ',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist',
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044',
    ]
})

In [3]:
state

Unnamed: 0,County,State
0,Lac qui Parle County,IL
1,De Witt County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LA


In [4]:
population

Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


We would naturally like to join the `state` and `population` tables using the `County` column. Unfortunately, not a single county is spelled the same in the two tables. This example is illustrative of the following common issues in text data:

1.  Capitalization: `qui` vs `Qui`
1.  Different punctuation conventions: `St.` vs `St` 
1.  Omission of words: `County`/`Parish` is absent in the `population` table
1.  Use of whitespace: `DeWitt` vs `De Witt`
1.  Different abbreviation conventions: `&` vs `and`

## String Methods

Python's string methods allow us to start resolving these issues. These methods are conveniently defined on all Python strings and thus do not require importing other modules. Although it is worth familiarizing yourself with [the complete list of string methods](https://docs.python.org/3/library/stdtypes.html#string-methods), we describe a few of the most commonly used methods in the table below.

| Method              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `str[x:y]`          | Slices `str`, returning indices x (inclusive) to y (not inclusive)          |
| `str.lower()`       | Returns a copy of a string with all letters converted to lowercase          |
| `str.replace(a, b)` | Replaces all instances of the substring `a` in `str` with the substring `b` |
| `str.split(a)`      | Returns substrings of `str` split at a substring `a`                        |
| `str.strip()`       | Removes leading and trailing whitespace from `str`                          |


We select the string for St. John the Baptist parish from the `state` and `population` tables and apply string methods to remove capitalization, punctuation, and `county`/`parish` occurrences.

In [None]:
john1 = state.loc[3, 'County']
john2 = population.loc[3, 'County']
print(john1)
print(john2)

In [None]:
(john1
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

Applying the same set of methods to `john2` allows us to verify that the two strings are now identical.

In [None]:
(john2
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

We can create a method called `clean_county` that normalizes an input county.

In [None]:
def clean_county(county):
    return (county
            .lower()
            .strip()
            .replace(' county', '')
            .replace(' parish', '')
            .replace('&', 'and')
            .replace(' ', '')
            .replace('.', ''))

We may now verify that the `clean_county` method produces matching counties for all the counties in both tables:

In [None]:
([clean_county(county) for county in state['County']],
 [clean_county(county) for county in population['County']]
)

Because each county in both tables has the same transformed representation, we may successfully join the two tables using the transformed county names.

## String Methods in pandas

In the code above we used a loop to transform each county name. `pandas` Series objects provide a convenient way to apply string methods to each item in the series. First, the series of county names in the `state` table:

In [None]:
state['County']

The `.str` property on `pandas` Series exposes the same string methods as Python does. Calling a method on the `.str` property calls the method on each item in the series.

In [None]:
state['County'].str.lower()

This allows us to transform each string in the series without using a loop.

In [None]:
(state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

We save the transformed counties back into their originating tables:

In [None]:
state['County'] = (state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

population['County'] = (population['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

Now, the two tables contain the same string representation of the counties:

In [None]:
state

In [None]:
population

It is simple to join these tables once the counties match.

In [None]:
state.merge(population, on='County')

## Summary

Python's string methods form a set of simple and useful operations for string manipulation. `pandas` Series implement the same methods that apply the underlying Python method to each string in the series.

You may find the complete docs on Python's `string` methods [here](https://docs.python.org/3/library/stdtypes.html#string-methods) and the docs on Pandas `str` methods [here](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary).

## Regular Expressions

In this section we introduce regular expressions, an important tool to specify patterns in strings.

![https://www.xkcd.com/208/](regular_expressions.png)

Perl is a dynamic programming language that has been nicknamed "the Swiss Army chainsaw of scripting languages" because of its flexibility and power, and also its ugliness. [Source: Wikipedia](https://en.wikipedia.org/wiki/Perl)

## Motivation

In a larger piece of text, many useful substrings come in a specific format. For instance, the sentence below contains a U.S. phone number.

`"give me a call, my number is 123-456-7890."`

The phone number contains the following pattern:

1. Three numbers
1. Followed by a dash
1. Followed by three numbers
1. Followed by a dash
1. Followed by four numbers

Given a free-form segment of text, we might naturally wish to detect and extract the phone numbers. We may also wish to extract specific pieces of the phone numbers—for example, by extracting the area code we may deduce the locations of individuals mentioned in the text.

To detect whether a string contains a phone number, we may attempt to write a method like the following:

In [None]:
def is_phone_number(string):
    
    digits = '0123456789'
    
    def is_not_digit(token):
        return token not in digits 
    
    # Three numbers
    for i in range(3):
        if is_not_digit(string[i]):
            return False
    
    # Followed by a dash
    if string[3] != '-':
        return False
    
    # Followed by three numbers
    for i in range(4, 7):
        if is_not_digit(string[i]):
            return False
        
    # Followed by a dash    
    if string[7] != '-':
        return False
    
    # Followed by four numbers
    for i in range(8, 12):
        if is_not_digit(string[i]):
            return False
    
    return True

In [None]:
is_phone_number("382-384-3840")

In [None]:
is_phone_number("phone number")

The code above is unpleasant and verbose. Rather than manually loop through the characters of the string, we would prefer to specify a pattern and command Python to match the pattern.

**Regular expressions** (often abbreviated **regex**) conveniently solve this exact problem by allowing us to create general patterns for strings. Using a regular expression, we may re-implement the `is_phone_number` method in two short lines of Python:

In [None]:
import re

def is_phone_number(string):
    regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
    return re.search(regex, string) is not None

is_phone_number("382-384-3840")

In the code above, we use the regex `[0-9]{3}-[0-9]{3}-[0-9]{4}` to match phone numbers. Although cryptic at a first glance, the syntax of regular expressions is fortunately much simpler to learn than the Python language itself; we introduce nearly all of the syntax in this section alone.

We will also introduce the built-in Python module `re` that performs string operations using regexes. 

## Regex Syntax

We start with the syntax of regular expressions. In Python, regular expressions are most commonly stored as raw strings. Raw strings behave like normal Python strings without special handling for backslashes.

For example, to store the string `hello \ world` in a normal Python string, we must write:

In [None]:
# Backslashes need to be escaped in normal Python strings
some_string = 'hello \n world'
print(some_string)

Using a raw string removes the need to escape the backslash:

In [None]:
# Note the `r` prefix on the string
some_raw_string = r'hello \n world'
print(some_raw_string)

Since backslashes appear often in regular expressions, we will use raw strings for all regexes in this section.

### Literals

A **literal** character in a regular expression matches the character itself. For example, the regex `r"a"` will match any `"a"` in `"Say! I like green eggs and ham!"`. All alphanumeric characters and most punctuation characters are regex literals.

In [None]:
# HIDDEN
def show_regex_match(text, regex):
    """
    Prints the string with the regex match highlighted.
    """
    print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))

In [None]:
# The show_regex_match method highlights all regex matches 
# in the input string
regex = r"green"
show_regex_match("Say! I like green eggs and ham!", regex)

In [None]:
show_regex_match("Say! I like green eggs and ham!", r"s")

In the example above we observe that regular expressions can match patterns that appear anywhere in the input string. In Python, this behavior differs depending on the method used to match the regex—some methods only return a match if the regex appears at the start of the string; some methods return a match anywhere in the string.

Notice also that the `show_regex_match` method highlights all occurrences of the regex in the input string. Again, this differs depending on the Python method used—some methods return all matches while some only return the first match.

Regular expressions are case-sensitive. In the example below, the regex only matches the lowercase `s` in `eggs`, not the uppercase `S` in `Say`.

In [None]:
show_regex_match("Say! I like green eggs and ham!", r"s")

### Wildcard Character

Some characters have special meaning in a regular expression. These meta characters allow regexes to match a variety of patterns.

In a regular expression, the period character `.` matches any character except a newline.

In [None]:
show_regex_match("Call me at 382-384-3840.", r".all")

To match only the literal period character we must escape it with a backslash:

In [None]:
show_regex_match("Call me at 382-384-3840.", r"\.")

By using the period character to mark the parts of a pattern that vary, we construct a regex to match phone numbers. For example, we may take our original phone number `382-384-3840` and replace the numbers with `.`, leaving the dashes as literals. This results in the regex `...-...-....`.

In [None]:
show_regex_match("Call me at 382-384-3840.", "...-...-....")

Since the period character matches all characters, however, the following input string will produce a spurious match.

In [None]:
show_regex_match("My truck is not-all-blue.", "...-...-....")

### Character Classes

A **character class** matches a specified set of characters, allowing us to create more restrictive matches than the `.` character alone. To create a character class, wrap the set of desired characters in brackets `[ ]`.

In [None]:
show_regex_match("I like your gray shirt.", "gr[ae]y")

In [None]:
show_regex_match("I like your grey shirt.", "gr[ae]y")

In [None]:
# Does not match; a character class only matches 
# one character from a set
show_regex_match("I like your graey shirt.", "gr[ae]y")

In [None]:
# In this example, repeating the character class will match
show_regex_match("I like your graey shirt.", "gr[ae][ae]y")

In a character class, the `.` character is treated as a literal, not as a wildcard.

In [None]:
show_regex_match("I like your grey shirt.", "irt[.]")

There are a few special shorthand notations we can use for commonly used character classes:

Shorthand | Meaning
--- | ---
[0-9] | All the digits
[a-z] | Lowercase letters
[A-Z] | Uppercase letters

In [None]:
show_regex_match("I like your gray shirt.", "[a-z]y")

Character classes allow us to create a more specific regex for phone numbers.

In [None]:
# We replaced every `.` character in ...-...-.... with [0-9] to restrict
# matches to digits.
phone_regex = r'[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
show_regex_match("Call me at 382-384-3840.", phone_regex)

In [None]:
# Now we no longer match this string:
show_regex_match("My truck is not-all-blue.", phone_regex)

### Negated Character Classes

A **negated character class** matches any character **except** the characters in the class. To create a negated character class, wrap the negated characters in `[^ ]`.

In [None]:
show_regex_match("The car parked in the garage.", r"[^c]ar")

### Quantifiers

To create a regex to match phone numbers, we wrote:

```
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
```

This matches 3 digits, a dash, 3 more digits, a dash, and 4 more digits.

Quantifiers allow us to match multiple consecutive appearances of a pattern. We specify the number of repetitions by placing the number in curly braces `{ }`.

In [None]:
phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 382-384-3840.", phone_regex)

In [None]:
# No match
phone_regex = r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
show_regex_match("Call me at 12-384-3840.", phone_regex)

A quantifier always modifies the character or character class to its immediate left. The following table shows the complete syntax for quantifiers.

Quantifier | Meaning
--- | ---
{m, n} | Match the preceding character m to n times.
{m} | Match the preceding character exactly m times.
{m,} | Match the preceding character at least m times.
{,n} | Match the preceding character at most n times.

**Shorthand Quantifiers**

Some commonly used quantifiers have a shorthand:

Symbol | Quantifier | Meaning
--- | --- | ---
* | {0,} | Match the preceding character 0 or more times
+ | {1,} | Match the preceding character 1 or more times
? | {0,1} | Match the preceding charcter 0 or 1 times

We use the `*` character instead of `{0,}` in the following examples.

In [None]:
# 3 a's
show_regex_match('He screamed "Aah!" as the cart took a plunge.', "A*h!")

In [None]:
# Lots of a's
show_regex_match(
    'He screamed "Aaaaaaaaaaaaaaaaaaaah!" as the cart took a plunge.',
    "Aa*h!"
)

In [None]:
# No lowercase a's
show_regex_match('He screamed "Ah!" as the cart took a plunge.', "Aa*h!")

**Quantifiers are greedy**

Quantifiers will always return the longest match possible. This sometimes results in surprising behavior:

In [None]:
# We tried to match 311 and 911 but matched the ` and ` as well because
# `<311> and <911>` is the longest match possible for `<.+>`.
show_regex_match("Remember the numbers <311> and <911>", "<.+>")

In many cases, using a more specific character class prevents these false matches:

In [None]:
show_regex_match("Remember the numbers <311> and <911>", "<[0-9]+>")

### Anchoring

Sometimes a pattern should only match at the beginning or end of a string.  The special character `^` anchors the regex to match only if the pattern appears at the beginning of the string; the special character `$` anchors the regex to match only if the pattern occurs at the end of the string.  For example the regex `well$` only matches an appearance of `well` at the end of the string.

In [None]:
show_regex_match('well, well, well', r"well$")

Using both `^` and `$` requires the regex to match the full string.

In [None]:
phone_regex = r"^[0-9]{3}-[0-9]{3}-[0-9]{4}$"
show_regex_match('382-384-3840', phone_regex)

In [None]:
# No match
show_regex_match('You can call me at 382-384-3840.', phone_regex)

### Escaping Meta Characters

All regex meta characters have special meaning in a regular expression. To match meta characters as literals, we escape them using the `\` character.

In [None]:
# `[` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\[")

In [None]:
# `.` is a meta character and requires escaping
show_regex_match("Call me at [382-384-3840].", "\.")

## Reference Tables

We have now covered the most important pieces of regex syntax and meta characters. For a more complete reference, we include the tables below.

**Meta Characters**

This table includes most of the important *meta characters*, which help us specify certain patterns we want to match in a string.

| Char   | Description                         | Example                    | Matches        | Doesn't Match |
| ------ | ----------------------------------- | -------------------------- | -------------- | ------------- |
| .      | Any character except \n             | `...`                      | abc            | ab<br>abcd    |
| [ ]    | Any character inside brackets       | `[cb.]ar`                  | car<br>.ar     | jar           |
| [^ ]   | Any character _not_ inside brackets | `[^b]ar`                   | car<br>par     | bar<br>ar     |
| \*     | ≥ 0 or more of last symbol          | `[pb]*ark`                 | bbark<br>ark   | dark          |
| +      | ≥ 1 or more of last symbol          | `[pb]+ark`                 | bbpark<br>bark | dark<br>ark   |
| ?      | 0 or 1 of last symbol               | `s?he`                     | she<br>he      | the           |
| {_n_}  | Exactly _n_ of last symbol          | `hello{3}`                 | hellooo        | hello         |
| &#124; | Pattern before or after bar         | <code>we&#124;[ui]s</code> | we<br>us<br>is | e<br>s        |
| \      | Escapes next character              | `\[hi\]`                   | [hi]           | hi            |
| ^      | Beginning of line                   | `^ark`                     | ark two        | dark          |
| \$     | End of line                         | `ark$`                     | noahs ark      | noahs arks    |

**Shorthand Character Sets**

Some commonly used character sets have shorthands.

| Description                   | Bracket Form       | Shorthand |
| ----------------------------- | ------------------ | --------- |
| Alphanumeric character        | `[a-zA-Z0-9]`      | `\w`      |
| Not an alphanumeric character | `[^a-zA-Z0-9]`     | `\W`      |
| Digit                         | `[0-9]`            | `\d`      |
| Not a digit                   | `[^0-9]`           | `\D`      |
| Whitespace                    | `[\t\n\f\r\p{Z}]`  | `\s`      |
| Not whitespace                | `[^\t\n\f\r\p{z}]` | `\S`      |

## Summary

Almost all programming languages have a library to match patterns using regular expressions, making them useful regardless of the specific language. In this section, we introduce regex syntax and the most useful meta characters.

## Regex and Python

In this section, we introduce regex usage in Python using the built-in `re` module. Since we only cover a few of the most commonly used methods, you will find it useful to consult [the official documentation on the `re` module](https://docs.python.org/3/library/re.html) as well.

In [None]:
#import re

## `re.search`

`re.search(pattern, string)` searches for a match of the regex `pattern` anywhere in `string`. It returns a truthy match object if the pattern is found; it returns `None` if not.

In [None]:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Call me at 382-384-3840."
match = re.search(phone_re, text)
match

Although the returned match object has a variety of useful properties, we most commonly use `re.search` to test whether a pattern appears in a string.

In [None]:
if re.search(phone_re, text):
    print("Found a match!")

In [None]:
if re.search(phone_re, 'Hello world'):
    print("No match; this won't print")

Another commonly used method, `re.match(pattern, string)`, behaves the same as `re.search` but only checks for a match at the start of `string` instead of a match anywhere in the string.

## `re.findall`

We use `re.findall(pattern, string)` to extract substrings that match a regex. This method returns a list of all matches of `pattern` in `string`.

In [None]:
gmail_re = r'[a-zA-Z0-9]+@gmail\.com'
text = '''
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
'''
re.findall(gmail_re, text)

## Regex Groups



Using **regex groups**, we specify subpatterns to extract from a regex by wrapping the subpattern in parentheses `( )`. When a regex contains regex groups, `re.findall` returns a list of tuples that contain the subpattern contents.

For example, the following familiar regex extracts phone numbers from a string:

In [None]:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)

To split apart the individual three or four digit components of a phone number, we can wrap each digit group in parentheses.

In [None]:
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
re.findall(phone_re, text)

As promised, `re.findall` returns a list of tuples containing the individual components of the matched phone numbers.

## `re.sub`

`re.sub(pattern, replacement, string)` replaces all occurrences of `pattern` with `replacement` in the provided `string`. This method behaves like the Python string method `str.sub` but uses a regex to match patterns.

In the code below, we alter the dates to have a common format by substituting the date separators with a dash.

In [None]:
messy_dates = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'
regex = r'[/.:]'
re.sub(regex, '-', messy_dates)

## `re.split`

`re.split(pattern, string)` splits the input `string` each time the regex `pattern` appears. This method behaves like the Python string method `str.split` but uses a regex to make the split.

In the code below, we use `re.split` to split chapter names from their page numbers in a table of contents for a book.

In [None]:
toc = '''
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
'''.strip()
toc

# First, split into individual lines
lines = re.split('\n', toc)
lines

In [None]:
# Then, split into chapter title and page number
split_re = r'=+' # Matches any sequence of = characters
[re.split(split_re, line) for line in lines]

## Regex and pandas

Recall that `pandas` Series objects have a `.str` property that supports string manipulation using Python string methods. Conveniently, the `.str` property also supports some functions from the `re` module. We demonstrate basic regex usage in `pandas`, leaving the complete method list to [the `pandas` documentation on string methods](https://pandas.pydata.org/pandas-docs/stable/text.html).

We've stored the text of the first five sentences of the novel *Little Women* in the DataFrame below. We can use the string methods that `pandas` provides to extract the spoken dialog in each sentence.

In [None]:
# HIDDEN
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''.strip()
little = pd.DataFrame({
    'sentences': text.split('\n')
})

In [None]:
little

Since spoken dialog lies within double quotation marks, we create a regex that captures a double quotation mark, a sequence of any characters except a double quotation mark, and the closing quotation mark.

In [None]:
quote_re = r'"[^"]+"'
little['sentences'].str.findall(quote_re)

Since the `Series.str.findall` method returns a list of matches, `pandas` also provides `Series.str.extract` and `Series.str.extractall` method to extract matches into a Series or DataFrame. These methods require the regex to contain at least one regex group.

In [None]:
# Extract text within double quotes
quote_re = r'"([^"]+)"'
spoken = little['sentences'].str.extract(quote_re)
spoken

We can add this series as a column of the `little` DataFrame:

In [None]:
little['dialog'] = spoken
little

We can confirm that our string manipulation behaves as expected for the last sentence in our DataFrame by printing the original and extracted text:

In [None]:
print(little.loc[4, 'sentences'])

In [None]:
print(little.loc[4, 'dialog'])

## Summary

The `re` module in Python provides a useful group of methods for manipulating text using regular expressions. When working with DataFrames, we often use the analogous string manipulation methods implemented in `pandas`.

For the complete documentation on the `re` module, see https://docs.python.org/3/library/re.html

For the complete documentation on `pandas` string methods, see https://pandas.pydata.org/pandas-docs/stable/text.html


## Text Processing and Dates

In [None]:
!cat log.txt

In [None]:
lines = open('log.txt').readlines()
first = lines[0]
first

String manipulation based on character positions.

In [None]:
time_str = first.split('[', 1)[1].split(' ', 1)[0]
day, month, rest = time_str.split('/')
year, hour, minute, second = rest.split(':')
year, month, day, hour, minute, second

In [None]:
time_strs = (pd.Series(lines).str.split('[', 1, expand=True)[1]
             .str.split(' ', 1, expand=True)[0])
day_month_rest = time_strs.str.split('/', expand=True)
#day_month_rest
pd.concat([day_month_rest.loc[:, 0:1], 
           day_month_rest[2].str.split(':', expand=True)], axis=1)

String manipulation based on regular expressions.

In [None]:
import re
pattern = r'(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)'
day, month, year, hour, minute, second = re.search(pattern, first).groups()
year, month, day, hour, minute, second

In [None]:
pd.Series(lines).str.extract(pattern)

Date parsing using the `datetime` module.

In [None]:
from datetime import datetime
datetime.strptime(time_str, '%d/%b/%Y:%H:%M:%S')

In [None]:
pd.Series(lines).str.extract(r'\[(.*) -0800\]')[0].apply(
    lambda s: datetime.strptime(s, '%d/%b/%Y:%H:%M:%S'))