<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>


<h1 align='center'>Data Type-Specific Cleaning Techniques</h1>

<br>

<div style="display: table; width: 100%">
  <div style="display: table-row; width: 100%;">
    <div style="display: table-cell; width: 50%; vertical-align: middle;">
      <img src="static/my_kingdom_for_a_decent_data_type_image.png">
    </div>
    <div style="display: table-cell; width: 10%">
    </div>
    <div style="display: table-cell; width: 40%; vertical-align: top;">
      <blockquote>
        <p style="font-style: italic;">"All perfect data is alike; all imperfect data is imperfect in its own janky way."</p>
        <br>
        <p>— Leo Tolstoy (paraphrased)</p>
      </blockquote>        
    </div>
  </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Binario_cropped.png'>MdeVicente</a>, released into the public domain.
</div>

<hr>

# Generally

Data type-specific cleaning is where pandas really comes into its own. Whether your data is a timeseries, or a float, or a string, there are a bunch of prefab tools for you to use.

---

# Modules covered

### Standard Library
* [pathlib](https://docs.python.org/3/library/pathlib.html)

### Third-Party Libraries
* [numpy](https://docs.scipy.org/doc/numpy/)
* [pandas](https://pandas.pydata.org/)


# Modules not covered

### Standard Library
* None

### Third-Party Libraries
* None

---

In [1]:
# Python stdlib imports
import pathlib

# Third party imports
import numpy as np
import pandas as pd

# Strings

[.str namespace methods](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling)

Most of the string-specific functionality within pandas resides in the .str namespace. Each column containing text has this namespace, which allows you to access advanced string functions.

In [2]:
df = pd.read_csv('./data/mangled_data.csv', error_bad_lines=False)
print(df.dtypes)
df

surname          object
forename         object
prefix           object
birthday         object
height           object
balance          object
pure_numeric    float64
attributes       object
biography        object
dtype: object


b'Skipping line 3: expected 9 fields, saw 10\n'


Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
0,\t\tcrunch,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...
1,Burger King,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...
2,Rooster,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.
3,Cuckoo-Bird,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,
4,Monster,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.


In [3]:
# We can convert numbers to strings with the astype method.
df.pure_numeric.astype(str)

0    0.5
1    1.0
2    0.8
3    0.3
4    0.2
Name: pure_numeric, dtype: object

In [4]:
# We can run most of our regular string methods.
df.surname = df.surname.str.upper()
df

Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
0,\t\tCRUNCH,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...
2,ROOSTER,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.
3,CUCKOO-BIRD,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,
4,MONSTER,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.


In [5]:
# And chain them if need be (ditch tabs, ditch dashes, clip whitespace)
df.surname.str.replace('\t','').str.replace('-',' ').str.strip()

0         CRUNCH
1    BURGER KING
2        ROOSTER
3    CUCKOO BIRD
4        MONSTER
Name: surname, dtype: object

In [6]:
mask = df.surname.str.contains('KING')
print(mask)
df[mask]

0    False
1     True
2    False
3    False
4    False
Name: surname, dtype: bool


Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...


In [7]:
# We can get substrings with the same syntax
df.surname.str[:3]

0     \t\t
1      BUR
2       RO
3         
4       MO
Name: surname, dtype: object

In [8]:
# We can concatenate strings
forename_string = df.forename.str.cat(sep=';')
print(forename_string)

df.prefix + ' ' + df.forename + ' is my name!'

horatio;THE;CORNELIUS;SONNY;Loch Ness


0           Captain horatio is my name!
1    His \tHighness     THE is my name!
2             Mx. CORNELIUS is my name!
3                 Mr. SONNY is my name!
4             Ms. Loch Ness is my name!
dtype: object

In [9]:
# We can split strings
print(df.birthday.str.split('/'), '\n\n')

# And get the first result
print(df.birthday.str.split('/').str.get(0), '\n\n')

# And get a dataframe of results
out = df.birthday.str.partition('/', expand=True)
out.columns = ['Before First Slash', 'First Slash', 'After First Slash']
out

0                                 [5, 6, 1945]
1    [The 5th dayeth of June in the year 1800]
2                                [5, 10, 2015]
3                                 [9, 9, 9999]
4                  [2018-04-02T23:01:22.55555]
Name: birthday, dtype: object 


0                                          5
1    The 5th dayeth of June in the year 1800
2                                          5
3                                          9
4                  2018-04-02T23:01:22.55555
Name: birthday, dtype: object 




Unnamed: 0,Before First Slash,First Slash,After First Slash
0,5,/,6/1945
1,The 5th dayeth of June in the year 1800,,
2,5,/,10/2015
3,9,/,9/9999
4,2018-04-02T23:01:22.55555,,


In [10]:
# And we can use regexes for complex stuff. Socials with an without dashes.
df.biography.str.extractall('(\w*\d\d\d-?\d\d-?\d\d\d\d?\w*)')

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,111111111
1,0,222222222
2,0,333-33-3333
4,0,555-55-5555


In [11]:
# And we can replace them
df.biography.str.replace('(\w*\d\d\d[-]?\d\d[-]\d\d\d\d?\w*)','REDACTION')

0    See Second Crunchberry Rebellion. Rose to prom...
1    Forced to abdicate by Major McCheese and the a...
2      Defenestrated chicken little. Social REDACTION.
3                                                  NaN
4                Dislikes rutabagas. Social REDACTION.
Name: biography, dtype: object

In [12]:
# Unaltered df for comparison
df

Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
0,\t\tCRUNCH,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...
2,ROOSTER,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.
3,CUCKOO-BIRD,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,
4,MONSTER,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.


In [13]:
# And you can binarize features with get_dummies
dummies = df.attributes.str.get_dummies().astype(bool)
pd.concat([df, dummies], axis=1)

Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography,coo_coo_for_cocopuffs,flies,pulls_rank
0,\t\tCRUNCH,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...,False,False,True
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...,False,False,True
2,ROOSTER,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.,False,True,False
3,CUCKOO-BIRD,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,,True,True,False
4,MONSTER,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.,False,False,False


# Numeric

There's a whole bunch of numeric methods, but we'll focus on the ones that are most useful for cleaning. Your workhorses are going to be <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html'>pd.to_numeric()</a>, <a href='http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.clip.html'>Series.clip()</a>, <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.round.html'> Series.round()/DataFrame.round()</a>, and <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html'>pd.cut()</a>

In [14]:
df

Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
0,\t\tCRUNCH,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...
2,ROOSTER,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.
3,CUCKOO-BIRD,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,
4,MONSTER,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.


In [15]:
# pd.to_numeric is the best general purpose function for casting to numeric types.

# 'ignore keeps the "bad" values as they are
ignored = pd.to_numeric(df.height, errors='ignore')

# 'coerce' changes bad values to np.NaN
coerced = pd.to_numeric(df.height, errors='coerce')
                  
# Note: only float or string columns can have np.NaN
pd.DataFrame(
    {
        'original': df.height, 
        'ignore': ignored, 
        'coerce': coerced
    }
)

Unnamed: 0,original,ignore,coerce
0,999999999,999999999,999999999.0
1,75,75,75.0
2,twelve,twelve,
3,twelve,twelve,
4,-100,-100,-100.0


In [16]:
# Sometimes you need to take out characters first. Note, use parens to allow for cleaner chaining.
# Also, translate would be much better here.
clean = (df.balance.str.strip()
                   .str.replace('(', '-')
                   .str.replace(')', '')
                   .str.replace('$', '')
                   .str.replace(' ', ''))

# More efficient version of the above
translation_table = str.maketrans(
    {
        '(': '-',
        ')': '',
        '$': '',
        ' ': '',
        ',': '',
    }
)
clean = df.balance.str.translate(translation_table)   
         
pd.DataFrame({
    'cleaned_float' : pd.to_numeric(clean),
    'original_text' : df.balance,
})

Unnamed: 0,cleaned_float,original_text
0,255000.04,"$255,000.04"
1,-30000.0,"($30,000)"
2,5000000.0,"$5,000,000"
3,-2000000.0,"($2,000,000)"
4,-3.5,($3.50)


In [17]:
# We round using round.
pd.DataFrame(
    {
        'cleaned'  : pd.to_numeric(clean),
        'one_digit': pd.to_numeric(clean).round(1),
    }
)

Unnamed: 0,cleaned,one_digit
0,255000.04,255000.0
1,-30000.0,-30000.0
2,5000000.0,5000000.0
3,-2000000.0,-2000000.0
4,-3.5,-3.5


In [18]:
# Often times we need to remove obviously messed up values. Clip sets the floor/ceiling.
pd.DataFrame(
    {
        'original'  : df.height,
        'coerced'   : coerced,
        'lower_clip': coerced.clip(lower=0),
        'upper_clip': coerced.clip(upper=100),
        'both_clip' : coerced.clip(lower=0, upper=100),
    }
)

Unnamed: 0,original,coerced,lower_clip,upper_clip,both_clip
0,999999999,999999999.0,999999999.0,100.0,100.0
1,75,75.0,75.0,75.0,75.0
2,twelve,,,,
3,twelve,,,,
4,-100,-100.0,0.0,-100.0,0.0


In [19]:
# Pandas cut is useful for turning continuous numbers into discrete categories.
my_bins = np.arange(0,1.1,.1)
pd.DataFrame(
    {
        'original': df.pure_numeric,
        'category': pd.cut(df.pure_numeric, bins=my_bins)
    }
)

# You can also use bins=5 for equally sized bins

Unnamed: 0,original,category
0,0.5,"(0.4, 0.5]"
1,1.0,"(0.9, 1.0]"
2,0.8,"(0.7, 0.8]"
3,0.3,"(0.2, 0.3]"
4,0.2,"(0.1, 0.2]"


In [20]:
# More usefully
pd.DataFrame(
    {
        'percent': df.pure_numeric,
        'grade': pd.cut(
            df.pure_numeric, 
            bins=[0.0, 0.6, .65, .7, .85, 1.0],
            labels=['F', 'D', 'C', 'B', 'A']
        )
    }
)

Unnamed: 0,percent,grade
0,0.5,F
1,1.0,A
2,0.8,B
3,0.3,F
4,0.2,F


# Datetimes

[.dt namespace methods](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)

Like strings, datetimes have a special namespace used to access their functions.

In [21]:
# Show df for clarity
df

Unnamed: 0,surname,forename,prefix,birthday,height,balance,pure_numeric,attributes,biography
0,\t\tCRUNCH,horatio,Captain,5/6/1945,999999999,"$255,000.04",0.5,pulls_rank,See Second Crunchberry Rebellion. Rose to prom...
1,BURGER KING,THE,His \tHighness,The 5th dayeth of June in the year 1800,75,"($30,000)",1.0,pulls_rank,Forced to abdicate by Major McCheese and the a...
2,ROOSTER,CORNELIUS,Mx.,5/10/2015,twelve,"$5,000,000",0.8,flies,Defenestrated chicken little. Social 333-33-3333.
3,CUCKOO-BIRD,SONNY,Mr.,9/9/9999,twelve,"($2,000,000)",0.3,flies|coo_coo_for_cocopuffs,
4,MONSTER,Loch Ness,Ms.,2018-04-02T23:01:22.55555,-100,($3.50),0.2,,Dislikes rutabagas. Social 555-55-5555.


In [22]:
# pd.to_datetime() is our workhorse function
pd.DataFrame(
    {  
        'coerce': pd.to_datetime(df['birthday'], errors='coerce'),
        'ignore': pd.to_datetime(df['birthday'], errors='ignore'),
    }
)

Unnamed: 0,coerce,ignore
0,1945-05-06 00:00:00.000000,5/6/1945
1,NaT,The 5th dayeth of June in the year 1800
2,2015-05-10 00:00:00.000000,5/10/2015
3,NaT,9/9/9999
4,2018-04-02 23:01:22.555550,2018-04-02T23:01:22.55555


In [23]:
# It also supports weird date strings without explicit conversion
format_string = 'The %dth dayeth of %B in the year %Y'
dates = pd.to_datetime(df['birthday'], errors='coerce', format=format_string)

In [24]:
# This allows us to do all the fancy dt stuff.
pd.DataFrame(
    {
        'converted'    : dates,
        'day'          : dates.dt.day,
        'month'        : dates.dt.month,
        'year'         : dates.dt.year,
        'day_in_week'  : dates.dt.dayofweek,
        'days_in_month': dates.dt.days_in_month,
        'weekday'      : dates.dt.weekday_name
    }
).dropna()

Unnamed: 0,converted,day,month,year,day_in_week,days_in_month,weekday
1,1800-06-05,5.0,6.0,1800.0,3.0,30.0,Thursday


# Strptime and strftime reference

<table class="docutils" border="1">
<colgroup>
<col width="15%">
<col width="43%">
<col width="32%">
<col width="9%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Directive</th>
<th class="head">Meaning</th>
<th class="head">Example</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%a</span></code></td>
<td>Weekday as locale’s
abbreviated name.</td>
<td><div class="first last line-block">
<div class="line">Sun, Mon, …, Sat
(en_US);</div>
<div class="line">So, Mo, …, Sa
(de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%A</span></code></td>
<td>Weekday as locale’s full name.</td>
<td><div class="first last line-block">
<div class="line">Sunday, Monday, …,
Saturday (en_US);</div>
<div class="line">Sonntag, Montag, …,
Samstag (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%w</span></code></td>
<td>Weekday as a decimal number,
where 0 is Sunday and 6 is
Saturday.</td>
<td>0, 1, …, 6</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%d</span></code></td>
<td>Day of the month as a
zero-padded decimal number.</td>
<td>01, 02, …, 31</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%b</span></code></td>
<td>Month as locale’s abbreviated
name.</td>
<td><div class="first last line-block">
<div class="line">Jan, Feb, …, Dec
(en_US);</div>
<div class="line">Jan, Feb, …, Dez
(de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%B</span></code></td>
<td>Month as locale’s full name.</td>
<td><div class="first last line-block">
<div class="line">January, February,
…, December (en_US);</div>
<div class="line">Januar, Februar, …,
Dezember (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%m</span></code></td>
<td>Month as a zero-padded
decimal number.</td>
<td>01, 02, …, 12</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%y</span></code></td>
<td>Year without century as a
zero-padded decimal number.</td>
<td>00, 01, …, 99</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%Y</span></code></td>
<td>Year with century as a decimal
number.</td>
<td>0001, 0002, …, 2013,
2014, …, 9998, 9999</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%H</span></code></td>
<td>Hour (24-hour clock) as a
zero-padded decimal number.</td>
<td>00, 01, …, 23</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%I</span></code></td>
<td>Hour (12-hour clock) as a
zero-padded decimal number.</td>
<td>01, 02, …, 12</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%p</span></code></td>
<td>Locale’s equivalent of either
AM or PM.</td>
<td><div class="first last line-block">
<div class="line">AM, PM (en_US);</div>
<div class="line">am, pm (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%M</span></code></td>
<td>Minute as a zero-padded
decimal number.</td>
<td>00, 01, …, 59</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%S</span></code></td>
<td>Second as a zero-padded
decimal number.</td>
<td>00, 01, …, 59</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%f</span></code></td>
<td>Microsecond as a decimal
number, zero-padded on the
left.</td>
<td>000000, 000001, …,
999999</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%z</span></code></td>
<td>UTC offset in the form +HHMM
or -HHMM (empty string if the
object is naive).</td>
<td>(empty), +0000, -0400,
+1030</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%Z</span></code></td>
<td>Time zone name (empty string
if the object is naive).</td>
<td>(empty), UTC, EST, CST</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%j</span></code></td>
<td>Day of the year as a
zero-padded decimal number.</td>
<td>001, 002, …, 366</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%U</span></code></td>
<td>Week number of the year
(Sunday as the first day of
the week) as a zero padded
decimal number. All days in a
new year preceding the first
Sunday are considered to be in
week 0.</td>
<td>00, 01, …, 53</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%W</span></code></td>
<td>Week number of the year
(Monday as the first day of
the week) as a decimal number.
All days in a new year
preceding the first Monday
are considered to be in
week 0.</td>
<td>00, 01, …, 53</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%c</span></code></td>
<td>Locale’s appropriate date and
time representation.</td>
<td><div class="first last line-block">
<div class="line">Tue Aug 16 21:30:00
1988 (en_US);</div>
<div class="line">Di 16 Aug 21:30:00
1988 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%x</span></code></td>
<td>Locale’s appropriate date
representation.</td>
<td><div class="first last line-block">
<div class="line">08/16/88 (None);</div>
<div class="line">08/16/1988 (en_US);</div>
<div class="line">16.08.1988 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%X</span></code></td>
<td>Locale’s appropriate time
representation.</td>
<td><div class="first last line-block">
<div class="line">21:30:00 (en_US);</div>
<div class="line">21:30:00 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%%</span></code></td>
<td>A literal <code class="docutils literal notranslate"><span class="pre">'%'</span></code> character.</td>
<td>%</td>
</tr>
</tbody>
</table>

In [25]:
# Though not optimal, you can use apply()/map() to do more complex transormations
spanish_months = {
    month: str(index)
    for index, month
    in enumerate([
        'enero','febrero', 'marzo', 'abril', 
        'mayo', 'junio', 'julio', 'agosto',
        'steptiembre', 'octubre', 'noviembre', 'diciembre'
    ], start=1)
}

def custom_converter(cell):
    '''This does some simple remapping.'''
    try:
        # Lower case
        content = cell.lower() 
        # Split based on the comma -> ['enero 5', '2018']
        month_and_day, sep, year = content.partition(',')
        # Split substring 'enero 5' -> ['enero', '5']
        month, day = month_and_day.strip().split()
        # Map the spansih month to integer
        month = spanish_months[month]
        # Create timestring and time
        time_string = year + '-' + month + '-' + day
        return pd.to_datetime(time_string)
    except Exception as e:
        return np.NaN

spanish = pd.Series(['enero 5, 2018', 'Agosto 20, 2015', 'July 20, 2012'])
converted = spanish.map(custom_converter)

pd.DataFrame({
    'spanish'  : spanish,
    'converted': converted,
})

Unnamed: 0,spanish,converted
0,"enero 5, 2018",2018-01-05
1,"Agosto 20, 2015",2015-08-20
2,"July 20, 2012",NaT


In [26]:
# Normalize sets all the times to midnight
bdays = pd.to_datetime(df['birthday'], errors='coerce').dropna()

print(bdays)
print(bdays.dt.normalize())

0   1945-05-06 00:00:00.000000
2   2015-05-10 00:00:00.000000
4   2018-04-02 23:01:22.555550
Name: birthday, dtype: datetime64[ns]
0   1945-05-06
2   2015-05-10
4   2018-04-02
Name: birthday, dtype: datetime64[ns]


In [27]:
# We can convert to weeks or months using dt.to_period()
# Periods are periods of times as opposed to point in time timesteampsl.
pd.DataFrame(
    {
        'week'    : bdays.dt.to_period('W'),
        'quarter' : bdays.dt.to_period('Q'),
        'hour'    : bdays.dt.to_period('H'),
        'month'   : bdays.dt.to_period('M'),
        'day'     : bdays.dt.to_period('D'),
    }
)

# See 'Offset Aliases' below for potential strings.

Unnamed: 0,week,quarter,hour,month,day
0,1945-04-30/1945-05-06,1945Q2,1945-05-06 00:00,1945-05,1945-05-06
2,2015-05-04/2015-05-10,2015Q2,2015-05-10 00:00,2015-05,2015-05-10
4,2018-04-02/2018-04-08,2018Q2,2018-04-02 23:00,2018-04,2018-04-02


In [28]:
# And you can output strings as needed
print(bdays.dt.strftime(format_string))

0      The 06th dayeth of May in the year 1945
2      The 10th dayeth of May in the year 2015
4    The 02th dayeth of April in the year 2018
Name: birthday, dtype: object


# Additional Learing Resources

* ### [Pandas: Working With Text Data](http://pandas.pydata.org/pandas-docs/stable/text.html)
* ### [Pandas: Working With Time Series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)
* ### [Pandas: Computational Tools](https://pandas.pydata.org/pandas-docs/stable/computation.html)
* ### [Pandas: Offset Aliases](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
* ### [Python Strftime and Strptime Behavior](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior)

---

# Next Up: [Other](5_other.ipynb)

<div>
    <br>
    <img style="margin-left: 0;" src='./static/other.png' width="200">
    <br>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Flag_of_None.svg'>Rainer Zenz</a>. Image is public domain.

</div>




---