<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>


<h1 align='center'>Data Type-Specific Cleaning Techniques</h1>

<br>

<div style="display: table; width: 100%">
  <div style="display: table-row; width: 100%;">
    <div style="display: table-cell; width: 50%; vertical-align: middle;">
      <img src="static/my_kingdom_for_a_decent_data_type_image.png">
    </div>
    <div style="display: table-cell; width: 10%">
    </div>
    <div style="display: table-cell; width: 40%; vertical-align: top;">
      <blockquote>
        <p style="font-style: italic;">"All perfect data is alike; all imperfect data is imperfect in its own janky way."</p>
        <br>
        <p>— Leo Tolstoy (paraphrased)</p>
      </blockquote>        
    </div>
  </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Binario_cropped.png'>MdeVicente</a>, released into the public domain.
</div>

<hr>

# Generally

Data type-specific cleaning is where pandas really comes into its own. Whether your data is a timeseries, or a float, or a string, there are a bunch of prefab tools for you to use.

---

# Modules covered

### Standard Library
* [pathlib](https://docs.python.org/3/library/pathlib.html)

### Third-Party Libraries
* [numpy](https://docs.scipy.org/doc/numpy/)
* [pandas](https://pandas.pydata.org/)


# Modules not covered

### Standard Library
* None

### Third-Party Libraries
* None

---

In [None]:
# Python stdlib imports
import pathlib

# Third party imports
import numpy as np
import pandas as pd

# Strings

[.str namespace methods](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling)

Most of the string-specific functionality within pandas resides in the .str namespace. Each column containing text has this namespace, which allows you to access advanced string functions.

In [None]:
df = pd.read_csv('./data/mangled_data.csv', error_bad_lines=False)
print(df.dtypes)
df

In [None]:
# We can convert numbers to strings with the astype method.
df.pure_numeric.astype(str)

In [None]:
# We can run most of our regular string methods.
df.surname = df.surname.str.upper()
df

In [None]:
# And chain them if need be (ditch tabs, ditch dashes, clip whitespace)
df.surname.str.replace('\t','').str.replace('-',' ').str.strip()

In [None]:
mask = df.surname.str.contains('KING')
print(mask)
df[mask]

In [None]:
# We can get substrings with the same syntax
df.surname.str[:3]

In [None]:
# We can concatenate strings
forename_string = df.forename.str.cat(sep=';')
print(forename_string)

df.prefix + ' ' + df.forename + ' is my name!'

In [None]:
# We can split strings
print(df.birthday.str.split('/'), '\n\n')

# And get the first result
print(df.birthday.str.split('/').str.get(0), '\n\n')

# And get a dataframe of results
out = df.birthday.str.partition('/', expand=True)
out.columns = ['Before First Slash', 'First Slash', 'After First Slash']
out

In [None]:
# And we can use regexes for complex stuff. Socials with an without dashes.
df.biography.str.extractall('(\w*\d\d\d-?\d\d-?\d\d\d\d?\w*)')

In [None]:
# And we can replace them
df.biography.str.replace('(\w*\d\d\d[-]?\d\d[-]\d\d\d\d?\w*)','REDACTION')

In [None]:
# Unaltered df for comparison
df

In [None]:
# And you can binarize features with get_dummies
dummies = df.attributes.str.get_dummies().astype(bool)
pd.concat([df, dummies], axis=1)

# Numeric

There's a whole bunch of numeric methods, but we'll focus on the ones that are most useful for cleaning. Your workhorses are going to be <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html'>pd.to_numeric()</a>, <a href='http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.clip.html'>Series.clip()</a>, <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.round.html'> Series.round()/DataFrame.round()</a>, and <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html'>pd.cut()</a>

In [None]:
df

In [None]:
# pd.to_numeric is the best general purpose function for casting to numeric types.

# 'ignore keeps the "bad" values as they are
ignored = pd.to_numeric(df.height, errors='ignore')

# 'coerce' changes bad values to np.NaN
coerced = pd.to_numeric(df.height, errors='coerce')
                  
# Note: only float or string columns can have np.NaN
pd.DataFrame(
    {
        'original': df.height, 
        'ignore': ignored, 
        'coerce': coerced
    }
)

In [None]:
# Sometimes you need to take out characters first. Note, use parens to allow for cleaner chaining.
# Also, translate would be much better here.
clean = (df.balance.str.strip()
                   .str.replace('(', '-')
                   .str.replace(')', '')
                   .str.replace('$', '')
                   .str.replace(' ', ''))

# More efficient version of the above
translation_table = str.maketrans(
    {
        '(': '-',
        ')': '',
        '$': '',
        ' ': '',
        ',': '',
    }
)
clean = df.balance.str.translate(translation_table)   
         
pd.DataFrame({
    'cleaned_float' : pd.to_numeric(clean),
    'original_text' : df.balance,
})

In [None]:
# We round using round.
pd.DataFrame(
    {
        'cleaned'  : pd.to_numeric(clean),
        'one_digit': pd.to_numeric(clean).round(1),
    }
)

In [None]:
# Often times we need to remove obviously messed up values. Clip sets the floor/ceiling.
pd.DataFrame(
    {
        'original'  : df.height,
        'coerced'   : coerced,
        'lower_clip': coerced.clip(lower=0),
        'upper_clip': coerced.clip(upper=100),
        'both_clip' : coerced.clip(lower=0, upper=100),
    }
)

In [None]:
# Pandas cut is useful for turning continuous numbers into discrete categories.
my_bins = np.arange(0,1.1,.1)
pd.DataFrame(
    {
        'original': df.pure_numeric,
        'category': pd.cut(df.pure_numeric, bins=my_bins)
    }
)

# You can also use bins=5 for equally sized bins

In [None]:
# More usefully
pd.DataFrame(
    {
        'percent': df.pure_numeric,
        'grade': pd.cut(
            df.pure_numeric, 
            bins=[0.0, 0.6, .65, .7, .85, 1.0],
            labels=['F', 'D', 'C', 'B', 'A']
        )
    }
)

# Datetimes

[.dt namespace methods](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)

Like strings, datetimes have a special namespace used to access their functions.

In [None]:
# Show df for clarity
df

In [None]:
# pd.to_datetime() is our workhorse function
pd.DataFrame(
    {  
        'coerce': pd.to_datetime(df['birthday'], errors='coerce'),
        'ignore': pd.to_datetime(df['birthday'], errors='ignore'),
    }
)

In [None]:
# It also supports weird date strings without explicit conversion
format_string = 'The %dth dayeth of %B in the year %Y'
dates = pd.to_datetime(df['birthday'], errors='coerce', format=format_string)

In [None]:
# This allows us to do all the fancy dt stuff.
pd.DataFrame(
    {
        'converted'    : dates,
        'day'          : dates.dt.day,
        'month'        : dates.dt.month,
        'year'         : dates.dt.year,
        'day_in_week'  : dates.dt.dayofweek,
        'days_in_month': dates.dt.days_in_month,
        'weekday'      : dates.dt.weekday_name
    }
).dropna()

# Strptime and strftime reference

<table class="docutils" border="1">
<colgroup>
<col width="15%">
<col width="43%">
<col width="32%">
<col width="9%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Directive</th>
<th class="head">Meaning</th>
<th class="head">Example</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%a</span></code></td>
<td>Weekday as locale’s
abbreviated name.</td>
<td><div class="first last line-block">
<div class="line">Sun, Mon, …, Sat
(en_US);</div>
<div class="line">So, Mo, …, Sa
(de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%A</span></code></td>
<td>Weekday as locale’s full name.</td>
<td><div class="first last line-block">
<div class="line">Sunday, Monday, …,
Saturday (en_US);</div>
<div class="line">Sonntag, Montag, …,
Samstag (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%w</span></code></td>
<td>Weekday as a decimal number,
where 0 is Sunday and 6 is
Saturday.</td>
<td>0, 1, …, 6</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%d</span></code></td>
<td>Day of the month as a
zero-padded decimal number.</td>
<td>01, 02, …, 31</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%b</span></code></td>
<td>Month as locale’s abbreviated
name.</td>
<td><div class="first last line-block">
<div class="line">Jan, Feb, …, Dec
(en_US);</div>
<div class="line">Jan, Feb, …, Dez
(de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%B</span></code></td>
<td>Month as locale’s full name.</td>
<td><div class="first last line-block">
<div class="line">January, February,
…, December (en_US);</div>
<div class="line">Januar, Februar, …,
Dezember (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%m</span></code></td>
<td>Month as a zero-padded
decimal number.</td>
<td>01, 02, …, 12</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%y</span></code></td>
<td>Year without century as a
zero-padded decimal number.</td>
<td>00, 01, …, 99</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%Y</span></code></td>
<td>Year with century as a decimal
number.</td>
<td>0001, 0002, …, 2013,
2014, …, 9998, 9999</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%H</span></code></td>
<td>Hour (24-hour clock) as a
zero-padded decimal number.</td>
<td>00, 01, …, 23</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%I</span></code></td>
<td>Hour (12-hour clock) as a
zero-padded decimal number.</td>
<td>01, 02, …, 12</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%p</span></code></td>
<td>Locale’s equivalent of either
AM or PM.</td>
<td><div class="first last line-block">
<div class="line">AM, PM (en_US);</div>
<div class="line">am, pm (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%M</span></code></td>
<td>Minute as a zero-padded
decimal number.</td>
<td>00, 01, …, 59</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%S</span></code></td>
<td>Second as a zero-padded
decimal number.</td>
<td>00, 01, …, 59</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%f</span></code></td>
<td>Microsecond as a decimal
number, zero-padded on the
left.</td>
<td>000000, 000001, …,
999999</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%z</span></code></td>
<td>UTC offset in the form +HHMM
or -HHMM (empty string if the
object is naive).</td>
<td>(empty), +0000, -0400,
+1030</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%Z</span></code></td>
<td>Time zone name (empty string
if the object is naive).</td>
<td>(empty), UTC, EST, CST</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%j</span></code></td>
<td>Day of the year as a
zero-padded decimal number.</td>
<td>001, 002, …, 366</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%U</span></code></td>
<td>Week number of the year
(Sunday as the first day of
the week) as a zero padded
decimal number. All days in a
new year preceding the first
Sunday are considered to be in
week 0.</td>
<td>00, 01, …, 53</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%W</span></code></td>
<td>Week number of the year
(Monday as the first day of
the week) as a decimal number.
All days in a new year
preceding the first Monday
are considered to be in
week 0.</td>
<td>00, 01, …, 53</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%c</span></code></td>
<td>Locale’s appropriate date and
time representation.</td>
<td><div class="first last line-block">
<div class="line">Tue Aug 16 21:30:00
1988 (en_US);</div>
<div class="line">Di 16 Aug 21:30:00
1988 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%x</span></code></td>
<td>Locale’s appropriate date
representation.</td>
<td><div class="first last line-block">
<div class="line">08/16/88 (None);</div>
<div class="line">08/16/1988 (en_US);</div>
<div class="line">16.08.1988 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">%X</span></code></td>
<td>Locale’s appropriate time
representation.</td>
<td><div class="first last line-block">
<div class="line">21:30:00 (en_US);</div>
<div class="line">21:30:00 (de_DE)</div>
</div>
</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">%%</span></code></td>
<td>A literal <code class="docutils literal notranslate"><span class="pre">'%'</span></code> character.</td>
<td>%</td>
</tr>
</tbody>
</table>

In [None]:
# Though not optimal, you can use apply()/map() to do more complex transormations
spanish_months = {
    month: str(index)
    for index, month
    in enumerate([
        'enero','febrero', 'marzo', 'abril', 
        'mayo', 'junio', 'julio', 'agosto',
        'steptiembre', 'octubre', 'noviembre', 'diciembre'
    ], start=1)
}

def custom_converter(cell):
    '''This does some simple remapping.'''
    try:
        # Lower case
        content = cell.lower() 
        # Split based on the comma -> ['enero 5', '2018']
        month_and_day, sep, year = content.partition(',')
        # Split substring 'enero 5' -> ['enero', '5']
        month, day = month_and_day.strip().split()
        # Map the spansih month to integer
        month = spanish_months[month]
        # Create timestring and time
        time_string = year + '-' + month + '-' + day
        return pd.to_datetime(time_string)
    except Exception as e:
        return np.NaN

spanish = pd.Series(['enero 5, 2018', 'Agosto 20, 2015', 'July 20, 2012'])
converted = spanish.map(custom_converter)

pd.DataFrame({
    'spanish'  : spanish,
    'converted': converted,
})

In [None]:
# Normalize sets all the times to midnight
bdays = pd.to_datetime(df['birthday'], errors='coerce').dropna()

print(bdays)
print(bdays.dt.normalize())

In [None]:
# We can convert to weeks or months using dt.to_period()
# Periods are periods of times as opposed to point in time timesteampsl.
pd.DataFrame(
    {
        'week'    : bdays.dt.to_period('W'),
        'quarter' : bdays.dt.to_period('Q'),
        'hour'    : bdays.dt.to_period('H'),
        'month'   : bdays.dt.to_period('M'),
        'day'     : bdays.dt.to_period('D'),
    }
)

# See 'Offset Aliases' below for potential strings.

In [None]:
# And you can output strings as needed
print(bdays.dt.strftime(format_string))

# Additional Learing Resources

* ### [Pandas: Working With Text Data](http://pandas.pydata.org/pandas-docs/stable/text.html)
* ### [Pandas: Working With Time Series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)
* ### [Pandas: Computational Tools](https://pandas.pydata.org/pandas-docs/stable/computation.html)
* ### [Pandas: Offset Aliases](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
* ### [Python Strftime and Strptime Behavior](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior)

---

# Next Up: [Other](5_other.ipynb)

<div>
    <br>
    <img style="margin-left: 0;" src='./static/other.png' width="200">
    <br>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Flag_of_None.svg'>Rainer Zenz</a>. Image is public domain.

</div>




---