<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>



<h1 align='center'>Other</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/other.png" width="400">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">'"This entire thing is the quote, not just the part in quote marks." [Quote marks, brackets, and editor's note are all in the original. -Ed.]'</p>
                <br>
                <p>— xkcd</p>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    <br>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Flag_of_None.svg'>Rainer Zenz</a>. Image is public domain.
</div>

<hr>

# Other

While we have endeavored to give you a high-level overview of the tools available for cleaning data, this is only really scratching the surface. In addition, there are tools and techniques that do not fit neatly into the previous categories explored, so we speak to them here.

---

# Modules covered

### Standard Library
* [csv](https://docs.python.org/3/library/csv.html)


### Third-Party Libraries
* [pandas](https://pandas.pydata.org/)
* [dask](https://dask.pydata.org/en/latest)


# Modules not covered

### Standard Library
* A whole bunch

### Third-Party Libraries
* A whole bunch

---

In [1]:
# Python stdlib imports
import csv

# Third party imports
import dask.dataframe
import numpy as np
import pandas as pd

# Strategies for dealing with ginormous data

By default, pandas does not load data lazily because it is less convenient and focuses on performing computation in-memory on arrays, Series, and DataFrames. At times, however, your data will be too large to fit in RAM.

In [2]:
# Using 'Categories' is akin to using a smaller surogate key
names = pd.read_fwf('./data/iris_dataset.txt')['class']
names_cat = names.astype('category')

names_mu     = names.memory_usage()
names_cat_mu = names_cat.memory_usage()
multiplier   = names_mu / names_cat_mu
print('In this case, the strings use {} times more space than a categorical.'.format(multiplier))

In this case, the strings use 3.8447761194029852 times more space than a categorical.


In [3]:
# We can do chunked read
for chunk in pd.read_csv('./data/ascii_table.csv', chunksize=1024):
    # process chunk
    pass

In [4]:
# In addition to chunked read, it's even faster to use the CSV module for a single column.
with open('./data/ascii_table.csv', 'r') as f:
    reader = csv.reader(f)
    data_header = next(reader)[0]
    for remaining_row in reader:
        # Process as needed lazily.
        pass

In [5]:
# Dask helps with lazy computation but we will need to explicitly told to compute()
dd = dask.dataframe.read_csv('./data/ascii_table.csv')
decimal_max = dd['decimal_value'].max().compute()
decimal_max

127

# Other easy preprocessing functions

In [6]:
# Load data
df = pd.read_fwf('./data/iris_dataset.txt')
df.head(5)

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
# Cumulative sum is useful.
pd.DataFrame(
    {
        'original'      : df.sepal_length_cm,
        'cumulative_sum': df.sepal_length_cm.cumsum(),
    }
).head()

Unnamed: 0,original,cumulative_sum
0,5.1,5.1
1,4.9,10.0
2,4.7,14.7
3,4.6,19.3
4,5.0,24.3


In [8]:
# Diff and pct_change is useful for differences
pd.DataFrame(
    {
        'original'   : df.sepal_length_cm,
        'diff'       : df.sepal_length_cm.diff(),
        'pct_change' : df.sepal_length_cm.pct_change(), 
    }
).head().round(2)

Unnamed: 0,original,diff,pct_change
0,5.1,,
1,4.9,-0.2,-0.04
2,4.7,-0.2,-0.04
3,4.6,-0.1,-0.02
4,5.0,0.4,0.09


In [9]:
# The T property on a pandas DataFrame transposes the data (reflects over its diagonal)
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,141,142,143,144,145,146,147,148,149,150
sepal_length_cm,5.1,4.9,4.7,4.6,5,5.4,4.6,5,4.4,4.9,...,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9,
sepal_width_cm,3.5,3,3.2,3.1,3.6,3.9,3.4,3.4,2.9,3.1,...,3.1,2.7,3.2,3.3,3,2.5,3,3.4,3,
petal_length_cm,1.4,1.4,1.3,1.5,1.4,1.7,1.4,1.5,1.4,1.5,...,5.1,5.1,5.9,5.7,5.2,5,5.2,5.4,5.1,
petal_width_cm,0.2,0.2,0.2,0.2,0.2,0.4,0.3,0.2,0.2,0.1,...,2.3,1.9,2.3,2.5,2.3,1.9,2,2.3,1.8,
class,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,Iris-setosa,...,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,Iris-virginica,


In [10]:
# Rank gives you ranks.
pd.DataFrame(
    {
        'sepal_orig'     : df.head(5).sepal_length_cm,
        'sepal_rank'     : df.head(5).sepal_length_cm.rank(ascending=True),
        'sepal_pct_rank' : df.head(5).sepal_length_cm.rank(ascending=True, pct=True),
    }
)

Unnamed: 0,sepal_orig,sepal_rank,sepal_pct_rank
0,5.1,5.0,1.0
1,4.9,3.0,0.6
2,4.7,2.0,0.4
3,4.6,1.0,0.2
4,5.0,4.0,0.8


In [11]:
# Shift allows you to shift columns back and forth.
pd.DataFrame(
    {
        'sepal_length_cm'       : df.sepal_length_cm,
        'three_observations_ago': df.sepal_length_cm.shift(3), # pos or neg
        'diff'                  : df.sepal_length_cm - df.sepal_length_cm.shift(3),
    }
)[['sepal_length_cm', 'three_observations_ago', 'diff']].head()

Unnamed: 0,sepal_length_cm,three_observations_ago,diff
0,5.1,,
1,4.9,,
2,4.7,,
3,4.6,5.1,-0.5
4,5.0,4.9,0.1


In [12]:
# See Python Data Analysis tools for merge and concatenate information.
pd.concat(
    [
        pd.merge(
            pd.DataFrame({'a': [1,5,3], 'b': [4,5,6]}),
            pd.DataFrame({'a': [3,2,1], 'c': [0,9,8]}),   
            how='left'
        ),
        pd.Series(['d','d','d'], name='d')
    ],
    axis=1
)

Unnamed: 0,a,b,c,d
0,1,4,8.0,d
1,5,5,,d
2,3,6,0.0,d


In [13]:
# Note, change to string formatted numbers to strings just before export.
pd.Series([1000000.29, 400000.99, 41112.38]).map(lambda x: '{:,.2f}'.format(x))

0    1,000,000.29
1      400,000.99
2       41,112.38
dtype: object

# Useful NumPy functions

In [14]:
# np.where gets you the indicies of particular conditions
TARGET_VALUE = 5.5
x_vals, y_vals = np.where(df.select_dtypes([np.float64]) == TARGET_VALUE)

indices = df.index[x_vals].values
columns = df.columns[y_vals].values

for index, column in zip(indices, columns):
    print('Sample {} is equal to {} in column {}'.format(index, TARGET_VALUE, column))

Sample 33 is equal to 5.5 in column sepal_length_cm
Sample 36 is equal to 5.5 in column sepal_length_cm
Sample 53 is equal to 5.5 in column sepal_length_cm
Sample 80 is equal to 5.5 in column sepal_length_cm
Sample 81 is equal to 5.5 in column sepal_length_cm
Sample 89 is equal to 5.5 in column sepal_length_cm
Sample 90 is equal to 5.5 in column sepal_length_cm
Sample 112 is equal to 5.5 in column petal_length_cm
Sample 116 is equal to 5.5 in column petal_length_cm
Sample 137 is equal to 5.5 in column petal_length_cm


In [15]:
# Argsort is used to rank by sort.
list_to_sort = ['a', 'c', 'b', 'd']
indices = np.argsort(list_to_sort) + 1

for letter, index in zip(list_to_sort, indices):
    print('Letter {} will be item {} in a sorted list.'.format(letter, index))

Letter a will be item 1 in a sorted list.
Letter c will be item 3 in a sorted list.
Letter b will be item 2 in a sorted list.
Letter d will be item 4 in a sorted list.


# Additional Learning Resources

* None

---

# Questions Before Exercises?


# Next Up: [Cleaning Exercises](6_cleaning_exercises.ipynb)

---