<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Loading and Exploring</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/loading.png" width="300">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”</p>
                <br>
                <p>— Sherlock Holmes</p>
                <br>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>

Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:New_Zealand_RP-7.1_(right).svg'>The New Zealand Transport Authority</a>, released into the public domain.

<hr>

# Generally

Like most things, loading data in Python goes much more smoothly if you can get it right the first time. Consequently, we are going to focus on mechanisms for loading data first, and then later work on how to _munge_ that data.

> “**Data wrangling**, sometimes referred to as **data munging**, is the process of transforming and mapping data from one ‘raw’ data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.”¹

Pandas has a large number of  [supported input types](https://pandas.pydata.org/pandas-docs/stable/io.html), but most of our work will be done through the workhouse [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) function, the accompanying [read_excel()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) function, and assorted [DataFrame construction methods](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

---
¹Wikipedia contributors, “Data wrangling,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Data_wrangling&oldid=834062041 (accessed May 16, 2018).

---

# Modules covered

### Standard Library
* [csv](https://docs.python.org/3.4/library/csv.html)
* [pathlib](https://docs.python.org/3/library/pathlib.html)

### Third-Party Libraries
* [chardet](http://chardet.readthedocs.io/en/latest/usage.html)
* [numpy](https://docs.scipy.org/doc/numpy-1.13.0/reference/)
* [pandas](https://pandas.pydata.org/)
* [tabulate](https://tabulate.readthedocs.io/en/latest/)


# Modules not covered

### Standard Library
* None

### Third-Party Libraries
* None

---

In [None]:
# stdlib imports
import csv
import pathlib

# Third-party imports
import chardet

import numpy as np
import pandas as pd

# The Wide, Wonderful, Wacky World of Filetypes

Sometimes people have access to conformant CSV files, well-structured databases, and well-kept Excel spreadsheets.

And sometimes, you get a trainwreck of a database or streams of bytes that want you to believe they’re data, but in reality just really wanna ruin your day…

### Fixed-width files

Fixed-width files are handled with [read_fwf()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html).  You can infer columns if the file is well-structured, or specify their size with the 'widths' keyword argument. You can infer headers or specify the non-existance of headers with headers=0.

In [None]:
# Specify fixed-width file path
IRIS_FWF_PATH = './data/iris_dataset.txt'

# Sometimes people send you fixed width files...
with open(IRIS_FWF_PATH) as f:
    print(f.read(256))

# ... and you can read these with pd.read_fwf()
df = pd.read_fwf(IRIS_FWF_PATH, colspecs='infer', widths=None)
df.head(2)

### Franken-CSVs

Sometimes CSVs are not really CSVs. If someone sends you a janky CSV with weird separators and line breaks, Pandas can handle it with read_csv().

In [None]:
# Specify frankencsv path.
FRANKEN_CSV_PATH = './data/website_feedback.csv'

# Sometimes people send you weird files.
with open(FRANKEN_CSV_PATH) as f:
    print(f.read(500))

# You can read these with pd.read_csv()
df = pd.read_csv(
    FRANKEN_CSV_PATH,
    sep=';',
    # Only the 'sep' argument is needed, the others are just for show.
    index_col='title',
    usecols=[0, 1, 2, 3, 4, 5],
    dtype={'category': 'category'},
    converters={'satistifaction': pd.to_numeric},
    thousands=',',
    na_values='no',
    keep_default_na=True,
    quotechar='"',
    escapechar=None,
    encoding='utf8',
    doublequote=True,
    delimiter=None,
    lineterminator=None,
    error_bad_lines=True,
    skip_blank_lines=False,
)
df.head(5)

### pandas.read_csv() signature

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

Can you remember all that? Me neither. [Read the documentation!](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) That’s what it’s there for!.

### Excel files

For better or worse, people use a lot of Excel. Pandas supports that (even if I personally **do not**) via [read_excel()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html).

In [None]:
# Specify Excel doc path
EXCEL_PATH = './data/greenway_analysis.xlsx'

# Sometimes people send you weird files.
with open(EXCEL_PATH, 'rb') as f:
    print(f.read(100))

# You can read these with pd.read_csv()
df = pd.read_excel(EXCEL_PATH)
df.head(3)

### Read_excel() signature

pandas.read_excel(io, sheet_name=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, usecols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, converters=None, dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)

In [None]:
### Warning: opening a CSV with read_excel will fail.
pd.read_excel('./data/requests-for-open-data.csv')

In [None]:
### Warning: opening an Excel with read_csv will fail.
pd.read_csv('./data/greenway_analysis.xlsx')

### Other types

Remember, if you run into difficulty, see the supported types or build something from scratch.

* pd.read_clipboard()
* pd.read_gbq()
* pd.read_hdf()
* pd.read_html()
* pd.read_json()
* pd.read_sas()
* pd.read_sql()
* pd.read_stata()
* pd.read_table()

And you can also build dataframes from scratch with raw data.

* pd.DataFrame()
* pd.DataFrame.from_dict()
* pd.DataFrame.from_records()
* pd.DataFrame.from_items()

In [None]:
pd.DataFrame({
    'Nbr': np.random.normal(1, 3, size=3),
    'Msg': ['such dataframe', 'very elegant', 'wow']
})


# Wrong encodings

In olden times, you didn't have to worry about [character encoding](https://en.wikipedia.org/wiki/Character_encoding), which you can think of as the manner in which bytes translate to characters. You had [ASCII](https://en.wikipedia.org/wiki/ASCII) (or if you were particularly unlucky, [EBCDIC](https://en.wikipedia.org/wiki/EBCDIC)), and you didn't have to convert things back and forth.

We are not so lucky these days, however, you have tools to deal with it. Common encodings you will come across are 'cp1252' (Windows), 'utf8' (everywhere), 'ascii' (old, but good), and 'latin-1' (evil). A [list of encodings supported out of the box in Python are here](https://docs.python.org/3/library/codecs.html#standard-encodings)

Note: both 'read_csv()' and 'to_csv()' give you the option to save or read whatever encoding you please. The default is utf8.

In [None]:
# The default encoding is utf8
df = pd.read_csv('./data/weather.csv', encoding='utf8')
df.head(5)

In [None]:
# Which can cause other encodings to file
df = pd.read_csv('./data/weather.csv', encoding='cp1252')
df.head(5)

In [None]:
# Which can cause other encodings to file
df = pd.read_csv('./data/weather.csv', encoding='ascii')
df.head(5)

### How do we determine encodings?

Your first option is to always ensure you are loading from and saving to the default encoding, UTF-8. Alternatively, you can use chardet.

In [None]:
# Find all our CSVs.
cwd = pathlib.Path().cwd()
files = list(cwd.rglob('**/*.csv'))

df = pd.DataFrame()

# For each CSV
for csv_file in files:
    # Read CSV and read encoding with chardet
    with open(csv_file, 'rb') as f:
        encoding = chardet.detect(f.read())
        csv_name = csv_file.name
    # Addthe name and encoding info to dataframe
    df = df.append({
        'filename'  : csv_name,
        'encoding'  : encoding['encoding'],
        'confidence': encoding['confidence']
    }, ignore_index=True)

# Show frame
df[['filename', 'encoding', 'confidence']]

# Malformed files

The best laid schemes o' mice an' men often result in seriously broken CSV files. Here are a couple common scenarios that a) apply to files as a whole and b) won't be covered in our datatype-specific cleaning later on.

### No headers

Generally pandas expects that you'll have a header to name your column types. If not, no big deal.

In [None]:
# Note: the head function still works no matter what.
df = pd.read_csv('./data/headless.csv').head(2)
df

In [None]:
# Specify the pass headers=0 and specify the headers instead.
column_names = [
    'OBJECTID', 'COMPLEX_NAME', 'ORGANIZATION_NAME', 'UNIT_NAME', 'UNIT_TYPE', 
    'INTEREST_TYPE1', 'INTEREST_TYPE2', 'ACRES', 'SHAPE.AREA', 'SHAPE.LEN'
]

df = pd.read_csv(
    './data/headless.csv',
    header=None,
    names=column_names
)

# Alternatively
# df = pd.read_csv('./data/headless.csv', header=None)
# df.columns = column_names

df.head(3)

### Duplicate data

Often times you will have duplicate data. This can be easily rectified on a per record or per column basis.

In [None]:
# Load data
df = pd.read_csv('./data/complex_dupe.csv')
devils_lake = df[df['COMPLEX_NAME'] == 'DEVILS LAKE WMD']

devils_lake.head(3)

In [None]:
# We can eliminate exact row duplicates
devils_lake.drop_duplicates().head(3)

In [None]:
# Or for a column or set of specific columns
devils_lake.drop_duplicates(subset=['COMPLEX_NAME', 'ORGANIZATION_NAME']).head(3)

### Unnecessary data

Don't load what you don't need.

In [None]:
# Load data
df = pd.read_csv('./data/complex_dupe.csv', usecols=['OBJECTID','COMPLEX_NAME', 'ORGANIZATION_NAME'])
df.head()

### Random Garbage

Sometimes you have garbage values for no particular reason. You can manually pull those out, or you can use convenience functions.

In [None]:
df = pd.read_csv('./data/weather.csv')
df.head()

In [None]:
df.head().replace(to_replace='ツ', value='._.', regex=True)

In [None]:
# You an also find the lines for specific garbage using this idiom
df[(df == r'¯\_(?)_/¯').any(axis=1)]

# # NOT REQUIRED, but parsing this out ...
# # Get whether each cell is equal to the value
# eq_to_df = df == r'¯\_(?)_/¯'
# # Get whether each row contains a cell equal to the value
# row_contains_bool = eq_to_df.any(axis=1)
# # Get those rows where the condition is true.
# df_containing_bad_vals = df[row_contains]

# Absurdly large files

Python is generally pretty efficient when it comes to packing a lot of data into a very small space. That said, RAM can be a limitation. If you cannot fit your file into RAM, you have a variety of options.

### Chunking (preferred)

You can break off bits of a dataframe at a time using the "chunksize" option.

In [None]:
chunk_count = 0

for chunk in pd.read_csv('./data/weather.csv', chunksize=100):
    chunk_count += 1
    print('The first cell in chunk {} is {}'.format(chunk_count, chunk.iloc[0,0]))

chunk.head()

### Offset / nrows

You can read a limited number of rows.

In [None]:
pd.read_csv('./data/weather.csv', nrows=100, skiprows=300)

### Out-of-core

There are a variety of other tactics, such as using third party libraries for lazy evalution (dask), using disk/dbs as cache, or key ranges.

# Exploring

There are books devoted to exploratory data analysis (EDA). Here are some quick tips you might use.

In [None]:
df = pd.read_excel('./data/greenway_analysis.xlsx')
df.head(3)

In [None]:
# Remember, the return values are regular Python objects, so you can do whatever.
greenway_names = df['main_greenway'].unique()
path_names = [item for item in greenway_names if 'path' in item.lower()]
path_names

In [None]:
# Datatypes: You can check a dataframe or series dtype with .dtypes
df.dtypes.head(5)

In [None]:
# Describe will give you information about the numeric types.
df.describe()

In [None]:
# Alternatively, you can get these individually.
df.median()

In [None]:
# Or examine the compnent column series individually
df['age']

In [None]:
# Info will give you information about the values stored.
df.info()

In [None]:
# You can use value counts to examine columns.
df['greenway'].value_counts().head()

In [None]:
# You can get correlation, etc.
df = pd.read_fwf(IRIS_FWF_PATH, colspecs='infer', widths=None)
df.corr()

In [None]:
# Which as an aside looks way cooler as a scatter matrix
import matplotlib
%matplotlib inline
pd.tools.plotting.scatter_matrix(df, figsize=(7,7), diagonal='kde');

# Additional Learing Resources

* ### [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* ### [Pandas Tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
* ### [Pandas Cookbook](https://pandas.pydata.org/pandas-docs/stable/tutorials.html#pandas-cookbook)
* ### [Pandas Guide for New Users](https://pandas.pydata.org/pandas-docs/stable/tutorials.html#lessons-for-new-pandas-users)
* ### [Pandas Intro to Data Structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

---

# Next Up: [Missing Data](3_missing_data.ipynb)

<br>

<img style="margin-left: 0;" src="static/empty_set.png">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Empty_set.svg'>Octahedron80</a>, released into the public domain
</div>

---