<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 1

_Authors: Joseph Nelson (DC)_

---

**Warning: This is a resource-heavy notebook that can consume a lot of RAM, especially when it's run in Chrome. For this lesson, you may want to close idle applications and/or open this notebook with Safari.**

### Lesson Guide
- [The Basics of `pandas` DataFrames](#basics)
    - [Loading Data](#loading)
    - [A Basic Examination of DataFrames](#examine)
    - [Selecting Columns](#selecting)
    - [Describing Data](#describing)
- [Exercise #1](#exercise-1)
- [Filtering and Sorting DataFrames](#filtering-sorting)
    - [Boolean Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise #2](#exercise-2)
- [Renaming, Adding, and Removing Columns](#columns)
    - [Renaming Columns](#renaming-columns)
    - [Adding Columns](#adding-columns)
    - [Removing Columns](#removing-columns)
- [Handling Missing Values](#missing)
    - [Finding Missing Values](#find-missing)
    - [Dropping Missing Values](#drop-missing)
    - [Filling in Missing Values](#fill-missing)


<a id='basics'></a>

## The Basics of `pandas` DataFrames

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading Data

**Q.1** You can read in a file either from your local computer or directly from a URL.

```Python
# Local:
users = pd.read_table('../datasets/users.txt')

# Remote:
users = pd.read_table('https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users.txt')
```

Read in the data using the method you prefer.

In [2]:
# A:
user = pd.read_table('../datasets/users/txt')

IOError: File ../datasets/users/txt does not exist

In [None]:
# A:

**Q.2** Use kwargs to set appropriate data-reading parameters.

In [None]:
# A:

<a id='examine'></a>
### A Basic Examination of DataFrames

**Q.1** Print the type of `users`.

In [None]:
# A:

**Q.2** Print the first five rows, first 10 rows, and last two rows of `users`.

In [None]:
# A:

In [None]:
# A:

In [None]:
# A:

**Q.3** Print the index and columns.

In [None]:
# A:

**Q.4** Find the dtypes of the columns.

In [None]:
# A:

**Q.5** Find the dimensions of the DataFrame.

In [None]:
# A:

**Q.6** Extract the underlying `numpy` array as a new variable.

In [None]:
# A:

<a id='selecting'></a>
### Selecting Columns

**Q.1** Assign the `gender` column to a variable.

In [None]:
# A:

_The former method is preferred, as columns can have names with special characters like periods or underscores that will create syntax issues with the latter._

**Q.2** What is the type of `gender`?

In [None]:
# A:

**Q.3** Select `gender` and `occupation` as a new DataFrame.

In [None]:
# A:

<a id='describing'></a>
### Describing Data

**Q.1** Calculate the descriptive statistics for the numeric columns in the DataFrame (_which is the function default_).  

In [None]:
# A:

**Q.2** Describe the "object" (string) columns.

In [None]:
# A:

**Q.3** Describe all of the columns, regardless of type.

In [None]:
# A:

**Q.4** Describe the `gender` Series from the `users` DataFrame.

In [None]:
# A:

**Q.5** Calculate the mean of the `age` column.

In [None]:
# A:

**Q.6** Calculate the counts of distinct values in the `gender` and `age` columns.

In [None]:
# A:

<a id='exercise-1'></a>
## Exercise #1

---

Load the `drinks.csv` data provided in the URL below.

**Perform the following:**
1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the `beer_servings` column/Series to a variable.
4. Calculate summary statistics for `beer_servings`.
5. Calculate the median of `beer_servings`.
6. Count the values of unique categories in `continent`.
7. Print the dimensions of the `drinks` DataFrame.
8. Find the first three items of the value counts of the `occupation` column.

**BONUS:**
1. Create the `users` DataFrame from the `user_file` provided (which lacks a header row).
2. Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`.


In [None]:
#  Use your preferred file location to read in the data:
remote_drinks_csv = 'https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/drinks.csv'
local_drinks_csv = 'datasets/drinks.csv'
# and
remote_user_file ='https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users_original.txt'
local_user_file = 'datasets/users_original.txt'

In [None]:
# A:

<a id='filtering-sorting'></a>

## Filtering and Sorting DataFrames

---


<a id='filtering'></a>
### Boolean Filtering

**Q.1** Show users `age < 20` using a Boolean mask.

In [None]:
# A:

**Q.2** Calculate the value counts of `occupation` for users `age < 20`.

In [None]:
# A:

**Q.3** Print the male users `age < 20`. 

In [None]:
# A:

**Q.4** Print the users `age < 10` or `age > 70`.

In [None]:
# A:

<a id='sorting'></a>
### Sorting

**Q.1** Return the `age` column sorted in ascending order.

In [None]:
# A:

**Q.2** Sort the `users` DataFrame by the `age` column (ascending).

In [None]:
# A:

**Q.3** Sort the `users` DataFrame by the `age` column in *descending* order.

In [None]:
# A:

<a id='exercise-2'></a>

## Exercise #2

---

**Using the `drinks` DataFrame from the previous exercise:**
1. Filter `drinks` to include only European countries.
2. Filter `drinks` to include only European countries with `wine_servings` > 300.
3. Calculate the mean `beer_servings` for all of Europe.
4. Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.

**Using the `users` DataFrame:**
1. Sort `users` by occupation and then by `age` in a single command.
2. Filter `users` to only include doctors and lawyers without using a `|`.

> **Hint:** Look up `pandas.Series.isin`.

In [None]:
# A:

<a id='columns'></a>

## Renaming, Adding, and Removing Columns

---

<a id='renaming-columns'></a>
### Renaming Columns

**Q.1** Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a *new* DataFrame.

In [None]:
# A:

**Q.2** Perform the same renaming for `drinks`, but in place.

In [None]:
# A:

In [None]:
# A:

**Q.3** Replace the column names of `drinks` with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [None]:
# A:

<a id='adding-columns'></a>
### Adding Columns

**Q.1** Make a `servings` column combines `beer`, `spirit`, and `wine`.

In [None]:
# A:

**Q.2** Make an `mL` column that is the `liters` column multiplied by 1,000.

In [None]:
# A:

<a id='removing-columns'></a>
### Removing Columns

**Q.1** Remove the `mL` column, returning a new DataFrame.

In [None]:
# A:

**Q.2** Remove the `mL` and `servings` columns from `drinks` in place.

In [None]:
# A:

<a id='missing'></a>
## Handling Missing Values

---

<a id='find-missing'></a>
### Finding Missing Values

**Q.1** Include missing values from the `continent` variable in the `drinks` DataFrame when counting unique values.

In [None]:
# A:

**Q.2** Create a Boolean Series indicating which values are missing or not missing in `continents`.

In [None]:
# A:

**Q.3** Subset to rows in `drinks` where `continent` is missing and where `continent` is not missing.

In [None]:
# A:

**Q.4** Calculate the sum of `drinks`' *columns* and the sum of its *rows*.

In [None]:
# A:

In [None]:
# A:

**Side Note: Adding Booleans**
```python
pd.Series([True, False, True])  # Creates a Boolean Series
pd.Series([True, False, True]).sum()  # Converts `False` to 0 and `True` to 1
```

**Q.5** FInd the number of missing values by column in `drinks`.

In [None]:
# A:

<a id='drop-missing'></a>
### Dropping Missing Values

**Q.1** Drop rows where *ANY* values are missing in `drinks` (returning a new DataFrame).  
_Make sure you know ahead of time exactly what you'll be dropping._

In [None]:
# A:

**Q.2** Drop rows only where *ALL* values are missing in `drinks`.

In [None]:
# A:

<a id='fill-missing'></a>
### Filling in Missing Values

What's up with these `NaN` continents?

In [None]:
# A:

_You probably figured it out already, but all of these continents are in North America (`NA`), and, when read in, were misinterpreted as a `null` or `NaN` value._

**Q.1** Fill in the missing values of the `continent` column using string `NA`.

In [None]:
# A:

**Q.2** Turn off the missing value filter when loading the `drinks` `.csv`.

In [None]:
# A: