In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from glob import glob
import re

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

### Data Files Location

* Most data files for the exercises can be found [here](#https://www.datacamp.com/courses/manipulating-dataframes-with-pandas)
    * [Olympic medals](#https://assets.datacamp.com/production/repositories/502/datasets/bf22326ecc9171f68796ad805a7c1135288120b6/all_medalists.csv)
    * [Gapminder](#https://assets.datacamp.com/production/repositories/502/datasets/09378cc53faec573bcb802dce03b01318108a880/gapminder_tidy.csv)
    * [2012 US election results (Pennsylvania)](#https://assets.datacamp.com/production/repositories/502/datasets/502f4eedaf44ad1c94b3595c7691746f282e0b0a/pennsylvania2012_turnout.csv)
    * [Pittsburgh weather data](#https://assets.datacamp.com/production/repositories/502/datasets/6c4984cb81ea50971c1660434cc4535a6669a848/pittsburgh2013.csv)
    * [Sales](#https://assets.datacamp.com/production/repositories/502/datasets/4c6d3be9e8640e2d013298230c415d3a2a2162d4/sales.zip)
    * [Titanic](#https://assets.datacamp.com/production/repositories/502/datasets/e280ed94bf4539afb57d8b1cbcc14bcf660d3c63/titanic.csv)
    * [Users](#https://assets.datacamp.com/production/repositories/502/datasets/eaf29468b9fbaad454a74d3c2b59b36e5ab4558b/users.csv)
* Other data files may be found in my [DataCamp repository](#https://github.com/trenton3983/DataCamp/tree/master/data)

### Data File Objects

In [None]:
election_penn = 'data/manipulating-dataframes-with-pandas/2012_US_election_results_(Pennsylvania).csv'
gapminder = 'data/manipulating-dataframes-with-pandas/gapminder.csv'
medals = 'data/manipulating-dataframes-with-pandas/olympic_medals.csv'
weather = 'data/manipulating-dataframes-with-pandas/Pittsburgh_weather_data.csv'
sales = 'data/manipulating-dataframes-with-pandas/sales.csv'
sales_feb = 'data/manipulating-dataframes-with-pandas/sales-feb-2015.csv'
titanic = 'data/manipulating-dataframes-with-pandas/titanics.csv'
users = 'data/manipulating-dataframes-with-pandas/users.csv'

# Manipulating DataFrames with pandas

### What You'll Learn

* Extracting, filtering, and transforming data from DataFrames
* Advanced indexing with multiple levels
* Tidying, rearranging and restructuring your data
* Pivoting, melting, and stacking DataFrames
* Identifying and spli!ing DataFrames by groups

## Extracting and transforming data

### Indexing DataFrames

#### A simple DataFrame

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Indexing using square brackets

In [None]:
df['salt']['Jan']

#### Using column attribute and row label

In [None]:
df.eggs['Mar']

#### Accessors

* A more efficient and more programmatically reusable method of accessing data in a DataFrame is by using accessors
    * .loc - accesses using lables
    * .iloc - accesses using index positions
* Both accessors use left bracket, row specifier, comma, column specifier, right bracket as syntax

##### Using the .loc accessor

In [None]:
df.loc['May', 'spam']

##### Using the .iloc accessor

In [None]:
df.iloc[4, 2]

#### Selecting only some columns

* When using bracket-indexing without the .loc or .iloc accessors, the result returned can be an individual value, Pandas Series, or Pandas DataFrame.
* To ensure the return value is a DataFrame, use a nested list within square brackets

In [None]:
df_new = df[['salt','eggs']]
df_new

### Exercises

#### Index ordering

In this exercise, the DataFrame ***election*** is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select ***'Bedford'*** county and the ***'winner'*** column. Which method is the preferred way?

In [None]:
election = pd.read_csv(election_penn, index_col='county')
election.head()

In [None]:
election.loc['Bedford', 'winner']

#### Positional and labeled indexing

Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. The DataFrame is provided for you as ***election***.

Find ***x*** and ***y*** such that ***election.iloc[x, y] == election.loc['Bedford', 'winner']***. That is, what is the row position of ***'Bedford'***, and the column position of ***'winner'***? Remember that the first position in Python is 0, not 1!

To answer this question, first explore the DataFrame using ***election.head()*** in the IPython Shell and inspect it with your eyes.

***Instructions***

* Explore the DataFrame in the IPython Shell using ***election.head()***.
* Assign the row position of ***election.loc['Bedford']*** to ***x***.
* Assign the column position of ***election['winner']*** to ***y***.
* Hit 'Submit Answer' to print the boolean equivalence of the ***.loc*** and ***.iloc*** selections.

In [None]:
# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])

***Depending on the situation, you may wish to use .iloc[] over .loc[], and vice versa. The important thing to realize is you can achieve the exact same results using either approach.***

### Slicing DataFrames

### Filtering DataFrames

### Transforming DataFrames

## Advanced Indexing

### Index objects and labeled data

### Hierarchical indexing

## Rearranging and reshaping data

### Pivoting DataFrames

### Stacking & unstacking DataFrames

### Melting DataFrames

### Pivot tables

## Grouping data

### Categorical and groupby

### Groupby and aggregation

### Groupby and transformation

### Groupby and filterning

## Bringing it all together

### Case Study - Summer Olympics

### Understanding the column labels

### Constructing alternative country rankings

### Reshaping DataFrames for visualization