In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from glob import glob
import re

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

### Data Files Location

* Most data files for the exercises can be found on the [course site](#https://www.datacamp.com/courses/manipulating-dataframes-with-pandas)
    * [Olympic medals](#https://assets.datacamp.com/production/repositories/502/datasets/bf22326ecc9171f68796ad805a7c1135288120b6/all_medalists.csv)
    * [Gapminder](#https://assets.datacamp.com/production/repositories/502/datasets/09378cc53faec573bcb802dce03b01318108a880/gapminder_tidy.csv)
    * [2012 US election results (Pennsylvania)](#https://assets.datacamp.com/production/repositories/502/datasets/502f4eedaf44ad1c94b3595c7691746f282e0b0a/pennsylvania2012_turnout.csv)
    * [Pittsburgh weather data](#https://assets.datacamp.com/production/repositories/502/datasets/6c4984cb81ea50971c1660434cc4535a6669a848/pittsburgh2013.csv)
    * [Sales](#https://assets.datacamp.com/production/repositories/502/datasets/4c6d3be9e8640e2d013298230c415d3a2a2162d4/sales.zip)
    * [Titanic](#https://assets.datacamp.com/production/repositories/502/datasets/e280ed94bf4539afb57d8b1cbcc14bcf660d3c63/titanic.csv)
    * [Users](#https://assets.datacamp.com/production/repositories/502/datasets/eaf29468b9fbaad454a74d3c2b59b36e5ab4558b/users.csv)
* Other data files may be found in my [DataCamp repository](#https://github.com/trenton3983/DataCamp/tree/master/data)

### Data File Objects

In [None]:
election_penn = 'data/manipulating-dataframes-with-pandas/2012_US_election_results_(Pennsylvania).csv'
gapminder = 'data/manipulating-dataframes-with-pandas/gapminder.csv'
medals = 'data/manipulating-dataframes-with-pandas/olympic_medals.csv'
weather = 'data/manipulating-dataframes-with-pandas/Pittsburgh_weather_data.csv'
sales = 'data/manipulating-dataframes-with-pandas/sales.csv'
sales_feb = 'data/manipulating-dataframes-with-pandas/sales-feb-2015.csv'
titanic = 'data/manipulating-dataframes-with-pandas/titanics.csv'
users = 'data/manipulating-dataframes-with-pandas/users.csv'

# Manipulating DataFrames with pandas

***Course Description***

In this course, you'll learn how to leverage pandas' extremely powerful data manipulation engine to get the most out of your data. It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. You will learn how to tidy, rearrange, and restructure your data by pivoting or melting and stacking or unstacking DataFrames. These are all fundamental next steps on the road to becoming a well-rounded Data Scientist, and you will have the chance to apply all the concepts you learn to real-world datasets.

### What You'll Learn

* Extracting, filtering, and transforming data from DataFrames
* Advanced indexing with multiple levels
* Tidying, rearranging and restructuring your data
* Pivoting, melting, and stacking DataFrames
* Identifying and spli!ing DataFrames by groups

## Extracting and transforming data

In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.

### Indexing DataFrames

#### A simple DataFrame

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Indexing using square brackets

In [None]:
df['salt']['Jan']

#### Using column attribute and row label

In [None]:
df.eggs['Mar']

#### Accessors

* A more efficient and more programmatically reusable method of accessing data in a DataFrame is by using accessors
    * .loc - accesses using lables
    * .iloc - accesses using index positions
* Both accessors use left bracket, row specifier, comma, column specifier, right bracket as syntax

##### Using the .loc accessor

In [None]:
df.loc['May', 'spam']

##### Using the .iloc accessor

In [None]:
df.iloc[4, 2]

#### Selecting only some columns

* When using bracket-indexing without the .loc or .iloc accessors, the result returned can be an individual value, Pandas Series, or Pandas DataFrame.
* To ensure the return value is a DataFrame, use a nested list within square brackets

In [None]:
df_new = df[['salt','eggs']]
df_new

### Exercises

#### Index ordering

In this exercise, the DataFrame ***election*** is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select ***'Bedford'*** county and the ***'winner'*** column. Which method is the preferred way?

In [None]:
election = pd.read_csv(election_penn, index_col='county')
election.head()

In [None]:
election.loc['Bedford', 'winner']

#### Positional and labeled indexing

Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. The DataFrame is provided for you as ***election***.

Find ***x*** and ***y*** such that ***election.iloc[x, y] == election.loc['Bedford', 'winner']***. That is, what is the row position of ***'Bedford'***, and the column position of ***'winner'***? Remember that the first position in Python is 0, not 1!

To answer this question, first explore the DataFrame using ***election.head()*** in the IPython Shell and inspect it with your eyes.

***Instructions***

* Explore the DataFrame in the IPython Shell using ***election.head()***.
* Assign the row position of ***election.loc['Bedford']*** to ***x***.
* Assign the column position of ***election['winner']*** to ***y***.
* Hit 'Submit Answer' to print the boolean equivalence of the ***.loc*** and ***.iloc*** selections.

In [None]:
# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])

***Depending on the situation, you may wish to use .iloc[] over .loc[], and vice versa. The important thing to realize is you can achieve the exact same results using either approach.***

#### Indexing and column rearrangement

There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

Your job is to read the CSV file and set the index to ***'county'***. You'll then assign a new DataFrame by selecting the list of columns ***['winner', 'total', 'voters']***. The CSV file is provided to you in the variable ***filename***.

***Instructions***

* Import pandas as pd.
* Read in ***filename*** using ***pd.read_csv()*** and set the index to ***'county'*** by specifying the ***index_col*** parameter.
* Create a separate DataFrame ***results*** with the columns ***['winner', 'total', 'voters']***.
* Print the output using ***results.head()***. This has been done for you, so hit 'Submit Answer' to see the new DataFrame!


In [None]:
# Read in filename and set the index: election
election = ____(filename, ____='county')

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = ____

# Print the output of results.head()
print(results.head())

### Slicing DataFrames

#### sales DataFrame

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Selecting a column (i.e., Series)

In [None]:
df['eggs']

In [None]:
type(df.eggs)

#### Slicing and indexing a Series

In [None]:
df['eggs'][1:4] # Part of the eggs column

In [None]:
df['eggs'][4] # The value associated with May

#### Using .loc[] (1)

In [None]:
df.loc[:, 'eggs':'salt'] # All rows, some columns

#### Using .loc[] (2)

In [None]:
df.loc['Jan':'Apr',:] # Some rows, all columns

#### Using .loc[] (3)

In [None]:
df.loc['Mar':'May', 'salt':'spam']

#### Using .iloc[]

In [None]:
df.iloc[2:5, 1:] # A block from middle of the DataFrame

#### Using lists rather than slices (1)

In [None]:
df.loc['Jan':'May', ['eggs', 'spam']]

#### Using lists rather than slices (2)#### 

In [None]:
df.iloc[[0,4,5], 0:2]

#### Series versus 1-column DataFrame

In [None]:
# A Series by column name
df['eggs']

In [None]:
type(df['eggs'])

In [None]:
# A DataFrame w/ single column
df[['eggs']]

In [None]:
type(df[['eggs']])

### Filtering DataFrames

#### Data

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Creating a Boolean Series

In [None]:
df.salt > 60

#### Filtering with a Boolean Series

In [None]:
df[df.salt > 60]

In [None]:
enough_salt_sold = df.salt > 60

In [None]:
df[enough_salt_sold]

#### Combining filters

In [None]:
df[(df.salt >= 50) & (df.eggs < 200)] # Both conditions

In [None]:
df[(df.salt >= 50) | (df.eggs < 200)] # Either condition

#### DataFrames with zeros and NaNs

In [None]:
df2 = df.copy()

In [None]:
df2['bacon'] = [0, 0, 50, 60, 70, 80]

In [None]:
df2

#### Select columns with all nonzeros

In [None]:
df2.loc[:, df2.all()]

#### Select columns with any nonzeros

In [None]:
df2.loc[:, df2.any()]

#### Select columns with any NaNs

In [None]:
df.loc[:, df.isnull().any()]

#### Select columns without NaNs

In [None]:
df.loc[:, df.notnull().all()]

#### Drop rows with any NaNs

In [None]:
df.dropna(how='any')

#### Filtering a column based on another

In [None]:
df.eggs[df.salt > 55]

#### Modifying a column based on another

In [None]:
df.eggs[df.salt > 55] += 5

In [None]:
df

### Transforming DataFrames

## Advanced Indexing

Having learned the fundamentals of working with DataFrames, you will now move on to more advanced indexing techniques. You will learn about MultiIndexes, or hierarchical indexes, and learn how to interact with and extract data from them.

### Index objects and labeled data

### Hierarchical indexing

## Rearranging and reshaping data

Here, you will learn how to reshape your DataFrames using techniques such as pivoting, melting, stacking, and unstacking. These are powerful techniques that allow you to tidy and rearrange your data into the format that allows you to most easily analyze it for insights.

### Pivoting DataFrames

### Stacking & unstacking DataFrames

### Melting DataFrames

### Pivot tables

## Grouping data

In this chapter, you'll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You'll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox.

### Categorical and groupby

### Groupby and aggregation

### Groupby and transformation

### Groupby and filterning

## Bringing it all together

Here, you will bring together everything you have learned in this course while working with data recorded from the Summer Olympic games that goes as far back as 1896! This is a rich dataset that will allow you to fully apply the data manipulation techniques you have learned. You will pivot, unstack, group, slice, and reshape your data as you explore this dataset and uncover some truly fascinating insights. Enjoy!

### Case Study - Summer Olympics

### Understanding the column labels

### Constructing alternative country rankings

### Reshaping DataFrames for visualization