In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from scipy.stats import zscore 

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

### Data Files Location

* Most data files for the exercises can be found on the [course site](#https://www.datacamp.com/courses/manipulating-dataframes-with-pandas)
    * [Olympic medals](#https://assets.datacamp.com/production/repositories/502/datasets/bf22326ecc9171f68796ad805a7c1135288120b6/all_medalists.csv)
    * [Gapminder](#https://assets.datacamp.com/production/repositories/502/datasets/09378cc53faec573bcb802dce03b01318108a880/gapminder_tidy.csv)
    * [2012 US election results (Pennsylvania)](#https://assets.datacamp.com/production/repositories/502/datasets/502f4eedaf44ad1c94b3595c7691746f282e0b0a/pennsylvania2012_turnout.csv)
    * [Pittsburgh weather data](#https://assets.datacamp.com/production/repositories/502/datasets/6c4984cb81ea50971c1660434cc4535a6669a848/pittsburgh2013.csv)
    * [Sales](#https://assets.datacamp.com/production/repositories/502/datasets/4c6d3be9e8640e2d013298230c415d3a2a2162d4/sales.zip)
    * [Titanic](#https://assets.datacamp.com/production/repositories/502/datasets/e280ed94bf4539afb57d8b1cbcc14bcf660d3c63/titanic.csv)
    * [Users](#https://assets.datacamp.com/production/repositories/502/datasets/eaf29468b9fbaad454a74d3c2b59b36e5ab4558b/users.csv)
* Other data files may be found in my [DataCamp repository](#https://github.com/trenton3983/DataCamp/tree/master/data)

### Data File Objects

In [None]:
election_penn = 'data/manipulating-dataframes-with-pandas/2012_US_election_results_(Pennsylvania).csv'
gapminder = 'data/manipulating-dataframes-with-pandas/gapminder.csv'
medals = 'data/manipulating-dataframes-with-pandas/olympic_medals.csv'
weather_data = 'data/manipulating-dataframes-with-pandas/Pittsburgh_weather_data.csv'
sales = 'data/manipulating-dataframes-with-pandas/sales.csv'
sales_feb = 'data/manipulating-dataframes-with-pandas/sales-feb-2015.csv'
titanic_data = 'data/manipulating-dataframes-with-pandas/titanics.csv'
users = 'data/manipulating-dataframes-with-pandas/users.csv'

# Manipulating DataFrames with pandas

***Course Description***

In this course, you'll learn how to leverage pandas' extremely powerful data manipulation engine to get the most out of your data. It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. You will learn how to tidy, rearrange, and restructure your data by pivoting or melting and stacking or unstacking DataFrames. These are all fundamental next steps on the road to becoming a well-rounded Data Scientist, and you will have the chance to apply all the concepts you learn to real-world datasets.

### What You'll Learn

* Extracting, filtering, and transforming data from DataFrames
* Advanced indexing with multiple levels
* Tidying, rearranging and restructuring your data
* Pivoting, melting, and stacking DataFrames
* Identifying and spli!ing DataFrames by groups

## Extracting and transforming data

In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.

### Indexing DataFrames

#### A simple DataFrame

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Indexing using square brackets

In [None]:
df['salt']['Jan']

#### Using column attribute and row label

In [None]:
df.eggs['Mar']

#### Accessors

* A more efficient and more programmatically reusable method of accessing data in a DataFrame is by using accessors
    * .loc - accesses using lables
    * .iloc - accesses using index positions
* Both accessors use left bracket, row specifier, comma, column specifier, right bracket as syntax

##### Using the .loc accessor

In [None]:
df.loc['May', 'spam']

##### Using the .iloc accessor

In [None]:
df.iloc[4, 2]

#### Selecting only some columns

* When using bracket-indexing without the .loc or .iloc accessors, the result returned can be an individual value, Pandas Series, or Pandas DataFrame.
* To ensure the return value is a DataFrame, use a nested list within square brackets

In [None]:
df_new = df[['salt','eggs']]
df_new

### Exercises

#### Index ordering

In this exercise, the DataFrame ***election*** is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select ***'Bedford'*** county and the ***'winner'*** column. Which method is the preferred way?

In [None]:
election = pd.read_csv(election_penn, index_col='county')
election.head()

In [None]:
election.loc['Bedford', 'winner']

#### Positional and labeled indexing

Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. The DataFrame is provided for you as ***election***.

Find ***x*** and ***y*** such that ***election.iloc[x, y] == election.loc['Bedford', 'winner']***. That is, what is the row position of ***'Bedford'***, and the column position of ***'winner'***? Remember that the first position in Python is 0, not 1!

To answer this question, first explore the DataFrame using ***election.head()*** in the IPython Shell and inspect it with your eyes.

***Instructions***

* Explore the DataFrame in the IPython Shell using ***election.head()***.
* Assign the row position of ***election.loc['Bedford']*** to ***x***.
* Assign the column position of ***election['winner']*** to ***y***.
* Hit 'Submit Answer' to print the boolean equivalence of the ***.loc*** and ***.iloc*** selections.

In [None]:
# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])

***Depending on the situation, you may wish to use .iloc[] over .loc[], and vice versa. The important thing to realize is you can achieve the exact same results using either approach.***

#### Indexing and column rearrangement

There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

Your job is to read the CSV file and set the index to ***'county'***. You'll then assign a new DataFrame by selecting the list of columns ***['winner', 'total', 'voters']***. The CSV file is provided to you in the variable ***filename***.

***Instructions***

* Import pandas as pd.
* Read in ***filename*** using ***pd.read_csv()*** and set the index to ***'county'*** by specifying the ***index_col*** parameter.
* Create a separate DataFrame ***results*** with the columns ***['winner', 'total', 'voters']***.
* Print the output using ***results.head()***. This has been done for you, so hit 'Submit Answer' to see the new DataFrame!


In [None]:
# Read in filename and set the index: election
election = pd.read_csv(election_penn, index_col='county')

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]

# Print the output of results.head(['winner', 'total', 'voters'])
print(results.head())

***The original election DataFrame had 6 columns, but as you can see, your results DataFrame now has just the 3 columns: 'winner', 'total', and 'voters'.***

### Slicing DataFrames

#### sales DataFrame

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Selecting a column (i.e., Series)

In [None]:
df['eggs']

In [None]:
type(df.eggs)

#### Slicing and indexing a Series

In [None]:
df['eggs'][1:4] # Part of the eggs column

In [None]:
df['eggs'][4] # The value associated with May

#### Using .loc[] (1)

In [None]:
df.loc[:, 'eggs':'salt'] # All rows, some columns

#### Using .loc[] (2)

In [None]:
df.loc['Jan':'Apr',:] # Some rows, all columns

#### Using .loc[] (3)

In [None]:
df.loc['Mar':'May', 'salt':'spam']

#### Using .iloc[]

In [None]:
df.iloc[2:5, 1:] # A block from middle of the DataFrame

#### Using lists rather than slices (1)

In [None]:
df.loc['Jan':'May', ['eggs', 'spam']]

#### Using lists rather than slices (2)#### 

In [None]:
df.iloc[[0,4,5], 0:2]

#### Series versus 1-column DataFrame

In [None]:
# A Series by column name
df['eggs']

In [None]:
type(df['eggs'])

In [None]:
# A DataFrame w/ single column
df[['eggs']]

In [None]:
type(df[['eggs']])

### Exercises

#### Slicing rows

The Pennsylvania US election results data set that you have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this exercise, you're going to perform slicing on the county names of the ***election*** DataFrame from the previous exercises, which has been pre-loaded for you.

***Instructions***

* Slice the row labels ***'Perry'*** to ***'Potter'*** and assign the output to ***p_counties***.
* Print the ***p_counties*** DataFrame. This has been done for you.
* Slice the row labels ***'Potter'*** to ***'Perry'*** in reverse order. To do this for hypothetical row labels ***'a'*** and ***'b'***, you could use a stepsize of ***-1*** like so: ***df.loc['b':'a':-1]***.
* Print the ***p_counties_rev*** DataFrame. This has also been done for you, so hit 'Submit Answer' to see the result of your slicing!

In [None]:
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)

#### Slicing columns

Similar to row slicing, columns can be sliced by value. In this exercise, your job is to slice column names from the Pennsylvania election results DataFrame using ***.loc[]***.

It has been pre-loaded for you as ***election***, with the index set to ***'county'***.

***Instructions***

* Slice the columns from the starting column to ***'Obama'*** and assign the result to ***left_columns***
* Slice the columns from ***'Obama'*** to ***'winner'*** and assign the result to ***middle_columns***
* Slice the columns from ***'Romney'*** to the end and assign the result to ***right_columns***
* The code to print the first 5 rows of ***left_columns***, ***middle_columns***, and ***right_columns*** has been written, so hit 'Submit Answer' to see the results!

In [None]:
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']

# Print the output of left_columns.head()
print('Left Columns: \n', left_columns.head())

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, 'Obama':'winner']

# Print the output of middle_columns.head()
print('\nMiddle Columns: \n', middle_columns.head())

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, 'Romney':]

# Print the output of right_columns.head()
print('\nRight Columns: \n', right_columns.head())

#### Subselecting DataFrames with lists

You can use lists to select specific row and column labels with the ***.loc[]*** accessor. In this exercise, your job is to select the counties ***['Philadelphia', 'Centre', 'Fulton']*** and the columns ***['winner','Obama','Romney']*** from the ***election*** DataFrame, which has been pre-loaded for you with the index set to ***'county'***.

Instructions

* Create the list of row labels ***['Philadelphia', 'Centre', 'Fulton']*** and assign it to ***rows***.
* Create the list of column labels ***['winner', 'Obama', 'Romney']*** and assign it to ***cols***.
* Create a new DataFrame by selecting with ***rows*** and ***cols*** in ***.loc[]*** and assign it to ***three_counties***.
* Print the ***three_counties*** DataFrame. This has been done for you, so hit 'Submit Answer` to see your new DataFrame.

In [None]:
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]

# Print the three_counties DataFrame
print(three_counties)

***If you know exactly which rows and columns are of interest to you, this is a useful approach for subselecting DataFrames.***

### Filtering DataFrames

#### Data

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### Creating a Boolean Series

In [None]:
df.salt > 60

#### Filtering with a Boolean Series

In [None]:
df[df.salt > 60]

In [None]:
enough_salt_sold = df.salt > 60

In [None]:
df[enough_salt_sold]

#### Combining filters

In [None]:
df[(df.salt >= 50) & (df.eggs < 200)] # Both conditions

In [None]:
df[(df.salt >= 50) | (df.eggs < 200)] # Either condition

#### DataFrames with zeros and NaNs

In [None]:
df2 = df.copy()

In [None]:
df2['bacon'] = [0, 0, 50, 60, 70, 80]

In [None]:
df2

#### Select columns with all nonzeros

In [None]:
df2.loc[:, df2.all()]

#### Select columns with any nonzeros

In [None]:
df2.loc[:, df2.any()]

#### Select columns with any NaNs

In [None]:
df.loc[:, df.isnull().any()]

#### Select columns without NaNs

In [None]:
df.loc[:, df.notnull().all()]

#### Drop rows with any NaNs

In [None]:
df.dropna(how='any')

#### Filtering a column based on another

In [None]:
df.eggs[df.salt > 55]

#### Modifying a column based on another

In [None]:
df.eggs[df.salt > 55] += 5

In [None]:
df

### Exercises

#### Thresholding data

In this exercise, we have provided the Pennsylvania election results and included a column called ***'turnout'*** that contains the percentage of voter turnout per county. Your job is to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%.

As before, the DataFrame is available to you as ***election*** with the index set to ***'county'***.

***Instructions***

* Create a boolean array of the condition where the ***'turnout'*** column is greater than ***70*** and assign it to ***high_turnout***.
* Filter the ***election*** DataFrame with the ***high_turnout*** array and assign it to ***high_turnout_df***.
* Print the filtered DataFrame. This has been done for you, so hit 'Submit Answer' to see it!

In [None]:
election = pd.read_csv(election_penn, index_col='county')

In [None]:
# Create the boolean array: high_turnout
high_turnout = election.turnout > 70

# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]

# Print the high_turnout_results DataFrame
print(high_turnout_df)

#### Filtering columns using other columns

The election results DataFrame has a column labeled ***'margin'*** which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than 1%, the results would be too-close-to-call.

Your job is to use boolean selection to filter the rows where the margin was less than 1. You'll then convert these rows of the ***'winner'*** column to ***np.nan*** to indicate that these results are too close to declare a winner.

The DataFrame has been pre-loaded for you as ***election***.

***Instructions***

* Import ***numpy*** as ***np***.
* Create a boolean array for the condition where the ***'margin'*** column is less than 1 and assign it to ***too_close***.
* Convert the entries in the ***'winner'*** column where the result was too close to call to ***np.nan***.
* Print the output of ***election.info()***. This has been done for you, so hit 'Submit Answer' to see the results.

In [None]:
# Create the boolean array: too_close
too_close = election.margin < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close] = NaN

# Print the output of election.info()
print(election.info())

#### Filtering using NaNs

In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The ***.dropna()*** method is used to perform this action. You'll now practice using this method on a dataset obtained from [Vanderbilt University](#http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html), which consists of data from passengers on the Titanic.

The DataFrame has been pre-loaded for you as ***titanic***. Explore it in the IPython Shell and you will note that there are many NaNs. You will focus specifically on the ***'age'*** and ***'cabin'*** columns in this exercise. Your job is to use ***.dropna()*** to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data.

You'll also use the ***.shape*** attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.

Finally, you'll use the ***thresh=*** keyword argument to drop columns from the full dataset that have less than 1000 non-missing values.

***Instructions***

* Select the ***'age***' and ***'cabin'*** columns of ***titanic*** and create a new DataFrame ***df***.
* Print the shape of ***df***. This has been done for you.
* Drop rows in ***df*** with ***how='any'*** and print the shape.
* Drop rows in ***df*** with ***how='all'*** and print the shape.
* Drop columns from the ***titanic*** DataFrame that have less than 1000 non-missing values by specifying the ***thresh*** and ***axis*** keyword arguments. Print the output of ***.info()*** from this.

In [None]:
titanic = pd.read_csv(titanic_data)

In [None]:
# Select the 'age' and 'cabin' columns: df
df = titanic[['age', 'cabin']]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print('\n', df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print('\n', df.dropna(how='all').shape)

# Drop columns in titanic with less than 1000 non-missing values
print('\n', titanic.dropna(thresh=1000, axis='columns').info())

### Transforming DataFrames

#### Data

In [None]:
df = pd.read_csv(sales, index_col='month')
df

#### DataFrame vectorized methods

In [None]:
df.floordiv(12) # Convert to dozens unit

#### NumPy vectorized functions

* [np.floor_divide](#https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.floor_divide.html)
* [invalid value encountered in floor_divide](#https://stackoverflow.com/questions/14861891/runtimewarning-invalid-value-encountered-in-divide)

In [None]:
np.floor_divide(df, 12) # Convert to dozens unit

#### Plain Python functions (1)

In [None]:
def dozens(n):
    return n//12

In [None]:
df.apply(dozens)  # Convert to dozens unit

#### Plain Python functions (2)

In [None]:
df.apply(lambda n: n//12)

#### Storing a transformation

In [None]:
df['dozens_of_eggs'] = df.eggs.floordiv(12)
df

#### The DataFrame index

In [None]:
df.index

#### Working with string values (1)

In [None]:
df.index = df.index.str.upper()
df

#### Working with string values (2)

In [None]:
df.index = df.index.map(str.lower)
df

#### Defining columns using other columns

In [None]:
df['salty_eggs'] = df.salt + df.dozens_of_eggs
df

### Exercises

#### Using apply() to transform a column

The ***.apply()*** method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise you'll take daily weather data in Pittsburgh in 2013 obtained from [Weather Underground](#https://www.wunderground.com/history).

A function to convert degrees Fahrenheit to degrees Celsius has been written for you. Your job is to use the ***.apply()*** method to perform this conversion on the 'Mean TemperatureF' and 'Mean Dew PointF' columns of the weather DataFrame.

***Instructions***

* Apply the ***to_celsius()*** function over the ***['Mean TemperatureF','Mean Dew PointF']*** columns of the ***weather*** DataFrame.
* Reassign the columns of ***df_celsius*** to ***['Mean TemperatureC','Mean Dew PointC']***.
* Hit 'Submit Answer' to see the new DataFrame with the converted units.

In [None]:
weather = pd.read_csv(weather_data)

In [None]:
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)

In [None]:
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF', 'Mean Dew PointF']].apply(to_celsius)
df_celsius.head()

In [None]:
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
df_celsius.head()

#### Using .map() with a dictionary

The ***.map()*** method is used to transform values according to a Python dictionary look-up. In this exercise you'll practice this method while returning to working with the ***election*** DataFrame, which has been pre-loaded for you.

Your job is to use a dictionary to map the values ***'Obama'*** and ***'Romney'*** in the ***'winner'*** column to the values ***'blue'*** and ***'red'***, and assign the output to the new column ***'color'***.

***Instructions***

* Create a dictionary with the key:value pairs ***'Obama':'blue'*** and ***'Romney':'red'***.
* Use the ***.map()*** method on the ***'winner'*** column using the ***red_vs_blue*** dictionary you created.
* Print the output of ***election.head()***. This has been done for you, so hit 'Submit Answer' to see the new column!

In [None]:
election =  pd.read_csv(election_penn, index_col='county')
election.head()

In [None]:
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)

# Print the output of election.head()
election.head()

#### Using vectorized functions

When performance is paramount, you should avoid using ***.apply()*** and ***.map()*** because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

You can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas.

In this exercise you're going to import the ***zscore*** function from ***scipy.stats*** and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

Instead of using ***.apply()*** as you did in the earlier exercises, the ***zscore*** UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame. You will be working with the ***election*** DataFrame - it has been pre-loaded for you.

***Instructions***

* Import ***zscore*** from ***scipy.stats***.
* Call ***zscore*** with ***election['turnout']*** as input .
* Print the output of ***type(turnout_zscore)***. This has been done for you.
* Assign ***turnout_zscore*** to a new column in ***election*** as ***'turnout_zscore'***.
* Print the output of ***election.head()***. This has been done for you, so hit 'Submit Answer' to view the result.

In [None]:
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)

# Print the type of turnout_zscore
print('Type: \n', type(turnout_zscore), '\n')

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
election.head()

## Advanced Indexing

Having learned the fundamentals of working with DataFrames, you will now move on to more advanced indexing techniques. You will learn about MultiIndexes, or hierarchical indexes, and learn how to interact with and extract data from them.

### Index objects and labeled data

### Hierarchical indexing

## Rearranging and reshaping data

Here, you will learn how to reshape your DataFrames using techniques such as pivoting, melting, stacking, and unstacking. These are powerful techniques that allow you to tidy and rearrange your data into the format that allows you to most easily analyze it for insights.

### Pivoting DataFrames

### Stacking & unstacking DataFrames

### Melting DataFrames

### Pivot tables

## Grouping data

In this chapter, you'll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You'll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox.

### Categorical and groupby

### Groupby and aggregation

### Groupby and transformation

### Groupby and filterning

## Bringing it all together

Here, you will bring together everything you have learned in this course while working with data recorded from the Summer Olympic games that goes as far back as 1896! This is a rich dataset that will allow you to fully apply the data manipulation techniques you have learned. You will pivot, unstack, group, slice, and reshape your data as you explore this dataset and uncover some truly fascinating insights. Enjoy!

### Case Study - Summer Olympics

### Understanding the column labels

### Constructing alternative country rankings

### Reshaping DataFrames for visualization