# More Pandas and Plotting with Matplotlib

## Goals

- Leverage pandas to perform SQL-like tasks (grouping, merging, applying functions)
- Use Matplotlib and pandas to create standard charts (scatter, line, bar, boxplot, etc.)


## Review

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1\. Read the CSV file 'movie_metadata.csv' into a pandas dataframe

In [None]:
path = '../../data/movie_metadata.csv'


2\. Let's take a peek at the first 5 rows of the data

3\. Let's take a look at a few of the movie titles in the *'movie_title'* column

4\. Let's take a closer look -- print the *'movie_title'* column as a numpy array

Hmm...what are those weird characters at the end of each title? Maybe **Google** knows?

5\. Ok let's get rid of those weird characters by **replacing** them with an empty string. But how to do that? Maybe there's something in the **pandas documentation**?

6\. Let's see if that did the trick -- print the *'movie_title'* column as a numpy array again

7\. Great! Now, let's set an index for the data. The movie title seems like a good candidate. But first, let's see if the titles are **unique**. Maybe compare the total number of movie titles to the number of unique movie titles?

8\. Uh-oh...it looks like the movie titles aren't unique. Let's find the **duplicates**. Maybe the **pandas documentation** has something?

9\. Let's just keep the first occurrence of every duplicate row (based on movie title) and drop the others. Wasn't there something in the pandas documentation about **drop_duplicates**?

10\. Great! Now, let's make the movie title the index.

11\. Deadpool was a pretty good movie. Let's get the data for it.

12\. Oh wow -- Deadpool grossed $363,024,000. What movies made more money than that?

13\. Let's sort all of the movies by gross earnings and find the top 5 highest grossing movies.

14\. Let's create a new metric called 'profit' -- it's calculated as **gross - budget**. Add a column to the data for this metric.

15\. How many movies generated a negative profit?

15\. What's the average profit per movie?

16\. Last, let's get a see what percentage of movies have each *content_rating*.

## Pandas GroupBy
#### Split -> Apply -> Combine

![image.png](attachment:image.png)

In [None]:
# Group the data by director_name


In [None]:
# Well that's not helpful. The data needs to be accessed by certain methods


In [None]:
# You can also access a specific column on a groupby object


In [None]:
# Let's fix those NaN. First, how many are missing?


In [None]:
# Let's replace those missing profit values with the median profit across all films


In [None]:
# Or we can just drop those rows all together


In [None]:
# Let's find the movies that are appropriate for kids under the age of 17


## Mapping and Applying Functions

In [None]:
# Make a function that returns 'fresh' if imdb_score > 5 and 'rotten' if imdb_score <= 5


In [None]:
# Apply the function to the 'imdb_score' column


In [None]:
# How many movies are rotten vs. fresh?


In [None]:
# How many 'Comedy' movies are there?


In [None]:
# What about the number of comedic action movies?


## Additional Pandas Resources


Be sure to check out any of these pandas resources!

- Pandas cheatsheets in the resources directory

- https://chrisalbon.com/ <- Great website. For now check out data wrangling section.

- Data school's collection of resources http://www.dataschool.io/best-python-pandas-resources/

- Data school tutorial in giant repo. http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

- http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

- http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb

- Repo with Pandas exercises https://github.com/guipsamora/pandas_exercises

- https://github.com/brandon-rhodes/pycon-pandas-tutorial

- https://github.com/jonathanrocher/pandas_tutorial

- https://github.com/chendaniely/scipy-2017-tutorial-pandas

- https://github.com/adeshpande3/Pandas-Tutorial

## Concatenating Dataframes

In [None]:
# Use numpy to create some data

array1 = np.random.normal(loc=3, size=(100, 3))
array2 = np.random.normal(loc=-3, size=(100, 3))

columns = ['feature_1', 'feature_2', 'feature_3']

df1 = pd.DataFrame(array1, columns=columns)
df2 = pd.DataFrame(array2, columns=columns)

In [None]:
# Take a peek at df1


In [None]:
# Take a peek at df2


In [None]:
# Now we want to concatenate these dataframes together. We can do that using pd.concat
# Check the pandas documentation for how to use this function


In [None]:
# Let's verify that worked -- our new df should have a shape of (200, 3)


In [None]:
# Let's look at the new dataframe


In [None]:
# Hmm...it looks like we have duplicate index values. Let's get rid of those by resetting the index


## Merging DataFrames

Similar to SQL joins, we can combine two DataFrames by using a unique key found in both DataFrames.

In [None]:
num = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num': [1, 2, 3]})
size = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size': ['S', 'M', 'L']})

In [None]:
num

In [None]:
size

Inner join: Only include observations found in both *num* and *size*

In [None]:
# Use pd.merge to do the join. Check out the pandas documentation for help!


Left outer join: Only include observations found in both *num*

Right outer join: Only include observations found in *size*

Full outer join: Include observations found in either *num* and *size*

### More resources on concatenating, merging, and other ways to transforms data in pandas.
- http://www.datacarpentry.org/python-ecology-lesson/04-merging-data/ 
- https://github.com/shanealynn/Pandas-Merge-Tutorial/blob/master/Pandas%20Merge%20Tutorial.ipynb
- https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/ https://pythonprogramming.net/concatenate-append-data-analysis-python-pandas-tutorial/ 
- https://www.dezyre.com/data-science-in-python-tutorial/pandas-introductory-tutorial-part-3
- Pivot tables: http://pbpython.com/pandas-pivot-table-explained.html

# Plotting + Exploratory Data Analysis

In [None]:
# import the matplotlib plotting library
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Let's read in the HR satisfaction data
path = '../../data/HR_comma_sep.csv'


In [None]:
# Let's just use a sample of 100 rows for now


In [None]:
# Let's focus on satisfaction and average monthly hours for now
# Looks like someone mispelled 'monthly' -- we can fix that


In [None]:
# Use .plot to make a line plot with satisfaction level


In [None]:
# Change the color of the line


In [None]:
# Change the thickness of the line


In [None]:
# Now, let's adjust the size and add some context

# Set plot size to 8x6 


# Call plot with linewidth to 3


# Label the x-axis


# Label the y-axis


# Give the plot a title


In [None]:
# We can combine charts together on a single plot


In [None]:
# We can also use subplots to show them separately

# Set the figsize to (12,4)


# Plot 'a' with a purple line


# Plot 'b' with a black line


# Use tight_layout() to add spacing between plots


In [None]:
# Plot 'a' vs. 'b' using a scatter plot

# Adjust the size


# Call .scatter() method


# Label the x and y axes


In [None]:
# Let's do that again, but with limits on the x and y axes

# Adjust the size


# Call .scatter() method


# Label the x and y axes


# Add the limits


In [None]:
# We can also change the color and sizes of individual dots

plt.figure(figsize=(8,5))

# Turn the last_evaluation into sizes


# Create dict of colors that correspond to salary
colors = {'low': 'r', 'medium': 'g', 'high': 'b'}


# Set 'sizes' to sizes and 'colors' to colors


# Label the axes


In [None]:
# Let's try to make this a little nicer now
# Load in the fivethirtyeight plot style
plt.style.use("fivethirtyeight")

In [None]:
# Let's run our scatter plot again


In [None]:
# Plot a bar plot of sales (i.e. department) and average_monthly_hours



# Requires a list of integers to be used as position coordinates
pos = np.arange(departments.size)

plt.figure(figsize=(15,5))

# Set bar width
w = 0.4
plt.bar(left=pos, width=w, height=departments, tick_label=departments.index, align="center")

In [None]:
# We can create the same chart but with a horizontal orientation

plt.barh(bottom=pos, width=departments, height=w, tick_label=departments.index, align="center")

In [None]:
# Make a histogram of satisfaction level


In [None]:
# Let's make some changes
# Set bins=20 and normed=True


In [None]:
# We can also show two histograms on the same plot

plt.figure(figsize=(9,8))

In [None]:
# Boxplots are also easy to make
# Let's make a boxplot of average_monthly_hours


- The box in the boxplot represents the middle 50% of values (i.e. all the values between the 25th and 75th percentile, also known as the interquartile range or IQR)
- The red line represents the median.
- The whiskers will extend up and down up to a distance of 1.5x the IQR away from the top and bottom of the box. If all values are within that range, the whiskers will go to the max and min of the variable.