Jupyter Notebook tips:  
Run a cell: `shift + enter`  
See "Cell" menu option above to run all cells or change cell type from code to markdown.  
See "Insert" menu option above to add cells.  
See "Edit" menu option above to delete cells.  
See "Kernel" menu option to restart and run all output.

# STEP 0: Play With Python

In [None]:
# Assign a list of strings to a variable called 'presents'


In [None]:
# Assign a new variable 'favorite_present' by indexing the list of strings 'presents' we made above


# STEP 1: LOAD PACKAGES & DATA

### Load Packages

In [None]:
# Import Pandas
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None)

### What is a dataframe in Pandas?
Let's look at the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) together.

In [None]:
# Create a dataframe using an example from the docs


### Read in Data


In [None]:
# Read the CSV into a dataframe called 'raw_data'


In [None]:
# Let's just display the 'hs_tf' column


In [None]:
# What happens if we compare the 'hs_tf' column with 'Yes'?


In [None]:
# Bring it all together to filter 'hs_tf' into a dataframe called 'sweaters'


 # STEP 2: WRANGLE/PROCESS DATA

### Wrangle Colors (The Tidy Way)

In [None]:
# Make a copy to avoid assignment warnings


# Convert 'colors' column from comma or and delimited string to new 'colors_list' column


# Use a question mark to see documentation on str.split


# Tidy colors data (1 color per row)


# Calc how many colors for each sweater & add to dataframe as new 'num_colors' column



In [None]:
# Display the dataframe


### Wrangle Image Descriptions (The Python / Pandas Way)

In [None]:
# Replace NaN with empty string
sweaters['image_desc'].fillna('', inplace=True)

# Convert 'image_desc' column from single space delimited string to new list of string column 'image_desc_list'
sweaters['image_desc_list'] = sweaters['image_desc'].str.split(' ')

# Calculate how many image_desc words are present & assign to new column 'num_words'
sweaters['num_words'] = sweaters['image_desc_list'].apply(lambda x: len(x))

# Display the dataframe
sweaters

# STEP 3: VISUALIZE DATA

In [None]:
# Pandas built-in plot tools (which use Matplotlib under the hood)
# This is convenient, but doesn't give as much control as using the Matplotlib API
sweaters.plot.scatter('num_colors', 'num_words')

## BONUS: NumPy and Matplotlib API
There are several ways to visualize our results... most are built on top of the Matplotlib package.

In [None]:
import matplotlib as mpl
from matplotlib import pyplot as plt

import numpy as np

# Change default plot size
plt.rcParams['figure.figsize'] = (12, 8)

# There are many pre-defined styles... view the available options
print(mpl.style.available)
# or use the default style
plt.style.use('default')

In [None]:
# Matplotlib scatter plot doesn't have built-in jitter option...
# but it's not too hard

def jitterify(arr, factor=0.01):
    """Add jitter 'factor' to 'arr' data
    :param arr: array-like, eg: list, ndarray
    :param factor: float, 0.0 -> 1.0
    :return: arr with added jitter
    """
    assert 0.0 <= factor <= 1.0, f"Error, invalid factor {factor}"
    arr = np.array(arr)
    assert arr.ndim == 1, f"Expected 1-d array, got {arr.ndim}"
    ptp = arr.ptp()
    jitter = np.random.randn(arr.size) * factor * ptp
    return arr + jitter

In [None]:
# Scatterplot docs:
# https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

fontsize = 12
markersize = 75
color = '#424242'
alpha = 0.4
jitter = 0.01

# Instantiate plot objects
fig, ax = plt.subplots()

# Add labels to axes
ax.set_xlabel('Number of colors on sweater', fontsize=fontsize, color='k')
ax.set_ylabel('Number of words\nin sweater description', fontsize=fontsize, color='k')
# Add figure title
fig.suptitle("Relationship between the number of colors and\nlength of description for ugly holiday sweaters",
            color='k', fontsize=fontsize + 2)

# Specify what data to plot
x = sweaters.num_colors
y = sweaters.num_words
# Add jitter to data so completely overlapping
x = jitterify(x, jitter)
y = jitterify(y, jitter)

# Plot
ax.scatter(x, y,
            c=color,
            s=markersize,
            alpha=alpha,
            edgecolors=color,
            linewidths=1.
           )

# Add polyfit curve
coeffs = np.polyfit(x, y, 1)
xlim = ax.get_xlim()
ax.plot(xlim, np.polyval(coeffs, xlim), color='k', alpha=0.9)

# Set tick increment
incr_x = 2
incr_y = 5
ax.xaxis.set_major_locator(mpl.ticker.MultipleLocator(incr_x))
ax.yaxis.set_major_locator(mpl.ticker.MultipleLocator(incr_y))

### Seaborn library
This library is designed with data-science and clean asthetics in mind... check it out!<br>
https://seaborn.pydata.org/

In [None]:
import seaborn as sns;
sns.regplot(x=sweaters['num_colors'], y=sweaters['num_words'])