There are a lot of sections in here, so be sure to browse around - pandas, matplotlib, Illustrator, and some specifically for items in the homeowrk.

# General pandas

## Importing

What's the full import statement? You'll want: pandas, matplotlib, pyplot, and you'll want to make sure graphs are going to be displayed in your notebook. And that fonts are going to save right in PDFs.

## Reading in files

When files are in a subdirectory, you can't just say `pd.read_csv("filename.csv")` - you need to say "oh you're in a subdirectory, let me go in there, too." Usually something like `pdf.read_csv("my_folder/filename.csv")`.

## Data types

Be sure to check your data types! Numbers should be ints or floats, dates should be datetimes (although usually it's okay for years to not be). You convert between most with `.astype`, but for dates you need to do something like `df['new_columns'] = pd.to_dataframe(df['date_columns'], format="....")`.

## Fixing things up

Sometimes it's easier to do things in pandas/matplotlib, sometimes it's easier to do in Illustrator. As a general rule, sorting, grids, and sizing are all easier in pandas and annotations are easier in Illustrator.

## Combining datasets

To combine datasets in pandas, you use `.merge`. In a perfect world, you'd do this to combine two dataframes named df1 and df2:

```python
df1.merge(df2)
```
to merge the LEFT dataset (df1) with the RIGHT dataset (df2). Unfortuantely this is always impossible, because it requires your datasets to have column names in common so it can guess how to join them together.

Instead, you need to tell it two things: 

* `left_on`, the name of the join column for the first dataframe
* `right_on`, the name of the join column for the second dataframe.

It will usually be something like, "I want to join `city` in df1 with `municipality_name` in df2," in which case you'd run:

```python
df1.merge(df2, left_on='city', right_on='municipality_name')
```

This looks cool, but you also need to save your new merged dataframe somewhere. Your end result will probably look like this:

```python
merged = df1.merge(df2, left_on='city', right_on='municipality_name')
merged.head()
```

## Reversing

Usually you can use `.sort_values()` to sort things, but every now and again you just want it to go in reverse. You can use `ascending=False` or you can lose your mind and do something like:

```python
df.iloc[::-1]
```

Scary, right? You can also do it on a single column, `df['Name'].iloc[::-1]`.

# Graphing

## Saving graphs

Make sure you're always doing the same imports up top!

```python
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['pdf.fonttype'] = 42
```

Especially that last one, so when you save your PDFs they'll have editable text. You can save using this command:

```python
plt.savefig("output.pdf")
```

It will only work if it's in the **same cell** as where you're graphing.


## Multiple and stacked bar charts

A normal bar chart is just a bar for every row. But sometimes you want TWO bars for every row!

Let's say we had a coffee shop, and every row was a month. If we had a `tea` column, we could plot tea sales like this:

```python
df.plot(kind='bar', x='month', y='tea')
```

If we also had a `coffee` column, though, we could also make a grouped bar chart, where each month gets a bar for tea AND a bar for coffee.

```python
df.plot(kind='bar', x='month', y=['tea', 'coffee'])
```

If we were interested in the total between tea and coffee, we could also stack the tea and coffee on top of each other by adding `stacked=True`

```python
df.plot(kind='bar', x='month', y=['tea', 'coffee'], stacked=True)
```

Sometimes you want something stacked out of 100% but your columns don't add up to 100%. In that case, do math.

## Putting multiple graphs on the same chart

If you have two separate plots you want to put on top of each other, you can do it two different ways:

1) You can just save them separately and cut and paste in Illustrator to stack them on top of each other. This way sucks.
2) You know how we made a plot and saved it to `ax` and then did like `ax.set_ylim(...)` and `ax.set_title(...)` and stuff? You can make a blank `ax`, and then keep telling graphs to go on top of it.

For example, this will create a separate plot for each continent:

```python
df.groupby('Continent').plot(x='GDP_per_capita', y='life_expectancy', marker='.', linestyle='')
```

It's kind of like a for loop, making a new plot for each group. Instead, we need to create a blank graph, and then say "hey `.plot`, use this graph!!!" We do that like this:

```python
fig, ax = plt.subplots()

df.groupby('Continent').plot(ax=ax, x='GDP_per_capita', y='life_expectancy', marker='.', linestyle='')```

In the first line, `fig, ax = plt.subplots()` we are building a blank canvas. Why does the code look like that? I don't know, because we aren't allowed to have nice things.

Then each time we want to draw a chart, we pass `ax=ax` to `.plot` to tell it where to draw instead of making a totally new graph.

Sometimes this works when you're stacking separate graphs on top of each other, too (like layering), but sometimes it doesn't. I'd like to explain more but it seriously changes every 6 months and I have no idea where we're at right now.

## Highlighting points/categories/etc

There are a lot of ways to do this!

```python
# Building colors to pass to matplotlib or Seaborn
# Use as color=colors for matplotlib, palette=colors for Seaborn
def build_colors(row):
    if row['Country'] == 'Switzerland':
        return 'red'
    elif row['Country'] == 'Germany':
        return 'red'
    else:
        return 'lightgrey'

colors = df.reset_index().apply(build_colors, axis=1)
```

And then you use it by passing `color=` to your `.plot` method:

```python
df.plot(x='Country', y='life_expectancy', color=colors)
```

## Line graphs (and other types)

If you want a line graph, `style='-'`. If you want dots, `style='.'`. If you want lines AND dots, you want `style='.-'`. Why can't it all just be `kind='bar'` and `kind='scatter'` and `kind='line'` and `kind='line-with-dots'`? I don't know, life sucks.

You can also look at https://matplotlib.org/api/markers_api.html - instead of `.` you can use `o` or `v` or a million other things.

## Coloring your dots and such

* `color='red'` to make everything red
* `markeredgecolor='black'` to make the edges black
* `markerfacecolor='blue'` to make the fill color blue (why is it called face color and not fill color????)
* `markersize=10` or `size=10` to make the circles 10 pixels (depending on the kind of graph you're making)
 
If you want them to not have a marker color or edge color of whatever disappear, make the color `'none'`.

## Size of the graph

`figsize=(10, 20)` when you're graphing, or when you're doing `plot.subplots` if you're doing the `ax=ax` trick.

## Grid Lines

To add/remove/adjust grid lines, follow this guide: http://jonathansoma.com/lede/data-studio/matplotlib/adding-grid-lines-to-a-matplotlib-chart/

Or just note you can do things like this:

```python
ax.grid('on', axis='y', which='major', linestyle='-', linewidth='0.5', color='red')
```

To turn on the grid for a specific part of an axis and give it a specific style/color.

`linestyle` can be `:` or `-` or `--` or `-.` - read here if you don't believe me: https://matplotlib.org/devdocs/gallery/lines_bars_and_markers/line_styles_reference.html. You can also change the dashes in Illustrator, which might be easier.

"WHAT DOES 'major' MEAN???" - read the next section.

## Tick Marks

There are two kinds of tick marks on an axis. MINOR and MAJOR. MAJOR get labels. MINOR do not.

When setting tick marks on the x or y axis, you can use `.set_xticks` and `.set_yticks` to make tick mark labels appear in specific places. For example:

```python
ax.set_xticks([2005, 2006, 2007, 2008])
```

But!!! The above doesn't work if your data is actual dates, only if it's integers. If you try to do the above with a datetime, you'll get an error.

Instead, you need to reach in and adjust the MAJOR ticks (because you want labels). If you wanted to make there be a tick mark WITH A LABEL every single month, it would look like this:

```python
import matplotlib.dates as dates
ax.xaxis.set_major_locator(dates.MonthLocator())
```

If you also wanted there to be a tick mark WITHOUT a label (minor) every 5 days, it would look like this:

```python
ax.xaxis.set_major_locator(dates.DayLocator(interval=5))
```

## Rotating tick marks

`rot=0`, `rot=90`, etc when using `plot`, OR select them all in Illustrator (hold down SHIFT when clicking them, or just draw a box that touches them all), `Object > Transform > Transform Each`, check 'Preview', and then play around with 'Angle'

## Adding an extra axis

Sometimes you just want more labels! Let's say you graphed something very long where you had everyone's name on the left-hand side but also wanted it on the right-hand side. You'd build your graph, save it as `ax`, then use the following code:

```python
# Duplicate the graph, giving it the same x axis
alt = ax.twinx()
# Set the new graph to have its tick marks in the same position
alt.set_yticks(ax.yaxis.get_ticklocs()) 
alt.set_ylim(ax.get_ylim())
# Set the labels for the tick marks to be the same, too
alt.set_yticklabels(df['Name'])
```

Why does it work? Who knows. But be careful, you might need to sort/reverse to make the labels correct.

# Illustrator

## Learning on Lynda

You can log onto Lynda.com (for free!) and find some tutorials with this link: https://ctl.columbia.edu/resources-and-technology/teaching-with-technology/tech-resources/lynda/

There are a _lot_ of lessons (and series) on there about using Illustrator, I'm honestly not sure what the good ones are! You could try something like the below:

* https://www.lynda.com/Illustrator-tutorials/Illustrator-CC-2019-Essential-Training/756294-2.html
* https://www.lynda.com/Illustrator-tutorials/Illustrator-CC-2019-One-One-Fundamentals-Revision/784289-2.html
* https://www.lynda.com/learning-paths/Design/become-a-digital-illustrator
* https://www.lynda.com/Illustrator-training-tutorials/227-0.html?category=beginner_337

Love something on there? Hate something? **Let everyone else know on Slack** so we can make sure we're using good tutorials!

## Opening things in Illustrator

Before you do _ANYTHING_ you should remove clipping masks.

1. Select everything (Command+A)
2. `Object > Clipping Mask > Release` (Command + Option + 7)
3. Keep releasing clipping masks again and again and again until it doesn't work any more

## Fill vs. stroke colors

Fill is the inside, stroke is the outline.

![](images/fill-stroke.png)

You select them separately. The white-with-a-red-line color means no color.

## Background colors

In Illustrator, draw a square as big as your entire artboard, then do `Select > Arrange > Send to Back` to make it go behind everything else.

## Editing lines

`Window > Stroke` to open up the stroke menu, then you can change the size with "Weight" or make it dashed with "Dashed line" (you might need to click the little... thing in the upper right-hand corner of the Stroke window and pick 'Show options' to be able to see that)

![](images/stroke-stuff.png)

## Selecting multiple things in Illustrator

Hold shift, click multiple things. Or click and drag a box around them.

## Selecting all of the _____

Things that look like what you have selected: `Select > Same > Appearance` or `Fill Color` or `Stroke Color` or whatever

Text: `Select > Object > All Text Objects`

## My grid/axis lines are on top of my chart!

Select the line, then `Object > Arrange > Send to Back`

## I sent something to the back and it disappeared!!!

Maybe you have a white rectangle as a background? Try clicking the background and hitting delete.

## Rotating text or other things

Click it (black arrow), then move your mouse around its edge until you see a thing that kind of implies you can rotate it. Click and drag.

## Drawing straight lines or rotating nicely

Hold shift while you draw the line or rotate or move a thing and it will go straight.

## Lining up things

When you have multiple things selected, the `Align` bar becomes active at the top. You can... align things with other things using it instead of manually pushing things around. You might want to play around with the different "Align to..." options.

![](images/align.png)

The "key object" one can be pretty good, as it uses the "key object" as an anchor and moves everything around it. You select the key object by clicking (without holding shift) after you've made your selection. Key object = blue box.

# Tips specifically for Part 2

## Making donut charts

Donut charts are just pies missing the center. So if you make a pie and just _draw a big white circle in the middle_, it suddenly becomes a donut.

## Making pie charts

if `kind='scatter'` makes a scatter and `kind='bar'` makes a bar, how do you think you make a pie? Well, [`kind='pie'`](https://pandas.pydata.org/pandas-docs/stable/visualization.html#pie-plot)!

If you pass `labels=` to the pie chart maker it'll label your data. Try passing `labels='region'` and see the totally-wrong-but-kind-of-funny thing that happens. Then fix it! (or ask me about it).

## Multiple pie charts

If you're trying to do multiple pie charts from the same data (which you are), you have a few options

**Combining in Illustrator**

Make a pie chart for each graph, combine them in Illustrator.

**Building them in the same matplotlib graph**

First, select _only_ the columns you want to make the pie from. You can do this two ways:

1. Using the super-weird `df[['col1','col2']]` style multiple-column selection
2. You could also use `df.drop('col3', axis=1)` (without `inplace=True`, you don't want it to last forever!)

Try that in a cell by itself to make sure you've gotten rid of `region`. Once it looks okay, put a `.plot` right after that and tell it to make a pie, but with `subplots=True`.

## The Economist arrow chart??

We aren't making an arrow chart, it's ugly. We're doing another kind of chart!

### Part One: The lines

It doesn't work with pandas, so we're going to be talking right to matplotlib. We're going to use something called `ax.hlines`.

To see how `hlines` works, try this out:

```python
fig, ax = plt.subplots()
ax.hlines(xmin=[1, 2, 3, 4], xmax=[7, 8, 9, 20], y=[1, 2, 3, 4])
```

The lines start from `xmin` and go to `xmax`. First xmin + first xmax + first y, second xmin + second xmax + second y, etc.

Start by adapting the code above to build the chart below.

![](images/dotlines-0.png)

While the y axis is _obviously_ the region, it won't let you do it! It wants numbers! So we'll cheat and use `df.index`, which is the `0, 1, 2...` thing on the left-hand side.

### Part Two: The dots

`ax.scatter` is going to be your new friend. It takes an `x` and a `y`, and you'll use it to add little nubs for the 2011 profits. Remember, since it's `ax.scatter` we're talking directly to matplotlib and we can't just use column names, we have to give it the `df`, too.

![](images/dotlines-1.png)

### Part Three: The labels

Just copy and paste this part! First we're saying "put the y ticks at these numbers" then we're saying "oh wait use these words instead." Why do we need both? I don't know, programming.

```python
ax.set_yticks(df.index)
ax.yaxis.set_ticklabels(df['region'])
```

![](images/dotlines-2.png)

### Part Four: Your options from here

You're picking the style, right? So do whatever you want!

1. You could also do a dot for 2007, just in a different color. It would also use `ax.scatter`.
2. If you really wanted to do arrows you could, get rid of the dots and use the 'Arrowheads' part of the 'Stroke' menu in Illustrator
3. Would any annotations or grids be useful here? What are you trying to stress?
4. How do you feel about the size of the dots and the lines?
5. You probably wanted to sort it (especially since I told you to). If you use `.sort_valeus` it actually won't work because we keep using the index in a kind of unusual way (the left-hand-side number, the index, gets out of order). So after you sort, try `.reset_index(drop=True)`. Save that back into your `df` to update it forever.

## Colors won't change from grey

Open up the colors menu, click the thing in the upper right-hand corner, and select 'RGB' instead of 'Greyscale".

## Bar plots with multiple things going on

You can give multiple `y` values, like `y=['col1','col2']` if you want to have a grouped bar plot. You add `stacked=True` if you want them stacked on top of each other.

## Invisible grids

Sometimes instead of having a grid you can see, you only see the grid lines when it intersects with your data. I'd probably graph the grid in an ugly color, then select them and `Object > Arrange > Move to Front`, then make it white.

## More arrow tips for The Guardian one

Okay, maybe we didn't like the arrow chart before, but I guess we're doing one now. Make it similarly to how you make the one for the Economist.

It sure seems like all of the arrows are coming from one place. Maybe `hlines` is fine with that?

Make the arrowheads in Illustrator. `Window > Stroke` to open the stroke menu, then click the upper right-hand corner to make sure **Show Options** is turned on. Then you can play with the **Arrowheads** options.

## Tips for the Guardian commuting one

There are a _lot_ of ways to do all of this bits and pieces for this one! If you have a thought and want to talk through it before you start feel free to chat me up in Slack.

By the way, did you know Illustrator has a graph tool?

![](images/illustrator-graph.png)

Sometimes instead of fighting with matplotlib it's an easy way to get lines or boxes or whatever that are the right relative size. Feel free to play around with it.