# Week 1 Data Vis Tutorial (Python)

In this tutorial, you will learn the basics on how to load data and do simple plots using **Python**.

Given that this **Jupyter Notebook** is mounted online, I have already set the two .csv files "in disk" (i.e. in the same location as the notebook). If you are not familiar with Python or Jupyter Notebook, this simply means that when you work with any type of data in an offline manner, it is a common practice that your data files are stored in the same location as the Jupyter Notebook (i.e. a .ipynb file).

Now, to load the data files, we will use a **Python Module** calles *Pandas*, which simulates data frames as used in R and spreadsheets in Excel/SPSS/Stata/etc. Run the following cell (i.e. press play) to import this module

In [None]:
import pandas as pd

Now to load both files into *Pandas Data Frames*, run the following cells

In [None]:
temp = pd.read_csv('../input/cm4125-week-1-lab/temp.csv')

In [None]:
pirates = pd.read_csv('../input/cm4125-week-1-lab/pirates.csv')

You can see the contents of the two variables by **writing the name** (if you use the `print` function, they won't appear as neat as in the following cells)

In [None]:
temp

In [None]:
pirates

Given that we have two tables with the same amount of rows (but particularly the *same* year for each data entry), we can create a single data frame with all the data on it!

We will see many ways to do this later on during the module, but for the time being, we will **concatenate** the colums we need into a variable calles `df`:

In [None]:
df = pd.concat([temp, pirates], axis=1)
df

Unfortunately, the `Year` column has been duplicated! This can be avoided through several other operations, but for now on we will simply **drop duplicates** with a built-in Pandas function:

In [None]:
df = df.loc[:,~df.columns.duplicated()]
df

Now it's time to plot! Firstly, we will import the **Matplotlib** module, which allows us to draw plots easily:

In [None]:
import matplotlib.pyplot as plt

Then, we will define a *scatterplot* with the pirates in the x axis, the temperature in the y axis

In [None]:
plt.scatter(df['Number of Pirates (Approximate)'], df['Global Average Temperature (Celsius)'])

The difference between this plot and the first Pastafarian one is that, if the year is not considered, then Python (and almost any other plotting tool) will sort the points throughout the x-axis and not by year.

This is because, in reality, the first Pastafarian plot is more like a "bar chart" where the number of pirates act as discrete categories ordered by year, and the temperature is the value on the y axis!

Therefore, we need to create a plot that mimics this design:

In [None]:
# First we convert the pirate data into strings so that matplotlib doesn't want to sort them
xaxis = [str(i) for i in df['Number of Pirates (Approximate)']]
plt.plot(xaxis,df['Global Average Temperature (Celsius)'],'--bo')
plt.title('Global Temp vs Pirates')
plt.xlabel('Number of Pirates (Approximate)')
plt.ylabel('Global Average Temperature (Celsius)')
plt.ylim(13,16.5)
plt.show()

**BONUS: CAN YOU ADD THE YEARS ON TOP OF EACH DOT?**

In [None]:
## Use this cell for your code


To plot the second version of the Pastafarian plot (i.e. the two lines), we can use Matplotliib as well:

In [None]:
fig = plt.figure()
ax = plt.axes()
ax.plot(df['Year'], df['Number of Pirates (Approximate)'])
ax.plot(df['Year'], df['Global Average Temperature (Celsius)'])

Once again, there are major differences between our plot and the second Pastafarian one! The main issue now is that the *scale* of the pirate data is vastly different to the temperature one, and thus, the slope in the temperature data cannot be appreciated!

We need to specify that we want both plots drawn with different scales in the same plotting space (something that I highly discourage, we will see why later in this module!)

In [None]:
import matplotlib.pyplot as plt


fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Global Average Temperature (Celsius)', color=color)
ax1.plot(df['Year'], df['Global Average Temperature (Celsius)'], color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Number of Pirates (Approximate)', color=color)  # we already handled the x-label with ax1
ax2.plot(df['Year'], df['Number of Pirates (Approximate)'], color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()