# First Steps with pandas
### Starter Code
* **PyData Bristol Meetup:** https://www.meetup.com/PyData-Bristol/events/268081062/
* **Date:** Thu 27th February 2020
* **Instructor:** John Sandall
* **Contact:** john@coefficient.ai / [@john_sandall](https://twitter.com/john_sandall)

<div class="alert alert-info">
    This workshop is sponsored by <a href="https://coefficient.ai/">Coefficient</a>. If you are interested in either Python training for your company or organisation, or in consultancy services to help accelerate delivery of your data science, analytics, data engineering or machine learning projects, please visit <a href="https://coefficient.ai/">https://coefficient.ai/</a> or you can contact me at <a href="mailto:john@coefficient.ai/">john@coefficient.ai</a>
</div>

---

# Packages in Python
Packages are libraries of code written to solve a particular set of problems. Some are "built-in" and are part of Python. For example, the `math` module:

In [None]:
import math

In [None]:
math.pi

You can import pi directly too:

In [None]:
from math import cos, pi

In [None]:
cos(pi)

There are many "third party" (i.e. made by some Python enthusiast, so make sure you trust what you're installing!) Python packages relevant to data science and analytics. Commonly used are `pandas`, `scikit-learn`, `NumPy`, `matplotlib` and more.

These can be installed with "package managers" such as [PIP](https://pypi.org/project/pip/) or [Conda](https://docs.conda.io/en/latest/). For example, if you installed Python using Anaconda, typing this into your command line will install the seaborn package:

```
conda install seaborn
```

You can reach your command-line on Windows by searching for "Anaconda Prompt" or on macOS by opening up a Terminal window. If you are using Jupyter Lab then it's even easier - just click "New" > "Terminal".

### A small warning
Ensure you trust the packages you're installing. Ask: who wrote this code? Do I trust giving their code access to my computer? **pip** especially can be a little more dangerous, anyone can upload something to PyPI (where pip installs from), so be careful especially as the [smallest typo can be a security risk](https://snyk.io/blog/malicious-packages-found-to-be-typo-squatting-in-pypi/). **conda** is more of a walled garden and as such much better positioned for enterprise usage.

### Some easter eggs

In [None]:
import this

This is the philosophy of Python. It can help you understand which of two solutions is the most "Pythonic". My favourite: `Explicit is better than implicit`.

This next one is an homage to the delightful xkcd webcomic:

In [None]:
import antigravity

# Data Analysis Packages
Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

| Package | Description |
| -- | -- |
| `NumPy` | Numerical calculations - does all the heavy lifting by passing out to C subroutines. This means you get _both_ the productivity of Python, _and_ the computational power of C. Best of both worlds! |
| `SciPy` | Scientific computing, statistic tests, and much more! |
| `pandas` | Your data manipulation swiss army knife. You'll likely see pandas used in any PyData demo! pandas is built on top of NumPy, so it's **fast**. |
| `matplotlib` | An old but powerful data visualisation package, inspired by Matlab. |
| `Seaborn` | A newer and easy-to-use but limited data visualisation package, built on top of matplotlib. |
| `scikit-learn` | Your one-stop machine learning shop! Classification, regression, clustering, dimensional reduction and more. |
| `nltk` and `spacy` | nltk = natural language processing toolkit; spacy is a newer package for natural language processing but very easy to use. |
| `statsmodels` | Statistical tests, time series forecasting and more. The "model formula" interface will be familiar to R users. |
| `requests` and `Beautiful Soup` | `requests` + `Beautiful Soup` = great combination for building web scrapers. |
| `Jupyter` | Jupyter itself is a package too. See the latest version at https://pypi.org/project/jupyter/, and upgrade with e.g. `conda install jupyter==1.0.0` |

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: `pandas`. But first, pandas is built on top of the speed and power of NumPy, so let's dig into that briefly.

# NumPy is _fast_

Import "as np" because Python programmers are lazy, and two letters beats five.

> _"I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it."_
> – Bill Gates

In [None]:
import numpy as np

In [None]:
# Generate a random number between 1 and 100
np.random.randint(1, 100)

How does this function work? Let's use Jupyter's in-built help!

In [None]:
?np.random.randint

---

Let's create a randomly generated NumPy "array".

In [None]:
# Generate 100 random numbers between 1 and 100
array = np.random.randint(1, 100, size=100)
array

Arrays behave a bit like Python lists. For example, you can slice them:

In [None]:
array[:5]

But they're more powerful, they can do things Python lists don't know how to do:

In [None]:
print(array.mean())  # average
print(array.sum())  # sum
print(array.min())  # min

---

> **Exercise:** Use argmax and the array's index selector (e.g. array[3] gives element at position 3) to return the array's max value.

In [None]:
# This is the maximum value in array
print("Max value =", array.max())

# This is the index location of the maximum value in the array
print("Index of max value =", array.argmax())

In [None]:
# Exercise: Use argmax and the array's index selector (e.g. array[3] gives element at position 3)
#           to return the array's max value.




---

We can "chain" these in-built functions like `randint` and `mean` all in one line:

In [None]:
# Average of 10000 random numbers between 1 and 100
np.random.randint(1, 100, size=10000).mean()

Finally, note how blazing fast NumPy is:

In [None]:
# Average of 1000000 random numbers between 1 and 100
np.random.randint(1, 100, size=1000000).mean()

---

# Manipulating Data In Python Using Pandas

[pandas](http://pandas.pydata.org/pandas-docs/stable/) is a library built on top of numpy, which allows us to use Excel-like tables in Python. These special tables are called DataFrames, the primary object in `pandas`.

Similar to before, `pd` is two characters whilst `pandas` is six. It's less to type. You could use anything (e.g. `import pandas as cute_bears`) but `pd` is a common convention.

In [None]:
import pandas as pd

## European Union 2016 membership referendum results
> The United Kingdom's European Union membership referendum took place on 23 June 2016 to gauge support for the country either remaining a member of or leaving the European Union. The referendum resulted in a simple majority of 51.9% being in favour of leaving the EU.

For this guided practice we will manually input data relating to the 2016 EU membership referendum.  We will apply the Python knowledge we have acquired and see how some popular modules such as pandas, numpy and matplotlib are used.

Data source: https://en.wikipedia.org/wiki/United_Kingdom_European_Union_membership_referendum,_2016#Regional_count_results

### 1. Create the dataset

Let us first define some lists that can be used to create a pandas dataframe:

In [None]:
# Source: https://en.wikipedia.org/wiki/United_Kingdom_European_Union_membership_referendum,_2016#Regional_count_results
regions  = [
    'East Midlands',
    'East of England',
    'Greater London',
    'North East England',
    'North West England',
    'Northern Ireland',
    'Scotland',
    'South East England',
    'South West England',  # includes Gibraltar
    'Wales',
    'West Midlands',
    'Yorkshire and the Humber'
]

Let us now list the electorate and turnout for the regions in question:

In [None]:
electorate = [
    3384299,
    4398796,
    5424768,
    1934341,
    5241568,
    1260955,
    3987112,
    6465404,
    4138134,
    2270272,
    4116572,
    3877780
]

turnout = [
    .742,
    .757,
    .697,
    .693,
    .700,
    .627,
    .672,
    .768,
    .767,
    .717,
    .720,
    .707
]

We are now ready to put all the information above in a single dataframe:

In [None]:
raw_data_dictionary = {
    'region': regions,
    'electorate': electorate,
    'turnout': turnout
}

results = pd.DataFrame(raw_data_dictionary)

In [None]:
results

### 2. Selecting columns

How do we select a single column?

In [None]:
results[['region']]

> **Exercise:** Try selecting just the turnout column.

In [None]:
# Enter your answer here


We can also select multiple columns:

In [None]:
results[['region', 'turnout']]

**Don't forget** to use double brackets here! Technically, we're passing a list of column names into the DataFrame's `[]` selectors:

In [None]:
cols_to_select = ['region', 'turnout']
results[cols_to_select]

Let us take a look at the first 5 entries:

In [None]:
results.head()

> **Exercise:** Copy the above and replace `.head()` with `.tail(10)` to get the last 10 entries in the dataframe.

In [None]:
# Enter your answer here...


### 3. Using `map()` and `apply()`

In [None]:
# What are the regions?
results[['region']]

In [None]:
# We can get just the values like this (technical note, this is a pandas "Series")
results.region

In [None]:
# We can turn the above into a normal Python list too
results.region.tolist()

In [None]:
# Let's create a dictionary lookup from region to country
countries = {
    'East Midlands': 'England',
    'East of England': 'England',
    'Greater London': 'England',
    'North East England': 'England',
    'North West England': 'England',
    'Northern Ireland': 'Northern Ireland',
    'Scotland': 'Scotland',
    'South East England': 'England',
    'South West England': 'England',
    'Wales': 'Wales',
    'West Midlands': 'England',
    'Yorkshire and the Humber': 'England'
}

In [None]:
# We can now "look up" the country (i.e. dictionary london) associated with the
# dictionary key for Greater London.
countries['Greater London']

How do we do a "vlookup" in pandas? We use the pandas `map()` method.

In [None]:
# We can magically "map" regions to countries (n.b. this is temporary & does not change the results dataframe!)
results.region.map(countries)

We'd like to create a new column in our original table called "country". First, let's see how to create a new column in pandas.

In [None]:
# This will add a new column with all values set to 123456789
# (It's a bit like adding a new entry into a dictionary.)
results['country'] = 123456789
results  # remember, Jupyter Notebook always prints out the last line of each cell!

In [None]:
# Let's create/overwrite the new state_names column from out state name mapper
results['country'] = results.region.map(countries)  # just like inserting a new value into a dict
results

Similarly, let's create a column to state the turnout as a formatted percentage and another one for the absolute turnout.

Here's an easy way of doing this.

In [None]:
# The easy way...
results.turnout * 100

For demonstration purposes, we'll do this the long way. Or rather, you will!

> **Exercise:** Define a function called `percentage()` that takes a single argument `x` and returns `x` multiplied by 100.

In [None]:
def percentage(x):
    # return something here...

> **Exercise:** You can now "apply" this function to one of the columns of our dataframe to create a new column. In the same way that we previously did `results.region.map()`, try `results.turnout.apply()`. Inside the brackets, enter just the name of the function you want to "apply" to the turnout column, i.e. `apply(percentage)`.

In [None]:
# Remember, map() takes a dict as an argument, apply() takes a function


> **Exercise:** Copy the last line of code into the cell below, and create a new column in our `results` dataframe called `turnout_percent` from your code in the exercise above.

In [None]:
# ...


In [None]:
# You should now see the new `turnout_percent` column with 74.2 for East Midlands, 75.7 for East of England, etc.
results

### 4. Selecting rows, filtering & sorting
How many columns and rows do we have in the dataframe?

In [None]:
results.shape

We have 12 rows and 5 columns.

Let us select some of the data points. For instance the first 4 entries for region and turnout_percent.

In [None]:
results[['region', 'turnout_percent']][0:4]

We can select some data based on conditional logic. For instance let us show only those records where the electorate is greater than 6 million people.

In [None]:
electorate_filter = (results.electorate >= 6000000)  # brackets added for clarity, you don't need them here
electorate_filter

In [None]:
results[electorate_filter]

In [None]:
# We can do this all in one go!
results[results.electorate >= 6000000]

In [None]:
# pandas provides another way of doing this kind of query
results.query('electorate >= 6000000')

As we can see, only South East England meets the condition.

Let's now try sorting the dataframe in descending order by the turnout.

In [None]:
results.sort_values('turnout', ascending=False)

A neat trick we can try is to get a description of each of the numeric fields in the dataframe. We will be able to see the following statistics:

- count of records
- mean
- standard deviation
- minimum and maximum values
- percentiles

In [None]:
results.describe()

### 5. Visualisation

Finally, let's visualise this dataset. First, we need to enable plotting inside the Jupyter Notebook.

In [None]:
# this line is a "magic" function, and allows plots to display inside our notebook
%matplotlib inline

In [None]:
# You can plot pandas objects using the `.plot()` method on any dataframe
results.turnout_percent.plot(kind='bar')

In [None]:
# We can also try seaborn
import seaborn as sns

sns.barplot(y='region', x='turnout_percent', data=results, orient='h')

In [None]:
# Let's sort the bars and add axis labels
my_plot = sns.barplot(
    y='region',
    x='turnout_percent',
    hue='country',
    orient='h',
    dodge=False,
    data=results.sort_values('turnout_percent'),
)
my_plot.set(xlabel='Region', ylabel='Turnout (%)', title="Turnout by region")

In [None]:
results.head()

> **Exercise:** Using the seaborn [scatterplot](http://seaborn.pydata.org/generated/seaborn.scatterplot.html) function (`sns.scatterplot()`) create a scatterplot with:
> - `'electorate'` on the x-axis
> - `'turnout_percent'` on the y-axis
> - you can keep `data=results`

In [None]:
# ...


> **Exercise:** Add `hue='country'` to your code above to colour code by country.

In [None]:
# ...
