# Data manipulation with Python <a name="0."></a>

Welcome to our second course on Python, this time on the use of arrays to manipulate data and the use of `matplotlib` to generate complex plots of data.

Contents:
- [Refresher](#1.)
  - [Data types, operators and expressions](#1.1)
  - [Control structures and statements](#1.2)
  - [User-defined functions](#1.3)
  - [Error types](#1.4)
- [Arrays](#2.)
  - [What are arrays?](#2.1)
  - [Manipulating arrays](#2.2)
- [Plotting with matplotlib](#3.)
  - [Generating a basic plot](#3.1)
  - [Customising your plot](#3.2)
  - [Subplots](#3.3)
  - [Other graph types](#3.4)
  - [Importing data](#3.5)
  - [Curve fitting](#3.6)
  - [3D plots](#3.7)
- [What now?](#4.)

# 1. Refresher <a name="1."></a>

## 1.1 Data types, operators and expressions refresher <a name="1.1"></a>

<hr style="border:2px solid gray">

***Variables*** are assigned using `=`.  There are two broad categories: ***numerical data types*** and ***sequenced data types***.

Numerical data types include:
- ***Ints*** (integers) - positive or negative whole numbers.  Example: `a = 4`
- ***Floats*** (floating-point numbers) - numbers with decimal points.  Example: `b = 1.1`, `c = 4E-7`
- ***Booleans*** - conditions that are `True` or `False`.  Example: `d = (4 < 7)` gives `True`

Sequenced data types include:
- ***Strings*** - collection of characters between two `'`, `"` or `'''` marks.  Examples: `'beep'`, `"Welcome to Python!"`.  Can be ***concatenated*** (added)
- ***Lists*** - similar to strings, though items do not need to be of the same type.  Examples: `L = [1, '2', 3, 'Four', 5]`, `L = []` (empty list).  Can access ***elements*** of a list by 'slicing' list by ***index***; notation is `list[start:stop:step]`.  Can concatenate lists, as well as ***append*** to them with `listName.append(addition)`
- ***Tuples*** - similar to lists, though ***immutable*** (cannot be changed once created).  Must instead convert to a list using `list(tupleName)`, make changes, then convert back to tuple using `tuple(listName)`

In the ***expression*** `a + b`, `a` and `b` are the ***operands*** whilst `+` is the ***operator***.  There are two broad categories: ***arithmetic operators*** and ***comparison operators***.

Arithmetic operators include:
- ***Addition***, `+` - adds operands on either side of the operator.  Example: `2 + 2 = 4`
- ***Subtraction***, `-` - subtracts right-hand operand from left-hand operand.  Example: `5 - 2 = 3`
- ***Multiplication***, `*` - multiplies values on either side of operator.  Example: `3*3 = 9`
- ***Division***, `/` - divides left-hand operand by right-hand operand.  Example: `22/8 = 2.75`
- ***Exponent***, `**` - raises left-hand operand to power of right-hand operand.  Example: `2**3 = 8`
- ***Modulo***, `%` - divides left-hand operand by right-hand operand and returns remainder.  Example: `13 % 3 = 1`
- ***Floor division***, `//` - same as with division operator, but with decimals removed.  Example: `22//8 = 2`

Comparison operators include:
- Equal to, `==` - compares both operands, returns `True` if equal
- Not equal to, `!=` - compares both operands, returns `True` if not equal.  Also represented by `<>`
- Greater than, `>` - returns `True` if left-hand operand is greater than right-hand operand
- Less than, `<` - returns `True` if left-hand operand is less than right-hand operand
- Greater than or equal to, `>=` - returns `True` if left-hand operand is greater than or equal to right-hand operand
- Less than or equal to, `<=` - returns `True` if left-hand operand is less than or equal to right-hand operand

There is also an example of an ***assignment operator*** - the equality operator, `=`.  It assigns a value from a right-hand operand to a lft-hand operand.  Example: in `a = 4`, we have assigned a value of 4 to the left-hand operand `a`.

A ***function*** is a block of code that ***runs*** (does things) when it is ***called*** (written down).  You put in (***input***) *variables* and the function gives an ***output*** that may itself be a variable.  Programmers will often import ***modules*** which are ***libraries*** containing multiple functions.

Functions can be ***nested*** - put inside other functions.

[Return to contents](#0.)

<hr style="border:2px solid gray">

## 1.2 Control structures and statements refresher <a name="1.2"></a>

<hr style="border:2px solid gray">

A ***while*** loop will execute a block of code repeatedly until a given condition is satisfied.  When this condition becomes false, the line immediately after the loop in the program will be executed.

We can combine `while` loops with an ***else*** statement that will come into effect.  For the above case, what our `while` loop did after `count` reached a value of three was execute the code immediately after itself.  However, there was no code following the `while` loop, so there was no further output.  Inserting an `else` statement will change this.

An example of of `while` loop is below.

In [None]:
count = 0
while (count < 3):    
    count = count + 1
    print("As expected")
else:
    print("I'm sorry, Dave")

A ***for*** loop will run through a sequenced data type and perform an action that is repeated across the entire sequence.  We can also place control structures within other control structures - these are known as ***nested*** loops.

A useful statement to include in `for` loops is the ***break*** statement, which terminates the loop containing it if a condition for breaking is met.  There's no further iteration, the loop just ends.

An example of a nested `for` loop is below.

In [None]:
for i in range(1, 6):
    for j in range(i):
         print(i, end=' ')
    print()

An example of a `for` look with a `break` statement is below.

In [None]:
for val in "hitchhiker":
    if val == "k":
        break
    print(val)

print("DON'T PANIC")

A `continue` statement will skip the rest of the code inside a loop for the current iteration only.  See below.

In [None]:
for val in "hitchhiker":
    if val == "i":
        continue
    if val == "e":
        continue
    print(val)

print("DON'T PANIC")

An ***if*** statement is a simple decision-making statement; it decides whether or not a certain statement or block of statements will be executed.  Indentation (empty space before the beginning of a line) is important here, as only indented statements will be identified as being within a control structure.

An example of an `if` statement is below.

In [None]:
i = 10
 
if (i > 15):
    print("10 is less than 15")
print("This is an if statement")

If we want to do something else when the condition in our `if` statement is `False` for a certain input, we can use an `else` statement; the use of an `else` statement after an `if` statement is known as an ***if else*** statement.

An example of an `if else` statement is below.

In [None]:
a = float(input("Enter a value for a: ", ))
b = 200
if b > a:
  print("b is greater than a")
else:
    print("a is greater than b")

`elif` ('else if') statements can be used to make so-called 'elif ladders'.  If the condition for the `if` statement is false, the conditions for the next `elif` statement are checked and so on.  If every condition is false, the `else` statement is activated.  In effect, `elif` automatically 'passes on' the sequence to the next statement.

An example of an `elif` ladder is below.

In [None]:
i = 20
if (i == 10):
    print("i is 10")
elif (i == 15):
    print("i is 15")
elif (i == 20):
    print("i is 20")
else:
    print("i is not present")

[Return to contents](#0.)

<hr style="border:2px solid gray">

## 1.3 User-defined functions refresher <a name="1.3"></a>

<hr style="border:2px solid gray">

***User-defined functions*** or ***UDFs*** are best used when a block of statements must be run through multiple times - this way, there is no need to repeatedly rewrite the same statement.  The layout of a UDF is as follows:

In [None]:
# def function_name(argument1, argument2, ...):
      # statement_1
      # statement_2
      # ...

To run a UDF with a particular set of arguments, we must call it with these arguments.  We can also call a UDF when nested within a different line of code, as well as nesting control structures and statements within one.

A ***docstring*** can be included in order to explain how a UDF works.  A docstring should give a basic overview of how a UDF works alongside the arguments it accepts.

An example of a UDF with all the above features is below.

In [None]:
def Sum(*numbers):
        '''
        Calculate and Print Average of two Numbers.
        Created on 25/05/2022
        '''
        s = 0
        for n in numbers:
            s += n
        return s

print(Sum.__doc__)
print("The sum of all numbers between 1 and 4 is", Sum(1,2,3,4))

[Return to contents](#0.)

<hr style="border:2px solid gray">

## 1.4 Error types refresher <a name="1.4"></a>

<hr style="border:2px solid gray">

***Syntax error :*** something has been entered incorrectly. Often the result of missing or extra brackets or quotation marks.

Below is an example of a syntax error.

In [None]:
i = 10

if (i > 15) print("10 is less than 15")
print("This is an if statement")

***Name error :*** a referenced name has not been defined.  Often caused by mispells of a named object, or by only defining an object in a condition, loop or function then using it elsewhere.

Below is an example of a name error.

In [None]:
def sum(*numbers):
     s = 0
     for n in numbers:
           s += n
     return s

print(sun(1,2,3,4))

***Type error :*** used the wrong data type.  An example of a type error is below.

In [None]:
'1' + 2

***Indentation error :*** an indentation (made by pressing *tab* on your keyboard) is expected but not present.  Often occurs when building control structures or UDFs.

Below is an example of an indentation error.

In [None]:
for i in range(1,24):
print(i)
if i == 8:
break

***Zero division error :*** divided by zero, which gives an undefined result.  Below is an example of a dividing by zero error.

In [None]:
1/0

***Logical errors*** are distinct from the rest as they do not result in error messages and are not technically incorrect in terms of syntax.  Due to a misunderstanding of mathematical or systemic process by the programmer, a logical error in coding is due to the code giving a 'true' result but not the desired or expected one.

An example of a logical error is below.

In [None]:
x = 4
y = 5

z = x+y/2
print('The average of the two numbers you have entered is:',z)

[Return to contents](#0.)

<hr style="border:2px solid gray">

# 2. Arrays <a name="2."></a>

### 2.1 What are arrays? <a name="2.1"></a>

<hr style="border:2px solid gray">

As we have seen, a list in Python is an ordered set of values, such as a set of integers or a set of floats.  ***Arrays*** are also similar in that they are also ordered sets of values, but there are some important differences between lists and arrays:
- The number of elements in an array is fixed
- The elements of an array must all be of the same type, and cannot be changed once the array is created

Lists have none of these limitations, so it may seem like a better idea to use them.  However, arrays have some considerable advantages over lists:
- Arrays can be two-dimensional like matrices in algebra, which allows us to have 'grids' of arrays.  Indeed, arrays can be n-dimensional, whilst lists can only have 1 dimension
- Arrays behave roughly like vectors or matrices: you can do arithmetic with them, such as adding them together, and you will get the result you expect
- Arrays work faster than lists, particularly for large arrays and lists

In fact, let's see what happens if we add two lists together:

In [None]:
L1 = [1,2,3,4,5]
L2 = [6,7,8, 9, 10]
L = L1 + L2
print(L)

As you can see, we've just created a longer list by 'sticking' list L2 to the end of list L1.  This is known as ***concatenation***.

Let's do the same for arrays.  Note that we must import the `numpy` module to create arrays.

In [None]:
import numpy as np

A1 = np.array([1,2,3,4,5])
A2 = np.array([6,7,8,9,10])
A = A1 + A2
print(A)

Each element of array `A2` is added to the element of array `A1`, provided they have the same ***index***.

Now that we've gone through general numpy arrays, there is an additional type of array that you may encounter: ***dictionaries***.  These are groups of ***key-value pairs*** or ***keys* known as ***associative arrays***.  Each key will map a key to its associated value.  This is best shown through demonstration.

In [None]:
thisdict = {
  "Brand": "Fiat",
  "Model": "Cinquecento Hawaii",
  "Year": 1998
   }
print(thisdict)

In effect, a key assigns the right-hand variables to the left-hand variables.  A dictionary is essentially just an array of values and keys: it's like a bag full of bricks and cement that assembles itself into a wall when called.

A handy feature of dictionaries is that you can ***filter*** them by key to obtain particular values.  In the case below, the ***in*** operator returns a Boolean result evaluating into either `True` or `False`. When the specified value is found inside the sequence, the `in` statement returns `True`, whereas when it is not found, it returns `False`.

In [None]:
d1 = {'Grade A':'65%', 'Grade B':'66%', 'Grade C':'67%', 'Grade D':'68%', 'Grade E':'69%', 'Grade F':'70%'}
l1 = ['Grade A','Grade C','Grade F']

requested_grades = {key: d1[key] for key in l1}
# A simple for loop that iterates through dictionary d1 based on the list l1
print(requested_grades)

You can add to a dictionary like so:

In [None]:
thisdict['Owner'] = 'Will McKenzie'

The keys and values of a dictionary can be accessed individually using `dict.keys`, `dict.values` and `dict.items`.

In [None]:
for key in thisdict.keys():
    print(key)

for val in thisdict.values():
    print(val)

for item in thisdict.items():
    print(item)

You can also access values with variables.

In [None]:
accessor = "Owner"
print(thisdict[accessor])

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 2.2 Manipulating arrays <a name="2.2"></a>

<hr style="border:2px solid gray">

As was the case with lists, we can also select elements from an array:

In [None]:
A = np.array([1,2,3,4,5])
print(A[0])
print(A[3])

We can also make 2D arrays by hand, as well as determine the size and shape of a given array:

In [None]:
import numpy as np
x = np.array([[1,2,3,4],[5,6,7,8]])
print("Size of array:", np.size(x))
print("Shape of array:", np.shape(x))

Creating arrays of higher dimensions than 2 or 3 is difficult; here, it's best to use `numpy` methods.  The below code creates a 5-dimensional array where the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is 3D array and 1st dim has 1 element that is a 4D array.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('number of dimensions:', arr.ndim)

Several `numpy` functions can generate arrays of a certain type very quickly.  For instance, the below array ranges from 0 to 20 in evenly spaced steps.

In [None]:
import numpy as np
y = np.linspace(0,1,20)
print(y)

The below array's elements go from 0 to 10 in steps of 0.1:

In [None]:
z = np.arange(1,10,0.1)
print(z)

Finally, the below array is simply the above (1,90) array *z* reshaped into a (30,3) array.

In [None]:
Q = z.reshape(30,3)
print(Q)

**Figure 1** should make it easier to understand the shapes of 1D, 2D and 3D arrays.

<CENTER><img src="NWTQH.png" style="width:60%"></CENTER>

**Figure 1:** (left to right) 1D, 2D and 3D arrays.  For the 1D array, element `[0]` would be $7$ and element `[2]` would be $9$.  For the 2D array, element `[0][0]` would be $5.2$ and element `[2][3]` would be $0.3$.  For the 3D array, element `[2][2][0]` would be $0$.

Looking at the 1D and 2D arrays begs a question: can we 'stack' two 1D arrays together to make a 2D array?  We can, by using the `stack` function from `numpy`.  `stack` takes arguments `arrays` and `axis`; the axes of stacking can be a little confusing, so pictograms of the stacking process are below.

In [None]:
arr1 = np.array([1,2,3])
arr2 = np.array([4,5,6])

print(arr1)
print(arr2)

horizontal = np.stack((arr1,arr2), axis = 0)
print(horizontal)

<CENTER><img src="stack1.png" style="width:20%"></CENTER>

**Figure 2:** Stacking arrays horizontally ($0^{th}$ axis).

For stacking along the $0^{th}$ (horizontal) axis, there was actually no need to include an axis argument as the horizontal axis is the default axis.  This is not the case for stacking along the $1^{st}$ (vertical) axis.

In [None]:
arr1 = np.array([1,2,3])
arr2 = np.array([4,5,6])

print(arr1)
print(arr2)

vertical = np.stack((arr1,arr2), axis = -1)
print(vertical)

<CENTER><img src="stack2.png" style="width:20%"></CENTER>

**Figure 3:** Stacking arrays vertically ($1^{st}$ axis).

Creating a 3D array is somewhat finicky, so it's best to use a function to generate one rather than create it manually by typing.  We'll use two functions from `numpy`.  The first is `arange`, which returns an array of evenly spaced elements as given by the arguments `start`, `stop` and `step`.  The second is `reshape`, which changes the shape of an array without affecting its data.

In [None]:
arr = np.arange(24).reshape(4,3,2)
print(arr)

Let's see how the shape of an array changes in output form rather than the earlier diagram.

In [None]:
dim_1 = np.arange(1,4)
dim_2 = np.arange(1,10).reshape(3,3)
dim_3 = np.arange(1,28).reshape(3,3,3)

print(dim_1)

print(dim_2)

print(dim_3)

It's best to visualise a 2D array as a series of 1D 'strips', and a 3D array as a series of 2D 'slices'.

We've seen that accessing elements of a 1D array is a very useful skill - let's extend this to higher-dimensional arrays.

In [None]:
dim_1 = np.arange(1,4)
dim_2 = np.arange(1,10).reshape(3,3)
dim_3 = np.arange(1,28).reshape(3,3,3)

print(dim_1[1])
print(dim_2[1][1])
print(dim_3[1][1][1])

This is very much like how we access elements from lists and 1D arrays, we just add on another 'position marker' for each *degree of dimensionality*; that is, the number of dimensions that an array has.

[Return to contents](#0.)

<hr style="border:2px solid gray">

# 3. Plotting with matplotlib <a name="3."></a>

Plots appear everywhere in physics and beyond.  Some of you may have experience with generating plots yourself, perhaps on Excel.  However, this becomes quite painful if you have a very large dataset or if you wish to manipulate your data in a complicated manner.  Luckily, Python - which is very adept at handling and manipulating complex datasets - can also generate plots using the `matplotlib` library.

### 3.1 Generating a basic plot <a name="3.1"></a>

<hr style="border:2px solid gray">

Generally, making a plot will follow the same sort of procedure:
- Obtain and specify data sets (as in the `x` and `y` values)
- Manipulate data sets if necessary; this is often much harder than generating the plot
- Import the module `matplotlib.pyplot`, the Python module dedicated to generating plots
- Tailor the details of your plot
- Generate your plot!

**Example:** Generate a simple quadratic plot.

A good approach would be to manually create a data set for our *x* values and write a function in *y* that is our quadratic.  Let's do just that.

In [None]:
import numpy as np

x = np.arange(-5,8,1)
y = (x-1)**2
print(x)
print(y)

Now let's get a-plotting:

In [None]:
import matplotlib.pyplot as plt

plt.plot(x,y)
plt.show()

There we go - a nice quadratic!

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.2 Customising your plot <a name="3.2"></a>

<hr style="border:2px solid gray">

Who wants a boring old line?  We can do much better.  The reason why we have a line is because we haven't specified the plot type - a line plot is the automatic option.  The change we need is simple:

In [None]:
plt.plot(x,y,'o') # Specifying 'o' after the dataset inputs tells matplotlib that we want a scatter plot
plt.show()

Or, we can combine the two using the `markers` option:

In [None]:
plt.plot(x,y,marker='o')
plt.show()

We can also change the appearance of the line:

In [None]:
plt.plot(x,y,linestyle='--')
plt.show()

We can also change the colour of our line - this will be handy later on.

In [None]:
plt.plot(x,y,c='r')
plt.show()

Adjusting the width of our line is an option.

In [None]:
plt.plot(x,y,linewidth='5.5')
plt.show()

We can plot two lines on the same plot by specifying a `plt.plot` function for each line.

In [None]:
plt.plot(x)
plt.plot(y)

plt.show()

Alternatively, we can group multiple lines in the same plot, though we must ensure that they are positioned correctly.  If we intend to make two lines *1* and *2*, we should group them as $(x_1,y_1 \: \: x_2,y_2)$.  See below.

In [None]:
x1=np.array([0,1,2,3])
y1=np.array([3,8,1,10])
x2=np.array([0,1,2,3])
y2=np.array([6,2,7,11])

plt.plot(x1,y1,x2,y2)
plt.show()

If you were given this plot without context, you'd have no idea what was going on.  Labelling is key, so let's start with labelling axes with the `xlabel` and `ylabel` functions:

In [None]:
x = np.arange(-5,8,1)
y = (x-1)**2

plt.plot(x, y)

plt.xlabel("x values")
plt.ylabel("y values")

plt.show()

A title would also be helpful:

In [None]:
x = np.arange(-5,8,1)
y = (x-1)**2

plt.plot(x, y)

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Second-order quadratic")

plt.show()

Using a grid can be handy as it allows a reader to identify points of intersection.  The `plt.grid` function not only does this, but it aligns the grid to shown axis values.

In [None]:
x = np.arange(-5,8,1)
y = (x-1)**2

plt.plot(x, y)

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Second-order quadratic")
plt.grid()

plt.show()

A line plot isn't always what you're after.  *Scatter plots* are quite common in physics, particularly in particle physics and astrophysics.  As for generating a scatter plot, there's an app for that.

In [None]:
x = np.random.randint(1,10, size = 10)
y = np.random.randint(1,15,size = 10)

# The random.randint function takes arguments (lowest value, highest value, size)
# and generates an array of the specified filled with random numbers between the lowest and highest values

plt.scatter(x, y)
plt.show()

*Colour maps* may be used with a plot.  In short, a colour map is a list of colours where each colour has a value that ranges from 1 to 100.  See below.

In [None]:
a = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
b = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
col = np.array([0,10,20,30,40,45,50,55,60,70,80,90,100])

plt.scatter(a, b, c=col, cmap='Spectral')
plt.colorbar()

plt.show()

Colour maps are particularly useful when used with a 2D array.  Without a colourmap, the below image would just be monochrome and effectively invisible.

![colourmap.png](attachment:colourmap.png)

**Figure 4:** A colour map of the surface of a silicon crystal, obtained using a scanning tunnelling microscope (STM).

If you're collecting data for scientific purposes, you'll almost never get definite results - there will always be some form of (potentially) quantifiable uncertainty.  Whilst your graphs will certainly look good, they won't indicate any degree of uncertainty.  Luckily, `matplotlib` allows you to generate *error bars*.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
   
x = np.arange(10)
y = 3*np.sin((x/20)*np.pi)

x_error = np.linspace(0.2, 0.5, 10)
y_error = np.linspace(0.2, 0.5, 10)

plt.plot(x,y, color = 'b')
plt.errorbar(x, y, xerr = x_error, yerr = y_error, ecolor = 'g')
plt.show()

It can also be handy to change the scale of one or both of your axes.  Take a look at the plot below.

In [None]:
x = 10.0**np.linspace(0.0, 5.0, 15) 
y = x**2.0

plt.plot(x,y)
plt.show()

An observer may not be aware that this plot scales by a factor of 10.  In this case, it would be handy to change the scale of *both* our datasets to a logarithmic scale?  However, doing this to our data could get annoying.  What if we instead made both our axes logarithmic instead?

This is where `xscale` and `yscale` come in.  See below.

In [None]:
x = 10.0**np.linspace(0.0, 5.0, 15) 
y = x**2.0

plt.plot(x,y)

plt.xscale('log') 
plt.yscale('log')

plt.show()

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.3 Subplots <a name="3.3"></a>

<hr style="border:2px solid gray">

The `subplot` function allows us to draw multiple plots in one figure.  It takes 3 arguments: the 1st describes the number of rows, the 2nd the number of columns and the 3rd the index of the current plot.  This is best shown with an example:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot 1:
x = np.array([0,1,2,3])
y = np.array([3,8,1,10])

plt.subplot(2,1,1) # 2 rows, 1 column, index 1
plt.plot(x,y)

# Plot 2:
x = np.array([0,1,2,3])
y = np.array([10,20,30,40])

plt.subplot(2,1,2) # 2 rows, 1 column, index 2
plt.plot(x,y)

plt.show()

You can add titles and axes to each plot as you would normally, as well as change the colour and property of each line.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot 1:
x = np.array([0,1,2,3])
y = np.array([3,8,1,10])

plt.subplot(2,1,1)
plt.plot(x,y,linestyle='--')

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 1")

# Plot 2:
x = np.array([0,1,2,3])
y = np.array([10,20,30,40])

plt.subplot(2,1,2)
plt.plot(x,y,c='r')

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 2")

plt.tight_layout() # This stops the title of Plot 2 overlapping with the x-axis label of Plot 1

plt.show()

We can also add a title to an entire block of plots using the `suptitle` function:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot 1:
x = np.array([0,1,2,3])
y = np.array([3,8,1,10])

plt.subplot(2,1,1)
plt.plot(x,y)

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 1")

# Plot 2:
x = np.array([0,1,2,3])
y = np.array([10,20,30,40])

plt.subplot(2,1,2)
plt.plot(x,y)

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 2")

plt.suptitle("Subplot demonstration")
plt.tight_layout()

plt.show()

Finally, we can include multiple lines on the same plot, as well as adding a legend to more easily identify each line.  Calling the legend just uses the `legend` function, but we must label each plot individually for the legend to render with correctly labeled lines.

In [None]:
# Plot 1:
x1 = np.array([0,1,2,3])
y1 = np.array([3,8,1,10])
x2 = np.array([0,1,2,3])
y2 = np.array([6,16,2,20])

plt.subplot(2,1,1)
plt.plot(x1,y1, label=r'Line 1')
plt.plot(x2,y2,c='r', label=r'Line 2')

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 1")

plt.legend()

# Plot 2:
x1 = np.array([0,1,2,3])
y1 = np.array([10,20,30,40])
x2 = np.array([0,1,2,3])
y2 = np.array([30,6,24,9])

plt.subplot(2,1,2)
plt.plot(x1,y1,c='g', label=r'Line 3')
plt.plot(x2,y2,c='y', label=r'Line 4')

plt.xlabel("x values")
plt.ylabel("y values")
plt.title("Plot 2")

plt.suptitle("Subplot demonstration")
plt.tight_layout()

plt.legend()

plt.show()

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.4 Other graph types <a name="3.4"></a>

<hr style="border:2px solid gray">

We can use the `bar` function to draw bar graphs:

In [None]:
x = np.array(["A","B","C","D"])
y = np.array([1,3,8,10])

plt.bar(x,y)
plt.show()

It's possible to instead use horizontal bars using `barh` (not a typo!), as well as adjust their width, height and colour.

In [None]:
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])

plt.barh(x,y,0.1,0.1, color='r')
plt.show()

Producing a histogram is relatively easy: simply use the `hist` function to convert data into a histogram.  Here, we've used the `normal` function from the `random` library, which takes values from a normal (Gaussian) distribution.  Its arguments are `loc` (centre/mean of distribution), `scale` (spread/standard deviation) and `size`.

In [None]:
x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

We can also adjust the number of *bins*, which are the bars of the histogram.

In [None]:
x = np.random.normal(170, 10, 250)

plt.hist(x, bins = 20)
plt.show()

This looks rather messy and undefined, so it would be best to highlight the borders of each bar.

In [None]:
x = np.random.normal(170, 10, 250)

plt.hist(x, bins = 20, edgecolor = "black")
plt.show()

We can also choose the positions of our bins by specifying the positions of their edges, like so.

In [None]:
x = np.random.normal(170, 10, 250)

# binpos = np.linspace(130, 200, 15)

binpos = [130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200]

plt.hist(x, bins = binpos, edgecolor = "black")
plt.show()

Alternatively, we can use `linspace` to generate evenly spaced bin edges for us.  The below example gives the same bins as above, though the actual data will be different - it's random, after all.

In [None]:
x = np.random.normal(170, 10, 250)

binpos = np.linspace(130, 200, 15)

plt.hist(x, bins = binpos, edgecolor = "black")
plt.show()

Producing a pie chart follows a similar process to producing a histogram, though we use the `pie` function this time.  By default, the plotting of the first wedge starts from the x-axis and moves counterclockwise.

In [None]:
x = np.array([35,25,25,15])

plt.pie(x)
plt.show()

We can also label each segment of the pie chart, as well as adjusting the starting angle.  Choosing an angle of 0 degrees would start from the default x-axis or 'East' whilst selecting 90 degrees, 180 degrees and 270 degrees would start from 'North', 'West' and 'South', respectively.

In [None]:
x = np.array([35,25,25,15])
labl = ["A","B","C","D"]

plt.pie(x,labels=labl, startangle = 90)
plt.show()

If we want a particular section of our chart to stand out, we can use the `explode` function.

In [None]:
x = np.array([35,25,25,15])

sep = [0.2,0,0,0]

plt.pie(x, explode=sep)
plt.show()

Finally, we can adjust the colours used in the chart.  Whilst we could add a legend, this generally gets in the way of both the chart and its labels, so it's better to stick to just using labels.

In [None]:
y = np.array([35, 25, 25, 15])
labl = ["A","B","C","D"]
cols = ["black", "hotpink", "b", "#4CAF50"]

plt.pie(y, labels = labl, colors = cols)
plt.show()

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.5 Importing data <a name="3.5"></a>

<hr style="border:2px solid gray">

Making large datasets in Python is a pain, particularly if you're collecting it rather than generating it.  You may instead use *Excel* or *Notepad* as an faster alternative to collecting and organising data.  The question is, how to get this data back into Python without taking even more time and manually entering it?

A common file type is the *.csv* file, which stands for *comma-separated variables*.  Data in a *.csv* file is separated by commas - here, the commas play the role of ***delimiters***.

The `pandas` module has a handy function called `read_csv` that does exactly what it says on the tin.  Let's practice by first importing a file using `read_csv` then printing it.

In [None]:
import pandas as pd

df = pd.read_csv("company_sales_data.csv")
print(df)

The `head` function allows you to read off a particular number of rows, starting from the first.  It also renders as a table, making it a little nicer to look at than the direct print.

In [None]:
df.head(5)

It's actually a rather nice-looking display layout.  Selecting a data column and an element from the column is easy - see below.

In [None]:
month = df ['month_number']
print(month)
print("First month:", month[0])

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.6 Curve fitting <a name="3.6"></a>

<hr style="border:2px solid gray">

So far, we have been plotting with data that we entirely understand.  However, when you actually collect data, it may well appear random.  It may indeed be random, or follow a relationship that is too complex to identify by hand.  **Curve fitting** will allow you to 'trial' several predicted relations between your data and variables; for instance, you could fit a quadratic curve against your data and see if you get a decent match.  The `scipy` library's `optimize` module has a very handy *routine* - `curve_fit` - that is very competent at doing this.

Let's start by creating a dataset with random 'noise'.  Our $x$-data is a random distribution that ranges from $0$ to $100$ whilst our $y$-data is $3 \times$ this with an offset of $2$ plus a different random distribution from $0$ to $10$ - this makes it 'narrower' and more like a linear plot.  See where this is going?

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(0, 100, 100)
y = 3*x + 2 + np.random.normal(0, 10, 100)

plt.plot(x, y, '.')
plt.show()

Now we define a function that returns a straight line with formula $y = m x +  c$.

In [None]:
def linear(x, m, c):
    return m*x + c

Now we use `curve_fit` to perform a *least-squares fit*, which minimises the squares of the *offset* (distance) of every point from a fitted line.

In [None]:
from scipy.optimize import curve_fit

popt, pcov = curve_fit(linear, x, y)

`curve_fit` returns two items, `popt` and `pcov`; we are interested in `popt`, which are the best-fit parameters for the *gradient* $m$ and the $y$*-intercept* $b$.

In [None]:
popt

Note that these are close to the values $3$ and $2$ we chose for our data.

However, we have not accounted for any error in our data.  We shall now implement this, assuming that each point has a vertical error of $\pm 10$.  Here, the `repeat` function is used to repeat elements of an array by a given number.

In [None]:
err = np.repeat(10, 100)

plt.plot(x,y,'.')
plt.errorbar(x, y, yerr = err, fmt = 'none')
plt.show()

Let's see what we get for our best-fit parameters now:

In [None]:
popt

The same, which makes sense - these are the absolute values and are unaffected by error.  We're more interested in the results of `pcov` - the *variance* (spread of data arounds its mean value) and *covariance* (relationship between two random variables) of the parameters.

In [None]:
print("a =", popt[0], "+/-", pcov[0,0]**0.5)
print("b =", popt[1], "+/-", pcov[1,1]**0.5)

Now we can plot a best-fit line from our `curve_fit` results:

In [None]:
xfine = np.linspace(0, 100, 100)  # Define values to plot the function for

plt.errorbar(x, y, yerr = err, fmt="none")
plt.plot(xfine, linear(xfine, *popt), 'r-')
plt.show()

A straight line - as predicted!

Let's put it all together:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

x = np.random.uniform(0, 100, 100)
y = 3*x + 2 + np.random.normal(0, 10, 100)
err = np.repeat(10, 100)

def linear(x, m, c):
    return m*x + c

popt, pcov = curve_fit(linear, x, y)

print("a =", popt[0], "+/-", pcov[0,0]**0.5)
print("b =", popt[1], "+/-", pcov[1,1]**0.5)

xfine = np.linspace(0, 100, 100)

plt.errorbar(x, y, yerr = err, fmt="none")
plt.plot(xfine, linear(xfine, *popt), 'r-')
plt.show()

[Return to contents](#0.)

<hr style="border:2px solid gray">

### 3.7 3D plots <a name="3.7"></a>

<hr style="border:2px solid gray">

We've spent quite a while dealing with 2D plots - how about we enter the third dimension?  We can use the `axes3d` module from the `mpl_toolkits` library to do just that.  It's needed because we need a different kind of axis to plot in 3 dimensions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d

To plot in 3D, we must specify the *projection* used - in this case, `3d`.

In [None]:
fig = plt.figure(figsize=(10,6))
ax1 = plt.axes(projection='3d')

This produces a pretty boring empty cube.  Let's generate an interesting line - this time using functions and data for 3 axes!

In [None]:
x = np.linspace(0,15,1000)
y = np.sin(x)
z = np.cos(x)

Now we plot as normal, this time with three data arguments.

In [None]:
ax1.plot(x,y,z)

ax1.set_xlabel('x axis')
ax1.set_ylabel('y axis')
ax1.set_zlabel('z axis')

plt.show()

Putting it all together:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d

fig = plt.figure(figsize=(10,6))
ax1 = plt.axes(projection='3d')

x = np.linspace(0,15,1000)
y = np.sin(x)
z = np.cos(x)

ax1.plot(x,y,z)

ax1.set_xlabel('x axis')
ax1.set_ylabel('y axis')
ax1.set_zlabel('z axis')

plt.show()

[Return to contents](#0.)

<hr style="border:2px solid gray">

# What now? <a name="4."></a>

If you want to crunch some numbers, check out the [challenges](DataChallenges.ipynb) that we've prepared for you.  We've also provided [solutions](DataSolutions.ipynb) for when you're ready to check your answers.

<hr style="border:2px solid gray">