# Solutions to the data manipulation challenges

Contents:
- [Solutions - Easy](#1.)
  - [E1 solution - Part 1](#1.1.1)
  - [E1 solution - Part 2](#1.1.2)
  - [E1 solution - Part 3](#1.1.3)
  - [E2 solution](#1.2)
- [Solutions - Medium](#2.)
  - [M1 solution](#2.1)
  - [M2 solution](#2.2)
  - [M3 solution - Part 1](#2.3.1)
  - [M3 solution - Part 2](#2.3.2)
  - [M3 solution - Part 3](#2.3.3)
- [Solutions - Hard](#3.)
  - [H1 solution - Part 1](#3.1.1)
  - [H1 solution - Part 2](#3.1.2)
  - [H1 solution - Part 3](#3.1.3)

# 1. Solutions - Easy <a name="1."></a>

### 1.1.1 E1 solution - Part 1 <a name="1.1.1"></a>

As we want this to work with multiple inputs for the vertices, we need to use floats and inputs. We can do this by assigning each vertex with the *float(input())* command then inputting each vertex into an array. We shall call this array vertices. Next we make a UDF to give the area of the triangle using these vertices with the formula provided. Finally, we call the UDF at the end to run it.

First of all, the float inputs:

In [None]:
x1=float(input("x1:"))
y1=float(input("y1:"))
x2=float(input("x2:"))
y2=float(input("y2:"))
x3=float(input("x3:"))
y3=float(input("y3:"))

Now we can build our array of the vertices above.  However, naming it *vertices* or something relatively long would be inefficient and time-consuming as we will need to constantly write out the name of this array.  As such, we shall simply name our array `v`.

In [None]:
import numpy as np

v=np.array([x1, y1, x2, y2, x3, y3])

Now we shall write our UDF.  It will directly give us our result, so we should include a `print` function and text in string form so we don't just get an unmarked value as the output.  We also need to call it at the end.

In [None]:
def area(v):
    print(f"Area of triangle = {((v[2]*v[5])-(v[4]*v[3])-(v[0]*v[5])+(v[4]*v[1])+(v[0]*v[3])-(v[2]*v[1]))/2} units")
area(v)

Putting it all together:

In [None]:
import numpy as np

x1=float(input("x1: "))
y1=float(input("y1: "))
x2=float(input("x2: "))
y2=float(input("y2: "))
x3=float(input("x3: "))
y3=float(input("y3: "))

v=np.array([x1,y1,x2,y2,x3,y3])

def area(v):
    print(f"Area of triangle is {((v[2]*v[5])-(v[4]*v[3])-(v[0]*v[5])+(v[4]*v[1])+(v[0]*v[3])-(v[2]*v[1]))/2} units")
area(v)

### 1.1.2 E1 solution - Part 2 <a name="1.1.2"></a>

The example triangle with coordinates $(0, 0), (1, 0), (0, 2)$ is actually a very good test case, as it will be a right-angled triangle.  This means its area will be given by $A = \frac{1}{2} l_1 l_2$ where $l_1$ and $l_2$ are the two sides adjacent to the hypoteneuse.  In this case, the area will be 1 unit - which matches what we get from running our function with these values used for the vertices!

### 1.1.3 E1 solution - Part 3 <a name="1.1.3"></a>

Our function - with an example docstring included - is below.  A docstring should at least include the arguments accepted by a UDF and a brief overview of its purpose.

In [None]:
def area(v):
    '''
    The arguments of the function are the vertices, given by (x1,y1), (x2,y2) and (x3,y3).
    The function will return the area of a triangle defined by these vertices
    '''
    print(f"Area of triangle = {((v[2]*v[5])-(v[4]*v[3])-(v[0]*v[5])+(v[4]*v[1])+(v[0]*v[3])-(v[2]*v[1]))/2} units")

## 1.2 E2 solution <a name="2.1"></a>

The first step is to generate data for our parametric variable $t$.  `linspace` would be the best option as it generates evenly spaced datapoints.

In [None]:
t = np.linspace(-5,5,1000)

Now we write these parametric equations in a form that Python will recognise.

In [None]:
x1 = np.sin(t)*(math.e**(np.cos(t)) - 2*np.cos(4*t) + np.sin(t/12)**5)
y1 = np.cos(t)*(math.e**(np.cos(t)) - 2*np.cos(4*t) + np.sin(t/12)**5)

x2 = (t/2)*np.cos(t)
y2 = -(t/2)*np.sin(t)

Let's generate basic plots for both of these and see what we get.

In [None]:
plt.plot(x1,y1)
plt.show()

plt.plot(x2,y2)
plt.show()

There's definitely a problem using these values of $t$ for the Fermat spiral - it isn't actually making a spiral!  We can identify two problems:
- The numerical difference between the upper and lower bounds is not enough for an obvious curve
- We have two curves that curl in different directions

Therefore, we must set the lower bound of our $t$ data to $0$ and increase the upper bound - perhaps $50$ will do?

In [None]:
t = np.linspace(0,50,1000)

x1 = np.sin(t)*(math.e**(np.cos(t)) - 2*np.cos(4*t) + np.sin(t/12)**5)
y1 = np.cos(t)*(math.e**(np.cos(t)) - 2*np.cos(4*t) + np.sin(t/12)**5)

x2 = (t/2)*np.cos(t)
y2 = -(t/2)*np.sin(t)

plt.plot(x1,y1)
plt.show()

plt.plot(x2,y2)
plt.show()

The Fermat Spiral is looking good, but now the Butterfly is all messed up!  The best approach would be to use two different $t$ datasets for each plot.

In [None]:
t1 = np.linspace(-5,5,1000)
t2 = np.linspace(0,50,1000)

x1 = np.sin(t1)*(math.e**(np.cos(t1)) - 2*np.cos(4*t1) + np.sin(t1/12)**5)
y1 = np.cos(t1)*(math.e**(np.cos(t1)) - 2*np.cos(4*t1) + np.sin(t1/12)**5)

x2 = (t2/2)*np.cos(t2)
y2 = -(t2/2)*np.sin(t2)

plt.plot(x1,y1)
plt.show()

plt.plot(x2,y2)
plt.show()

Much better!  However, these plots would be meaningless to a random observer.  We'll need to label and title both of them.

In [None]:
plt.plot(x1,y1, 'g')
plt.title('Plot of a Butterfly Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.plot(x2,y2, 'b')
plt.title('Plot of a Fermat Spiral')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Putting it all together:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math

t1 = np.linspace(-5,5,1000)
t2 = np.linspace(0,50,1000)

x1 = np.sin(t1)*(math.e**(np.cos(t1)) - 2*np.cos(4*t1) + np.sin(t1/12)**5)
y1 = np.cos(t1)*(math.e**(np.cos(t1)) - 2*np.cos(4*t1) + np.sin(t1/12)**5)

x2 = (t2/2)*np.cos(t2)
y2 = -(t2/2)*np.sin(t2)

plt.plot(x1,y1, 'g')
plt.title('Plot of a Butterfly Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

plt.plot(x2,y2, 'b')
plt.title('Plot of a Fermat Spiral')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# 2. Solutions - Medium <a name="2."></a>

## 2.1 M1 solution <a name="2.1"></a>

As both plots use the same limits for *x*, we can use `linspace()` to generate the x-data for both plots.

In [None]:
x1=np.linspace(-np.pi, np.pi, 100)

As the first graph has more required features, we’ll be focusing on just the top graph from now on.  We’ll start by generating the cosine subplot, as it’s displayed first on the plot’s label.  The first step is actually generating the subplot.  We’ll use a $2 \times 1$ figure, with 2 subplots, hence (2,1,2).  Our x-data is `x1`, our y-data is `np.cos(x1)`, the colour of our line is blue and our we shall label our subplot `cosine`.  Our label will be located in the upper left-hand corner of the plot.

In [None]:
ax1=plt.subplot(2,1,1)
ax1.plot(x1, np.cos(x1), 'b',label=r'$cosine$')
ax1.legend(loc='upper left')

Now we can add our title:

ax1.set_title('Figure showing two subplots, with points labelled using LaTeX text')

The next step is to set the position of the x and y axes, which are called *spines* in `matplotlib`.  We want them to be central, so we set them to be central:

In [None]:
ax1.spines['left'].set_position('center')
ax1.spines['right'].set_color('none')
ax1.spines['bottom'].set_position('center')
ax1.spines['top'].set_color('none')

We repeat these steps for the sine graph.  Now we set the `x` and `y` ticks.  These are values used to show specific
points on the coordinate axis.  They will be labelled, which we shall do after setting their position:

In [None]:
ax1.set_xticks([-np.pi,-np.pi/2,0,np.pi/2,np.pi])
ax1.set_yticks([-1,0,1])

Now to label the ticks:

In [None]:
ax1.set_xticklabels([r'$-\pi$',r'$-\pi/2$','0',r'$\pi/2$',r'$\pi$'])
ax1.set_yticklabels([r'-1',r'0',r'1'])

We’ll describe how to do the annotation for the cosine subplot:

In [None]:
ax1.annotate('$cos(2\pi/3)=-1/2$', xy=(2*np.pi/3,-1/2),
arrowprops=dict(arrowstyle='-',connectionstyle='angle3'), xytext=(1.5, 0.8))

Putting it all together:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math


x1=np.linspace(-np.pi, np.pi, 100)

ax1=plt.subplot(2,1,1)
ax1.plot(x1, np.cos(x1), 'b',label=r'$cosine$')
ax1.legend(loc='upper left')

ax1.set_title('Figure showing a subplot, with a point on the line labelled using LaTeX text')

ax1.spines['left'].set_position('center')
ax1.spines['right'].set_color('none')
ax1.spines['bottom'].set_position('center')
ax1.spines['top'].set_color('none')

ax1.set_xticks([-np.pi,-np.pi/2,0,np.pi/2,np.pi])
ax1.set_yticks([-1,0,1])

ax1.set_xticklabels([r'$-\pi$',r'$-\pi/2$','0',r'$\pi/2$',r'$\pi$'])
ax1.set_yticklabels([r'-1',r'0',r'1'])

ax1.annotate(r'$cos(\frac{2\pi}{3})=-\frac{1}{2}$', xy=(2*np.pi/3,-1/2),
arrowprops=dict(arrowstyle='-',connectionstyle='angle3'), xytext=(1.5, 0.8))

plt.show()

## 2.2 M2 solution <a name="2.2"></a>

We'll need to use arrays; a clear giveaway is that we're given our data in a component-style layout.  So, let's start by writing our data in array form.

In [None]:
p1 = np.array([8.705,1.114,1.092,-3.410])
p2 = np.array([6.808,-5.209,-14.834,7.561])
p3 = np.array([4.282,1.750,3.788,6.835])

A good next step would be to write the formula for invariant mass in Python mathematical syntax.

In [None]:
M = (x[0] + y[0])**2 - (x[1] + y[1])**2 - (x[2] + y[2])**2 - (x[3] + y[3])**2

Now to start comparing.  The best way forward would be a simple accept/reject rule, so this will involve a single `if` statement.  The criterion for our `if` statement will be whether a pair of photons has an invariant mass similar to that of the Higgs boson; that is, 125 GeV.

However, it's unlikely that a pair of photons will have an invariant mass of *exactly* 125 GeV, so the *equal to* operator `==` will not work.  This is where the `isclose` function comes in - our check will instead be gauging how close the invariant mass of a photon pair is to 125 GeV.

It would be a real pain to rewrite our `if` loop for each possible photon pair, so a UDF is best used here.  Generally, multiple instances of similar-looking input data is a signal to use UDFs.  We'll incorporate our formula for invariant mass into the `if` loop within our UDF.

In [None]:
def Minv(x,y):
    M = (x[0] + y[0])**2 - (x[1] + y[1])**2 - (x[2] + y[2])**2 - (x[3] + y[3])**2
    if math.isclose(M, 125, abs_tol = 5.0):
        print("The photons",x,"and",y, "have an invariant mass of", M, "GeV and therefore came from a Higgs decay")

The full code is below, with the relevant modules imported and all possible combinations called.  Note that there is only an output for the correct (1,3) photon pair as we have not included an *else* condition.

In [None]:
import math

p1 = np.array([8.705,1.114,1.092,-3.410])
p2 = np.array([6.808,-5.209,-14.834,7.561])
p3 = np.array([4.282,1.750,3.788,6.835])

def Minv(x,y):
    M = (x[0] + y[0])**2 - (x[1] + y[1])**2 - (x[2] + y[2])**2 - (x[3] + y[3])**2
    if math.isclose(M, 125, abs_tol = 5.0):
        print("The photons",x,"and",y, "have an invariant mass of", M, "GeV and therefore came from a Higgs decay")
   
Minv(p1,p2)
Minv(p2,p3)
Minv(p1,p3)

### 2.3.1 M3 solution - Part 1 <a name="2.1.1"></a>

Our first step is to import `pandas` in order to read our *.csv* file, as well as `matplotlib` in order to generate plots from the dataset *company_sales_data.csv*.  We  can also assign the variable `dat` to the `read_csv` function of `pandas` in order to save time.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt  

dat = pd.read_csv("company_sales_data.csv")

Now to assign names to the various data columns so we can quickly and easily refer to them in their entirety, or by their element.  Though we won't yet be needing data for the total units of product sold or the total profit for the year, including them now will save time later.

In [None]:
mon = dat['month_number']
cream = dat['facecream']
wash = dat['facewash']
paste = dat['toothpaste']
soap = dat['bathingsoap']
shampoo = dat['shampoo']
moist = dat['moisturizer']
units = dat['total_units']
profit = dat['total_profit']

Since a single plot was specified, we won't need to muck around with any subplots.  Let's add a line for each product and label it so we can later render a legend.

In [None]:
plt.plot(mon, cream, label = 'Facecream sales')
plt.plot(mon, wash, label = 'Facewash sales')
plt.plot(mon, paste, label = 'Toothpaste sales')
plt.plot(mon, soap, label = 'Soap sales')
plt.plot(mon, shampoo, label = 'Shampoo sales')
plt.plot(mon, moist, label = 'Moisturizer sales')

It doesn't look the best, does it?  We can resolve this by adding markers - perhaps dot markers?  Simply add `marker = 'o'` to each `plt.plot` command and observe.

In [None]:
plt.plot(mon, cream, label = 'Facecream sales', marker='o')
plt.plot(mon, wash, label = 'Facewash sales',  marker='o')
plt.plot(mon, paste, label = 'Toothpaste sales', marker='o')
plt.plot(mon, soap, label = 'Soap sales', marker='o')
plt.plot(mon, shampoo, label = 'Shampoo sales', marker='o')
plt.plot(mon, moist, label = 'Moisturizer sales', marker='o')

We label our plot and generate the legend like so.  Since the progression of the year doesn't exactly depend on product sales, the `month_number` column will be our independent variable and we thus use the months as our *x*-axis and our product sales as the various *y* axes.  Another giveaway would be that the months are present for each product dataset, making it a common axis - the *x*-axis. 

In [None]:
plt.xlabel('Month Number')
plt.ylabel('Number of units')
plt.legend(loc='upper left')
plt.title('Sales data')
plt.xticks(mon)

Let's give it a trial run to see how the legend interacts with our various lines:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt  

dat = pd.read_csv("company_sales_data.csv")

mon = dat['month_number']
cream = dat['facecream']
wash = dat['facewash']
paste = dat['toothpaste']
soap = dat['bathingsoap']
shampoo = dat['shampoo']
moist = dat['moisturizer']
units = dat['total_units']
profit = dat['total_profit']

plt.plot(mon, cream, label = 'Facecream sales', marker='o')
plt.plot(mon, wash, label = 'Facewash sales',  marker='o')
plt.plot(mon, paste, label = 'Toothpaste sales', marker='o')
plt.plot(mon, soap, label = 'Soap sales', marker='o')
plt.plot(mon, shampoo, label = 'Shampoo sales', marker='o')
plt.plot(mon, moist, label = 'Moisturizer sales', marker='o')

plt.xlabel('Month Number')
plt.ylabel('Number of units')
plt.legend(loc='upper left')
plt.title('Sales data')
plt.show()

Hmm, not the best.  We'll need to resolve this by adjusting our *x* and *y* ticks.  Though having labels for every 2 months works, we may as well use `month_number` for our *x* ticks.  The *y* ticks are a little trickier, as we need to find a way of generating 'dead space' for our legend without 'squishing' our lines together too much.  A good approach would be to have several of the *y* ticks take values above the highest monthly sales value for a single product - in this case, the `14000` (some monetary units) for soap in December.

This is right at the end, however, so we should look more towards the highest values earlier in the year.  Soap sales take the highest value here again at `9550` monetary units in March, so we can use this as a base.  The *x* and *y* ticks below should work:

In [None]:
plt.xticks(mon)
plt.yticks([1000, 2000, 4000, 6000, 8000, 10000, 12000, 15000, 18000])
plt.show()

Putting it all together:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt  

dat = pd.read_csv("company_sales_data.csv")

mon = dat['month_number']
cream = dat['facecream']
wash = dat['facewash']
paste = dat['toothpaste']
soap = dat['bathingsoap']
shampoo = dat['shampoo']
moist = dat['moisturizer']
units = dat['total_units']
profit = dat['total_profit']

plt.plot(mon, cream, label = 'Facecream sales', marker='o')
plt.plot(mon, wash, label = 'Facewash sales',  marker='o')
plt.plot(mon, paste, label = 'Toothpaste sales', marker='o')
plt.plot(mon, soap, label = 'Soap sales', marker='o')
plt.plot(mon, shampoo, label = 'Shampoo sales', marker='o')
plt.plot(mon, moist, label = 'Moisturizer sales', marker='o')

plt.xlabel('Month number')
plt.ylabel('Number of units')
plt.legend(loc='upper left')
plt.title('Sales data')
plt.xticks(mon)
plt.yticks([1000, 2000, 4000, 6000, 8000, 10000, 12000, 15000, 18000])
plt.show()

### 2.3.2 M3 solution - Part 2 <a name="2.1.2"></a>

Let's generate a really basic bar graph for facecream data and facewash data to see what we get.

In [None]:
plt.bar(month, cream)
plt.bar(month, wash)

plt.show()

This doesn't look the best - it may be better to try and generate the bars side-by-side for each product.  Let's start by reducing the bar width so side-by-side bars can actually fit next to each other:

In [None]:
plt.bar(month, cream, width = 0.25)
plt.bar(month, wash, width = 0.25)

plt.show()

Now let's adjust the *x* ticks so every month is showing.  Again, this isn't necessary, but it looks good.

In [None]:
plt.bar(month, cream, width = 0.25)
plt.bar(month, wash, width = 0.25)

plt.xticks(month)
plt.show()

Now to separate the bars for each product - we do this by shifting them in different directions.  Though we wish to preserve the *y* data, the *x* data is more of a reference frame than anything, so shifting our *x* data doesn't actually matter.

How to go about this?  We can use a compact `for` loop to shift every data point in `month_number` - 'left' for the facecream data and 'right' for the facewash data.

In [None]:
plt.bar([a-0.25 for a in month], cream, width= 0.25)
plt.bar([a+0.25 for a in month], wash, width= -0.25)

plt.xticks(month)
plt.show()

Add labels, a title and a legend, and we're good to go.

In [None]:
plt.xlabel('Month number')
plt.ylabel('Number of units')
plt.legend(loc='upper left')
plt.title('Facewash and facecream sales data')

Putting it all together:

In [None]:
plt.bar([a-0.25 for a in month], cream, width= 0.25, label = 'Facecream sales')
plt.bar([a+0.25 for a in month], wash, width= -0.25, label = 'Facewash sales')

plt.xlabel('Month number')
plt.ylabel('Number of units')
plt.legend(loc='upper left')
plt.title('Facewash and facecream sales data')

plt.xticks(month)
plt.show()

### 2.3.3 M3 solution - Part 3 <a name="2.1.3"></a>

Since this is for the final year, we'll need to total up the monthly sales for each product.  We do this using the `sum` function that is built into Python.  Another complication with pie charts is that you can't make 'bare' segments like you can with line graphs - you need to collect all segment data into a single list.  Implementing this isn't actually too bad:

In [None]:
sales = [cream.sum(), wash.sum(), paste.sum(), soap.sum(), shampoo.sum(), moist.sum()]

We also need to collect our labels into a list:

In [None]:
lab = ['Facecream', 'Facewash', 'Toothpaste', 'Bathing soap', 'Shampoo', 'Moisturizer']

Let's generate a 'prototpye' pie chart.

In [None]:
sales = [cream.sum(), wash.sum(), paste.sum(), soap.sum(), shampoo.sum(), moist.sum()]
lab = ['Facecream', 'Facewash', 'Toothpaste', 'Bathing soap', 'Shampoo', 'Moisturizer']

plt.pie(sales, labels=lab)
plt.show()

This already looks like what we're after, but we can do better.  Let's make a title and a legend.

In [None]:
sales = [cream.sum(), wash.sum(), paste.sum(), soap.sum(), shampoo.sum(), moist.sum()]
lab = ['Facecream', 'Facewash', 'Toothpaste', 'Bathing soap', 'Shampoo', 'Moisturizer']

plt.pie(sales, labels=lab)

plt.title('Sales data')
plt.legend(loc='lower right')
plt.show()

On second thoughts, let's not use a legend... our segments are already labelled anyway.

As suggested, we use the `autopct` option for `plt.pie` to render percentages on our plot:

In [None]:
plt.pie(sales, labels=lab, autopct='%1.1f%%')

Now to show which product generated the most sales income - this would be bathing soap.  The best way to go about this would be the `explode` option.

In [None]:
boom = [0,0,0,0.2,0,0]
plt.pie(sales, labels = lab, autopct='%1.1f%%', explode = boom)

Putting it all together:

In [None]:
sales = [cream.sum(), wash.sum(), paste.sum(), soap.sum(), shampoo.sum(), moist.sum()]
lab = ['Facecream', 'Facewash', 'Toothpaste', 'Bathing soap', 'Shampoo', 'Moisturizer']

boom = [0,0,0,0.2,0,0]

plt.pie(sales, labels = lab, autopct='%1.1f%%', explode = boom)

plt.title('Sales data')

plt.show()

# 3. Solutions - Hard <a name="3."></a>

### 3.1.1 H1 solution - Part 1 <a name="3.1.1"></a>

### 3.1.2 H1 solution - Part 2 <a name="3.1.2"></a>

### 3.1.1 H1 solution - Part 3 <a name="3.1.3"></a>