# TMA 01, question 3 (45 marks)

**Name**: Daniel Smith
    
**PI**: A7603242

### The task

In question 1, you started looking at a dataset covering the wholesale values for fruit and vegetables for the years 2004-2012. The dataset was available at:

https://data.gov.uk/dataset/agricultural_market_reports
    
The entire dataset is available for download, but we have also provided a download of the data in the file <code>fruitveg.csv</code> in the <code>data</code> directory.

Visit the website via the above link, read the description of the data and then answer the following questions.

You are now required to combine some of the information from this dataset with data about the amount of orchard space in England and Wales, which you can download from this site:

    https://data.gov.uk/dataset/orchard_fruit_survey

but which is also contained in the file <code>orchfruit_ap&pr_30may13.csv</code> in the <code>data</code> directory.

**You must produce a graphical representation of the changes in average wholesale price and orchard space for each type of dessert apple grown in England and Wales. You should then discuss what your representation shows.**

*(45 marks)*

### Some guidance

This TMA question gives you the opportunity to demonstrate your mastery of the techniques in carrying out a small-scale data analysis. Specifically, this question requires you to clean two datasets, combine and reshape them, and graphically present the cleaned data. All the techniques required to answer this question can be found in Parts 2-5, and are illustrated in the associated notebooks.

There are many ways you could approach this task, but one way might be to produce a pandas dataframe, containing the values so that for each variety of apple, and for each year, the average wholesale price for the year is listed, and the total orchard space given over to that variety. The final dataframe could look something like this:



|Apple variety|Year|Average wholesale price|Orchard space|
|---|---|---|---|
|Cox|2004 | 45 | 12 |
|Cox|2012 | 45 | 12 |
|Worcester|2004 | 23 | 1 |
|Worcester|2012 | 23 | 1 |
|$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |


(although note that the figures 45, 12, 23 and 1 are just for illustration; they are not necessarily the correct values for the question).


You should then construct one or more plots showing how the relationships between the type of apple, the average wholesale price of that type, and the England and Wales orchard space for each type has changed over the period 2004-2015. You should also give an explanation of what you believe the plot shows.

This question requires that you complete a number of tasks:

1. You need to examine the datasets. You should consider questions such as how missing data is handled, whether there is any dirtiness or ambiguity in the data, and any differences in how data is represented in the two datasets. This task uses the techniques described in Part 3, section 2.

2. You will need to capture the data in a dataframe in the form described above. This task uses the techniques described in Part 3, section 3 and Part 4.

3. Finally, you should select a visualisation method for the data in the dataset, and present a plot of the data, with a description of how you think it should be interpreted. This task uses the techniques described in Part 5. We are not prescribing a particular choice of visualisation: you should choose one that you think is appropriate.


It is crucial for this question to bear in mind that at each stage, you must describe what you have done in sufficient detail that someone could replicate your work. This means that you must:

* explain what any code that you have written does, and execute it in the body of your submitted notebook,

* where you have used tools that are not accessed via python or the notebooks (such as OpenRefine), you should include some screenshots to show what you did, and to help the marker understand your thinking,

* clearly explain any assumptions or simplifications that you have made about the data, and

* interpret your final results in the context of these assumptions and simplifications.


Some guidance on presentation:

* You must present your answer in this notebook.
    
* Do not put too much text or code into each notebook cell. Each cell should contain one or two paragraphs at most, or around ten lines of python.

* Ensure that in your code, you use meaningful variable names.

* You should have a specific cell whose return value is the dataframe described above.

* You should have a specific cell which plots the data in the dataframe.

We have provided a structure for your answer, and you should describe your work under the appropriate headings in the rest of the TMA. The headings do not represent equal amounts of work, nor do they necessarily carry the same weight as the equivalent headings in question 2, because different datasets and different tasks require the effort to be spent in different places. You may need to use several cells to address a particular heading. For example, you would expect to present substantially more work on identifying and handling the missing data, than on importing the datasets. 

### Your answer

#### 1. Import the two datasets

In [1]:
# Load the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

In [2]:
# read in the cleaned fruitveg_dcs.csv referenced as apples_df
apples_df = pd.read_csv('data/fruitveg_dcs.csv')

# check it looks ok
apples_df[:5]

Unnamed: 0,Year,Variety,Quality,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,2004,Cox’s Orange-group,1st,0.62,0.63,0.6,0.58,,,,,0.68,0.51,0.53,0.58
1,2004,Cox’s Orange-group,2nd,0.4,0.4,0.4,0.42,,,,,0.39,0.33,0.3,0.33
2,2004,Cox’s Orange-group,Ave,0.55,0.54,0.5,0.51,,,,,0.64,0.45,0.46,0.5
3,2004,Discovery,1st,,,,,,,0.75,0.51,0.37,,,
4,2004,Discovery,2nd,,,,,,,0.66,0.33,0.25,,,


In [3]:
months_list = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

apples_df['Average'] = apples_df[months_list].mean(axis=1)
apples_df[:5]

# This post was very helpful
# https://stackoverflow.com/questions/25748683/pandas-sum-dataframe-rows-for-given-columns

Unnamed: 0,Year,Variety,Quality,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Average
0,2004,Cox’s Orange-group,1st,0.62,0.63,0.6,0.58,,,,,0.68,0.51,0.53,0.58,0.59125
1,2004,Cox’s Orange-group,2nd,0.4,0.4,0.4,0.42,,,,,0.39,0.33,0.3,0.33,0.37125
2,2004,Cox’s Orange-group,Ave,0.55,0.54,0.5,0.51,,,,,0.64,0.45,0.46,0.5,0.51875
3,2004,Discovery,1st,,,,,,,0.75,0.51,0.37,,,,0.543333
4,2004,Discovery,2nd,,,,,,,0.66,0.33,0.25,,,,0.413333


In [4]:
apples_df = apples_df[['Year','Variety','Quality','Average']]
apples_df[:5]

Unnamed: 0,Year,Variety,Quality,Average
0,2004,Cox’s Orange-group,1st,0.59125
1,2004,Cox’s Orange-group,2nd,0.37125
2,2004,Cox’s Orange-group,Ave,0.51875
3,2004,Discovery,1st,0.543333
4,2004,Discovery,2nd,0.413333


In [5]:
group = apples_df.groupby(['Year', 'Variety'])
group_df = pd.DataFrame(group['Average'].mean())
group_df[:12]

apples_df = group_df.pivot_table(index=['Year'], columns='Variety', values='Average')
# group_df[:12]
apples_df[:12]


Variety,Braeburn,Cox’s Orange-group,Discovery,Egremont Russet,Gala,Jonogold – group,Katy,Other Early Season,Other Late Season,Other Mid Season,Red Pippin,Spartan,Worcester Pearmain
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2004,,0.49375,0.49,0.582857,0.43631,0.426667,0.463333,0.46,0.464286,0.5175,0.3425,0.382,0.445
2005,,0.456852,0.407778,0.517024,0.376111,0.399667,,0.5,0.45125,0.415,0.53,0.391,0.426667
2006,,0.515787,0.46,0.537196,0.457639,0.424,,0.465,0.48375,0.526,0.45,0.467778,0.485
2007,,0.599259,0.548333,0.610444,0.501429,0.451667,,0.5525,0.53125,0.516,0.496667,0.428333,0.471667
2008,,0.62787,0.552778,0.672361,0.523333,0.450357,,0.55,0.55,0.567143,0.445556,0.501111,0.573333
2009,,0.626917,0.533889,0.682083,0.588148,0.503651,,0.51,0.57875,0.515,0.53,0.531111,0.451667
2010,,0.65,0.545,0.654815,0.547167,0.461111,,0.616667,0.611111,0.58,0.555,0.59,0.393333
2011,,0.726806,0.623333,0.672917,0.638148,0.55,,0.535,0.65375,0.63,,0.628444,0.645
2012,,0.728009,0.726667,0.767143,0.681429,0.633333,,0.8225,0.7425,0.65,0.725,0.699444,0.684444
2013,,0.753571,0.618333,0.734127,0.686444,0.596667,,0.5575,0.718571,0.668571,0.66,0.608611,0.543889


In [6]:
# read in the cleaned orchard_dcs.csv as orchards_df
orchards_df = pd.read_csv('data/orchard_fruit_dcs.csv')

# check it looks ok
orchards_df

Unnamed: 0,year,Braeburn,Cameo,Cox(andclones),Discovery,EgremontRusset,Fiesta/RedPippin,Gala(andclones),Jonagold(andclones),Jazz,Kanzi,Katy,Spartan,WorcesterPearmain,Other dessert varieties
0,1999,,,4694,577,325,186.0,757,400,,,,299.0,294,1276
1,2000,,,4186,484,334,163.0,828,353,,,,286.0,283,1194
2,2001,,,3489,420,331,133.0,719,257,,,,257.0,207,1209
3,2002,,,3015,339,268,109.0,663,201,,,,195.0,196,945
4,2003,306.0,,2738,264,264,,674,227,,,,142.0,147,729
5,2004,194.0,,3144,301,308,,669,231,,,,164.0,213,957
6,2007,271.0,,2128,189,277,,740,204,,,,137.0,124,877
7,2009,304.0,36.0,1798,177,293,,878,316,83.0,144.0,133.0,,133,724
8,2012,509.0,41.0,1697,157,224,,1312,283,117.0,96.0,129.0,,115,847


The apples_df needs to have an average for the year

In [7]:
orchards_df.set_index(['year'], inplace=True)
current_cols = orchards_df.columns
current_cols

Index(['Braeburn', 'Cameo', 'Cox(andclones)', 'Discovery', 'EgremontRusset',
       'Fiesta/RedPippin', 'Gala(andclones)', 'Jonagold(andclones)', 'Jazz',
       'Kanzi', 'Katy', 'Spartan', 'WorcesterPearmain',
       'Other dessert varieties'],
      dtype='object')

In [8]:
apple_cols = apples_df.columns
apple_cols

Index(['Braeburn', 'Cox’s Orange-group', 'Discovery', 'Egremont Russet',
       'Gala', 'Jonogold – group', 'Katy', 'Other Early Season',
       'Other Late Season', 'Other Mid Season', 'Red Pippin', 'Spartan',
       'Worcester Pearmain'],
      dtype='object', name='Variety')

In [9]:

# Make a list of the cplumns containing the data we need to combine
others_list = ['Other Early Season', 'Other Mid Season','Other Late Season']

# add a new `Other Varieties` Column and determine the average price of the other varieties
apples_df['Other Dessert Varieties'] = apples_df[others_list].mean(axis=1)
# then we drop the unneeded columns
apples_df.drop(others_list, axis=1, inplace=True)

# check it looks ok
apples_df[:5]

Variety,Braeburn,Cox’s Orange-group,Discovery,Egremont Russet,Gala,Jonogold – group,Katy,Red Pippin,Spartan,Worcester Pearmain,Other Dessert Varieties
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2004,,0.49375,0.49,0.582857,0.43631,0.426667,0.463333,0.3425,0.382,0.445,0.480595
2005,,0.456852,0.407778,0.517024,0.376111,0.399667,,0.53,0.391,0.426667,0.455417
2006,,0.515787,0.46,0.537196,0.457639,0.424,,0.45,0.467778,0.485,0.491583
2007,,0.599259,0.548333,0.610444,0.501429,0.451667,,0.496667,0.428333,0.471667,0.53325
2008,,0.62787,0.552778,0.672361,0.523333,0.450357,,0.445556,0.501111,0.573333,0.555714


In [12]:
# Comparing the two column lists we can see there is some inconsistency between the naming of the individual varieties
# the orchards_df does not have spaces in the variety naming, and 
orchard_cols = orchards_df.columns
orchard_cols

Index(['Braeburn', 'Cameo', 'Cox(andclones)', 'Discovery', 'EgremontRusset',
       'Fiesta/RedPippin', 'Gala(andclones)', 'Jonagold(andclones)', 'Jazz',
       'Kanzi', 'Katy', 'Spartan', 'WorcesterPearmain',
       'Other dessert varieties'],
      dtype='object')

In [14]:
apple_cols = apples_df.columns
apple_cols

Index(['Braeburn', 'Cox’s Orange-group', 'Discovery', 'Egremont Russet',
       'Gala', 'Jonogold – group', 'Katy', 'Red Pippin', 'Spartan',
       'Worcester Pearmain', 'Other Dessert Varieties'],
      dtype='object', name='Variety')

In [18]:
# to prepare the tables for merging we should rename the columns to match
# notes: it was decided to simplify the following COlumn names:
#     - Cox(and clones to simply Cox)
#     - Fiesta/Red Pippin to just Red Pippin ( a quick online search revealed they are 
#       alternate names for the same apple) ****
#     - Gala(and clones) to Gala
#     - Jonogold(andclones) to just jonogold
#     - renamed the other dessert Varieties to match apples_df
orchards_df.columns = ['Braeburn','Cameo','Cox','Discovery','Egremont Russet','RedPippin',
                       'Gala','Jonagold','Jazz','Kanzi','Katy','Spartan','Worcester Pearmain',
                       'Other Dessert Varieties']

orchards_df.head()

Unnamed: 0_level_0,Braeburn,Cameo,Cox,Discovery,Egremont Russet,RedPippin,Gala,Jonagold,Jazz,Kanzi,Katy,Spartan,Worcester Pearmain,Other Dessert Varieties
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1999,,,4694,577,325,186.0,757,400,,,,299.0,294,1276
2000,,,4186,484,334,163.0,828,353,,,,286.0,283,1194
2001,,,3489,420,331,133.0,719,257,,,,257.0,207,1209
2002,,,3015,339,268,109.0,663,201,,,,195.0,196,945
2003,306.0,,2738,264,264,,674,227,,,,142.0,147,729


In [21]:
apples_df.columns = ['Braeburn', 'Cox', 'Discovery','Egremont Russet','Gala',
                     'Jonogold', 'Katy','Red Pippin','Spartan','Worcestor Pearmain', 
                     'Other Dessert Varieties']
apples_df.head()

Unnamed: 0_level_0,Braeburn,Cox,Discovery,Egremont Russet,Gala,Jonogold,Katy,Red Pippin,Spartan,Worcestor Pearmain,Other Dessert Varieties
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2004,,0.49375,0.49,0.582857,0.43631,0.426667,0.463333,0.3425,0.382,0.445,0.480595
2005,,0.456852,0.407778,0.517024,0.376111,0.399667,,0.53,0.391,0.426667,0.455417
2006,,0.515787,0.46,0.537196,0.457639,0.424,,0.45,0.467778,0.485,0.491583
2007,,0.599259,0.548333,0.610444,0.501429,0.451667,,0.496667,0.428333,0.471667,0.53325
2008,,0.62787,0.552778,0.672361,0.523333,0.450357,,0.445556,0.501111,0.573333,0.555714


In [22]:
apples_df.groupby('Variety')

KeyError: 'Variety'

#### 2. Identify and handle missing data

#### 3. Identify and handle inconsistent or dirty data

#### 4. Identify and handle ambiguity and vagueness

#### 5. Put the data into an appropriate form for plotting

#### 6. Visualise the data

#### 7. Interpret your plot