# MEI Introduction to Data Science
# Lesson 5 - Activity 1
The problem in this activity can be solved by displaying scatter diagrams that use an additional visual feature to represent a third variable. The activity uses the data from the AQA large data set which gives information about cars.

## Problem: 
> *Which features of cars are linked to higher emissions?*

To answer this question you could explore plotting scatter diagrams for various features plotted against the emissions.

## Getting the data

* Run the code below to import the data


In [None]:
import pandas as pd

# import matplotlib
import matplotlib.pyplot as plt

# importing the data
cars_data=pd.read_csv('../input/aqalds/AQA-large-data-set.csv')

# inspecting the dataset to check that it has imported correctly
cars_data.head()

## Exploring the data
You can explore the data by finding the shape of the data set and displaying the data types.

In [None]:
# Explore the shape of data set and get the data types

In lesson 3 activity 2 you pre-processed the cars data by removing any cars with a recorded mass or engine size of 0 and converting the propulsion field to the appropriate text values.
* Run the code below to clean the data being used in this notebook

In [None]:
cars_data=cars_data[cars_data['Mass'] >0]
cars_data=cars_data[cars_data['EngineSize'] >0]
cars_data['PropulsionTypeId'] = cars_data['PropulsionTypeId'].replace({1: 'Petrol',
                                                                       2: 'Diesel',
                                                                       3: 'Electric', 
                                                                       7: 'Gas/Petrol', 
                                                                       8: 'Electric/Petrol'})
cars_data.head()

Before analysing the data you can use `describe()` to explore the recorded values emissions that you are going to investigate.
* Get a summary of the CO2 emissions
* Try changing the field to CO or NOX

In [None]:
# Get a summary of the CO2 field
cars_data['CO2'].describe()

It is useful to have visual representation of the emissions as a single variable before looking for an association with other variables.
* Run the code below to view a boxplot of the CO2 emissions by propulsion type
* Try changing the column to CO or NOX

In [None]:
cars_data.boxplot(column = ['CO2'],by='PropulsionTypeId', vert=False,figsize=(12, 8))
plt.show()

**Checkpoint**
> * Do the range of emissions look appropriate for the different propulsion type? Give some evidence to back up your answer. 
> * Is there any further cleaning of the data that would be helpful?

## Analysing the data
### Plotting scatter diagrams 
Pandas features a command to draw a scatter diagram.

* Run the code below to generate the scatter diagram

In [None]:
# Plot the scatter diagram
cars_data.plot.scatter(x='EngineSize', y='CO2', figsize=(10,10))
plt.show()

The link between the mass and emissions can also be explored.
* Edit and run the code below so that it plots the scatter diagram for CO2 against mass
* Explore changing the field used for the y-axis to CO or NOX

In [None]:
# Plot the scatter diagram
cars_data.plot.scatter(x='EngineSize', y='CO2', figsize=(10,10))
plt.show()

**Checkpoint**
> * Do cars with larger engine sizes have higher emissions on average? How do the emissions vary as engine size increases?
> * Do heavier cars have higher emissions on average? How do the emissions vary as mass size increases?
> * Are your answers to the above two questions similar for the different types of emissions: CO2, CO and NOX?

### Adding a third (numerical) variable to a scatter diagram
One of the additional features of scatter diagrams in pandas is the ability to format the points based on a third variable. Where this is a numerical value it can be assigned to either the shading or the size of the points.

To plot a scatter diagram with the points shaded based on a third value you use the parameter `c` for colour. 
* Run the code below to plot the scatter diagrams
* Change the code so that engine size is plotted on the x-axis and the shading is set by mass
* Explore equivalent diagrams for CO and NOX

In [None]:
# plot CO2 against mass with the colour determined by the engine size
cars_data.plot.scatter(x='Mass', y='CO2', c='EngineSize', figsize=(10,10), sharex=False)
plt.show()

To plot a scatter diagram with the size of the points based on a third value you use the parameter `s` size. This is handled different as a parameter: you need to call the full variable from the dataset: `cars_data['EngineSize']`. The `/50` is to scale down the size of the points.
* Run the code below to plot the scatter diagrams
* Change the code so that engine size is plotted on the x-axis and the size is set by mass
* Explore changing the `/50` for the scale
* Explore equivalent diagrams for CO and NOX

In [None]:
# plot CO2 against mass with the size determined by the engine size 
cars_data.plot.scatter(x='Mass', y='CO2', s=cars_data['EngineSize']/50, figsize=(10,10))
plt.show()

**Checkpoint**
> * How are mass and engine size collectively linked to emissions?
> * Which of the diagrams you have produced is the most helpful to show this? 

### Adding a third (categorical) variable to a scatter diagram
You can also format the colour of points in a scatter diagram based on a third (categorical value).

Unfortunately the `scatter` command in pandas doesn't contain a simple `by` parameter for grouping so the code is more complex. 

In the code below: 
`cmap = {'Petrol': 'red', 'Diesel': 'blue'}` creates a *mapping* that links the values Petrol and Diesel to the colours red and blue. 
`cmap.get(c, 'black') for c in cars_data.PropulsionTypeId]` looks up the value from the Propulsion field and returns the equivalent value (or black if it doesn't find it in the list).

* Run the code to generate the scatter diagram

In [None]:
# create a mapping for the colours
cmap = {'Petrol': 'red', 'Diesel': 'blue'}

# plot the scatter diagram with the colour set by the mapping
cars_data.plot.scatter(x='Mass', y='CO2', figsize=(10,10),  c=[cmap.get(c, 'black') for c in cars_data.PropulsionTypeId])
plt.show()

* Edit and run the code below so that it plots scatter diagram with cars registered in 2002 and 2016 in different colours (*note that the* `YearRegistered` *field is numerical and so the values 2002 and 2016 do not need to be in quotes*)
* Explore equivalent diagrams for engine size and CO or NOX

In [None]:
cmap = {'Petrol': 'red', 'Diesel': 'blue'}
cars_data.plot.scatter(x='Mass', y='CO2', figsize=(10,10),  c=[cmap.get(c, 'black') for c in cars_data.PropulsionTypeId])
plt.show()

**Checkpoint**
> * How does the propulsion type affect the link between mass and emissions or engine size and emissions?  
> * Are the links between mass/engine size and emissions similar for the cars registered in 2002 and 2016?

## Communicating the results
**Checkpoint**
> * Use the charts and values calculated to answer the initial problem: *Which features of cars are linked to higher emissions?*