# MEI Introduction to Data Science
# Lesson 3 - Activity 1
The problem to be solved from the dataset used in this activity requires the data to be cleaned in a different way (by removing characters and converting to a string), creating derived fields and grouping data The activity uses the data from the OCR large data set which gives the methods of travel to work by local authority for 2001 and 2011.

## Problem: 
> *Are people more likely to cycle to work or walk to work in different parts of the country?*

To answer this question you could compare the means and standard deviations of the number of people walking and cycling to work for different regions.

## Getting the data
Initially the data for 2011 will be imported.
* Run the code below to import the 2011 data

In [None]:
# import pandas module
import pandas as pd

# importing the data
travel_2011_data=pd.read_csv('../input/ocrlds/OCR-lds-travel-2011.csv')

# inspecting the dataset to check that it has imported correctly
travel_2011_data.head()

This data has imported all the numerical fields with the comma separators for thousands. This is problematic as the they will be interpreted as text fields (or "strings") and therefore it will not be possible to calculate statistics such as the mean or the standard deviation. Use the `shape` and `dtypes` as well as `head(10)` and `tail(10)` commands you learnt in lesson 2 to explore the data.

*Reminder: the format for these commands in lesson 2 was:* `heathrow_2015_data.shape`, `heathrow_2015_data.dtypes` *and* `heathrow_2015_data.head(10)`*.*

In [None]:
# explore the data

**Checkpoint**
> Which fields of the dataset would it be appropriate to convert to numerical values?

### Removing characters and converting to a string
The values as they are stored have two problems that need to be sorted:
* The strings contain commas for some of the values to separate thousands
* They are being stored as strings and not numbers

These two problems need to be sorted in order.

In lesson 2 you used the `replace` command to replace the value in a field:
`heathrow_2015_data['Daily Total Rainfall'] = heathrow_2015_data['Daily Total Rainfall'].replace({'tr': 0.025})`
In this example you just want to remove any commas but leave the rest of the numbers - to do this you can use the string replace command `str.replace` to replace any commas with an empty string.

Once this has been completed you can reformat the values as `float` so they can be interpretted as numbers.

* Run the code below that performs this for the In employment field

In [None]:
# any commas in the In employment fields are removed by replacing them with an empty string
travel_2011_data['In employment'] = travel_2011_data['In employment'].str.replace(',', '')

# the fields are then convert to the float type
travel_2011_data['In employment'] = travel_2011_data['In employment'].astype('float')

# you can then use the describe command to check that this has worked and they can be analysed
travel_2011_data['In employment'].describe()

The code below performs this for the Bicycle field.

* Run the code to remove the commas and reformat as numbers.
* Change the code so that it removes the commas and reformats the On foot field

In [None]:
# any commas in the In employment fields are removed by replacing them with an empty string
travel_2011_data['Bicycle'] = travel_2011_data['Bicycle'].str.replace(',', '')

# the fields are then convert to the float type
travel_2011_data['Bicycle'] = travel_2011_data['Bicycle'].astype('float')

# you can then use the describe command to check that this has worked and they can be analysed
travel_2011_data['Bicycle'].describe()

### Creating a derived field
The data in this dataset show the numbers of people in each local authority that use each type of transport as their primary means of getting to work. However, this doesn't take into account the size of the local authority: local authorities with more people living in them will generally have greater numbers across all the categories. To aid comparison between different sized local authorities it is useful to consider the proportion of people in each local authority who work that use each type of transport (expressed as a percentage).

The proportion of people who cycle is a *derived field*: it is not part of the original dataset but it can be derived from other values, in this case the number who cycle divided by the total number of people in work.

* Run the code in the two boxes below to add the Bicycle percent field and check that it has added correctly to the data set

In [None]:
# The percentage is calculated and stored in a new field: Bicycle percent
travel_2011_data['Bicycle percent']=travel_2011_data['Bicycle']/travel_2011_data['In employment']*100

In [None]:
travel_2011_data.head()

* Edit the code in the first box above so that it adds a column for On foot percent. Run the second box again to confirm that a second column has been added to the dataset. 

**Checkpoint**
> When would you use the raw number of people using a type of transport and when would you use a percentage?
>* Give an example of a sentence comparing the number of people who cycle and walk to work using raw numbers.
>* Give an example of a sentence comparing the number of people who cycle and walk to work using percentages.

## Exploring the data

It is useful to see the boxplots for these fields. The code in the two boxes below will generate the boxplots grouped by region. Plotting charts requires the `matplotlib` module to be imported. This will be explored in more detail in lesson 4.

In [None]:
# import matplotlib for plotting
import matplotlib.pyplot as plt

In [None]:
# plot a boxplot for Bicycle percent grouped by Region
travel_2011_data.boxplot(column = ['Bicycle percent'],by='Region', vert=False,figsize=(12, 8))
plt.show()

## Analysing the data
### Using the groupby command
The data in this dataset listed by local authority and region. Pandas contains a built-in command: `groupby` that allows for operations, such as calculating the mean, to be performed on different groups.

* Run the code below which calculates the mean for the data grouped by region

In [None]:
print(travel_2011_data.groupby(['Region'])['Bicycle percent'].mean())

The command for standard deviation is `std()`
* Add to the code above so that it also calculates the standard deviation of the Bicycle percent field for the local authorities in each region
* Add to the code above so that it calculates the mean and standard deviation of the On foot percent field

## Communicating the results: 
Use the measures calculated to answer the orginal problem: *Are people more likely to cycle to work or walk to work in different parts of the country?*

**Checkpoint**
> * Which regions have the most variation in the proportion of people cycling or walking to work?
> * What local factors might affect the proportion of people cycling or walking to work? For example the geography, types of people who live there or other factors. 