# Data Processing with `pandas`

## Part 1: Merging datasets

In [None]:
import pandas as pd

In [None]:
from dataframes import europe, americas, requirements, prices, currencies, exchange_rates, dublin

Here's some details of outlets in a small coffee shop chain:

In [None]:
europe

In [None]:
americas

Create a new DataFrame called `outlets` which contains all eight entries, and has a new row index with unique values (from 0 to 7), with the `europe` entries first. The `location_id` column can be left as it is (we will resolve the duplicate values later). 

*Hint:*  
[pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) with the `ignore_index` parameter

In [None]:
#add your code below
outlets = pd.concat([europe, americas], ignore_index=True)
#outlets

In [None]:
outlets

A new outlet will be opened in Dublin. A site is found and it has the following `requirements`:

In [None]:
requirements

Theres a catalogue of `prices` as follows:

In [None]:
prices

Merge the `requirements` table with `prices` on `name`, creating a new DataFrame called `purchases` which is the same as `requirements` but with a `price` column added, and another column `cost` which is equal to `price` * `quantity`:

In [None]:
purchases = requirements.merge(prices, how="left", on="name")
purchases['cost'] = purchases['price'] * purchases['quantity']
#purchases

In [None]:
purchases

The details for the Dublin branch are in the `dublin` DataFrame:

In [None]:
dublin

Add this to the bottom of the `outlets` DataFrame, again updating the row index so that it has numbers `0` to `8`: 

In [None]:
outlets = pd.concat([outlets, dublin], ignore_index=True)


In [None]:
outlets

## Part 2: Data preparation and cleaning

Another DataFrame contains the `currency` for each country in which there is an outlet:

In [None]:
currencies

Create a DataFrame called `outlets_detail`, which is the same as `outlets` but has an additional column `Currency`, with the given currency for each outlet. 

- Notice that in the `currencies` DataFrame, the column heading `country` is lower case so does not quite match `Country`, and that `currency` needs to be renamed to `Currency`

- Avoid modifying the original `outlets` DataFrame

In [None]:
outlets_detail = pd.merge(outlets, currencies, how='left', left_on='Country', right_on='country')
outlets_detail.drop('country', axis=1, inplace=True)
outlets_detail.rename(columns={'currency': 'Currency'}, inplace=True)

In [None]:
outlets_detail

Run the following code to create lists of the countries where there are outlets in the two regions:

In [None]:
EUROPE = ['UK', 'Italy', 'France', 'Germany', 'Ireland']
AMERICAS = ['Argentina', 'Brazil', 'USA']

Create a function called `region`, which takes a single argument, `country`, and returns 'Europe' if in `EUROPE`, 'Americas' if in `AMERICAS`, and 'Other' if in neither list:

In [None]:
#def region(country):
def region(country):
    
    if country in EUROPE:
        return 'Europe'
    elif country in AMERICAS:
        return 'Americas'
    else:
        return 'Other'


Add a new column `Region` to `outlets_detail`, which uses your function and `.apply()` to populate the column values:

In [None]:
outlets_detail['Region'] = outlets_detail['Country'].apply(region)


In [None]:
outlets_detail

Create a column `new_id` which contains strings in the format `<Region>_<location_id>`, for example:

`Europe_1`

*Hint: you may find the `.astype()` method useful*

In [None]:
outlets_detail['new_id'] = outlets_detail['Region'] + '_' + outlets_detail['location_id'].astype(str)


In [None]:
outlets_detail

Finally, drop the original `location_id` column, and set the index of the DataFrame as the values in the `new_id` column, discarding the original index:

In [None]:
outlets_detail.drop('location_id', axis=1, inplace=True)
outlets_detail.set_index('new_id', inplace=True, drop=True)


*Note how the `.drop()` method is destructive, in that running the code a second time will throw an error because the given column cannot be found. In these circumstances you may need to re-run previous code to get the DataFrame back to its previous state.*

In [None]:
outlets_detail

### Preparation of a different dataset 

In Part 3 we will be working with a different dataset. It will be possible to load the prepared dataset directly later in the notebook, but let's have a go at doing some of this preparation work ourselves first:

In [None]:
df = pd.read_csv('data/ward-profiles.csv')

In [None]:
df.head()

The dataset contains data for each ward in London. However, you'll notice that (with the exception of `City of London`), the `Ward name` values are prefixed with the name of the Borough in which it is located.

Create a function which will identify the string ` - ` (a dash with a space on either side) within another string, and return the text which precedes it. If the string is not present, the whole string should be returned.

`City of London` would return `City of London`  
`Barking and Dagenham - Abbey` would return `Barking and Dagenham`  

In [None]:
def borough(ward):
    
    if ' - ' not in ward:
        return ward
    
    return ward[:ward.index(' - ')]


Use `.apply()` and your function to create a column called `Borough` which contains the returned string:

In [None]:
df['Borough'] = df['Ward name'].apply(borough)


In [None]:
df.head()

Follow the same process to create a column called `Ward`:

- `City of London` => `City of London`
- `Barking and Dagenham - Abbey` => `Abbey`

In [None]:
def ward(ward):
    if ' - ' not in ward:
        return ward
    
    return ward[ward.index(' - ') + 3:]   

df['Ward'] = df['Ward name'].apply(ward)


In [None]:
df.head()

Use `.drop()` to get rid of the original `Ward name` column:

In [None]:
df.drop(['Ward name'], axis=1, inplace=True)


In [None]:
df.head()

Finally, see if you can move the new `Borough` and `Ward` columns to be the first two columns in the Dataframe:

*Hint: this [Stack Overflow answer](https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe/35322540#35322540) may be useful*

In [None]:
df = df[['Borough', 'Ward'] + [c for c in df if c not in ['Borough', 'Ward']]]


If you managed to do all of those tasks, your DataFrame should be the same as `wards` loaded at the beginning of Part 3 below.

## Part 3: Data grouping and aggregation

In [None]:
wards = pd.read_csv('data/ward-profiles-clean.csv')
wards

Use `.groupby()` to create a Series called `population` which contains the sum of the values in the `Population - 2015` column for each `Borough`:

In [None]:
population = wards.groupby('Borough')['Population - 2015'].sum()
population

Use `.groupby()` and `.agg()` to create a DataFrame called cars_stats which contains the `max`, `min` and `mean` of the `Cars per household - 2011` for each `Borough`: 

In [None]:
cars_stats = wards.groupby('Borough')['Cars per household - 2011'].agg(['max', 'min', 'mean'])


In [None]:
cars_stats

Update `cars_stats` so that `mean` is rounded to one decimal place:

In [None]:
cars_stats['mean'] = cars_stats['mean'].round(1)


In [None]:
cars_stats

Create a Series called `ward_count` which has an index of `Borough` and with values showing the `.count()` of `Ward` for each, i.e. the number of wards in each `Borough`. Order this by the values, with the `Borough` with the most wards at the top:

In [None]:
ward_count = wards.groupby('Borough')['Ward'].count().sort_values(ascending=False)


In [None]:
ward_count

Create a DataFrame called `transport` which contains the columns `['Borough', 'Ward', 'Average Public Transport Accessibility score - 2014', '% travel by bicycle to work - 2011']` from `wards`:

In [None]:
transport = wards[['Borough', 
                   'Ward', 
                   'Average Public Transport Accessibility score - 2014', 
                   '% travel by bicycle to work - 2011']]


In [None]:
transport

Merge the columns from `cars_stats` into `transport`, such that:

- the number of rows in `transport` remains the same
- three new columns are added (`max`, `min`, `mean`)
- the values in each of these columns for all wards in a given `Borough` are the same

In [None]:
transport = transport.merge(cars_stats, how='outer', on='Borough')


Drop the `max` and `min` columns:

*Hint: remember that the default `axis` argument will attempt to drop rows rather than columns*

In [None]:
transport.drop(['max','min'], axis=1, inplace=True)


In [None]:
transport

Finally, rename the `mean` column to `Borough household cars - average`:

In [None]:
transport.rename(columns={'mean': 'Borough household cars - average'}, inplace=True)


In [None]:
transport