<a href="https://colab.research.google.com/github/saad-ameer/Python-for-Data-Analyst/blob/main/challenge_questions_tfl_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenge Questions - TfL Dataset

# Instructions:
• Please ensure you don't overwrite any existing cells. Add new cells below by pressing ALT+ENTER

• Attempt all of the questions

• You are encouraged to look online for help should you need it

# Dataset overview:
There are three datasets stored in the same directory as this Notebook, they are all related to each other:

• **tfl-daily-cycle-hires.csv**: This dataset contains bike hire data from Transport for London during the period
30th July 2010 to 30th September 2021. 'Day' is the day in '%d/%m/%Y' format. 'Number of Bicycle Hires' is the total number of bikes hired that day.


#

## Import pandas, numpy and datetime

In [7]:
import pandas as pd
import numpy as np
import datetime as dt

## Load the files:
• "tfl-daily-cycle-hires.csv" should be assigned to the variable **tfl**

In [8]:
tfl = pd.read_csv('tfl-daily-cycle-hires.csv')

## Check the head of the DataFrame

In [9]:
tfl.head()

Unnamed: 0,Day,Number of Bicycle Hires,Unnamed: 2
0,30/07/2010,6897.0,
1,31/07/2010,5564.0,
2,01/08/2010,4303.0,
3,02/08/2010,6642.0,
4,03/08/2010,7966.0,


## Check the data types of the DataFrame columns

In [10]:
tfl.dtypes

Unnamed: 0,0
Day,object
Number of Bicycle Hires,float64
Unnamed: 2,float64


## Change the data types and remove unnecessary columns

• 'Day' should be a datetime64 data type

• 'Number of Bicycle Hires' should be float64

• Any other columns should be deleted

In [11]:
#tfl['Day'] = tfl['Day'].astype('datetime64[ns]')
tfl['Day'] = pd.to_datetime(tfl['Day'],format='%d/%m/%Y')

In [12]:
tfl = tfl.drop(columns=['Unnamed: 2'])
#tfl.drop(columns='Unnmaed: 2', inplace=True)

## What is the average number of bicycle hires per day across the entire dataset?

In [13]:
tfl.describe()

Unnamed: 0,Day,Number of Bicycle Hires
count,4081,4081.0
mean,2016-02-29 00:00:00,26261.932124
min,2010-07-30 00:00:00,2764.0
25%,2013-05-15 00:00:00,19201.0
50%,2016-02-29 00:00:00,26030.0
75%,2018-12-15 00:00:00,33313.0
max,2021-09-30 00:00:00,73094.0
std,,9741.300803


In [14]:
tfl['Number of Bicycle Hires'].mean()

np.float64(26261.932124479295)

In [15]:
tfl.groupby('Day')['Number of Bicycle Hires'].mean()

Unnamed: 0_level_0,Number of Bicycle Hires
Day,Unnamed: 1_level_1
2010-07-30,6897.0
2010-07-31,5564.0
2010-08-01,4303.0
2010-08-02,6642.0
2010-08-03,7966.0
...,...
2021-09-26,45120.0
2021-09-27,32167.0
2021-09-28,32539.0
2021-09-29,39889.0


## Create a new column called 'Year' which contains the 4 digit year

In [16]:
#tfl['Year'] = tfl['Day'].dt.year
tfl['Year'] = tfl['Day'].dt.strftime('%Y')

In [17]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Year
0,2010-07-30,6897.0,2010
1,2010-07-31,5564.0,2010
2,2010-08-01,4303.0,2010
3,2010-08-02,6642.0,2010
4,2010-08-03,7966.0,2010
...,...,...,...
4076,2021-09-26,45120.0,2021
4077,2021-09-27,32167.0,2021
4078,2021-09-28,32539.0,2021
4079,2021-09-29,39889.0,2021


## What is the average number of bicycle hires per Year across the entire dataset

In [18]:
tfl.groupby('Year')['Number of Bicycle Hires'].mean()

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,14069.76129
2011,19568.353425
2012,26008.969945
2013,22042.353425
2014,27462.731507
2015,27046.134247
2016,28152.013661
2017,28619.29863
2018,28952.164384
2019,28561.520548


In [19]:
tfl.pivot_table(index='Year',aggfunc='mean',values='Number of Bicycle Hires')

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,14069.76129
2011,19568.353425
2012,26008.969945
2013,22042.353425
2014,27462.731507
2015,27046.134247
2016,28152.013661
2017,28619.29863
2018,28952.164384
2019,28561.520548


## What is the total number of bicycle hires per Year across the entire dataset

In [20]:
tfl.groupby('Year')['Number of Bicycle Hires'].sum()

Unnamed: 0_level_0,Number of Bicycle Hires
Year,Unnamed: 1_level_1
2010,2180813.0
2011,7142449.0
2012,9519283.0
2013,8045459.0
2014,10023897.0
2015,9871839.0
2016,10303637.0
2017,10446044.0
2018,10567540.0
2019,10424955.0


## Create a new column called 'Category' on the tfl DataFrame that classifies the number of bike hires per day as:
* 'Low' if the 'Number of Bicycle Hires' is below 10,000
* 'Medium' if the 'Number of Bicycle Hires' is below 40,000 but greater than or equal to 10,000
* 'High' if the 'Number of Bicycle Hires' is greater than or equal to 40,000

In [21]:
tfl['Category'] = np.where(tfl['Number of Bicycle Hires'] < 10000, 'Low',
                           np.where(tfl['Number of Bicycle Hires'] < 40000, 'Medium', 'High'))

In [22]:
tfl

Unnamed: 0,Day,Number of Bicycle Hires,Year,Category
0,2010-07-30,6897.0,2010,Low
1,2010-07-31,5564.0,2010,Low
2,2010-08-01,4303.0,2010,Low
3,2010-08-02,6642.0,2010,Low
4,2010-08-03,7966.0,2010,Low
...,...,...,...,...
4076,2021-09-26,45120.0,2021,High
4077,2021-09-27,32167.0,2021,Medium
4078,2021-09-28,32539.0,2021,Medium
4079,2021-09-29,39889.0,2021,Medium


In [23]:
def classify(x):
  if x>=40000:
    return 'High'
  elif x>=10000:
    return 'Medium'
  else:
    return 'Low'

In [24]:
tfl['Category'] = tfl['Number of Bicycle Hires'].apply(classify)

In [30]:
tfl.head(15)

Unnamed: 0,Day,Number of Bicycle Hires,Year,Category
0,2010-07-30,6897.0,2010,Low
1,2010-07-31,5564.0,2010,Low
2,2010-08-01,4303.0,2010,Low
3,2010-08-02,6642.0,2010,Low
4,2010-08-03,7966.0,2010,Low
5,2010-08-04,7893.0,2010,Low
6,2010-08-05,8724.0,2010,Low
7,2010-08-06,9797.0,2010,Low
8,2010-08-07,6631.0,2010,Low
9,2010-08-08,7864.0,2010,Low


## For each year in the tfl DataFrame how many days are classed as 'Low', 'Medium' or 'High'?

In [25]:
tfl.groupby(['Year','Category']).size()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Year,Category,Unnamed: 2_level_1
2010,Low,44
2010,Medium,111
2011,Low,30
2011,Medium,335
2012,High,27
2012,Low,19
2012,Medium,320
2013,Low,25
2013,Medium,340
2014,High,27


In [26]:
tfl.groupby(by=['Year','Category']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Day,Number of Bicycle Hires
Year,Category,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,Low,44,44
2010,Medium,111,111
2011,Low,30,30
2011,Medium,335,335
2012,High,27,27
2012,Low,19,19
2012,Medium,320,320
2013,Low,25,25
2013,Medium,340,340
2014,High,27,27
