# Python Charmers 

## Python Fundamentals Lesson 5: Data Wrangling Part 2

### Lesson Overview
- **Objective:** This lesson dives into data prep with pandas and numpy
- **Source materials:** [stefmolin/
pandas-workshop](https://github.com/stefmolin/pandas-workshop/blob/37c6e9cca94c29a4c5ffc8be6b241da8fb8ecb53/notebooks/2-data_wrangling.ipynb)
- **Prerequisites:** [Lesson 2: Packages](./fundamentals-02-packages.ipynb)
- **Duration:** 45 mins

In this section, we continue data wrangling into more advanced ways of restructuring, reshaping, and enriching data.

## Reshaping data

The taxi dataset we have be working with is in a format conducive to an analysis. This isn't always the case. Let's now take a look at the TSA traveler throughput data, which compares 2021 throughput to the same day in 2020 and 2019:

In [1]:
import pandas as pd

tsa = pd.read_csv('../data/tsa_passenger_throughput.csv', parse_dates=['Date'])
tsa.head()

Unnamed: 0,Date,2021 Traveler Throughput,2020 Traveler Throughput,2019 Traveler Throughput
0,2021-05-14,1716561.0,250467,2664549
1,2021-05-13,1743515.0,234928,2611324
2,2021-05-12,1424664.0,176667,2343675
3,2021-05-11,1315493.0,163205,2191387
4,2021-05-10,1657722.0,215645,2512315


*Source: [TSA.gov](https://www.tsa.gov/coronavirus/passenger-throughput)*

First, we will lowercase the column names and take the first word (e.g., `2021` for `2021 Traveler Throughput`) to make this easier to work with:

In [2]:
tsa = tsa.rename(columns=lambda x: x.lower().split()[0])
tsa.head()

Unnamed: 0,date,2021,2020,2019
0,2021-05-14,1716561.0,250467,2664549
1,2021-05-13,1743515.0,234928,2611324
2,2021-05-12,1424664.0,176667,2343675
3,2021-05-11,1315493.0,163205,2191387
4,2021-05-10,1657722.0,215645,2512315


Now, we can work on reshaping it.

### Melting

Melting helps convert our data into long format. Now, we have all the traveler throughput numbers in a single column:

In [3]:
tsa_melted = pd.melt(tsa, # our dataframe
    id_vars=['date'], # column, or list of columnd that uniquely identifies a row (can be multiple)
    var_name='year', # name for the new column created by melting
    value_name='travelers' # name for new column containing values from melted columns
)
tsa_melted.sample(5, random_state=1) # show some random entries

Unnamed: 0,date,year,travelers
974,2020-09-12,2019,1879822.0
435,2021-03-05,2020,2198517.0
1029,2020-07-19,2019,2727355.0
680,2020-07-03,2020,718988.0
867,2020-12-28,2019,2500396.0


To convert this into a time series of traveler throughput, we need to replace the year in the `date` column with the one in the `year` column. Otherwise, we are marking prior years' numbers with the wrong year.

In [4]:
# 'dt' is the datetime section of the pandas library
# 'strftime' stands for "string format time". It converts datetime objects into a string 
# '-%m-%d' access the month and day from the existing date field

tsa_melted = tsa_melted.assign(
    date=lambda x: pd.to_datetime(x.year + x.date.dt.strftime('-%m-%d'))
)

# alternatively you could write
tsa_melted['date'] = pd.to_datetime(tsa_melted['year'] + tsa_melted['date'].dt.strftime('-%m-%d'))

tsa_melted.sample(5, random_state=1)

Unnamed: 0,date,year,travelers
974,2019-09-12,2019,1879822.0
435,2020-03-05,2020,2198517.0
1029,2019-07-19,2019,2727355.0
680,2020-07-03,2020,718988.0
867,2019-12-28,2019,2500396.0


This leaves us with some null values (the dates that aren't present in the dataset):

In [5]:
tsa_melted.sort_values('date').tail(3)

Unnamed: 0,date,year,travelers
136,2021-12-29,2021,
135,2021-12-30,2021,
134,2021-12-31,2021,


These can be dropped with the `dropna()` method:

In [6]:
tsa_melted = tsa_melted.dropna()
tsa_melted.sort_values('date').tail(3)

Unnamed: 0,date,year,travelers
2,2021-05-12,2021,1424664.0
1,2021-05-13,2021,1743515.0
0,2021-05-14,2021,1716561.0


### Pivoting

Using the melted data, we can pivot the data to compare TSA traveler throughput on specific days across years. Let's look at the first 10 days in March:

In [7]:
# convert column to date format
tsa_melted['date'] = pd.to_datetime(tsa_melted['date'])

# 'copy()' here is ensuring that 'first_10_days' becomes a standalone DataFrame
first_10_days = tsa_melted.loc[(tsa_melted['date'].dt.month == 3) & (tsa_melted['date'].dt.day <= 10)].copy()
first_10_days['day_in_march'] = first_10_days['date'].dt.day

# pivot dataset
first_10_days_pivot = pd.pivot(first_10_days,index='year', columns='day_in_march', values='travelers')
first_10_days_pivot

day_in_march,1,2,3,4,5,6,7,8,9,10
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019,2257920.0,1979558.0,2143619.0,2402692.0,2543689.0,2156262.0,2485430.0,2378673.0,2122898.0,2187298.0
2020,2089641.0,1736393.0,1877401.0,2130015.0,2198517.0,1844811.0,2119867.0,1909363.0,1617220.0,1702686.0
2021,1049692.0,744812.0,826924.0,1107534.0,1168734.0,992406.0,1278557.0,1119303.0,825745.0,974221.0


In [8]:
# Alternatively these steps can be combined as below
tsa_pivoted = tsa_melted\
    .query('date.dt.month == 3 and date.dt.day <= 10')\
    .assign(day_in_march=lambda x: x.date.dt.day)\
    .pivot(index='year', columns='day_in_march', values='travelers')
tsa_pivoted

day_in_march,1,2,3,4,5,6,7,8,9,10
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019,2257920.0,1979558.0,2143619.0,2402692.0,2543689.0,2156262.0,2485430.0,2378673.0,2122898.0,2187298.0
2020,2089641.0,1736393.0,1877401.0,2130015.0,2198517.0,1844811.0,2119867.0,1909363.0,1617220.0,1702686.0
2021,1049692.0,744812.0,826924.0,1107534.0,1168734.0,992406.0,1278557.0,1119303.0,825745.0,974221.0


In [9]:
# note we currently have two headers, to return to one use '.reset_index()
# the 'day_in_march' column can now be dropped
tsa_pivoted.reset_index()

day_in_march,year,1,2,3,4,5,6,7,8,9,10
0,2019,2257920.0,1979558.0,2143619.0,2402692.0,2543689.0,2156262.0,2485430.0,2378673.0,2122898.0,2187298.0
1,2020,2089641.0,1736393.0,1877401.0,2130015.0,2198517.0,1844811.0,2119867.0,1909363.0,1617220.0,1702686.0
2,2021,1049692.0,744812.0,826924.0,1107534.0,1168734.0,992406.0,1278557.0,1119303.0,825745.0,974221.0


**Important**: We aren't covering the `unstack()` and `stack()` methods, which are additional ways to pivot and melt, respectively. These come in handy when we have a multi-level index (e.g., if we ran `set_index()` with more than one column). More information can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html).

### Transposing

The `T` attribute provides a quick way to flip rows and columns.

In [10]:
tsa_pivoted.T # or tsa_pivoted.transpose()

year,2019,2020,2021
day_in_march,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2257920.0,2089641.0,1049692.0
2,1979558.0,1736393.0,744812.0
3,2143619.0,1877401.0,826924.0
4,2402692.0,2130015.0,1107534.0
5,2543689.0,2198517.0,1168734.0
6,2156262.0,1844811.0,992406.0
7,2485430.0,2119867.0,1278557.0
8,2378673.0,1909363.0,1119303.0
9,2122898.0,1617220.0,825745.0
10,2187298.0,1702686.0,974221.0


### Merging (Joining)

We typically observe changes in air travel around the holidays, so adding information about the dates in the TSA dataset provides more context. The `holidays.csv` file contains a few major holidays in the United States:

In [11]:
holidays = pd.read_csv('../data/holidays.csv', parse_dates=True, index_col='date')
holidays.loc['2019']

Unnamed: 0_level_0,holiday
date,Unnamed: 1_level_1
2019-01-01,New Year's Day
2019-05-27,Memorial Day
2019-07-04,July 4th
2019-09-02,Labor Day
2019-11-28,Thanksgiving
2019-12-24,Christmas Eve
2019-12-25,Christmas Day
2019-12-31,New Year's Eve


Merging the holidays with the TSA traveler throughput data will provide more context for our analysis:

In [12]:
# 'merge()' will join two dataframes, in the form df1.merge(df2, ......)
# 'left_on' & 'right_on' are the columns or list of columns you are joining on, 
#   - in this case we use an index so there is a special parameter for that
# 'how' defines the type of join, e.g. 'left','inner',etc.
tsa_melted_holidays = tsa_melted.merge(
    holidays, 
    left_on='date', 
    right_index=True, 
    how='left')
    
tsa_melted_holidays = tsa_melted_holidays.sort_values('date')
tsa_melted_holidays.head()

Unnamed: 0,date,year,travelers,holiday
863,2019-01-01,2019,2126398.0,New Year's Day
862,2019-01-02,2019,2345103.0,
861,2019-01-03,2019,2202111.0,
860,2019-01-04,2019,2150571.0,
859,2019-01-05,2019,1975947.0,


*Tip: There are many parameters for this method, so be sure to check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). 
To append rows i.e. UNION ALL in SQL, take a look at the `pd.concat()` function.*

We can take this a step further by marking a few days before and after each holiday as part of the holiday. This would make it easier to compare holiday travel across years and look for any uptick in travel around the holidays:

In [13]:
tsa_melted_holiday_travel = tsa_melted_holidays.assign(
    holiday=lambda x:
        x.holiday.ffill(limit=1).bfill(limit=2)
)
tsa_melted_holiday_travel.head()

Unnamed: 0,date,year,travelers,holiday
863,2019-01-01,2019,2126398.0,New Year's Day
862,2019-01-02,2019,2345103.0,New Year's Day
861,2019-01-03,2019,2202111.0,
860,2019-01-04,2019,2150571.0,
859,2019-01-05,2019,1975947.0,


*Tip: Check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) for the full list of functionality available with the `fillna()` method.*

Notice that we now have values for the day after each holiday and the two days prior. Thanksgiving in 2019 was on November 28th, so the 26th, 27th, and 29th were filled. Since we are only replacing null values, we don't override Christmas Day with the forward fill of Christmas Eve:

In [14]:
# Note our 'year' column is formatted as a string
tsa_melted_holiday_travel.loc[
    (tsa_melted_holiday_travel['year'] == '2019') & 
    ((tsa_melted_holiday_travel['holiday'] == "Thanksgiving") | 
    (tsa_melted_holiday_travel['holiday'].str.contains("Christmas")))
]

# Alternatively this can be filtered using 'query()'
tsa_melted_holiday_travel.query(
    'year == "2019" and '
    '(holiday == "Thanksgiving" or holiday.str.contains("Christmas"))'
)

Unnamed: 0,date,year,travelers,holiday
899,2019-11-26,2019,1591158.0,Thanksgiving
898,2019-11-27,2019,1968137.0,Thanksgiving
897,2019-11-28,2019,2648268.0,Thanksgiving
896,2019-11-29,2019,2882915.0,Thanksgiving
873,2019-12-22,2019,1981433.0,Christmas Eve
872,2019-12-23,2019,1937235.0,Christmas Eve
871,2019-12-24,2019,2552194.0,Christmas Eve
870,2019-12-25,2019,2582580.0,Christmas Day
869,2019-12-26,2019,2470786.0,Christmas Day


## Aggregations and grouping

After reshaping and cleaning our data, we can perform aggregations to summarize it in a variety of ways. In this section, we will explore using pivot tables, crosstabs, and group by operations to aggregate the data.

### Pivot tables
We can build a pivot table to compare holiday travel across the years in our dataset:

In [15]:
# The pivot_table() function in pandas is similar to pivot(). However, it is more powerful 
# it enables you to create summary tables of data by pivoting on one or more columns and 
# aggregating values across one or more columns.

pd.pivot_table(
    tsa_melted_holiday_travel, # the dataframe
    index = 'year', # the column(s) to pivot on
    columns = 'holiday', # the header columns
    values = 'travelers', # the values for our header columns
    aggfunc = 'sum' # how we'll aggregrate the data
)

# tsa_melted_holiday_travel.pivot_table(
#     index='year', columns='holiday', 
#     values='travelers', aggfunc='sum'
# )

holiday,Christmas Day,Christmas Eve,July 4th,Labor Day,Memorial Day,New Year's Day,New Year's Eve,Thanksgiving
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019,5053366.0,6470862.0,9414228.0,8314811.0,9720691.0,4471501.0,6535464.0,9090478.0
2020,1745242.0,3029810.0,2682541.0,2993653.0,1126253.0,4490388.0,3057449.0,3364358.0
2021,,,,,,1998871.0,,


We can use the `pct_change()` method on this result to see which holiday travel periods saw the biggest change in travel:

In [16]:
pd.pivot_table(
    tsa_melted_holiday_travel, 
    index = 'year', 
    columns = 'holiday', 
    values = 'travelers', 
    aggfunc = 'sum' 
).pct_change(fill_method=None)


holiday,Christmas Day,Christmas Eve,July 4th,Labor Day,Memorial Day,New Year's Day,New Year's Eve,Thanksgiving
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019,,,,,,,,
2020,-0.654638,-0.531776,-0.715055,-0.639961,-0.884139,0.004224,-0.532176,-0.629903
2021,,,,,,-0.554856,,


Let's make one last pivot table with column and row subtotals, along with some formatting improvements. First, we set a display option for all floats:

In [17]:
pd.set_option('display.float_format', '{:,.0f}'.format)

### If else statements with numpy

In this next section we'll use the numpy library to create conditional statements, i.e. Ifelse statements, if elseif, IIF()

In [18]:
# You may need to install numpy via terminal and then restart this notebook in a new tab
import numpy as np

# np.where(statement, if true, if false)
tsa_melted_holiday_travel['before_pandemic'] = np.where(tsa_melted_holiday_travel['date']<'2020-03-01','Y','N')
tsa_melted_holiday_travel.head()

# Note you can also nest np.where() statements, e.g.
# np.where(statement, if true, np.where(statement, if true, if false))

Unnamed: 0,date,year,travelers,holiday,before_pandemic
863,2019-01-01,2019,2126398,New Year's Day,Y
862,2019-01-02,2019,2345103,New Year's Day,Y
861,2019-01-03,2019,2202111,,Y
860,2019-01-04,2019,2150571,,Y
859,2019-01-05,2019,1975947,,Y


Next, we group together Christmas Eve and Christmas Day, likewise for New Year's Eve and New Year's Day, and create the pivot table:

In [31]:
import numpy as np

# Group Christmas and New Year by removing Day or Eve
tsa_melted_holiday_travel = tsa_melted_holiday_travel.assign(
    holiday=lambda x: np.where(
        x.holiday.str.contains('Christmas|New Year', regex=True), 
        x.holiday.str.replace('Day|Eve', '', regex=True).str.strip(), 
        x.holiday
    )
)

tsa_melted_holiday_travel.pivot_table(
    index='year', 
    columns='holiday', 
    values='travelers', 
    aggfunc='sum', 
    margins=True, # creates column & row totals
    margins_name='Total'
)

holiday,Christmas,July 4th,Labor Day,Memorial Day,New Year's,Thanksgiving,Total
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019,11524228.0,9414228.0,8314811.0,9720691.0,11006965.0,9090478.0,59071401.0
2020,4775052.0,2682541.0,2993653.0,1126253.0,7547837.0,3364358.0,22489694.0
2021,,,,,1998871.0,,1998871.0
Total,16299280.0,12096769.0,11308464.0,10846944.0,20553673.0,12454836.0,83559966.0


Before moving on, let's reset the display option:

In [20]:
pd.reset_option('display.float_format')

*Tip: Read more about options in the documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).*

### Exercise 3

Using the meteorite data from the `Meteorite_Landings.csv` file, create a pivot table that shows both the number of meteorites and the 95th percentile of meteorite mass for those that were found versus observed falling per year from 2005 through 2009 (inclusive). Hint: Be sure to convert the `year` column to a number as we did in the previous exercise.

In [21]:
# Enter your code here

### Crosstabs
The `pd.crosstab()` function provides an easy way to create a frequency table. Here, we count the number of low-, medium-, and high-volume travel days per year, using the `pd.cut()` function to create three travel volume bins of equal width:

In [22]:
# pd.cut() here is used to segment and sort data values 'travelers' into bins.
tsa_melted_holiday_travel['travel_volume'] = pd.cut(tsa_melted_holiday_travel['travelers'], bins=3, labels=['low', 'medium', 'high'])

# pd.crosstab() can now make a quick frequency table by year
pd.crosstab(
    index = tsa_melted_holiday_travel['travel_volume'], # the bins we created with pd.cut()
    columns = tsa_melted_holiday_travel['year'] # our column headers
)

year,2019,2020,2021
travel_volume,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
low,0,277,54
medium,42,44,80
high,323,44,0


*Tip: The `pd.cut()` function can also be used to specify custom bin ranges. For equal-sized bins based on quantiles, use the `pd.qcut()` function instead.*

Note that the `pd.crosstab()` function supports other aggregations provided you pass in the data to aggregate as `values` and specify the aggregation with `aggfunc`. You can also add subtotals and normalize the data. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) for more information.

### Group by operations
Rather than perform aggregations, like `mean()` or `describe()`, on the full dataset at once, we can perform these calculations per group by first calling `groupby()`:

In [32]:
tsa_melted_holiday_travel.groupby('year').describe(include=np.number)

Unnamed: 0_level_0,travelers,travelers,travelers,travelers,travelers,travelers,travelers,travelers
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2019,365.0,2309482.0,285061.490784,1534386.0,2091116.0,2358007.0,2538384.0,2882915.0
2020,365.0,881867.4,639775.194297,87534.0,507129.0,718310.0,983745.0,2507588.0
2021,134.0,1112632.0,338040.673782,468933.0,807156.0,1117391.0,1409377.75,1743515.0


Groups can also be used to perform separate calculations per subset of the data. For example, we can find the highest-volume travel day per year using `rank()`:

In [36]:
# Create rank of travelers each year
tsa_melted_holiday_travel['travel_volume_rank'] = tsa_melted_holiday_travel.groupby('year').travelers.rank(ascending=False)

# show top ranks for each year
tsa_melted_holiday_travel.sort_values(['travel_volume_rank', 'year']).head(3)

Unnamed: 0,date,year,travelers,holiday,before_pandemic,travel_volume,travel_volume_rank
896,2019-11-29,2019,2882915.0,Thanksgiving,Y,high,1.0
456,2020-02-12,2020,2507588.0,,Y,high,1.0
1,2021-05-13,2021,1743515.0,,N,medium,1.0


The previous two examples called a single method on the grouped data, but using the `agg()` method we can specify any number of them:

In [38]:
# Create columns for travellers during the holidays and non-holidays, and converting year to a numeric
tsa_melted_holiday_travel = tsa_melted_holiday_travel.assign(
    holiday_travelers=lambda x: np.where(x.holiday.isna(), np.nan, x.travelers),
    non_holiday_travelers=lambda x: np.where(x.holiday.isna(), x.travelers, np.nan),
    year=lambda x: pd.to_numeric(x.year)
)

# select_dtypes(include='number') is selecting all numerical data types from our dataframe
tsa_melted_holiday_travel.select_dtypes(include='number').groupby('year').agg(['mean', 'std'])

Unnamed: 0_level_0,travelers,travelers,travel_volume_rank,travel_volume_rank,holiday_travelers,holiday_travelers,non_holiday_travelers,non_holiday_travelers
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2019,2309482.0,285061.490784,183.0,105.510656,2271977.0,303021.675751,2312359.0,283906.226598
2020,881867.4,639775.194297,183.0,105.510656,864988.2,489938.240989,883161.9,650399.77293
2021,1112632.0,338040.673782,67.5,38.826537,999435.5,273573.24968,1114347.0,339479.298658


*Tip: The `select_dtypes()` method makes it possible to select columns by their data type. We can specify the data types to `exclude` and/or `include`.*

In addition, we can specify which aggregations to perform on each column:

In [26]:
tsa_melted_holiday_travel.assign(
    holiday_travelers=lambda x: np.where(x.holiday.isna(), np.nan, x.travelers),
    non_holiday_travelers=lambda x: np.where(x.holiday.isna(), x.travelers, np.nan)
).groupby('year').agg({
    'holiday_travelers': ['mean', 'std'], 
    'holiday': ['nunique', 'count']
})

Unnamed: 0_level_0,holiday_travelers,holiday_travelers,holiday,holiday
Unnamed: 0_level_1,mean,std,nunique,count
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2019,2271977.0,303021.675751,6,26
2020,864988.2,489938.240989,6,26
2021,999435.5,273573.24968,1,2


We are only scratching the surface; some additional functionalities to be aware of include the following:
- We can group by multiple columns &ndash; this creates a hierarchical index.
- Groups can be excluded from calculations with the `filter()` method.
- We can group on content in the index using the `level` or `name` parameters e.g., `groupby(level=0)` or `groupby(name='year')`.
- We can group by date ranges if we use a `pd.Grouper()` object.

Be sure to check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) for more details.

### Exercise 4

Using the meteorite data from the `Meteorite_Landings.csv` file, compare summary statistics of the mass column for the meteorites that were found versus observed falling.

In [27]:
# Enter your code here

## Mini Project: Preppin' Data 2022: Week 4

The final introductory challenge for 2022 looks at how are students are getting to and from the school. Are the students travelling in a sustainable manner? What's the most popular type of sustainable travel?ut

In [28]:
import pandas as pd 
student_df = pd.read_csv('../data/students.csv')
travel_df = pd.read_csv('../data/travel.csv')

print("""Input
""")

print('Students')
print(student_df.head(5))
print(""" 
 
 """)
print('Travel')
print(travel_df.head(5))

Input

Students
   id pupil first name pupil last name  gender Date of Birth  \
0   1            Ronna         Nellies  Female    12/21/2013   
1   2            Rusty       Andriulis    Male     7/21/2012   
2   3          Roberta       Oakeshott  Female     12/4/2011   
3   4             Lola       Rubinfajn    Male     6/29/2012   
4   5           Kamila        Benedtti  Female     7/10/2012   

  Parental Contact Name_1 Parental Contact Name_2 Preferred Contact Employer  \
0                 Purcell                   Ketti                     Demizz   
1                 Vassili                    Rivi                   Brainbox   
2                    Lind                 Haskell                   Centidel   
3                    Elie                   Tresa                   Edgeblab   
4                   Adela                  Clevey                     Trudoo   

   Parental Contact  
0                 1  
1                 1  
2                 2  
3                 2  
4       

### Requirements

- Input the data sets
- Join the data sets together based on their common field
- Remove any fields you don't need for the challenge
- Change the weekdays from separate columns to one column of weekdays and one of the pupil's travel choice
- Group the travel choices together to remove spelling mistakes
- Create a Sustainable (non-motorised) vs Non-Sustainable (motorised) data field 
- Scooters are the child type rather than the motorised type
- Total up the number of pupil's travelling by each method of travel 
- Work out the % of trips taken by each method of travel each day
- Round to 2 decimal places
- Output the data

In [29]:
# Enter your code here

In [30]:
import pandas as pd 
solution_df = pd.read_csv('../data/PD2022Wk4Output.csv')
print(solution_df.head(5))

      Sustainable?  % of trips per day  Trips per day  Number of Trips  \
0      Sustainable                0.51           1000              510   
1  Non-Sustainable                0.01           1000                9   
2  Non-Sustainable                0.01           1000                9   
3      Sustainable                0.01           1000               13   
4      Sustainable                0.22           1000              220   

  Weekday Method of Travel  
0      Th             Walk  
1      Th        Aeroplane  
2      Tu        Aeroplane  
3       W  Mum's Shoulders  
4      Tu          Bicycle  


## Additional Resources
- 📰 **Py Data** - Pandas Docs - https://pandas.pydata.org/docs/
- 📰 **wjsutton** - Python Preppin' Data Solutions - https://github.com/wjsutton/preppin-data
- 📺 **Alex the Analyst** The Best Python Pandas Tutorial - https://youtu.be/bDhvCp3_lYw?si=LljpeI6ad1lNgr5z

## Summary

In this lesson explored more ways to clean and prepare a dataset, utilised lamda functions, and saw how we can join, pivot and union data in Python. 

## Next Lesson

**[Lesson 6: Pandas Test](./fundamentals-06-pandas-test.ipynb)** 
Put those skills to work in this test on pandas.