https://worldroadstatistics.org/wrs-data/data/

Area of analysis: is Ireland a country heavely dependent on public transportation or not?

# Data loading

In [61]:
import pandas as pd

df_2021 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2021.csv")
df_2020 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2020.csv")
df_2019 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2019.csv")
df_2018 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2018.csv")
df_2017 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2017.csv")
df_2016 = pd.read_csv("../Data/total-four-wheeled-traffic-(by-vehicle-type)---2016.csv")

In [62]:
df_2021["Year"] = 2021 
df_2020["Year"] = 2020
df_2019["Year"] = 2019
df_2018["Year"] = 2018
df_2017["Year"] = 2017
df_2016["Year"] = 2016

In [63]:
df = pd.concat([df_2021,df_2020,df_2019,df_2018,df_2017,df_2016])

# Descriptive Statistics 

In [64]:
# Display the first 5 rows of df
df.head()

Unnamed: 0,Category,Passenger Car Traffic,Bus and Motor Coach Traffic,"Total Van, Pickup, Lorry and Road Tractor Traffic",Total four wheeled Traffic,Year
0,Austria,60721.0,390.0,13810.0,74921.0,2021
1,Belarus,244.0,324.0,935.0,1503.0,2021
2,Bulgaria,,353.0,2757.0,3110.0,2021
3,Croatia,19780.0,240.0,3170.0,23190.0,2021
4,Denmark,40313.0,624.0,9257.0,50194.0,2021


In [65]:
# Dysplay the shape of our dataset
df.shape

(188, 6)

In [66]:
# Display data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 188 entries, 0 to 30
Data columns (total 6 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Category                                           188 non-null    object 
 1   Passenger Car Traffic                              152 non-null    float64
 2   Bus and Motor Coach Traffic                        176 non-null    float64
 3   Total Van, Pickup, Lorry and Road Tractor Traffic  175 non-null    float64
 4   Total four wheeled Traffic                         158 non-null    float64
 5   Year                                               188 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 10.3+ KB


In [24]:
# Display sum of null values for each column
df.isnull().sum()

Category                                              0
Passenger Car Traffic                                36
Bus and Motor Coach Traffic                          12
Total Van, Pickup, Lorry and Road Tractor Traffic    13
Total four wheeled Traffic                           30
Year                                                  0
dtype: int64

In [29]:
# Printing of unique values per column
df.Category.unique()

array(['Austria', 'Belarus', 'Bulgaria', 'Croatia', 'Denmark', 'Estonia',
       'Finland', 'France', 'Germany', 'Hungary', 'Ireland', 'Latvia',
       'Lithuania', 'Malta', 'Moldova', 'Monaco', 'Montenegro',
       'Netherlands', 'North Macedonia', 'Norway', 'Poland', 'Portugal',
       'Romania', 'Russian Federation', 'Serbia', 'Slovak Republic',
       'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom',
       'Ukraine', 'Belgium'], dtype=object)

In [38]:
# Display data distribution
df.loc[df["Year"]==2021].describe()

Unnamed: 0,Passenger Car Traffic,Bus and Motor Coach Traffic,"Total Van, Pickup, Lorry and Road Tractor Traffic",Total four wheeled Traffic,Year
count,25.0,29.0,28.0,26.0,31.0
mean,89225.76,726.413793,19942.5,107020.653846,2021.0
std,149398.238108,1014.271009,32740.725805,180480.177848,0.0
min,55.0,3.0,4.0,62.0,2021.0
25%,9393.0,138.0,2738.0,11584.75,2021.0
50%,32284.0,324.0,8275.0,43019.5,2021.0
75%,63913.0,665.0,16833.75,78344.75,2021.0
max,582400.0,3862.0,116000.0,677800.0,2021.0


In [93]:
df_bus = df[["Year", "Category", "Bus and Motor Coach Traffic", "Total four wheeled Traffic"]]

In [94]:
df_bus_ireland = df_bus.loc[df_bus["Category"]=="Ireland"]

In [95]:
df_bus_ireland.describe()

Unnamed: 0,Year,Bus and Motor Coach Traffic,Total four wheeled Traffic
count,6.0,6.0,6.0
mean,2018.5,333.166667,43892.333333
std,1.870829,57.031278,4778.960375
min,2016.0,260.0,35435.0
25%,2017.25,285.5,42218.75
50%,2018.5,345.0,46279.0
75%,2019.75,379.75,46676.25
max,2021.0,392.0,47687.0


# Inferential Statistics on Ireland

In [96]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

In [97]:
# We want to find a confidence interval for the mean of "Bus and Motor Coach Traffic" in Ireland
bus_traffic_data_dublin = df_bus_ireland['Bus and Motor Coach Traffic'].dropna()  # Remove missing values

# Calculate the confidence interval
confidence_interval = stats.norm.interval(0.95, loc=np.mean(bus_traffic_data_dublin), scale=np.std(bus_traffic_data_dublin))

print("95% Confidence Interval for Bus Car Traffic Mean:", confidence_interval)

95% Confidence Interval for Bus Car Traffic Mean: (231.12663811756266, 435.2066952157707)


In [98]:
# Testing if the mean of "Bus and Motor Coach Traffic" is significantly different from 200
bus_traffic_data_dublin = df_bus_ireland['Bus and Motor Coach Traffic'].dropna()  # Remove missing values

t_stat, p_value = stats.ttest_1samp(bus_traffic_data_dublin, 200)
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 5.719499820142219
P-value: 0.0022847835958103276


**Interpretation of Confidence Interval:**

The confidence interval provides a range in which we can be reasonably confident that the true population mean lies.

**Interpretation of p-value from t-Test:**

The p-value from the t-test is compared to a significance level (commonly 0.05) to determine if there is enough evidence to reject the null hypothesis.

In this case, the small p-value (0.00228) suggests that there is evidence to reject the null hypothesis that the mean of "Bus and Motor Coach Traffic" is equal to 200.


Let's try to do the same test for 300:

In [119]:
# Testing if the mean of "Bus and Motor Coach Traffic" is significantly different from 300
bus_traffic_data_dublin = df_bus_ireland['Bus and Motor Coach Traffic'].dropna()  # Remove missing values

t_stat, p_value = stats.ttest_1samp(bus_traffic_data_dublin, 300)
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 1.4245062130266608
P-value: 0.2136071555678542


**Result:**
- **T-statistic:** 1.42
- **P-value:** 0.21

**Interpretation:**
- The t-statistic of 1.42 is the test statistic, indicating how many standard errors the sample mean of Dublin is from the assumed population mean of 300.
- The p-value of 0.21 is the probability of observing such an extreme t-statistic under the assumption that the null hypothesis (mean equals 300) is true.

**Conclusion:**
- With a p-value greater than the commonly used significance level of 0.05, we fail to reject the null hypothesis.
- There is not enough evidence to suggest that the mean of "Bus and Motor Coach Traffic" in Dublin is significantly different from 300 at a 5% significance level.

Therefore, based on this analysis, we do not have sufficient statistical evidence to conclude that the mean bus and motor coach traffic in Dublin differs significantly from 300.


# Inferential Statistics on Ireland and other countries

In [102]:
df_bus_ireland

Unnamed: 0,Year,Category,Bus and Motor Coach Traffic,Total four wheeled Traffic
10,2021,Ireland,260.0,40942.0
10,2020,Ireland,271.0,35435.0
10,2019,Ireland,392.0,46049.0
10,2018,Ireland,386.0,46509.0
11,2017,Ireland,361.0,47687.0
11,2016,Ireland,329.0,46732.0


In [99]:
n_dublin = df_bus_ireland.shape[0]
std_dublin = df_bus_ireland['Bus and Motor Coach Traffic'].std()
avg_dublin = df_bus_ireland['Bus and Motor Coach Traffic'].mean()

In [103]:
df_bus_switzerland

Unnamed: 0,Year,Category,Bus and Motor Coach Traffic,Total four wheeled Traffic
29,2021,Switzerland,138.0,64021.0
29,2020,Switzerland,131.0,60514.0
28,2019,Switzerland,142.0,66869.0
28,2018,Switzerland,139.0,66251.0
29,2017,Switzerland,136.0,65505.0
28,2016,Switzerland,134.0,64375.0


In [100]:
df_bus_switzerland = df_bus.loc[df_bus["Category"]=="Switzerland"]
n_switzerland = df_bus_switzerland.shape[0]
std_switzerland = df_bus_switzerland['Bus and Motor Coach Traffic'].std()
avg_switzerland = df_bus_switzerland['Bus and Motor Coach Traffic'].mean()

We perform the t test for two populations, Ireland and Switzerland, to see if they have the same average buses

In [104]:
#H0: mu Ireland = mu Switzerland // H1: mu Ireland =! mu Switzerland

t_test = stats.ttest_ind_from_stats(mean1 = avg_dublin, std1 = std_dublin, nobs1 = n_dublin, 
                                    mean2 = avg_switzerland, std2 = std_switzerland, nobs2 = n_switzerland, 
                                    equal_var = False)

In [105]:
t_test

Ttest_indResult(statistic=8.420182820205463, pvalue=0.00037033951688098595)

In [110]:
X1 = df_bus.loc[df_bus["Category"]=="Switzerland"]["Bus and Motor Coach Traffic"]
X2 = df_bus.loc[df_bus["Category"]=="Ireland"]["Bus and Motor Coach Traffic"]

t_test = stats.ttest_ind(X1, X2, equal_var = False)

In [111]:
t_test

Ttest_indResult(statistic=-8.420182820205463, pvalue=0.00037033951688098595)

**Independent Samples t-Test with Pooled Standard Deviation:**

- **Null Hypothesis (H0):** The mean bus and motor coach traffic in Ireland is equal to the mean in Switzerland.
- **Alternative Hypothesis (H1):** The mean bus and motor coach traffic in Ireland is not equal to the mean in Switzerland.

**Result:**

- **T-statistic:** 8.42
- **P-value:** 0.00037

**Interpretation:**

- The t-statistic of 8.42 is the test statistic, indicating how many standard errors the sample mean of Ireland is from the sample mean of Switzerland.
- The p-value of 0.00037 is the probability of observing such an extreme t-statistic under the assumption that the null hypothesis is true.

**Conclusion:**

- With a very low p-value (below commonly used significance levels like 0.05), we reject the null hypothesis.
- There is enough evidence to suggest that the mean "Bus and Motor Coach Traffic" in Ireland is significantly different from the mean in Switzerland.

**Independent Samples t-Test without Pooled Standard Deviation:**

- **Null Hypothesis (H0):** The mean bus and motor coach traffic in Ireland is equal to the mean in Switzerland.
- **Alternative Hypothesis (H1):** The mean bus and motor coach traffic in Ireland is not equal to the mean in Switzerland.

**Result:**

- **T-statistic:** -8.42 (negative because the order of the means is reversed in the function call)
- **P-value:** 0.00037

**Interpretation:**

- The results are consistent with the previous test but obtained using a different method.

**Conclusion:**

- The conclusion remains the same: there is enough evidence to suggest that the mean "Bus and Motor Coach Traffic" in Ireland is significantly different from the mean in Switzerland.

In summary, both methods (with and without pooled standard deviation) lead to the same conclusion, indicating a significant difference in the mean bus and motor coach traffic between Ireland and Switzerland.


What about if we do the same calculus taking into consideration the % of bus traffic with respect to four wheeled Traffic?  

In [115]:
df_bus_ireland["Bus and Motor Coach Traffic %"] = df_bus_ireland["Bus and Motor Coach Traffic"]*100/df_bus_ireland["Total four wheeled Traffic"]
df_bus_switzerland["Bus and Motor Coach Traffic %"] = df_bus_switzerland["Bus and Motor Coach Traffic"]*100/df_bus_switzerland["Total four wheeled Traffic"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bus_ireland["Bus and Motor Coach Traffic %"] = df_bus_ireland["Bus and Motor Coach Traffic"]*100/df_bus_ireland["Total four wheeled Traffic"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bus_switzerland["Bus and Motor Coach Traffic %"] = df_bus_switzerland["Bus and Motor Coach Traffic"]*100/df_bus_switzerland["Total four wheeled Traffic"]


In [114]:
df_bus_ireland

Unnamed: 0,Year,Category,Bus and Motor Coach Traffic,Total four wheeled Traffic,Bus and Motor Coach Traffic %
10,2021,Ireland,260.0,40942.0,0.635045
10,2020,Ireland,271.0,35435.0,0.764781
10,2019,Ireland,392.0,46049.0,0.851267
10,2018,Ireland,386.0,46509.0,0.829947
11,2017,Ireland,361.0,47687.0,0.75702
11,2016,Ireland,329.0,46732.0,0.704014


In [116]:
df_bus_switzerland

Unnamed: 0,Year,Category,Bus and Motor Coach Traffic,Total four wheeled Traffic,Bus and Motor Coach Traffic %
29,2021,Switzerland,138.0,64021.0,0.215554
29,2020,Switzerland,131.0,60514.0,0.216479
28,2019,Switzerland,142.0,66869.0,0.212356
28,2018,Switzerland,139.0,66251.0,0.209808
29,2017,Switzerland,136.0,65505.0,0.207618
28,2016,Switzerland,134.0,64375.0,0.208155


In [117]:
X1 = df_bus_ireland["Bus and Motor Coach Traffic %"]
X2 = df_bus_switzerland["Bus and Motor Coach Traffic %"]

t_test = stats.ttest_ind(X1, X2, equal_var = False)

In [118]:
t_test

Ttest_indResult(statistic=16.692779735745855, pvalue=1.3599795956174663e-05)

With a p-value much smaller than common significance levels like 0.05, we reject the null hypothesis.

There is strong evidence to suggest that the mean percentage of "Bus and Motor Coach Traffic" in Ireland is significantly different from the mean in Switzerland.