![](https://images.unsplash.com/photo-1701940616836-12baf70dc4f0?q=80&w=2028&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

# Dataset: urban mobility

We have dataset about urban mobility that includes various features related to transportation modes and usage patterns in a city.

### Features:
1. **Distance (km)**: Numerical (continuous). Represents the distance of a trip.
2. **Traffic Density**: Categorical (Low, Medium, High). Captures the level of traffic in the region where the trip takes place.
3. **Time of Day**: Categorical (Morning, Afternoon, Evening, Night). Reflects the period in which the trip occurs.
4. **Weather Condition**: Categorical (Clear, Rainy, Foggy, Snowy). Indicates the weather at the time of the trip.
5. **Travel Duration (mins)**: Numerical (continuous). The total time taken for the trip.
6. **Average Speed (km/h)**: Numerical (continuous). Average speed during the trip.
7. **Transport Mode**: Categorical (Walk, Bike, Bus, Train, Car, Motorcycle). This is the target variable, representing the mode of transportation.


In [None]:
import pandas as pd
import plotly.express as px

urban = pd.read_csv("https://cs.calvin.edu/courses/data/202/fsantos/datasets/urban_mobility.csv")
urban.head()

Unnamed: 0,Distance (km),Traffic Density,Time of Day,Weather Condition,Travel Duration (mins),Avg Speed (km/h),Transport Mode
0,0.93,Low,Evening,Rainy,8.9,6.3,Walk
1,1.77,Medium,Night,Snowy,31.5,3.4,Walk
2,9.15,Low,Afternoon,Rainy,16.9,32.5,Car
3,17.25,Medium,Morning,Clear,40.6,25.5,Car
4,5.8,High,Afternoon,Foggy,10.9,31.9,Bus


# Getting better visualizations

**📝 Exercise 1**: What is wrong about that horrible plot??? Try to correct the mappings and visual cues so that you can visualize it better.
(No unique answer here. You don't need to show everything... just choose a subset of interesting variables.).

In [None]:
import plotly.express as px

# Horrible Plotly Express plot using as many variables as possible
fig = px.scatter(
    urban,
    x='Distance (km)',
    y='Transport Mode',
    color='Travel Duration (mins)',
    size='Avg Speed (km/h)',
    facet_col='Traffic Density',
    facet_row='Weather Condition',
    symbol='Transport Mode',
    title="Urban Mobility",
    opacity=1.0,
    height=600,
    width=600
)
fig.show()

In [None]:
fig = px.scatter(
    urban,
    x='Distance (km)',
    y='Travel Duration (mins)',
    color='Transport Mode',
    facet_col='Traffic Density',
    title="Urban Mobility: Distance vs. Travel Duration",
    height=600,
    width=900
)
fig.show()

**📝 Exercise 2**: The following plot is trying to analyse `Travel Duration (mins)` and `Traffic Density`. Is this the best way to analyse these two variables? Clearly not. Try getting a better plot.
(No unique answer here).

In [None]:
fig_strange = px.pie(urban,
                   names='Traffic Density',
                   values='Travel Duration (mins)')
fig_strange.show()

In [None]:
fig_better = px.box(
    urban,
    x='Traffic Density',
    y='Travel Duration (mins)',
    color='Traffic Density',
    title="Travel Duration by Traffic Density",
    height=600,
    width=900
)
fig_better.show()

**📝 Exercise 3**: We want to check if there are some relationships between the columns. Make a **pair plot** (a matrix showing one-by-one scatter plots) relating the variables `Distance (km)`, `Travel Duration (mins)`, `Avg Speed (km/h)`. Also, map color to `Transport Mode`.

In [None]:
# Pair plot (scatter matrix)
fig_pairplot = px.scatter_matrix(
    urban,
    dimensions=['Distance (km)', 'Travel Duration (mins)', 'Avg Speed (km/h)'],
    color='Transport Mode',
    title="Pair Plot of Urban Mobility Variables",
    height=700,
    width=900
)
fig_pairplot.show()

**📝 Exercise 4**: Are there any patterns you can observe in the data? What are some relations between these variables? Explain.

**Answer:** In this chart, we can see some clear patterns between the different types of transportation:

1. Distance vs. Avg Speed:
   - Trains and cars cover longer distances at higher speeds.
   - Walking and biking tend to have shorter distances and lower speeds.

2. Distance vs. Travel Time:
   - The farther you go, the longer it takes.
   - Walking and biking take more time for longer distances compared to cars and trains, which move faster.

3. Travel Time vs. Avg Speed:
   - Trains have high speeds but don’t take a lot of time.
   - Buses and cars take more time but still move faster than walking or biking.

Each transport type forms its own group, with walking and biking moving slower and covering shorter distances, while cars, buses, and trains go faster and farther. The data helps to clearly distinguish between different modes of travel.


**📝 Exercise 5**: Let's check the distribution of values for `Distance (km)` and `Avg Speed (km/h)`. Make two violin plots, for both variables, and show groups according to `Transport Mode`.

In [None]:
fig_distance_violin = px.violin(
    urban,
    x='Transport Mode',
    y='Distance (km)',
    color='Transport Mode',
    title="Distribution of Distance by Transport Mode",
    box=True,
    height=600,
    width=900
)
fig_distance_violin.show()

fig_speed_violin = px.violin(
    urban,
    x='Transport Mode',
    y='Avg Speed (km/h)',
    color='Transport Mode',
    title="Distribution of Avg Speed by Transport Mode",
    box=True,
    height=600,
    width=900
)
fig_speed_violin.show()

# Predicting Travel Duration

What if we wanted to use this data to predict `Travel Duration (mins)` given other variables? We are already going into machine learning models...

To estimate this numerical/continuous variable, we will perform a **Regression Analysis**.

We will import stuff from the [scikit-learn](https://scikit-learn.org/stable/) library.

We need to convert our categorical variables to number in order for it to work:

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
label_encoder = LabelEncoder()

# Convert categorical columns to numerical using label encoding
X_regression_encoded = urban.copy()
X_regression_encoded['Transport Mode'] = label_encoder.fit_transform(urban['Transport Mode'])
X_regression_encoded['Traffic Density'] = label_encoder.fit_transform(urban['Traffic Density'])
X_regression_encoded['Time of Day'] = label_encoder.fit_transform(urban['Time of Day'])
X_regression_encoded['Weather Condition'] = label_encoder.fit_transform(urban['Weather Condition'])

X_regression_encoded.head()

Unnamed: 0,Distance (km),Traffic Density,Time of Day,Weather Condition,Travel Duration (mins),Avg Speed (km/h),Transport Mode
0,0.93,1,1,2,8.9,6.3,4
1,1.77,2,3,3,31.5,3.4,4
2,9.15,1,0,2,16.9,32.5,2
3,17.25,2,2,0,40.6,25.5,2
4,5.8,0,0,1,10.9,31.9,1


We will split our columns into the data we are going to input, and the data we expect to get an answer for.

`Travel Duration (mins)` is called our **target**.
The other columns are called our **features**.

In [None]:
# Define features and target for regression
X_regression = X_regression_encoded.drop(columns=['Travel Duration (mins)'])
y_regression = urban['Travel Duration (mins)']

Finally, we are going to split these into two sets: one for training our model, other for testing it.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets for regression
X_train_regression, X_test_regression, y_train_regression, y_test_regression = train_test_split(X_regression, y_regression, test_size=0.3)

**📝 Exercise 6**: what does each dataframe here contains?

`X_train_regression`: This holds most of the information (about 70%) from the dataset, but without the travel time (Travel Duration). It includes things like distance, traffic density, and weather. This is used to help the computer learn how these factors are related to travel time.

`y_train_regression` : This holds the actual travel times (Travel Duration) for the same 70% of the data. The computer uses this to compare with the other information (distance, traffic, etc.) and learn patterns.

`X_test_regression`: This holds the same kind of information as X_train_regression (distance, traffic, etc.) but comes from the other 30% of the dataset that we saved for testing. The computer will make predictions based on this data once it's done learning.

`y_test_regression`: This holds the actual travel times for the remaining 30% of the data. It’s used to check how well the computer’s predictions match the real travel times.

We now need to normalize our features using a "scaler". Observe the code below:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_regression_scaled = scaler.fit_transform(X_train_regression)
X_test_regression_scaled = scaler.transform(X_test_regression)

**📝 Exercise 7**: Describe what happened with the values in each feature column.

When we use a **scaler** like `StandardScaler`, it changes the values in each feature column (like distance or speed) so that:

- The average value becomes 0.
- The data is spread out evenly, with most values close to 0 and a consistent range.

This helps the model because all the features are now on the same scale, even though they had different units before (like kilometers vs. minutes).

At last, we are going to train our regressor. It is called k-NN Regressor, and the way it works will be explained next week. Observe:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Train k-NN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train_regression_scaled, y_train_regression)

**📝 Exercise 8**: Now that we have our model, try to predict the Travel Time once we have this data:

Distance (km): 7.27\
Traffic Density: 1.00 (Low)\
Time of Day: 0.00 (Afternoon)\
Weather Condition: 1.00 (Foggy)\
Avg Speed (km/h):	29.90\
Transport Mode:	2.00 (Car)

Set it in the following way, and use the `scaler` we created to set the numbers to the right scale:

In [None]:
new_data = {
    "Distance (km)": [7.27],
    "Traffic Density": [1.00],
    "Time of Day": [0.00],
    "Weather Condition": [1.00],
    "Avg Speed (km/h)": [29.90],
    "Transport Mode": [2.00]
}
data_in = pd.DataFrame(new_data)
data_in_scaled = scaler.transform(data_in)

Use `knn_regressor.predict()` and pass `data_in_scaled` as argument. Get the answer.
According to our data, this should be around 14.6 minutes. What is the difference error?

In [None]:
predicted_duration = knn_regressor.predict(data_in_scaled)
actual_duration = 14.6
difference_error = abs(predicted_duration[0] - actual_duration)
predicted_duration[0], difference_error

(15.559999999999999, 0.9599999999999991)

Answer: The predicted travel time is 15.56. So, the difference error is 0.96 minutes

**📝 Exercise 9**: Let's predict the whole test dataset and set it to `y_pred_regression`. If our model is working well, it should correspond closely to `y_test_regression` (which contains the "right answers").

How can we evaluate the errors? We can use metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE).

Import these from scikit learn: `from sklearn.metrics import mean_absolute_error, mean_squared_error`

Use them (you may want to check documentation). What are the results? Is this acceptable? (At least by looking at MAE... MSE applies a square and is harder to grasp.)

In [None]:
y_predicted = knn_regressor.predict(X_test_regression_scaled)

In [None]:
# use mean_absolute_error with y_predicted and y_test_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_test_regression, y_predicted)
mse = mean_squared_error(y_test_regression, y_predicted)
print("Mean Absolute Error(MAE): ", mae)
print("Mean Squared Error(MSE): ", mse)

Mean Absolute Error(MAE):  3.8080000000000003
Mean Squared Error(MSE):  25.583716923076924


The **MAE** of 3.65 means that, on average, our model’s predictions are about 3.65 minutes off from the actual travel time. This is a decent result for this kind of problem, but there’s room to make it better. The **MSE** is higher because it squares the differences, which makes bigger mistakes seem even larger.

**📝 Reflection Exercise**: Write a sentence or two of your overall
reflections on this practice. You may write whatever you want, but you
might perhaps respond to one or two of these questions:

-   Was anything unclear about this assignment?
  - There wasn’t anything unclear about this assignment. Breaking the notebook into sections made it easier to follow.
-   How hard was it for you? Where did you get “stuck”?
  - Like previous assignments, it was a bit challenging, but it was a good opportunity to apply what we've learned in class.
-   How long did it take you?
  - It took me about an 1 - 1 1/2 hours to complete.
-   What questions or uncertainties remain?
  - I don’t have any remaining questions or uncertainties.
-   What skills do you think you’ll need more practice with?
  - I think more practice in general would be helpful to further apply and strengthen the skills we've learned.
-   Did you try anything out of curiosity that you weren’t specifically
    asked to do?
      - I didn’t try anything outside the assignment instructions this time.

# Extra: publishing your model

If you want to publish this as a streamlit app, install streamlit with:

In [None]:
!pip install -q streamlit

Now write the complete code in a file called app.py

In [None]:
%%writefile app.py

import streamlit as st
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import LabelEncoder

# Sample training data to create the model
df = pd.read_csv("https://cs.calvin.edu/courses/data/202/fsantos/datasets/urban_mobility.csv")

# Initialize the label encoder
label_encoder = LabelEncoder()

# Label encode the categorical columns
label_encoders = {
    "Traffic Density": LabelEncoder(),
    "Time of Day": LabelEncoder(),
    "Weather Condition": LabelEncoder(),
    "Transport Mode": LabelEncoder()
}
for col in label_encoders:
    df[col] = label_encoders[col].fit_transform(df[col])

# Define features and target variable
X = df.drop(columns=['Travel Duration (mins)'])
y = df['Travel Duration (mins)']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create and train the k-NN regressor model
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(X_scaled, y)

# Streamlit interface
st.title("Travel Duration Prediction")

st.write("Enter the values for the features below:")

# User inputs for the feature columns
distance = st.number_input("Distance (km)", min_value=0.0, max_value=50.0, step=0.1, value=7.27)

# Get label options from the LabelEncoder and use them for dropdown menus
traffic_density_input = st.selectbox("Traffic Density", options=label_encoders['Traffic Density'].classes_)
time_of_day_input = st.selectbox("Time of Day", options=label_encoders['Time of Day'].classes_)
weather_condition_input = st.selectbox("Weather Condition", options=label_encoders['Weather Condition'].classes_)
transport_mode_input = st.selectbox("Transport Mode", options=label_encoders['Transport Mode'].classes_)

# Avg Speed input
avg_speed = st.number_input("Avg Speed (km/h)", min_value=0.0, max_value=200.0, step=0.1, value=29.9)

# Encode the user inputs using the same encoders used during training
encoded_traffic_density = label_encoders['Traffic Density'].transform([traffic_density_input])[0]
encoded_time_of_day = label_encoders['Time of Day'].transform([time_of_day_input])[0]
encoded_weather_condition = label_encoders['Weather Condition'].transform([weather_condition_input])[0]
encoded_transport_mode = label_encoders['Transport Mode'].transform([transport_mode_input])[0]

# Create a DataFrame for the input values
input_data = pd.DataFrame({
    "Distance (km)": [distance],
    "Traffic Density": [encoded_traffic_density],
    "Time of Day": [encoded_time_of_day],
    "Weather Condition": [encoded_weather_condition],
    "Avg Speed (km/h)": [avg_speed],
    "Transport Mode": [encoded_transport_mode]
})

# Standardize the input data using the same scaler
input_data_scaled = scaler.transform(input_data)

# Make predictions using the trained model
predicted_duration = knn_regressor.predict(input_data_scaled)

# Display the prediction
st.write(f"### Predicted Travel Duration: {predicted_duration[0]:.2f} minutes")

Overwriting app.py


For publishing, install *localtunnel*

In [None]:
!npm install localtunnel

[K[?25h
up to date, audited 23 packages in 1s

3 packages are looking for funding
  run `npm fund` for details

2 [33m[1mmoderate[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.


Now, run a streamlit server.

You will get a URL. Put the password shown in the opening page.

In [None]:
import urllib
print("Password/Enpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))
!streamlit run /content/app.py &>/content/logs.txt & npx localtunnel --port 8501

Password/Enpoint IP for localtunnel is: 34.80.49.32
your url is: https://mean-states-hammer.loca.lt
