# Deep Learning Approach 1


---


**Mentor:**
  - ***Professor Richard Sowers***, Department of Industrial and Systems Engineering, University of Illinois at Urbana-Champaign (UIUC).

**Group Members:**
  - ***Advika Pattiwar*** (linkedin.com/in/advika-pattiwar)
  - ***Dhruv Borda*** (linkedin.com/thebordadhruv)
  - ***Hrithik Rathi*** (linkedin.com/in/hrithik-rathi)
  - ***Suvrata Gayathri Kappagantula*** (linkedin.com/in/gayathrikappagantula)


---

In this notebook, we implement a Graph Neural Network (GNN) using PyTorch and PyTorch Geometric to analyze bike sharing data. The approach encompasses data preprocessing, feature engineering, and the application of a GNN model for predictive analysis. The combination of weather, temporal, and spatial features within a graph structure allows for comprehensive analysis and prediction, providing valuable insights into bike sharing usage patterns.

### Data Preprocessing and Feature Engineering

1. **Filtering Rides**: Selecting trips where the start and end stations are different.
2. **Average Ride Time Calculation**: Computing the average ride time for each trip.
3. **Grouping and Aggregating Data**: Data is grouped by date, time, and end station characteristics. Aggregations include the number of rides and average ride time.
4. **Weather Data Integration**: Average daily weather features are merged with the bike sharing data.
5. **Categorical and Continuous Feature Processing**:
   - Categorical features like 'end_station_name' are encoded.
   - Continuous features (weather data, coordinates) are normalized.

6. **Cyclical Feature Encoding**: The 'weekday' feature is encoded using a cyclical method to preserve its nature.

### Graph Construction

- **Node Representation**: Each station is represented as a node in the graph.
- **Edge Construction**: Edges are created based on trips between stations.
- **Station Mapping**: A unique integer is assigned to each station for graph representation.

### GNN Model Definition

The model includes several key components:
1. **GCN Layers**: To capture neighborhood information.
2. **GAT Layer**: Introduces an attention mechanism in the graph structure.
3. **Dropout Layer**: For regularization and to prevent overfitting.
4. **Fully Connected Layer**: Final layer for output predictions.

### Training and Testing Procedure

- **Data Splitting**: The dataset is divided into training and testing sets.
- **Model Training**:
   - Loss function (Mean Absolute Error) and Adam optimizer are defined.
   - The model is trained for a predefined number of epochs, and training loss is monitored.
- **Model Testing**:
   - The model's performance is evaluated on the test set.
   - The Mean Absolute Error (MAE) is calculated to assess the predictive accuracy.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install torch_geometric



In [None]:
import io
import requests

import pandas as pd
import numpy as np

import torch
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv, global_mean_pool
import torch.nn.functional as F

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

### Access data from AWS s3 bucket

In [None]:
# Load the data from the URL
url = "https://s3-us-east-2.amazonaws.com/dhruvborda-project-nyccitibikerentals/Dataset/debugging.pkl"
response = requests.get(url)

if response.status_code == 200:
    debugging = pd.read_pickle(io.BytesIO(response.content))
    debugging_backup = debugging.copy()
    print("Data loaded successfully.")
else:
    print(f"Failed to download debugging.pkl. Status code: {response.status_code}")
    exit()  # Exit if data loading fails

Data loaded successfully.


### Feature Engineering and Data Preparation

In [None]:
debugging_filtered = debugging[debugging['start_station_name'] != debugging['end_station_name']]

# Calculate average ride time in minutes
debugging_filtered['avg_ride_time'] = (debugging_filtered['ended_at'] - debugging_filtered['started_at']).dt.total_seconds() / 60

new_debugging = debugging_filtered.groupby(['year', 'month', 'day', 'weekday', 'end_station_name','end_lat', 'end_lng']).agg(
    number_of_rides=('ride_id', 'size'),
    avg_ride_time=('avg_ride_time', 'mean')
).reset_index()

# Select the weather-related features and calculate the daily average
weather_features = ['AWND', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT01', 'WT02', 'WT03', 'WT08']
weather_avg = debugging.groupby(['year', 'month', 'day'])[weather_features].mean().reset_index()

# Merge the grouped debugging with the weather averages
new_debugging = pd.merge(new_debugging, weather_avg, on=['year', 'month', 'day'], how='left')
new_debugging.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  debugging_filtered['avg_ride_time'] = (debugging_filtered['ended_at'] - debugging_filtered['started_at']).dt.total_seconds() / 60


Unnamed: 0,year,month,day,weekday,end_station_name,end_lat,end_lng,number_of_rides,avg_ride_time,AWND,...,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5,WT01,WT02,WT03,WT08
0,2023,9,1,4,1 Ave & E 30 St,40.741444,-73.975361,1,12.7,4.47,...,76.0,61.0,60.0,20.0,13.0,18.1,0.0,0.0,0.0,0.0
1,2023,9,1,4,1 Ave & E 44 St,40.75002,-73.969053,2,11.591667,4.47,...,76.0,61.0,60.0,20.0,13.0,18.1,0.0,0.0,0.0,0.0
2,2023,9,1,4,1 Ave & E 62 St,40.761227,-73.96094,2,10.05,4.47,...,76.0,61.0,60.0,20.0,13.0,18.1,0.0,0.0,0.0,0.0
3,2023,9,1,4,1 Ave & E 68 St,40.765005,-73.958185,1,11.366667,4.47,...,76.0,61.0,60.0,20.0,13.0,18.1,0.0,0.0,0.0,0.0
4,2023,9,1,4,1 Ave & E 94 St,40.781721,-73.94594,1,25.583333,4.47,...,76.0,61.0,60.0,20.0,13.0,18.1,0.0,0.0,0.0,0.0


In [None]:
# Encode categorical variables: 'end_station_name' types
categorical_columns = ['end_station_name']
end_station_name_encoder = LabelEncoder()

# Fit and transform each categorical column separately
new_debugging['end_station_name_encoded'] = np.array(end_station_name_encoder.fit_transform(new_debugging['end_station_name']).tolist())
end_station_name_encoded_2d = new_debugging['end_station_name_encoded'].values.reshape(-1, 1)

# Normalize continuous variables: latitude, longitude, temperatures, wind speeds
continuous_columns = [  #'avg_ride_time',
                      'AWND', 'PRCP', 'SNOW', 'SNWD',
       'TMAX', 'TMIN', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'WT01', 'WT02', 'WT03',
       'WT08']
scaler = StandardScaler()
normalized_continuous = scaler.fit_transform(new_debugging[continuous_columns])

# Cyclical Encoding Functions
def encode_cyclical_feature(value, max_value):
    value_scaled = value / max_value
    return np.sin(value_scaled * 2 * np.pi), np.cos(value_scaled * 2 * np.pi)

encoded_weekday = np.array(new_debugging['weekday'].apply(encode_cyclical_feature, max_value=6).tolist())

In [None]:
# Convert 'new_debugging[['year', 'month', 'day']]' to a numpy array
date_features = new_debugging[['year', 'month', 'day']].to_numpy()

# Combine all features
node_features = np.concatenate([
    end_station_name_encoded_2d,
    normalized_continuous,
    new_debugging[['number_of_rides','end_lat', 'end_lng']].to_numpy(),

    encoded_weekday,
    date_features
], axis=1)

# Convert features to PyTorch tensor
node_features = torch.tensor(node_features, dtype=torch.float)

# Function to create a mapping from station names to unique integers
def create_station_mapping(dataframe):
    stations = pd.concat([dataframe['start_station_name'], dataframe['end_station_name']]).unique()
    return {station: i for i, station in enumerate(stations)}, stations

# Create a mapping from station names to unique integers
station_mapping, stations = create_station_mapping(debugging)

# Function to create edges for the graph (trips between stations)
def create_edges(dataframe, station_mapping):
    return dataframe.apply(lambda row: (station_mapping[row['start_station_name']],
                                        station_mapping[row['end_station_name']]), axis=1)

# Create edges for the graph
edges = create_edges(debugging, station_mapping)
edge_index = torch.tensor(list(edges), dtype=torch.long).t().contiguous()

# Labels: Encode end station names
labels = torch.tensor(new_debugging['number_of_rides'].values, dtype=torch.float)

# Create graph data
graph_data = Data(x=node_features, edge_index=edge_index, y=labels)

### Graph Neural Network (GNN) Model

In [None]:
class GNN(torch.nn.Module):
    def __init__(self, num_node_features):
        super(GNN, self).__init__()
        # First GCN layer
        self.conv1 = GCNConv(num_node_features, 32)
        # Second GCN layer
        self.conv2 = GCNConv(32, 64)
        # Graph Attention Network (GAT) layer
        self.gat_conv = GATConv(64, 64)
        # Dropout layer
        self.dropout = torch.nn.Dropout(p=0.5)
        # Fully connected layer
        self.fc = torch.nn.Linear(64, 1)

    def forward(self, x, edge_index):
        # Apply first GCN layer with ReLU activation
        x = F.relu(self.conv1(x, edge_index))
        # Apply dropout
        x = self.dropout(x)
        # Apply second GCN layer with ReLU activation
        x = F.relu(self.conv2(x, edge_index))
        # Apply GAT layer
        x = self.gat_conv(x, edge_index)
        # Final fully connected layer
        x = self.fc(x)
        return x

In [None]:
model = GNN(graph_data.num_node_features)

# Define loss function and optimizer
criterion = torch.nn.L1Loss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train_test_model(model, data, epochs=100, test_size=0.2):
    # Split the data into training and testing sets
    train_indices, test_indices = train_test_split(range(data.num_nodes), test_size=test_size, random_state=42)

    train_indices_tensor = torch.tensor(train_indices)
    test_indices_tensor = torch.tensor(test_indices)

    def adjust_edge_index(edge_index, node_indices):
    # Create a mapping from old indices to new indices
      idx_mapping = {old_idx: new_idx for new_idx, old_idx in enumerate(node_indices.tolist())}

      # Adjust edge indices based on the mapping
      adjusted_edge_index = edge_index.clone()
      for i in range(edge_index.size(1)):
          adjusted_edge_index[0, i] = idx_mapping[edge_index[0, i].item()]
          adjusted_edge_index[1, i] = idx_mapping[edge_index[1, i].item()]

      return adjusted_edge_index

    # Filter edge_index for training and testing subgraphs
    def filter_edges(edge_index, node_indices):
        mask = torch.isin(edge_index[0], node_indices) & torch.isin(edge_index[1], node_indices)
        return edge_index[:, mask]

    train_edge_index = filter_edges(data.edge_index, train_indices_tensor)
    test_edge_index = filter_edges(data.edge_index, test_indices_tensor)

    # Training loop
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        # Use train_edge_index for training
        out = model(data.x[train_indices], train_edge_index)
        loss = criterion(out, data.y[train_indices])
        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item()}")

    # Test the model
    model.eval()
    with torch.no_grad():
        # Use test_edge_index for testing
        filtered_test_edge_index = filter_edges(data.edge_index, test_indices_tensor)
        adjusted_test_edge_index = adjust_edge_index(filtered_test_edge_index, test_indices_tensor)
        preds = model(data.x[test_indices], adjusted_test_edge_index).squeeze()
        preds = preds.relu()  # Apply ReLU again to ensure non-negative outputs
        rounded_preds = preds.round()  # Round to the nearest integer
        rounded_preds = rounded_preds.clamp(min=0)
        test_mae = torch.sqrt(criterion(preds, data.y[test_indices].float().view_as(preds)))
        print(f"Test MAE: {test_mae.item()}")

# Run the training and testing
train_test_model(model, graph_data)

  return F.l1_loss(input, target, reduction=self.reduction)


Epoch 0, Loss: 183.1399383544922
Epoch 10, Loss: 16.821413040161133
Epoch 20, Loss: 8.500924110412598
Epoch 30, Loss: 3.543086528778076
Epoch 40, Loss: 1.4592653512954712
Epoch 50, Loss: 0.789175271987915
Epoch 60, Loss: 0.6083827614784241
Epoch 70, Loss: 0.5541871786117554
Epoch 80, Loss: 0.49966609477996826
Epoch 90, Loss: 0.44967135787010193
Test MAE: 0.7657119035720825
