## Milestone 1: Problem Definition, Dataset Selection, and Data Exploration

LIS 640 - Introduction to Applied Deep Learning

Due 2/14/25

## **Overview**
In this milestone, you will:
1. **Define a deep learning problem** where AI can make a meaningful impact.
2. **Identify three datasets** that fit your topic and justify their relevance.
3. **Explore and visualize** the datasets to understand their structure.
4. **Implement a PyTorch Dataset class** to prepare data for deep learning.

This notebook provides an example of **fuel-efficient car usage** to illustrate what is expected.


## **Step 1: Define Your Deep Learning Problem**
Write a paragraph explaining:
- **Why your chosen topic is important.**
- **How deep learning can help solve the problem.**

### **Example Problem Statement: Fuel-Efficient Car Usage**
*Fuel efficiency is a major factor in reducing carbon emissions and lowering fuel costs. Drivers often adopt inefficient driving patterns, wasting fuel through unnecessary acceleration, braking, or idling. A deep learning model could analyze driving behavior and suggest optimizations in real-time, helping individuals improve their fuel economy.*

➡ **Write your problem statement below:**  

### **Lane Line Detection**
Lane line detection is a crucial aspect of most autonomous driving systems. It involves identifying and tracking the lane markings on the road to ensure the vehicle stays within its designated lane. Accurate lane detection is essential for maintaining safe driving conditions, enabling features like lane-keeping assistance, adaptive cruise control, and autonomous navigation. However, challenges such as varying lighting conditions, occlusions (e.g., by other vehicles or debris), and poorly marked or faded lane lines can make this task complex. A deep learning model trained on annotated road images can be used to detect lane lines in real-time, providing the vehicle with the necessary information to make informed decisions and navigate safely. By improving the robustness and accuracy of lane detection systems, we can enhance the safety and reliability of autonomous vehicles, ultimately contributing to safer roads and more efficient transportation systems.

## **Step 2: Identify and Justify Three Relevant Datasets**
Find three datasets that provide useful information for solving your problem.  
For each dataset, include:
1. A **short description** of what it contains.
2. A **link to the dataset**.
3. **Why this dataset is useful for your problem.**

### **Example Datasets for Fuel Efficiency**

- **Dataset 1: Vehicle Trajectory Data (NGSIM US 101 Dataset)**
	- Description: This dataset contains detailed vehicle trajectory data collected on a segment of the U.S. Highway 101 in Los Angeles, California. It includes precise location information of each vehicle within the study area every one-tenth of a second, capturing detailed lane-changing and car-following behaviors.
	- Source: [U.S. Department of Transportation - NGSIM Program](https://data.transportation.gov/stories/s/Next-Generation-Simulation-NGSIM-Open-Data/i5zb-xe34/)
	- Justification: Analyzing this data can help identify driving patterns that affect fuel efficiency, such as frequent lane changes or abrupt braking.

- **Dataset 2: Climate & Air Quality Data**  
  - Description: Contains CO2 emissions and climate-related metrics across different regions.
	•	Source: [U.S. Historical Climatology Network](https://www.ncei.noaa.gov/products/land-based-station/us-historical-climatology-network)
	•	Justification: Can correlate driving behavior with environmental impact. Provides environmental context to fuel consumption.

- **Dataset 3: Automobile Dataset (UCI Machine Learning Repository)**
  - Description: This dataset includes various characteristics of automobiles, such as engine size, horsepower, weight, and fuel consumption. It also provides information on insurance risk ratings and normalized losses in use as compared to other cars.
  - Source: [UCI Machine Learning Repository - Automobile Dataset](https://archive.ics.uci.edu/dataset/10/automobile)
  - Justification: The dataset’s detailed vehicle specifications and performance metrics can be used to analyze how different factors influence fuel efficiency, aiding in the development of predictive models.

➡ **Find and document three datasets for your problem below:**

- **Dataset 1: Lane Line detection Dataset**
	- Description: This dataset has 100 images from German roads including annotated lane lines in each image. The dataset is diverse including curved roads.
	- Source: [New-Lane detection Computer Vision](https://universe.roboflow.com/maanasa-prasad/new-lane-detection)
	- Justification: Provides lane line data from different environments and includes curved lane lines.

- **Dataset 2: Indian Roads Dataset**
	- Description: This dataset contains over 650 labeled lane images of various road environments, such as curves, traffic, and more. It is collected from real scenarios across multiple cities in India, and includes images with lane lines that are manually annotated with polygons.
	- Source: Uploaded on Roboflow
	- Justification: Provides lane line data from a different road driving setting, allowing us to get a better variety of data (i.e. left/right hand drive, different road markings, etc).

- **Dataset 3: US Roads Dataset**
	- Description: This dataset contains over 100 labeled lane images from highway driving in the United States in clear conditions. It includes images with the relevant lanes and lane lines annotated with polygons.
	- Source: Uploaded on Roboflow
	- Justification: Provides lane line data from a more local and potentially more relevant setting where lane line detection might be more necessary – ADAS/hands-off cruise control and steering for long highway drives.



### We also found these datasets, but have not analyzed them yet due to the size and might explore them later if possible
- **Dataset 1: CurveLanes Dataset from Kaggle**
	- Description: This dataset has 150k lane images of difficult scenarios such as curves and multi-lanes in traffic. It is collected from real urban and highway scenarios in multiple cities in China. The dataset includes images with lane lines which are manually annotated with natural cubic splines. The labels include two key x, y coordinates of the lane marking.
	- Source: [Kaggle CurveLanes Dataset](https://www.kaggle.com/datasets/bnyadmohammed/curvelanes/data) and uploaded from [Github CurveLanes Dataset](https://github.com/SoulmateB/CurveLanes)
	- Justification: The dataset includes more difficult to detect lane lines in more complex and variety of scenarios

- **Dataset 2: Waymo Open Dataset - Motion Dataset**
	- Description: This dataset has lane line data which was used internally by Waymo for their training purposes which has been open sourced. It includes lane connections, lane boundaries and lane neighbors. It provies information of multiple x, y coordinates along the lane line as labels.
	- Source [Waymo Open Dataset](https://github.com/waymo-research/waymo-open-dataset?tab=readme-ov-file)
	- Justification: This data includes detailed information on more lane line data but also includes features such as lane neighbors and lane connections.

- **Dataset 3: TuSimple Lane Line Dataset**
	- Description: The dataset consists of 6,408 road images on US highways and includes images from different weather conditions. Dataset includes annotated lane lines.
	- Source [TuSimple Dataset on Kaggle](https://www.kaggle.com/datasets/manideep1108/tusimple)
	- Justification: This dataset emphasizes variation in weather conditions which the other datasets do not mention which will allow our model to generalize better.


## **Step 3: Explore and Visualize Your Data**
Understanding the structure of your dataset is crucial. Perform the following tasks:
1. **Summarize dataset statistics:**
   - Number of samples
   - Number of features
   - Data types (numerical, categorical, text, etc.)
   - Ranges of values (min/max)
   - Missing values

2. **Create visualizations:**
   - Histograms: Show feature distributions.
   - Scatter plots: Explore relationships between key variables.
   - (Optional) PCA: Visualize high-dimensional data in 2D.

### **Example Exploration Code**
Modify this code to work with your dataset.


In [19]:
# import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Load dataset (modify with your own file)
# df = pd.read_csv("your_dataset.csv")

# # Display basic information
# print("Dataset Overview:")
# print(df.info())

# # Show summary statistics
# print("Summary Statistics:")
# print(df.describe())


In [20]:
# # Histogram of numerical features
# df.hist(figsize=(12, 8))
# plt.show()


In [21]:
# # Example scatter plot of two features (modify column names as needed)
# sns.scatterplot(x=df['feature1'], y=df['feature2'])
# plt.title("Feature1 vs Feature2")
# plt.show()


In [22]:
# from sklearn.decomposition import PCA
# import numpy as np

# # Apply PCA for dimensionality reduction (modify as needed)
# pca = PCA(n_components=2)
# X_pca = pca.fit_transform(df.select_dtypes(include=[np.number]))

# # Plot PCA results
# plt.scatter(X_pca[:, 0], X_pca[:, 1])
# plt.title("PCA Projection of Dataset")
# plt.show()


For each figure that you create, add an explanation of why this is a useful figure. What does it tell about your data? Which figures do you find most interesting and why?

## **Step 4: Implement a PyTorch Dataset Class**
Follow [this tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to prepare data for deep learning by creating a PyTorch Dataset class that:
- Loads data from a CSV or another source.
- Applies preprocessing (e.g., normalization, missing value handling).
- Returns samples in a PyTorch-compatible format.

### **Example PyTorch Dataset Implementation**
Modify this template for your dataset.


In [23]:
# import torch
# from torch.utils.data import Dataset

# class CustomDataset(Dataset):
#     def __init__(self, csv_file):
#         self.data = pd.read_csv(csv_file)
#         self.features = self.data[['feature1', 'feature2']].values  # Modify features
#         self.labels = self.data['target'].values  # Modify target

#     def __len__(self):
#         return len(self.data)

#     def __getitem__(self, idx):
#         x = torch.tensor(self.features[idx], dtype=torch.float32)
#         y = torch.tensor(self.labels[idx], dtype=torch.float32)
#         return x, y

# # Example usage
# dataset = CustomDataset('your_dataset.csv')
# print(len(dataset))


In [24]:
import os
import torch
from torch.utils.data import Dataset
from PIL import Image
import numpy as np

class LaneDataset(Dataset):
    def __init__(self, image_dir, label_dir, transform=None):
        """
        Args:
            image_dir (str): Path to the directory with images.
            label_dir (str): Path to the directory with lane line annotations.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.image_dir = image_dir
        self.label_dir = label_dir
        self.transform = transform
        self.image_files = sorted(os.listdir(image_dir))
        self.label_files = sorted(os.listdir(label_dir))

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        # Load image
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_path).convert("RGB")  # Ensure image is in RGB format

        # Load label (from the corresponding .txt file)
        label_path = os.path.join(self.label_dir, self.label_files[idx])
        with open(label_path, "r") as f:
            lines = f.readlines()
            # Parse lane line annotations
            lanes = []
            for line in lines:
                # Split the line into components
                parts = list(map(float, line.strip().split()))
                class_id = int(parts[0])  # Class ID (e.g., 0 for lane)
                points = parts[1:]  # Lane points (x1, y1, x2, y2, ...)
                lanes.append((class_id, points))

        # Convert to numpy arrays or tensors
        lanes = np.array(lanes)  # Shape: (num_lanes, 1 + num_points * 2)

        # Apply transformations (if any)
        if self.transform:
            image = self.transform(image)

        # Return image and label
        return image, lanes

In [25]:
# %pip install torchvision

from torchvision import transforms

# Example transformations
transform = transforms.Compose([
    transforms.Resize((360, 640)),  # Resize to a fixed size
    transforms.ToTensor(),           # Convert to tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
])



In [26]:
from torch.utils.data import DataLoader

# Initialize dataset
train_dataset = LaneDataset(
    image_dir="New-Lane_detection.v1i.yolov11/train/images",
    label_dir="New-Lane_detection.v1i.yolov11/train/labels",
    transform=transform
)


In [28]:
import os
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import numpy as np
from torchvision import transforms

class LaneDataset(Dataset):
    def __init__(self, image_dir, label_dir, transform=None):
        """
        Args:
            image_dir (str): Path to the directory with images.
            label_dir (str): Path to the directory with lane line annotations.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.image_dir = image_dir
        self.label_dir = label_dir
        self.transform = transform
        self.image_files = sorted(os.listdir(image_dir))
        self.label_files = sorted(os.listdir(label_dir))

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        # Load image
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_path).convert("RGB")  # Ensure image is in RGB format

        # Load label (from the corresponding .txt file)
        label_path = os.path.join(self.label_dir, self.label_files[idx])
        with open(label_path, "r") as f:
            lines = f.readlines()
            # Parse lane line annotations
            lanes = []
            for line in lines:
                # Split the line into components
                parts = list(map(float, line.strip().split()))
                class_id = int(parts[0])  # Class ID (e.g., 0 for lane)
                points = parts[1:]  # Lane points (x1, y1, x2, y2, ...)
                lanes.append((class_id, points))

        # Do not convert to NumPy array (keep as a list of lists)
        # lanes = np.array(lanes)  # This line caused the error

        # Apply transformations (if any)
        if self.transform:
            image = self.transform(image)

        # Return image and label
        return image, lanes

# Define transformations
transform = transforms.Compose([
    transforms.Resize((360, 640)),  # Resize to a fixed size
    transforms.ToTensor(),           # Convert to tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize
])

# Initialize dataset
train_dataset = LaneDataset(
    image_dir="New-Lane_detection.v1i.yolov11/train/images",
    label_dir="New-Lane_detection.v1i.yolov11/train/labels",
    transform=transform
)

# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, num_workers=4)

# Example usage
image, label = train_dataset[0]
print("Image shape:", image.shape)
print("Label:", label)

Image shape: torch.Size([3, 360, 640])
Label: [(0, [0.355285625, 1.0, 0.4615511298076923, 0.5084549158653846, 0.45618570192307695, 0.5016216923076923, 0.3470609206730769, 1.0, 0.355285625, 1.0])]


## **Final Submission**
Upload your submission for Milestone 1 to Canvas. 
Submit this notebook with:

✅ A **clear problem statement**.  
✅ Three **documented datasets** with justification.  
✅ **Exploratory analysis** with summary statistics & visualizations.  
✅ A **PyTorch Dataset class** for preparing data.  

📌 Use the provided example to guide your work. Happy Deep Learning! 🚀