## Milestone 1: Problem Definition, Dataset Selection, and Data Exploration

LIS 640 - Introduction to Applied Deep Learning

Due 2/14/25

## **Overview**
In this milestone, you will:
1. **Define a deep learning problem** where AI can make a meaningful impact.
2. **Identify three datasets** that fit your topic and justify their relevance.
3. **Explore and visualize** the datasets to understand their structure.
4. **Implement a PyTorch Dataset class** to prepare data for deep learning.

This notebook provides an example of **fuel-efficient car usage** to illustrate what is expected.


## **Step 1: Define Your Deep Learning Problem**
Write a paragraph explaining:
- **Why your chosen topic is important.**
- **How deep learning can help solve the problem.**

### **Example Problem Statement: Fuel-Efficient Car Usage**
*Fuel efficiency is a major factor in reducing carbon emissions and lowering fuel costs. Drivers often adopt inefficient driving patterns, wasting fuel through unnecessary acceleration, braking, or idling. A deep learning model could analyze driving behavior and suggest optimizations in real-time, helping individuals improve their fuel economy.*

➡ **Write your problem statement below:**  


## **Step 2: Identify and Justify Three Relevant Datasets**
Find three datasets that provide useful information for solving your problem.  
For each dataset, include:
1. A **short description** of what it contains.
2. A **link to the dataset**.
3. **Why this dataset is useful for your problem.**

### **Example Datasets for Fuel Efficiency**

- **Dataset 1: Vehicle Trajectory Data (NGSIM US 101 Dataset)**
	- Description: This dataset contains detailed vehicle trajectory data collected on a segment of the U.S. Highway 101 in Los Angeles, California. It includes precise location information of each vehicle within the study area every one-tenth of a second, capturing detailed lane-changing and car-following behaviors.
	- Source: [U.S. Department of Transportation - NGSIM Program](https://data.transportation.gov/stories/s/Next-Generation-Simulation-NGSIM-Open-Data/i5zb-xe34/)
	- Justification: Analyzing this data can help identify driving patterns that affect fuel efficiency, such as frequent lane changes or abrupt braking.

- **Dataset 2: Climate & Air Quality Data**  
  - Description: Contains CO2 emissions and climate-related metrics across different regions.
	•	Source: [U.S. Historical Climatology Network](https://www.ncei.noaa.gov/products/land-based-station/us-historical-climatology-network)
	•	Justification: Can correlate driving behavior with environmental impact. Provides environmental context to fuel consumption.

- **Dataset 3: Automobile Dataset (UCI Machine Learning Repository)**
  - Description: This dataset includes various characteristics of automobiles, such as engine size, horsepower, weight, and fuel consumption. It also provides information on insurance risk ratings and normalized losses in use as compared to other cars.
  - Source: [UCI Machine Learning Repository - Automobile Dataset](https://archive.ics.uci.edu/dataset/10/automobile)
  - Justification: The dataset’s detailed vehicle specifications and performance metrics can be used to analyze how different factors influence fuel efficiency, aiding in the development of predictive models.

➡ **Find and document three datasets for your problem below:**  


## **Step 3: Explore and Visualize Your Data**
Understanding the structure of your dataset is crucial. Perform the following tasks:
1. **Summarize dataset statistics:**
   - Number of samples
   - Number of features
   - Data types (numerical, categorical, text, etc.)
   - Ranges of values (min/max)
   - Missing values

2. **Create visualizations:**
   - Histograms: Show feature distributions.
   - Scatter plots: Explore relationships between key variables.
   - (Optional) PCA: Visualize high-dimensional data in 2D.

### **Example Exploration Code**
Modify this code to work with your dataset.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset (modify with your own file)
df = pd.read_csv("your_dataset.csv")

# Display basic information
print("Dataset Overview:")
print(df.info())

# Show summary statistics
print("Summary Statistics:")
print(df.describe())


In [None]:
# Histogram of numerical features
df.hist(figsize=(12, 8))
plt.show()


In [None]:
# Example scatter plot of two features (modify column names as needed)
sns.scatterplot(x=df['feature1'], y=df['feature2'])
plt.title("Feature1 vs Feature2")
plt.show()


In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Apply PCA for dimensionality reduction (modify as needed)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df.select_dtypes(include=[np.number]))

# Plot PCA results
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("PCA Projection of Dataset")
plt.show()


For each figure that you create, add an explanation of why this is a useful figure. What does it tell about your data? Which figures do you find most interesting and why?

## **Step 4: Implement a PyTorch Dataset Class**
Follow [this tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to prepare data for deep learning by creating a PyTorch Dataset class that:
- Loads data from a CSV or another source.
- Applies preprocessing (e.g., normalization, missing value handling).
- Returns samples in a PyTorch-compatible format.

### **Example PyTorch Dataset Implementation**
Modify this template for your dataset.


In [None]:
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        self.features = self.data[['feature1', 'feature2']].values  # Modify features
        self.labels = self.data['target'].values  # Modify target

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.float32)
        return x, y

# Example usage
dataset = CustomDataset('your_dataset.csv')
print(len(dataset))


## **Final Submission**
Upload your submission for Milestone 1 to Canvas. 
Submit this notebook with:

✅ A **clear problem statement**.  
✅ Three **documented datasets** with justification.  
✅ **Exploratory analysis** with summary statistics & visualizations.  
✅ A **PyTorch Dataset class** for preparing data.  

📌 Use the provided example to guide your work. Happy Deep Learning! 🚀