<a href="https://colab.research.google.com/github/sushanttwayana/PYTHON/blob/main/Day6(A_deeper_dive_into_loading_data).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'water-protability:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5187090%2F8658162%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240611%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240611T173219Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3Da63fde10ba0bacaa8ff8a2e960bab4415266bfc8098d85c0cb1b08a8b72c04d0482033d2466b2e932994fc938db50fad8215d451aedd1c076215d12daa8ba5a040ecc24a4014a42aa3ed959b265cf2f8710dad22d99c9577e6c93e675ba4af4401f0f5545c550d44517c51cded93cc15ca81368ebfa7168cd15e77516448a5571a42c3983f746b077c9583bc5df46f87015daad32bb8ce0dd7c8fd808dc1db14715e319a1cd095a742bc8e74ae14e436ef6623f3b4c5988d4ab1133ac55b8d7aabf5fbcb1165ad7903bc0421ee036da2431fff3f900c9c2015dd8be9a84255bd883de5ab5433232bf17ac1191241f1e8f58010be21a0d00456a2dab07dc7e280'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# Using the TensorDataset class

In practice, loading your data into a PyTorch dataset will be one of the first steps you take in order to create and train a neural network with PyTorch.

The TensorDataset class is very helpful when your dataset can be loaded directly as a NumPy array. Recall that TensorDataset() can take one or more NumPy arrays as input.

In this exercise, you'll practice creating a PyTorch dataset using the TensorDataset class.

torch and numpy have already been imported for you, along with the TensorDataset class.

In [None]:
import torch.nn as nn
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader

In [None]:
import numpy as np
import torch
from torch.utils.data import TensorDataset

np_features = np.array(np.random.rand(12, 8))
np_target = np.array(np.random.rand(12, 1))

# Convert arrays to PyTorch tensors
torch_features = torch.tensor(np_features)
torch_target = torch.tensor(np_target)
# Create a TensorDataset from two tensors
dataset = TensorDataset(torch_features, torch_target)
# dataset = TensorDataset(torch_features.float(), torch_target.float())

# Return the last element of this dataset
print(dataset[-1])

(tensor([0.7421, 0.5156, 0.8350, 0.6717, 0.2516, 0.8435, 0.3683, 0.8443],
       dtype=torch.float64), tensor([0.0609], dtype=torch.float64))


TensorDataset is great to use when your dataset can be loaded from NumPy arrays (or converted to NumPy arrays). However, sometimes you need to code a custom dataset class.



In [None]:
dataframe = pd.read_csv("/kaggle/input/water-protability/water_potability.csv")

In [None]:
dataframe = pd.DataFrame({
    'ph': [7.0, 8.1, np.nan, 7.8],
    'Sulfate': [300, 320, 330, np.inf],
    'Solids': [20000, 21000, 22000, 23000],
    'Conductivity': [400, 420, 430, 440],
    'Chloramines': [3.1, 3.2, 3.3, 3.4],
    'Turbidity': [4.0, 4.1, 4.2, 4.3],
    'Hardness': [150, 160, 170, 180],
    'Organic_carbon': [10, 11, 12, 13],
    'Potability': [0, 1, 0, 1]
})


In [None]:
# Normalize the features
features = dataframe[['ph', 'Sulfate', 'Solids', 'Conductivity', 'Chloramines', 'Turbidity', 'Hardness', 'Organic_carbon']]
features = (features - features.mean()) / features.std()

# Convert to PyTorch tensors
features_tensor = torch.tensor(features.to_numpy()).float()
target_tensor = torch.tensor(dataframe['Potability'].to_numpy()).float()

# Create a dataset from the two generated tensors
dataset = TensorDataset(features_tensor, target_tensor)

# Create a dataloader using the above dataset
dataloader = DataLoader(dataset, shuffle=True, batch_size=2)

# Create a model using the nn.Sequential API
model = nn.Sequential(
    nn.Linear(8, 16),  # Adjust the input dimension to 8 to match the features
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()  # Sigmoid activation function to squash output values to [0, 1]
)

# Define loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    for features_batch, target_batch in dataloader:
        # Forward pass
        output = model(features_batch)

        # Debugging: print output values
        print(f"Output: {output.detach().numpy()}")

        # Ensure target shape matches output
        loss = criterion(output, target_batch.unsqueeze(1))

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# After training, let's print out the output of the trained model
output = model(features_tensor)
print(output)

Output: [[nan]
 [nan]]


  sqr = _ensure_numeric((avg - values) ** 2)


RuntimeError: all elements of input should be between 0 and 1

# Evaluating model performance

**Writing the evaluation loop**

In this exercise, you will practice writing the evaluation loop. Recall that the evaluation loop is similar to the training loop, except that you will not perform the gradient calculation and the optimizer step.

The model has already been defined for you, along with the object validationloader, which is a dataset.

In [None]:
# Set the model to evaluation mode
model.eval()
validation_loss = 0.0

with torch.no_grad():

  for data in validationloader:

          outputs = model(data[0])
          loss = criterion(outputs, data[1])

          # Sum the current loss to the validation_loss variable
          validation_loss += loss.item()

# Calculate the mean loss value
validation_loss_epoch = validation_loss / len(validationloader)
print(validation_loss_epoch)

# Set the model back to training mode
model.train()

NameError: name 'validationloader' is not defined

![image.png](attachment:4a0d6b36-36cd-4374-a36c-f57cf66e0884.png)

**Calculating accuracy using torchmetrics**
In addition to the losses, you should also be keeping track of the accuracy during training. By doing so, you will be able to select the epoch when the model performed the best.

In this exercise, you will practice using the torchmetrics package to calculate the accuracy. You will be using a sample of the facemask dataset. This dataset contains three different classes. The plot_errors function will display samples where the model predictions do not match the ground truth. Performing such error analysis will help you understand your model failure modes.

The torchmetrics package is already imported. The model outputs are the probabilities returned by a softmax as the last step of the model. The labels tensor contains the labels as one-hot encoded vectors.

In [None]:
import torchmetrics

In [None]:
# Create accuracy metric using torch metrics
metric = torchmetrics.Accuracy(task="multiclass", num_classes=3)
for data in dataloader:
    features, labels = data
    outputs = model(features)

    # Calculate accuracy over the batch
    acc = metric(outputs.softmax(dim=-1), labels.argmax(dim=-1))

# Calculate accuracy over the whole epoch
acc = metric.compute()

# Reset the metric for the next epoch
model.reset()
plot_errors(model, dataloader)

ValueError: Either `preds` and `target` both should have the (same) shape (N, ...), or `target` should be (N, ...) and `preds` should be (N, C, ...).

The accuracy is a great metric for classification problems. Calculating the class-wise accuracy gives a better understanding of your model performances. Moreover, by looking at your model misclassification, you can find trends in the errors and better understand when your model fails.