# Evaluation via Inception Score


The Inception Score (IS) is a popular metric used to evaluate the quality of images generated by models such as Generative Adversarial Networks (GANs). 
It measures two key aspects of the generated images: 
1. Quality and 
2. Diversity. 

The score is derived from the Inception model, which is a deep convolutional neural network pre-trained on ImageNet for image classification tasks.

In [None]:
# !pip install torch torchvision scipy 

In [6]:
import torch
from torchvision import transforms
from torchvision.models.inception import inception_v3
import numpy as np
from scipy.stats import entropy
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
from PIL import Image

In [7]:
# Path to folder of generated images
image_folder_path = 'generated_images/'

# Define the transform to resize and normalize images
# Note: Normalization uses the InceptionV3's expected mean and std
transform = transforms.Compose([
    transforms.Resize((299, 299)),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])

# Create the dataset and dataloader
dataset = ImageFolder(root=image_folder_path, transform=transform)
dataloader = DataLoader(dataset, batch_size=32, shuffle=False)

In [10]:


def inception_score(imgs, cuda=True, batch_size=32, resize=True, splits=1):
    """Computes the inception score of the generated images.
    imgs -- A list or iterator of PIL Image objects or a PyTorch DataLoader returning PIL images
    cuda -- If True, use GPU for computation
    batch_size -- Batch size for feeding into Inception v3
    resize -- Resize input images to 299x299 if not already done
    splits -- Number of splits for calculating the score
    """
    
    # Check if imgs is a DataLoader
    if not isinstance(imgs, DataLoader):
        imgs = DataLoader(imgs, batch_size=batch_size)

    # Set up dtype
    dtype = torch.cuda.FloatTensor if cuda else torch.FloatTensor

    # Set up inception model
    inception_model = inception_v3(pretrained=True, transform_input=False).type(dtype)
    inception_model.eval()
    up = torch.nn.Upsample(size=(299, 299), mode='bilinear').type(dtype) if resize else None

    def get_pred(x):
        if resize:
            x = up(x)
        x = inception_model(x)
        return torch.nn.functional.softmax(x, dim=1).data.cpu().numpy()

    # Transform for input images
    transform = transforms.Compose([
        transforms.Resize((299, 299)) if resize else transforms.Lambda(lambda x: x),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
    ])

    # Get predictions
    preds = np.zeros((len(imgs.dataset), 1000))

    
    for i, (batch, _) in enumerate(imgs, 0):  # Assuming imgs is a DataLoader
        # Check if the batch needs conversion
        if not torch.is_tensor(batch[0]):
            batch = torch.stack([transform(img).type(dtype) for img in batch])
        else:
            batch = batch.type(dtype)
        
        batch_size_i = batch.size(0)
        preds[i*batch_size:i*batch_size + batch_size_i] = get_pred(batch)


    # Calculate scores
    split_scores = []
    for k in range(splits):
        part = preds[k * (len(preds) // splits): (k+1) * (len(preds) // splits), :]
        py = np.mean(part, axis=0)
        scores = [entropy(pyx, py) for pyx in part]
        split_scores.append(np.exp(np.mean(scores)))

    return np.mean(split_scores), np.std(split_scores)

In [11]:
if __name__ == '__main__':
    # Compute the Inception Score
    mean_score, std_score = inception_score(dataloader, cuda=True, resize=False, splits=10)

    print(f"Inception Score: Mean = {mean_score}, Std = {std_score}")




Inception Score: Mean = 2.1783351147759924, Std = 0.3104690459466458


# TLDR Explanation of the Inception Score results
Interpretation
- A mean Inception Score of around 2.18 with a standard deviation of 0.31 suggests that the model produces images of:
    - moderate quality and 
    - moderate diversity. 
- The images are likely recognizable and varied to some extent, but there may be room for improvement in both the realism and diversity of the images to achieve higher scores. 

- The standard deviation indicates that the model's performance is somewhat consistent, but there could be noticeable differences in the quality or diversity of images in different batches.

# Detailed Explanation of the Inception Score results of Mean and Standard Deviation

## Mean of the Inception Score (2.1783351147759924)
### Quality: 
The mean value of the Inception Score reflects the average quality of the generated images. A higher mean score suggests that, on average, the images are more realistic and contain recognizable objects according to the Inception model. 

### Interpretation of Quality of generated images based on Mean: 
Mean score of 2.18 indicates that the generated images have a moderate level of quality. In the context of Inception Scores, higher values (e.g., scores closer to or above 10) are typically seen in very high-quality models. However, the interpretation of "high quality" is relative and depends on the specific dataset and task.

### Diversity: 
The score also captures the diversity of the generated images. A higher score implies that the model can generate a variety of images across different classes. The diversity aspect is evaluated by measuring how confidently the Inception model predicts different classes for different generated images. 

### Interpretation of Diversity of generated images based on Mean: 
Mean score of 2.18 indicates that the generated images have a moderate level of diversity among the generated images.


## Standard Deviation of the Inception Score (0.3104690459466458)

### Consistency: 
The standard deviation provides insight into the consistency of the Inception Score across different sets of generated images. A lower standard deviation indicates that the Inception Scores are more consistent across different batches of generated images, while a higher standard deviation suggests variability in the quality and diversity of images across batches.

 
### Intepretation of Consistency of generated images based on Standard Deviation:
 standard deviation of 0.31 indicates that there is some variability in the quality and diversity of the generated images, but it's not excessively high. This suggests that while there is some inconsistency in how the model performs across different sets of generated images, the level of variation is relatively moderate.




## Additional Notes: 
It's important to note that while the Inception Score can provide useful insights into the performance of a generative model, it is not without limitations. 

It depends on the Inception model, which is trained on ImageNet and may not fully capture quality or diversity aspects specific to different datasets or domains. 

Additionally, it does not measure how well the generated images match the target distribution (i.e., the real images). Therefore it is beneficial to use the Inception Score in conjunction with other evaluation metrics and qualitative assessments to get a comprehensive understanding of a model's performance.