Learning from web data for image classification
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



Leveraging the abundant number of web data is a promising strategy in addressing the problem of data lacking when training convolutional neural networks (CNNs). However, web images often contain incorrect tags, which may compromise the learned CNN model. On the other hand, different data distribution between web and well-labled dataset (called dataset bias or build-in gap) also influence the effectiveness of using web data. To address these problems, we propose two methods that focus on cleaning noisy web data and reducing dataset bias, respectively. Meanwhile, we collect 0.5 million web images covering all categories of four public image classification datasets (SD-198, Stanford Dogs, Food-101 and MIT Indoor67) to support the related research.


We crawl web images from Google Images, Flickr and Twitter, respectively.
The dataset can be downloaded from the following links:
Baidu Disk

Our Works

(1) Recognition from Web Data: A Progressive Filtering Approach (TIP 2018)

In this paper, we present a novel progressive filtering method that effectively exploits web images for various image classification tasks. Moreover, a one-to-many label assignment strategy is employed for data correction based on the confi- dence values of labels and the tags of images. The method performs well in a variety of image classification tasks.

(2) Learning from Web Data using Adversarial Discriminative Neural Networks for Fine-Grained Classification (AAAI 2019)

In this work, we firstly show that there exists a gap between the web and the standard datasets, which will inhibit the training of parameters in convolutional layers when both of them are utilized. To address this problem, we present a novel multi-task learning framework that effectively exploits web images for various fine-grained classification tasks. An adversarial discriminative loss is proposed to advocate representation coherence between standard and web data.