Skip to content
forked from DRSY/MoTIS

Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)

Notifications You must be signed in to change notification settings

TrendingTechnology/MTIS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Mobile Text-to-Image Search Powered by AI

A minimal demo demonstrating semantic multimodal text-to-image search using pretrained vision-language models.

Features

  1. text-to-image retrieval using semantic similarity search.
  2. support different vector indexing strategies(linear scan and KMeans are now implemented).

Screenshot

  • All images in the gallery all
  • Search with query Three cats search

Install

  1. Download the two TorchScript model files(text encoder, image encoder) into models folder and add them into the Xcode project.
  2. Required dependencies are defined in the Podfile. We use Cocapods to manage these dependencies. Simply do 'pod install' and then open the generated .xcworkspace project file in XCode.
pod install
  1. This demo by default load all images in the local photo gallery on your realphone or simulator. One can change it to a specified album by setting the albumName variable in getPhotos method and replacing assetResults in line 117 of GalleryInteractor.swift with photoAssets.

Todo

  • Basic features
  • Accessing to specified album or the whole photos
  • Asynchronous model loading and vectors computation
  • Indexing strategies
  • Linear indexing(persisted to file via built-in Data type)
  • KMeans indexing(persisted to file via NSMutableDictionary)
  • Ball-Tree indexing
  • Locality sensitive hashing indexing
  • Choices of semantic representation models
  • OpenAI's CLIP model
  • Integration of other multimodal retrieval models
  • Effiency
  • Reducing memory consumption of models(ViT/B-32 version of CLIP takes about 605MB for storage and 1GB for runtime on iPhone)

About us

This project is maintained by ADAPT lab from Shang Hai Jiao Tong University. We expect it to continually integrate more advanced features and better cross-modla search experience.

About

Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Swift 81.6%
  • Objective-C++ 15.5%
  • Objective-C 2.4%
  • Ruby 0.5%