# Instagram

## Functional 
- Users can upload, download, and view photos and vidoes.
- Users can search based on photo and video titles.
- Users can follow other users.
- User feed is generated consisting of top photos of all users that the user follows.

## Non-functional
- High reliability.
- High availability. 
    - Availability is implied by reliability.
- Maximum 200ms latency for feed generation.
- Consistency can take hit. (It's ok for user not to see photos for a while)

## Extended
- Users can tag photos and videos.
- Users can search photos and vidoes based on tags.
- Users can comment on photos and vidoes.

## Design

<img src="img/instagram1.png" style="width:600px;height:400px;">

Users will
- Upload image.
- View/search image.

Need
- Object storage to store image.
- Database to store image metadata.

<img src="img/instagram2.png" style="width:600px;height:400px;">

The service will be read heavy.
- Web servers have connection limits. 
- Dediciated servers for read and write such that 
    - Write operations don't hog the system and disrupt read operations.
    - Scale and optimize independently.

<img src="img/instagram3.png" style="width:600px;height:400px;">

- Redundancy is needed not to lose photos and for high availability of all other components. 

News feed generation
1. Fetch a list of users that current user folows. Submit photos of those list of users to ranking algorithms and generate the feed. However, this will cause latency.
2. Have dedicated server to continuously pre-generate user feed into "UserNewsFeed" table. 

News feed update
- Pull: client pulls news feed from server at regular interval. Most of time, client will receive in empty response.
- Push: server pushes news feed to client whenever there is update. Servers may end up frequently updating client.

Fetch latest photos
- Sort photos based on CreationDate.
- Make CreattionDate part of PhotoID, which is indexted.
    - For example, epoch time + auto-incrementing ID from key generation service.

## Capacity

Storage 
- Assume 
    - 500M total users with 1M daily active users.
    - 2M new photos every day (23 new photos / s)
    - Average photo size 200KB
- Space for 1 day's amount of photo: 2M * 200KB = 400GB
- If 10 yrs, 400GB * 365 * 10 = 1425TB

## API

## DB
- Need to index on (PhotosID, CreationDate) since we want to fetch recent photos.
- Store metadata in distributed key-value storage.
    - Key: PhotoId, value: object containing PhotoLocation.
- Store photos in distributed file storage like HDFS or S3.

Schema "Photo"
- PhotoID (int, PK)
- UserID (int)
- PhotoPath (varchar)
- PhotoLatitude (int)
- PhotoLongitude (int)
- UserLatitude (int)
- UserLongitude(int)
- CreationDate(datetime)

Schema "User"
- UserID (int, PK)
- Name (varchar)
- Email (varchar)
- CreationDate (datetime)
- LastLogin (datetime)

Schema "UserFollow" 
- FollowerID (int, PK)
- FolloweeID (int, PK)
    
Data size
- Assume "int" and "datetime" are 4 bytes.

User 
- UserID: 4 bytes
- Name: 20 bytes
- Email: 32 bytes
- DateOfBirth: 4 bytes
- CreationDate: 4 bytes
- LastLogin: 4 bytes 
- Total: approximately 100 bytes
- With 500M users, we need 100 * 500M = 50GB

Photo
- PhotoID: 4 bytes
- UserID: 4 bytes
- PhotoPath: 256 bytes
- PhotoLatitude: 4 bytes
- PhotoLongitude: 4 bytes
- UserLatitude: 4 bytes
- UserLongitude: 4 bytes
- CreationDate: 4 bytes
- Total: approximately 300 bytes
- With 2M photos everyday, we need 300 * 2M = 0.6GB
- For 10 yrs, we need 2.2TB

UserFollow
- Assume each user follows 500 other users and each row in UserFollow table is 8 bytes: 5M * 500 * 8 bytes = 1.8TB

Total space: 50GB + 2.2TB + 1.8TB = 4TB

## Data partitioning
- Partition photos into different DBs based on PhotoID (For example, PhotoId % 10), which can be generated by key generation service.
- In the beginning, we can put all DBs into a single server. As the service scales, we can migrate DB to addition DB server one by one.

## Caching
- Cache servers: CDN
- Application servers: LRU

## Load balancing

## DB cleanup