Skip to content

vijay120/insight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Running this project

bash run.sh

The above command will run median_unique.py and words_tweeted.py in the src dir and output results to the f1.txt and f2.txt in tweet_output

Time taken

3.5 hrs

Depedencies

  1. bisect
  2. sets
  3. collections

Optimizations that were implemented

I did a rolling median calculation of an integer stream to improve performance of median_unique.py. This utilized the bisect library in python that provides us with an ordered list. It uses a O(log(n)) insert algorithm to maintain a sorted list as streams of integers come through. Using this technique, our median calculator is O(1) in runtime.

Memory considerations

According to google, there are approximate 500 million tweets generated a day. This comes to about 3.5 billion tweets a week. Both my solutions contain in-memory datastructures, a collection counter in the words_tweeted.py file and a list in the median_unique.py file.

Assumming 16GB main memory (current settings in my computer), the theoretical maximum space each entry with the in-memory objects can store is ~20 bits (162^32/(3.510^9)) or 2.5 bytes. For median_unique.py, I used an unsigned short array to store the unique values since each element takes up 2 bytes, less than the 2.5 bytes limit. For words_tweeted.py, I used a high performance container called Counter to iterate.

I have not formally characterized the memory complexity of container data structure and there doesnt seem to be good documentation online for it, but based on posts in stackoverflow and blogs, the container type seems suitable for high performance in-memory storage.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published