Running this project

bash run.sh

The above command will run median_unique.py and words_tweeted.py in the src dir and output results to the f1.txt and f2.txt in tweet_output

Time taken

3.5 hrs

Depedencies

bisect
sets
collections

Optimizations that were implemented

I did a rolling median calculation of an integer stream to improve performance of median_unique.py. This utilized the bisect library in python that provides us with an ordered list. It uses a O(log(n)) insert algorithm to maintain a sorted list as streams of integers come through. Using this technique, our median calculator is O(1) in runtime.

Memory considerations

According to google, there are approximate 500 million tweets generated a day. This comes to about 3.5 billion tweets a week. Both my solutions contain in-memory datastructures, a collection counter in the words_tweeted.py file and a list in the median_unique.py file.

Assumming 16GB main memory (current settings in my computer), the theoretical maximum space each entry with the in-memory objects can store is ~20 bits (162^32/(3.510^9)) or 2.5 bytes. For median_unique.py, I used an unsigned short array to store the unique values since each element takes up 2 bytes, less than the 2.5 bytes limit. For words_tweeted.py, I used a high performance container called Counter to iterate.

I have not formally characterized the memory complexity of container data structure and there doesnt seem to be good documentation online for it, but based on posts in stackoverflow and blogs, the container type seems suitable for high performance in-memory storage.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
tweet_input		tweet_input
tweet_output		tweet_output
.DS_Store		.DS_Store
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running this project

Time taken

Depedencies

Optimizations that were implemented

Memory considerations

About

Releases

Packages

Languages

vijay120/insight

Folders and files

Latest commit

History

Repository files navigation

Running this project

Time taken

Depedencies

Optimizations that were implemented

Memory considerations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages