diff --git a/apsys2016/main.pdf b/apsys2016/main.pdf new file mode 100644 index 0000000..1e70600 Binary files /dev/null and b/apsys2016/main.pdf differ diff --git a/apsys2016/section/1_data.tex b/apsys2016/section/1_data.tex index a832fb0..eb562de 100644 --- a/apsys2016/section/1_data.tex +++ b/apsys2016/section/1_data.tex @@ -53,8 +53,8 @@ \section{Data Collection} {\color{red} (Work 3 and Work 5) Figure~\ref{fig:size} shows the file size distribution for VirusTotal malwares. -The smallest malware is only 704 bytes, and the largest one is more than 502 MB. -95.3\% of malwares fall into the range from 16 KB to 2 MB. +The smallest malware is only 704 bytes, and the largest one is more than 502\,MB. +95.3\% of malwares fall into the range from 16\,KB to 2\,MB. VirusTotal does not provide tags to differ 64-bit malwares from 32-bit malwares directly. We sample 10000 malwares and download their executable binaries from VirusTotal. We apply Linux command file to each sampled malware binary. diff --git a/apsys2016/section/3_1_LRU.tex b/apsys2016/section/3_1_LRU.tex index 74cb689..6555e2d 100644 --- a/apsys2016/section/3_1_LRU.tex +++ b/apsys2016/section/3_1_LRU.tex @@ -2,15 +2,17 @@ \section{Malware Temporal Properties} \label{sec:temporal} This section presents our study of the temporal properties of VirusTotal malwares -and answers two fundamental questions: -how many malware families appear everyday +and answers three fundamental questions: +how many malware families appear everyday, +{\color{red} (Work 2) +what lifetime of malwares on VirusTotal distributes, +} and whether or not malwares occur in bursts. -To answer the second question, we design a new caching mechanism +To answer the last question, we design a new caching mechanism that can be used for both offline and online malware predictions. -This cache-based malware prediction technique can predict which malware families will appear in the near -future with high precision. +This cache-based malware prediction technique can predict which malware families will appear in the near future with high precision. -\input{section/fignewfamily} +\input{section/figtemporal} We first study how many new malware families appear everyday. Figure~\ref{fig:new} shows the number of new malware families appearing on each day in November 2015. @@ -23,10 +25,25 @@ \section{Malware Temporal Properties} {\bf Observation 1:} {\em 100-400 new malware families appear each day.} -\input{section/hitsize} -\input{section/hitday} +%\input{section/hitsize} +%\input{section/hitday} + + +{\color{red} (Work 2) +Next, we study how lifetime of malwares distributes. +We calculate the lifetime as time spent from when the malware firstly seen on VirusTotal to when the submission is conducted. +As shown in Figure~\ref{fig:life}, malwares submitted to VirusTotal are quite new. +More than 75\% malware submissions are firstly seen on VirusTotal. +For more than 90\% malware submissions, +their submission time is less than one month since they were firstly seen on VirusTotal. + +{\bf Observation 2:} +{\em Malwares on VirusTotal are quite new. } + +} + -Next, we investigate whether malwares behave temporal locality. +Finally, we investigate whether malwares behave temporal locality. Temporal locality is an important metric that can guide the prediction of near-future malwares. @@ -50,6 +67,7 @@ \section{Malware Temporal Properties} \ie, how many entries are inserted into and evicted from cache together. Cache prefetching loads spatially close entries into caches in advance. Cache replacement policy controls what entries to evict when cache is full. +{\color{red} (Writing 9) We use a simple cache setting in our evaluation. We fix the cache block size to be one, no prefetching, and use the LRU (least recently used) replacement policy. @@ -63,6 +81,7 @@ \section{Malware Temporal Properties} we create a new cache entry and add it into the front of the cache entry list. If the cache is full, we evict the entry at the end of the list. The cache hit rate is calculated as follows: +} $$ \mbox{hit rate} = \dfrac{\mbox{\# of hits}}{\mbox{\# of hits + \# of misses}}$$ @@ -77,34 +96,45 @@ \section{Malware Temporal Properties} When using more than 80 cache entries, which is less than 1\% of the total number of malware families, the cache hit rate rises above 90\%, and when using more than 230 cache entries, which is less than 3\% of the total number of malware families, the cache hit rate rises above 95\%. -The high cache hit rate confirms that malwares occur in bursts. +{\color{red} (Writing 8) +The high cache hit rate and the small cache size confirm that malwares occur in bursts. +} -{\bf Observation 2:} +{\bf Observation 3:} {\em The occurrence of malwares in each family has strong temporal locality.} +{\color{red} (Writing 7) +In offline prediction, malwares encountered in each client site are merged together. +These data are used to predict which malware family would appear in the near future in a large scale. +Malwares encountered in one client site can be viewed as a random sample of all malwares. +If we can predict malware occurrence online, we could apply personalized defense mechanism and malware detection mechanism on the client side. +} To support online malware occurrence prediction, it is essential to lower the performance overhead of running our cache mechanism. To this end, we lower the cache content update frequency from once per malware report to once per day. That is, we keep cache content unchanged to count cache hits and cache misses each day and update the cache content at the end of each day. In this second experiment, we fix the cache size to 200. -Figure~\ref{fig:batchcache} shows the cache hit rate for each day. +{\color{red} (Writing 1) +Figure~\ref{fig:batchcache} shows prediction rate by using next day’s data. Most cache hit rate is above 70\%, -showing that even when lowering performance overhead, -our cache mechanism still achieves a good estimation of malware occurrences. +showing that even when lowering performance overhead, \ie, updating cache content once a day, +our cache mechanism still achieves a good estimation of malware occurrences. +} +{\color{red} (Writing 9) +We do not need to combine malwares encountered in each client site and send out updated cache content frequently. Cache-based malware prediction can be used in on-line scenario. +} -{\color{red} +{\color{red} (Work 6) We also study which malware families cause more cache misses and have bad prediction results. We count cache misses for each malware family, and consider a family with more than half malwares causing cache misses as family with bad prediction results. -We find that whether a family will have bad prediction results is related to the cache size and the number of malwares contained in the family. +We find that whether a family has bad prediction results is related to the cache size and the number of malwares contained in the family. When cache size is 100, there are 6990 families with bad prediction results, and the largest number of malwares contained in one of these families is 2307. -When we change the cache size to 1000, +When we increase the cache size to 1000, there are 5458 families with bad prediction results, -and the largest number of malwares contained in one of these families is 126. +and the largest number of malwares contained in one of these families changes to 126. } -%\underline{Discussion.} -%Resources to combat malwares are limited. -%Any techniques that could allow antivirus vendors to focus their efforts would be great. + diff --git a/apsys2016/section/figtemporal.tex b/apsys2016/section/figtemporal.tex new file mode 100644 index 0000000..46620e0 --- /dev/null +++ b/apsys2016/section/figtemporal.tex @@ -0,0 +1,51 @@ +\begin{figure*}[!htb] +\minipage{0.31\textwidth} + \includegraphics[width=\linewidth]{figure/new_family} +\mycaption{fig:new}{New malware families on VirusTotal in November 2015.} +{ +The number of new malware families we observed every day in November 2015. +} + %\label{fig:overlap} +\endminipage\hfill +\minipage{0.31\textwidth} + \includegraphics[width=\linewidth]{figure/lifetime} + \mycaption{fig:life}{Malwares' lifetime in November 2015.} +{ +Accumulation distribution of malwares' lifetime in November 2015. +} + %\label{fig:maxUncover} +\endminipage\hfill +\minipage{0.31\textwidth}% + \includegraphics[width=\linewidth]{figure/LRU} + \mycaption{fig:cache}{Relation between cache hit rate and cache size.} +{Cache hit rate under different values of cache size from 10 to 1000.} + %\label{fig:aveUncover} +\endminipage\hfill + +\vspace{-0.1in} +\end{figure*} + +\begin{figure*}[!htb] +\minipage{0.31\textwidth} + \includegraphics[width=\linewidth]{figure/LRU_day} +\mycaption{fig:batchcache}{Cache hit rate in November 2015.} +{ +Cache hit rate every day in November 2015 if we only update cache content at the end of every day. +} + %\label{fig:overlap} +\endminipage\hfill +\minipage{0.31\textwidth} + \includegraphics[width=\linewidth]{figure/id} + \mycaption{fig:id}{Skewness of malware submissions from different users in November 2015.} +{Accumulation distribution of malwares submitted by different users in November 2015.} + %\label{fig:maxUncover} +\endminipage\hfill +\minipage{0.31\textwidth}% + \includegraphics[width=\linewidth]{figure/country} + \mycaption{fig:country}{Skewness of malware submissions from different countries in November 2015.} +{Accumulation distribution of malwares submitted from different countries in November 2015.} + %\label{fig:aveUncover} +\endminipage\hfill + +\vspace{-0.1in} +\end{figure*} \ No newline at end of file diff --git a/apsys2016/section/hitsize.tex b/apsys2016/section/hitsize.tex index 47064bb..b243ad2 100644 --- a/apsys2016/section/hitsize.tex +++ b/apsys2016/section/hitsize.tex @@ -2,7 +2,8 @@ \begin{center} \includegraphics[width=2.5in]{figure/LRU} \mycaption{fig:cache}{Relation between cache hit rate and cache size.} -{Cache hit rate under different values of cache size from 10 to 1000.} +\mycaption{fig:batchcache}{Cache hit rate in November 2015.} +{Cache hit rate every day in November 2015 if we only update cache content at the end of every day.} %\label{fig:cache} \end{center} \vspace{-0.2in}