<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# A brief introduction to TopicFlow

TopicFlow is a tool that visualizes the results of automatic topic detection and topic alignment between sets of tweets over time. The tool was developed by Jianyu Li, Sana Malik, Panagis (Pano) Papadatos and Alison Smith originally as a team project for CMSC 734 Information Visualization at the University of Maryland. You can find more information about TopicFlow by reading the README.md and their papers:
- [TopicFlow: Visualizing Topic Alignment of Twitter Data over Time](https://wiki.cs.umd.edu/cmsc734_f12/images/0/05/TopicFlowFinalReport2.pdf)
- [Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow](http://link.springer.com/chapter/10.1007/978-3-319-19003-7_9)

What we want to achieve by utilizing TopicFlow for [PERCEIVE](https://github.com/sailuh/perceive) is trying to visualize the "flow" of topics of Full Disclosure documents that may help us identify upcoming cybersecurity threats. 

PERCEIVE is developed and maintained by a joint effort of many contributors. The role of TopicFlow in PERCEIVE can be simplified with the graph below:

![work flow](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/work%20flow.png?raw=true)

While the output of TopicFlow pipeline is the visualization, the output of this data transformation pipeline is a *run.py* file that enables a user to create new TopiFlow projects or run an existing project. 

Although TopicFlow is a powerful tool, it was designed to visualize only the flow of "tweets". To make TopicFlow work for Full Disclosure data, several changes were made to the original scripts:
1. changed all "tweet" related content to "doc" or "document" in the final visualization;
2. disabled 
```javascript
if ($("g #"+j)[0].style.display != "none") { }
``` 
in *controller.js* to avoid `style errors`. Otherwise, TopicFlow couldn't configure text data other than tweets.
3. removed all datasets to select from except Full Disclosure 2012 dataset. Several changes were made in *index.html*, *controller.js*, and */topicflow/data* directory. For example, in *controller.js*, the original version allows users to choose some of these datasets:
```javascript
var idToName = {"HCI" : "HCI", "ModernFamily" : "Modern Family", "catfood": "Catfood" , 
					"drugs" : "Drugs", "earthquake" : "Earthquake", "umd" : "UMD", "debate":"#debate", "chi":"CHI Conference", 
					"sandy" : "Sandy and NJ"}
```
however, in our final version, we only need the following dataset as the starting point:
```javascript
var idToName = {
                // add new idToName
                "Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```

# Methodology

So How do we approach this? After exploring TopicFlow and discussing with Carlos Paradis several times, I believe the best way to utilize TopicFlow without overhauling the original codes is modifying only the parts that help us display our datasets. Since TopicFlow is pretty hand-coded, this data transformation pipeline has to edit the actual scripts and generate new files with our Python program. 

To better understand what we need to do. Here let's take a look at how TopicFlow works in the simpliest form, a triangle.

![methodology](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/methodology.png?raw=true)

Essentially two scripts and one data directory controls how TopicFlow works: *index.html* provides the place for the visualization and the basic information, */data/< project >* stores the actual data to display, and *controller.js* coordinates all the JavaScript scripts and tells TopicFlow how to read data and the way to visualize. The highlighted elements are what we will be modifying or creating in this data transformation pipeline.

**Please note that this transformation pipeline only works for Full Disclosure data**

The functions in this pipeline only works for PERCEIVE datasets. To create a new project, make sure you have the following four directories that have the listed structure: 
    
*path_tf (path of TopicFlow)*  
&nbsp;&nbsp;&nbsp;&nbsp;|- css  
&nbsp;&nbsp;&nbsp;&nbsp;|- data  
&nbsp;&nbsp;&nbsp;&nbsp;|- scripts  
&nbsp;&nbsp;&nbsp;&nbsp;|- index.html  
&nbsp;&nbsp;&nbsp;&nbsp;|- run.py  
&nbsp;&nbsp;&nbsp;&nbsp;|- ......  

*path_doc (path of parsed documents directory)*   
&nbsp;&nbsp;&nbsp;&nbsp;|- yyyy_mm  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- yyyy_mm_relative_id.extension

*path_meta (path of metadata of parsed documents directory)*   
&nbsp;&nbsp;&nbsp;&nbsp;|- yyyy_mm  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- yyyy_mm.csv   

*path_LDA  (path of LDA directory)*  
&nbsp;&nbsp;&nbsp;&nbsp;|- Document_Topic_Matrix  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- mm.csv    
&nbsp;&nbsp;&nbsp;&nbsp;|- Topic_Flow  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- topic_flow.csv  
&nbsp;&nbsp;&nbsp;&nbsp;|- Topic_Term_Matrix  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- mm.csv    


**Example of 2014 Full Diclosure datasets**

*/&ast;&ast;/Topicflow*  
&nbsp;&nbsp;&nbsp;&nbsp;|- css  
&nbsp;&nbsp;&nbsp;&nbsp;|- data  
&nbsp;&nbsp;&nbsp;&nbsp;|- scripts  
&nbsp;&nbsp;&nbsp;&nbsp;|- index.html  
&nbsp;&nbsp;&nbsp;&nbsp;|- run.py  
&nbsp;&nbsp;&nbsp;&nbsp;|- ......  

*/&ast;&ast;/2014.parsed*   
&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_01    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Jan_0.reply.body.txt  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Jan_0.reply.body_no_signature.txt  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Jan_1.reply.body.txt  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Jan_1.reply.body_no_signature.txt  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- ......  
&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_02  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- ......  
&nbsp;&nbsp;&nbsp;&nbsp;|- ......    

*/&ast;&ast;/2014.csv*   
&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_01    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Jan.csv    
&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_02    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- 2014_Feb.csv    
&nbsp;&nbsp;&nbsp;&nbsp;|- ......    

*/&ast;&ast;/2014_k_10*  
&nbsp;&nbsp;&nbsp;&nbsp;|- Document_Topic_Matrix  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- Jan.csv  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- Feb.csv  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- ......  
&nbsp;&nbsp;&nbsp;&nbsp;|- Topic_Flow  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- topic_flow.csv  
&nbsp;&nbsp;&nbsp;&nbsp;|- Topic_Term_Matrix   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- Jan.csv  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- Feb.csv  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|- ......  

# Walking Through All Functions

In this section, I'll try to explain how each data transformation functon works in a language that's easy to comprehend. The flow of `run.py` looks like:

![run.py flow](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/run.py%20flow.png)

Argparse and local server will be covered in Function 7. Another function not mentioned in this notebook is **read_data**, which just reads data and store them as pandas.DataFrame objects. 

**Notice**: In TopicFlow directory, a backup of the original *index.html*, *scripts*, and *data* is included in */topicflow_backup*. Use it to restore missing *index.html* and *controller.js* if something went wrong.

## Function 1 - transform_doc

Let's start with reading all text files. Functions **transform_doc**, **transform_bins**, and **transform_topicSimilarity** will load the necessary datasets that the user intends to visualize in TopicFlow, transform the data into the format that TopicFlow can read, and create a JavsScript file in the new project data directory.

In order to transform data, I think it's worth spending some time doing reverse engineering. Let's first understand what's the end result of **transform_doc** and how it works. The end result is a file called *Doc.js* inside the project data directory. Say the name of the new project is "FD2014", the path of the end result would be `/topicflow/data/FD2014/Doc.js`. *Doc.js* is essentially a JavaScript function that contains all the document text and the metadata of the document, and calling another function defined in *controller.js* to read the data. The skeleton of *Doc.js* looks like:
```javascript
function populate_tweets_FD2014(){
    var tweet_data ={"1":{"tweet_id":1,"author":...,"tweet_date":...,"text":...}, "2":...
    readTweetJSON(tweet_data);
}
```
If you open it for the first time, the length of this file would be daunting, but it actually has a very simple structure. First, a JavaScript function called **populate_tweets_FD2014** ("FD2014" is the project name) is defined. Then, a variable called "tweet_data" is defined, along with literally all the document data in JSON format as the value of this variable. At last, the function **readTweetJSON** defined in *controller.js* is called to actually read the data in "tweet_data" variable. 

One thing important to clarify here is the word "tweet", or "tweets". Although we are utilizing TopicFlow to read data other than tweets, the file names and function names in TopicFlow inherit the nature of the initial purpose by putting "tweet" or "tweets" in them. There are so many functions and codes in different files having "tweet" and they are so interwined that I couldn't alter this naming convention at this stage. But luckily we can name this file as *Doc.js* instead of *Tweet.js*. Hooray!

Okay, now let's see what the JSON part in *Doc.js* looks like:
```json
{
  "1": {
    "tweet_id": 1,
    "author": "Luciano Bello <luciano () debian org>",
    "tweet_date": "12\/31\/2013 16:46",
    "text": "..."
    },
  "2": {
    ...
    },
  ...
}
```
To make the data transformation work, we have to first initialize two dictionaries, one storing our document data (mainly in .txt files) and required metadata, and the other dictionary storing the document ids and names of matched text file. The second dictionary will be used in **transform_bins**. Aftering transforming the first big dictionary into JSON format, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *Doc.js*. The overall flow looks like:

![transform_doc](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/transform_doc.png)

In [None]:
def transform_doc(project_name, path_doc, path_meta, doc_extension):
    """
    Transform Full Disclosure email documents from .txt formats into
    JavaScript format that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_doc     -- path of documents directory

    Returns:
        a dictionary that maps document id with .txt file name that will be 
        used in transform_bins
        
    Outcome:        
        "Doc.js"
    """

    ### READ METADATA
    df_list = read_data(df_list=True)


    ### DATA TRANSFORMATION
    # initiate one main dictionary of Doc.js and one dictionary that maps 
    # document id with .txt file name
    tweet_data = {}    # which contains all elements of documents
    tweet_id_txt = {}  # use this for transform_bins
    
    # find documents
    id_pointer = 1     # tweet_id starts with 1
    for month_ix, folder in enumerate(os.listdir(path_doc)):
        tweet_id_txt[str(month_ix)] = {}
        tweet_id_txt[str(month_ix)]['id'] = []
        tweet_id_txt[str(month_ix)]['txt'] = []
        path_folder = os.path.join(path_doc, folder)
        # read .txt files with the user-specified extension
        txt_list = [x for x in os.listdir(path_folder) if x.endswith(doc_extension)]
        # find .txt files that match their metadata entries
        for txt in txt_list:
            txt_entry_elements = txt.split('.')[0].split('_') # looks like ['2005', 'Jan', '0']
            txt_entry_elements[1] = folder[-2:]               # looks like ['2005', '01', '0']
            txt_entry = '_'.join(txt_entry_elements)          # looks like '2005_01_0', use this to find document metadata in .csv file
            # only make a record if there's a match between .txt file and metadata,
            # and the file is readable.
            try:
                row = df_list[month_ix][df_list[month_ix]['id'] == txt_entry]
                author = row['author'].values[0]
                date = pd.to_datetime(row['date']).apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).values[0]
                with open(os.path.join(path_folder, txt), 'r',
                          encoding='latin1') as textfile:     # notice the encoding
                    text = textfile.read().replace('"','').replace('http://','').replace('\\','').replace('\n','') # remove irrgular expressions
                
                # populate content
                tweet_data[str(id_pointer)] = {}
                tweet_id_txt[str(month_ix)]['id'].append(id_pointer)
                tweet_id_txt[str(month_ix)]['txt'].append(txt.split('.')[0] + '.txt')
                tweet_data[str(id_pointer)]['tweet_id'] = id_pointer
                tweet_data[str(id_pointer)]['author'] = author
                tweet_data[str(id_pointer)]['tweet_date'] = date
                tweet_data[str(id_pointer)]['text'] = text
                
                id_pointer += 1
            # for any reason the above try fails, we don't record
            except:
                # here, you can do things like listing files that can't be parsed
                # e.g. print(txt)
                pass
                
    # transform body into .json format
    json_tmp = json.dumps(tweet_data)

    # transform into .js format that TopicFlow can read
    prefix = 'function populate_tweets_' + project_name + '(){\nvar tweet_data ='
    posfix = ';\nreadTweetJSON(tweet_data);\n}'
    doc_js = prefix + json_tmp + posfix

    ### WRITE
    # make a directory named after project_name
    if os.path.isdir(os.path.join(path_tf, 'data', project_name)) == False:
        os.mkdir(os.path.join(path_tf, 'data', project_name))
        
    # write
    with open(os.path.join(path_tf, 'data', project_name, 'Doc.js'), 'w') as file:
        file.write(doc_js)

    print('\nDoc.js created,             20% complete.')
    
    return tweet_id_txt

After the modification, a line says "Doc.js created,             20% complete." will be printed out in the terminal. This newly created file should populate the document content on the right side of TopicFlow. Clicking a document should let a uer see the author, date, and actual text of that document. 

## Function 2 - transform_bins

**transform_bins** is the hardest part in the whole data transformation pipeline. Although it has the same three-part-structure as **transform_doc**, the JSON part in **transform_bins** is much more complex, thus require very careful handling of indexing and putting data in the right place. Here we can take a quick glance of the model that's draw by Carlos Paradis:
![bins_model](https://raw.githubusercontent.com/estepona/topicflow/master/data_model/bins_model.png)

Again, let's do reverse engineering. The end result is a file called *Bins.js* inside the project data directory. Say the name of the new project is "FD2014", the path of the end result would be `/topicflow/data/FD2014/Bins.js`. *Bins.js* is essentially a JavaScript function that divide all documents by time (in the example of Full Disclosure data, divide by month) which is called binning, and store the LDA data (document-topic scores and topic-word scores) of all the pairs. The skeleton of *Bins.js* looks like:
```javascript
function populate_bins_FD2014(){
    var bin_data ={"0":{"tweet_Ids":[...],"start_time":...,"bin_id":...,"topic_model":{...},"end_time":...},"1":...
    readBinJSON(bin_data);
}
```
Again, it's could be daunting the first time you open it: it's very lengthy, but the structure stays the same. First, a JavaScript function called **populate_bins_FD2014** ("FD2014" is the project name) is defined. Then, a variable called "bin_data" is defined, along with all the relevent data in JSON format as the value of this variable. At last, the function **readBinJSON** defined in *controller.js* is called to read the data in "bin_data" variable. 

Now let's see what the JSON part in *Bins.js* looks like:
```json
{
  "0": {
    "tweet_Ids": [1,2,3...],
    "start_time": "12/31/2013 16:46",
    "bin_id": 0,
    "topic_model": {
            "topic_doc": {
                    "0_0": {
                        "1": 0.00010030434072387,
                        "2": 0.36551017173243494,
                        ...
                    },
                    "0_1: {...},
                    ...
                },
            "doc_topic": {
                "1": {
                    "0_0": 0.00010030434072387,
                    "0_1": 0.00010030434072383,
                    ...
                },
                "2": {...},
                ...
            },
            "topic_word": {
                "0_0": {
                    "x86_64": 0.0361921097895964,
                    "i586": 0.0335562698609424,
                    ...
                },
                "0_1": {...},
                ...
            },
            "topic_prob": {
                "0": "0_0",
                "1": "0_1",
                ...
            }
        },
    "end_time": "1/31/2014 21:25"
    },
  "1": {
    ...
  },
  ...
}
```
To make the data transformation work, we have to first process and store all the data in a dictionary, and transform it into JSON format. Then, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *Bins.js*. The overall flow looks like:

![transform_bins](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/transform_bins.png?raw=true)

Details of how **transform_bins** works can be found in the comments. One thing to notice is that this function takes more consideration in indexing than other functions because there are so many document-topic and topic-word pairs to populate and sometimes the index starts with 0 and sometimes it starts with 1, a consistancy issue that's hard to fix.

In [None]:
def transform_bins(project_name, path_doc, path_meta, path_LDA, tweet_id_txt):
    """
    Transform LDA-genereted Topic-document matrixes and Topic-word
    matrixes into JavaScript format that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_doc     -- path of documents directory
        path_LDA     -- path of LDA main directory, this directory should
                        contain 3 sub-directories: Document_Topic_Matrix,
                        Topic_Flow, and Topic_word_Matrix
        tweet_id_txt -- a dictionary that maps document id with .txt file name
                        generated by transform_doc     

    Outcome:
        "Bins.js"
    """

    ### DEFINE month_list, READ DATA
    month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    # read df_list
    df_list = read_data(df_list=True)
    # read topic-doc & topic-word data sets
    df_topic_doc = read_data(df_topic_doc=True)
    # read topic-word data sets
    df_topic_word = read_data(df_topic_word=True)


    ### DATA TRANSFORMATION - 1
    # initiate bins, each month is one bin, each bin is also a dictionary
    bin_dict = {}
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)] = {}

    # populate bin_id
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['bin_id'] = month_ix

    # populate tweet_ids
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['tweet_Ids'] = tweet_id_txt[str(month_ix)]['id']

    # populate start_time & end_time
    # here we need input from df_list, specifically the length of each month
    # this part sorts out the earliest and latest time of a tweet in each month, and
    # transform them into "mm/dd/yy hh:mm" format
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['start_time'] = pd.to_datetime(df_list[month_ix].date).sort_values().apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).tolist()[0]
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['end_time'] = pd.to_datetime(df_list[month_ix].date).sort_values().apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).tolist()[-1]

    # initiate topic_model
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['topic_model'] = {}
        # add 4 sub dictionaries
        bin_dict[str(month_ix)]['topic_model']['topic_doc'] = {}
        bin_dict[str(month_ix)]['topic_model']['doc_topic'] = {}
        bin_dict[str(month_ix)]['topic_model']['topic_word'] = {}
        bin_dict[str(month_ix)]['topic_model']['topic_prob'] = []

        
    ### DATA TRANSFORMATION - 2: POPULATE topic_model
    # topic_model is the hardest part. We need to populate them month by month,
    # and one by one.
    for month_ix in range(len(month_list)):
        overlap = set(df_topic_doc[month_ix].index.tolist()) & set(tweet_id_txt[str(month_ix)]['txt'])
        overlap = list(overlap)
        df_topic_doc_overlap = df_topic_doc[month_ix].copy().loc[overlap, :]
        
        # topic_prob & topic_doc
        for prob in range(10):
            bin_dict[str(month_ix)]['topic_model']['topic_prob'].append(str(month_ix) + '_' + str(prob))
            # initiate topic_doc
            bin_dict[str(month_ix)]['topic_model']['topic_doc'][str(month_ix) + '_' + str(prob)] = {}
            overlap_id = [tweet_id_txt[str(month_ix)]['txt'].index(index_txtfile) for index_txtfile in df_topic_doc_overlap.index.tolist()]
            overlap_id = [tweet_id_txt[str(month_ix)]['id'][index_tweetid] for index_tweetid in overlap_id]
            for overlap_ix in range(len(overlap_id)):
                bin_dict[str(month_ix)]['topic_model']['topic_doc'][str(month_ix) + '_' + str(prob)][str(overlap_id[overlap_ix])] = df_topic_doc_overlap[str(int(prob + 1))].tolist()[overlap_ix]
        
        # doc_topic
        overlap_id = [tweet_id_txt[str(month_ix)]['txt'].index(index_txtfile) for index_txtfile in df_topic_doc_overlap.index.tolist()]
        overlap_id = [tweet_id_txt[str(month_ix)]['id'][index_tweetid] for index_tweetid in overlap_id] 
        for overlap_ix2 in range(len(overlap_id)):
            row = df_topic_doc_overlap.iloc[overlap_ix2, :].tolist()
            bin_dict[str(month_ix)]['topic_model']['doc_topic'][str(overlap_id[overlap_ix2])] = {}
            for row_ix in range(len(row)):
                bin_dict[str(month_ix)]['topic_model']['doc_topic'][str(overlap_id[overlap_ix2])][str(month_ix) + '_' + str(row_ix)] = row[row_ix]
            
        # topic_word
        for topic_word_ix in range(10):
            name = str(month_ix) + '_' + str(topic_word_ix)
            bin_dict[str(month_ix)]['topic_model']['topic_word'][name] = {}
            topwords = df_topic_word[month_ix].iloc[topic_word_ix].sort_values(ascending=False)[:10]
            topwords = np.around(topwords, 17)
            # we choose top 10 most frequent words, so here the range is 10
            for topword_ix in range(10):
                bin_dict[str(month_ix)]['topic_model']['topic_word'][name][topwords.index[topword_ix]] = topwords.values[topword_ix]
        
        # delete df_topic_doc_overlap to aviod overwritting error and save memory
        del df_topic_doc_overlap
        

    ### TRANSFORM INTO JS FORMAT
    # transform bin_dict into an ordered dictionary
    bin_dict_ordered = {}

    key_order = ('tweet_Ids','start_time','bin_id','topic_model','end_time')
    for month_ix in range(len(month_list)):
        tmp = OrderedDict()
        for k in key_order:
            tmp[k] = bin_dict[str(month_ix)][k]
        bin_dict_ordered[str(month_ix)] = tmp

    # transform body into .json format
    json_tmp = json.dumps(bin_dict_ordered)

    # transform into .js format that TopicFlow can read
    prefix = 'function populate_bins_' + project_name + '(){\nvar bin_data = '
    posfix = ';\nreadBinJSON(bin_data);\n}'
    bins_js = prefix + json_tmp + posfix


    ### WRITE
    with open(os.path.join(path_tf, 'data', project_name, 'Bins.js'), 'w') as file:
        file.write(bins_js)

    print('Bins.js created,            40% complete.')

After the modification, a line says "Bins.js created,            40% complete." will be printed out in the terminal. This newly created file should populate the both the bottom-left and center area of TopicFlow. Each column in the visualization is a bin and each box is a topic.

## Function 3 - transform_topicSimilarity

After bins and topics are created, **transform_topicSimilarity** generates nodes and links between topics in adjacent bins. It also has the three-part-structure as **transform_doc** and **transform_bins**. 

Reverse engineering! The end result is a file called *TopicSimilarity.js* inside the project data directory. Say the name of the new project is "FD2014", the path of the end result would be `/topicflow/data/FD2014/TopicSimilarity.js`. *TopicSimilarity.js* is essentially a JavaScript function that scores how similar the topics between two adjancent bins are. The scores are also generated by the LDA algorithm. The skeleton of *Bins.js* looks like:
```javascript
function populate_similarity_FD2014(){
    var sim_data ={"nodes":[{"name":...,"value":...},...],"links":[{"source":...,"target":...,"value":...},...]}
    readSimilarityJSON(sim_data);
}
```
*TopicSimilarity.js* is the shortest among all three data files and it follows a simple logic: we have nodes and score the links between nodes. As of the overall JavaScript structure, first, a function called **populate_similarity_FD2014** ("FD2014" is the project name) is defined. Then, a variable called "sim_data" is defined, along with all the relevent data in JSON format as the value of this variable. At last, the function **readSimilarityJSON** defined in *controller.js* is called to read the data in "sim_data" variable. 

Now let's see what the JSON part in *TopicSimilarity.js* looks like:
```json
{
  "nodes": [
      {
          "name": "0_0",
          "value": 43
      },
      {
          "name": "0_1",
          "value": 57
      },
      ...
  ],
  "links": [
      {
          "source":1,
          "target":18,
          "value":233.6647080989732
      },
      {
          "source":2,
          "target":13,
          "value":183.70069470814772
      },
      ...
  ]
}
```
To make the data transformation work, we have to first process and store all the similarity data in a dictionary, and transform it into JSON format. Then, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *TopicSimilarity.js*. The overall flow looks like:

![transform_topicSimilarity](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/transform_topicSimilarity.png?raw=true)

In [None]:
def transform_topicSimilarity(project_name, path_LDA):
    """
    Transform topic similarity matrix into JavaScript format
    that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_LDA     -- path of LDA main directory, this directory should
                        contain 3 sub-directories: Document_Topic_Matrix,
                        Topic_Flow, and Topic_word_Matrix

    Outcome:
        "TopicSimilarity.js"
    """

    ### DEFINE month_list, READ DATA
    month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    df_topic_sim = read_data(df_topic_sim=True)


    ### DATA TRANSFORMATION
    # initiate a dictionary
    sim_dict = {}

    # populate nodes
    # put topics into nodes, record their orders
    nodes = []
    for i in range(len(month_list)):
        for j in range(10):
            tmp = {}
            name = str(i) + '_' + str(j)
            # how to calculate the value of a topic? the paper didn't define clearly
            # so here I use a random number
            value = np.random.randint(1,100)
            tmp['name'], tmp['value'] = name, value
            nodes.append(tmp)

    # populate links
    # put source, target, value into links
    links = []
    for month_ix in range(len(month_list) - 1):
        # get unique pais between every two months, in total we have 11 pairs
        mm1, mm2 = month_list[month_ix], month_list[month_ix + 1]
        sim = mm1 + '_' + mm2 + '_similarity'
        df_tmp = df_topic_sim[[mm1, mm2, sim]].dropna(axis=0).drop_duplicates()
        for row_ix in range(len(df_tmp)):
            source = month_ix*10 + int(df_tmp[mm1].values[row_ix]) - 1
            target = (month_ix+1)*10 + int(df_tmp[mm2].values[row_ix]) - 1
            score = df_tmp[sim].values[row_ix] * 200 # 200 makes it neither too thin nor too thick
            link_tmp = {}
            link_tmp['source'], link_tmp['target'], link_tmp['value'] = source, target, score
            links.append(link_tmp)

    # put two lists into sim_dict
    sim_dict['nodes'], sim_dict['links'] = nodes, links


    ### TRANSFORM INTO JS FORMAT
    json_tmp = json.dumps(sim_dict)

    # finally, transform into .js format that TopicFlow can read
    prefix = 'function populate_similarity_' + project_name + '(){\nvar sim_data = '
    posfix = ';\nreadSimilarityJSON(sim_data);\n}'
    topicSimilarity_js = prefix + json_tmp + posfix


    ### WRITE
    with open(os.path.join(path_tf, 'data', project_name, 'TopicSimilarity.js'), 'w') as file:
        file.write(topicSimilarity_js)

    print('TopicSimilarity.js created, 60% complete.')

After the modification, a line says "TopicSimilarity.js created, 60% complete." will be printed out in the terminal. This newly created file should control the top-left panel of TopicFlow and the lines between different topics. These data are in charge of topic flow.

## Function 4 - modify_html 

As mentioned earlier, the *index.html* file in TopicFlow controls the loading of datasets and display the dataset selector when user initiates TopicFlow or changes datasets. Respectively, the two parts in index.html looks like:
```html
<script src="data/Full_Disclosure_2012/Tweet.js"></script>
<script src="data/Full_Disclosure_2012/Bins.js"></script>
<script src="data/Full_Disclosure_2012/TopicSimilarity.js"></script>

<!-- add new section after this line -->
<!-- end of adding new datasets. -->
```

and

```html
<li id="Full_Disclosure_2012"><a href="#">Full_Disclosure_2012</a></li>
<!-- add new dataset selector after this line -->
<!-- end of adding new dataset selector -->
```


We will let the function find the locations of the above parts and add codes for a new dataset in the same style. To make it faster finding the locations, four lines of comments are placed so that the program easily finds the place for our insertion. The overall flow looks like:

![modify_html](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/modify_html.png?raw=true)

In [None]:
def modify_html(project_name, path_tf):
    """
    Modify the content of \topicflow\index.html.
    Two hand-added comments are used to locate the lines where new content can be
    added. Executing the function would replace the existing index.html.
    
    Args:
        project_name -- name of the new project
        path_tf      -- path of topicflow directory
    
    Outcome:
        a modified "index.html" that includes a new project
    """
    # read exisitng index.html and parse by lines
    with open(os.path.join(path_tf, 'index.html'), 'r') as file:
        html = file.read()

    html_parse = html.split('\n')

    # add new section after '<!-- add new section after this line -->'
    ix = html_parse.index('<!-- add new section after this line -->')
    new_section = '<script src="data/SHA/Doc.js"></script>\n<script src="data/SHA/Bins.js"></script>\n<script src="data/SHA/TopicSimilarity.js"></script>\n'.replace('SHA',project_name)
    html_parse.insert(ix+1, new_section)

    # add new selector after '<!-- add new dataset selector after this line -->'
    ix = html_parse.index('\t\t\t<!-- add new dataset selector after this line -->')
    new_selector = '\t\t\t<li id="SHA"><a href="#">SHA</a></li>'.replace('SHA', project_name)
    html_parse.insert(ix+1, new_selector)

    # replace existing index.html
    html_combine = '\n'.join(html_parse)
    os.remove(os.path.join(path_tf, 'index.html'))
    with open(os.path.join(path_tf, 'index.html'), 'w') as file:
        file.write(html_combine)

    print('index.html modified,        80% complete.')

After the modification, a line says "index.html modified,        80% complete." will be printed out in the terminal. The new index.html should have the following changes being made. In this example, the new project is called "FD2014":
```html
<script src="data/Full_Disclosure_2012/Tweet.js"></script>
<script src="data/Full_Disclosure_2012/Bins.js"></script>
<script src="data/Full_Disclosure_2012/TopicSimilarity.js"></script>

<!-- add new section after this line -->
<script src="data/FD2014/Doc.js"></script>
<script src="data/FD2014/Bins.js"></script>
<script src="data/FD2014/TopicSimilarity.js"></script>
<!-- end of adding new datasets. -->
```
and
```html
<li id="Full_Disclosure_2012"><a href="#">Full_Disclosure_2012</a></li>
<!-- add new dataset selector after this line -->
<li id="FD2014"><a href="#">FD2014</a></li>
<!-- end of adding new dataset selector -->
```

## Function 5 - modify_controller

Following the same methodology as **modify_html**, **modify_controller** locates two parts in *controller.js* that controls how TopicFlow reads the data of our new project and what functions to call to parse the data. Respectively, the two parts nested in the function **populateVisualization** in *controller.js* look like:
```javascript
var idToName = {
                // add new idToName
                "Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```
and
```javascript
// Populate the interface with the selected data set
if (selected_data==="Full_Disclosure_2012") {
    populate_tweets_Full_Disclosure_2012();
    populate_bins_Full_Disclosure_2012();
    populate_similarity_Full_Disclosure_2012();
}
// add new selected dataset here
// end of adding new selected datasets

```


We will let the function find the locations of the above parts and add codes for a new idToName variable and a new selected dataset in the same style. To make it faster finding the locations, three lines of comments are placed so that the program easily finds the place for our insertion. The overall flow looks like:

![modify_controller](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/modify_controller.png?raw=true)

In [None]:
def modify_controller(project_name, path_tf):
    """
    Modify the content of \topicflow\scripts\controller.js.
    Two hand-added comments are used to locate the lines where new content can be
    added. Executing the function would replace the existing controller.js.
    
    Args:
        project_name -- name of the new project
        path_tf      -- path of topicflow directory
    
    Outcome:
        a modified "controller.js" that includes a new project
    """
    # read exisitng controller.js and parse by lines
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'r') as file:
        controller = file.read()

    controller_parse = controller.split('\n')

    # add idToName after '// add new idToName'
    ix = controller_parse.index('\t\t\t\t\t// add new idToName')
    new_idToName = '\t\t\t\t\t"SHA":"SHA",'.replace('SHA', project_name)
    controller_parse.insert(ix+1, new_idToName)

    # add selected dataset after '// add new selected dataset here'
    ix = controller_parse.index('\t// add new selected dataset here')
    new_selectedDataset = '\tif (selected_data==="SHA") {\n\t\tpopulate_tweets_SHA();\n\t\tpopulate_bins_SHA();\n\t\tpopulate_similarity_SHA();\n\t}'.replace('SHA', project_name)
    controller_parse.insert(ix+1, new_selectedDataset)

    # replace existing controller.js
    controller_combine = '\n'.join(controller_parse)
    os.remove(os.path.join(path_tf, 'scripts', 'controller.js'))
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'w') as file:
        file.write(controller_combine)

    print('controller.js modified,     100% complete.')

After the modification, a line says "controller.js modified,     100% complete." will be printed out in the terminal. The new *controller.js* should have the following changes being made. We still use the example of the new project called "FD2014", notice here that the function names (created in function **transform_doc**) will have the project name "FD2014" at the end:
```javascript
var idToName = {
					// add new idToName
					"FD2014":"FD2014",
					"Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```
and
```javascript
// Populate the interface with the selected data set
if (selected_data==="Full_Disclosure_2012") {
    populate_tweets_Full_Disclosure_2012();
    populate_bins_Full_Disclosure_2012();
    populate_similarity_Full_Disclosure_2012();
}
// add new selected dataset here
if (selected_data==="FD2014") {
    populate_tweets_FD2014();
    populate_bins_FD2014();
    populate_similarity_FD2014();
}
// end of adding new selected datasets
```

## Function 6 - del_project

Much like the inverse of **modify_html** and **modify_controller**, funtion **del_project** deletes the content related to the specified project(s) in *index.html* and *controller.js*, as well as the project folder under */data*. Human marked comments like "// Populate the interface with the selected data set" are used to locate the content. The overall flow looks like: 

![del_project](https://github.com/estepona/PERCEIVE-freddie/blob/master/notebook_graphs/del_project.png?raw=true)

In [1]:
def del_project(project_name_delete):
    """
    Delete an existing project. Content of the project in index.html, 
    controller.js, and data/<project> folder will be deleted. The base project 
    "Full_Disclosure_2012" should not be deleted.
    
    Args:
        project_name_delete -- name of the project that should be deleted
    
    Outcome:
        Removal of an existing project or multiple existing projects.
    """
    ### DELETE CONTENT IN index.html
    # read exisitng index.html and parse by lines
    with open(os.path.join(path_tf, 'index.html'), 'r') as file:
        html = file.read()
    html_parse = html.split('\n')

    # delete section after '<!-- add new section after this line -->'
    ix_1 = html_parse.index('<!-- add new section after this line -->')
    ix_2 = html_parse.index('<!-- end of adding new datasets. -->')
    delete_ix = 0
    for i_1 in range(ix_1, ix_2):
        # make sure only the specified project is deleted, we don't want to delete other projects that have the this name in it
        if project_name_delete in html_parse[i_1] and 'Doc.js' in html_parse[i_1] and len(html_parse[i_1]) == 36+len(project_name_delete):
            delete_ix = i_1
    for i_2 in range(4): # there are 4 lines for each project section, and we don't want to delete the end line
        if not delete_ix == 0:
            html_parse.pop(delete_ix)

    # delete dataset selector after '<!-- add new dataset selector after this line -->'
    ix_1 = html_parse.index('\t\t\t<!-- add new dataset selector after this line -->')
    ix_2 = html_parse.index('\t\t\t<!-- end of adding new dataset selector -->')
    for i_3 in range(ix_1, ix_2):
        if 'id="' + project_name_delete + '"' in html_parse[i_3]:
            html_parse.pop(i_3)
    
    # replace existing index.html
    html_combine = '\n'.join(html_parse)
    os.remove(os.path.join(path_tf, 'index.html'))
    with open(os.path.join(path_tf, 'index.html'), 'w') as file:
        file.write(html_combine)
    
    
    ### DELETE CONTENT IN controller.js
    # read exisitng controller.js and parse by lines
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'r') as file:
        controller = file.read()
    controller_parse = controller.split('\n')

    # delete idToName after '// add new idToName'
    ix_1 = controller_parse.index('\t\t\t\t\t// add new idToName')
    ix_2 = controller_parse.index('\t\t\t\t\t"Full_Disclosure_2012":"Full_Disclosure_2012"')
    for i_4 in range(ix_1, ix_2):
        if '"' + project_name_delete + '"' in controller_parse[i_4]:
            controller_parse.pop(i_4)

    # delete selected dataset after '// add new selected dataset here'
    ix_1 = controller_parse.index('\t// add new selected dataset here')
    ix_2 = controller_parse.index('\t// end of adding new selected datasets')
    delete_ix = 0
    for i_5 in range(ix_1, ix_2):
        if '"' + project_name_delete + '"' in controller_parse[i_5]:
            delete_ix = i_5
    for i_6 in range(5): # there are 5 lines for each selected dataset, and we don't want to delete the end line
        if not delete_ix == 0:
            controller_parse.pop(delete_ix)

    # replace existing controller.js
    controller_combine = '\n'.join(controller_parse)
    os.remove(os.path.join(path_tf, 'scripts', 'controller.js'))
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'w') as file:
        file.write(controller_combine)
    
    
    ### DELETE data.<project_name_delete> FOLDER
    # delete three .js files
    for js_file in os.listdir(os.path.join(path_tf, 'data', project_name_delete)):
        os.remove(os.path.join(path_tf, 'data', project_name_delete, js_file))
    # delete project folder
    os.rmdir(os.path.join(path_tf, 'data', project_name_delete))

After the modification, a line says "Project(s) successfully deleted, running TopicFlow now..." will be printed out in the terminal. Then, if you examine *index.html*, *controller.js*, or the */data* directory, you will no longer see the specified project(s).

## Function 7 - argparse and local server

Finally, under   
>`if __name__ == "__main__":`  

Three functionalities are added to allow malnipulation in terminal and invoke local server instance. 

Using the argparse library in `run.py` makes it easier for a user to add or delete project(s) in command lines and see the TopicFlow visualization in a local server, or simple run an existing project.

The function will first check if a "delete" option is called. If called, only the deletion part will be executed. Then the function checks if an "add" optin is called. If called, a new project will be added. If nothing is called, `python run.py` will invoke a local server instance with a randomly generated port number, which will also be printed out in the terminal.

In [None]:
if __name__ == "__main__":
    # record the path of topicflow
    path_tf = sys.argv[0][:-6]
    if len(path_tf) == 0:
        path_tf = '.'

    ### ARGPARSE
    parser = argparse.ArgumentParser(prog = 'run.py',
                                     description = 'A program that lets you create a new project and transforms your data into TopicFlow readable format, or just run TopicFlow and choose existing projects with the command "python run.py".',
                                     epilog = 'Then you can open a browser and type in localhost:<port number> to see the visualization! When done, just stop the process in terminal.')
    parser.add_argument('-a', '--add', type = str, nargs = '+',
                        help = 'If adding a new project. Please specify all the following items: [the name of the project, path of document folder, path of document metadata folder, document extension, path of LDA folder], 5 items in total. Enclosing each in double quotes, and don\'t forget the dots. Please don\'t use space when naming. For the document extension, choose from [.reply.body.txt, .reply.body_no_signature.txt, .reply.body_tags.txt, .reply.title_body.txt, .reply.title_body_no_signature.txt]. If running an existing project, no need to use this flag. EXAMPLE: python run.py -a "FD2014" "E:\\...\\data\\docs" "E:\\...\\data\\docs_metadata" ".reply.body.txt" "E:\\...\\data\\LDA".')
    parser.add_argument('-d', '--delete', type = str, nargs = '+',
                        help = 'Delete one or multiple existing projects. Specify the name(s) of the project(s) that should be deleted in double quotes. The base project "Full_Disclosure_2012" should not be deleted. Single deletion example: python run.py -d "FD2014". Multiple deletion example: python run.py -d "FD2014" "FD2015".')
    args = parser.parse_args()

    # delete an existing project, if true, end the outer if.
    if args.delete:
        for arg_del in args.delete:
            if os.path.isdir(os.path.join(path_tf, 'data', arg_del)):
                project_name_delete = arg_del
                del_project(project_name_delete)
        if len(args.delete) == 1:
            print('Project successfully deleted, running TopicFlow now...')
        elif len(args.delete) >= 1:
            print('Projects successfully deleted, running TopicFlow now...')
    
    # add a new project
    elif args.add:
        project_name = args.add[0]
        path_doc = args.add[1]
        path_meta = args.add[2]
        doc_extension = args.add[3]
        path_LDA = args.add[4]
        
        time_start = time.time()
        
        if os.path.isdir(path_doc) and os.path.isdir(path_LDA):
            print('\nData transformation started...')
            tweet_id_txt = transform_doc(project_name, path_doc, path_meta, doc_extension)
            transform_bins(project_name, path_doc, path_meta, path_LDA, tweet_id_txt)
            transform_topicSimilarity(project_name, path_LDA)
            modify_html(project_name, path_tf)
            modify_controller(project_name, path_tf)
            print('\nTotal time taken:', str(round(time.time() - time_start, 2)), 'seconds.\n')


    ### INVOKE SERVER
    PORT = np.random.randint(9000, 10000)

    # change the working directory to topicflow
    os.chdir(path_tf)

    Handler = http.server.SimpleHTTPRequestHandler

    with socketserver.TCPServer(("", PORT), Handler) as httpd:
        print("serving at port", PORT)
        httpd.serve_forever()

In a terminal, the above codes generate the following options:

command:  
`python run.py -h`

![example\_-h](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/example_-h.png)

Creating a new project called "FD2014" using Full Disclosure 2014 documents and 2014 LDA data in terminal looks like: (note the five arguments)

command:  
`python run.py -a "FD2014" "E:\documents\Learning Materials\from_UMD\projects\PERCEIVE\data\New Crawler Full Disclosure\2014.parsed" "E:\documents\Learning Materials\from_UMD\projects\PERCEIVE\data\New Crawler Full Disclosure\2014.csv" ".reply.body_no_signature.txt" "E:\documents\Learning Materials\from_UMD\projects\PERCEIVE\data\LDA_VEM\estepona_2014_k_10"`

![example_FD2014](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/example_FD2014.png)

A timer is also included to see how long the data transformation takes. Just nice to know.

Now, let us see the end result of our data transformation in browser!

![example_FD2014_screenshot](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/example_FD2014_screenshot.png)

You can also **search** a term in the left-bottom panel. In the example below, a search term "security" results in a number of highlightd cells that are related to "security".

![example_FD2014_screenshot_search](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/notebook_graphs/example_FD2014_screenshot_search.png)

When you are done with the visualization, just stop the process in terminal.

# Remaining Issues

Although now we have a working data transformation pipeline, there are still one issue remained:

    
**Value of node**  
    In function **transform_topicSimilarity**, the value of each individual node is not clearly defined in the original paper, so the way I approach this is generating a random integer between 1 and 100 and assign it to the value of node.
```python
# how to calculate the value of a topic? the paper didn't define clearly
# so here I use a random number
value = np.random.randint(1,100)
tmp['name'], tmp['value'] = name, value
```
We'd like to know how the values are defined and make changes to the data transformation pipeline accordingly.