<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# A brief introduction to TopicFlow

TopicFlow is a tool that visualizes the results of automatic topic detection and topic alignment between sets of tweets over time. The tool was developed by Jianyu Li, Sana Malik, Panagis (Pano) Papadatos and Alison Smith originally as a team project for CMSC 734 Information Visualization at the University of Maryland. You can find more information about TopicFlow by reading the README.md and their papers:
- [TopicFlow: Visualizing Topic Alignment of Twitter Data over Time](https://wiki.cs.umd.edu/cmsc734_f12/images/0/05/TopicFlowFinalReport2.pdf)
- [Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow](http://link.springer.com/chapter/10.1007/978-3-319-19003-7_9)

What we want to achieve by utilizing TopicFlow for [PERCEIVE](https://github.com/sailuh/perceive) is trying to visualize the "flow" of topics of Full Disclosure documents that may help us identify upcoming cybersecurity threats. 

PERCEIVE is developed and maintained by a joint effort of many contributors. The role of TopicFlow in PERCEIVE can be simplified with the graph below:

![work flow](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/work%20flow%20diagram/work%20flow.jpg)

While the output of TopicFlow pipeline is the visualization, the output of this data transformation pipeline is a *run.py* file that enables a user to create new TopiFlow projects or run an existing project. 

Although TopicFlow is a powerful tool, it was designed to visualize only the flow of "tweets". To make TopicFlow work for Full Disclosure data, several changes were made to the original scripts:
1. changed all "tweet" related content to "doc" or "document" in the final visualization;
2. disabled 
```javascript
if ($("g #"+j)[0].style.display != "none") { }
``` 
in *controller.js* to avoid `style errors`. Otherwise, TopicFlow couldn't configure text data other than tweets.
3. removed all datasets to select from except Full Disclosure 2012 dataset. Several changes were made in *index.html*, *controller.js*, and */topicflow/data* directory. For example, in *controller.js*, the original version allows users to choose some of these datasets:
```javascript
var idToName = {"HCI" : "HCI", "ModernFamily" : "Modern Family", "catfood": "Catfood" , 
					"drugs" : "Drugs", "earthquake" : "Earthquake", "umd" : "UMD", "debate":"#debate", "chi":"CHI Conference", 
					"sandy" : "Sandy and NJ"}
```
however, in our final version, we only need the following dataset as the starting point:
```javascript
var idToName = {
                // add new idToName
                "Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```

# Methodology

So How do we approach this? After exploring TopicFlow and discussing with Carlos Paradis several times, I believe the best way to utilize TopicFlow without overhauling the original codes is modifying only the parts that help us display our datasets. Since TopicFlow is very hand-coded, this data transformation pipeline has to edit the actual scripts and generate new files with our Python program. 

To better understand what we need to do. Here let's take a look at how TopicFlow works in the simpliest form, a triangle.

![methodology](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/work%20flow%20diagram/methodology.jpg)

Essentially two scripts and one data directory controls how TopicFlow works: *index.html* provides the place for the visualization and the basic information, */data/< project >* stores the actual data to display, and *controller.js* coordinates all the JavaScript scripts that tells TopicFlow how to read data and the way to visualize. The highlighted elements are what we will be modifying or creating in this data transformation pipeline.

**Please note that this transformation pipeline only works for Full Disclosure data**  
The functions in this pipeline only works for Full Disclosure datasets. To create a new project, the two specified directories after "-a" must contain the following files or sub-directories:


*path_doc*  
&nbsp;&nbsp;&nbsp;&nbsp; |- yyyy_mm_index.txt  
&nbsp;&nbsp;&nbsp;&nbsp; |- Full_Disclosure_Mailing_List_mmyyyy.csv   

*path_LDA*  
&nbsp;&nbsp;&nbsp;&nbsp; |- Document_Topic_Matrix  
&nbsp;&nbsp;&nbsp;&nbsp; |- Topic_Flow  
&nbsp;&nbsp;&nbsp;&nbsp; |- Topic_Term_Matrix  

# Walking Through All Functions

In this section, I'll try to explain how each data transformation functon works in a language that's easy to comprehend. The flow of `run.py` looks like:

![run.py flow](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/work%20flow%20diagram/run.py%20flow.jpg)

Argparse and local server will be covered in Function 6. Another function not mentioned in this notebook is **read_data**, which just reads data and store as pandas.DataFrame objects. 

**Notice**: A backup of topicflow is included. Use it to restore missing *index.html* and *controller.js* if something went wrong.

## Function 1 - modify_html 

As mentioned earlier, the *index.html* file in TopicFlow controls the loading of datasets and display the dataset selector when user initiates TopicFlow or changes datasets. Respectively, the two parts in index.html looks like:
```html
<script src="data/Full_Disclosure_2012/Tweet.js"></script>
<script src="data/Full_Disclosure_2012/Bins.js"></script>
<script src="data/Full_Disclosure_2012/TopicSimilarity.js"></script>

<!-- add new section after this line -->
<!-- end of adding new datasets. -->
```

and

```html
<li id="Full_Disclosure_2012"><a href="#">Full_Disclosure_2012</a></li>
<!-- add new dataset selector after this line -->
<!-- end of adding new dataset selector -->
```


We will let the function find the locations of the above parts and add codes for a new dataset in the same style. To make it faster finding the locations, four lines of comments are placed so that the program easily finds the place for our insertion. The overall flow looks like:

![modify_html](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/function%20graph/modify_html.jpg)

In [None]:
def modify_html(project_name, path_tf):
    """
    Modify the content of \topicflow\index.html.
    Two hand-added comments are used to locate the lines where new content can be
    added. Executing the function would replace the existing index.html.

    Args:
        project_name -- name of the new project
        path_tf      -- path of topicflow directory
    """
    # read exisitng index.html and parse by lines
    with open(os.path.join(path_tf, 'index.html'), 'r') as file:
        html = file.read()

    html_parse = html.split('\n')

    # add new section after '<!-- add new section after this line -->'
    ix = html_parse.index('<!-- add new section after this line -->')
    new_section = '<script src="data/SHA/Doc.js"></script>\n<script src="data/SHA/Bins.js"></script>\n<script src="data/SHA/TopicSimilarity.js"></script>\n'.replace('SHA',project_name)
    html_parse.insert(ix+1, new_section)

    # add new selector after '<!-- add new dataset selector after this line -->'
    ix = html_parse.index('\t\t\t<!-- add new dataset selector after this line -->')
    new_selector = '\t\t\t<li id="SHA"><a href="#">SHA</a></li>'.replace('SHA', project_name)
    html_parse.insert(ix+1, new_selector)

    # replace existing index.html
    html_combine = '\n'.join(html_parse)
    os.remove(os.path.join(path_tf, 'index.html'))
    with open(os.path.join(path_tf, 'index.html'), 'w') as file:
        file.write(html_combine)

    print('\nindex.html modified,        20% complete.')

After the modification, a line says "index.html modified,        20% complete." will be printed out in the terminal. The new index.html should have the following changes being made. In this example, the new project is called "Fre":
```html
<script src="data/Full_Disclosure_2012/Tweet.js"></script>
<script src="data/Full_Disclosure_2012/Bins.js"></script>
<script src="data/Full_Disclosure_2012/TopicSimilarity.js"></script>

<!-- add new section after this line -->
<script src="data/Fre/Doc.js"></script>
<script src="data/Fre/Bins.js"></script>
<script src="data/Fre/TopicSimilarity.js"></script>
<!-- end of adding new datasets. -->
```
and
```html
<li id="Full_Disclosure_2012"><a href="#">Full_Disclosure_2012</a></li>
<!-- add new dataset selector after this line -->
<li id="Fre"><a href="#">Fre</a></li>
<!-- end of adding new dataset selector -->
```

## Function 2 - modify_controller

Following the same methodology as **modify_html**, **modify_controller** locates two parts in *controller.js* that controls how TopicFlow reads the data of our new project and what functions to call to parse the data. Respectively, the two parts nested in the function **populateVisualization** in *controller.js* look like:
```javascript
var idToName = {
                // add new idToName
                "Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```
and
```javascript
// Populate the interface with the selected data set
if (selected_data==="Full_Disclosure_2012") {
    populate_tweets_Full_Disclosure_2012();
    populate_bins_Full_Disclosure_2012();
    populate_similarity_Full_Disclosure_2012();
}
// add new selected dataset here
// end of adding new selected datasets

```


We will let the function find the locations of the above parts and add codes for a new idToName variable and a new selected dataset in the same style. To make it faster finding the locations, three lines of comments are placed so that the program easily finds the place for our insertion. The overall flow looks like:

![modify_controller](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/function%20graph/modify_controller.jpg)

In [None]:
def modify_controller(project_name, path_tf):
    """
    Modify the content of \topicflow\scripts\controller.js.

    Two hand-added comments are used to locate the lines where new content can be
    added. Executing the function would replace the existing controller.js.

    Args:
        project_name -- name of the new project
        path_tf      -- path of topicflow directory
    """
    # read exisitng controller.js and parse by lines
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'r') as file:
        controller = file.read()

    controller_parse = controller.split('\n')

    # add idToName after '// add new idToName'
    ix = controller_parse.index('\t\t\t\t\t// add new idToName')
    new_idToName = '\t\t\t\t\t"SHA":"SHA",'.replace('SHA', project_name)
    controller_parse.insert(ix+1, new_idToName)

    # add selected dataset after '// add new selected dataset here'
    ix = controller_parse.index('\t// add new selected dataset here')
    new_selectedDataset = '\tif (selected_data==="SHA") {\n\t\tpopulate_tweets_SHA();\n\t\tpopulate_bins_SHA();\n\t\tpopulate_similarity_SHA();\n\t}'.replace('SHA', project_name)
    controller_parse.insert(ix+1, new_selectedDataset)

    # replace existing controller.js
    controller_combine = '\n'.join(controller_parse)
    os.remove(os.path.join(path_tf, 'scripts', 'controller.js'))
    with open(os.path.join(path_tf, 'scripts', 'controller.js'), 'w') as file:
        file.write(controller_combine)

    print('controller.js modified,     40% complete.')

After the modification, a line says "controller.js modified,     40% complete." will be printed out in the terminal. The new *controller.js* should have the following changes being made. We still use the example of the new project called "Fre", notice here that the function names (created in function **transform_doc**) will have the project name "Fre" at the end:
```javascript
var idToName = {
					// add new idToName
					"Fre":"Fre",
					"Full_Disclosure_2012":"Full_Disclosure_2012"
                }
```
and
```javascript
// Populate the interface with the selected data set
if (selected_data==="Full_Disclosure_2012") {
    populate_tweets_Full_Disclosure_2012();
    populate_bins_Full_Disclosure_2012();
    populate_similarity_Full_Disclosure_2012();
}
// add new selected dataset here
if (selected_data==="Fre") {
    populate_tweets_Fre();
    populate_bins_Fre();
    populate_similarity_Fre();
}
// end of adding new selected datasets
```

## Function 3 - transform_doc

Here we move to the part of the actual data transformation. Functions **transform_doc**, **transform_bins**, and **transform_topicSimilarity** will load the necessary datasets that the user intends to visualize in TopicFlow, transform the data into the format that TopicFlow can read, and create a JavsScript file in the new project data directory.

In order to transform data, I think it's worth spending some time doing reverse engineering. Let's first understand what's the end result of **transform_doc** and how it works. The end result is a file called *Doc.js* inside the project data directory. Say the name of the new project is "Fre", the path of the end result would be `/topicflow/data/Fre/Doc.js`. *Doc.js* is essentially a JavaScript function that contains all the document text and the metadata of the document, and calling another function defined in *controller.js* to read the data. The skeleton of *Doc.js* looks like:
```javascript
function populate_tweets_Fre(){
    var tweet_data ={"1":{"tweet_id":1,"author":...,"tweet_date":...,"text":...}, "2":...
    readTweetJSON(tweet_data);
}
```
If you open it for the first time, the length of this file would be daunting, but it actually has a very simple structure. First, a JavaScript function called **populate_tweets_Fre** ("Fre" is the project name) is defined. Then, a variable called "tweet_data" is defined, along with literally all the document data in JSON format as the value of this variable. At last, the function **readTweetJSON** defined in *controller.js* is called to actually read the data in "tweet_data" variable. 

One thing important to clarify here is the word "tweet", or "tweets". Although we are utilizing TopicFlow to read data other than tweets, the file names and function names in TopicFlow inherit the nature of the initial purpose by putting "tweet" or "tweets" in them. There are so many functions and codes in different files having "tweet" and they are so interwined that I couldn't alter this naming convention at this stage. But luckily we can name this file as *Doc.js* instead of *Tweet.js*. Hooray!

Okay, now let's see what the JSON part in *Doc.js* looks like:
```json
{
  "1": {
    "tweet_id": 1,
    "author": "Luciano Bello <luciano () debian org>",
    "tweet_date": "12\/31\/2013 16:46",
    "text": "..."
    }
  "2": {
    ...
  }
  ...
}
```
To make the data transformation work, we have to first store our document data (mainly in .txt files) and metadata in a pandas.DataFrame object (I chose to use pandas.DataFrame here because there are no nesting dictionaries or lists for each document, and this way if we want to see the structure in table we can do that), and transform it into JSON format. Then, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *Doc.js*. The overall flow looks like:

![transform_doc](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/function%20graph/transform_doc.jpg)

In [None]:
def transform_doc(project_name, path_doc):
    """
    Transform Full Disclosure email documents in .txt formats into
    JavaScript format that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_doc     -- path of documents directory

    Returns:
        a JavaScript formatted string ready to be written as "Doc.js".
    """

    ### DEFINE month_list, READ DATA
    month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    df_list = read_data(df_list=True)


    ### DATA TRANSFORMATION
    # initiate four elements of Doc.js
    tweet_id = None
    author = []
    tweet_date = []
    text = []

    # populate tweet_id
    tweet_count = 0
    for month_ix in range(len(month_list)):
        tweet_count += len(df_list[month_ix])
    tweet_id = list(range(1, tweet_count + 1))

    # populate author
    for month_ix in range(len(month_list)):
        author += df_list[month_ix].author.apply(lambda x: x.replace('"','')).tolist()

    # populate tweet_date
    for month_ix in range(len(month_list)):
        # transform time into "mm/dd/yy hh:mm" format
        tweet_date += pd.to_datetime(df_list[month_ix].dateStamp).apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).tolist()

    # populate text
    for month_ix in range(len(month_list)):
        M = df_list[month_ix]
        for text_ix in range(len(M)):
            # 'k' points to the name of the file
            ix = str(M['k'].values[text_ix])
            # iterate and read .txt files of a month, add text to a list
            # it's worth noting the encoding is 'latin1'
            try:
                for file in os.listdir(path_doc):
                    if file.endswith(".txt"):
                        yr = file[:4]
                        break
                filename = yr + '_' + month_list[month_ix] + '_' + ix + '.txt'
                path_file = os.path.join(path_doc, filename)
                with open(path_file, 'r',
                          encoding='latin1') as textfile:
                    tmp = textfile.read().replace('"','').replace('http://','').replace('\\','').replace('\n','')
                text.append(tmp)
            except:
                text.append('empty document')

    ### TRANSFORM INTO JS FORMAT
    # transform into pd.DataFrame
    df_tmp = pd.DataFrame({'tweet_id':tweet_id, 'author':author, 'tweet_date': tweet_date, 'text': text},
                          columns=['tweet_id','author','tweet_date','text'],
                          index=tweet_id)

    # transform body into .json format
    json_tmp = df_tmp.to_json(orient='index')

    # transform into .js format that TopicFlow can read
    prefix = 'function populate_tweets_' + project_name + '(){\nvar tweet_data ='
    posfix = ';\nreadTweetJSON(tweet_data);\n}'
    doc_js = prefix + json_tmp + posfix


    ### WRITE
    # make a directory named after project_name
    if os.path.isdir(os.path.join(path_tf, 'data', project_name)) == False:
        os.mkdir(os.path.join(path_tf, 'data', project_name))

    # write
    with open(os.path.join(path_tf, 'data', project_name, 'Doc.js'), 'w') as file:
        file.write(doc_js)

    print('Doc.js created,             60% complete.')

After the modification, a line says "Doc.js created,             60% complete." will be printed out in the terminal. This newly created file should populate the document content on the right side of TopicFlow. Clicking a document should let a uer see the author, date, and actual text of that document. 

## Function 4 - transform_bins

**transform_bins** is the hardest part in the whole data transformation pipeline. Although it has the same three-part-structure as **transform_doc**, the JSON part in **transform_bins** is much more complex, thus require very careful handling of indexing and putting data in the right place. Here we can take a quick glance of the model that's draw by Carlos Paradis:
![bins_model](https://raw.githubusercontent.com/estepona/topicflow/master/data_model/bins_model.png)

Again, let's do reverse engineering. The end result is a file called *Bins.js* inside the project data directory. Say the name of the new project is "Fre", the path of the end result would be `/topicflow/data/Fre/Bins.js`. *Bins.js* is essentially a JavaScript function that divide all documents by time (in the example of Full Disclosure data, divide by month) which is called binning, and store the LDA data (document-topic scores and topic-word scores) of all the pairs. The skeleton of *Bins.js* looks like:
```javascript
function populate_bins_Fre(){
    var bin_data ={"0":{"tweet_Ids":[...],"start_time":...,"bin_id":...,"topic_model":{...},"end_time":...},"1":...
    readBinJSON(bin_data);
}
```
Again, it's could be daunting the first time you open it: it's very lengthy, but the structure stays the same. First, a JavaScript function called **populate_bins_Fre** ("Fre" is the project name) is defined. Then, a variable called "bin_data" is defined, along with all the relevent data in JSON format as the value of this variable. At last, the function **readBinJSON** defined in *controller.js* is called to read the data in "bin_data" variable. 

Now let's see what the JSON part in *Bins.js* looks like:
```json
{
  "0": {
    "tweet_Ids": [1,2,3...],
    "start_time": "12/31/2013 16:46",
    "bin_id": 0,
    "topic_model": {
            "topic_doc": {
                    "0_0": {
                        "1": 0.00010030434072387,
                        "2": 0.36551017173243494,
                        ...
                    },
                    "0_1: {...},
                    ...
                },
            "doc_topic": {
                "1": {
                    "0_0": 0.00010030434072387,
                    "0_1": 0.00010030434072383,
                    ...
                },
                "2": {...},
                ...
            },
            "topic_word": {
                "0_0": {
                    "x86_64": 0.0361921097895964,
                    "i586": 0.0335562698609424,
                    ...
                },
                "0_1": {...},
                ...
            },
            "topic_prob": {
                "0": "0_0",
                "1": "0_1",
                ...
            }
        },
    "end_time": "1/31/2014 21:25"
    }
  "1": {
    ...
  }
  ...
}
```
To make the data transformation work, we have to first process and store all the data in a dictionary, and transform it into JSON format. Then, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *Bins.js*. The overall flow looks like:

![transform_bins](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/function%20graph/transform_bins.jpg)

Details of how **transform_bins** works can be found in the comments. One thing to notice is that this function takes more consideration in indexing than other functions because there are so many document-topic and topic-word pairs to populate and sometimes the index starts with 0 and sometimes it starts with 1, a consistancy issue that's hard to fix.

In [None]:
def transform_bins(project_name, path_doc, path_LDA):
    """
    Transform LDA-genereted Topic-document matrixes and Topic-Term
    matrixes into JavaScript format that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_doc     -- path of documents directory
        path_LDA     -- path of LDA main directory, this directory should
                        contain 3 sub-directories: Document_Topic_Matrix,
                        Topic_Flow, and Topic_Term_Matrix

    Returns:
        a JavaScript formatted string ready to be written as "Bins.js".
    """

    ### DEFINE month_list, READ DATA
    month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    # read df_list
    df_list = read_data(df_list=True)
    # read topic-doc & topic-word data sets
    df_topic_doc = read_data(df_topic_doc=True)
    # read topic-word data sets
    df_topic_word = read_data(df_topic_word=True)


    ### DATA TRANSFORMATION - 1
    # initiate bins, each month is one bin, each bin is also a dictionary
    bin_dict = {}
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)] = {}

    # populate bin_id
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['bin_id'] = month_ix

    # populate tweet_ids
    # here we need input from df_list, specifically the lenth of each month
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['tweet_Ids'] = []
    # two points recording the starting position of tweet_id of each month
    lo,hi = 1,1
    for month_ix in range(len(month_list)):
        hi += len(df_list[month_ix])
        for tweet_ix in range(lo,hi):
            bin_dict[str(month_ix)]['tweet_Ids'].append(tweet_ix)
        lo = hi

    # populate start_time & end_time
    # here we need input from df_list, specifically the lenth of each month
    # this part sorts out the earliest and latest time of a tweet in each month, and
    # transform them into "mm/dd/yy hh:mm" format
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['start_time'] = pd.to_datetime(df_list[month_ix].dateStamp).sort_values().apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).tolist()[0]
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['end_time'] = pd.to_datetime(df_list[month_ix].dateStamp).sort_values().apply(lambda x: str(x.month) + '/' + str(x.day) + '/' + str(x.year) + ' ' + str(x.hour) + ':' + str(x.minute)).tolist()[-1]


    # initiate topic_model
    for month_ix in range(len(month_list)):
        bin_dict[str(month_ix)]['topic_model'] = {}
        # add 4 sub dictionaries
        bin_dict[str(month_ix)]['topic_model']['topic_doc'] = {}
        bin_dict[str(month_ix)]['topic_model']['doc_topic'] = {}
        bin_dict[str(month_ix)]['topic_model']['topic_word'] = {}
        bin_dict[str(month_ix)]['topic_model']['topic_prob'] = {}


    ###  DATA TRANSFORMATION - 2: POPULATE topic_model

    # to begin this section, create a DataFrame mapping Topic-doc.
    # the documents in the df_topic_doc are not the same as in metadata.
    # Thus, before pupulating 4 sub dictionaries, first we need to find
    # all the overlapping documents

    # step 1, creates a list of the starting position of each month's tweet_id
    month_start_tweetIds = []
    tweet_count = 0
    for month_ix in range(len(month_list)):
        month_start_tweetIds.append(tweet_count)
        tweet_count += len(df_list[month_ix])

    # step 2, iterate and find the overlapping documents of every month
    for month_ix in range(len(month_list)):
        doc_df_topic_doc = []
        for i in df_topic_doc[month_ix].index.values:
            doc_df_topic_doc.append(int(i[13:-4]))
        overlap = set(doc_df_topic_doc) & set(df_list[month_ix]['k'].values)

        # step 3, create a DataFrame mapping the overlapping documents and 10 topics
        overlap_ix = []
        ix_list = df_topic_doc[month_ix].index.tolist()
        doc_year = df_topic_doc[0].index.values[0][4:8]
        for item in overlap:
            name = str(month_list[month_ix]) + '/' + doc_year + '_' + str(month_list[month_ix]) + '_' + str(item) + '.txt'
            overlap_ix.append(ix_list.index(name))
        df_topic_doc_overlap = df_topic_doc[month_ix].iloc[overlap_ix, : ].copy()

        # pre-step 4, add tweet_Ids to df_topic_doc_overlap
        overlap_tweetIds = []
        for k in df_topic_doc_overlap.index.values:
            name = int(k[13:-4])
            name_ix = df_list[month_ix]['k'].tolist().index(name) + 1
            name_ix += month_start_tweetIds[month_ix]
            overlap_tweetIds.append(name_ix)
        df_topic_doc_overlap['tweet_Ids'] = overlap_tweetIds

        # now we have the overlapping documents, we can populate 4 sub dictionaries
        # populate topic_prob
        L = len(df_topic_doc[month_ix].columns)
        for ix in range(L):
            T = str(month_ix) + '_' + str(ix)
            bin_dict[str(month_ix)]['topic_model']['topic_prob'][str(ix)] = T

        # populate topic_doc
        # create 10 topic keys
        for ix in range(L):
            T = str(month_ix) + '_' + str(ix)
            bin_dict[str(month_ix)]['topic_model']['topic_doc'][T] = {}
        # add doc values to these keys
        for ix_2 in range(L):
            T = str(month_ix) + '_' + str(ix_2)
            col_score = df_topic_doc_overlap[str(ix_2 + 1)].values # there is +1 here because in the csv there is no column named '0'
            col_score = np.around(col_score, 17)                 # reduce crazy long decimal points and scientific notations
            col_k = df_topic_doc_overlap['tweet_Ids'].values
            for ix_3 in range(len(col_score)):
                bin_dict[str(month_ix)]['topic_model']['topic_doc'][T][str(col_k[ix_3])] = col_score[ix_3]

        # populate doc_topic
        for ix_4 in range(len(df_topic_doc_overlap)):
            row_score = df_topic_doc_overlap.iloc[ix_4,:]
            row_score = np.around(row_score, 17)
            bin_dict[str(month_ix)]['topic_model']['doc_topic'][ str(int(row_score['tweet_Ids'])) ] = {}
            for ix_5 in range(L):
                name = str(month_ix) + '_' + str(ix_5)
                bin_dict[str(month_ix)]['topic_model']['doc_topic'][ str(int(row_score['tweet_Ids'])) ][name] = row_score[ix_5]

        # populate topic_word
        for ix_6 in range(L):
            name = str(month_ix) + '_' + str(ix_6)
            bin_dict[str(month_ix)]['topic_model']['topic_word'][name] = {}
            topwords = df_topic_word[month_ix].iloc[ix_6].sort_values(ascending=False)[:10]
            topwords = np.around(topwords, 17)
            # we choose top 10 most frequent words, so here the range is 10
            for ix_7 in range(10):
                bin_dict[str(month_ix)]['topic_model']['topic_word'][name][topwords.index[ix_7]] = topwords.values[ix_7]

        # delete df_topic_doc_overlap to aviod overwritting error
        del df_topic_doc_overlap

    ### TRANSFORM INTO JS FORMAT
    # transform bin_dict into an ordered dictionary
    bin_dict_ordered = {}

    key_order = ('tweet_Ids','start_time','bin_id','topic_model','end_time')
    for month_ix in range(len(month_list)):
        tmp = OrderedDict()
        for k in key_order:
            tmp[k] = bin_dict[str(month_ix)][k]
        bin_dict_ordered[str(month_ix)] = tmp

    # transform body into .json format
    json_tmp = json.dumps(bin_dict_ordered)

    # transform into .js format that TopicFlow can read
    prefix = 'function populate_bins_' + project_name + '(){\nvar bin_data = '
    posfix = ';\nreadBinJSON(bin_data);\n}'
    bins_js = prefix + json_tmp + posfix


    ### WRITE
    with open(os.path.join(path_tf, 'data', project_name, 'Bins.js'), 'w') as file:
        file.write(bins_js)

    print('Bins.js created,            80% complete.')

After the modification, a line says "Bins.js created,            80% complete." will be printed out in the terminal. This newly created file should populate the both the bottom-left and center area of TopicFlow. Each column in the visualization is a bin and each box is a topic.

## Function 5 - transform_topicSimilarity

After bins and topics are created, **transform_topicSimilarity** generates nodes and links between topics in adjacent bins. It also has the three-part-structure as **transform_doc** and **transform_bins**. 

Reverse engineering! The end result is a file called *TopicSimilarity.js* inside the project data directory. Say the name of the new project is "Fre", the path of the end result would be `/topicflow/data/Fre/TopicSimilarity.js`. *TopicSimilarity.js* is essentially a JavaScript function that scores how similar the topics between two adjancent bins are. The scores are also generated by the LDA algorithm. The skeleton of *Bins.js* looks like:
```javascript
function populate_similarity_Fre(){
    var sim_data ={"nodes":[{"name":...,"value":...},...],"links":[{"source":...,"target":...,"value":...},...]}
    readSimilarityJSON(sim_data);
}
```
*TopicSimilarity.js* is the shortest among all three data files and it follows a simple logic: we have nodes and score the links between nodes. As of the overall JavaScript structure, first, a function called **populate_similarity_Fre** ("Fre" is the project name) is defined. Then, a variable called "sim_data" is defined, along with all the relevent data in JSON format as the value of this variable. At last, the function **readSimilarityJSON** defined in *controller.js* is called to read the data in "sim_data" variable. 

Now let's see what the JSON part in *TopicSimilarity.js* looks like:
```json
{
  "nodes": [
      {
          "name": "0_0",
          "value": 43
      },
      {
          "name": "0_1",
          "value": 57
      },
      ...
  ],
  "links": [
      {
          "source":1,
          "target":18,
          "value":233.6647080989732
      },
      {
          "source":2,
          "target":13,
          "value":183.70069470814772
      },
      ...
  ]
}
```
To make the data transformation work, we have to first process and store all the similarity data in a dictionary, and transform it into JSON format. Then, we can add the codes before and after the JSON part with one customization on the project name. Finally, write to *TopicSimilarity.js*. The overall flow looks like:

![transform_topicSimilarity](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/function%20graph/transform_topicSimilarity.jpg)

In [None]:
def transform_topicSimilarity(project_name, path_LDA):
    """
    Transform topic similarity matrix into JavaScript format
    that TopicFlow can read.

    Args:
        project_name -- name of the new project
        path_LDA     -- path of LDA main directory, this directory should
                        contain 3 sub-directories: Document_Topic_Matrix,
                        Topic_Flow, and Topic_Term_Matrix

    Returns:
        a JavaScript formatted string ready to be written as
        "TopicSimilarity.js".
    """

    ### DEFINE month_list, READ DATA
    month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    df_topic_sim = read_data(df_topic_sim=True)


    ### DATA TRANSFORMATION
    # initiate a dictionary
    sim_dict = {}

    # populate nodes
    # put topics into nodes, record their orders
    nodes = []
    for i in range(len(month_list)):
        for j in range(10):
            tmp = {}
            name = str(i) + '_' + str(j)
            # how to calculate the value of a topic? the paper didn't define clearly
            # so here I use a random number
            value = np.random.randint(1,100)
            tmp['name'], tmp['value'] = name, value
            nodes.append(tmp)

    # populate links
    # put source, target, value into links
    links = []
    for month_ix in range(len(month_list) - 1):
        # get unique pais between every two months, in total we have 11 pairs
        mm1, mm2 = month_list[month_ix], month_list[month_ix + 1]
        sim = mm1 + '_' + mm2 + '_similarity'
        df_tmp = df_topic_sim[[mm1, mm2, sim]].dropna(axis=0).drop_duplicates()
        for row_ix in range(len(df_tmp)):
            source = month_ix*10 + int(df_tmp[mm1].values[row_ix]) - 1
            target = (month_ix+1)*10 + int(df_tmp[mm2].values[row_ix]) - 1
            score = df_tmp[sim].values[row_ix] * 200 # 200 makes it neither too thin nor too thick
            link_tmp = {}
            link_tmp['source'], link_tmp['target'], link_tmp['value'] = source, target, score
            links.append(link_tmp)

    # put two lists into sim_dict
    sim_dict['nodes'], sim_dict['links'] = nodes, links


    ### TRANSFORM INTO JS FORMAT
    json_tmp = json.dumps(sim_dict)

    # finally, transform into .js format that TopicFlow can read
    prefix = 'function populate_similarity_' + project_name + '(){\nvar sim_data = '
    posfix = ';\nreadSimilarityJSON(sim_data);\n}'
    topicSimilarity_js = prefix + json_tmp + posfix


    ### WRITE
    with open(os.path.join(path_tf, 'data', project_name, 'TopicSimilarity.js'), 'w') as file:
        file.write(topicSimilarity_js)

    print('TopicSimilarity.js created, 100% complete.')

After the modification, a line says "TopicSimilarity.js created, 100% complete." will be printed out in the terminal. This newly created file should control the top-left panel of TopicFlow and the lines between different topics. These data are in charge of topic flow.

## Function 6 - argparse and local server

Finally, under   
>`if __name__ == "__main__":`  

two functionalities are added to allow malnipulation in terminal and local server instance. 

Using the argparse library in `run.py` makes it easier for a user to add a project in terminal and see the TopicFlow visualization in a local server, or run an existing project.

In [None]:
### ARGPARSE
parser = argparse.ArgumentParser(prog = 'TopicFlow Creator',
                                 description = 'A program that lets you create a new project and transforms your data into TopicFlow readable format, or run an existing project.',
                                 epilog = 'Then you can open a browser and type in localhost:8000 to see the visualization! When done, just stop the process in terminal.')
parser.add_argument('-n', '--new',  type = str,
                    help = 'Enter the name of a new project, no space allowed.')
parser.add_argument('-a', '--add', type = str, nargs = '+',
                    help = 'Please specify the paths of [document files, LDA files], enclosing each in double quotes. If starting a new project, both paths should be specified. If running an existing project, no need to use this flag. EXAMPLE: -n "Trending" -a "E:\\...\\data\\docs" "E:\\...\\data\\LDA".')
args = parser.parse_args()


if args.new:
    project_name = args.new
    path_doc = args.add[0]
    path_LDA = args.add[1]

    if os.path.isdir(path_doc) and os.path.isdir(path_LDA):
        modify_html(project_name, path_tf)
        modify_controller(project_name, path_tf)
        transform_doc(project_name, path_doc)
        transform_bins(project_name, path_doc, path_LDA)
        transform_topicSimilarity(project_name, path_LDA)

Creating a new project using Full Disclosure 2012 document and LDA data in terminal looks like:

![run.py -n -a](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/screenshots/run%20-n%20-a.png)

**command**:

`python topicflow\run.py -n "Fre" -a "E:\documents\Learning Materials\from_UMD\projects\PERCEIVE\data\Full Disclosu re\2012 - Copy" "E:\documents\Learning Materials\from_UMD\projects\PERCEIVE\data\LDA_VEM\2012_k_10_12"`

A timer is also included to see how long the data transformation takes. Just nice to know.

Now, let us see the end result of our data transformation in local:8000!

In [None]:
## INVOKE SERVER
PORT = 8000

# change the working directory to topicflow
os.chdir(path_tf)

Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

![TopicFlow Fre](https://raw.githubusercontent.com/estepona/PERCEIVE-freddie/master/screenshots/TopicFlow-Fre.png)

When you are done with the visualization, just stop the process in terminal.

# Remaining Issues

Although now we have a working data transformation pipeline, there are still some issues remained:

    
1. **Search in TopicFlow**  
    Search is a useful functionality in the original version that allows a user searching key words in bottom-left panel. Based on user's search, irrelevent topics, nodes, links, and documents will be filtered out. However, because we disabled 
    ```javascript
    if ($("g #"+j)[0].style.display != "none") { }
    ```
    at the beginning in order to avoid `style errors`, we also made TopicFlow unable to show relevent topics and documents after user's search (nodes and links will be shown). If we enable this line, for some reason TopicFlow wouldn't load Full Disclosure documents. Working around this line of code took me quite some time, and yet I don't know how I could keep this code while making TopicFlow workable.
 
2. **Value of node**  
    In function **transform_topicSimilarity**, the value of each individual node is not clearly defined in the original paper, so the way I approach this is generating a random integer between 1 and 100 and assign it to the value of node.
    ```python
    # how to calculate the value of a topic? the paper didn't define clearly
    # so here I use a random number
    value = np.random.randint(1,100)
    tmp['name'], tmp['value'] = name, value
    ```
    I'd like to know how the values are defined and make changes to the data transformation pipeline accordingly.

3. **Additional options to manage projects**  
    Currently the `run.py` only allows creating a new project or running an existing project. In the future we may want to add additional options such as removing a project.