# Homework 1: CS-GY 9223
## Exploring 20 NewsGroups

In this homework, you will write a D3 Visualization in Javascript and integrate it in Jupyter Notebook.

The goal of this exercise is to explore the **20 News Groups dataset**, a popular machine learning dataset that contains news articles grouped in 20 topics. Your visualization should receive the dataset and display a bar chart with the top most frequent words in the dataset. The user should be able to filter the data based on topic (for example, by clicking in checkboxes, selecting from a drop down menu, etc.). The user should also be able to export the selected documents from the selected topic back to Python (using a button).

In summary, your visualization should have the following capabilities:
- Display a bar chart with the top K words in the document collection
- Enable the user to filter the documents based on topic, and display a bar chart with the frequency of the top K words from that topic.
- Export the documents from the selected news topic back to python (as a list of strings).
- The visualization has to be integrated with python. The API should have two functions:
  - `plot_top_words(documents, K) # plot top K words using D3 and Javascript`
  - `get_exported_documents() # get the exported documents back to python`
  
Example of the resulting visualization:
<img src="https://github.com/yeb2Binfang/CS_9223_Visualization_for_ML/blob/main/HW/HW1/HW_Vis.png?raw=1" width = "500px" height="100px"/>

### Accessing the data

The data should be accessed from sklearn. In this section we show an example of code for accessing the documents and the document classes.

In [1]:
# Fetching the data
from sklearn.datasets import fetch_20newsgroups
import numpy as np
newsgroups = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'))

# getting the topic ids
topic_idx = np.array(newsgroups.target, dtype=int)

# getting the unique topic names
topic_names = np.array(newsgroups.target_names)

# getting the list of documents
documents = list(newsgroups.data)

# getting the list of topics (in the same order as documents)
topics = list(topic_names[topic_idx])

These are the 20 topics in the dataset:

In [2]:
topic_names

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype='<U24')

The documents and document topics are assigned to the variables *documents* and *topics*. We print some document examples below.

In [3]:
for i in range(2):
    print("Topic: {}".format(topics[i]))
    print("-"*60)
    print("Document:")
    print(documents[i])
    print("="*60)
    print("")

Topic: rec.autos
------------------------------------------------------------
Document:
I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

Topic: comp.windows.x
------------------------------------------------------------
Document:
I'm not familiar at all with the format of these "X-Face:" thingies, but
after seeing them in some folks' headers, I've *got* to *see* them (and
maybe make one of my own)!

I've got "dpg-view" on my Linux box (which displays "uncompressed X-Faces")
and I've managed to compile [un]compface too... but now that I'm *looking*
for them, I can't seem to find any

## plot_top_words(documents, K)

### count word frequency
this function is used to count the word frequences, I will split the word such as  I'll to "I" and "ll"

For this dunction, it will return the array of top k words and the array of each word's frequences in accending order

In [4]:
import operator

def count_word_frequency(document,k):
  doc = document
  ##create a dist
  data = {}
  words = doc.replace("(",' ').replace(")",' ').replace("'",' ').replace("*",' ').replace(".",' ').replace("/",' ').replace("_",' ').replace("-",' ').replace(":",' ').replace("<",' ').replace(">",' ').replace("[",' ').replace("]",' ').replace("!",' ').replace("?",' ').replace(",",' ').strip("\n").lower()
  ##split the text
  words1 = words.split()
  ##count text
  for word in words1:
    if word in data:
      data[word] +=1
    else:
      data[word] = 1
  ###
  #sorted the dist by value in acending order
  sorted_fre = np.array(sorted(data.items(), key=operator.itemgetter(1),reverse=True))

  num_of_words = sorted_fre.shape[0]
  fre = np.zeros(num_of_words)
  words = []

  for i in range(0,num_of_words):
    fre[i] = int(sorted_fre[i][1])
    w = sorted_fre[i][0]
    words.append(w)
    
  words = np.array(words)
  

  return words[:k].tolist(),fre[:k].tolist()



### get data in dict format
the data format

data_dict_words_and_wordFre

{

'topic1':[{word:'I',fre:'3'},{word:'love',fre:3]}],

'topic2'...

}


data_dict

{

'topic1':{'word':['w1','w2'...'wk'],'fre':[f1,f2,...fk]},

'topic2':...

}


In [5]:
topic_names_List = topic_names.tolist()
def getData(documents,topic_names_List,k):
  data_dict_words_and_wordFre = {}
  for i in range(len(topic_names_List)):
    words,words_fre = count_word_frequency(document=documents[i],k=k)
    data_dict_words_and_wordFre[topic_names_List[i]] = {'words':words,'words_fre':words_fre}
    
  data_dict = {}
  for i in range(len(topic_names_List)):
    data_dict[topic_names_List[i]] = []
    for j in range(k):
      if j>(len(data_dict_words_and_wordFre[topic_names_List[i]]['words'])-1):
       break
      dic = {'word':data_dict_words_and_wordFre[topic_names_List[i]]['words'][j],'fre':data_dict_words_and_wordFre[topic_names_List[i]]['words_fre'][j]}
      data_dict[topic_names_List[i]].append(dic)
  return data_dict,data_dict_words_and_wordFre

### Import Packages

In [6]:
from IPython.display import display, HTML
import json
from string import Template

In [7]:
%%javascript
require.config({
    paths: {
        d3: "https://d3js.org/d3.v6.min"
     }
});

require(["d3"], function(d3) {
    window.d3 = d3;
});

<IPython.core.display.Javascript object>

### communicate between JS and python

In [8]:
name = ''
def target_func(comm, open_msg):
    # comm is the kernel Comm instance

    # Register handler for later messages
    @comm.on_msg
    def _recv(msg):
        # Use msg['content']['data'] for the data in the message
        document_name = msg['content']['data']['document_name']
        global name
        name = document_name
        #comm.send({'array':doc[n] })

get_ipython().kernel.comm_manager.register_target('my_comm_target', target_func)

### plot and visualize

In [9]:
def plot_top_words(documents,k):
    ## get data
    data_dict,data_dict_words_and_wordFre=getData(documents,topic_names_List,k)
    template = Template("""
                <html>

                <head>
                </head>

                <body>
                  <!-- Load d3.js -->
                  <script src="https://d3js.org/d3.v4.js"></script>

                  <!-- Initialize a select button -->
                  <select id="selectButton"></select>

                  <!-- Create a div where the graph will take place -->
                  <div id="my_dataviz"></div>

                  <button id="button">Exported Document</button>

                  <!-- Color Scale -->
                  <script src="https://d3js.org/d3-scale-chromatic.v1.min.js"></script>
                  <script>
                    //dataset
                    //console.log(name);
                    var name = 'alt.atheism'
                    // List of groups (here I have one group per column)
                    var topic = $topic;
                    //console.log(topic);

                    var dict1 = $dict1;
                    var dict2 = $dict2;
                    var data1 = dict1[name];
                    //console.log(dict1);

                    var words = dict2[name]['words'];
                    //console.log(words);

                    var words_fre = dict2[name]['words_fre'];
                    var max = words_fre[0]

                    // add the options to the button
                    d3.select("#selectButton")
                      .selectAll('myOptions')
                      .data(topic)
                      .enter()
                      .append('option')
                      .text(function (d) { return d; }) // text showed in the menu
                      .attr("value", function (d) { return d; }) // corresponding value returned by the button

                    // set the dimensions and margins of the graph
                    var margin = {top: 20, right: 30, bottom: 40, left: 90},
                        width = 460 - margin.left - margin.right,
                        height = 400 - margin.top - margin.bottom;

                    // append the svg object to the body of the page
                    var svg = d3.select("#my_dataviz")
                                .append("svg")
                                .attr("id","graph")
                                .attr("width", width + margin.left + margin.right)
                                .attr("height", height + margin.top + margin.bottom)
                                .append("g")
                                .attr("transform",
                                "translate(" + margin.left + "," + margin.top + ")");

                    // Initialize the X axis
                    var x = d3.scaleLinear()
                      .range([ 0, width]);
                    var xAxis = svg.append("g")
                      .attr("transform", "translate(0," + height + ")")
                      .attr("class", "myXaxis");
                      
                    // Add X axis label:
                    svg.append("text")
                      .attr("text-anchor", "end")
                      .attr("x", width)
                      .attr("y", height + 40)
                      .text("words fre");

                    // Y axis
                    var y = d3.scaleBand()
                      .range([ 0, height ])
                      .padding(.1);
                    var yAxis = svg.append("g")
                                   .text("words");
                    

                    function update(data_dict,data_word_and_fre){
                      // Update the Y axis
                      y.domain(data_word_and_fre['words']);
                      yAxis.transition().duration(1000).call(d3.axisLeft(y));

                      // Update the X axis
                      max = data_word_and_fre['words_fre'][0];
                      x.domain([0,max])
                      xAxis.call(d3.axisBottom(x));

                      //Bars
                      var bar = svg.selectAll("myRect")
                        .data(data_dict)

                      bar
                        .enter()
                        .append("rect")
                        .transition()
                        .duration(1000)
                          .attr("x", x(0) )
                          .attr("id","myrect")
                          .attr("y", function(d) { return y(d.word); })
                          .attr("width", function(d) { return x(d.fre); })
                          .attr("height", y.bandwidth() )
                          .attr("fill", "#69b3a2");

                      bar.exit()
                        .remove()

                      output()
                    }

                    // When the button is changed, run the updateChart function
                    d3.select("#selectButton").on("change", function(d) {
                        // recover the option that has been chosen
                        var selectedOption = d3.select(this).property("value")
                        // run the updateChart function with this selected option
                        d3.select("#my_dataviz").selectAll("#myrect").remove();

                        update(dict1[selectedOption],dict2[selectedOption]);
                    })

                    var first=true;
                    if(first==true){
                      update(dict1['alt.atheism'],dict2['alt.atheism']);
                      fist=false;
                    }

                    function output(){
                        let comm = Jupyter.notebook.kernel.comm_manager.new_comm('my_comm_target')
                        // Send data
                        comm.send({'document_name': 'alt.atheism'});

                        // Register a handler
                        comm.on_msg(function(msg) {
                            //let data = msg.content.data.array;        
                            //console.log(data)
                            //d3.select("#div_receive_data").selectAll("*").remove()
                            //bar_chart("#div_receive_data", data)
                            //document.getElementById("div_receive_data").innerHTML = data;

                        });

                        // Setting up button
                        document.getElementById("button").addEventListener("click", ()=>{
                            let n = d3.select("#selectButton").property("value");

                            //console.log(n);
                            comm.send({'document_name': n});

                        }); 
                    }

                  </script>

                </body>
                </html>
                """)
    my_html = template.safe_substitute(topic=topic_names_List,dict1=data_dict,dict2=data_dict_words_and_wordFre)
    display(HTML(my_html))
    
    

In [10]:
plot_top_words(documents,6)

## Export the document

### getDoc
We will get the document's name, and we use getDoc to access the document

In [11]:
def getDoc():
    doc= {}
    for i in range(len(topic_names_List)):
        doc[topic_names_List[i]] = documents[i]
    return doc

### get_exported_doc

In [12]:
def get_exported_documents():
    global name
    doc = getDoc()
    
    return doc[name]

In [13]:
print(get_exported_documents())

I just called Texas' legislative bill tracking service and found out
that HB 1776 (Concealed Carry) is scheduled for a floor vote TODAY!
Let those phone calls roll in.

Daryl
