# Machine Learning Research Project Paper

October 5, 2017 – Research project proposals due.

**What is expected?**

a.	A clear description of the topic.  
b.	Background research of related work.  
c.	Data sources?  
d.	What algorithms are being used and code sources.  
e.	References.  

See the example research project proposals.  

**Group Size**

One to two people.

**Progress reports:**

Research project progress reports will be due December 9, 2017

A draft of the final report written in a scientific paper format.

The project progress reports must have:

a.	Abstract (10 %)  
b.	Introduction (5 %)  
c.	Code with Documentation (50%)  
d.	Results (20 %)  
e.	Discussion (10 %)  
f.	References (5 %)   


**Projects Due:**

Research project code and reports will be due December 13, 2015.

What is expected?

A final draft of the research written in a scientific paper format. You are expected to respond to feedback you received from the progress report. 


**Grading Rubric:**

The following breakdown will be used for determining the score for the research project: 

| Assignment              | Points   |
|-------------------------|----------|
| Project proposals       |      100 |
| Progress report         |      150 |
| Progress draft          |      250 |
| Research project paper  |      500 |

	

**Research Projects**

This course will have individual machine learning research papers.  These papers should be in a style that could be submitted to a conference, workshop or journal. Students should begin to work on this in early in the semester.

These assignments will provide practice in real-world analysis and application of machine learning algorithms. The research can take one of the following forms:

i.	Tweaking an existing machine learning algorithm.  
ii.	Applying an existing machine learning algorithm in a novel context.  
iii.	Validating an existing machine learning algorithm in real-world contexts.  
iv.	Creating a novel machine learning algorithm.  
v. Competing in a compeition like Kaggle [https://www.kaggle.com/](https://www.kaggle.com/)   

_Topic 1_

Tweaking an existing algorithm 
This project involves finding an existing algorithm and making a modification that might improve its recall, precision, scalability, memory usage or speed. Other tweaks may include changing the algorithm so it applies in a different context.  

_Topic 2_

Applying an existing algorithm in a novel context
This project involves finding an existing algorithm and applying it in a novel way. This might involve applying genetic sequencing algorithms to the optimization of hardware resource sharing, or the use of genetic algorithms (GAs) to optimize the arrangement of class schedules.  This topic is very closely related to “Tweaking an existing algorithm” in that one usually needs to make changes to use an algorithm in a novel context. 

_Topic 3_

Validating an existing algorithm in real-world contexts 
This project involves finding how practically useful and applicable an existing algorithm really is in the “real world.” This involves empirically validating the sensitivity, recall, precision, scalability, memory usage and speed claims of an algorithm with realistic, noisy and non-ideal data.

_Topic 4_

Creating a novel algorithm 
This project involves creating a novel algorithm that can answer an interesting real-world question. This topic is very closely related to “Tweaking an existing algorithm” in that one usually extends and improves what exists rather than create totally from scratch.

_Topic versus Thesis_

A topic is a general interest. A thesis statement presents an assertion; what you intend to do and how you intend to prove/convince others what you did is correct.  A topic of interest might be “What keywords should I add to my tweet to make it more viral?” 

A thesis is more specific and is framed such that the assertion is testable.  A thesis would be “We believe adding keywords according to our algorithm will significantly improve a tweets virality.”  The paper would then involve quantitatively defining virality and comparing the random tweets and the tweaked tweets with the keyword algorithm using reasonable real-world data (e.g. Twitter).


**Submission **
You will submit your assignments via BlackBoard.
Click the title of assignment (blackboard -> assignment -> <Title of Assignment>), to go to the submission page. 

**What Topics are Interesting?**
•	Sequence alignment
•	Secure Communication
•	Scheduling (courses, trains, etc.)
•	Packing (where to best place items in store/website/warehouse)
•	Path finding (who/what is closet? How do I get there? Extracting a Retweet’s Origins?)
•	What terms are associated with your Twitter/ Facebook/ LinkedIn/ Google+:/ GitHub/ Web handle?
•	Which genes cause cancer?
•	What users are associated with your Twitter/ Facebook/ LinkedIn/ Google+:/ GitHub/ Web handle?
•	Friendship cliques on Twitter/ Facebook/ LinkedIn/ Google+:/ GitHub/ Web
•	Sentiment analysis/influence analysis. Who are most influential on Twitter/ Facebook/ LinkedIn/ Google+:/ GitHub/ Web?
•	Mapping (jobs, housing, crime, etc.)
•	At what price should I start an Ebay auction?
•	Do I want to follow that person back?
•	Where should I place transmission towers?
•	What keywords should I add to my tweet/post?
•	Identification of Transcription Factor Binding Sites
•	When should I tweet/post?
•	Matching (Who is like/unlike me? What is the best TV, college, etc. for me?)
•	What is the reach of my tweet/post?
•	What are people saying about me on Twitter/ Facebook/ LinkedIn/ Google+:/ GitHub/ Web?
•	Should I add a picture/url to my tweet/post?

There are an infinite number of interesting topics to which machine learning algorithms could be applied.  You want to keep your topic simple and doable. A topic like “Which genes cause cancer?” is of great interest but too broad for a topic. Analyzing existing gene expression quantification algorithms is more suitable for a project. The purpose of these projects is for students do get their “hands dirty,” not to necessarily develop break-through algorithms.  We will discuss how to do this research early in the semester.


## Example projects

There are several example projects and papers on NEU's BlackBoard for this course.
 

## Special Project in Deep Learning

Students who wish can contribute to a larger class project rather to the small group or individual assignments.

The description of the project is below:

**Understanding the semantics of the latent space in unsupervised deep learning models**

**Abstract:**

Deep learning and neural networks are increasingly important concepts as demonstrated through their performance on difficult problems in computer vision, medical diagnosis, natural language processing and many other domains. Deep learning algorithms are unique in that they try to learn latent features from data, as opposed to traditional machine learning where features selection is typically handcrafted. However, the semantics of deep neural networks “hidden layers” are poorly understood, and are often treated as “black box” models.  The aim of this research is to develop tools and algorithms to better understand the semantics of the latent features learned by deep networks, particularly those used for unsupervised deep learning.

**Unsupervised models to be studied:**

_Autoencoders_

An autoencoder, autoassociator or Diabolo network is an artificial neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.

Autoencoder’s though are difficult to interpret of the representation of semantics and aren’t really a generative model.

Autoencoder
i.	encoder and dencoder  
ii.	hidden layers
iii.	bottleneck layer - forces network to learn a compressed latent representation
iv.	reconstruction loss  - forces hidden layer to represent info about the input


_Variational autoencoders (VAEs)_

Variational autoencoders are a stochastic variational extension of autoencoders  that allow for a probabilistic representation of the data and amortized inference.
In a VAE, the encoder becomes a variational inference network that maps the data to the a distribution for the hidden variables, and the decoder becomes a generative network that maps the latent variables back to the data. In just a couple of years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions.

Variational Autoencoders (VAEs) have some very desirable properties:

i.	Clear semantics as a generative model
ii.	Extension of autoencoders that allow sampling and estimating probabilities.
iii.	This creates a kind of implicit generative model it creates a latent representation thru its probabilities
iv.	"Latent variables" has a fixed prior distribution
v.	Probabilistic encoder and dencoder
vi.	Probabilistic representation of the data

Variational autoencoders have a clear advantage over autoencoders in that the probabilistic representation of the data is a form of representation learning.


_Autoregresstive variational autoencoders_

Standard VAE’s have some issues:
i.	They do not encode what is not useful for them to decode.  This means they don't capture fine details. Subtle features (e.g. small nodules that may be important clinically but don’t represent much of the image space may be missed.)


Autoregresstive VAE’s use autoregresstive networks in the encoder and the dencoder to capture more local info in the estimation of the densities.


_Restricted Boltzmann machines (RBMs)_

Restricted Boltzmann machines (RBM)s are a very simple model, just a fully connected bipartite graph with an input layer and a hidden layer. The cost function minimizes an energy function by re-weighting to minimize the difference between the input layer and the hidden layer.


Like VAE’s restricted Boltzmann machines can be thought as a form of representation learning.

Restricted Boltzmann machines:
i.	Clear semantics as a generative model
ii.	Simple, just a fully connected bipartite graph with an input layer and a hidden layer.
iii.	Minimizes an energy function (re-weighting to minimize the difference between the input layer and the hidden layer).
iv.	Energies can easily be converted to probabilities.
v.	Creates a latent representation thru its probabilities.

Like VAE’s, the densities generated can be visualized.  When run on image data sets of numbers, the visualizations clearly show where the densities show portions of 2’s, 3’s, 7’s. etc.


**Understanding the semantics of the latent space**

The research includes implementing models to help understand the semantics of the latent space in VAEs and RBMs. This includes:

A. Visualizing the latent space

i.	The densities generated can be visualized.  
ii.	Visualizing these data using t-SNE.
iii.	Generative models are also used to visualize the semantic meaning of hidden layers. To visualize the semantic meaning of each layers of generative models, the parameters of the models can be gradually adjusted, and the effect on generated images are observed.

Latent space visualization — Deep Learning bits #2 [https://medium.com/@juliendespois/latent-space-visualization-deep-learning-bits-2-bd09a46920df](https://medium.com/@juliendespois/latent-space-visualization-deep-learning-bits-2-bd09a46920df)

Discovering Hidden Factors of Variation in Deep Networks [https://arxiv.org/abs/1412.6583](https://arxiv.org/abs/1412.6583)

Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pp. 818–833. Springer, 2014. [https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf](https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf)

Topic Modeling and t-SNE Visualization [https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html](https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html)

Visualizing Data Using t-SNE [https://www.youtube.com/watch?v=RJVL80Gg3lA](https://www.youtube.com/watch?v=RJVL80Gg3lA)

Visualizing data using t-SNE [https://www.researchgate.net/publication/228339739_Viualizing_data_using_t-SNE](https://www.researchgate.net/publication/228339739_Viualizing_data_using_t-SNE)


B. Ranking the latent space

Machine learning latent factor models such as singular value decomposition (SVD), principal component analysis (PCA), and probabilistic PCA (PPCA) have the very powerful property that the first k components or first k terms can be ranked. This is very desirable, it allows for:

i.	A quantitative assessment of signal loss.
ii.	Dimensionality reduction.
iii.	A powerful form of regularization by removing the components or terms that contributes very to the signal.

I’ve had a hard time finding papers related to ranking the latent space in deep learning.

So far just,

Deep Variational Canonical Correlation Analysis [https://arxiv.org/abs/1610.03454](https://arxiv.org/abs/1610.03454)

C. Latent space arithmetics

Another way of exploring the learned representations is to show arithmetic in the latent space. Generative nets have been shown to encode semantic knowledge of such things as glasses, chair or face images in its latent space. One can “add” or “subtract” latent encoding and the effect on generated images are observed. Like adding glasses to a generated human face, in theory one could add nodules to a generated human lung.


Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.   [https://arxiv.org/abs/1411.5928](https://arxiv.org/abs/1411.5928)

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. [](https://arxiv.org/abs/1511.06434

Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In ECCV, 2016. [https://arxiv.org/abs/1603.08637](https://arxiv.org/abs/1603.08637)

**Student responsibilities:**

a.	Implement models specified by Professor Brown in Keras and TensorFlow  
b.	Bi-weekly progress updates (once every two weeks)   
c.	Discuss results from models with Professor Brown   
d.	Should interesting results arise co-author papers with Professor Brown   
e.  Think of novel ideas and approaches and suggest them to the working group     


## List of datasets for machine learning research

* [List of datasets for machine learning research](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)   
* [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/)  
* [Public Data Sets : Amazon Web Services](https://aws.amazon.com/datasets/) 
* [freebase](https://developers.google.com/freebase/)  
* [Google Public Data Explorer](https://www.google.com/publicdata/directory)  
* [datahub](http://datahub.io/)  
* [data.gov](https://www.data.gov/)  


## How to do a literature review

* [http://www.asbmb.org/asbmbtoday/asbmbtoday_article.aspx?id=15161](http://www.asbmb.org/asbmbtoday/asbmbtoday_article.aspx?id=15161)  
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003149](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003149)  
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3715443/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3715443/)  
* [http://www.monash.edu.au/lls/llonline/writing/science/lit-review/1.xml](http://www.monash.edu.au/lls/llonline/writing/science/lit-review/1.xml)  
* [http://library.lincoln.ac.nz/Research/Writing-your-research/Literature-Reviews/Sample-literature-reviews/](http://library.lincoln.ac.nz/Research/Writing-your-research/Literature-Reviews/Sample-literature-reviews/)  
* [http://www.coe.montana.edu/ee/rosss/Courses/EE578_Fall_2008/Writing%20a%20Review%20Paper.pdf](http://www.coe.montana.edu/ee/rosss/Courses/EE578_Fall_2008/Writing%20a%20Review%20Paper.pdf)  
* [http://writingcenter.unc.edu/handouts/literature-reviews/](http://writingcenter.unc.edu/handouts/literature-reviews/)  
* [http://writing.wisc.edu/Handbook/ReviewofLiterature.html](http://writing.wisc.edu/Handbook/ReviewofLiterature.html)  
* [http://guides.library.ucsc.edu/write-a-literature-review](http://guides.library.ucsc.edu/write-a-literature-review)  
* [http://www.duluth.umn.edu/~hrallis/guides/researching/litreview.html](http://www.duluth.umn.edu/~hrallis/guides/researching/litreview.html)  
* [https://ithacalibrary.com/sp/assets/users/_lchabot/lit_rev_eg.pdf](https://ithacalibrary.com/sp/assets/users/_lchabot/lit_rev_eg.pdf)  
* [http://www.lib.ncsu.edu/tutorials/litreview/](http://www.lib.ncsu.edu/tutorials/litreview/)   
* [http://library.concordia.ca/help/howto/litreview.php](http://library.concordia.ca/help/howto/litreview.php)  
* [http://library.bcu.ac.uk/learner/writingguides/1.04.htm](http://library.bcu.ac.uk/learner/writingguides/1.04.htm)  
* [http://guides.library.vcu.edu/lit-review](http://guides.library.vcu.edu/lit-review)  
* [http://www2.le.ac.uk/offices/ld/resources/writing/writing-resources/literature-review](http://www2.le.ac.uk/offices/ld/resources/writing/writing-resources/literature-review)  

Last update September 5, 2017