# Course Overview

**Objectives of Machine Learning**

*Making the simple complicated is commonplace; making the complicated
simple, awesomely simple, that's creativity.* --[Charles Mingus](http://en.wikiquote.org/wiki/Charles_Mingus)

Given some data, as samples of variables/attributes/features/measurements:

  * **find**: outliers, anomolies, redundant variables
  * **explain**: relationships, dependencies
  * **predict**: some variables from others
  * **classify**: samples into categories
  * **control**: by selecting actions in sequence that optimize performance

** Your Objectives in Class **

  * Develop intuition for finding patterns in data.
  * Learn concepts and algorithms in machine learning research.
  * Learn to program in python.
  * Implement many common machine learning algorithms in python.
  * Write python code to apply algorithms to various data sets.
  * Use python to analyze and visualize results.
  * Learn to write scientific papers in LaTeX with graphics from python.
  * Learn to do machine learning research by asking and answering questions about your data and  results.

**Tools**

* Python
    * *ipython* interpreter, for incremental code development
    * *numpy* for matrices
    * *matplotlib* for visualization
    * distributed computing
    * fast emough
    * [popular](http://spectrum.ieee.org/computing/software/top-10-programming-languages), huge community
    * [ipython notebook](http://ipython.org/notebook)
* LaTeX
    * programming language for documents
    * publication quality, better than word

**Skills You Will Learn** or **Steps You Will Do For Each Assignment**

* Python and LaTeX
* Talk with expert. Understand current practice for their data analysis problem.
* Read data.
* Visualize it. Ask questions about it.
* Create simple model first. 
* Discuss result with expert. Verify that you understand the goal.
* Try more complex models.  Fairly and honestly compare results.
* Write up problem description, current approaches, experiment methods and results, discussion, conclusion, in scientific report that has
  * intuitive organization
    * no more than two levels of headings
    * short but meaningful section names
  * quality writing
    * no misspellings
    * good english grammar
    * well-formatted math
  * understandable graphics
    * intuitive plots, all axes labelled

Maintain a lab notebook, either hand written or on the computer. Include enough detail to repeat the experiments and to write a paper for publication.  Ideally, try some *literate programming* using ipython notebooks:

  * [Literate computing" and computational reproducibility: IPython in the age of data-driven journalism](http://nbviewer.ipython.org/github/fperez/blog/blob/master/130418-Data-driven%20journalism.ipynb#"Literate-computing"-and-computational-reproducibility:-IPython-in-the-age-of-data-driven-journalism), by Fernando Perez 
  * [Reproducible research, literate programming, IPython, and GitHub](http://peterwittek.com/2014/05/reproducible-research-literate-programming-ipython-and-github/), by Peter Wittek, published May 16th, 2014.

**Assignments**


  * About 7 assignments. Same ones for on and off-campus students.
  * Implement a machine learning algorithm in python
  * Apply it to a given data set
  *  Write a report in Latex describing
    * data and machine learning problem being addressed,
    * method you followed, including summary of python code and resources you used,
    *  results described with figures and explanations in text,
    *  discussion of the results, including your observations and questions you have,
    *  attempts to answer some of the questions if you have time with further experiments
  * The last assignment will be of your own design.


**The Power of Statistics and Visualization**

Simple analysis and dynamic visualization can lead to tremendous
insight with little effort.  For example, take the time to watch the
video at ted.com by Hans Rosling: 

[Talk by Hans Rosling](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html)

The first two steps to a data analysis problem should always be:

  - read data into python, summarize and visualize the data until you have a sense of
    * number of samples and attributes
    * types of attributes and ranges of values
    * any missing attribute values
  - list the initial set of questions to be answered about the data and guess at possible answers (hypotheses), based on your summaries and visualizations.

These steps are often ignored, or not given enough attention.


**Examples of Kinds of Problems We Will Study**

  * Given a bunch of images of hand-drawn digits, learn a mapping from image to digit (0 - 9) and test it on new images.  (supervised learning: classification)
  * Given a set of attributes (measurements) of a lot of automobiles, including their miles-per-gallon (mpg), learn a function of all attributes except mpg that predicts the mpg. (supervised learning: prediction)
  * Given the definition of a game, learn a function of the current game position that produces the best next move. (reinforcement learning)
  * Given  gene expression data for a bunch of samples, find groupings that are common among the different treatments. (unsupervised learning)


# Introduction to Python

**Why Python?**

What do you (I) want in a programming and computing environment?

  * Concise, intuitive programming language.  
  * Ability to //play// with data and computation ideas.
    * Data persistence.  
    * Matrix, vector structures and operations are easy.
    * Interpreter, using same langauge as programs.
    * Functional and object-oriented programming style
  * Rich, easy-to-use visualization
  * Fast computation.
  * Large community of users and developers.
  * Free

Python satisfies all, except perhaps *fast computation*, but getting
faster all the time. (See [Python Speed](http://wiki.python.org/moin/PythonSpeed))

Python 

  * is an open-source language and environment for computing and graphics
  * started in 1989 by Guido van Rossum at CWI in the Netherlands
  * is a multi-paradigm programming language
  * has dynamic typing
  * has garbage collection
  * easily extensible in C and C++
  * available for Unix, Windows, and MacOS systems;
  * is the language of choice for many researchers in machine learning and  statistics, and
  * is the required language in many job ads (see [Python Jobs](http://www.python.org/community/jobs/))


**Installing and Running Python**

Python is installed on our department's systems.

You may download and install python on your own computer by following
instructions at [python.org](http://www.python.org)

On our systems, enter the *ipython* interactive environment with

    ipython
  
To quit, type control-d

To run python code in file *code.py*, either type

    run code.py
  
in *ipython*, or type

    python code.py
  
at the unix command prompt.

When in the *ipython*, you may type python statements or expressions
that are evaluated, or *ipython* commands.  See the
[video tutorial on using ipython](http://showmedo.com/videotutorials/video?name=1000010&fromSeriesID=100), in five parts by Jeff Rush, for help
getting started with *ipython*.

In [1]:
list?