# 1. Mail Spam Detection with Naive Bayesian Classifier

(Sandeep, Abhimanyu)

* Description of the task (Mail Spam Detection)
* Introduction to Naive Bayesian Classifier
    - What it is
    - How it works
    - Strengths of the algorithm; when should we use it?
    - Weaknesses of the algorithm; when does it perform badly?
    - http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification
* Implement Naive Bayesian Classifier in Python (there are many tutorials on the net on this, you can use them)
    - a function that takes a test and train datasets as input, trains on the train dataset using NBC, and returns predictions for the test dataset as output
* Test the NBC you implemented on a real-dataset
    - SpamAssasin public corpus https://spamassassin.apache.org/publiccorpus/
    - http://stackoverflow.com/questions/4743996/publicly-available-spam-filter-training-set

# 2. An introduction to Gradient Boosting Machines with xgboost

(Nayan, Pranav)

* What is Boosting?
* What are Gradient Boosting Machines?
* What are the (hyper)parameters of the GBM algorithms?
* Introduction to xgboost
* An example of xgboost in practice (small end-to-end code sample without any tuning)
* How does one tune the hyperparameters of GBM/xgboost in practice?
    - What are good initial values to start with?
    - Any rules of thumb?
    - Grid search and Random search
* An example of hyperparameter tuning with xgboost (code)

# 3. Random Forests

(Raghav, Anup)

## Research the Algorithm

Research and describe the information processing objectives of the algorithm. Select at least 5 of the following 10 questions and answer them to complete this section:

* What are the standard names and abbreviations used for the algorithm?
* What is the information processing strategy of the algorithm?
* What is the objective or goal for the algorithm?
* What metaphors or analogies are commonly used to describe the behavior of the algorithm?
* What is the pseudocode or flowchart description of the algorithm?
* What are the heuristics or rules of thumb for using the algorithm?
* What classes of problem is the algorithm well suited?
* What are common benchmarks and example datasets used to demonstrate the algorithm?
* What are useful resources for learning more about the algorithm?
* What are the primary references in which the algorithm was first described?

## Summarize Parameters

Summarize the algorithm parameters including their intentions and common usage rules of thumb.

Machine learning algorithms are parameterized so that the system can be customized to a given problem instance. Algorithms are also parameterized because there are decision points within the process that influence the behavior of the system. Research and summarize the algorithm parameters and the effect they are believed to have on the performance of the systems.

* What are the algorithm parameters and the range of valid values for each?
* What influence is each parameter believed to have on the system?
* What are common practices when configuring the algorithm parameters?

## Characterize Behaviors

Select properties that you can measure to characterize the behaviors of the algorithm over time.

A running algorithm can be considered a complex system that exhibits behaviors. Behaviors can be considered specific actions taken on the model that is being prepared during each algorithm iteration or decisions made with regard to the data that is being processed during each algorithm iteration.

The behaviors of a running algorithm and the final model that it creates can be measured and characterized. This could be as simple as a performance measure of the model against the problem over algorithm iterations or something more specific to the way the algorithm is designed or the decisions that it makes.

* What behaviors does the algorithm exhibit when preparing a model on a problem?
* What are common measures used to evaluate the behavior of the algorithm?
* What measures could be implemented to measure and characterize known behaviors of the system?

## Select Test Datasets

Select problem datasets that you can test the algorithm against in order to elicit specific behaviors.

An algorithm is suited to a class of problem, and instances of this class of problem need to be selected if you are going to perform experiments on the algorithm to investigate its behaviors. This task is concerned with selecting appropriate datasets that can be used for experiments that you design and execute.

It is useful to select small standard datasets that are commonly used in the field for assessing the specific algorithm you have selected and on the general class to which the algorithm belongs. It is useful because you can use results in literature as points of inspiration and comparison.

The following issues should be considered when selecting a dataset.

* Datasets that have been used in prior studies of the algorithm or the general class of algorithm.
* Datasets that are known to exhibit the specific behavior, feature or limitation that you are investigating.
* Datasets that allow you to complete your experiments relatively quickly and do not require significant resources to manage.

Create a list of at least 3 datasets on which the algorithms could be applied and list the source from where you got (to know about) the dataset. Research papers on the algorithm are a great source for datasets.

## Run Experiments

Design and execute experiments on selected datasets in order to elicit specific system behaviors that you can characterize and measure.

Design and execute experiments to answer questions you have about the algorithm. Use the selected datasets and the measures you chose to characterize algorithm behaviors. Remember to focus your attention on the algorithm and make observations that teach you more about how to manage and use the algorithm more effectively.


# 4. Census Income Dataset

(Harshil, Akshay)

https://archive.ics.uci.edu/ml/datasets/Census+Income (read through this page very carefully)

## Summarize Data

Use descriptive statistical methods to summarize the data. There are tools and packages that make this process quick and effective and it should become the automatic first thing to do to get a feeling for the data.

* Five Number Summary: Compute a 5 number summary for each real-valued attribute (this is the minimum, lower quartile, median, upper quartile, and maximum values of each attribute).
* Standard Distributions: Check the distribution of each attribute against standard distributions such as Gaussian and Power Law distributions. (qq-plots)
* Relationships: Compute correlations for all pairs of attributes using both parametric and nonparametric methods (such as Pearson's and Spearman's correlation coefficients).
* Review the summary statistics and note down any observations, theories and testable ideas that the data inspires.

## Visualize Data

Expose the structures in the data by visualizing them in plots and graphs. Graphs of attributes and attribute pairs can quickly highlight obvious outliers, missing data and other concerns that may need to be addressed in a data preparation step.

* Histograms: Summarize real and categorical data using histograms. Try coarse and fine grained binning of values in order to expose outliers and the mass of attribute values.
* Box and Whisker plots: These plots are a graphical depiction of the 5 number summary. Box plots can be useful for highlighting obvious outliers and highlighting the mass of the data.
* Pair-wise Scatter plots: Create pairwise scatter plots of all real-valued attributes, and use histograms for categorical attributes. Display all graphs as a matrix of all attributes versus all attributes. This can graphically show any relationships between paired features.
* Record observations when reviewing the plots and comment on any theories that you may be able to exploit or test.

## Summarize Structure (optional, but would be nice)

Summarize the structures in the data using more advanced methods. The methods in this section are unsupervised, meaning that they reveal the structures that are resident in the data provided. This more involved form of data analysis can reveal deeper insights into the data that you can use in generating predictive models.

* Feature Selection: Run feature selection algorithms to determine which attributes you may be able to discard and which may have more influence in a predictive model. Run multiple methods and look for commonality in the results.
* Natural Clusters: Run unsupervised clustering algorithms to determine the number of natural clusters in the data (such as the k-means algorithm). Keep a record of which cluster each data instance is assigned as this may be useful information when building a predictive model.
* Projections: Run unsupervised projection algorithms to determine the features that are important and relationships between the data (such as principle component analysis and self-organizing map algorithms). Visualize the projections so that you can make additional qualified observations.
* Write up observations and findings from the structure summaries created.

## Create and compare models

* Run multiple algorithms on the dataset (with cross-validation and some basic hyperparameter tuning). Compare the performance of the algorithms.
    - Logistic Regression
    - Support Vector Machines
    - Random Forests
    - Gradient Boosting Machines
* Do some basic feature engineering and rerun the models (and compare performance).


# 5. Support Vector Machines

(Amit, Kunal)

## Research the Algorithm

Research and describe the information processing objectives of the algorithm. Select at least 5 of the following 10 questions and answer them to complete this section:

* What are the standard names and abbreviations used for the algorithm?
* What is the information processing strategy of the algorithm?
* What is the objective or goal for the algorithm?
* What metaphors or analogies are commonly used to describe the behavior of the algorithm?
* What is the pseudocode or flowchart description of the algorithm?
* What are the heuristics or rules of thumb for using the algorithm?
* What classes of problem is the algorithm well suited?
* What are common benchmarks and example datasets used to demonstrate the algorithm?
* What are useful resources for learning more about the algorithm?
* What are the primary references in which the algorithm was first described?

## Summarize Parameters

Summarize the algorithm parameters including their intentions and common usage rules of thumb.

Machine learning algorithms are parameterized so that the system can be customized to a given problem instance. Algorithms are also parameterized because there are decision points within the process that influence the behavior of the system. Research and summarize the algorithm parameters and the effect they are believed to have on the performance of the systems.

* What are the algorithm parameters and the range of valid values for each?
* What influence is each parameter believed to have on the system?
* What are common practices when configuring the algorithm parameters?

## Characterize Behaviors

Select properties that you can measure to characterize the behaviors of the algorithm over time.

A running algorithm can be considered a complex system that exhibits behaviors. Behaviors can be considered specific actions taken on the model that is being prepared during each algorithm iteration or decisions made with regard to the data that is being processed during each algorithm iteration.

The behaviors of a running algorithm and the final model that it creates can be measured and characterized. This could be as simple as a performance measure of the model against the problem over algorithm iterations or something more specific to the way the algorithm is designed or the decisions that it makes.

* What behaviors does the algorithm exhibit when preparing a model on a problem?
* What are common measures used to evaluate the behavior of the algorithm?
* What measures could be implemented to measure and characterize known behaviors of the system?

## Select Test Datasets

Select problem datasets that you can test the algorithm against in order to elicit specific behaviors.

An algorithm is suited to a class of problem, and instances of this class of problem need to be selected if you are going to perform experiments on the algorithm to investigate its behaviors. This task is concerned with selecting appropriate datasets that can be used for experiments that you design and execute.

It is useful to select small standard datasets that are commonly used in the field for assessing the specific algorithm you have selected and on the general class to which the algorithm belongs. It is useful because you can use results in literature as points of inspiration and comparison.

The following issues should be considered when selecting a dataset.

* Datasets that have been used in prior studies of the algorithm or the general class of algorithm.
* Datasets that are known to exhibit the specific behavior, feature or limitation that you are investigating.
* Datasets that allow you to complete your experiments relatively quickly and do not require significant resources to manage.

Create a list of at least 3 datasets on which the algorithms could be applied and list the source from where you got (to know about) the dataset. Research papers on the algorithm are a great source for datasets.

## Run Experiments

Design and execute experiments on selected datasets in order to elicit specific system behaviors that you can characterize and measure.

Design and execute experiments to answer questions you have about the algorithm. Use the selected datasets and the measures you chose to characterize algorithm behaviors. Remember to focus your attention on the algorithm and make observations that teach you more about how to manage and use the algorithm more effectively.
