# Visualizations and Random Forest 

Prior to this task, you should have watched a video on random forest on Canvas.

## Advantages of Random Forest:

* Random forest can solve both type of problems that is classification and regression and does a decent estimation at both fronts.
* Random forest can be used on both categorical and continuous variables. 
* You do not have to scale features.
* Fairly robust to missing data and outliars.

## Disadvantages of Random Forest

* It is complex, e.g., look at the tree at the end of this exercise!  This makes it feel like a black box, and we have very little control over what the model does.
* It can take a long time to train.

In [None]:
# Here are some alternative ways to load packages in python as aliases 
# This can be useful if you call them often


The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and there are many other attributes that available here.

We should check to see if there are any null values.  There are several ways we've learned to do this.

We shoud check the data first to see if there are any weird anomalies. 

What we should look for are:
* There are not any data points that immediately appear as anomalous 
* No zeros in any of the measurement columns. 

Another method to verify the quality of the data is make basic plots. Often it is easier to spot anomalies in a graph than in numbers.

It is useful to know whether some pairs of attributes are correlated and how much. For many ML algorithms correlated features that are not independent should be treated with caution.  Here is a good [blog](https://towardsdatascience.com/data-correlation-can-make-or-break-your-machine-learning-project-82ee11039cc9) on explaining why.

To prevent this, there are methods for deriving features that are as uncorrelated as possible (CA, ICA, autoencoder, dimensionality reduction, manifold learning, etc.), which we'll learn about in coming classes.

We can explore coreelation with Pandas pretty easily...

### Let's explore/review some visualization approaches

A good way to look at correlations quickly is a visualization called a heatmap.  Let's take a look at correlations betewen features in our dataset.

You can also save the plots you make in these notebooks locally.

Let's take a look how we can explore the distributions of values within a specific feature.  Specifically, let's look at the distribution of property tax in Boston. We can do this either in matplotlib or sns.  There are so many tools available to you in Python!

What's the correlation between property taxes and the number of rooms in a house?

Another possibility is to aggregate data points over 2D areas and estimate the [probability desnsity function](https://en.wikipedia.org/wiki/Probability_density_function). Its a 2D generalization of a histogram. We can either use a rectangular grid, or even a hexagonal one.

What you'll see is you have access to so many visualizations.  A great way to explore them is through the gallery:  https://seaborn.pydata.org/examples/index.html


# How to implement Random Forest

First, we need to get a train and test dataset going...

The 'ravel' command flattens an array:  "ravel(): when you have y.shape == (10, 1), using y.ravel().shape == (10, ). In words... it flattens an array."

https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected

How do we evaluate this model?  Previously, we've worked with labels for classifications but now instead of a DISCRETE target, we've got a continuous target.  For example, the confusion matrix doesn't make sense and the code will error out below:

Check out this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) and see if you can find some ways to evaluate this model.

The importance of our features can be found in reg.feature_importances_. We sort them by decreasing order of importance:

We can compute how much each feature contributes to decreasing the weighted impurity within a tree.   This is a fast calculation, but one should be cautious because it can be a biased approach.  It has a tendency to inflate the importance of continuous features or high-cardinality categorical variables (a lot of very uncommon or unique variables).

You'll need to open tree.dot file in a text editor, e.g., notepad.  Select all the code and paste in here:  http://www.webgraphviz.com/.  Scroll right and the tree should show up.

## More practice - optional but recommended because its interesting and doesn't take too long

This is another good [tutorial](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0) on random forest:
.  You can perform this tutorial on your own and expand it for your choose your adventure, though you should be sure to demonstrate knowledge of this topic vs. copying and executing the tutorial.