# Week 6

Phew. Is it week 6 already? This week, we've gotten to the machine-learning part. There are lots of course on machine learning at DTU. And across many research areas, people use ML for all kinds of things. So there's a good chance you're already familiar with what's going to happen today. 

So why are we doing a week devoted to machine learning? Well, there are two reasons
* Not all of you have done any machine learning, so this is a chance to make sure that everyone has had a chance to catch up.
* And the second reason that I'd to get across in the course (and IMHO it's an important one) is that visualization **AND** machine learning is a powerful combination. A combination that is pretty rare. 
  - Usually it's the case that people are either good at machine learning or data viz, but not both. 
  - So what we will be able to do in this class is an unusual combo: We can use ML to understand data and then visualize the outputs of the machine-learning.

To get started on all this, the idea for today is to give you a quick sense of machine-learning, the theory behind it and then get you guys ready to use the state-of-the-art machine learning framework for python, scikit-learn (`sklearn`).

Thus, the elements are as follows.

* Lightning intro to machine learning.
* Playing around with `sklearn` through a couple of tutorials.
* Then learning about the K nearest neighbors (KNN) algorithms, including an exercise based on the SF data.
* And finally, learning about decisions trees and trying them out on real-world crime data.

## Part 1: Lightning intro to machine learning

As mentioned above, we won't go too deep with machine learning in this class. Here there goal is to learn enough to combine data analysis with simple machine learning to create the next-level visualizations I advertised above.

So we kick off the machine-learning part by watching some video lectures on the *fundamentals of Machine learning*. The lectures have been prepared by our very own expert, Ole Winter, whose work focuses on Machine Learning. The lectures + slides have been prepared especially for you guys by Ole, and lovingly edited by yours truly.

In [1]:
# Ole Winter, "What is Machine Learning" 
from IPython.display import YouTubeVideo
YouTubeVideo("SsCYF9tDY9Y",width=800, height=450)

In [2]:
# Ole on Model Selection
YouTubeVideo("MHhlAtw3Ces",width=800, height=450)

In [3]:
# Ole on feature extraction and selection
YouTubeVideo("RZmitKn220Q",width=800, height=450)

Now it's time to read a little bit about machine learning. For this written intro, we're going to use a book that I've used in this class in previous years. That book has a little bit of a special style. It focuses on learning everything from scratch ... and that includes defining mathematical concepts in code (rather than equations). It might take a little tiny bit of time to get used to. 

But it's a great book and a nice + concise intro that matches Ole's lectures pretty well. So here goes.

*Reading*: Data Science From Scratch (DSFS), Chapter 11. Get it [here](https://cn.inside.dtu.dk/cnnet/filesharing/download/5d0bbcac-3370-4ddf-a8bc-4a62ef931c45).

> *Exercises*: A few questions about machine learning.

> * What do we mean by a 'feature' in a machine learning model?
> * What is the main problem with overfitting?
> * Explain the connection between the bias-variance trade-off and overfitting/underfitting.
> * The `Luke is for leukemia` on page 145 in the reading is a great example of why accuracy is not a good measure in very unbalanced problems. You know about the incidents dataset we've been working with. Try to come up with a similar example based on the data we've been working with today.

## Part 2: Scikit-learn

In this section we introduce scikit-learn, `sklearn`. The best way to learn about any Python package, is to find a good tutorial and run through it to get an intuition for the syntax and how it's usage is intended (we've already done this with `pandas`, remember). 

The amazing package `sklearn` is state-of-the-art machine learning for Python. It's used in companies big and small all over the world and in lots of academic papers. To day we'll run through a couple of tutorials just to get you started.

We start with a high level overview presented in [this tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html). Read/work througH the first three sections (*Machine learning: the problem setting*, *Loading an example dataset*, *Learning and predicting*) to get a sense of data types and syntax.

> *Exercise*: Did you read the text?
>
> * Describe in your own words how data is organized in `sklearn` (how does a *dataset* work according to the tutorial)?
> * What is the dimensionality of the `.data` part of a dataset and what is the size of each dimension?

Now we're going to work through a [tutorial on text analytics](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html). You might wonder, why text. Well, there are two reasons. Firstly, this is a great tutorial that introduces various concepts that are useful for you guys to know about, so even if text isn't really part of the course - the tutorial still sets you up to be excellent users of `sklearn`. Secondly, it's not a bad thing to know about text analysis. Even in this class. This could be something you might like to use in the final project. And now you have a great place to go if you want to dig into some textual data.

We won't do the whole tutorial. I'd like you to work thorough up to and including the section *Building a pipeline*.

> *Exercise*: Did you do the work?
>
> * Describe in your own words the dataset used in the tutorial. 
> * Investigate further: what kind of folder/file structure does the `sklearn.datasets.load_files` function expect?
> * What is the "bag-of-words" representation of text? How does this strategy turn text into data of the kind described above?
> * (Don't worry too much about tokenization and TF-IDF for now, but do check out those part if you want to use real text analysis later)
> * Once you've built the classifier, play around with it a bit. Describe the content of the `predicted` variable.


You can find an overview of all tutorials here https://scikit-learn.org/stable/tutorial/index.html.

## Part 3: KNN

Now it's time to work with a real algorithm. We'll take the simplest machine learning scheme in the universe. Although it's simple, it is still often useful (as we shall see when we use it to analyze crime data). It's called *K nearest neighbors*.

We start by Ole introducing the idea.

In [4]:
# Ole on K-nearest-neighbors
YouTubeVideo("OE159z8kC-Y",width=800, height=450)

Now, let's read about it. 

*Reading*: Again we turn to DSFS, this time chapter 12, as an intro to the KNN algorithm. Find it [here](https://cn.inside.dtu.dk/cnnet/filesharing/download/8923a9e2-ded5-4cd8-b334-42b9fc64d33a).

> *Warm up exercises*: K-nearest-neighbors
> 
> How does K-nearest-neighbors work? Explain in your own words.
Explain in your own words: What is the curse of dimensionality? Use figure 12-6 in DSFS as part of your explanation.



Finally, we can start working with the crime data. Here's a little exercise


> *Exercise*: K-nearest-neighbors map.
>
> We know from last week's exercises that the focus crimes `PROSTITUTION`, `DRUG/NARCOTIC` and `DRIVING UNDER THE INFLUENCE` tend to be concentrated in certain neighborhoods, so we focus on those crime types since they will make the most sense a KNN - map.
> 
> * Begin by using `folium` (see Week4) to plot all incidents of the three crime types on their own map. This will give you an idea of how the varioius crimes are distributed across the city.
> * Next, it's time to set up your model based on the actual data. I recommend that you try out `sklearn`'s `KNeighborsClassifier`. For an intro, start with [this tutorial](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html) and follow the link to get a sense of the usage.
>   * You don't have to think a lot about testing/trainig and accuracy for this exercise. We're mostly interested in creating a map that's not too problematic. But do calculate the number of observations of each crime-type respectively. You'll find that the levels of each crime varies (lots of drug arrests, an intermediate amount of prostitiution registered, and very little drunk driving in the dataset). Since the algorithm classifies each point according to it's neighbors, *what could a consequence of this imbalance in the number of examples from each class mean for your map*?
>   * You can make the dataset 'balanced' by grabbing an equal number of examples from each crime category. 
>       * How do you expect that will change the KNN result? 
>       * In which situations is the balanced map useful - 
>       * When is the map where data is in proportion to occurrences useful? 
>       * Choose which map you will work on in the following.
> * Now create an approximately square grid of point that runs over SF. You get to decide the grid-size, but I recommend somewhere between $50\times50$ and $100\times100$ points. I recommend using `folium` for this task.
> * Visualize your model by coloring the grid, coloring each grid point according to it's category. Create a plot of this kind for models where each point is colored according to the majority of its $5$, $10$, and $30$ nearest neighbors. Describe what happens to the map as you increase the number of neighbors, `K`. 
> * To see an example, [click here](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/KNN-example.png). This one is a 100x100 grid based on crimes from 1st January 2017 until the end of 2018. And the categories are narcotics, prostitution and vehicle theft.

## Part 4: Decision Trees

Now we turn to decision trees. This is a fantastically useful supervised machine-learning method, that we use all the time in research. To get started on the decision trees, we'll use some fantastic *visual* introduction. 


*Decision Trees Reading 1*: The visual introduction to decision trees on this webpage is AMAZING. Take a look to get an intuitive feel for how trees work. Do not miss this one, it's a treat! http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

*Decision Trees Reading 2*: the second part of the visual introduction is about the topic of model selection, and bias/variance tradeoffs that we looked into earlier during this lesson. But once again, here those topics are visualized in a fantastic and inspiring way, that will make it stick in your brain better. So check it out http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

*Decision Trees Reading 3*: Finally, you can also read about decision trees in DSFS, chapter 17. [Get it here](https://cn.inside.dtu.dk/cnnet/filesharing/download/6115e455-347f-4683-84a9-c6b6e2cb3802). 

And our little session on decision trees wouldn't be complete without hearing from Ole about these things. 

In [5]:
# Ole explains decision trees
YouTubeVideo("LAA_CnkAEx8",width=600, height=338)

> Exercises: Just a few questions to make sure you've read the text (DSFS chapter 17) and/or watched the video.
> 
> * There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?
> * Explain in your own words: Why is entropy useful when deciding where to split the data?
> * Why are trees prone to overfitting?
> * Explain (in your own words) how random forests help prevent overfitting.

> *Exercise*: Decision trees and real-world crime data
> 
> The idea for today is to pick two crime-types that have *different geographical patterns* and *different temporal patterns*. We can then use various variables of the real crime data as categories to build a decision tree. I'm thinking we can use
> * `DayOfWeek` (`Sunday`, ..., `Saturday`). (Note: Will need to be encodede as integer in `sklearn`)
> * `PD District` (`TENDERLOIN`, etc). (Note: Will need to be encodede as integer in `sklearn`)
> 
> And we can extract a few more from the `Time` and `Date` variables
> * Hour of the day (1-24)
> * Month of the year (1-12)
> 
> So your job is to **select two crime categories** that (based on your analyses from the past three weeks) have different spatio-temporal patterns. Then we are going to to build is a decision tree (or a random forest) that takes as input the four labels (Hour-of-the-day, Day-of-the-week, Month-of-the-year, and PD-District) of some crime (from one of the two categories) and then tries to predict which category that crime is from.
>
> Some notes/hints
> * It is a good idea to create a balanced dataset, that is, **grab an equal number of examples** from each of the two crime categories. Pick categories with lots of training data. It's probably nice to have something like 10000+ examples of each category to train on. 
> * Also, I recommend you grab your training data at `random` from the set of all examples, since we want crimes to be distributed equally over time.
> * A good option is the  `DecisionTreeClassifier`.
> * Since you have created a balanced dataset, the baseline performance (random guess) is 50%. How good can your classifier get?

> There are also a couple of optional things to try if you're done with the exercise and have time to spend. 
> * Optional: Try out more variables like
>     * Year
>     * Something based on the Street-name.
>     * create a finer geo-grid based on GPS
>     * Etc
> * Optional: Do proper cross-validation for your algorithm
> * Optional: Include weather data

(Thanks to TA Germans for collaborating on the design of this exercise)
