# DAML 01 - Introduction

Michal Grochmal <michal.grochmal@city.ac.uk>

### People

* Michal Grochmal
* Cosmin Stamate

### Material

- [Python Data Science Handbook][1] by Jake VanderPals
- [Think Stats, 2nd Edition][2] by Allen Downey
- [Intro to Information Retrieval][3] by Manning, Raghavan and Schütze


- [Statistics Done Wrong][4] by Alex Reinhart
- [Scikit Learn User Guide][5]
- [Deep Learning][6] by Goodfellow, Bengio and Courville

[1]: https://jakevdp.github.io/PythonDataScienceHandbook/
[2]: http://greenteapress.com/wp/think-stats-2e/
[3]: https://nlp.stanford.edu/IR-book/
[4]: https://www.statisticsdonewrong.com/
[5]: http://scikit-learn.org/stable/user_guide.html
[6]: http://www.deeplearningbook.org/

## Course Outline

1. Jupyter (and Python Review)
2. NumPy
3. Matplotlib
4. Pandas
5. Statistics and Analytics
6. Scikit Learn and Classification (e.g. KNN)
7. Regression and Feature Engineering
8. Clustering and PCA
9. Decision Trees (and Random Forests) and SVMs
10. Neural Networks
11. The future

### Data Analysis

* Statistics
* Data manipulation
* Visualization
* Rinse repeat
* Extract knowledge

### Machine Learning

* Statistics *and* Linear Algebra
* Data manipulation
* Program reproducible scenarios
* Build model
* Validate model
* Reuse model for same problem on a bigger scale

### Data Science

Kind off a combination of both above.
The person performing them is often called a Data Scientist.
Typically, one will first try to tackle a difficult problem by data analysis and then,
if the problem cannot be solved by data analysis, you attempt machine learning.
Examples:

- Handwriting
- How many people on twitter are "positive" about a hashtag?
- Network distribution

Often, but not always,
the difference between the use of plain analysis and machine learning is the scale of the problem.
Yet, are there problems that can be solved by data analysis but cannot be solved by machine learning?

- Bus watcher problem

That said, these kinds of problems are often not performed by a data scientist.

---

What is a Data Scientist then?

##### Data scientist vs. System administrator

> SA: Why do you need 45GBs of server memory?
>
> DS: The model needs 5M iterations to train, and I need to do it in parallel.
>
> SA: But you're booting 7k python VMs for that, and forking the same process
> thousands of times without doing much work in each fork.  This is incredibly inefficient.
>
> DS: Sorry, but that is how the model library works.  It is in alpha phase,
> was just pinched together by a bunch of guys at Berkeley.
>
> SA: Wait, and you are dumping an alpha phase library into production?!
>
> DS: That is the only one that has a Convolutional NN model that works on our GPUs.

##### Data scientist vs. Software developer

> SD: We cannot use that code, it has globals, no encapsulation, not even an API.
>
> DS: All we need is that something calls this every minute.
>
> SD: No!  That's an extra webserver on top of the one we have.
> It ain't even integrated with our single sign on.
>
> DS: It does not need to be, it is just the solution for the ML part.
>
> SD: But this will stay forever in the codebase, and people will forget what it does.

##### Data scientist vs. Project manager

> PM: So, do we have the solution for that problem?
>
> DS: Yes!  I finally got it validated for 87% accuracy.
>
> PM: Good!  So it is almost done!  By when do you think you will get the other 13% done?
>
> DS: No, no, no.  It is done, it has an accuracy of 87%.
>
> PM: But we need a solution, not 87% of a solution.
>
> DS: That is not how machine learning works.

## History

*NumPy* and friends was developed in this century but before that it was called *Numeric*,
and even before that parts of it were developed as other libraries.
For example, during compilation *NumPy* still uses FORTRAN code.
The continuity of its development had its ups and downs,
just in the same fashion as machine learning
(or intelligent systems or simply artificial intelligence as they were called)
had its ups and downs of enthusiasm.

* FORTRAN
* MATLAB
* gnuplot
* S & R
* NumPy

### Data Resources

- [Kaggle Datasets][kaggle]
- [UCI ML Repository (Irvine)][uci]
- [ML Data][mldata]
- [data.gov.uk][gov]
- [KDnuggets Data][kdnuggets]

[kaggle]: https://www.kaggle.com/datasets
[uci]: http://archive.ics.uci.edu/ml/index.php
[mldata]: http://mldata.org/repository/data/
[gov]: https://data.gov.uk/search
[kdnuggets]: https://www.kdnuggets.com/datasets/index.html