# Data Science Guide วิทยาศาสตร์ข้อมูล

## Intro to Data Science: What it is about?

[Jake's Talk](https://www.youtube.com/watch?v=gsTiokP91JE&feature=youtu.be) recommended by Insight. Understand how to be onboard especially for PhDs in quantitative fields.

## Learning path example

Credit: Modified from an email from Insight, [KDNuggets](http://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html), and the Insight blog post ["Preparing for the Transition to Data Science"](  https://blog.insightdatascience.com/preparing-for-the-transition-to-data-science-e9194c90b42c#.axsrpmnki)

### 1. Prepare hardware and software

1) Avoid Windows OS! - Open source tools and working from the command line are important. So Mac or installed Ubuntu is highly recommended. Virtual machine or partition are two ways you can use Ubuntu on Windows machine.

2) Update your OS. - Ubuntu 14.04 LTS for Linux or OS X version 10.9 or later for Mac.

3) Install Python. - The best way is to install through Acaconda. Python Version 3.3+ is recommended. Be familiar with Jupyter (formerly called IPython) included in Anaconda. A minimal starter guide for IPython is [here]( http://cs231n.github.io/ipython-tutorial/). Use it as a data science diary.

### 2. Get basic knowledge in CS (especially in Python), math and stats.

This includes [CS fundamental with Python](http://interactivepython.org/runestone/static/pythonds/index.html).

### 3. Learn how to use tools around data science

Python - [Google's Python Class](https://developers.google.com/edu/python/?csw=1), [Code Academy’s Python Class](https://www.codecademy.com/learn/python), or [Learn Python The Hard Way](https://learnpythonthehardway.org/book/).

- Python modules for generic quantitative analysis: SciPy, NumPy
- Python module for data analysis: Pandas. You can learn from Wes McKinney's [video](https://www.youtube.com/watch?v=w26x-z-BdWQ&feature=youtu.be) and [exercise](https://github.com/lemonbalm/pandas-exercises). See also [Michael Hansen’s tutorial](http://synesthesiam.com/posts/an-introduction-to-pandas.html) and [Pandas cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).
- Python modules for machine learning: Scikit-learn

SQL - [SQL School](https://community.modeanalytics.com/sql/tutorial/introduction-to-sql/), [Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql), or [Learn SQL The Hard Way]( http://sql.learncodethehardway.org/book/)

- Connect to database management system [MySQL](http://zetcode.com/db/mysqlpython/) and [PostgreSQL](http://zetcode.com/db/postgresqlpythontutorial/) in Python environment. 
- Python module that to access and modify SQL databases in a more "pythonic" way: [SQLAlchemy](http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html) 


Web Development - [HTML/CSS](https://www.codecademy.com/learn/web) and [JavaScript](https://www.codecademy.com/learn/javascript)

### 4. Machine Learning with Python

Basic topics: k-means clustering, decision trees, linear regression, logistic regression

Advanced topics: Support Vector Machine, ensemble classifier, dimensionality reduction, model validation technique

- [Andrew Ng’s Coursera Course](https://www.coursera.org/learn/machine-learning/home/info?source=cdpv2)


### 5. Specialize

Deep Learning in Python!

### 6. Get inspired and informed

See some cool data science projects
- Projects from Insight fellows: [Brenton's Legislater Prognosticator](https://blog.insightdatascience.com/legislator-prognosticator-45aa6fd12013#.2yr1c1l84), [Ruth's Star Wars and Machine Learning](https://blog.insightdatascience.com/catching-star-wars-surprises-and-other-spoilers-with-machine-learning-60803195761e#.timejt1rb), and [Lavanya's Building an Automatic Keyword Extractor for URX](https://blog.insightdatascience.com/building-an-automatic-keyword-extractor-for-urx-d1923d9d272e#.5mzisu9wt).
- [Visualizing MBTA Data](http://mbtaviz.github.io/)

Join Data Science community. Even if you just glance at one or two of the following, every couple of days, you'll start to see interesting trends and will become much more “in the know”:

- [The Daily Crunch](https://techcrunch.com/crunch-daily/), [DataTau](http://www.datatau.com/), [Data Science on Reddit](https://www.reddit.com/r/datascience/), and [Twitter](https://twitter.com/InsightDataSci/lists/data-scientists)


### 7. Try your own project(s)!

## Free Resource & Books

Data Science: The Art of Data Science https://leanpub.com/artofdatascience

Big Data: Big Data Now http://www.oreilly.com/data/free/big-data-now-2014-edition.csp

Apache Hadoop: Hadoop explained https://www.packtpub.com/packt/free-ebook/hadoop-explained

Apache Spark: Mastering Apache Spark https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details

Theoretical Machine Learning: Elements of Statistical Learning http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

Practical Machine Learning: An Introduction to Statistical Learning with Applications in R http://www-bcf.usc.edu/~gareth/ISL/

Deep Learning: Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/, Deep Learning http://www.deeplearningbook.org/

Data Mining: Mining of Massive Datasets http://www.mmds.org/

Statistics for Data Science:  Think Stats: Probability and Statistics for Programmers http://www.greenteapress.com/thinkstats2/index.html


Learn Data Science with Python: step by step (https://www.dataquest.io/blog/python-data-science/?utm_content=bufferabe5e&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer)

One of the most complete resources on Data Science (https://github.com/donnemartin/data-science-ipython-notebooks)

General guideline to Data Science Online course (https://github.com/datasciencemasters/go)


## Basic Operations

<img src="figures/data-science-process.jpg" width = "500x">

(Source: http://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html, http://www.kdnuggets.com/2015/03/10-steps-success-kaggle-data-science-competitions.html)

Stage 1: Ask A Question. Set a clear objective. Understand the performance measure.

Skills: science, domain expertise, curiosity
Tools: your brain, talking to experts, experience

Stage 2: Get the Data

Skills: web scraping, data cleaning, querying databases, CS stuff
Tools: python, pandas

Stage 3: Explore the Data. Know your data.

Skills: Get to know data, develop hypotheses, patterns? anomalies?
Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data. Set up a local validataion environment. Apply basics first. Combine several models. 

Skills: regression, machine learning, validation, big data
Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

Skills: presentation, speaking, visuals, writing
Tools: matplotlib, adobe illustrator, powerpoint/keynote

## Which model?

<img src="figures/scikit-learn-algorithm-cheatsheet.png" width = "800x">

Popular data mining algorithm (http://www.kdnuggets.com/2015/05/top-10-data-mining-algorithms-explained.html)

1. C4.5
2. k-means
3. Support vector machines
4. Apriori
5. EM
6. PageRank
7. AdaBoost
8. kNN
9. Naive Bayes
10. CART

Other popular models and relatively parallizable:
1. Geralized Linear Models (GLMs) plus Lasso and Elastic-Net Regularized Generalized Linear Models
2. Gradient boost machine (GBM)
3. Random Forest (RF)
4. Deep Learning (DL)

What you need to know about them:

In a simple term, what does it do? 
What are closest algorithm? What make it different?
What is advantage? What is disadvantage?


## Q: Are you sure you learn something? Try to explain and concept map terms in this page.