# Prologue to Data Science

> It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

<footer>~ Arthur Conan Doyle, Sherlock Holmes</footer>

This guide acts as a prologue to Data Science and is a prerequisite for [General Assembly's Data Science](https://generalassemb.ly/education/data-science) course. It will guide you through the basics of the Python programming language, teach you how to operate your computer through the command line, and refresh your stats knowledge.

The modern Data Scientist doesn't always need to know the mathematics that go on behind the scenes, but they _do_ need to be intimately familiar with the characteristics of the various machine learning algorithms - e.g. which types of data they are suitable for, how to measure their accuracy, and how to interpret their output. 

To _design_ and _assess_ the validity of your data models, you'll at least need a **statistics** vocabulary equivalent to one offered by college-level  course. The materials referenced in this guide provide a succinct refresher. But as you'll also want to _implement_ and _iterate_ over your data models once you've designed them, you also need some **programming** skills. Data Scientists often work with a scripting language with strong machine learning libraries. _R_ is a common contender, but for the purposes of this introduction _Python_ is used to teach you the basics of computational thinking.

Each of the following banners will lead you to an online resource which will  set you on your journey as fledgling Data Science. The topics presented here are both wide and deep, so where appropriate, the _minimum_ requirements are  listed. Data Science is an intellectually challenging pursuit, but let that not deter you. Data Scientists are often just better statisticians than most programmers, and better programmers than most statisticians. It's through a hybrid of the two domains that they develop their invaluable skill-set.

![break](assets/code.png)

## Programming Fundamentals

### Python

CodeAcademy is a free website with tutorials to teach users rudimentary programming. Its Python course is aimed at non-programmers and will slowly take you through various programming concepts. The course is split up in teaching and practice units so you'll also learn why certain techniques are useful.

**Minimum Requirement** : up until Codecademy's _exam statistics_ unit

**Estimated Time** : 8 Hours

**Alternative** : If you'd prefer a book over Codecademy's interactive learning method, consider reading [Think Python](http://www.greenteapress.com/thinkpython/thinkpython.html). It starts with basic concepts of programming, and is carefully designed to define all terms when they are first used and to develop each new concept in a logical progression. 

**Advanced** : If you are already familiar with another programming language, and would just like to get acquainted with the Python syntax, follow the [Learn Python](http://www.learnpython.org/) path instead as it does less hand-holding.

![resource](assets/codecademy.png) [Codecademy](http://codecademy.com)

### Command Line

The popular introductory manual 'Learn Python the Hard Way' comes with an appendix about the command line. It's a crash course in using the command line to make your computer perform tasks. As a crash course, it's not as detailed or extensive as dedicated guides. It is simply designed to get you barely capable enough to start using your computer like a real programmer does. When you're done with the appendix, you will be able to give most of the basic commands that every shell user touches every day. You'll understand the basics of directories and a few other concepts.

Learing how to use the command line is important because if you want to learn to code, then you _must_ learn this. Programming languages are advanced ways to control your computer with language. The command line is the baby little brother of programming languages. Learning the command line teaches you to control the computer using language. Once you get past that, you can then move on to writing code and feeling like you actually own the hunk of metal in front of you.

**Minimum Requirement** : Excercises 1-15

**Advanced** : Consult the [Bash Cheat Sheet](http://cli.learncodethehardway.org/bash_cheat_sheet.pdf) and see if there's a trick or two you could add to your toolbelt

**Estimated Time** : 2 Hours

![resource](assets/terminal.jpg) [Command Line Crash Course](http://learnpythonthehardway.org/book/appendix-a-cli/introduction.html)

### Python Pandas

Pandas is a Python library for doing data analysis. It's really fast and lets you do exploratory work incredibly quickly. You can imagine pandas being the tool which holds your data. The better you know how to merge in new data and ask for a particualr subset of data, the simpler it will be for you to bring in more evidence to your dataset and answer more specific question about your data.

**Minimum Requirement** : Lessons 1 - 2, 4 & 6, Excercises 1 - 3

**Estimated Time** : 2-4 Hours

**Alternative** : Depending on your available time, you might want to check out the [10 minute intro to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html#min), but if you do  want a more thorough introduction, I suggest you follow some of the lessons provided by with Learn Pandas. Learn Pandas is a short guide spanning 11 lessons and 4 excercises on using pandas for data manipulation and analysis.

**Advanced** : Explore the [Pandas Cookbook](https://github.com/jvns/pandas-cookbook) and see what neat features you can use in your own data analysis.



![resource](assets/pandas.png) [Learn Pandas](https://bitbucket.org/hrojas/learn-pandas)

### Practice Python

If programming isn't your strong suit and time allows you to develop your programming skills further, consider joining CheckIO. CheckiO is an online game where coders compete and collaborate. For novice coders, the service is a self tutoring community where code review and feedback are game elements.


Remember that Codecademy also provides an excellent [glossary](http://www.codecademy.com/glossary/python) of concepts and techniques you'll likely employ in your adventures.

![resource](assets/checkio.png) [CheckIO](http://check.io)

![break](assets/theory.png)

## Statistics Refresher

_Think Stats_ is an introduction to Probability and Statistics for people who have some exposure to python. It emphasizes simple techniques you can use to explore real data sets and answer interesting questions. Readers are encouraged to work on a project with real datasets.

Because it uses a programming language, it covers data analysis from beginning to end: viewing data, calculating descriptive statistics, identifying outliers, describing data using the distributions (and explaining what the distributions really mean!). Going through this small book, the goal is understanding and using statistics, not just learning statistics.

**Minimum Requirement** : Chapters 1 - 5

**Estimated Time** : 8 Hours

**Alternative** : As an alternative you could sample some [Kahn Academy](https://www.youtube.com/playlist?list=PL4C863861E3B2E380) videos from their statistics playlist. Their videos always focus on a single topic, so it allows you to easily plug the gaps in your knowledge.


![resource](assets/think_stats.png) [Think Stats](http://www.greenteapress.com/thinkstats/index.html)

![break](assets/resources.png)

## Resources for Further Study

Upon strengthening your knowledge base in both Python and Statistics, you'll be ready to embark on your journey to become a Data Scientist. [General Assembly's Data Science](https://generalassemb.ly/education/data-science) course picks up  where this prologue ends. It first offers a chance to clarify anything that wasn't clear from the prologue. The instructor then sets off on a 11-week tour to develop the student's programming ability and knowledge of statistical methods. The course provides an in-depth overview of the most popular machine learning algorithms, and culminates in an indivudual data science project. 

The course curriculum was developed in-house by General Assembly, but any further study of Data Science benefits from having these two books handy for reference.

_Building Machine Learning system with Python_ shows you exactly how to find patterns through raw data. The book starts by brushing up on your Python ML knowledge and introducing libraries, and then moves on to more serious projects on datasets, Modelling, Recommendations, improving recommendations through examples and sailing through sound and image processing in detail.

![resource](assets/building_machine_learning_python.png) [Building Machine Learning Systems](http://shop.oreilly.com/product/9781782161400.do)

_Python for Data Analysis_ is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. It was authored by the lead developer of the pandas package that's been discussed here, so acts as an insider's guide to everything pandas.

![resource](assets/pydata.png) [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)

With sophisticated analytics, cool new technologies, lean learning principles and agile delivery methods, data science is an exciting, emerging field to join.

Please reach out if you have questions about anything or need help!

All the best,

Mart van de Ven

**Data Science Instructor, Hong Kong**

![resource](assets/nounproject.png) [Credits](http://nbviewer.ipython.org/gist/anonymous/b2896a7012f262f674f0)

In [7]:
from IPython.display import HTML

HTML('''<script>

code_show=true;

function code_toggle() {
    if (code_show){ 
        $('div.input').hide();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').hide();
    } else {
        $('div.input').show();
        $('.output_scroll').removeClass('output_scroll');
        $('.prompt').show();
    }
    code_show = !code_show
}
</script>
 
<a class='btn btn-warning btn-lg' style="margin:0 auto; display:block; max-width:320px" href="javascript:code_toggle()">TOGGLE CODE</a>''')

In [8]:
HTML('''<link href='http://fonts.googleapis.com/css?family=Roboto|Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="http://localhost:8000/custom.css">

<script>
$(function(){
    code_toggle()
})
</script>

''')