# Python Data Science Handbook

## Preface: What is data science?

Data science comprises three distinct and overlapping areas: the skills of a **statistician** who knows how to model and summarize datasets; the skills of a **computer scientist** who can design and use algorithms to efficiently store, process, and visualize this data; and the domain expertise—what we might think of as "classical" training in a subject—necessary both to formulate the right questions and to put their answers in context.

The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: **NumPy** for manipulation of homogeneous array-based data, **Pandas** for manipulation of heterogeneous and labeled data, **SciPy** for common scientific computing tasks, **Matplotlib** for publication-quality visualizations, **IPython** for interactive execution and sharing of code, **Scikit-Learn** for machine learning, and many more tools.

## 1. IPython: Beyond Normal Python - Interactive Python

*"Tools for the entire life cycle of research computing."*

If Python is the engine of our data science task, you might think of IPython as the interactive control panel.

IPython's tools to quickly access this information: the **?** character to explore documentation, the **??** characters to explore source code, and the **Tab key** for auto-completion.

Every Python object contains the **reference to a string**, known as a **doc string**, which in most cases will contain a concise summary of the object and how to use it. Python has a built-in **help()** function that can access this information and prints the results.

IPython introduces the **?** character as a shorthand for accessing this documentation and other relevant information.

`len?*`

`L.insert?`

`L?`

`def square(a):
  """Return the square of a."""
  return a ** 2`

`square?`

`square??`

Sometimes the **??** suffix doesn't display any source code: this is generally because the object in question is not implemented in Python. If this is the case, the ?? suffix gives the same output as the ? suffix.

`len??`

`L.<TAB>`

`L.c<TAB>`

`L._<TAB>` dunder methods

`from itertools import co<TAB>`

`*Warning?` wildcard matching

`str.*find*?`

### Magic Cmds

When working in the IPython interpreter, one common gotcha is that pasting multi-line code blocks can lead to unexpected errors.

`%paste
def donothing(x):
  return x`

A command with a similar intent is **%cpaste**, which opens up an interactive multiline prompt in which you can paste one or more chunks of code to be executed in a batch.

`%run myscript.py`

Note also that after you've run this script, any functions defined within it are available for use in your IPython session.

Another example of a useful magic function is **%timeit**, which will automatically determine the execution time of the single-line Python statement that follows it.

`%timeit L = [n ** 2 for n in range(1000)]
1000 loops, best of 3: 325 µs per loop`

The benefit of **%timeit** is that for short commands it will automatically perform multiple runs in order to attain more robust results. For multi line statements, adding a second % sign will turn this into a cell magic that can handle multiple lines of input. 

`%%timeit
  L = []
  for n in range(1000):
  L.append(n ** 2)
1000 loops, best of 3: 373 µs per loop`

`%timeit?`

`%magic` To access a general description of available magic functions

`%lsmagic` For a quick and simple list of all available magic functions

Inputs and outputs are displayed in the shell with **In/Out** labels, but there's more–IPython actually creates some Python variables called **In** and **Out** that are automatically updated to reflect this history.

The standard Python shell contains just one simple shortcut for accessing previous output; the variable **_**.

`print(_)`

You can use a double underscore to access the second-to-last output, and a triple underscore to access the third-to-last output. IPython stops there.

`print(__)`
`print(___)`

`Out[2]`
`_2`

`math.sin(2) + math.cos(2);` suppress the output of a command by adding a semicolon

`%history -n 1-4` For accessing a batch of previous inputs

`%rerun` (which will re-execute some portion of the command history)

`%save` (which saves some set of the command history to a file)

### IPython Shell

`!echo "hello world"`

`!pwd` print working directory

`!ls` list working directory

`cd projects/` change directory

`mkdir myproject` make directory

`mv ../myproject.txt ./` move. from one dir up to here.

`cp myproject.txt tmp/`

`rm -r tmp`

`contents = !ls`
`directory = !pwd`

`type(directory)
IPython.utils.text.SList`

This looks and acts a lot like a Python list, but has additional functionality, such as the **grep and fields methods** and the **s, n, and p properties** that allow you to search, filter, and display the results.

`message = "hello from Python"
!echo {message}`

With IPython's shell commands, you cannot use !cd to navigate the filesystem. The reason is that shell commands in the notebook are executed in a temporary subshell. You can use the **%cd** magic command.

`%cd ..`

`cd myproject` automagic func., this behavior can be toggled with the **%automagic** magic function.

Besides **%cd**, other available shell-like magic functions are **%cat, %cp, %env, %ls, %man, %mkdir, %more, %mv, %pwd, %rm, and %rmdir**, any of which can be used without the % sign if automagic is on. This makes it so that you can almost treat the IPython prompt as if it's a normal shell.

### Controlling Exceptions

With the **%xmode** magic function, IPython allows you to control the amount of information printed when the exception is raised. **%xmode** takes a single argument, the **mode**, and there are three possibilities: **Plain, Context, and Verbose**. The default is Context. Plain is more compact and gives less information. The Verbose mode adds some extra information, including the arguments to any functions that are called.

The standard Python tool for interactive debugging is **pdb**, the Python debugger. The IPython-enhanced version of this is **ipdb**, the IPython debugger.

In IPython, the most convenient interface to debugging is the **%debug** magic command. If you call it after hitting an exception, it will automatically open an **interactive debugging prompt** at the point of the exception. The ipdb prompt lets you explore the current state of the stack, explore the available variables, and run Python commands.

If you'd like the debugger to launch automatically whenever an exception is raised, you can use the %pdb magic function to turn on this automatic behavior:

`%xmode Plain`
`%pdb on`

`Exception reporting mode: Plain`
`Automatic pdb calling has been turned ON`

### Profiling and Timing Code

- **%time**: Time the execution of a single statement
- **%timeit**: Time repeated execution of a single statement for more accuracy


- **%prun**: Run code with the profiler
- **%lprun**: Run code with the line-by-line profiler. line_profiler


- **%memit**: Measure the memory use of a single statement. memory_profiler
- **%mprun**: Run code with the line-by-line memory profiler

Notice also how much longer the timing takes with %time versus %timeit, even for the presorted list. This is a result of the fact that %timeit does some clever things under the hood to prevent system calls from interfering with the timing. For example, it prevents cleanup of unused Python objects (known as garbage collection) which might otherwise affect the timing. For this reason, %timeit results are usually noticeably faster than %time results.