# PSTAT 134/234 - Statistical Data Science <a class="tocSkip">
    

## Instructor: Sang-Yun Oh <a class="tocSkip">

# What is Data Science?

## (Classical) Scientific Method

![Scientific Method](images/scientific-method.png)
[[WikiMedia](https://commons.wikimedia.org/wiki/File:The_Scientific_Method_as_an_Ongoing_Process.svg)]

Scientific method used since ~1200s and formaized in 1500s

## New Paradigm of Science

**Fourth Paradigm: Data-intensive scientific discovery** (Jim Gray): everything about science is changing

* Scope of Research has broadened: **Basic science + Business insight**

* "Some models are useful": **Precise + Approximate**

* What is more important? **Understanding + Prediction**

## Data Science is ...

* A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data [reference](https://doi.org/10.1145/2500499)

* Merger of statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data [reference](https://doi.org/10.1007/978-4-431-65950-1_3)

* Composed of techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. [reference](https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/)

## Beginnings of Data Science

![Cleveland](images/data-science-cleveland.png)
[[International Statistical Review](http://doi.org/10.2307/1403527)]

The article proposed data science program to train a new generation of data analysts:

* Domain expertise: data analysis collaborations in subject matter areas.

* Mathematics/Statistics: models, estimation, and distribution based on probabilistic inference.

* Computing: hardware and software; computational algorithms
    
* Theory: foundations of data science; mathematical investigations of models and methods
    

[International Statistical Review](http://doi.org/10.2307/1403527)

## Data Scientific Approach

![DataScienceLifeCycle](images/DataScienceLifeCycle.jpg)
[[UC Berkeley, School of Information](https://datascience.berkeley.edu/about/what-is-data-science/)]

# In PSTAT 134, you will ...

- Analyze real sports, healthcare, financial, and text data for insight

- Learn data formats and ways to access them:  
    e.g. JSON data over web-based application programming interface (API) 

- Learn and practice new tools:  
    e.g. Jupyter notebook, pandas, interactive widgets, command line interface

- Practice conceptualization and communication of ideas through open-ended questions and final group project

- Use documentation to achieve programming objective effectively

# Course Information

## Teaching Staff 

### Instructor: Sang-Yun Oh <a class="tocSkip">

- Lectures: MW 11 am - 12:15 pm
- Room: PHELP3515
- Office: South Hall 5514

### TA: Zhipu (Aaron) Zhou <a class="tocSkip">
    
### TA: Fanqi (Franky) Meng <a class="tocSkip">
    
### Tutor: Dorsa Jenab <a class="tocSkip">
    
### Tutor: Junayed Naushad <a class="tocSkip">
    
### Tutor: Yiyi Xu <a class="tocSkip">

## Grading

* **Attendance**: both lectures and sections (**10%**)  
    Total of five will be dropped. _No exceptions_

* **Midterm** (11/6): computer-based individual in-class exam (**25%**)

* **Assignments**: discussion is encouraged, but the write-up must be individual work (**35%**)

* **Project**: group final project & poster session (**30%**)  
    Poster presentation: 12/12 @ 12-3pm  
    Report due: 12/12 @ 11:59 pm

## Course Material

- [PSTAT 134 - Github repository](https://github.com/UCSB-PSTAT-134-234/Fall2019)

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake Vanderplas  
    
- Written in Jupter notebook and [the source code available on Github](https://github.com/jakevdp/PythonDataScienceHandbook)

- You can follow the book in our course Jupterhub (file upload)
    ![](https://covers.oreillystatic.com/images/0636920034919/lrg.jpg) 
    

## Tentative Weekly Outline 

---

* **Week 1**: Data and uncertainty   
    - Computing: Jupyter notebook and Python primer
    
* **Week 2**: Data scraping, transformation, and wrangling
    - Computing: command line interface and Pandas
        
* **Week 3-4**: Visualization and exploratory analysis
    - Computing: Scikit-learn, Matplotlib, Seaborn

* **Week 5-6**: Finance data module

* **Week 7-8**: Health data module
               
* **Week 9**: Text data module

* **Final Projects**: Final project poster session

# Computational Environment

<table style="width: 100%;">
    <tr>
    <td style="width: 50%;"> <img src="images/jupyternotebook.png" alt="Drawing" style="width: 100%;"/> </td>
    <td style="width: 50%;"> <img src="images/jupyterlab.png" alt="Drawing" style="width: 100%;"/> </td>
    </tr>
    <tr>
    <td style="text-align: center; font-weight: bold;"/> Jupyter Notebook <br> (required) </td>
    <td style="text-align: center; font-weight: bold;"/> Jupyter Lab <br> (optional) </td>
    </tr>
</table>

## Jupyter Notebooks vs. Jupyter Lab

- Jupyter cluster is here: https://pstat134.lsit.ucsb.edu

- If your username is ****,  

- Jupyter Notebook: https://pstat134.lsit.ucsb.edu/user/****/tree  
    a.k.a Classic Notebook (**required for PSTAT 134**)

- Jupyter Lab: https://pstat134.lsit.ucsb.edu/user/****/lab  
    (**optional for PSTAT 134**)

- **Jupyter Notebook + Additional Features = Jupyter Lab**

## Github

* Optional but recommended

* [Github Student Account](https://education.github.com/pack)

# Jupyter Notebook

- Notebook environment: code, formatted text, and graphics

- Conversion to other formats: e.g. markdown, latx, PDF, HTML, etc.

- Interactivity HTML widgets

![Jupyter Notebook](images/jupyternotebook.png)

- "Kernels" accessible through Jupyter include Python, shell, R, Julia, others  
    [Jupyter kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)  
    ![Jupyter Diagram](images/jupyter-diagram.jpg)


- Read [Jupyter notebook basics](https://github.com/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)

## Writing formatted text

- Text formatting with [markdown syntax](https://guides.github.com/features/mastering-markdown/)

- Math equations with latex commands:  
    e.g., `$$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i $$` produces:
    $$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\hat\mu)^2 $$
    [Mathpix](https://mathpix.com/)

- Images with `![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)`

![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)

- Markdown is everywhere: e.g., [Github Flavored Markdown (GFM)](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown)

## More than Python

- Python interpreter

- IPython is short for interactive Python with additional functionality  
    e.g., tab completion, syntax highlighting, magic, etc. (Chapter 1 in Vanderplas)  

- IPython's "[line, cell] magic" commands  
    e.g., debugging, code timing, shell commands, execute external script, etc.

## Line magic: `%`

- [Demo] Debugging example: `%debug`, etc.

- [`xmode`](https://jakevdp.github.io/PythonDataScienceHandbook/01.06-errors-and-debugging.html#Controlling-Exceptions:-%xmode): Exception handler mode

- More detail on debugging: [Vanderplas - Chapter 1](https://jakevdp.github.io/PythonDataScienceHandbook/01.06-errors-and-debugging.html)

### Example: Debugging

In [1]:
# exception mode to verbose output
%xmode Plain
# %xmode Verbose

from IPython.core.debugger import set_trace

def func1(a, b):
    return a / b

def func2(x):
    
    a = x
    # set_trace()
    b = x - 1
    return func1(a, b)

# Refer to https://docs.python.org/3/library/pdb.html#debugger-commands
# Press h for help
# Uncomment below to trigger an error
# func2(1) 

Exception reporting mode: Plain


In [2]:
# After an exception occurs, calling %debug 
# starts the debugger at last error
# uncomment the next line for demo
%debug

# input following at ipdb prompt
# print(a)
# print(b)

ERROR:root:No traceback has been produced, nothing to debug.


### Other magic commands

- [many more](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

In [3]:
# shift-tab for documentation
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%

## Cell magic: `%%`

- Entire cell is interpreted differently

### Example: Measuring running time

In [4]:
%%timeit -n500 -r10
total = []
for i in range(1000):
    total += [i]

99.6 µs ± 6.12 µs per loop (mean ± std. dev. of 10 runs, 500 loops each)


In [5]:
%%timeit -n500 -r10
total = [i for i in range(1000)]

47.7 µs ± 460 ns per loop (mean ± std. dev. of 10 runs, 500 loops each)


# Bash Shell

---

* Bash is a text interface to the operating system (OS)

* OS handles file operations, interfacing with network, etc

* Bash allows you to execute system operations using text commands

    ```bash
    echo "######## hello" > somefile.txt ## prints string into file 'somefile.txt'
    ls -alh                              ## list files in directory
    cat somefile.txt                     ## print contents of 'somefile.txt'
    rm somefile.txt                      ## remove 'somefile.txt'
    ```
* Super cool resource: [Explain Shell](https://explainshell.com/)

* [Demo] Use git command to clone a repository on Github  
    _Note: self-learning git is recommended but not required for the course_ [[Software carpentry lesson: git](https://swcarpentry.github.io/git-novice/)]
    
* More on command line later    

## Example: Run bash commands: `%%bash`

Bash is a shell scripting language. Bash language can coexist in Jupyter notebook. 

In [6]:
%%bash

echo "######## hello" > somefile.txt
cat somefile.txt
echo "######## did you see a hello?"
echo "######## listing files"
rm somefile.txt
ls -alh

######## hello
######## did you see a hello?
######## listing files
total 52K
drwxr-xr-x 4 jovyan users 4.0K Sep 30 20:40 .
drwxrwxr-x 6 jovyan  1000 4.0K Sep 30 17:06 ..
-rw-r--r-- 1 jovyan users  33K Sep 30 20:38 01-Statistical-Data-Science.ipynb
drwxr-xr-x 3 jovyan users 4.0K Sep 30 17:04 images
drwxr-xr-x 2 jovyan users 4.0K Sep 27 07:33 .ipynb_checkpoints


## Example: Jupyter Notebook and Bash

- [Who is Jovyan?](https://github.com/jupyter/docker-stacks/issues/358)
- ["In science fiction, a Jovian is an inhabitant of the planet Jupiter."](https://en.wikipedia.org/wiki/Jovian_%28fiction%29)

Shell output can be saved into a Python variable

In [7]:
nbfiles = !ls *.ipynb  # store filenames in nbfiles variable

for f,one in enumerate(nbfiles):
    print("file", f, ":", one)

file 0 : 01-Statistical-Data-Science.ipynb


## Example: Exporting notebooks

- Notebooks can be converted using [`nbconvert`](https://nbconvert.readthedocs.io/en/latest/)
    - Slides
    - Static web pages  
        e.g. auto-updating reporting, blogs, etc

- Shell script to automated execution  
    `jupyter nbconvert --to html --execute mynotebook.ipynb`