# PSTAT 134/234 - Statistical Data Science

---

## Instructor: Sang-Yun Oh

- Lectures: MW 11 am - 12:15 pm

- Office: South Hall 5514

- Office hours: Tuesday 4-6 pm


## Teaching Assistant: Sergio Rodriguez 

- Sections: F 9 - 9:50 am / 12 - 12:50 pm

- Office: South Hall 6432-W

- Office hours: Thursday 1-3 pm


# Course Information 

---

## Grading

* Attendance in lectures and sections are required (20%)  
    Total of five will be dropped. No exceptions

* Individual in-class midterm (20%)

* Individual assignments (30%)

* Group final project & presentations (30%)


## Textbooks

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake Vanderplas

- [R for Data Science](http://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund

- Other resources as necessary


## Learn by doing

- Critical statistical thinking is crucial

- Significant programming is required

- Many software tools will be new 
    e.g., R, Python, command line tools, etc

- Proactive attitude is a must!  
    e.g., asking questions, discussing, experimenting, RTM (read-the-manual)

- Diverse backgrounds mean you will have different strengths!  
    Help each other, and assess your own areas of improvement

- You don't have to be an expert at everything

- But you have to be willing to dig deeper on your own

# Data Science

---

- A Forbes magazine editorial on [History of Data Science](https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#7131af3355cf)  

- Isolated statistical/machine learning algorithms are not enough in real usage scenarios 

- Real world challenges are much more broad 

## Data challenges

- Data collection (text files, web pages, pdf files, APIs, ...)

- Data storage (disk space, redundancy, data structure, databases, security, privacy, ...)

- Data access (ease of access, cloud vs. local, security, ...)

- Data uniformity and consistency (cleaning, wrangling, entity resolution, ...)

## Analysis challenges

- Not all data are useful: e.g., "signal to noise", curse of dimensionality

- Critical thinking for formulating questions  
    e.g., prediction of medical diagnosis based on predictors

- Derived features may be redundant and lead to unstable results 

- Effects of outliers on algorithms

- Getting a “feel” for the data (visualization and summary statistics)

- Setting a direction

- Getting a simple but complete analysis done is better than doing one part “perfectly”

## Interpretation challenges

- What results do you obtain from your analysis?

- Do you believe it? What can you conclude? (analysis outcome vs. conclusion)

- Refining your analysis

# Course outline

---

* **Week 1** (4/2-4/6): Data and uncertainty   
    - Computing: Jupyter notebook and Python primer
    - Reading: [Chapter 1 (skim)-2](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#1.-IPython:-Beyond-Normal-Python) in Vanderplas
    
* **Week 2** (4/9-4/13): Data scraping, transformation, and wrangling
    - Computing: Shell commands and Pandas
    - Reading: [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#3.-Data-Manipulation-with-Pandas) in Vanderplas  
        [The Unix Shell](http://swcarpentry.github.io/shell-novice/) by Software Carpentry  
        
* **Week 3-4** (4/16-4/27): Visualization and exploratory analysis
    - Computing: Matplotlib and Scikit-learn
    - Reading: [Chapter 4 (skim) - 5](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#4.-Visualization-with-Matplotlib)

* **In-class midterm** (4/30)

* **Week 5-6** (5/2-5/11): Finance data module

* **Week 7-8** (5/14-5/24): Health data module
               
* **Week 9** (5/28-6/1): Text data module

* **Week 10** (6/4-6/8): Final project presentations

* **Final Projects** (6/14): Final project presentations


# Computational Environment

---

## Github

* [Github Student Account](https://education.github.com/pack)


## Jupyterhub

* [Course Jupyter Hub](https://jupyterhub.lsit.ucsb.edu)

* PSTAT 134/234 coursework only

* Your work can be inspected by teaching staff at any time

* Sign the [privacy policy](https://goo.gl/forms/pwa0FKNy6F0ZT8U32)

# Jupyter Notebook

---

- Interactive python environment for code and text

- Accessed using a web browser

- "Kernels" accessible through Jupyter include Python, shell, R, Julia, others  
    [Jupyter kernels](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)
    
- Cells contain code (depends on kernel or magic), formatted text (markdown)

## Writing formatted text

- Text formatting with [markdown syntax](https://guides.github.com/features/mastering-markdown/)

- Math equations with latex commands:  
    e.g., `$$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i $$` produces:
    $$ \hat \mu = \frac{1}{n}\sum_{i=1}^n x_i,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\hat\mu)^2 $$

- Images with `![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)`
    ![Not Lazy](http://thoughtfulcampaigner.org/wp-content/uploads/2017/10/im-not-lazy-im-just-in-energy-saving-mode-sleepy-cat.jpg)

</br>

- Headings, tables, lists, bold, italic, etc.

- Markdown syntax can have variations: e.g., [Github Flavored Markdown (GFM)](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown)

## Coding in Python and more

- Python interpreter

- IPython is short for interactive Python with additional functionality  
    e.g., tab completion, syntax highlighting, magic, etc. (Chapter 1 in Vanderplas)  
    [demo]

- IPython's "[line, cell] magic" commands  
    e.g., debugging, code timing, shell commands, execute external script, etc.

In [None]:
%lsmagic

### Line magic starts with `%`

- Debugging example (python debugger): `%pdb`  
    [Demo]


In [None]:
%pdb off

def print_string(x, y):
    print('x is lowercase:', x.islower())
    print('y is lowercase:', y.islower())

print_string('a', 'b')

### Cell magic starts with `%%`

- Entire cell is interpreted differently

#### Time running time of a cell by `%%timeit`

In [None]:
%%timeit -n500 -r10
total = []
for i in range(1000):
    total += [i]

In [None]:
%%timeit -n500 -r10
total = [i for i in range(1000)]

#### Run bash commands: `%%bash`

In [None]:
%%bash

echo "######## hello" > somefile.txt
ls -alh
cat somefile.txt
rm somefile.txt

- [Who is Jovyan?](https://github.com/jupyter/docker-stacks/issues/358)
- ["In science fiction, a Jovian is an inhabitant of the planet Jupiter."](https://en.wikipedia.org/wiki/Jovian_%28fiction%29)

### Other magic commands

    - `%load`: load outside script
    - `%store`: pass variables between notebooks
    - [many more](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

## Exporting notebooks

- Notebooks can be converted using [`nbconvert`](https://nbconvert.readthedocs.io/en/latest/)
    - Slides
    - Static web pages  
        e.g. auto-updating reporting, blogs, etc

- Shell script to automated execution  
    `jupyter nbconvert --to html --execute mynotebook.ipynb`


## Following Python Data Science Handbook notebooks

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) is written in Jupter notebook and [the author made the source code available on Github](https://github.com/jakevdp/PythonDataScienceHandbook)

- You can follow the book in our course Jupterhub (file upload)

- You can fork the repository to your Github and make changes

# Bash Shell

---

* Bash is a text interface to the operating system (OS)

* OS handles file operations, interfacing with network, etc

* Bash allows you to execute system operations using text commands

    ```bash
    echo "######## hello" > somefile.txt ## prints string into file 'somefile.txt'
    ls -alh                              ## list files in directory
    cat somefile.txt                     ## print contents of 'somefile.txt'
    rm somefile.txt                      ## remove 'somefile.txt'
    ```
* Super cool resource: [Explain Shell](https://explainshell.com/)

* [Demo] Use git command to clone a repository on Github
    
* More on command line later    