# Week 0 Friday

Welcome to Math 10!

[Canvas homepage](https://canvas.eee.uci.edu/courses/58295)

This class is an introduction to using Python for data science. There are two primary parts of the course:
* Part 1.  Exploratory Data Analysis.  (Weeks 1-5)
* Part 2.  Introduction to Machine Learning.  (Weeks 5-10)

Two in-class midterms: Monday Week 5 (10/30) and Monday Week 10 (12/04).  They're closed book and closed computer.

There's no final exam; instead there is a class project.


##### There will be NO official textbook for this course. You may find the following references helpful:
* For Basic Python Programming: [A Byte of Python](https://python.swaroopch.com/)
* For Machine Learning Codes in Python: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
* For Machine Learning Applications and Theories: [An Introduction to Statistical Learning](https://www.statlearning.com/) with [solutions in Python](https://github.com/hardikkamboj/An-Introduction-to-Statistical-Learning); [The Elements of Statistical Learning](https://hastie.su.domains/ElemStatLearn/); [Probabilistic Machine Learning: An Introduction](https://probml.github.io/pml-book/book1.html)
* For Deep Learning: [Deep Learning](https://www.deeplearningbook.org/)


## Announcements

* If you're on the waitlist, please submit homework/take quizzes on the same schedule as the regular class.  (Assignments won't be excused, but I do drop the one lowest worksheet scores.)

## What is Data Science?

Three correlated concepts: 
- Data Science
- Artificial Intelligence 
- Machine Learning

[Battle of the Data Science Venn Diagrams ](https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html)

The original Venn diagram from Drew Conway:
<img src="https://images.squarespace-cdn.com/content/v1/5150aec6e4b0e340ec52710a/1364352051365-HZAS3CLBF7ABLE3F5OBY/Data_Science_VD.png?format=2500w" width="300" height="300"/>



Another diagram from Steven Geringer:
<img src="http://2.bp.blogspot.com/-Qi-0utjhySM/UsteLrV6NyI/AAAAAAAACNQ/AdkizQfS8l8/s320/moz-screenshot-3-729576.png" width="300" height="300"/>


Perhaps the reality should be:
<img src="https://www.researchgate.net/profile/Iva-Kostadinova/publication/326715653/figure/fig1/AS:711037263302656@1546535902683/Data-science-disciplines-1.png" width="500" height="500"/>



[David Robinson's Auto-pilot example](http://varianceexplained.org/r/ds-ml-ai/):
- Machine learning: **predict** whether there is a stop sign in the camera
- Artificial intelligence: decide when to take the **action** of applying brakes (either by rules or from data)
- Data science: provide the **insights** why it’s more likely to miss a stop sign before sunrise or after sunset

### [Example: Precision Medicine and Single-cell Sequencing.](https://learn.gencore.bio.nyu.edu/single-cell-rnaseq/)

<img src="https://learn.gencore.bio.nyu.edu/wp-content/uploads/2018/01/scRNA-overview.jpg" width="500" height="500"/>


- A structured data table, with $n$ observations and $p$ variables.
- **Mathematical representation**: The data *matrix* $X\in\mathbb{R}^{p\times n}$. For notations we write
$X=\left(\mathbf{x}^{(1)},\mathbf{x}^{(2)} \cdots, \mathbf{x}^{(n)} \right)$, where the $i$-th column vector represents $i$-th observation,  $\mathbf{x}^{(i)}=\left(
\begin{matrix}
   x_{1}^{(i)}\\
   x_{2}^{(i)} \\
   \cdots \\
   x_{p}^{(i)}
  \end{matrix} 
\right) \in\mathbb{R}^{p}$


- *Roughly speaking*, big data -- large $n$, high-dimensional data -- large $p$.

## Why Python?

### Python is Popular
How to measure popularity? It is indeed a data science problem!

- [TIOBE](https://tiobe.com/tiobe-index/): Based on google search results
- [PYPL PopularitY](https://pypl.github.io/PYPL.html): Based on google trends
- [GitHut 2.0](https://madnight.github.io/githut/#/pull_requests/2020/2): Based on Github
- [Redmonk](https://redmonk.com/sogrady/2020/07/27/language-rankings-6-20/): Based on Github+Stack Overflow

### Python is Good
- Stable Learning Curves

[An entertaning cartoon from Tobias Hermann](https://github.com/Dobiasd/articles/blob/master/programming_language_learning_curves.md)

- Scalability of Computation (with the help with other packages)

[benchmarking of scientific computation problems](https://modelingguru.nasa.gov/docs/DOC-2783)

[comparison between Numpy and Matlab](https://jekel.me/2017/Python-with-Numba-faster-than-fortran/)

- Useful Packages
    - [Numpy](https://numpy.org/): Scientific Computing
    - [Pandas](https://pandas.pydata.org/): Data Analysis and Manipulation
    - [Scikit-Learn](https://scikit-learn.org/stable/): Machine Learning
    - [Matplotlib](https://matplotlib.org/): Visualizing Functions/Datasets
    - [Seaborn](https://seaborn.pydata.org/): Visualizing Statistical Data

## Warm-up with Deepnote and some Python concepts

* [Deepnote vs Jupyter notebook](https://datasciencenotebook.org/compare/jupyter/deepnote)


You can type alone with me.

You execute a cell/block in Deepnote (or in a Jupyter notebook) by holding down shift and hitting return. The order in which you execute cells is important.

This is an example of a markdown cell. Markdown cells are used to write explanation for your code and format text nicely.

* This is an example of making a list.
To execute a cell, you can use `command+enter`. To edit a cell, make sure it's highlighted, and then press enter.

# Math 10
## Week 0 
### Friday
If you are fimilar with $\LaTeX$
* $\int_0^1 x^2 dx$

Here's an example of how we can change text color:

<font color = red> Warning: </font> Midterm1 is on Week 5 Monday.

Here are some examples of code cells:

In [None]:
2+2

4

In [None]:
a = 4
b = 8
a+b

12

The order in which we evaluate cells matters!

In [None]:
print(x)

NameError: name 'x' is not defined

In [None]:
x = 10

In [None]:
print(x)

10


In [None]:
# take square of x
x**2

100

In [None]:
print('Hello World!')

Hello World!


In [None]:
list = [5, 6, 7, 8]

In [None]:
list[1]

6

Indexing in Python starts at 0!

In [None]:
list[0]

5

In [None]:
list[3]

8

In [None]:
list[-1]

8

In [None]:
list[::-1]

[8, 7, 6, 5]

* NumPy is one of the most important python libraries
* NumPy does not come with base python, we will need to import it every time we start a new notebook
The abbreviation `np` is a standard convention, and we will always use it in Math 10.

In [None]:
# load Numpy
import numpy as np

In [None]:
# generate a random variable
# we can use the funtion in numpy random.default_rng

np.random.default_rng?

[0;31mDocstring:[0m
Construct a new Generator with the default BitGenerator (PCG64).

Parameters
----------
seed : {None, int, array_like[ints], SeedSequence, BitGenerator, Generator}, optional
    A seed to initialize the `BitGenerator`. If None, then fresh,
    unpredictable entropy will be pulled from the OS. If an ``int`` or
    ``array_like[ints]`` is passed, then it will be passed to
    `SeedSequence` to derive the initial `BitGenerator` state. One may also
    pass in a `SeedSequence` instance.
    Additionally, when passed a `BitGenerator`, it will be wrapped by
    `Generator`. If passed a `Generator`, it will be returned unaltered.

Returns
-------
Generator
    The initialized generator object.

Notes
-----
If ``seed`` is not a `BitGenerator` or a `Generator`, a new `BitGenerator`
is instantiated. This function does not manage a default global instance.

Examples
--------
``default_rng`` is the recommended constructor for the random number class
``Generator``. Here are se

In [None]:
rng = np.random.default_rng()
rng.random(5)

array([0.78238483, 0.81217947, 0.52612065, 0.91363018, 0.65672153])

In [None]:
help(rng.random)

Help on built-in function random:

random(...) method of numpy.random._generator.Generator instance
    random(size=None, dtype=np.float64, out=None)
    
    Return random floats in the half-open interval [0.0, 1.0).
    
    Results are from the "continuous uniform" distribution over the
    stated interval.  To sample :math:`Unif[a, b), b > a` multiply
    the output of `random` by `(b-a)` and add `a`::
    
      (b - a) * random() + a
    
    Parameters
    ----------
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    dtype : dtype, optional
        Desired dtype of the result, only `float64` and `float32` are supported.
        Byteorder must be native. The default value is np.float64.
    out : ndarray, optional
        Alternative output array in which to place the result. If size is not None,
        it

In [None]:
rng.random(3)

array([0.39916726, 0.40975624, 0.24989856])