# Data Science with Python
## Ciência de Dados com Python
by Ladislav Vrbsky  
(born in Prague, Czech Republic, works at [Vibe Tecnologia](https://vibetecnologia.com/))  

<br>
for Estácio

<a href='https://portal.estacio.br/'> <img src='https://portal.estacio.br/imgs/logo-estacio.png' /></a>

[Open this file in Colab](https://colab.research.google.com/github/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/00_Class_slides.ipynb)  
[Show this file on GitHub](https://github.com/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/00_Class_slides.ipynb)

## Agenda

- Motivation
- IDE
- Data Analysis and visualization
- Machine Learning with scikit-learn
- References & Tips

## Motivation
- Data is one of the most valuable assets of an organization
- Getting data driven gets you ahead
- Data is already present, you might as well use it

## IDE - Intro to Jupyter, Notebooks, Markdown

Open, read & interact with  
`./01_Intro_to_Jupyter_and_Markdown.ipynb`  
[Open locally](./01_Intro_to_Jupyter_and_Markdown.ipynb)  
[Open in Colab](https://colab.research.google.com/github/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/01_Intro_to_Jupyter_and_Markdown.ipynb)  
[Show on GitHub](https://github.com/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/01_Intro_to_Jupyter_and_Markdown.ipynb)

## Data Analysis and visualization

Head over to  
`./02_Intro_to_Data_Analysis.ipynb`  
[Open locally](./02_Intro_to_Data_Analysis.ipynb)  
[Open in Colab](https://colab.research.google.com/github/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/02_Intro_to_Data_Analysis.ipynb)  
[Show on GitHub](https://github.com/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/02_Intro_to_Data_Analysis.ipynb)  


## Machine Learning workflow

<img src='./img/ML_flow1.png'>

## Machine Learning workflow

<img src='./img/ML_flow2.png'>

## Classification Metrics

- Accuracy  
% - correct on all predictions:
$$\frac{TP+TN}{N}$$

- Precision  
% - classified as positives that are correct:
$$\frac{TP}{TP+FP}$$

- Recall  
% - correct on actual positive class:
$$\frac{TP}{TP+FN}$$

- F1 score  
$$F_1 = 2 * \frac{Precision * Recall}{Precision + Recall}$$

## Regression Metrics

Note: Compare error metric to the average value of the label


- Mean Absolute Error
$$\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^{n}\lvert y_{i} - \hat{y}_{i} \rvert$$

- Mean Squared Error
$$\mathrm{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}$$

- Root Mean Squared Error
$$\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}}$$


## Regression Metrics cont.
- R2 score (Coeffictient of determination)
$$
\mathrm{R}^2 
= 1 - \frac{\mathrm{SSE}}{\mathrm{SSM}}
= 1 - \frac{\sum_{i}(y_{i} - \hat{y}_{i})^{2}}{\sum_{i}{(y_{i}-\bar{y})^2}}
= 1 - \frac{n \times \mathrm{MSE}}{\sum_{i}{(y_{i}-\bar{y})^2}}
$$

<center>
<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Coefficient_of_Determination.svg/400px-Coefficient_of_Determination.svg.png'>
SSM - sum of squares to mean (left) and SSE - sum of squared errors (right).
<br/>
Image source: <a href='https://en.wikipedia.org/wiki/Coefficient_of_determination'>wikipedia<a/>
<center/>


## Machine Learning with scikit-learn

Head over to  
`./03_Intro_to_Machine_Learning_with_scikit_learn.ipynb`  
[Open locally](./03_Intro_to_Machine_Learning_with_scikit_learn.ipynb)  
[Open in Colab](https://colab.research.google.com/github/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/03_Intro_to_Machine_Learning_with_scikit_learn.ipynb)  
[Show on GitHub](https://github.com/vrbsky/talks-classes-etc/blob/master/Data_Science_with_Python_Estacio/03_Intro_to_Machine_Learning_with_scikit_learn.ipynb)

### Referrences

- [Python docs](https://docs.python.org/3/)
- [Pandas docs](https://pandas.pydata.org/docs/)
- [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)
- [References by Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) inside notebook
- [VS Code - Data Science tools](https://code.visualstudio.com/docs/python/data-science-tutorial)
- [Databricks Community Edition](https://community.cloud.databricks.com/login.html)
- [UCI datasets](https://archive.ics.uci.edu/ml/datasets.php)
- Neural Networks Frameworks
  - [Tensorflow](https://www.tensorflow.org/)
  - [PyTorch](https://pytorch.org/)
- [Spark](https://spark.apache.org/) (Big Data execution engine)
- Visualization libs
  - [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
  - [Matplotlib](https://matplotlib.org/gallery/index.html)
  - [seaborn](https://seaborn.pydata.org/examples/index.html)
  - [plotly](https://plotly.com/python/)
  - [HoloViews](http://holoviews.org/)


### Tips
- On windows, you can use Python from Windows Store, [anaconda](https://www.anaconda.com/products/individual), or directly from [python.org](https://www.python.org/downloads/windows/).
- To manage packages, use virtual environments in python or anaconda (`venv`, `virtualenv`).
- Join [SQL Norte](https://instagram.com/sqlnorte?igshid=1phf1zizjz9ec) community (data-oriented events)
- Practice on [kaggle](https://www.kaggle.com/)

```python
# show all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# show numbers without scientific notation (123e9)
pd.set_option('display.float_format', lambda x: format(x, ',.2f'))
```

## Best Practices
- Know the business
- Explain the finance (or UX) impact
- Effective communication
- Good Probability & Statistics
- Use even SQL, Excel
- Understand the algorithms
- Be honest: *All models are wrong, but some are useful.*
- Practice

## Systems' Best Practices
- Simplify access to data
- Fix source of data quality issues
- Set standards to strengthen the analytical workflow
- Document the processes
- Share metadata across the organization. Have a business glossary, data dictionaries, data catalog, data governance
  - Standardizes, provides data lineage, helps deploy models, provides repeatability, transparent data

In [1]:
!jupyter nbconvert 00_Class_slides.ipynb --to slides --post serve

[NbConvertApp] Converting notebook 00_Class_slides.ipynb to slides
[NbConvertApp] Writing 290771 bytes to 00_Class_slides.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/00_Class_slides.slides.html
Use Control-C to stop this server
^C

Interrupted
