# 1. Packages and Data

## 1.1 Packages

In [9]:
# Standard data libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Ipython libraries
from IPython.display import Image
from IPython.core.display import HTML

## 1.2 Data

Some of you coming from an R world may find this package useful for getting in some toy datasets to play with:

* https://stackoverflow.com/questions/16579407/are-there-any-example-data-sets-for-python

Some of the classics are also available in this Scikit Learn module. Though I will warn to be careful not to side track too much into the Scikit Learn ML side of things just now.

* https://scikit-learn.org/stable/datasets/index.html


# 2. Introduction

Now we get into the thick of things! In the previous week you got a basic grasp for what Python is and how to do some basic programming tasks with it. This week we have a few core learning objectives

1. Data analysis in Python (Numpy & Pandas)
2. Calculus (...Calculus)
3. (bonus) Data Viz in python

The first is important so you can quickly undertake the EDA section of your ML projects as well as manipulate and analyse your data. This will be covered in more depth later, but Deep Learning is no different in many aspects of the standard ML end-to-end project framework. Similarly, point (3) will also assist in you quickly getting insights and understanding not only your data but the modelling process. We will cover some specific (and cool!) libraries and packages for visualising Neural Networks later in the course, but there are some classic libraries that are powerful and useful which are good to get a basic handling on now.

The inclusion of calculus is important if we are to really get under the hood of these Deep Learning algorithms and understand what is going on. Calculus plays a crucial role in what makes Neural Networks work which we will cover in the first week of class. Therefore to best prepare ourseleves to apply and understand the use of calculus in deep learning, it is a good idea to get a bit of background in it. 

Also - understanding a bit more of the math underneath these things puts you at a distinct advantage over many who just jump straight to the 'fun stuff'.

And this:

<h3 align="center"> How many people approach Deep Learning</h3>
<img src="https://s3-ap-southeast-2.amazonaws.com/mdsi-deep-learn-aut-19/math_trumpet.jpg" width="250" height="250"/>
<style>
 img {
    vertical-align: middle;
}
</style>

# 3. Introduction to numpy

## 3.1 Key topics

* Understand what numpy is, how it works, what it is used for
* The numpy array
    * Create a 1-dimensional array
    * Perform basic operations (sum, mean, sd, exponentials)
    * Extract from the array (single element, slicing, conditionally)
    * Perform operations on the entire array
    * Understand the need for .shape and .reshape (the dangers of rank 1 arrays)
    * Extending the above to n-dimensional arrays
        * Key terms: 'rank' and 'shape' of an array
* Broadcasting

Some resources that will assist with this learning:

* http://cs231n.github.io/python-numpy-tutorial/
    * Skip to the numpy section, though the top is a nice summary of key python programming skills
    * Pay particular attention to the broadcasting section
* https://www.datacamp.com/courses/intro-to-python-for-data-science 
    * Chapter 4
* https://jakevdp.github.io/PythonDataScienceHandbook/
    * Chapter 2
* https://www.machinelearningplus.com/python/101-numpy-exercises-python/
    * Once you have made some notes, feel free to work through these exercises to test your skills!

# 4. Introduction to pandas

Pandas is quite a large and important topic for data science in python so I have split it over two weeks. However feel free to run ahead and keep working on pandas things. It will be very useful to have your own notebook of useful scripts and functions in pandas to refer back to when you want to do something. I will clean up one of my own personal notebooks and publish to show you what I use. These notebooks will be your own and you will build upon regularly. Every time you google something, you should add an example to your notebooks (perhaps with the link) so you can easily find and refer back.

## 4.1 Key topics

* Different ways to create a dataframe
    * Reading in various data types 
    * From lists
    * From dictionaries
    * Attributes vs methods
* The series object - going from a series to a dataframe
* Basic useful things:
    * Summary stats (Shape, describe, info)
    * Useful math methods (.corr, min, max, mean, median, var, std etc)
    * Working with strings and dates

Some useful resources for pandas:

* https://www.datacamp.com/courses/intermediate-python-for-data-science
    * Chapter 2 and the back end of chapter 3
* https://jakevdp.github.io/PythonDataScienceHandbook/
    * Chapter 3

I would strongly recommend simply going through the pandas documentation and making notes and examples on all the useful elements. This does take a bit of time and you will not be able to do it all at once, but at least getting yourself familiar with what is in there will be very useful.

* https://pandas.pydata.org/pandas-docs/stable/reference/series.html
    * The pandas series reference documentation
* https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
    * The pandas dataframe reference documentation

## 4.2 A gift to you

I have also included one of my own notebooks for basic Pandas things that I created from some of these resources and others online. Please note importantly **THIS IS NOT a teaching resource**. I really hesitated on whether to share this or not as I believe strongly:

<h4 align="center"> You need to make <i>your own</i> notebooks</h3>

<h4 align="center"> That are in <i>your</i> words </h4>

<h4 align="center"> And have <i>your</i> code</h4>

Else this is no different to just copying and pasting the resources listed above. You need the muscle memory (and actual memories) of doing this stuff yourself.

I came down on the side of including this notebook to show you how I like to keep my notes and decided against doing too much clean up as this is not an end product, but an evolving document. Whenever I get time I like to go back, work on these, update, add more exercises etc. Especially as new content and packages are released and I solve some tricky or fiddly data manipulation/wrangling in a project I am working on.

It is included in this directory and called `pandas_notes_1`

I will emphasise this again so it really sticks:

<h3 align="center"> This is <b>not</b> a teaching resource</h3>
<h4 align="center"> This is one of my personal notebooks for your interest</h4>

So don't stress if it all doesn't make sense or is a bit short and sharp. It is written for me to refer back to. I hope you find it useful as you make your own notebooks that will be an important and everlasting valuable resource for your future ML work :) 

## 4.3 Pandas Profiling

<h4 align="center"> <b>NOTE: <b>You will need to install this package, see github link </h4>

Have you ever thought that some of this EDA work could be automated - well look no further! Tip of the hat to Anthony So for introducing me to this package.

This useful package generates profile reports from a pandas DataFrame. 

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* Essentials: type, unique values, missing values
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent values
* Histogram
* Correlations highlighting of highly correlated variables, Spearman and Pearson matrixes

More info here https://github.com/pandas-profiling/pandas-profiling/blob/master/README.md

### 4.3.1 Import Package + Data

In [10]:
import pandas_profiling
import pandas as pd
import numpy as np

In [3]:
#Set the data file path
import os
data_dir = os.path.dirname(os.getcwd()) + "/datasets"

In [4]:
#We will import a toy dataset. Various chemical levels and a quality (target) variable for white and red wines
wine = pd.read_csv(data_dir + "/wine-total.csv")
#There are around 6.5k rows and 14 variables. 
print(wine.shape)

(6497, 14)


### 4.3.2 Save out profile to html

In [5]:
#We can save out the profile to an html file if we want
profile = pandas_profiling.ProfileReport(wine)
profile.to_file(outputfile="wine_profile.html")

### 4.3.3 Run profile in notebook

In [6]:
#Note currently commented out as doesn't look great on this (dark) theme. 
# pandas_profiling.ProfileReport(wine)

# 5. Introduction to calculus

Calculus itself is a **huge** topic so don't feel like I am asking you to know 'Calculus' (the whole thing) in a week. Really what I want you to become familiar with is just the basics and only really *differentiation* and pay special attention to the chain rule as it will become important later. Next week we will use calculus to differentiate matrices so having the basics down this week is important.

## 5.1 Key topics

* What is a derivative, why do we care?
    * Derivatives and gradients, how do they link?
* Special focus on the chain rule

## 5.2 Resources

Some resources:

* https://the-learning-machine.com/article/machine-learning/calculus
    * This is a very thorough and nice introduction to calculus in general
    * I would recommend reading through this entire page. However don't really worry about the trig examples (sin, cos etc), nor too much about area under the curve (integration) and the multivariate section (for this week)
    * The multivariate calculus section is great reading for next week so I would recommend it even as a primer. The *Hessian* is a useful term to know for next week. 
        * Though don't worry too much about the sections Rosenbrock's function, Himmelblaus function though they are interesting. 
        * Similarly, the math gets a little hairy around the 'Detecting minima, maxima, and saddle points' section so feel free to glaze over that until you get to 'Computing derivatives' which is important. 
* https://www.youtube.com/watch?v=rAof9Ld5sOg
    * A nice intro to what differentiation is. This is moving towards the formal definition using limits. Don't worry too much about that but mainly understand what differentiation is and why we care about it.
* https://www.youtube.com/watch?v=BcOPKQAZcn0
    * This is a nice walkthrough of basic differentiation. Don't worry too much about the end when she starts going towards the second derivative.
* https://www.youtube.com/watch?v=DYb-AN-lK94
    * Formal definition of the chain rule. Notice how we are using this when another function is 'hidden' inside another? That is important!
* https://www.youtube.com/watch?v=H-ybCx8gt-8
    * Nice intro and walk through of the chain rule
* https://www.youtube.com/watch?v=U0m4MsOgETw
    * Eddie woo is great at explaining concepts so his videos are worth a watch.

You don't necessarily need to do too many notes and examples on this topic but understanding differentiation and especially what the chain rule is and why it matters is the concept you should have

# 6. Data Viz in python

This is very much an 'extra' topic so don't feel stressed to get on top of this. You can use whatever tools needed to visualise and explore your data. Though of course doing exploration and viz all in the one place is nice. Since you will need to submit notebooks for some assignments, this will be quite useful as you tell your data story.

## 6.1 Key topics

* Technologies:
    * Matplotlib
    * Seaborn
    * Bokeh
* Graph types
    * There are tonnes of graph types and this course is not a data viz course, so just getting down the basic bar, scatter, line etc graphs would be a good start and then come back later to work on fancy things for a specific use case.

## 6.2 Resources

Some resources to help with this content:

* https://jakevdp.github.io/PythonDataScienceHandbook/
    * Chapter 4 for an introduction to Matplotlib
* https://www.datacamp.com/courses/intermediate-python-for-data-science
    * Chapter 1 for an introduction to Matplotlib as well
* https://www.datacamp.com/courses/data-visualization-with-seaborn
    * Good overview of seaborn
* https://www.datacamp.com/courses/interactive-data-visualization-with-bokeh
    * Bokeh

<h2 align="center"> That's all for this week - see you next week!</h2>