# **DSFM Demo**: Setup and Fundamentals

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this notebook we will go through the basics of using GitHub and Python on your private virtual machine (VM), hosted on the Google Cloud. Keep note of the URL to your VM as we will need it throughout DSFM.

-------------

<img src="http://pngimg.com/uploads/github/github_PNG15.png" width="200" height="200" align="center"/>

The first section introduces GitHub, a common platform used for collaborating on software projects. GitHub is based on a distributed version control system called Git. Working with Git and GitHub makes it easy to track changes in the source code and effectively combine the work of a team of software engineers. 

We are using GitHub to share the lecture slides and all practical materials with you. Each day, we will "push" (that's the Git term for upload) new materials to our class repository. So you might wonder ... 

### How can I download class materials from GitHub at the beginning of the class?

1. Generate a personal access token (see instructions [here](https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token)) and copy it.

2. Clone all materials from our GitHub class repository 

    `git clone <repository_link>`
    
3. Enter your GitHub username and personal access token as the password. Note that nothing will show up as you paste in your token for security reasons. Once you have pasted in your token, click `Enter`. The repository will now be cloned (i.e. downloaded) to your machine. 

4. You now have the repository on your private machine.

### Option 1: Clone the class repository daily

Here, we retain a copy of the class repo at the end of each day. So, we will be left with 5 folders at the end of the Boot Camp.

1. Right-click the existing repository and append a tag like `-day1` to the end of the repository name. You are free to append whatever tag makes most sense to you.

2. Open a Terminal window on your private VM.

3. Type in `git clone <repository_url>`

4. Enter your GitHub username and personal access token as the password, then click `Enter`

5. You now have the latest version of the repository on your private machine.

### Option 2: Overwrite the class repository daily

Here, we overwrite any changes we made to the class repo. So, we will be left with 1 folder at the end of the Boot Camp.

1. Open a Terminal window on your private VM.

2. Navigate to the existing GitHub repository by typing in `cd <repository_name>`

3. Pull the latest version of class materials by typing in `git pull`

4. Enter your GitHub username and personal access token as the password, then click `Enter`

5. You should now have the latest version of the repository on your private machine.

If you run into any merge conflicts, navigate to the GitHub repository, then download all materials (`git fetch --all`) and reset the master branch (`git reset --hard origin/master`). More details on forcing GitHub to overwrite local changes are on [StackOverflow](https://stackoverflow.com/questions/1125968/how-do-i-force-git-pull-to-overwrite-local-files).

For more details on how GitHub works see the [official Quickstart documentation](https://help.github.com/en/github/getting-started-with-github/quickstart).

-------------

# Python

<img src="https://www.python.org/static/opengraph-icon-200x200.png" width="150" height="150" align="center"/>

The following section introduces you to the basic programming concepts that you will use with the demos and problems in the DSFM course. It is recommended that you review this introduction before the start of class. If some of the concepts are foreign to you, it's probably OK. Python is very easy to learn and we primarily will use Python to "glue together" the passing off of data through pre-programmed libraries for analysis.  

We have split further Python training materials into blue, red, and black:

**The Blue Piste  (7 hours)**
- A basic Kaggle course on Python:  https://www.kaggle.com/learn/python  
- Great for a fast overview.   
        
**The Red Piste  (34 hours)**
- A JetBrains Academy track on Python:  https://hi.hyperskill.org
- Great way to go from absolute beginner to fully prepared. Interactive, bite-sized exercises. 
- Cool tools to track the concepts you have studied:  https://hyperskill.org/knowledge-map
    
**The Black Piste  (many more hours)**
- A comprehensive Kaggle course on Pandas:  https://www.kaggle.com/learn/pandas
- Advanced Python with more focus on data. The most complete preparation for DSFM. 
  
The official Python documentation can be found on Python.org:   

  https://docs.python.org/3/  

### Why Python?

Python is the most popular language for data science due to its readable syntax, ease of learning, and large number of libraries for data science.

### Which Python?

We will use the latest version of __Python 3.X__. We also will provide you with a virtual machine from Google Cloud to use during the class. Everything will be pre-installed on the VM and ready to go, so all you _ __really__ _ need for the class is a lightweight laptop and a web browser. Using a VM is also advantageous when you need to run code on more powerful machines, and/or when they take a long time to train (hours to days). 

You may also want to set up your own computer so that you can continue to work on examples on a local machine after you leave the course. To do so, we strongly encourage you to use the [Anaconda Distribution](https://www.anaconda.com/distribution/) of Python to insure that you have a full-stack of scientific computing libraries and a properly configured environment. 

* __Anaconda__ is a free and open source distribution of the Python programming languages for data science and machine learning related applications (large-scale data processing, predictive analytics, scientific computing), that aims to simplify package management and deployment.  

* Anaconda comes with a set of installed data science libraries with the possibility of installing other useful libraries later. __Conda__ is their open source package management system and environment management system that runs inside Anaconda distribution.  

-------------

# Jupyter Lab

<img src="https://jupyter.org/assets/main-logo.svg" width="150" height="150" align="center"/>

Before we begin with Python, let's first introduce the programming user interface and execution environment that we will use in the DSFM course: [Jupyter Lab](https://jupyterlab.readthedocs.io/en/latest/getting_started/installation.html)  

Jupyter Lab is an [open-source tool](https://github.com/jupyterlab/jupyterlab) that runs in a web browser. Fostering reuse and reproducibility, it supports interactive data science and scientific computing across multiple programming languages (i.e., "kernels") via the idea of notebooks. A Notebook provides an IDE-like experience to users, but also allows them to test and document what is going on in a very detailed (and visually attractive manner). The introduction that you are reading right now is a Jupyter Notebook (or at least it started as a Jupyter Notebook if you are reading it as a PDF). 

Below you will begin to see how Jupyter runs code in an interactive manner.  Content in Jupyter Notebooks are organized into separate "cells." Each cell is either a section of CODE (which will execute when you run it) or a section of MARKDOWN (which is a quick-and-easy way to format text), or it is the result of code that has been run right above it. You can change the type of cell between **CODE** and **MARKDOWN** by selecting the cell and hitting `M`. You can collapse output from cells by clicking along the blue bar on the left.

  * For documentation on the functionality of Jupyter Lab, see:  
    
    https://jupyterlab.readthedocs.io/en/stable/#  
  
  * For documentation on commands for Jupyter Notebooks, see:  
  
    https://jupyter-notebook.readthedocs.io/en/latest/#  
  
  * For documentation on formatting with MarkDown, see:  
    
    https://help.github.com/articles/basic-writing-and-formatting-syntax  
  
  * For other tips and tricks in Jupyter, see:  
  
    https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/


If you have the time and the expertise, you may want to install Anaconda and run Jupyter Lab on your own computer now. You can then play around with the following three techniques: Shortcuts, Shell Commands, and Magic Commands.

Please let the teaching team know if you need any help or advice on installing Anaconda on your local machine.

### Keyboard Shortcuts  

You can use the following shortcuts from the keyboard when working with notebooks (and/or see the __Commands__ tab along the left menu bar of Jupyter Lab.

    Shift + Enter        Execute a cell
    Shift + Tab          Show documentation ("intellisense")
    Cmd + ']'            Increase indentation
    Cmd + '['            Decrease indentation  
    Ctrl + Shift + -     Split current cell into two at cursor
    A                    Add a new cell above
    B                    Add a new cell below
    X                    Cut the current cell
    M                    Change type of current cell to markdown
    etc...

### A Code cell

In [None]:
print('hello world')

### A Markdown cell

here is a text

### Shell commands 

You can execute an OS shell command from inside your notebook using the exclamation '!' character. The following are examples of common UNIX shell commands that you might use:  

    !ls                        list contents of current folder
    !conda install <library>   install a new Py5hon library
    !ps                        list the process status for current user
    etc...

In [None]:
!ls

### Magic commands

You can execute a special (magical) set of Jupyter commands that add extra functionality to the environment. You prefix magic commands with a % or %%  

    %timeit               runs a statement 100,000 times (default) and provide mean of the fastest three times.
    %%time                give you information about a single run of the code in your cell.
    %lsmagic              view list of all magic commands
    ...

In [None]:
%%time 
a = [2**(1/float(b)) for b in range(1, 10000000)]

-------------

# Basic Python Programming

## Comments, Print and Execution

* single line comments start with '#' and multi line comments are located between ''' pairs
* Print function, show a message in the default output
* You can execute a cell by pressing Shift+Enter.

In [None]:
# this comment is on a single line  

'''
this comment 
crosses two lines
'''

"""
this comment
also crosses two lines
"""

print('Hello World')
pass

## Variables

* A variable has a name and a value
* A variable is created when you assign a value to it using `=`
* Values can be numeric, string or any other data type
* Values are types but variables are NOT typed. The same variable can be reassigned to any type.
* Strings are enclosed by quotes, such as:  `'HERE IS SOME TEXT'`  OR   `"HERE IS SOME TEXT"`

In [None]:
# Increment variable
myvar = 3
myvar += 2
print(myvar)
print()

# Decrement variable
myvar -= 1
print(myvar)
print()

# Append to a string 
mystring = "Hello"
mystring += " world."
print(mystring)
print('Myvar: {}; Mystring: {}'.format(myvar, mystring))
print()

# Swap variables in place
myvar, mystring = mystring, myvar
print(myvar, mystring)
print()

print('Check variable type:')
type(myvar)

## Data Structures

Python has four main data structures: 

#### Set

  - an *unordered* set of items
  
#### List

  - an *ordered* set of items
  - items can be changed
  
#### Tuple  

  - an *ordered* set of items
  - items cannot be changed (they are "immutable")
  
#### Dictionary

  - a `key:value` store of items  
  - dictionaries are unordered (although there are ordered dicts too)  
  
Supporting libraries give you many more data types. The `numpy` library, for example, also gives you arrays and matrices.  

We use extra white spaces to align variable assignments. Strictly speaking, Python enthusiasts consider this bad practice (see the [PEP 8 Style Guide](https://www.python.org/dev/peps/pep-0008/) for more). We are convinced it increases code readability though. 

In [None]:
# LIST
sample     = [1, ["another", "list"], ("a", "tuple")]
mylist     = ["List item 1", 2, 3.14, 4]
mylist[0]  = "List item 1 again"   # We're changing the item
mylist[-1] = 2                     # Here, we refer to the last item
print('Mylist variable')
print(mylist)
print()

# SET 
print('Myset variable')
myset = set(mylist)
print(myset)
print()

# DICTIONARY 
mydict       = dict({"Key 1": "Value 1", 2: 3, "pi": 3.14})
mydict["pi"] = 3.15 # This is how you change dictionary values.
print('Mydict variable')
print(mydict)
print()

# TUPLE 
print('Mytuple variable')
mytuple = (1, 2, 3)
print(mytuple)
print()

# LENGTH OF ALL VARIABLES ACROSS DATA TYPES
print('Length of data')
print(len(mylist))
print(len(myset))
print(len(mydict))
print(len(mytuple))


Lists use a particular form of indexing and slicing (explained in tutorials and explained more in class).

In [None]:
# Indexing of lists example 
mylist = ["List item 1", 2, 3.14]
print(mylist[:])
print(mylist[0:2])
print(mylist[-3:-1])
print(mylist[1:])
print()

# Adding a third parameter, "step" will have Python step in
# N item increments, rather than 1.
# e.g., this will return the first item, then go to the third and
# return that (so, items 0 and 2 in 0-indexing).
print(mylist[::2])

## Flow Control

A program consists of a set variables and control flow statements: 
* `if`: to define conditions
* `for`: to create a loop
* `while`: to create a loop
    
Blocks after control structures are defined by indentation after `:`.

List comprehensions provide a powerful way to create and manipulate lists. 
* They consist of an expression followed by a for clause followed by zero or more if or for clauses.

In [None]:
# Create a list from 0 to 10 (not including 10)
rangelist = range(10)
print(rangelist)
print()

# Example: for loop (a very common type of loop!)
for number in rangelist:
    
    # Check if number is one of the numbers in the tuple.
    if number in (3, 4, 7, 9):
        print('We are in the if block: {}'.format(number))
        
    elif number in (5,6):
        print('We are in the elif block: {}'.format(number))
    else:
        print('We are in the else block: {}'.format(number))

In [None]:
# Example: while loop 
number_of_iterations = 10

while number_of_iterations >= 0:
    
    print('Current value of number_of_iterations is: {}'.format(number_of_iterations))
    
    number_of_iterations -= 1

## Functions

Functions introduce reusability and modularity into the code.
Functions receive a set of input arguments and produce an output.
Functions are defined using *def* keyword.

Input arguments:
* Optional arguments are set in the function declaration after the mandatory arguments by being assigned a default value.
* For named arguments, the name of the argument is assigned a value.
* Parameters are passed by *assignment*, but immutable types (tuples, integers, strings, etc) cannot be changed in the caller by the callee. 

Function Output:
* Functions can return an output or return nothing.
* Functions can return a tuple (and using tuple unpacking you can effectively return multiple values).

Global variables are declared outside of functions and can be read without any special declarations, but if you want to write to them you must declare them at the beginning of the function with the *global* keyword, otherwise Python will bind that object to a new local variable 

In [None]:
# an_int and a_string are optional, they have default values
# if one is not passed (2 and "A default string", respectively).
def passing_example(a_list, an_int = 2, a_string = "A default string"):
    
    a_list.append("A new item")
    an_int = 4
    
    return a_list, an_int, a_string

my_list = [1, 2, 3]
my_int = 10

print(passing_example(my_list, my_int))
print(my_list)
print(my_int)


## Classes

* Classes provide a means of bundling data and functionality together.
* Creating a new class creates a new type of object, allowing new instances of that type to be made.
* Each class instance can have attributes attached to it for maintaining its state.
* Class instances can also have methods (defined by its class) for modifying its state.
* Constructor method gives newly instantiated object an initial state.

In [None]:
# Example class 
class Dog(object):

    def __init__(self, name): # class constructor
        self.name = name
        self.tricks = [] # creates a new empty list for each dog

    def add_trick(self, trick):
        self.tricks.append(trick)

# Initialize an instance of the Dog class
d = Dog('Fido')
e = Dog('Buddy')

# Apply class method
d.add_trick('roll over')
e.add_trick('play dead')

print(d.tricks)
print(e.tricks)

## File I/O

* Python has a wide array of libraries built in for input and output to the files.
* Standard mechanism for serializing objects into files is through *pickling*.

In [None]:
# Create a .txt.file
f = open("test.txt","w+")
for i in range(10):
     f.write("This is line %d\r\n" % (i+1))
        
f.close() 

# append to a file
f = open("test.txt", "a+")
for i in range(2):
     f.write("Appended line %d\r\n" % (i+1))
        
f.close()

# read a file
f = open("test.txt", "r")
if f.mode == 'r':
    contents = f.read()
    
print('Conent of the .txt file: \n')
print(contents)

In [None]:
# Serialize and deserialize a list
import pickle

list = [1,2,3]
with open('list.pkl', 'wb') as f:
    pickle.dump(list, f)
    
with open('list.pkl', 'rb') as f:
    newlist = pickle.load(f)
    
print('Content of the .pkl file: \n')
print(newlist)


## Libraries

* A library is a set of functions and classes that can be reused in your code.
* To use a library, you must first __import__ it using the `import` command.
* Either the whole library or specific classes and functions from within it can be imported.
* There are several data science libraries in Python that are very popular: *Numpy, Pandas, Matplotlib*


### Numpy

NumPy is the fundamental library for scientific computing with Python and one of the core libraries used in various data science applications. It contains among other things the followings:

* a powerful N-dimensional array as its main data structure
* numerous linear algebra capabilities over N-dimensioanal arrays
* SciPy library, a collection of mathematical algorithms and convenience functions, built on top of Numpy. 

more information: [NumPy tutorial](https://numpy.org/doc/stable/user/quickstart.html)

In [None]:
import numpy as np

np.arcsin(1)
a = np.arange(15).reshape(3, 5)
a

### Pandas

Pandas is a popular, open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Its functionality is based on two data structures: 

* Series: a list, or array-like, sequence of items
* DataFrame: a 2-dimensional, table-like, collection of rows and columns

more information: [10 mins with Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)

In [None]:
import pandas as pd

dates = pd.date_range('20130101', periods=6)
df    = pd.DataFrame(np.random.randn(6,4), index=dates, columns=['A','B','C','D'])
df.head()

### Matplotlib

Matplotlib is a plotting library which can produce high quality figures in a variety of formats, and across a variety of platforms.

* __matplotlib__ itself is an object-oriented library. It can be very confusing.
* __Pyplot__ provides a more convenient interface to matplotlib that is modelled closely on MATLAB.
* __Seaborn__ is, another popular visualization library for Python to produce more pretty figures, is based on matplotlib at its core.
* __Bokeh__ and Plotly are python visualization libraries which, unlike Matplotlib and Seaborn, produce interactive visualizations.

more information: [matplotlib tutorial](https://matplotlib.org/tutorials/index.html)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot([1, 2, 3, 3])
plt.ylabel('some numbers on y axis')
plt.show()

### Scikit-Learn

Scikit-Learn is the leading machine learning package in Python and the main library we deal with in this course. 

It includes different models for the full cycle of a data science project including:

* data preprocessing
* predictive modeling (supervised and unsupervised , classification and regression)
* model evaluation and selection
* a lot more ...

![](https://scikit-learn.org/stable/_static/ml_map.png)


more information: [Scikit-Learn tutorial](https://scikit-learn.org/stable/tutorial/index.html)

In [None]:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = LogisticRegression(C=1e5)

# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(12, 10))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()