<a href="https://colab.research.google.com/github/xren935/COMP206/blob/master/intro_to_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 1 ― Python for Data Science

The following notebook was prepared for the [Python for Data Science](https://www.facebook.com/events/699389510612077) workshop, held by the [McGill AI Society](https://www.mcgillai.com/) on August 19, 2020. 

During the first half of this workshop, we will introduce you to the Python programming language before moving on to interesting data handling and visualization libraries often used in machine learning and data science.

*Built on material by Yutong Yan and excerpts from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas*

---



## What is Google Colab and why do we use it?

When writing code, one needs a development environment to be able to write and execute code. Google Colaboratory is an interactive development environment which has the following perks:
- ability to share code easily
- access to GPUs
- zero need for default configuration

Furthermore it supports the combination of live code and rich text in a notebook format. 

## Part 1: Python basics

In this section, we will cover Python syntax, control flow, and native data structures. 

In [None]:
# This is a comment
1 + 1

#### Variable assignment and types

In [None]:
x = 2
print("x: ", x, ", type(x): ", type(x))

x = 2.
print("x: ", x, ", type(x): ", type(x))

x = "a"
print("x: ", x, ", type(x): ", type(x))

x = 'a'
print("x: ", x, ", type(x): ", type(x))

x = [2]
print("x: ", x, ", type(x): ", type(x))

x = {"a": 2}
print("x: ", x, ", type(x): ", type(x))

#### String formatting

In [None]:
"""
Here we will look at the ages 
of our friends
Alice, 
John 
and Meg
"""
alice = 20
john = 18
meg = 18

print("Alice is {} years old".format(alice))
print(f"John is {john} years old")
print("Meg is %d years old"%meg)

#### Booleans and Logic operators

We will look at the following operators:

$$
\begin{equation}
X \land Y \\
X \lor Y \\
\neg Y \\
X \oplus Y \\
X = Y
\end{equation}
$$

In [None]:
x = True
y = False

print(f"X and Y: {x and y}")
print(f"X or Y: {x or y}")
print(f"not y: {not y}")
print(f"x != y: {x != y}")
print(f"x == y: {x == y}")

#### Lists

In [None]:
arr1 = ["alice", True, 42]
print(arr1)

arr2 = [0]*10
print(arr2)

arr3 = [[1], [1,1], [1,1,1], [1,1,1,1]]
print(arr3)

# Note that range returns an iteratble
arr4 = list(range(10))
print(arr4)

In [None]:
print(arr1[0])
print(arr3[-1])

print(arr4[5:])
print(arr4[:5])

In [None]:
arr = "Bob is sleeping".split()
print(arr)

arr[0] = "Alice"
print(arr)

print(' '.join(arr))

Map and Lambda

In [None]:
"""
lambda allow us to mimick the behavior of functions
map applies a function onto a list
"""
print(map(lambda x : type(x), arr))
print(list(map(lambda x : type(x), arr)))

print(list(map(lambda x : x**2, arr4)))

##### Iterating over *lists*

In [None]:
l = [chr(n) for n in range(97,107)] # [chr(97), ..., chr(106)]

#For loop (Cannot easily get index of elements)
print("Simple for loop:")

for num in l:
    print(num)

#Index loop
print("\nfor loop over indices:")
for i in range(len(l)):
    print(i,l[i])
    
#Enumeration loop
print("\nEnumeration loop style for loop:")
for i, num in enumerate(l):
    print(i, num)

##### List comprehension

In [None]:
# Again a generator
print(i**2 for i in range(10))

a = [i**2 for i in range(10)]
print(a)

a = list(i**2 for i in range(10))
print(a)

a = [i for i in range(1,20) if i%3 == 0]
print(a)

#Also if-else
a = [i if i%3 == 0 else 0 for i in range(1,20)]
print(a)

#### Strings

In [None]:
# Strings are similar to lists

greet = 'Hello World'
print(greet)
print(greet[0])
print(greet[0:3])
print(greet[3:])

In [None]:
print(greet.lower()) # Convert string to lowercase
print(greet.upper() )# Convert string to uppercase
print(greet)         # Note that this does not change the original string
print(greet.split()) # split based on space
print(greet.split('ll')) # split based on 'll'

In [None]:
#Strings are immutable (unlike lists)
greet[0] = 'a' # will generate a TypeError

#### Dictionary

In [None]:
# Nice way to store information

d = {"Name":"Harry", "Age":11, "Address":"4 Privet Drive, Little Whinging, Surrey"}
print(d)

# get value based on key
print(d["Name"])

# change an existing value 
d["Address"] = 'Hogwarts'
print(d)

# insert new element
d["Godfather"] = "Sirius"
print(d)

# delete a key-value pair
del d["Godfather"]
print(d)

In [None]:
# Iterating over dictionary keys
for key in d:
    print( key, ":", d[key])
    
for key, value in d.items():
    print(key, value)

#### If statements

In [None]:
#if statement, python is all about indentations!
x=3
if x>0:
    print("I do not like green eggs and ham")
elif x == 0: 
    print("I do not like them")
else: #x<0
    print("Sam-I-am")

#### Functions

In [None]:
#You can define your own functions:
def func1(x):
    return [x*2, x*3]

# You can set default parametrizations
def func2(x=5):
    return [x*2, x*3]

print(f"func1(5) = {func1(5)}")
print(f"func2() = {func2()}")

# You can also define your functions using lambdas
func = lambda x: [x*2, x*3]
print(f"func(5) = {func(5)}")

#The type is inferred by Python
print(f"func1(5) = {func1(5)}")

#Python also makes your functions polymorphic whenever it can
print(f"func('string') = {func('string')}")

#### Variable scope

In [None]:
a = 1 # Global variable

def foo(): 
    print(f'Inside foo() : {a}')
  
def boo():     
    a = 2
    print(f'Inside boo() : {a}')
  
def goo():     
    global a 
    a = 3
    print(f'Inside goo() : {a}')
  
# Global scope 
print(f'global : {a}')
foo() 
print(f'global : {a}')
boo() 
print(f'global : {a}')
goo() 
print(f'global : {a}')

#### Exceptions and Try catch

In [None]:
def add_two_int(x, y):
  assert isinstance(x, int), f"Argument {x} should be of type int"
  if not isinstance(y, int):
    raise TypeError(f"Argument {y} should be of type int")
  return x + y

In [None]:
add_two_int(1,2)

In [None]:
add_two_int('1', 2)

In [None]:
add_two_int(1, "2")

In [None]:
add_two_int(1, True) # Be careful, these things may happen :P

In [None]:
try:
  print(add_two_int(1, "2"))
except TypeError as e:
  print(e)

In [None]:
try:
  print(add_two_int(1, "2"))
except Exception as e:
  print(e)

In [None]:
try:
  print(add_two_int(1, "2"))
except ValueError as e:
  print(e)

## Part 2: Python Data Handling and Visualization libraries

Now that you are comfortable with Python essentials, let's check out some essential data handling and visualisation libraries!

### Numpy
Numpy provides support for multi-dimensional arrays and matrix manipulation as well as a large collection of mathematical and logical operations. Numpy and smart vectorization are key to fast and open-source scientific computing in Python.

**Note:** Numpy is so popular that many other Python Libraries have functions that accept Numpy arrays as input. 
 


In [None]:
import numpy as np

#### Arrays

We can create Numpy arrays from lists:

In [None]:
x = np.array([[1],[2],[3],[4.0]])
print(f"{x}\n")
print(f"{x.shape}\n")
print(x.dtype)

Similarly, we can define Numpy matrices as follows:

In [None]:
X = np.array([[1,2,3], [4,5,6]])
print(f"{X}\n")
print(X.shape)

No need to create your own identity matrix! Numpy provides direct support for it:

In [None]:
X = np.eye(3)  # Square matrix with only 1's on diagonal
print(X)

We can also fill matrices with randoms integers or floats:

In [None]:
# random integers between 0 and 9
X = np.random.randint(10, size = (4, 4))
print(f"{X}\n")

# random number from 0 to 1
X = np.random.random((4,4))
print(X)

Elements in an array can be accessed just like with lists:

In [None]:
# selecting an element: array[row,column]
X[0,0]

In [None]:
# selecting rows: array[row]
X[0]

In [None]:
#selecting all rows, columns 0 and 2
X[:,[0,2]]

In [None]:
#selecting all rows except the first, all columns except the last:
X[1:,:-1]

#### Logical operations



We can also use logical operator to create arrays of booleans:

In [None]:
X < 0.5

Or use them to select elements that satisfy certain conditions:

In [None]:
X[X < 0.5] #select all elements that are less than 0.5

Numpy also comes with built-in logic functions that save you time on iterating through matrices: 

*   `np.any(arr, axis=None)` tests whether any array element along a given axis evaluates to True (OR operation)
*   `np.all(arr, axis=None)` tests whether all array elements along a given axis evaluate to True (AND operation)


You can find more useful numpy logical functions [here](https://numpy.org/doc/stable/reference/routines.logic.html). 




In [None]:
bool_arr = np.array([[True, True, True], [False, True, False], [False, False, False]])
print(f"{bool_arr}\n")

print(np.any(bool_arr)) # True: no axis specified, so checks all dimensions
print(np.all(bool_arr, 0)) # axis=0 => matrix collapses in the row direction
print(np.all(bool_arr, 1)) # axis=1 => matrix collapses in the column direction

#### Math operations

For a comprehensive list of all the available math functions, please have a look [here](https://numpy.org/doc/stable/reference/routines.math.html). We will go over a few:  

In [None]:
# elementwise addition
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

print(f"{x}\n")
print(f"{y}\n")
print(f"x + y = \n {x + y} \n")
print(f"np.add(x, y) = \n {np.add(x, y)} \n")

In [None]:
# elementwise multiplication
print(f"x * y = \n {x * y} \n")
print(f"np.multiply(x, y) = \n {np.multiply(x, y)} \n")

# useful when multiplying matrix by scalar
print(f"1.5 * x = \n {1.5 * x} \n")

# what happens when the arrays have different sizes? Try it out!

Watch out! We are dealing with matrices: use the `np.dot()` function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices.

In [None]:
print(f"x.dot(y) = \n {x.dot(y)}")
print(f"np.dot(x,y) = \n {np.dot(x, y)}")

Of course, dimensions must match!

In [None]:
x2 = np.array([[1,2], [1,2]])
print(f"np.shape(x2) = {np.shape(x2)}")
y2 = np.array([[1,2,3],[1,2,3],[1,2,3]])
print(f"np.shape(y2) = {np.shape(y2)}")

np.dot(x2,y2) # will produce a ValueError

In [None]:
x = np.array([[1,2],[3,4]])
x = x.reshape((1, 2, 2, 1))
print(f"shape of x: {x.shape}")
print(x)
y = np.array([[5,6],[7,8]])
#print (np.dot(x, y)) # will produce a ValueError

# Squeeze() removes single-dimensional entries from the shape of an array
x = np.squeeze(x)
print(f"shape of squeezed x: {x.shape}")
print(np.dot(x, y))

In [None]:
# Calculate transpose and inverse
x_transp = x.T
print(f"x = \n {x} \n")
print(f"x_transp = \n {x_transp} \n")

x_inv = np.linalg.inv(x)
print(f"x_inv = \n {x_inv} \n")

In [None]:
# Solve x for the equation Ax = b
A = np.random.randint(5, size=(2, 2))
b = np.random.random((2, 1))
x = np.linalg.solve(A, b)

print(f"{x}\n")
print(np.linalg.inv(A).dot(b))

# Avoid using inv(), instead use solve()

In [None]:
# Applying custom functions on numpy arrays
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)
print(squares)

# Always try to look for built-in numpy functions first as they are much faster!

### Matplotlib

Matplotlib is a library that provides functionalities for plotting and data visualization similar to the MATLAB environment. 



In [None]:
import matplotlib.pyplot as plt

Let's make sure the plots properly appear in our notebook by using an IPython [magic function](https://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained). 

In [None]:
%matplotlib inline

We use the `plt.plot()` function to plot 2D data:

In [None]:
plt.title("My Plot")
#x is 0, 1, 2, 3
plt.plot([1,2,3,4])
plt.ylabel("some numbers")
plt.xlabel("some numbers")
plt.savefig("a.pdf")
plt.show()

The format of plot function is (x-axis values, y-axis values, style). For more options, [check here](https://wordpress.com/support/markdown-quick-reference/)

In [None]:
plt.plot([1,2,3,4], [1,4,9,16], "b--") # b colors the curve blue, -- dashes it
plt.axis([0, 6, 0, 20]) # changes the default axis length

We can also stack multiple curves on the same plot

In [None]:
t = np.arange(0., 5., 0.2)

plt.plot(t, t, "r--")
plt.plot(t, t**2, "bs")
plt.plot(t, t**3, "g^")
plt.xlabel("x axis label")
plt.ylabel("y axis label")
plt.title("Curves")
plt.legend(["t", "t**2", "t**3"])
plt.show()

We can also plot the curves seperately but keep them in the same figure using the `plt.subplot()` function. 

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(figsize = (5, 4))
plt.subplot(211)
plt.plot(t1, f(t1), "bo", t2, f(t2), "k")

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), "r--")

plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(2, sharex=True)
fig.suptitle("Aligning x-axis using sharex")
ax1.plot(x, y)
ax2.plot(x + 1, -y)

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
fig.suptitle("Sharing x per column, y per row")
ax1.plot(x, y)
ax2.plot(x, y**2, "tab:orange")
ax3.plot(x, -y, "tab:green")
ax4.plot(x, -y**2, "tab:red")

for ax in fig.get_axes():
    ax.label_outer()

Find out more about Matplotlib [here](http://matplotlib.org/users/pyplot_tutorial.html)!

### Pandas
Pandas builds upon the Numpy library to provide essential data structures for organizing and manipulating real-word data with adequate flexibility ( attaching labels to data, working with missing data, etc.)

Aside from highly-performant data operations, Pandas objects seamlessly intregrate with database systems and spreadsheets programs. 

In [None]:
import pandas as pd

#### Series

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar numpy array:

In [None]:
print(data.values)

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [None]:
print(data.index)

Like with a numpy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
print(data[1])

In [None]:
print(data[1:3])

However, a `Series` object is not simply a numpy array! The `index` can be customized instead of being a simple list of sequential integers. For example, we can index our `Series` object with strings:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)

And the item access works as usual:

In [None]:
print(data['b'])

We can even use non-contiguous or non-sequential indices:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
print(data)

In [None]:
print(data[5])

We can also construct a `Series` from a Python Dictionary.

**Note:** just like a numpy array is more efficient than Python Lists for certain operations, the implementation of `Series` can make them much more efficient than the native Dictionary

In [None]:
area_codes_dict = {'Montreal': 514,
                   'Toronto': 416,
                   'Vancouver': 604,
                   'Calgary': 587, 
                   'Ottawa': 613}
area_codes = pd.Series(area_codes_dict)
print(area_codes)

Again, accessing elements can be done in a similar fashion as usual

In [None]:
area_codes['Montreal']

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [None]:
area_codes['Toronto':'Calgary']

#### DataFrames
A `DataFrame` is analog to a two-dimensional array with both flexible row indices and flexible column names. 

One way to build a `DataFrame` is by using multiple `Series` objects! Let's create a provinces `Series` associated with the cities we saw ealier:

In [None]:
provinces_dict = {'Montreal': 'QC',
                  'Toronto': 'ON',
                  'Vancouver': 'BC',
                  'Calgary': 'AB', 
                  'Ottawa': 'ON'}
provinces = pd.Series(provinces_dict)
print(provinces)

We can now construct a two-column `DataFrame` using a Dictionary of `Series`:

In [None]:
cities = pd.DataFrame({'area_codes': area_codes,
                       'province': provinces})
print(cities)

Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [None]:
cities.index

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
cities.columns

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area_codes'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [None]:
cities['area_codes']

A Pandas ``DataFrame`` can be constructed in a variety of ways. Here we'll give several examples:

##### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [None]:
pd.DataFrame(area_codes, columns=['area_codes'])

##### From a Dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [None]:
pd.DataFrame({'area_codes': area_codes,
              'province': provinces})

##### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [None]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

##### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

#### Hands-on example
In this final section, we will have a look at an example with real-world data and see how Pandas can help us get quickly started with exploring it! 

As mentioned previously, Pandas supports downloading data seamlessly from a variety of sources including CSVs, databases, etc. 

Let's extract our dataset from a CSV file stored on GitHub:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/McGillAISociety/WorkshopDatasets/master/foot-players-stats.csv')

We can then get a preview of the data by using the `head()` function. We'll see that Pandas conveniently extracted columns names and indexed the data!

In [None]:
df.head(n = 10)

We observe that in row 8, the data is not available. This can happen and there are many ways to deal with it.

- If the entire row is empty, we will usually get rid of the row entirely.


Sometimes, if only a few values are missing we can:
1. replace the value with the mean of this feature. For example, use the average of all player's weight
2. replace the value with the mean, with respect to another feature. For example, we can replace the average height of a player by the average height of players of the same country
3. remove the entire row (we should try to limit this if your dataset is quite small)

In [None]:
# remove "in-place" the rows with non available data 
df.dropna(inplace = True, how = 'all')

In [None]:
# View the shape of the data
df.shape

Now let's use Pandas' convenient `describe()` function to get a good initial overview of our data. 

**Note:** The default `describe()` only shows numerical values while excluding NaN elements

In [None]:
df.describe()

Pandas also offers various functions to get information about your data, such as Matplotlib histograms:

In [None]:
df.hist(figsize = (14,16))
plt.tight_layout()

Now, let's quickly look at some commonly used indexing and data selection Pandas `DataFrame` operations:

- `iloc[]` is used on a `DataFrame` to access data using **integer** indexes. 

In [None]:
 # Rows:
df.iloc[0]  # first row (Cristiano Ronaldo)
df.iloc[1]  # second row (Lionel Messi)
df.iloc[-1]  # last row (Barry Richardson)

 # Columns:
df.iloc[:,0]  # first column (name) 
df.iloc[:,1]  # second column (nationality)
df.iloc[:,-1]  # last column (GK_reflexes)

 # Multiple row and column selections:
df.iloc[0:5]  # first five rows of dataframe (use head() instead!)
df.iloc[:, 0:2]  # first two columns w/ all rows
df.iloc[[0,3,6,24], [0,5,6]]  # 1st, 4th, 7th, 25th rows + 1st, 6th, 7th columns.
df.iloc[0:5, 5:8]  # first 5 rows and 5th, 6th, 7th columns

- `set_index()` can be used to change the current index to one or more existing columns

In [None]:
df.set_index("Name", inplace=True)
df.head()

- `loc[]` is used on a `DataFrame` to access data using **labeled** and/or **boolean/logical** indexes.

In [None]:
df.loc[["Lionel Messi","Neymar"]]

In [None]:
df.loc[["Lionel Messi","Neymar"], ["Nationality", "Height", "Weight"]]

In [None]:
df.loc[df["Nationality"] == "argentina"].head()

## Conclusion
For more information about Pandas numerous operations as well as the rest of the libraries mentioned above, we highly encourage to have a look at the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas. You can buy the buy the book in the link or find excerpts [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).

Thank you for following our workshop! Best of luck on what's in store for you. 