# Statistical Modelling

## 3.1 What is data science and modelling?

Python is a powerful language that is used for analytics and modelling. A popular language in industry, it is heavily used in the **data science field** and gaining popularity in the econometrics field.

So what is data science? Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.


## Setting

Just a high level refresher, suppose we had a bunch of data collected on some [Pokemon](https://en.wikipedia.org/wiki/Pok%C3%A9mon). In particular, for each pokemon, we noted down their height and weight. In particular, we had the following observations:

- Pikachu: Height-120 and Weight-54
- Charmander: Height-132 and Weight-58
- Squirtle: Height-144 and Weight-68
- Raichu: Height-150 and Weight-70

We want to see if there is a **linear relationship** between the age and height of the Pokemon. That is, are the two _features_ for the Pokemon related?

More specifically, we want to investigate whether are _taller_ pokemon heavier?

Fortunately, using Python, we can investigate this!

## 3.2 NumPy Arrays

For the rest of this tutorial, we will be using [numpy arrays](https://www.machinelearningplus.com/python/numpy-tutorial-part1-array-python-examples/) rather than lists.

#### Remark
Why are we now working with numpy arrays rather than lists? A few reasons actually!

1) NumPy arrays are alot more efficient than lists. For example, if you had 10 different datasets to work with, storing it in a list of lists would take around 2 MB whilst storing it in a NumPy array would take around 20% of that. This is really important as this means you're laptop/desktop will be able to process the data MUCH faster and more efficiently.

2) NumPy arrays are quite similar to vectors in R or matrices in Matlab, hence you'll be able to port alot of the skills you learn here over.

3) NumPy arrays allows for vector/matrix operations in a much smoother manner. In general, we have access to more functions that can be useful to us.

In [7]:
# First we import numpy 
import numpy as np

In [8]:
# Creating arrays are quite easy. Recall how we created lists
my_list = [10, 20, 30, 40]

In [15]:
print(my_list)

[10, 20, 30, 40]


In [11]:
# So to create a NumPy array, we do something similar regarding the syntax of the list but we now wrap it
# in the np.array() function.

my_array = np.array([10, 20, 30, 40])

In [14]:
print(my_array)

[10 20 30 40]


Same effect!

In [13]:
# We can also just pass in a list variable into our numpy function.
my_array_two = np.array(my_list)
print(my_array_two)

[10 20 30 40]


#### Excercise

1) Try creating a numpy array of the list `["pika pika", "bulbasaur"]`. Does this work?

2) Generate a numpy array of every 4th number from 1-100. Recall the range function `range(0, 100, 4)` may help here.

Now where NumPy arrays come in really handy is vector operations. Let's see some examples.

In [23]:
# We have two lists.
x_one = [1,2,3,4,5]
x_two = [10,20,30,40,50]

# Let's add them together. What do you think we will get?
x_three = x_one + x_two
#print(x_three)

In [25]:
# So how would we actually add them together?

# Method 1:
x_three = []
for index, entry in enumerate(x_one):
    result = x_one[index] + x_two[index]
    x_three.append(result)
    
# Method 2: Faster way of the above
x_sum = [entry_one + entry_two for entry_one, entry_two in zip(x_one, x_two)]

Unecessarily difficult for a task that should seem easy!!! 

What happens if we use numpy?

In [26]:
x_one = np.array([1,2,3,4,5])
x_two = np.array([10,20,30,40,50])

print(x_one + x_two)

[11 22 33 44 55]


Too easy.

#### Excercise

1) Create a function that takes in two numpy arrays, subtract them from each other and return a numpy array containing the results.

2) (Extension) Given two numpy arrays (1 x n) and (n x 1), matrix multiply them together to get a single value. (You might want to Google what does @ in Python do)

In [31]:
# x_one @ x_two

Now that's very cool and all but what if we want to work with matrices instead?

It's easy, you insert in a `list of lists`. Simply put the list of numbers, followed by a `,` and then your next row of entries. Make sure the rows are of the same length (for linear algebra to work, it'll still work in Python)!

In [34]:
A = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(A)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [36]:
# What happens if not the same length
B = np.array([[1,2,3], [4], [7,8]])
print(B)

# Still works but generally not a good idea if you recall from your math classes...

[list([1, 2, 3]) list([4]) list([7, 8])]


In [39]:
# We also have some special functions to create certain matrices we may need in maths.

# A 2x3 matrix of 0.
zeros = np.zeros( (2, 3) )
print(zeros)
print("\n\n") # Just to make output easier to read.

# A 5x6 matrix of 1.
ones = np.ones( ( 5, 6) )
print(ones)
print("\n\n") # Just to make output easier to read.

# A 3x3 identity matrix (recall these are always square matrices, so we need to only specify row length).
identity = np.eye( (3) )
print(identity)

[[0. 0. 0.]
 [0. 0. 0.]]



[[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]]



[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


There are so many features for NumPy but one more important thing is access elements of our numpy array.

In [42]:
array = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])
print(array)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


Lets say we want to access 4. It is in the **1st** row (remember we count from 0 in Python) and in the **0th** column.

In [43]:
# First specify row number and then column number.
array[1][0]

4

#### Excercise
1) Access the number 6 in `array`.

2) Access the number 11 in `array`.

What we just did is known as _indexing_. Now what if we want to grab an entire row? We use a technique called [slicing](https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/).

To grab the 1st row (so [4, 5, 6] in our case), we specify the row and `,` to specify the columns we want from that row. If we leave out `,` we just get the entire row.

In [50]:
array[1]

array([4, 5, 6])

In [52]:
# This is equivalent to where the left and right of : is empty to grab anything.
array[1, :]

array([4, 5, 6])

In [55]:
# What if we want only the 1st and 2nd column. Recall that this will grab the 1st column and 2nd
# (but not 3rd because its exclusive)
array[1, 1:3]

array([5, 6])

In [58]:
# Grab first 2 rows and first 2 columns.
array[0:2, 0:2]

array([[1, 2],
       [4, 5]])

#### Excercise

Use the `array = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])` to work on. Grab the following rows and columns from the array.

1) Grab the first 3 rows and first 2 columns.

2) Grab the first 3 rows and first column.

3) Grab the last 2 rows and last 2 columns.

# TODO

## 3.3 Summary Statistics

When you're first given a dataset in data analysis, running some basics analysis to look at things such as summary statistics can go a long way. Here, we will be using the [Scipy](https://scipy-lectures.org/) library, a very popular tool for scientific computing.

In [None]:
import scipy 

## 3.4 Regressions

## 3.5 Real data

## 3.6 Introduction to Machine Learning

In [3]:


import statsmodels.api as sm
import statsmodels.formula.api as smf

In [4]:
height = np.array([120, 132, 144, 150])
weight = np.array([54, 58, 68, 70])

In [6]:
model = sm.OLS(height, weight)
print(model.fit().summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     3700.
Date:                Sun, 17 Mar 2019   Prob (F-statistic):           9.79e-06
Time:                        16:07:29   Log-Likelihood:                -11.119
No. Observations:                   4   AIC:                             24.24
Df Residuals:                       3   BIC:                             23.62
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.1784      0.036     60.827      0.0

