# Scientific Computing and Modelling in Python Part 1

## 3.1 What is data science and modelling?

Python is a powerful language that is used for analytics and modelling. A popular language in industry, it is heavily used in the **data science field** and gaining popularity in the econometrics field.

So what is data science? Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.


## Setting

Just a high level refresher, suppose we had a bunch of data collected on some [Pokemon](https://en.wikipedia.org/wiki/Pok%C3%A9mon). In particular, for each pokemon, we noted down their height and weight. In particular, we had the following observations:

- Pikachu: Height-120 and Weight-54
- Charmander: Height-132 and Weight-58
- Squirtle: Height-144 and Weight-68
- Raichu: Height-150 and Weight-70

We want to see if there is a **linear relationship** between the age and height of the Pokemon. That is, are the two _features_ for the Pokemon related?

More specifically, we want to investigate whether are _taller_ pokemon heavier?

Fortunately, using Python, we can investigate this!

## 3.2 NumPy Arrays

For the rest of this tutorial, we will be using [numpy arrays](https://www.machinelearningplus.com/python/numpy-tutorial-part1-array-python-examples/) rather than lists.

#### Remark
Why are we now working with numpy arrays rather than lists? A few reasons actually!

1) NumPy arrays are alot more efficient than lists. For example, if you had 10 different datasets to work with, storing it in a list of lists would take around 2 MB whilst storing it in a NumPy array would take around 20% of that. This is really important as this means you're laptop/desktop will be able to process the data MUCH faster and more efficiently.

2) NumPy arrays are quite similar to vectors in R or matrices in Matlab, hence you'll be able to port alot of the skills you learn here over.

3) NumPy arrays allows for vector/matrix operations in a much smoother manner. In general, we have access to more functions that can be useful to us.

#### Excercise

1) Try creating a numpy array of the list `["pika pika", "bulbasaur"]`. Does this work?

2) Generate a numpy array of every 4th number from 1-100. Recall the range function `range(0, 100, 4)` may help here.

In [None]:
# Code here

#### Excercise

1) Create a function that takes in two numpy arrays, subtract them from each other and return a numpy array containing the results.

2) (Extension) Given two numpy arrays (1 x n) and (n x 1), matrix multiply them together to get a single value. (You might want to Google what does @ in Python do)

In [1]:
# Code here
# x_one @ x_two

#### Excercise
Use the `array = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])` to work on.
1) Access the number 6 in `array`.

2) Access the number 11 in `array`.

#### Excercise

Use the `array = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])` to work on. Grab the following rows and columns from the array.

1) Grab the first 3 rows and first 2 columns.

2) Grab the first 3 rows and first column.

3) Grab the last 2 rows and last 2 columns.

## 3.3 Summary Statistics

#### Excercise

1) Calculate the standard deviation of `fake_data`. Use the function `np.std()` OR a combination of `np.sqrt()` (square roots numbers) and `np.var()`.

#### Excercise

Suppose we had a 5 x 2 matrix. `A = np.array([ [5, 10, 15, 20, 25], [6, 12, 18, 24, 30] ])`.

1) Calculate the mean and variance of the first column.

2) Do the same thing for the second column.

## 3.4 Regressions

#### Excercise

1. Create your own numpy arrays and run your own regressions. Ask for help if needed! Try doing it with 2 or more variables.

2. (Extension). If you have done ECMT2160, you would have seen the `logit` regression. Try implementing [it](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html). Instead of `sm.OLS`, use `sm.Logit` and keep everything else the same.