# Python assessment

This notebook is intended to assess your knowledge of Python, as well as three popular tools: numpy, pandas, and scikit-learn.

Don't worry if you don't finish the entire notebook, this is simply for us to assess what content to include on the upcoming python tutorials.

Each exercise is designed to assess some of your knowledge in plain python, or particular aspects of how to use one of these libraries.
Please attempt to do at least the first 3 exercises. From exercise 4 onwards, if you think it is particularly difficult, feel free to skip the exercise and go to the next one.

In [None]:
# Only run this cell if you're using Google Colab

!git clone https://github.com/torresmateo/fgv-class-2021.git
!cp -r fgv-class-2021/data .
!cp -r fgv-class-2021/utils.py .

## Excercise 0: import modules <a id='ex1'></a> 

Please import the following functions:
* `pandas` with the `pd` alias.
* `numpy` with the `np` alias.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from utils import *
# your code here (~2 lines)

# Exercise 1: create a random matrices in numpy

Create the following matrices in numpy:
* An identity matrix of shape `(10,10)` into the variable `I`
* A random integer matrix  where values are in the range `[0,100]` of shape `(5,5)` into the variable `A` 
* A random float matrix where values are in the range `[0,1]` of shape `(5,5)` into the variable `B`
* Use `B` to calculate $C = BB^T$ and store the result into the variable `C`
* Invert `C` and store the inverted matrix into `C_inv`
* Transpose `A` and store the transposed matrix into `A_T`

In [None]:
np.random.seed(0)

# your code here (replace None with your answer)
I = None
A = None
B = None
C = None
C_inv = None
A_T = None

ex_1_vars = [I, A, A_T, B, C, C_inv, C @ C_inv]

In [None]:
plot_ex1(ex_1_vars)

### Expected output:

![expected image ex 1](ex1.png "Expected output ex 1")
```
```

## Excercise 2: numpy and list operators

Starting from the provided matrices `A` and `B`, perform the following operations, and add the results into the `ex_2_vars` variable:

* dot product between `A` and `B`
* element-wise multiplication between `A` and `B`
* vector of row-wise sum of `A`
* vector of column-wise mean of `B`


In [None]:
np.random.seed(0)
A = np.random.rand(10, 10)
B = np.random.rand(10, 10)

ex_2_vars = []
# your code here (~4 lines)


In [None]:
plot_ex2(ex_2_vars)

### Expected output:

![expected image ex 2](ex2.png "Expected output ex 2")

```
[array([5.28539835, 5.35380972, 5.6034787 , 5.13354102, 3.29925204,
       3.67902113, 4.29030669, 3.99955955, 6.52812667, 4.10689007]), array([0.5810242 , 0.65373969, 0.51279199, 0.4952426 , 0.68347771,
       0.46202251, 0.58440466, 0.38212016, 0.53170837, 0.39428567])]
```

## Excercise 3: sampling

* Sample 1000 points at random from a **uniform** distribution in 2 dimensions in the range $[30, 50]$ for $x$ and $[25,45]$ for $y$. Store the values in variable `uniform_sample` with shape `(1000, 2)`
* Sample 1000 points at random from a **normal** distribution in 2 dimensions in with mean at $(15,10)$ and standard deviaton of `1.5`. Store the values in variable `normal_sample` with shape `(1000, 2)`
* Sample 5000 points at random from a **normal** distribution in **3** dimensions with mean at the origin and standard deviaton of `1`. Keep only the points where all components are positive. Normalize the points so that their norm is 1. Store the values in a variable `normal_sample_3d` with shape `(1000,3)`

In [None]:
np.random.seed(0)
# your code here (~6 lines)


In [None]:
plot_ex3(uniform_sample, normal_sample, normal_sample_3d)

### Expected output:

![expected image ex 3](ex3.png "Expected output ex 2")

## Excercise 4: data wrangling with pandas

Load the file `data/clinical_trials-raw.tsv` into a pandas DataFrame in a variable called `trials`, and perform the following operations on the data:

* Keep only the studies where the `Phases` column has one of these values: `Phase 2`, `Phase 3`, `Phase 4`
* Add a `date` column that casts the `Start Date` column as values with type `datetime` 
* The `Interventions` column is a list separated by the character `|`. Add a column `interventions_list` to the DataFrame that has the same information as a lists
* extract all the interventions into a variable `interventions_unique`, the variable should not contain duplicated entries.
* Use `interventions_unique` to make a list of all the interventions involving a drug. Store the list of drugs into a `drugs` variable. You should only store the drug name.
* Add a `age_clean` column to the dataset, and populate it using values from the `Age` column. You should include only the values `Child`, `Adult`, and `Older Adult` as a list.

In [None]:
# your code here (~15 lines)


In [None]:
plot_ex4(interventions_unique, drugs, trials)

### Expected output:

![expected image ex 4](ex4.png "Expected output ex 4")

```
Number of Interventions: 1306
Number of Drugs: 912
```

## Excercise 5: advanced data wrangling

Load the file `data/drugbank.pkl` into a pandas DataFrame in a variable called `drugbank`, and perform the following operations on the data (**note:** *the file is in long format*):

* Create a new DataFrame `approved` that only contains drugs that have the `Group` value as `approved`
* Use the `approved` DataFrame to create a new DataFrame `drugbank_wide` that has the following columns:
    * `DrugBank ID`
    * `Name`
    * `Target Count` (you will need to calculate this value)
    * `Top ATC` (the top ATC is the **first character** of the `ATC` variable. If no ATC is available, use the value `'NOT FOUND'`
    

<details>
    <summary>HINTS</summary>
    <ul>
        <li>Use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html"><code>set_index</code></a> function and set the <code>DrugBank ID</code> column as the DataFrame index.</li>
        <li>The following functions will be very useful in this situation:
            <ul>
                <li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html"><code>join</code></a></li>
                <li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html"><code>drop</code></a></li>
                <li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html"><code>rename</code></a></li>
            </ul>
        </li>
    </ul>
</details>


In [None]:
# your code here(~12 lines)


In [None]:
plot_ex5(drugbank_wide, drugbank)

### Expected output:

![expected image ex 5](ex5.png "Expected output ex 5")

## Excercise 6: clustering with scikit-learn

For this excercise, you will need to import the required components of scikit-learn

* Load the iris dataset 
* split the dataset into training and testing using `train_test_split`. (set `random_state=0`)
* Train a decision tree classiffier (set `random_state=0`)
* Plot a confusion matrix of the results
* On another cell, plot the decision tree

In [None]:
# your code here (~10 lines)


### Expected output:
```
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at ...>
```

![expected image ex 6-1](ex6-1.png "Expected output ex 6")

In [None]:
plt.figure(figsize = (4,8))
# your code here (1 line)


### Expected output:
```
[Text(89.28, 391.392, 'X[3] <= 0.75\ngini = 0.663\nsamples = 90\nvalue = [34, 27, 29]'),
 Text(44.64, 304.416, 'gini = 0.0\nsamples = 34\nvalue = [34, 0, 0]'),
 Text(133.92000000000002, 304.416, 'X[2] <= 5.05\ngini = 0.499\nsamples = 56\nvalue = [0, 27, 29]'),
 Text(89.28, 217.44, 'X[3] <= 1.75\ngini = 0.128\nsamples = 29\nvalue = [0, 27, 2]'),
 Text(44.64, 130.464, 'gini = 0.0\nsamples = 26\nvalue = [0, 26, 0]'),
 Text(133.92000000000002, 130.464, 'X[1] <= 3.1\ngini = 0.444\nsamples = 3\nvalue = [0, 1, 2]'),
 Text(89.28, 43.488, 'gini = 0.0\nsamples = 2\nvalue = [0, 0, 2]'),
 Text(178.56, 43.488, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(178.56, 217.44, 'gini = 0.0\nsamples = 27\nvalue = [0, 0, 27]')]
```

![expected image ex 6-2](ex6-2.png "Expected output ex 6-2")