# Course 2 week 1 lecture notebook

## Outline
Please click on the desired link to jump to that section of the lecture notebook!

[Numpy and Pandas functions](#numpy-pandas-functions)

[Linear model](#linear-model)

[Concordance index](#c-index)

[Combine features](#combine-features)

<a name="numpy-pandas-functions"></a>
## Numpy and Pandas functions

In [1]:
import numpy as np
import pandas as pd
from utils import load_data

- We will load a small dataset and practice some pandas functions that will be helpful in this week's assignment.

In [2]:
X, y = load_data(2)

In [3]:
X

Unnamed: 0,Age,Systolic_BP,Diastolic_BP,Cholesterol
0,77.19634,104.559933,68.598186,98.729389
1,63.52985,107.219068,92.739634,114.992075


In [4]:
y

0    1.0
1    1.0
Name: y, dtype: float64

### Mean
- Calculate the mean of the dataframe

In [5]:
X.mean()

Age              70.363095
Systolic_BP     105.889501
Diastolic_BP     80.668910
Cholesterol     106.860732
dtype: float64

Notice how it calculates the mean of each column.  
- Pandas will treat each column separately.  
- If you were working with a 2D array in numpy, taking the mean would take the mean of the entire matrix.
- Specifying the axis is a way to ensure that you will take the mean of each column instead of the entire table of data.

- Calculate the mean of each column (each feature)

In [6]:
X.mean(axis=0)

Age              70.363095
Systolic_BP     105.889501
Diastolic_BP     80.668910
Cholesterol     106.860732
dtype: float64

- Calculate the mean of each example (also known as each record, row, or patient)

In [7]:
X.mean(axis=1)

0    87.270962
1    94.620157
dtype: float64

### Natural log
- Calculate the natural log of the data
- Notice, pandas doesn't have a `.log()` function, so we'll use numpy
- Also notice that in numpy and pandas, the log function is the natural log (the base is the number 'e').

In [8]:
np.log(X)

Unnamed: 0,Age,Systolic_BP,Diastolic_BP,Cholesterol
0,4.346352,4.64976,4.228266,4.592383
1,4.15151,4.674874,4.529796,4.744863


### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name='linear-model'></a>
## Linear model

We'll practice using a scikit-learn model for linear regression. You will do something similar in this week's assignment (but with a different model).

[sklearn.linear_model.LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

- Import the model

In [9]:
from sklearn.linear_model import LinearRegression

- Create the model object

In [10]:
model = LinearRegression()
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

- We'll load in some data

In [11]:
from utils import load_data

In [12]:
X, y = load_data(100)

- Fit the model

In [13]:
model.fit(X, y)
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

- View the coefficients (these are the 'weights' associated with each feature). 
- You'll use the coefficients for making predictions.
$$\hat{y} = \beta_1x_1 + \beta_2x_2 + ... \beta_N x_N$$

In [14]:
model.coef_

array([0.00975155, 0.00835816, 0.00836864, 0.00971064])

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name='c-index'></a>
## Concordance Index

- We'll generate some labels

In [18]:
import pandas as pd

- We will let `y` refer to the actual health outcome of the patient.
- 1 indicates disease, 0 indicates health (normal)

In [19]:
y = pd.Series([0,0,1,1,0])
y.name="health"
y

0    0
1    0
2    1
3    1
4    0
Name: health, dtype: int64

In [20]:
risk_score = pd.Series([2.2, 3.3, 4.4, 4.4])
risk_score.name='risk score'
risk_score

0    2.2
1    3.3
2    4.4
3    4.4
Name: risk score, dtype: float64

### Identify a permissible pair
- Look at the label, and see if they are different

In [21]:
if y[0] != y[1]:
    print(f"y[0]={y[0]} and y[1]={y[1]} is a permissible pair")
else:
    print(f"y[0]={y[0]} and y[1]={y[1]} is not a permissible pair")

y[0]=0 and y[1]=0 is not a permissible pair


In [22]:
if y[0] != y[2]:
    print(f"y[0]={y[0]} and y[2]={y[2]} is a permissible pair")
else:
    print(f"y[0]={y[0]} and y[2]={y[2]} is NOT permissible pair")

y[0]=0 and y[2]=1 is a permissible pair


### Check for risk ties
- For permissible pairs, check if they have the same risk score

In [23]:
if risk_score[2] == risk_score[3]:
    print(f"patient 2 ({risk_score[2]}) and patient 3 ({risk_score[3]}) have a risk tie")
else:
    print(f"patient 2 ({risk_score[2]}) and patient 3 ({risk_score[3]}) DO NOT have a risk tie")

patient 2 (4.4) and patient 3 (4.4) have a risk tie


### Concordant pairs
- Check if a permissible pair is also a concordant pair
- We'll check one case, where the first patient is healthy and the second has the disease

In [24]:
if y[1] == 0 and y[2] == 1:
    if risk_score[1] < risk_score[2]:
        print(f"patient 1 and 2 is a concordant pair")

patient 1 and 2 is a concordant pair


- Note that we checked the situation where patient 1 is healthy and patient 2 has the disease.
- We should also check the other situation where patient 1 has the disease and patient 2 is healthy.

You'll practice implementing this algorithm in this week's assignment!

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="combine-features"></a>
## Combine features


In [25]:
import pandas as pd

In [26]:
from utils import load_data

In [27]:
X, y = load_data(2)

In [28]:
X

Unnamed: 0,Age,Systolic_BP,Diastolic_BP,Cholesterol
0,77.19634,104.559933,68.598186,98.729389
1,63.52985,107.219068,92.739634,114.992075


In [29]:
feature_names = X.columns
feature_names

Index(['Age', 'Systolic_BP', 'Diastolic_BP', 'Cholesterol'], dtype='object')

### Combine strings

- Use f-strings to combine two strings 
- There are other ways to do this, but Python's f-strings are quite useful).

In [30]:
name1 = feature_names[0]
name2 = feature_names[1]

In [31]:
name1

'Age'

In [32]:
name2

'Systolic_BP'

In [33]:
combined_names = f"{name1}_&_{name2}"
combined_names

'Age_&_Systolic_BP'

### Add two columns
- Add the values from two columns and put them into a new column.
- You'll do something similar in this week's assignment.

In [34]:
X['new_column'] = X['Age'] + X['Systolic_BP']
X

Unnamed: 0,Age,Systolic_BP,Diastolic_BP,Cholesterol,new_column
0,77.19634,104.559933,68.598186,98.729389,181.756273
1,63.52985,107.219068,92.739634,114.992075,170.748919


### This is the end of this practice section.

Please continue on with the lecture videos!

---