# Python for Data Science, Level I
### *Session \#10*
---

### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Multilinear Foundations

### Warm Ups
---

**Setup**

In [109]:
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use("seaborn")

import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.read_csv("housing_prices.csv")

**Dividing data into feature matrix and target vector:**
```python
columns = ['charles_river', 'student_teacher_ratio']
X = df[columns]
y = df['median_price']
```

**Dividing data into test and train set:**
```python
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Create a linear model:** 
```python
model = LinearRegression()
model.fit(X_train, y_train)
```

**Predict using your linear model:** `predicted = model.predict(X_test)`

### Exercises
---

**1. In the following equation, do the features** $X_0$ **and** $X_1$ **have a positive or negative relationship with Y?**

$$Y = 200X_{0} - 65X_{1} + 25$$

**2. Let's say our linear model estimates house prices by starting with \\$400\,000\, then adds \\$50,000 per room and subtracts \$2,000 per year since construction.**

**Can you translate this into an equation?**

**3. What would a 5-room house from 1990 cost, according to this model?**

**4. Create a feature matrix from the** `nitrous` **and** `average_num_rooms` **columns. Use the** `median_price` **column as the target vector.**

**Finally, use** `train_test_split` **to separate out training vs test sets**

**5. Create a new** `LinearRegression` **model and fit it using the data from** `X_train` **and** `y_train`
   

**6. What is the mean absolute error for your model?**

**Remember, use** `X_test` **and** `y_test` **when evaluating your model**

## II. Multilinear Regression

### Warm Ups
---

**Correlations between columns:** `df.corr()`

**Compute R2 score:** `model.score(predictions, y_test)`

**Standardize features:** 
```python
X = scale(df[columns])
```

**Importance of features:** 
```python
sorted(zip(model.coef_, columns))
```

### Exercises
---

**1. Will the MAE for this model be high or low? Will the R2 be close to 1 or 0?**

![image](flat.png)

**2. Will the MAE be high or low? Will the R2 be close to 1 or 0?**

![image](steep.png)

**3. Use** `.corr()` **to see the column correlations, then grab out the correlations for** `median_price`

**Save the result as** `p_cor`

**4. Use** `plt.bar()` **to create a bar chart of the correlations.**

`plt.bar()` **needs two inputs** `p_cor.index` **and** `p_cor.values`

**5. Retrain your model using all available column data, minus** `median_price` **of course!**

**Use** `scale()` **on your feature matrix before training, to standardize data**

*Hint: You can use df.columns[:-1] to grab all but the last column name*

**6. What is the R2 score for your new linear model?**

**7. Output the most important features. Which feature is the most positive for a house's price?** 

**What factor is the most damaging for a house's price?** 

## III. Linear Regression Assumptions

### Warm Ups
---

**Create residuals:**
```python
predictions = model.predict(X_test)
residuals = predictions - y_test
```

**Add column names back**
```python
X_test = pd.DataFrame(X_test, columns=columns)
```

**Plot residual graph against predicted**
```python
plt.scatter(predictions, residuals)
```

**Plot residual graph against predicted**
```python
plt.scatter(X_test['avg_num_rooms'], residuals)
```

**Create a histogram of a column** 
```python
plt.hist(X_test['median_price'])
```

### Exercises
---

**1. Use a scatter plot to check linearity and equal variance for** `avg_num_rooms` **and** `student_teacher_ratio`. **Which of these seems like a better indicator for price?** 

**2. Plot the residuals against the predictions.**

**Are the errors evenly distributed, or are there systematic failures?** 

**3. Plot the residuals against** `avg_num_rooms`. **Are the errors evenly distributed, or are there patterns of failure?**

**4. Plot a histogram of the residuals. Are they (roughly) normally distributed, or are there a lot of outliers?** 