# Python for Data Science, Level I
### *Session \#10*
---

### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Basic Linear Regression

### Warm Ups

---

**Import libraries and read in dataset:**

In [248]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing_df = pd.read_csv("housing_prices.csv")

**Create a linear model:** `model = LinearRegression()`

**Manually set the y-intercept of the model:** `model.intercept_ = 6000.0`

**Manually set the coefficient(s) of the model:** `model.coef_ = np.array([3000.0])` 

**Make prediction column using your linear model:** `housing_df['predicted_price'] = model.predict(housing_df[['distance_to_city']])`

**Plot your data and model:**
    
```python
axes = housing_df.plot(kind='scatter', x='distance_to_city', y='median_price', color='black')
housing_df.plot(x='distance_to_city', y='predicted_price', ax=axes)
fig = plt.gcf()
```

### Exercises
---

**1. Create a scatter plot with** `avg_num_rooms` **as the x-axis and** `median_price` **as y-axis**

**2. Create a** `LinearRegression` **model and manually set the intercept and coefficient.**

Hint: Try setting the intercept around `-35000` and the coefficient around `6000`.

**3. Assign** `housing_df['predicted_price']` **to the output of** `model.predict()` **called on** `avg_num_rooms`

Hint: Remember that `model.predict()` expects a dataframe, not a column. 

**4. Create a line plot with** `predicted_price` **as the y-axis and** `avg_num_rooms` **as the x-axis. Make sure your new line plot is overlayed on top of your data points from exercise 1.** 

Hint: `.plot` uses a line plot by default. Remember to use `fig` to see the updated image of your graph.

**5. Let's figure out which line has "best fit"! Write a function** `squared_error()` **that first takes two columns and finds the difference between each corresponding element. It should then square each difference and return the sum.**

Hint: No looping needed -- you can use subtraction on the columns themselves, thanks to Numpy element-wise operations.

**6. Use** `squared_error()` **on the columns** `median_price` **and** `predicted_price` **to see how well your line fits the data. This will be a BIG number -- don't worry! Tweak the intercept and coefficient, to see how it improves or degrades the fit of your line.**

**7. Use** `model.fit()` **on** `avg_num_rooms` **and** `median_price` **to automatically optimize your linear model according to squared error. Repeat the steps from exercise 3 and 4 to overwrite** `predicted_price` **using your optimized model and graph the results.**

## II. Multilinear Regression

### Warm Ups
---

**Training a model with multiple features:** 

```python
columns = ['avg_num_rooms', 'student_teacher_ratio', 'nitrous']
model.fit(housing_df[columns], housing_df['median_price'])
```

**Prediction with multiple features:** `model.predict(housing_df[columns])`

**Compute mean squared error:** `mean_squared_error(housing_df['median_price'], housing_df['predicted_price'])`

**Compute R2 score:** `r2_score(housing_df['median_price'], housing_df['predicted_price'])`

### Exercises
---

**1. What is the mean squared error and R2 score for your single-variable linear model for** `median_price`?
   

**2. Retrain your model using the three features you think are most important to home price, and overwrite the** `predicted_price` **column using your new model.**

**3. What is the mean squared error and R2 score for your new three-variable linear model for median_price?**

**4. Retrain your model using all available column data, and overwrite the predicted_price column using your new model. What are your mean squared error and R2 score now?**

Hint: Don't forget to remove the `median price` and `predicted price` columns! Those are forms of the answer, which you do not want to train on.

**5. Output the coefficients of the linear model using** `.coef_`, **and take their absolute value. The function** `np.argmin()` **will give you the index of the smallest value in the coefficients? Which feature does this correspond to?**

**4. Drop the corresponding column and retrain your model. How does this affect the r2 score and mean_squared_error?** 

## III. Linear Regression Assumptions

### Warm Ups
---

**Use a scatter plot to check for linearity and equal variance:** 
```python
axes = housing_df.plot(kind='scatter', x='nitrous', y='median_price', color='black')
fig = plt.gcf()
```

**Use a correlation matrix to check for dependence between variables:** `housing_df.corr()`

**Create a histogram of a column** 
```python
axes = housing_df['error'].plot(kind='hist', bins=50)
fig = plt.gcf()
```

### Exercises
---

**1. Check the correlation matrix for** `housing_df` **to find features that are more than 90% correlated**

**2. Use a scatter plot to check linearity and equal variance for** `avg_num_rooms` **and** `student_teacher_ratio`. **Which of these is a better indicator for price?** 

**3. Create a column called** `error` **which is the difference between** `predicted_price` **and** `median_price`.

**4. Create a histogram of the** `error` **column. Is it normal, and centered on 0?**