# Lab_regression: Are there any relationships? 

Congratulations! In this lab, you are a researcher for the U.S. Department of Education and you are trying to establish whether there is a relationship between Education Expenditure and SAT Scores. As a researcher, you also want to know if a higher pupil/teacher ratio results in better SAT scores. This will help the Dept. of Education decide if the more impoverished students are at a disadvantage while taking the SAT.

For this investigation, you are provided a dataset which was collected between 1994 and 1995. In this dataset, we have a data frame with 50 observations on the following variables:

- state: a factor with names of each state

- expend: expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of US dollars)

- ratio: average pupil/teacher ratio in public elementary and secondary schools, Fall 1994

- sat: average total SAT score, 1994-95

So let's start to look at the data and check for correlations!

### Reminder
- Complete all puzzles and submit this lab before Monday evening at 11:59pm.

### Getting Help

Remember, there are a lot of ways to get help if you find yourself stuck:

1. On the course Piazza page: https://piazza.com/class/k5n1f8g2b722s7

2. During the lab time, the lab TA will be available in the STAT 107 Zoom channel:
    - Anku: 2-4pm and 4-6pm on Wednesday
    - Sogol: 10am-12noon and 2-4pm on Friday
    - STAT 107 Zoom Lab Hours Password (all caps): DISCOVERY
    - STAT 107 Zoom Lab Hours Link : https://illinois.zoom.us/j/962372579

3. Office Hours:

    1-on-1 Zoom-based Office Hours: Available every weekday
    - Mondays, 4-6pm
    - Tuesdays, 4-6pm
    - Wednesdays, 9-11am  (“Professor Office Hours” with Karle or Wade)
    - Wednesdays, 4-6pm
    - Thursdays, 4:30pm-6:30pm
    - Friday: 4-6pm
    
    When you join Zoom, you will be in the “waiting room”.  Whenever you have a question:
    1. Visit https://queue.illinois.edu/q/stat107
    2. Add your question to the queue
    3. As soon as you’re at the top, we’ll bring you into the Zoom channel to work 1:1 with staff

    - STAT 107 Zoom Office Hours Password (all caps): DISCOVERY
    - STAT 107 Zoom Office Hours Link(same as lab hours): https://illinois.zoom.us/j/962372579




Have fun discovering relationships and stay safe! :) 
###### And remember keeping cheatsheet for syntax always helps! 

## Part 0 | Data Overview

First of all, at the beginning of any data analysis, it is important to start with some quick exporatory data analysis (EDA). We can call the first few rows (head) and last few rows (tail) of the data set to see that everything is what we would expect. Additionally, we can look at the variable names, formats, and some summary statistics.

### Notebook Setup

We have provided the setup for libraries today, including `pandas` and visualization libraries (`matplotlib` and `seaborn`):
#### In case, you don't have seaborn/ sklearn installed, please install it first by running : 
```
conda install seaborn
conda install scikit-learn 
```

In [None]:
# Standard imports:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Visulaization options:
#%matplotlib inline

# We do this to ignore several specific pandas warnings:
#import warnings
#warnings.filterwarnings("ignore")

# Use default white plot style:
#sns.set(style="white")

### Read the CSV file

In the following cell, read the `dataa.csv` file and print out **10 random rows** from that file.

In [None]:
df = ...


df.sample(10)

In [None]:
## == TEST CASES for dataset ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df)==50), "You dataset might not be correct"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### EDA: Top Five Rows

Output the first five rows of your dataset:

In [None]:
head = ...
head


In [None]:
## == TEST CASES for head ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(head)==5), "You head might not be correct"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### EDA: Last Five Rows

Output the last five rows of your dataset.

In [None]:
tail = ...
tail


In [None]:
## == TEST CASES for tail ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(tail)==5), "You tail might not be correct"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### EDA: Numeric Data Overview

Use `df.describe()` to print out summary information about all numeric data columns in your dataset.

In [None]:
description = ...
description


In [None]:
## == TEST CASES for description ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(description)==8), "You description might not be correct"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Great job!

Now lets move forward with the analysis! As previously stated, in this lab we would like to find out if the two variables (Education Expenditure and pupil/teacher ratio) have a relationship with SAT scores. We already learned some techniques in class to figure this out, right?(Reminder to look at your notes!) But first, let's visualize the data to see if there are any correlations.

## Part 1 | First, let's create some scatterplots!

It is really important to visualize your data first to see if there are any correlations between pairwise variables before doing further analysis. This helps us understand the relations between variables clearly.

### Puzzle 1.1a: 

In the following cell, write the Python code to create scatterplots between the total expenditure, `expend` and average total SAT score, `total`. Label x as `expend` and y as `total`.

In [None]:
# Create a scatter plot between 'expend' (x-axis) and 'total'(y-axis). 
...



### Puzzle 1.1b: 

The researchers also had a hypothesis that better pupil/teacher ratio leads to better SAT scores. Go ahead and also create a scatterplot between average pupil/teacher ratio in public elementary and secondary schools,`ratio`, and SAT score,`total`. Label x as `ratio` and y as `total`. 

In [None]:
# Create a scatter plot between 'ratio'(x-axis) and 'total'(y-axis). 
...



### Puzzle 1.2a:  Correlations

Look at both of your scatter plots.  Do you think any of these pairs have **correlations**?

- Is the correlation coefficent positive?
- Is the correlation coefficent negative?
- Is the correlation coefficent zero?

Type your response in the Markdown cell below.

Your response here!

### Puzzle 1.2b: Correlation Matrix

You just took a guess for the correlation. Now, its time to see the actual values. In the following cell, write the Python code to find the **correlation matrix** of our data.  Think about what you can see between the pairs. 

In [None]:
## == TEST CASES for Puzzle 1.2b ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(cor['expend']["expend"] == cor['ratio']["ratio"] == cor['total']["total"]),"Are you sure the correlations are correct?"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Puzzle 1.2c: Reflections

- What does a negative correlation mean?
- Can we say something about **causality** by looking for correlations between the variables?

## Part 2 | Simple Linear Regression

### 2.1a. Linear Regression coefficents (slope and y-intercept)

Checking the descriptive statistics is useful and gives us a basic idea of the data. Sometimes, we want to use the data that we have to make predictions.  We can do this using simple linear regression.

The simple linear equation is:

$Y = (slope)*X + intercept$

- X is called as independent variable
- Y is the dependent variable.

### Puzzle 2.1: 
In the following cell, write the Python code to create the simple regression model to predict the average total score from expenditure per pupil in average daily attendance in public elementary and secondary schools.

First, train the `LinearRegression` model with our data:

In [None]:
# Create a linear regression model:
model = ...

# Train ("fit") the model:
...



### Puzzle 2.1b:

Use `model.intercept_` and `model.coef_` to display the y-intercept ($b_0$) and slope ($b_1$) of the linear regression.  (If your `LinearRegression` variable is something other than `model`, you will need to use that instead of `model`.)

In [None]:
# Find the intercept:
intercept = ...
intercept



In [None]:
# Find the slope:
slope = ...
slope



In [None]:
## == TEST CASES for Puzzle 2.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(round(intercept,2) ==1089.29), "The intercept doesn't seem correct"
assert(round(slope[0],2)== -20.89), "The coefficent doesn't seem correct"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### 2.2 Residuals

Look at the plot above and think about it for a second. Does this linear regression line predict all of our data perfectly or do we have some error?  The distance that each point is from the regression line is called the residual or prediction error.  Unless we have a perfect correlation, we will have some error.

**Residuals are the differences between the observed value of $y$ and the predicted value of y ($\hat{y}$).** 

Lets find the residuals!

### Puzzle 2.2a: 

In the following cell, write the Python code to store the predicted `total` as `total_predicted`:

In [None]:
df["total_predicted"] = ...



df["total_predicted"]

### Puzzle 2.2b: 

In the following cell, write the Python code to store the error for of the column `total` as `total_error`:

In [None]:
df["total_error"] = ...



df["total_error"]

In [None]:
## == TEST CASES for Puzzle 2.2 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert(round(df["total_error"].sum(), 3) == 0), "The residuals might not be correct. Please check again!"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Turn In

You're almost done -- congratulations!

You need to do two more things:

1. Save your work.  To do this, create a **notebook checkpoint** by using the menu within the notebook to go **File -> Save and Checkpoint**

2. After you have saved and checkpointed, exit this notebook and return to the Data Science Discovery page on how to use git to turn this notebook into the course!