# Lab: Regression Analysis

### Before you start:

* Read the README.md file
* Comment as much as you can and use the resources (README.md file) 

Happy learning!

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

## Challenge 1
I work at a coding bootcamp, and I have developed a theory that the younger my students are, the more often they are late to class. In order to test my hypothesis, I have collected some data in the following table:

| StudentID | Age | Tardies |
|--------|-----|------------|
| 1      | 17  | 10         |
| 2      | 51  | 1          |
| 3      | 27  | 5          |
| 4      | 21  | 9         |
| 5      | 36  |  4         |
| 6      | 48  |  2         |
| 7      | 19  |  9         |
| 8      | 26  | 6          |
| 9      | 54  |  0         |
| 10     | 30  |  3         |

Use this command to create a dataframe with the data provided in the table. 
~~~~
student_data = pd.DataFrame({'X': [x_values], 'Y': [y_values]})
~~~~

In [None]:
# Creating the data frame
student_data = pd.DataFrame({'Age': [17, 51, 27, 21, 36, 48, 19, 26, 54, 30],
                             'Tardies': [10, 1, 5, 9, 4, 2, 9, 6, 0, 3]})
student_data.index = np.arange(1, len(student_data) + 1)
student_data.index.name = 'StudentID'

student_data

Draw a dispersion diagram (scatter plot) for the data.

In [None]:
x = student_data['Age']
y = student_data['Tardies']

plt.scatter(x, y)
plt.show()

Do you see a trend? Can you make any hypotheses about the relationship between age and number of tardies?

In [None]:
# ANSWER:
# I think the initial assumption will be confirmed. A trend is recognizable.
# The younger the participants, the more likely they are to be late.

Calculate the covariance and correlation of the variables in your plot. What is the difference between these two measures? Compare their values. What do they tell you in this case? Add your responses as comments after your code.

In [None]:
# Covariance between 'Age' and 'Tardies'
print(f'Covariance: {np.cov(x, y)[0, 1]}')

# Correlation between 'Age' and 'Tardies'
print(f'Correlation: {np.corrcoef(x, y)[0, 1]}')

# Covariance: there is an inverse relationship between the variables (the less in age, the more tardies)
# Correlation: there is a very strong (negative) correlation between the variables (almost 1.0)

Build a regression model for this data. What will be your outcome variable? What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Create linear regression model
model = linear_model.LinearRegression()

# Train model with 'Age' and 'Tardies'
result = model.fit(pd.DataFrame(x), y)

# Coefficient of Determination
result.score(pd.DataFrame(x), y)

# ANSWER:
# The Coefficient of Determination measures how much better our predictions are than random picks.
# Our score is ~0.88, so our model fits pretty good.
# I chose the linear progression because the trend in the plot can be described very well by a line.

Plot your regression model on your scatter plot.

In [None]:
# Regression model line
lin_reg_line = result.intercept_ + result.coef_[0] * x

# Regression model plot
plt.plot(x, lin_reg_line, c = 'orange')
plt.scatter(x, y)
plt.show()

Interpret the results of your model. What conclusions can you draw from your model and how confident in these conclusions are you? Can we say that age is a good predictor of tardiness? Add your responses as comments after your code.

In [None]:
# Average error
np.sqrt(mean_squared_error(result.predict(pd.DataFrame(x)), y))

# ANSWER:
# We have a Coefficient of Determination close to 1.0 (0.88) and an average error on our baseline data of 1.15 per point.
# The result is beneficial for predicting who is more likely to be late and who isn't.

## Challenge 2
For the second part of this lab, we will use the vehicles.csv data set. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to vehicle characteristics, including the model, make, and energy efficiency standards, as well as each car's CO2 emissions. As discussed in class the goal of this exercise is to predict vehicles' CO2 emissions based on several independent variables. 

In [None]:
# Import any libraries you may need & the data


Let's use the following variables for our analysis: Year, Cylinders, Fuel Barrels/Year, Combined MPG, and Fuel Cost/Year. We will use 'CO2 Emission Grams/Mile' as our outcome variable. 

Calculate the correlations between each of these variables and the outcome. Which variable do you think will be the most important in determining CO2 emissions? Which provides the least amount of helpful information for determining CO2 emissions? Add your responses as comments after your code.

In [None]:
# Your response here. 


Build a regression model for this data. What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Your response here. 


Print your regression summary, and interpret the results. What are the most important varibles in your model and why? What can conclusions can you draw from your model and how confident in these conclusions are you? Add your responses as comments after your code.

In [None]:
# Your response here. 


## Challenge 2: Error Analysis

I am suspicious about the last few parties I have thrown: it seems that the more people I invite the more people are unable to attend. To know if my hunch is supported by data, I have decided to do an analysis. I have collected my data in the table below, where X is the number of people I invited, and Y is the number of people who attended. 

|  X |  Y |
|----|----|
| 1  |  1 |
| 3  |  2 |
| 4  |  4 |
| 6  |  4 |
| 8  |  5 |
| 9  |  7 |
| 11 |  8 |
| 14 |  13 |

We want to know if the relationship modeled by the two random variables is linear or not, and therefore if it is appropriate to model it with a linear regression. 
First, build a dataframe with the data. 

In [None]:
# Your code here.


Draw a dispersion diagram (scatter plot) for the data, and fit a regression line.

In [None]:
# Your code here.


What do you see? What does this plot tell you about the likely relationship between the variables? Print the results from your regression.

In [None]:
# Your response here. 


Do you see any problematic points, or outliers, in your data? Remove these points and recalculate your regression. Print the new dispersion diagram with your new model and the results of your model. 

In [None]:
# Your response here. 


What changed? Based on the results of the two models and your graphs, what can you say about the form of the data with the problematic point and without it?

In [None]:
# Your response here. 
