# Multiple Regression

## Biased About the Status Quo Study

Read about the study as described in the activity sheet.

The first thing we must do is load the pandas and sklearn modules. sklearn contains the linear_model library which allows us to fit linear regression models.

In [1]:
import pandas as pd
from sklearn import linear_model # import the linear regression model library

### Load data

The next thing to do is load the data from the .csv file. Examine what is printed to make sure that it matches the data file and nothing is wrong.

In [16]:
# load the data
data = pd.read_csv('./data/tworek2016.csv')
print(data.head())

   excluded  RavensProgressiveMatrix_sum  Inherence_Bias  Should_Score  \
0         0                            5        7.666667      6.333333   
1         1                            5        7.333333      6.666667   
2         0                            4        5.333333      5.666667   
3         0                            2        5.800000      8.000000   
4         0                            2        4.266667      7.000000   

   Good_Score  Ought_Score  Belief_in_Just_World  \
0    6.666667     6.500000                  5.65   
1    8.000000     7.333333                  4.60   
2    2.666667     4.166667                  4.20   
3    7.666667     7.833333                  4.85   
4    3.666667     5.333333                  4.65   

   instructionsâ onthefollowingscreensyouwillbeaskedtofilloutseve  \
0                                                  1                
1                                                  1                
2                                  

### Filter out data

You may notice that this dataset includes an "excluded" column. When this is equal to 1, the participant was excluded from the study for various reasons. We can also reference the column named "filter_$" that is equal to 1 when that participant was included. We can use one of these columns to filter out all of the unnecessary data. 

To do this, we can use a simple pandas expression. If we use an expression like 

`data['filter_$'] == 1` 

inside the `[ ]`'s of the DataFrame variable, then we can retrieve only the data where the value in the `'filter_$'` column is equal to 1. Try this in the code below

In [17]:
# TODO: Complete the following code to filter out all of the participants that need to be excluded
# using the filter_$ column
filtered_data = data[]
print(filtered_data.head())

SyntaxError: invalid syntax (<ipython-input-17-1322db145238>, line 3)

### Simple regression using `sklearn`

We can use the `sklearn` module to easily do linear regression. 

In [19]:
# Sets x to be the Ought_Score values and y to be the Inherence_Bias values.
# x had to be reshaped using .values.reshape(-1, 1) because it has only one feature.
x = filtered_data['Inherence_Bias'].values.reshape(-1, 1)
y = filtered_data['Ought_Score']

# Fits the linear regression model to the data
lm = linear_model.LinearRegression()
model = lm.fit(x, y)

We can access the R<sup>2</sup> result of the linear regression using the `score` function.

In [20]:
# output the R^2 result of the fitting
lm.score(x, y)

0.09269136043436188

...and the coefficient(s).

In [22]:
# output the coefficient (slope of the line in this case) for x
lm.coef_

array([0.29683265])

Along with the intercept!

In [24]:
# output the y-intercept of the line
lm.intercept_

3.9105816212720557

### Multiple regression using `sklearn`

It is easy to add more predictor variables to the regression model in `sklearn`. We just need to change our DataFrame to include multiple columns instead of just one. `pandas` makes this easy. For example, to include both `Inherence_Bias` and `educ` (Education level), all you would do is:

`x2 = filtered_data[['Inherence_Bias', 'educ']]`

To add more variables, just add commas and the appropriate string label for the column you want to add. Find the appropriate column labels for Raven’s Progressive matrix score, conservatism, and belief in a just world and add those to `x2` to do multiple regression. You may have to open the .csv file and examine the columns!

In [32]:
# TODO: add more variables to this list!
x2 = filtered_data[['Inherence_Bias', 'educ']]

model2 = lm.fit(x2, y)
lm.score(x2, y)

0.09673957792975474

In [31]:
# output the coefficients of each predictor variable
lm.coef_

array([ 0.29243454, -0.07907961])

In [29]:
# output the intercept of the resulting equation in multidimensional space
lm.intercept_

4.660068892910393