# ASSOCIATIONS: QUANTITATIVE AND CATEGORICAL VARIABLES

## Introduction

Examining the relationship between variables can give us key insight into our data. In this lesson, we will cover ways of assessing the association between a quantitative variable and a categorical variable.

In the next few exercises, we’ll explore a dataset that contains the following information about students at two portuguese schools:

- `school`: the school each student attends, Gabriel Periera ('GP') or Mousinho da Silveria ('MS')
- `address`: the location of the student’s home ('U' for urban and 'R' for rural)
- `absences`: the number of times the student was absent during the school year
- `Mjob`: the student’s mother’s job industry
- `Fjob`: the student’s father’s job industry
- `G3`: the student’s score on a math assessment, ranging from 0 to 20

Suppose we want to know: Is a student’s score (G3) associated with their school (school)? If so, then knowing what school a student attends gives us information about what their score is likely to be. For example, maybe students at one of the schools consistently score higher than students at the other school.

To start answering this question, it is useful to save scores from each school in two separate lists:

```python
scores_GP = students.G3[students.school == 'GP']
scores_MS = students.G3[students.school == 'MS']
```

In [None]:
import numpy as np
import pandas as pd

students = pd.read_csv('../data/students2.csv')

#print the first five rows of students:
print(students.head())

#separate out scores for students who live in urban and rural locations:
scores_urban = students.G3[students.address == 'U']
scores_rural = students.G3[students.address == 'R']


## Mean and Median Differences

Recall that in the last exercise, we began investigating whether or not there is an association between math scores and the school a student attends. We can begin quantifying this association by using two common summary statistics, mean and median differences. To calculate the difference in mean G3 scores for the two schools, we can start by finding the mean math score for students at each school. We can then find the difference between them:

```python
mean_GP = np.mean(scores_GP)
mean_MS = np.mean(scores_MS)
print(mean_GP) #output: 10.49
print(mean_MS) #output: 9.85
print(mean_GP - mean_MS) #Output: 0.64
```

We see that the mean math score for students at GP is 10.49, while the mean score for students at MS is 9.85. The mean difference is 0.64. We can follow a similar process to calculate a median difference:

```python
median_GP = np.median(scores_GP)
median_MS = np.median(scores_MS)
print(median_GP) #Output: 11.0
print(median_MS) #Output: 10.0
print(median_GP-median_MS) #Output: 1.0
```

GP students also have a higher median score, by one point. Highly associated variables tend to have a large mean or median difference. Since "large" could have different meanings depending on the variable, we will go into more detail in the next exercise.

In [None]:
import numpy as np
import pandas as pd
students = pd.read_csv('students.csv')

scores_urban = students.G3[students.address == 'U']
scores_rural = students.G3[students.address == 'R']

#calculate means for each group:
scores_urban_mean = None
scores_rural_mean = None

#print mean scores:
print('Mean score - students w/ urban address:')
print(scores_urban_mean)
print('Mean score - students w/ rural address:')
print(scores_rural_mean)

#calculate mean difference:
mean_diff = None

#print mean difference
print('Mean difference:')
print(mean_diff)

#calculate medians for each group:
scores_urban_median = None
scores_rural_median = None

#print median scores
print('Median score - students w/ urban address:')
print(scores_urban_median)
print('Median score - students w/ rural address:')
print(scores_rural_median)

#calculate median difference
median_diff = None

#print median difference
print('Median difference:')
print(median_diff)
