# Tutorial 2: Exploring data with Seaborn

Welcome to part 2 of this tutorial series. In this short notebook, we explore the [EQAO](http://www.eqao.com/en/assessments/grade-9-math/Pages/grade-9-math.aspx) data using visualizations with the [Seaborn](http://seaborn.pydata.org/) library. Based on our exploration, we can come up with some questions for further investigation. 

First let's import our tools. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import unicodecsv
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr
from functools import partial
%matplotlib inline

Now read in the data and check out the preview. 

In [None]:
schools = pd.read_csv('../input/school_data.csv')

In [None]:
schools.head()

In [None]:
attitudes = pd.read_csv('../input/response_data.csv')

In [None]:
attitudes.head()

The schools table contains the name and ID number for each school, along with the school board (and board ID number), followed by the total number of students, the percentage of students performing at each level (level 3 represents the provincial standard, while level 4 exceeds the standard), the number of female and male students, and the number of students answering the attitudinal survey. 

The additudes table contains the school ID, and the percentage of students answering "agree" or "strongly agree" to each of the following statements:  
Q1: I like mathematics  
Q2: I am good at mathematics  
Q3: I am able to answer difficult mathematics questions  
Q4: Mathematics is one of my favourite subjects  
Q5: I understand most of the mathematics I am taught  
Q6: Mathematics is an easy subject  
Q7: I do my best in mathematics class  
Q8: The mathematics I learn now is useful for everyday life  
Q9: The mathematics I learn now helps me do work in other subjects  
Q10: I need to do well in mathematics to study what I want later  
Q11: I need to keep taking mathematics for the kind of job I want after I leave school  

Let's get started and look at a simple [scatter plot](http://seaborn.pydata.org/generated/seaborn.jointplot.html) of 2 variables: 

In [None]:
sns.jointplot(x="Q1(%)", y="Q4(%)", data=attitudes);

The plot also displays the [Pearson's r](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) value, which measures the strength of the correlation. Not surprisingly, there is a strong positive correlation between the statements "I like mathematics" and "Mathematics is one of my favourite subjects". 

Next let's look at the distribution of responses for each statement. We could get an idea using a [box plot](http://seaborn.pydata.org/generated/seaborn.boxplot.html):

In [None]:
sns.boxplot(data=attitudes);

Whoa, that doesn't look right! The school ID value is completely throwing off the graph since it doesn't make sense for it to be there. Let's replot this without the student ID and also turn it horizontally.

In [None]:
sns.boxplot(data=attitudes.drop(['School ID'],1), orient = "h");

That looks better, but it's still a bit hard to understand the distribution. We can get a slightly different perspective using a [violin plot](http://seaborn.pydata.org/generated/seaborn.violinplot.html). 

In [None]:
plt.figure(figsize=(8,7));
sns.violinplot(data=attitudes.drop(['School ID'],1), orient = "h");

Ok, now we have a feeling for how students are answering the attitudinal survey. Next we'd like to be able to relate that information to school data such as performance. This information is contained in separate tables though, so let's [merge](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html) them and keep only the data we're interested in for now. 

In [None]:
all_info = schools.merge(attitudes, on = "School ID").drop(['School', 'Board ID', 'Num responses'],1)

In [None]:
all_info.head()

Let's look at the distributions of student responses again, but this time categorized by school board. To do so, we can use a [swarm plot](http://seaborn.pydata.org/generated/seaborn.swarmplot.html) and add a hue on the board column. 

First we have to use the [melt](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to rearrange our table a bit. The following operation moves the heading values Q1-Q11 to row values under the heading "Question".

In [None]:
all_rearranged = pd.melt(all_info, id_vars=["School ID", "Board","Num students", "Level 1 (%)","Level 2 (%)", 
                                            "Level 3 (%)","Level 4 (%)", "Num F", "Num M" ], var_name="Question")

In [None]:
all_rearranged.head()

In [None]:
plt.figure(figsize=(9,7));
sns.swarmplot(x="Question", y ="value", data=all_rearranged, hue="Board", split = True);

This type of plot allows us to check how distributions might be different for different categories. When a category takes only two values, we can use a violin plot instead. 

For example, let's split schools into "high risk" (those with $>$30% of students at level 1 or 2) and "low risk" (the remaining schools). 

In [None]:
all_rearranged["risk"] = np.where(all_rearranged['Level 1 (%)'] + all_rearranged['Level 2 (%)'] > 30, 'high', 'low')

In [None]:
all_rearranged.head()

In [None]:
plt.figure(figsize=(9,7));
sns.violinplot(x="Question", y ="value", data=all_rearranged, hue="risk", split = True);

Looking at this plot, we might guess that student responses differ the most between high and low risk schools on statements 2, 3, and 5. Let's check this hypothesis by quickly checking the Pearson's r values between each statement percentage and the percentage of students at level 1. We can find the maximum negative correlation, and display the corresponding plot. 

In [None]:
compare = all_info.drop(['School ID','Board', 'Num students', 'Level 2 (%)', 'Level 3 (%)', 'Level 4 (%)', 'Num M', 'Num F'],1);
calc = partial(pearsonr,compare['Level 1 (%)'])

In [None]:
compare.apply(calc)

The right-hand column displays the result of [pearsonr](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html) applied to each Q column against the Level 1 (%) column. We can see that statements 2, 3, and 5 have the greatest negative correlations as we expected, with Q3 being the greatest.

In [None]:
sns.jointplot(x="Level 1 (%)", y="Q3(%)", data=all_info);

Therefore we have identified that agreement with the statement "I am able to answer difficult mathematics questions" is the most negatively correlated with students performing at level 1.   

What other types of questions could we ask? Practice exploring the data using visualizations to come up with your own questions!