### Colleges Chosen by Non-Matrics

This notebook was created to explore the new data for where accepted students decide to go in favor of Siena.  The bulk of the plots in this notebook were generated using [Altair](https://altair-viz.github.io/).

Import necessary libraries.

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import pickle as pkl
import os
import sys
import altair as alt
alt.renderers.enable('notebook')

from vega_datasets import data

import warnings
warnings.filterwarnings('ignore')

sys.path.insert(0, '../src/visualization/')
import visualize as vis

Load in all of the .csv files as DataFrames.  Then, we concatenate the three DataFrames of college data into one, and map this data to `df['College_chosen_by_non-matrics']`.

In [None]:
df = pd.read_csv('../data/processed/CriticalPath_Data_EM_Confidential_lessNoise.csv').drop(columns='Unnamed: 0')

Create a DataFrame that groups students by the college they chose over Siena, as well as what major these students were.

In [None]:
college_by_major = df.groupby(["College_chosen_by_non-matrics",
                                              "Major"]).count().rename(columns={"Unique_student_ID":"# Students"})
college_by_major = college_by_major.reset_index()

Create a barplot showing the breakdown of students that chose UAlbany over Siena College.

In [None]:
alt.Chart(college_by_major[college_by_major['College_chosen_by_non-matrics']=='SUNY UNIVERSITY AT ALBANY'].iloc[:15]).mark_bar().encode(
    x='# Students:Q',
    y=alt.Y(
        'Major:O',
        sort = alt.EncodingSortField(
                field='# Students',
                op = "sum",
                order = "descending"
        )
    )
).properties(height=200,width=300,title='Applicants who go to UAlbany instead of Siena: Last 3 Years')

All students who were accepted to Siena, but chose another college, broken down by major.  It seems to look like this barplot matches up well with the applicants by major barplot on [01-st-exploratory.ipynb](https://github.com/stibbs1998/admissions_internship/blob/master/notebooks/01-st-exploratory.ipynb).

In [None]:
alt.Chart(college_by_major.groupby("Major").sum().reset_index(
).sort_values("# Students",ascending=False).iloc[:30]).mark_bar().encode(
    x='# Students:Q',
    y=alt.Y(
        'Major:O',
        sort = alt.EncodingSortField(
                field='# Students:Q',
                op = "sum",
                order = "descending"
        )
    )
).properties(height=400,width=400,title="Applicants Who Don't Attend Siena by Major (Last 3 Years)")

Create a DataFrame to breakdown applicants by their `'CollegeCode'`.  That is, are they applying to the School of Science, Business, or Liberal Arts.

In [None]:
college_by_school = df[~df['College_chosen_by_non-matrics'].isnull()]

college_by_school = college_by_school.groupby(["College_chosen_by_non-matrics",
                                              "CollegeCode"]).count().rename(columns={"Unique_student_ID":"# Students"})
college_by_school = college_by_school.reset_index().rename(columns={"CollegeCode":"School"})
college_by_school['School'] = college_by_school['School'].map({"AD":"School of Art","BD":"School of Business","SD":"School of Science"})

Create a barplot of the top thirty colleges by the total students who chose to go there over Siena.  Further break this down by the number who applied to the School of Science, Business, and Liberal Arts.

In [None]:
num_colleges = 30
height = 500
width = 500

top_choices = college_by_school.groupby("College_chosen_by_non-matrics").sum().sort_values("# Students",
                                                        ascending=False).iloc[:num_colleges].index.values

_source = college_by_school.set_index("College_chosen_by_non-matrics").loc[top_choices].reset_index()

def popular_college_by_school(source,title):

    bars = alt.Chart(source).mark_bar().encode(
        x=alt.X('# Students:Q', stack='zero'),
        y=alt.Y('College_chosen_by_non-matrics:O',axis=alt.Axis(title=''),
               sort=alt.EncodingSortField(
                field="yield",  # The field to use for the sort
                op="sum",  # The operation to run on the field prior to sorting
                order="ascending"  # The order to sort in
            )),
        color=alt.Color('School')
    ).properties(height=height,width=width,title=title)

    text = alt.Chart(source).mark_text(
        dx=-10, dy=3, color='white').encode(
        x=alt.X('# Students:Q', stack='zero'),
        y=alt.Y('College_chosen_by_non-matrics:O',sort=alt.EncodingSortField(
                field="yield",  # The field to use for the sort
                op="sum",  # The operation to run on the field prior to sorting
                order="ascending"  # The order to sort in
            )),
        detail='School:O',
        text=alt.Text('# Students:Q', format='.0f')
    ).properties(height=height,width=width)

    return bars + text

popular_college_by_school(_source,title='College Breakdown by Department: Last Three Years')

Barplot of where undeclared liberal arts majors go.

In [None]:
alt.Chart(college_by_major[college_by_major['Major']=='UNAR'].groupby(
    "College_chosen_by_non-matrics").sum().reset_index(
).sort_values("# Students",ascending=False).iloc[:30]).mark_bar().encode(
    x='# Students:Q',
    y=alt.Y(
        'College_chosen_by_non-matrics:O',
        sort = alt.EncodingSortField(
                field='# Students',
                op = "sum",
                order = "descending"
        )
    )
).properties(height=400,width=400,title='Colleges Chosen by Undeclared Arts Majors: Last 3 Years').configure_mark(
   opacity=0.5,color='blue')

We can even look at each individual major to find where other students tend to go. 

Below we define a function that takes in a major and returns the cooresponding barplot.

In [None]:
def major_breakdown(major,n=20,col='blue'):
    return alt.Chart(college_by_major.groupby(["College_chosen_by_non-matrics","Major"]).sum().reset_index(
    ).sort_values("# Students",ascending=False)[college_by_major['Major']== major][:n]).mark_bar(color=col).encode(
        x='# Students:Q',
        y=alt.Y(
            'College_chosen_by_non-matrics:O',axis=alt.Axis(title=''),
            sort = alt.EncodingSortField(
                    field='# Students',
                    op = "sum",
                    order = "descending"
            )
        )
    ).properties(height=300,width=200,title=f"Where else do {major} Majors go?")

Side-by-side barplots of where both Physics and Business majors decide to attend over Siena Collge.

In [None]:
major_breakdown('PHYS',n=10,col='green') | major_breakdown('BUSI',n=10,col='gold')

Is there a way to find the average distance from someone's house to the school they go to????

Using the [Haversine Formula](https://en.wikipedia.org/wiki/Haversine_formula), we can calculate the distance from one point to another in kilometers.  This is done in the source code found [here]

Haversine Formula:  $$ d =  3,958.8 mi \cdot c$$ 
$$ c = 2 \cdot atan^2( \sqrt{a}, \sqrt{1-a} ) $$ 
$$ a = sin^2 (\Delta \phi /2) + cos\phi_1 \cdot cos\phi_2 \cdot sin^2(\Delta \lambda /2) $$
* $d$ is the distance from A $\to$ B
* $\phi$ is the latitude (North/South)
* $\lambda$ is the longitude (East/West)

What is the median distance to Siena College of students broken down by admission status??

In [None]:
alt.Chart(df.groupby("Admission_status").median().reset_index() ).mark_bar().encode(
    x=alt.X('Dist_to_Siena:Q',axis=alt.Axis(title='Distance to Siena (mi)')),
    y=alt.Y(
        'Admission_status:O', title='Admission Status',
        sort = alt.EncodingSortField(
                field='Dist_to_Siena',
                op = "sum",
                order = "descending",
        )
    ),
    color='Admission_status:O'
).properties(height=300,width=300,title="Median Distance to Siena College").configure_mark() 

Create a layered, normalized histogram of distance to Siena by admission status.

In [None]:
f, axes = plt.subplots(figsize=(10,6))
mile_limit = 500
bins = 50

sns.distplot(df[(df['Admission_status']=='Applied') & df['Dist_to_Siena'].le(mile_limit)]['Dist_to_Siena'],
             color='skyblue',label='Applied',hist_kws={"alpha":0.5},bins=bins);
sns.distplot(df[(df['Admission_status']=='Accepted') & df['Dist_to_Siena'].le(mile_limit)]['Dist_to_Siena'],
             color='red',label='Accepted',hist_kws={"alpha":0.4}, bins=bins);
sns.distplot(df[(df['Admission_status']=='Enrolled') & df['Dist_to_Siena'].le(mile_limit)]['Dist_to_Siena'],
             color='gold',label='Enrolled',hist_kws={"alpha":0.3}, bins=bins);

plt.legend(loc='best');
plt.ylabel('Kernel Density Estimate')
plt.xlabel("Distance to Siena College (mi)")
plt.title('Distance to Siena by Admission Status');

For the 20 most popular colleges selected by accepted applicants to Siena, how does distance from Siena vs the distance to other colleges affect their popularity?

Using [this](https://altair-viz.github.io/gallery/selection_histogram.html) as the boilerplate for the code, we are able to select ***ANY*** range of distance to Siena College, and generate the barplot for attendance at school this far away.

In [None]:
top_choices = df.groupby("College_chosen_by_non-matrics").sum().sort_values("Unique_student_ID",
                                                        ascending=False).iloc[:20].index.values

source = df.set_index("College_chosen_by_non-matrics").loc[top_choices].reset_index()
source = source[(source['Dist_to_Siena']<500)&(source['Dist_to_Ccbnm']<1000)]
source['index'] = source.index
source['Year_of_entry'] = (source['Year_of_entry']-30)/100

brush = alt.selection(type='interval')

points = alt.Chart(source).mark_point().encode(
    y=alt.Y('Dist_to_Ccbnm:Q',axis=alt.Axis(title='Distance to College Attended (mi)')),
    x=alt.X('Dist_to_Siena:Q',axis=alt.Axis(title='Distance to Siena (mi)')),
    color=alt.condition(brush, 'CollegeCode:N', alt.value('lightgray'))
).add_selection(
    brush
).properties(height=800,width=800)

bars = alt.Chart(source).mark_bar().encode(
    y=alt.Y('College_chosen_by_non-matrics:N',sort=alt.EncodingSortField(
            field="College_chosen_by_non-matrics:Q", 
            op="count",
            order="descending")
        ),
    color='CollegeCode:N',
    x=alt.X('count(College_chosen_by_non-matrics):Q')
).transform_filter(
    brush
).properties(height=800,width=800)

text = alt.Chart(source).mark_text(
        dx=-10, dy=3, color='white').encode(
        x=alt.X('count(College_chosen_by_non-matrics):Q', stack='zero',title='# Students'),
        y=alt.Y('College_chosen_by_non-matrics:N', axis=alt.Axis(title=''),
               sort=alt.EncodingSortField(
                    field="College_chosen_by_non-matrics:Q", 
                    op="count",
                    order="descending")),
        detail='CollegeCode:O',
        text=alt.Text('count(CollegeCode):Q', format='.0f')
).transform_filter(
    brush
).properties(height=800,width=800)

(points | (bars+text)).save('../reports/Dist2Siena_Ccbnm.html')

Now lets look at the distribution of distances another way.  Let's have the ability to mouse over any college, and obtain a histogram detailing the distribution of distance to Siena.

In [None]:
alt.data_transformers.enable('json')

selector = alt.selection_single(empty='all', fields=['College_chosen_by_non-matrics'])

states = alt.topo_feature(data.us_10m.url, feature='states')

source = df.dropna(subset=['ccbnm_for_dist'])

base = alt.Chart(source).properties(
    width=800,
    height=800
).add_selection(selector)


background = alt.Chart(states).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(title="Colleges Chosen by Non-Matrics",
    width=800,
    height=800
).project('albersUsa')

points = base.mark_circle(size=20,color='steelblue').encode(
    longitude='ccbnm_long:Q',
    latitude='ccbnm_lat:Q',
    tooltip=['College_chosen_by_non-matrics','ccbnm_lat','ccbnm_long']
).add_selection(
    selector
)

hists = base.mark_bar(opacity=0.5, thickness=100).encode(
    x=alt.X('Dist_to_Ccbnm', axis=alt.Axis(title='Distance to College (mi)'),
            bin=alt.Bin(step=50)),
    y=alt.Y('count()', axis=alt.Axis(title='Number of Students'),
            stack=None)
).transform_filter(
    selector
).properties(width=800,height=800)

((background + points) | hists).save('../reports/College_Map_Histogram.html')