# College Data
#### Stephano Casuso
---

In [19]:
import pandas, seaborn
from matplotlib import pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

ModuleNotFoundError: No module named 'pandas'

In [None]:
df = pandas.read_csv("Project Proposal/college_data.csv")
pandas.set_option('max_columns', None)

## Exploring the Data

**This dataset contains 145 features of over 1500 colleges across the US.**

**There is some specific data:**
![image.png](attachment:image.png)

---
## Map of Colleges

In [None]:
df.plot.scatter(x='Longitude location of institution', 
                y='Latitude location of institution',
                c='Purple',
                alpha=0.3,
                figsize=(17,12),
                cmap=plt.get_cmap('jet'))

---
## Distribution of University Costs

In [None]:
expense = (df['Tuition and fees, 2010-11']
           +df['Tuition and fees, 2011-12']
           +df['Tuition and fees, 2012-13']
           +df['Tuition and fees, 2013-14'])
df['expense'] = expense/4

**University expense was derived from the average of yearly tuition between 2010-2014 for in-state students.**

In [None]:
seaborn.displot(data=df, x='expense', aspect=2, kde=True)

**Notice the two humps of expense in the plot. The first one is the concentration of public college tuition while the second one belongs to the private college tuition.**

---
## Composition of Private Schools vs Public Schools

In [None]:
df2 = df[['Sector of institution']].value_counts()
df2.plot.pie(y='Sector of institution', autopct='%1.1f%%', labels=['Private', 'Public'], startangle=90)

**Type of Colleges in Data:**
+ Private: $\;\;\;$ 971
+ Public: $\;\;\;\;$ 563

---
## Are Private Colleges More Expensive Than Public Ones?

In [None]:
plt.figure(figsize=(10,5))
pvp = df[['expense', 'Sector of institution']].groupby(by='Sector of institution').mean()
seaborn.barplot(data=pvp, x=pvp.index, y='expense')

**Private colleges average a yearly tuition of ﹩26,580, which is close to 3$1\over2$ times the price of public colleges at ﹩7,780.**

---
## What States Have the Highest Average University Tuition?

In [None]:
states = df[['expense', 'FIPS state code']].groupby(by='FIPS state code').mean()
plt.figure(figsize=(10,10))
seaborn.barplot(data=states, 
                y=states.index, 
                x='expense', 
                order=states.sort_values('expense', ascending = False).index
               )

**Top 5 States with Most Expensive Tution:**
1. Massachusetts
2. Connecticut
3. District of Columbia
4. Rhode Island
5. Iowa

---
## Can a College's Academic Performance Be Predicted Through its Tuition?

#### Which features can be used to define academic performance?

Instinctively, I chose the colleges' graduation rate to determine the academic performance. If the graduation rate is high, then the students are doing well. If it's low, then the students are having a harder time passing their classes.

After looking at the graph comparing the expense to the graduation rate, I thought it'd be best to add more features to define academic performance. The data contains some of the student's SAT and ACT scores which they've submitted as part of their application to the colleges, so I'll add those as well.

Relevant features:
+ Graduation rate - Bachelor degree within 4 years, total
+ Graduation rate - Bachelor degree within 5 years, total
+ Graduation rate - Bachelor degree within 6 years, total
+ SAT Critical Reading 75th percentile score
+ SAT Math 75th percentile score
+ SAT Writing 75th percentile score
+ ACT Composite 75th percentile score

In [None]:
print(df['SAT Math 75th percentile score'].isna().value_counts(),
'\n',df['SAT Writing 75th percentile score'].isna().value_counts(),
'\n',df['ACT Composite 75th percentile score'].isna().value_counts(),
'\n',df['SAT Critical Reading 75th percentile score'].isna().value_counts()
     )

One problem, a large amount of SAT and ACT scores seem to be missing from the data because not all students were required to submit their scores.

In [None]:
# in case of error, run cell below first
df['totalGradRate'].isna().value_counts()

In [None]:
totalGradRate = (df['Graduation rate - Bachelor degree within 4 years, total']
                 +df['Graduation rate - Bachelor degree within 5 years, total']
                 +df['Graduation rate - Bachelor degree within 6 years, total']
                )/3
#'''
testingPerformance = (df['SAT Critical Reading 75th percentile score']
                      +df['SAT Math 75th percentile score']
                      +df['SAT Writing 75th percentile score']
                      +df['ACT Composite 75th percentile score']
                      )/4
#'''
df['totalGradRate'] = totalGradRate
df['testingPerformance'] = testingPerformance

#df['academicPerformance'] = totalGradRate
df['academicPerformance'] = (totalGradRate+testingPerformance)/2

In [None]:
plt.rcParams["figure.figsize"] = [15, 5]
f, axes = plt.subplots(1, 2)
seaborn.regplot(data=df, x='expense', y='academicPerformance', ax=axes[0])
seaborn.kdeplot(data=df, x='expense', y='academicPerformance', ax=axes[1])
plt.show()

**We can see two distinct clusters of data in the plot. What could they be?**

In [None]:
seaborn.relplot(data=df, 
                x='expense', 
                y='academicPerformance', 
                hue='Sector of institution',
                col='Sector of institution',
)

When the data is graphed through what type of institution it is, the two clusters are separated, and the difference between public and private colleges becomes clear.

It seems that most private colleges tend to increase in academic performance as their price goes up, but public colleges' academic performance vary within the small range of tuition cost.

## Colleges that Deserve Special Recognition

**First, let's look at the best public colleges money can buy.**

In [None]:
import plotly.express as px

In [None]:
pu = df[ df['Sector of institution'] == 'Public, 4-year or above' ]
pu = pu[['Name', 'FIPS state code', 'expense', 'academicPerformance']]
pu = pu.dropna()

In [None]:
px.scatter(pu, x='expense', y='academicPerformance', hover_data=['Name', 'FIPS state code'], color='expense')

**Now let's look at the best private colleges for your money.**

In [None]:
pr = df[ df['Sector of institution'] == 'Private not-for-profit, 4-year or above' ]
pr = pr[['Name', 'FIPS state code', 'expense', 'academicPerformance']]
pr = pr.dropna()

In [None]:
px.scatter(pr, x='expense', y='academicPerformance', hover_data=['Name', 'FIPS state code'], color='expense')

These two interactive graphs could be a great resource for high school seniors looking for colleges/universities, or even current college undergraduates looking for better options than the current college they're attending.