## Project Proposal (5\%)

### Due (Each Student): October 17

Each **individual student** will submit a project proposal which:

1. (2\%) Describes and motivates a real-world problem where data science may provide helpful insights. This problem must consist of at least two key questions of interest and your description of the problem and questions should be easily understood by a casual reader. Citations to motivating sources are preferred where possible (e.g. news articles, published papers, etc. Do not use Wikipedia itself, but the links Wikipedia articles cite may be useful)

2. (2\%) Characterize the source(s) of your dataset by either:
* Explicitly loading and showing a dataset's contents
* Describing a data source(s) for your project (include links if applicable, or explain how the data will be collected)

In either case, describe the contents, reliability, and issues you expect to encounter in collecting these data. Describe, in brief, how you intend to clean the dataset to prepare it for the analysis.

* **note**: Datasets must be sufficiently technically challenging to collect/clean for full credit to be assigned. It is *not* sufficient to download one Kaggle csv for your data collection.
3. (1\%) Write one or two sentences about how the data will be used to solve the problem and your two questions of interest. At this point of the semester, we haven't studied the machine learning methods yet, but you should have a general idea of what you can do with ML. If you do not, ask a TA or the professor or do a little googling.

# Understanding Freshman Satisfaction and Retention in Higher Education
## Sahana Dhar

College and university life for students is an importance experience for many young adults nowadays. According to the Bureau of Labor Statistics, 61.8% of high school graduates were enrolled in college (2021). A significant issue that many institutions face is their impact on freshman satsifaction and the return rate of first year students. Colleges and unversities must continue adapting to the needs and expectations of the student bodies, in order to mintain enrollment rates. The questions I will be exploring in this projects are as follows:
1) What are the key determinants of freshman satisfaction within colleges, and how do they differ against institutions.
2) How do various institutional factors, such as tuition costs, diversity, academic calendar systems, on-campus housing availability, the number of undergraduates, and whether an institution is public or private, impact the likelihood of freshmen returning for their sophomore year?

These questions will include analyzing aspects of college life, such as the environment, diversity, facilities, and the student body, which ultimately impacts any students enjoyment of their college experience and their decision to continue their education there.


In [1]:
# the following modules will be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

In [2]:
import time
import pandas as pd
from selenium import webdriver

# Initialize data containers
college_names = []
public_private = []
undergraduates = []
freshman_satisfaction = []
admission_rate = []
in_state_tuition = []
out_of_state_tuition = []

# Set the base URL
base_url = 'https://www.collegedata.com/college-search'

# Initialize a Selenium webdriver with Chrome DevTools Protocol
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
driver = webdriver.Chrome(options=chrome_options)

# Open the webpage
driver.get(base_url)

# Keep scrolling down to load all content
scrolls = 100  # Adjust the number of scrolls as needed

for _ in range(scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # Adjust the sleep time as needed

# Get the page source with all the loaded content
html = driver.page_source

# Create a BeautifulSoup object
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find all the college card containers
college_cards = soup.find_all('div', class_='CollegeCard_container__6iCEj')

# Loop through each college card and extract the data
for card in college_cards:
    college_names.append(card.find('h3', class_='CollegeCard_name__2Qo2g').text)
    public_private.append(card.find('div', class_='CollegeCard_type__3VK-P').text)

    stat_lines = card.find_all('div', class_='StatLine_container__axaVo')
    
    # Check if there are enough stat_lines to extract all the required data
    if len(stat_lines) >= 6:
        undergraduates.append(stat_lines[0].find('div', class_='StatLine_value__1ASq0').text)
        freshman_satisfaction.append(stat_lines[1].find('div', class_='StatLine_value__1ASq0').text)
        admission_rate.append(stat_lines[2].find('div', class_='StatLine_value__1ASq0').text)
        in_state_tuition.append(stat_lines[4].find('div', class_='StatLine_value__1ASq0').text)
        out_of_state_tuition.append(stat_lines[5].find('div', class_='StatLine_value__1ASq0').text)
    else:
        # Handle cases where not all data is available
        undergraduates.append("N/A")
        freshman_satisfaction.append("N/A")
        admission_rate.append("N/A")
        in_state_tuition.append("N/A")
        out_of_state_tuition.append("N/A")

# Create a DataFrame
data = {
    'College Name': college_names,
    'Public/Private': public_private,
    'Undergraduates': undergraduates,
    'Freshman Satisfaction': freshman_satisfaction,
    'Admission Rate': admission_rate,
    'In-State Tuition': in_state_tuition,
    'Out-of-State Tuition': out_of_state_tuition
}

df1 = pd.DataFrame(data)

# Print or save the DataFrame as needed


# Close the driver
driver.quit()

In [34]:
(df1)

Unnamed: 0,College Name,Public/Private,Undergraduates,Freshman Satisfaction,Admission Rate,In-State Tuition,Out-of-State Tuition
0,Aaniiih Nakoda College,Public • Coed,110,0%,Not reported,"$2,410","$2,410"
1,Abilene Christian University,Private • Coed,3189,78.5%,"79% of 9,397 applicants were admitted","$40,500","$40,500"
2,Abraham Baldwin Agricultural College,Public • Coed,3327,67%,"79% of 2,927 applicants were admitted","$3,565","$10,471"
3,Abraham Lincoln University,Private for-profit • Coed,63,0%,76% of 58 applicants were admitted,"$6,440","$6,440"
4,Academy College,Private for-profit • Coed,95,100%,Not reported,Not reported,Not reported
...,...,...,...,...,...,...,...
2393,York University,Private • Coed,457,58%,60% of 485 applicants were admitted,"$21,525","$21,525"
2394,Young Harris College,Private • Coed,923,76%,"64% of 1,485 applicants were admitted","$30,900","$30,900"
2395,Youngstown State University,Public • Coed,4487,75.4%,"78% of 6,718 applicants were admitted","$8,754","$9,114"
2396,Zane State College,Public • Coed,2275,65%,Not reported,"$5,556","$10,866"


In [1]:
import pandas as pd
ls = []
tables = pd.read_html('https://www.4icu.org/us/a-z/')
table1 = tables[0]

# Flatten the MultiIndex to a single-level index
table1.columns = table1.columns.get_level_values(0)

# Now you can access the column names
table1.columns = ['Rank', 'Name', 'Location']
table1

Unnamed: 0,Rank,Name,Location
0,1005,A.T. Still University,Kirksville ...
1,504,Abilene Christian University,Abilene
2,1453,Abraham Baldwin Agricultural College,Tifton
3,495,Academy of Art University,San Francisco
4,1022,Adams State University,Alamosa
...,...,...,...
1745,1708,"York College, City University of New York",Jamaica
1746,1485,York University,York
1747,1465,Young Harris College,Young Harris
1748,487,Youngstown State University,Youngstown


In [8]:
for college in table1['Name']:
    ls.append(college)

In [9]:
ls

['A.T. Still University',
 'Abilene Christian University',
 'Abraham Baldwin Agricultural College',
 'Academy of Art University',
 'Adams State University',
 'Adelphi University',
 'Adler Graduate School',
 'Adler University',
 'Adrian College',
 'AdventHealth University',
 'Agnes Scott College',
 'Air Force Institute of Technology',
 'Alabama A&M University',
 'Alabama State University',
 'Alaska Bible College',
 'Alaska Pacific University',
 'Albany College of Pharmacy and Health Sciences',
 'Albany Law School',
 'Albany Medical College',
 'Albany State University',
 'Albertus Magnus College',
 'Albion College',
 'Albright College',
 'Alcorn State University',
 'Alderson Broaddus University',
 'Alfred State College',
 'Alfred University',
 'Alice Lloyd College',
 'Allegheny College',
 'Allen College',
 'Allen University',
 'Alliance University',
 'Alliant International University',
 'Alma College',
 'Alvernia University',
 'Alverno College',
 'Amberton University',
 'American Baptist

In [10]:
#df1.to_csv('college.csv', index=False)

In [11]:
#ls = ['MIT', 'Harvard', 'Yale University', 'Stanford', 'Northeastern University', 'UCLA', 'UC Berkeley', 'Brown University', 'Dartmouth', 'Brandeis', 'UMass Amherst']
#for x in df1['College Name']:
    #ls.append(x)

In [12]:
"""import requests

api_key = '6aCAk5BkeG4rDXGzFYzA44Wyv08gXcM0uuFPqMAI'

# Base URL for the College Scorecard API
base_url = "http://api.data.gov/ed/collegescorecard/v1/schools"

# Specify the parameters for your API request
params = {
    'api_key': api_key,
    'school.name': 'Harvard'}

# Make the GET request
response = requests.get(base_url, params=params)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()
    # Process and use the data as needed
    print(data)
else:
    print(f"API request failed with status code: {response.status_code}")

        
data"""


'import requests\n\napi_key = \'6aCAk5BkeG4rDXGzFYzA44Wyv08gXcM0uuFPqMAI\'\n\n# Base URL for the College Scorecard API\nbase_url = "http://api.data.gov/ed/collegescorecard/v1/schools"\n\n# Specify the parameters for your API request\nparams = {\n    \'api_key\': api_key,\n    \'school.name\': \'Harvard\'}\n\n# Make the GET request\nresponse = requests.get(base_url, params=params)\n\n# Check if the request was successful\nif response.status_code == 200:\n    data = response.json()\n    # Process and use the data as needed\n    print(data)\nelse:\n    print(f"API request failed with status code: {response.status_code}")\n\n        \ndata'

In [13]:
import requests
import pandas as pd

schools = ls[0:10]
minority_percentages = []
retention_rate = []
women_percentage = []
men_percentage = []
tuition_instate = []
tuition_outofstate = []
sat = []
act = []
school_to_remove = [] 

# Define a function to calculate diversity percentage
def calculate_diversity(data):
    try:
        total_population = data['latest.student.demographics.race_ethnicity.non_resident_alien']
        white_population = data['latest.student.demographics.race_ethnicity.white']

        diversity_percentage = 1 - (white_population / total_population)
    except (KeyError, ZeroDivisionError):
        diversity_percentage = None

    return diversity_percentage

# Modify the calculate_sat_scores function to use the 'average' overall SAT score
def calculate_sat_scores(data):
    try:
        sat_scores = data['admissions']['sat_scores']['average']
        overall_score = sat_scores.get('overall', None)

        return overall_score
    except KeyError:
        return None

# Modify the calculate_act_scores function to use the 'cumulative' ACT score
def calculate_act_scores(data):
    try:
        act_scores = data['admissions']['act_scores']['midpoint']
        cumulative_act = act_scores.get('cumulative', None)

        return cumulative_act
    except KeyError:
        return None

# Loop through the list of schools
for school in schools:
    # Replace 'YOUR_API_KEY' with the API key you obtained after registration
    api_key = '6aCAk5BkeG4rDXGzFYzA44Wyv08gXcM0uuFPqMAI'

    # Base URL for the College Scorecard API
    base_url = "http://api.data.gov/ed/collegescorecard/v1/schools"

    # Specify the parameters for your API request
    params = {
        'api_key': api_key,
        'school.name': school
    }

    # Make the GET request
    response = requests.get(base_url, params=params)

    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()

        # Check if the data contains the expected structure
        if 'results' in data and data['results']:
            school_data = data['results'][0]['latest']

            # Calculate and add diversity percentage
            diversity_percentage = calculate_diversity(school_data)
            minority_percentages.append(diversity_percentage * 100 if diversity_percentage is not None else None)

            # Add retention rate, women percentage, men percentage, tuition, SAT, and ACT scores
            retention_rate.append(school_data['student']['retention_rate']['overall']['full_time'])
            women_percentage_value = school_data['student']['demographics']['women']
            men_percentage_value = school_data['student']['demographics']['men']

            # Check if the values are not None before multiplying by 100
            if women_percentage_value is not None:
                women_percentage.append(women_percentage_value * 100)
            else:
                women_percentage.append(None)

            if men_percentage_value is not None:
                men_percentage.append(men_percentage_value * 100)
            else:
                men_percentage.append(None)
            
            tuition_instate.append(school_data['cost']['tuition']['in_state'])
            tuition_outofstate.append(school_data['cost']['tuition']['out_of_state'])
            sat.append(calculate_sat_scores(school_data))
            act.append(calculate_act_scores(school_data))
        else:
            print(f"Data for {school} not found in the API response.")
            schools.remove(school_to_remove)

    else:
        print(f"API request for {school} failed with status code: {response.status_code}")
        #schools.remove(school_to_remove)


# Create a DataFrame
data = {
    'School': schools,
    'Minority Percentage': minority_percentages,
    'Retention Rate': retention_rate,
    'Women Percentage': women_percentage,
    'Men Percentage': men_percentage,
    'Tuition In-State': tuition_instate,
    'Tuition Out-of-State': tuition_outofstate,
    'SAT Score': sat,
    'ACT Score': act
}

df = pd.DataFrame(data)

# Print the DataFrame
print(df)

API request for A.T. Still University failed with status code: 429
API request for Abilene Christian University failed with status code: 429
API request for Abraham Baldwin Agricultural College failed with status code: 429
API request for Academy of Art University failed with status code: 429
API request for Adams State University failed with status code: 429
API request for Adelphi University failed with status code: 429
API request for Adler Graduate School failed with status code: 429
API request for Adler University failed with status code: 429
API request for Adrian College failed with status code: 429
API request for AdventHealth University failed with status code: 429


ValueError: All arrays must be of the same length

In [None]:
data = {
    'Minority Percentage': minority_percentages,
    'Retention Rate': retention_rate,
    'Women Percentage': women_percentage,
    'Men Percentage': men_percentage,
    'Tuition In-State': tuition_instate,
    'Tuition Out-of-State': tuition_outofstate,
    'SAT Score': sat,
    'ACT Score': act
}

In [None]:
df = pd.DataFrame(data)

# Dataset Description

The dataset for this project is primarily collected from web scraping the website 'https://www.collegedata.com/college-search.' It currently consists of the following columns:
- 'College Name' 
- 'Public/Private' (whether the college is a public or private institution)
- 'Undergraduates' (Number of undergaduates)
- 'Freshman Satisfaction' (percentage of freshman satisfaction)
- 'Admission Rate'
- 'In-State Tuition' (price for students living within the same state)
- 'Out-of-State Tuition' (price for students who lived in a different state)

However, this is simply a subset of the entire Dataframe, as all 2400 colleges have not been added yet. In order to do this, a tool like Selenium would have to be employed in ordeer to access the rest of the unloaded text from the given url.

To expand the target features and address the research questions of interest, I plan to add additional columns by individually web scraping the college pages. These features include:

- Ethnicity percentages of different races (assesses diversity and invlusivity)
- Freshman return rate (to analyze freshman satisfaction and engagement)
- Housing availability (to understand whether the university offers housing on campus)
- Percentage of students in college housing (to understand the residential community on campus)
- Greek life participation (information on the presence of Greek Life on campus)
- Athletic division (will help to understand the importance of sports involvement)
- Academic Calendar System (To understand the academic year breaakdown)
- Average Percent of Need Met (To analyze how much the college provides in terms of financial aid)
- Room and Board Cost (Price for housing on campus)



# Reliability and Data Collection Issues

Web scraping can be difficult in this scenario as information for all colleges may not be present. There might be several rows of missing information or may be inconsistent in its formatting. To prepare the dataset for the most efficient analysis, I will input or remove missing values or rows as necessary, and convert any numerical data into a float to run mathematical operations, such as correlations or regressions. I will be combining the newly scraped data for each college with the dataframe above to properly align colleges with all of their corresponding features, This will help ensure that this dataframe is best fit for modeling and analysis.

# Solving the Problem and Answering Our Questions

The collected data will be used for comprehensive data analysis, allowing me to understand the  elationship between various college factors and the key questions of interest: (1) understanding the determinants of freshman satisfaction, and (2) assessing the impact of institutional characteristics on the likelihood of freshmen returning for another year. By discovering correlations in factors, employing machine learning techniques, my analysis aims to provide colleges and universities with predictive insights that can guide colleges to enlist changes and techniques that will enhance the quality of the freshman experience and improve retention rates. 





