## Project Proposal (5\%)

### Due (Each Student): October 17

Each **individual student** will submit a project proposal which:

1. (2\%) Describes and motivates a real-world problem where data science may provide helpful insights. This problem must consist of at least two key questions of interest and your description of the problem and questions should be easily understood by a casual reader. Citations to motivating sources are preferred where possible (e.g. news articles, published papers, etc. Do not use Wikipedia itself, but the links Wikipedia articles cite may be useful)

2. (2\%) Characterize the source(s) of your dataset by either:
* Explicitly loading and showing a dataset's contents
* Describing a data source(s) for your project (include links if applicable, or explain how the data will be collected)

In either case, describe the contents, reliability, and issues you expect to encounter in collecting these data. Describe, in brief, how you intend to clean the dataset to prepare it for the analysis.

* **note**: Datasets must be sufficiently technically challenging to collect/clean for full credit to be assigned. It is *not* sufficient to download one Kaggle csv for your data collection.
3. (1\%) Write one or two sentences about how the data will be used to solve the problem and your two questions of interest. At this point of the semester, we haven't studied the machine learning methods yet, but you should have a general idea of what you can do with ML. If you do not, ask a TA or the professor or do a little googling.

# Understanding Freshman Satisfaction and Retention in Higher Education
## Sahana Dhar

College and university life for students is an importance experience for many young adults nowadays. According to the Bureau of Labor Statistics, 61.8% of high school graduates were enrolled in college (2021). A significant issue that many institutions face is their impact on freshman satsifaction and the return rate of first year students. Colleges and unversities must continue adapting to the needs and expectations of the student bodies, in order to mintain enrollment rates. The questions I will be exploring in this projects are as follows:
1) What are the key determinants of freshman satisfaction within colleges, and how do they differ against institutions.
2) How do various institutional factors, such as tuition costs, diversity, academic calendar systems, on-campus housing availability, the number of undergraduates, and whether an institution is public or private, impact the likelihood of freshmen returning for their sophomore year?

These questions will include analyzing aspects of college life, such as the environment, diversity, facilities, and the student body, which ultimately impacts any students enjoyment of their college experience and their decision to continue their education there.


In [1]:
# the following modules will be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

In [2]:
import time
import pandas as pd
from selenium import webdriver

# Initialize data containers
college_names = []
public_private = []
undergraduates = []
freshman_satisfaction = []
admission_rate = []
in_state_tuition = []
out_of_state_tuition = []

# Set the base URL
base_url = 'https://www.collegedata.com/college-search'

# Initialize a Selenium webdriver with Chrome DevTools Protocol
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
driver = webdriver.Chrome(options=chrome_options)

# Open the webpage
driver.get(base_url)

# Keep scrolling down to load all content
scrolls = 100  # Adjust the number of scrolls as needed

for _ in range(scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # Adjust the sleep time as needed

# Get the page source with all the loaded content
html = driver.page_source

# Create a BeautifulSoup object
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

# Find all the college card containers
college_cards = soup.find_all('div', class_='CollegeCard_container__6iCEj')

# Loop through each college card and extract the data
for card in college_cards:
    college_names.append(card.find('h3', class_='CollegeCard_name__2Qo2g').text)
    public_private.append(card.find('div', class_='CollegeCard_type__3VK-P').text)

    stat_lines = card.find_all('div', class_='StatLine_container__axaVo')
    
    # Check if there are enough stat_lines to extract all the required data
    if len(stat_lines) >= 6:
        undergraduates.append(stat_lines[0].find('div', class_='StatLine_value__1ASq0').text)
        freshman_satisfaction.append(stat_lines[1].find('div', class_='StatLine_value__1ASq0').text)
        admission_rate.append(stat_lines[2].find('div', class_='StatLine_value__1ASq0').text)
        in_state_tuition.append(stat_lines[4].find('div', class_='StatLine_value__1ASq0').text)
        out_of_state_tuition.append(stat_lines[5].find('div', class_='StatLine_value__1ASq0').text)
    else:
        # Handle cases where not all data is available
        undergraduates.append("N/A")
        freshman_satisfaction.append("N/A")
        admission_rate.append("N/A")
        in_state_tuition.append("N/A")
        out_of_state_tuition.append("N/A")

# Create a DataFrame
data = {
    'College Name': college_names,
    'Public/Private': public_private,
    'Undergraduates': undergraduates,
    'Freshman Satisfaction': freshman_satisfaction,
    'Admission Rate': admission_rate,
    'In-State Tuition': in_state_tuition,
    'Out-of-State Tuition': out_of_state_tuition
}

df = pd.DataFrame(data)

# Print or save the DataFrame as needed


# Close the driver
driver.quit()

In [3]:
(df)

Unnamed: 0,College Name,Public/Private,Undergraduates,Freshman Satisfaction,Admission Rate,In-State Tuition,Out-of-State Tuition
0,Aaniiih Nakoda College,Public • Coed,110,0%,Not reported,"$2,410","$2,410"
1,Abilene Christian University,Private • Coed,3189,78.5%,"79% of 9,397 applicants were admitted","$40,500","$40,500"
2,Abraham Baldwin Agricultural College,Public • Coed,3327,67%,"79% of 2,927 applicants were admitted","$3,565","$10,471"
3,Abraham Lincoln University,Private for-profit • Coed,63,0%,76% of 58 applicants were admitted,"$6,440","$6,440"
4,Academy College,Private for-profit • Coed,95,100%,Not reported,Not reported,Not reported
...,...,...,...,...,...,...,...
2393,York University,Private • Coed,457,58%,60% of 485 applicants were admitted,"$21,525","$21,525"
2394,Young Harris College,Private • Coed,923,76%,"64% of 1,485 applicants were admitted","$30,900","$30,900"
2395,Youngstown State University,Public • Coed,4487,75.4%,"78% of 6,718 applicants were admitted","$8,754","$9,114"
2396,Zane State College,Public • Coed,2275,65%,Not reported,"$5,556","$10,866"


In [4]:
"""from bs4 import BeautifulSoup
import requests

# Define the range of rows to scrape (e.g., rows 100 to 140)
start_row = 100
end_row = 140

# Iterate through the specified range of rows
for index, row in df.iloc[start_row:end_row].iterrows():
    name = row['College Name']
    url = f'https://www.collegedata.com/college-search/{name}'

    # Send a GET request to the URL
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')
        # Or any other suitable value   
        # Extract the relevant information based on the HTML structure
        entrance_difficulty = soup.find('div', text='Entrance Difficulty').find_next('div', class_='TitleValue_value__1JT0d').text
        admission_rate = soup.find('div', text='Overall Admission Rate').find_next('div', class_='TitleValue_value__1JT0d').text
        admission_rate = admission_rate.split(' of ')[0]  # Extract the percentage part
        number_of_applicants = soup.find('div', text='Overall Admission Rate').find_next('div', class_='TitleValue_value__1JT0d').text
        number_of_applicants = number_of_applicants.split(' of ')[1].split(' applicants were admitted')[0]
        early_action_offered = soup.find('div', text='Early Action Offered').find_next('div', class_='TitleValue_value__1JT0d').text
        early_decision_offered = soup.find('div', text='Early Decision Offered').find_next('div', class_='TitleValue_value__1JT0d').text
        gpa = soup.find('div', text='Average GPA').find_next('div', class_='TitleValue_value__1JT0d').text
        sat_math = soup.find('div', text='SAT Math').find_next('div', class_='TitleValue_value__1JT0d').text
        sat_ebrw = soup.find('div', text='SAT EBRW').find_next('div', class_='TitleValue_value__1JT0d').text

        # Update the DataFrame with the extracted data
        df.at[index, 'GPA'] = gpa
        df.at[index, 'Entrance Difficulty'] = entrance_difficulty
        df.at[index, 'Admission Rate'] = admission_rate
        df.at[index, 'Number of Applicants'] = number_of_applicants
        df.at[index, 'Early Action Offered'] = early_action_offered
        df.at[index, 'Early Decision Offered'] = early_decision_offered
        df.at[index, 'SAT Math'] = sat_math
        df.at[index, 'SAT EBRW'] = sat_ebrw
        gpa_element = soup.find('div', text='Average GPA')
        if gpa_element:
            gpa = gpa_element.find_next('div', class_='TitleValue_value__1JT0d').text
            df.at[index, 'GPA'] = gpa
        else:
            df.at[index, 'GPA'] = 'Data Not Found'"""

"from bs4 import BeautifulSoup\nimport requests\n\n# Define the range of rows to scrape (e.g., rows 100 to 140)\nstart_row = 100\nend_row = 140\n\n# Iterate through the specified range of rows\nfor index, row in df.iloc[start_row:end_row].iterrows():\n    name = row['College Name']\n    url = f'https://www.collegedata.com/college-search/{name}'\n\n    # Send a GET request to the URL\n    response = requests.get(url)\n\n    if response.status_code == 200:\n        # Parse the HTML content of the page\n        soup = BeautifulSoup(response.text, 'html.parser')\n        # Or any other suitable value   \n        # Extract the relevant information based on the HTML structure\n        entrance_difficulty = soup.find('div', text='Entrance Difficulty').find_next('div', class_='TitleValue_value__1JT0d').text\n        admission_rate = soup.find('div', text='Overall Admission Rate').find_next('div', class_='TitleValue_value__1JT0d').text\n        admission_rate = admission_rate.split(' of ')[0]  

In [5]:
"""df.iloc[start_row:end_row]"""

'df.iloc[start_row:end_row]'

In [6]:
"""df.iloc[start_row:end_row]"""

'df.iloc[start_row:end_row]'

In [7]:
"""from bs4 import BeautifulSoup
import requests

# Create empty columns in the DataFrame to store the extracted data
df['GPA'] = None
df['Entrance Difficulty'] = None
df['Admission Rate'] = None
df['Number of Applicants'] = None
df['Early Action Offered'] = None
df['Early Decision Offered'] = None
df['SAT Math'] = None
df['SAT EBRW'] = None

# Define the range of rows to scrape (e.g., the first 10 rows)
start_row = 100  # Change this to the starting row you want
end_row = 140   # Change this to the ending row you want

# Iterate through the specified range of rows
for index, row in df.iloc[start_row:end_row].iterrows():
    name = row['College Name']
    url = f'https://www.collegedata.com/college-search/{name}'

    # Send a GET request to the URL
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract the relevant information based on the HTML structure
        entrance_difficulty = soup.find('div', text='Entrance Difficulty').find_next('div', class_='TitleValue_value__1JT0d').text
        admission_rate = soup.find('div', text='Overall Admission Rate').find_next('div', class_='TitleValue_value__1JT0d').text
        admission_rate = admission_rate.split(' of ')[0]  # Extract the percentage part
        number_of_applicants = soup.find('div', text='Overall Admission Rate').find_next('div', class_='TitleValue_value__1JT0d').text
        number_of_applicants = number_of_applicants.split(' of ')[1].split(' applicants were admitted')[0]
        early_action_offered = soup.find('div', text='Early Action Offered').find_next('div', class_='TitleValue_value__1JT0d').text
        early_decision_offered = soup.find('div', text='Early Decision Offered').find_next('div', class_='TitleValue_value__1JT0d').text
        gpa = soup.find('div', text='Average GPA').find_next('div', class_='TitleValue_value__1JT0d').text
        sat_math = soup.find('div', text='SAT Math').find_next('div', class_='TitleValue_value__1JT0d').text
        sat_ebrw = soup.find('div', text='SAT EBRW').find_next('div', class_='TitleValue_value__1JT0d').text

        # Update the DataFrame with the extracted data
        df.at[index, 'GPA'] = gpa
        df.at[index, 'Entrance Difficulty'] = entrance_difficulty
        df.at[index, 'Admission Rate'] = admission_rate
        df.at[index, 'Number of Applicants'] = number_of_applicants
        df.at[index, 'Early Action Offered'] = early_action_offered
        df.at[index, 'Early Decision Offered'] = early_decision_offered
        df.at[index, 'SAT Math'] = sat_math
        df.at[index, 'SAT EBRW'] = sat_ebrw"""

"from bs4 import BeautifulSoup\nimport requests\n\n# Create empty columns in the DataFrame to store the extracted data\ndf['GPA'] = None\ndf['Entrance Difficulty'] = None\ndf['Admission Rate'] = None\ndf['Number of Applicants'] = None\ndf['Early Action Offered'] = None\ndf['Early Decision Offered'] = None\ndf['SAT Math'] = None\ndf['SAT EBRW'] = None\n\n# Define the range of rows to scrape (e.g., the first 10 rows)\nstart_row = 100  # Change this to the starting row you want\nend_row = 140   # Change this to the ending row you want\n\n# Iterate through the specified range of rows\nfor index, row in df.iloc[start_row:end_row].iterrows():\n    name = row['College Name']\n    url = f'https://www.collegedata.com/college-search/{name}'\n\n    # Send a GET request to the URL\n    response = requests.get(url)\n\n    if response.status_code == 200:\n        # Parse the HTML content of the page\n        soup = BeautifulSoup(response.text, 'html.parser')\n\n        # Extract the relevant informa

In [8]:
"""(df.iloc[100:140])"""

'(df.iloc[100:140])'

# Dataset Description

The dataset for this project is primarily collected from web scraping the website 'https://www.collegedata.com/college-search.' It currently consists of the following columns:
- 'College Name' 
- 'Public/Private' (whether the college is a public or private institution)
- 'Undergraduates' (Number of undergaduates)
- 'Freshman Satisfaction' (percentage of freshman satisfaction)
- 'Admission Rate'
- 'In-State Tuition' (price for students living within the same state)
- 'Out-of-State Tuition' (price for students who lived in a different state)

However, this is simply a subset of the entire Dataframe, as all 2400 colleges have not been added yet. In order to do this, a tool like Selenium would have to be employed in ordeer to access the rest of the unloaded text from the given url.

To expand the target features and address the research questions of interest, I plan to add additional columns by individually web scraping the college pages. These features include:

- Ethnicity percentages of different races (assesses diversity and invlusivity)
- Freshman return rate (to analyze freshman satisfaction and engagement)
- Housing availability (to understand whether the university offers housing on campus)
- Percentage of students in college housing (to understand the residential community on campus)
- Greek life participation (information on the presence of Greek Life on campus)
- Athletic division (will help to understand the importance of sports involvement)
- Academic Calendar System (To understand the academic year breaakdown)
- Average Percent of Need Met (To analyze how much the college provides in terms of financial aid)
- Room and Board Cost (Price for housing on campus)



# Reliability and Data Collection Issues

Web scraping can be difficult in this scenario as information for all colleges may not be present. There might be several rows of missing information or may be inconsistent in its formatting. To prepare the dataset for the most efficient analysis, I will input or remove missing values or rows as necessary, and convert any numerical data into a float to run mathematical operations, such as correlations or regressions. I will be combining the newly scraped data for each college with the dataframe above to properly align colleges with all of their corresponding features, This will help ensure that this dataframe is best fit for modeling and analysis.

# Solving the Problem and Answering Our Questions

The collected data will be used for comprehensive data analysis, allowing me to understand the  elationship between various college factors and the key questions of interest: (1) understanding the determinants of freshman satisfaction, and (2) assessing the impact of institutional characteristics on the likelihood of freshmen returning for another year. By discovering correlations in factors, employing machine learning techniques, my analysis aims to provide colleges and universities with predictive insights that can guide colleges to enlist changes and techniques that will enhance the quality of the freshman experience and improve retention rates. 





