# INFO2950 Final Project

## Background Information

The pursuit of a college education has consistently been tied to graduates' financial success compared to non-graduates; on average, graduates are 1/2 as likely to be unemployed and have 84% higher income than those whose highest education is a high school diploma. Every year, high school graduates have to decide if the cost of college -- both tuition _and_ being out of the workforce for 4 years -- is worth the financial benefit. On top of the decision to attend, they must choose a major, a decision that could determine the degree of their financial success and the magnitude of benefits reaped from attending college.

However, this relationship is highly generalized. Not every degree concentration yields the same average financial success as others. For example, a degree in computer science or another form of engineering leads to a job market that has, on average, higher salaries than a degree in English.

A typical reason why a certain major's salary would be important to a college graduate is that many come to college for upward mobility. College is seen by many as a pathway to a higher financial and social standing as well as a higher quality of life. This extends further than simply their post-graduation salary, but the mean income of the location in which they live, as quality of life is largely affected by the wealth of those around you, as it impacts the opportunities available, public school funding, health, and much more. Perhaps the mean income of postgraduates' region of residence indicates the degree of their major's impact on their financial success and quality of life.

Thus, we are interested in exploring the relationship between the college major of a college postgraduate and the wealth of their county of residence. We decided we were most interested in exploring this relationship within New York State, where there is a _dramatic_ wealth gap between its most affluent regions (for example, New York City) versus its most poor regions; in fact, New York was recognized as the state with the highest gap between the rich and the poor, with a Gini coefficient score in 2022 of 0.52.


## Research Question

Question: Can we reliably predict a New York county's wealth based on their percentage of postgraduates of a certain age and major?

At the conclusion of this project, we seek to determine if there is a correlation between college major and the per capita income of the postgraduate county of residence in New York, and age and the per capita income of the postgraduate county of residence in New York. We will train a multivariable regression on data tables from both Wikepedia and the U.S. Census Bureau to see if we can reliably predict a county's per capita income given the population percentages of Science and Engineering, Business, Education, and Arts, Humanities, and Others for each of these age groups: 25-39 years, 40-64 years, and 65 years and over. Essentially, we aim to determine if college major and age affects the average wealth of where people live.

We defined county wealth using per capita income of the county and we solely considered New York counties for our regression. We defined postgraduates as people holding a Bachelor's degree older than 25 years old. 

## Original Data Description

We have two data tables from the U.S. Census Bureau and Wikepedia. 

Our data from Wikepedia:
https://en.wikipedia.org/wiki/List_of_New_York_locations_by_per_capita_income.

- Each row (observation) is a New York State county
- There are two columns (attributes) and they are: the county name and the county's per capita income. Before we copy-pasted it, the original dataframe on the website had additional columns for median household income, median family income, population, and number of households. We omitted these columns because they did not serve a purpose in our regression
- This dataset was created by Wikepedia to inform the public of the income per capita of each county in New York State
- The data is originally from the 2010 United States Census Data and the 2006-2010 American Community Survey 5-Year Estimates. The data from the Census is created, funded, and collected by the United States government for the purposes of better understanding the demographics of the country and informing policy
- Since this was a survey, the process of a voluntary survey may have influenced which data was actually gathered, since some may have declined to share their data or others may not have been within reach of or had access to the survey
- Those who participated were aware their data was going to be used for analyzing demographics
- We collected this data through copying and pasting the data into an Excel sheet and downloading it as a .csv file

Our data from the U.S. Census Bureau: https://data.census.gov/table/ACSST1Y2022.S1502?g=040XX00US36,36 using New York State (as a whole) and all New York Counties as filters for their data set, created in 2022.

- The rows (observations) each represented one person surveyed, categorized by age range and college major.
- The columns (attributes) are described below:
    - The first column houses the labels of groups of people by age and college major; the following columns pre- data cleaning have each county and the general state's estimate of people falling into each category, the margin of error, percent estimate, percent error, male estimate, male margin of error, male percent estimate, male percent margin of error, female estimate, female margin of error, female percent estimate, and female percent margin of error
- This dataset was created to gather information about the residents of the United States to better understand the demographics of the country and created more informed policy
- This data was created, funded, and collected by the government of the United States as a part of the Census, a large survey taken every 10 years
- Those who participated in the Census were aware that their data was being used to better understand demographics
- Since this was a survey, the process of a voluntary survey may have influenced which data was actually gathered, since some may have declined to share their data or been without access to complete the survey
- There are 5 rows and 469 columns, full of numerical data. No information is missing and the data is self-contained. The data identifies subpopulations by age and gender. Age is used to further divide the groups of majors and gender is used to divide each county's data
- This data was accessed using filters to narrow down the dataset to the variables: age, college major, and county

## Data Cleaning and Collection

### Importing packages

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
import duckdb

ModuleNotFoundError: No module named 'duckdb'

### Data cleaning: selecting and extracting the relevant columns of U.S. Census Bureau data table and renaming columns

Read the .csv file:

In [None]:
census_majors = pd.read_csv("nys_counties_college.csv")
census_majors.columns = census_majors.columns.str.replace(
    "    ", "")

Below, we see that there are many columns that are not relevant to our research question. We are only interested in each county's percent estimate column.

In [None]:
print(census_majors.head())

Below, we are replacing the !!s and the spaces with an underscore, then selecting the relevant columns which have each county's total percentage (not including the columns regarding margin or error, total estimate, NYS as a whole, or Male/Female specific columns). There were a lot of columns to be dropped (more than half), so after trying to use .drop(), a TA helped us realize we should just extract the columns of interest and put it into a new dataframe, which we called census_majors_cleaned. We then rename the suffix of each column, _New_York_Percent_Estimate, to County so that we have [name] County as each column's name, so the column names are consistent with the County column of the Wikepedia dataframe.

In [None]:
census_majors.columns = census_majors.columns.str.replace(
    "!!", "_")
census_majors.columns = census_majors.columns.str.replace(
    " ", "_")
#Checking that the column names are correct
display(census_majors)

In [None]:
census_majors_cleaned = census_majors[[
'Label_(Grouping)', 'Albany_County,_New_York_Percent_Estimate',
'Bronx_County,_New_York_Percent_Estimate',
'Broome_County,_New_York_Percent_Estimate',
'Cattaraugus_County,_New_York_Percent_Estimate',
'Cayuga_County,_New_York_Percent_Estimate',
'Chautauqua_County,_New_York_Percent_Estimate',
'Chemung_County,_New_York_Percent_Estimate',
'Clinton_County,_New_York_Percent_Estimate',
'Dutchess_County,_New_York_Percent_Estimate',
'Erie_County,_New_York_Percent_Estimate',
'Jefferson_County,_New_York_Percent_Estimate',
'Kings_County,_New_York_Percent_Estimate',
'Madison_County,_New_York_Percent_Estimate',
'Monroe_County,_New_York_Percent_Estimate',
'Nassau_County,_New_York_Percent_Estimate',
'New_York_County,_New_York_Percent_Estimate',
'Niagara_County,_New_York_Percent_Estimate',
'Oneida_County,_New_York_Percent_Estimate',
'Onondaga_County,_New_York_Percent_Estimate',
'Ontario_County,_New_York_Percent_Estimate',
'Orange_County,_New_York_Percent_Estimate',
'Oswego_County,_New_York_Percent_Estimate',
'Putnam_County,_New_York_Percent_Estimate',
'Queens_County,_New_York_Percent_Estimate',
'Rensselaer_County,_New_York_Percent_Estimate',
'Richmond_County,_New_York_Percent_Estimate',
'Rockland_County,_New_York_Percent_Estimate',
'St._Lawrence_County,_New_York_Percent_Estimate',
'Saratoga_County,_New_York_Percent_Estimate',
'Schenectady_County,_New_York_Percent_Estimate',
'Steuben_County,_New_York_Percent_Estimate',
'Suffolk_County,_New_York_Percent_Estimate',
'Sullivan_County,_New_York_Percent_Estimate',
'Tompkins_County,_New_York_Percent_Estimate',
'Ulster_County,_New_York_Percent_Estimate',
'Warren_County,_New_York_Percent_Estimate',
'Wayne_County,_New_York_Percent_Estimate',
'Westchester_County,_New_York_Percent_Estimate',
]]
#Confidence check: does the dataframe have the columns we want?
display(census_majors_cleaned)

In [None]:
#Getting rid of added title (we just want the county name + "County")
census_majors_cleaned.columns = census_majors_cleaned.columns.str.replace(
    ",_New_York_Percent_Estimate", "")
#Changing the _s to a space for more familiar names 
#(underscore is less familiar to us)
census_majors_cleaned.columns = census_majors_cleaned.columns.str.replace(
    "_", " ")
#Setting the index to be the majors so when we transpose the dataframe,
#the columns will be named with the majors
census_majors_cleaned.set_index('Label (Grouping)', inplace=True)
#Checking to make sure the dataframe has the index we want
display(census_majors_cleaned)

We want to merge this data with the Wikepedia data, so we will align the county names as rows because a taller dataframe is better for regressions. We will use .transpose() to switch the columns and the rows of the cleaned census data. After performing that, we realized we have to set the index to our "Label (Grouping)" column, so we used set_index() above the transpose() command (in the cell above) so we could set the right column to be the index

In [None]:
census_majors_cleaned = census_majors_cleaned.transpose()
display(census_majors_cleaned)

Now we have to create an index for the rows, because we want the county to be a column rather than an index, since we cannot use a County index like we would with a column as a key when merging dataframes.

In [None]:
census_majors_cleaned.reset_index(inplace=True)
census_majors_cleaned.rename(
    columns = {'index':'County'}, inplace = True)
display(census_majors_cleaned)

Now we are ready to split this data into other subgroups and merge that data with the Wikepedia data!
Note: we know we have NaNs in our dataframe, but we will solely be using this dataframe to extract columns from, so we will be "cleaning" the NaNs out by not using those columns (since the columns of NaNs serve to show which columns are for which age)

### Data collection: web scraping from Wikepedia's "List of New York locations by per capita":

We attempted to data scrape the Wikepedia website, but we were unsuccessful because the < tr > tags that contained the data we were trying to scrape did not have a class associated with them, and TAs we consulted recommended manually pasting the data into an Excel spreadsheet and downloading it as a .csv file instead. Below, you will find our initial web scraping process, from saving the url to opening the html file up to where we realized there was no clear way to scrape without a class in the < tr > tag. 

In [None]:
wikepedia_url = "https://en.wikipedia.org/wiki/List_of_New_York_locations_by_per_capita_income"

In [None]:
wikepedia_result = requests.get(wikepedia_url)

In [None]:
with open("county_wealth.html", "w") as writer:
    writer.write(wikepedia_result.text)

In [None]:
with open("county_wealth.html", "r") as reader:
    html_source = reader.read()

In [None]:
# Confidence check
html_source[:20]

In [None]:
page = BeautifulSoup(html_source, "html.parser")

Below, you will find that there are no classes in the < tr > tags that we scraped.

In [None]:
wikepedia_county_income = page.find("table", 
                                    {"class": "wikitable"})
wikepedia_county_income = wikepedia_county_income.find_next(
    'tbody').find_all('tr')
print(wikepedia_county_income)

### Data collection: Excel and downloading the Wikepedia table

Thus, we manually copy-pasted the Wikepedia data into Excel and downloaded that file as a .csv

In [None]:
county_wealth = pd.read_csv("county_wealth.csv")
display(county_wealth)

### Data Cleaning: Adding "County" to the county names in county_wealth

To perform an inner join on the tables, we have to rename each row of the "County" column of county_wealth, since the county name does not have "county" after it, as it does in census_majors' county columns.

In [None]:
county_wealth['County'] = county_wealth['County'] + ' County'
county_wealth['County'].iloc[[0]] = 'New York County'
display(county_wealth)

We also have to change the Per capita income column values from strings to integers.

In [None]:
#Get rid of the commas and extra spaces in the values 
#so they can be converted to integers
county_wealth['Per capita income'] = county_wealth[
    'Per capita income'].str.replace('  ,', '')
county_wealth['Per capita income'] = county_wealth[
    'Per capita income'].str.replace(', ', '')
#Convert to int
county_wealth['Per capita income'] = county_wealth[
    'Per capita income'].astype(int)
#Confidence check - print the type of the column 
#and display the dataframe
print(county_wealth['Per capita income'].dtypes)
display(county_wealth)

### Data cleaning: splitting census data

Here we are splitting the data into two subsets:
- One dataframe where the age is disregarded and all of the ages are compiled into one major, so that there is one value for each major (ageless_majors)
- Three dataframes where each age is displayed in full on its own (young_majors, middle_majors, old_majors)

After we create these, we display each dataframe and then merge it with the wikepedia dataframe. We want to merge this via a left join to combine the data unique to each individual dataframe being joined so we can look at specific age's majors' impact on county income per capita. 

In [None]:
#create the census dataframe for all ages (columns 1 - 6)
only_census_total = census_majors_cleaned.iloc[:, 1: 7]
#Drop the NaN column from the only_census_total dataframe 
#to create the no-NaN ageless_majors dataframe
ageless_majors = only_census_total.drop(columns=[
"Total population 25 years and over with a Bachelor's degree or higher"])
#Add the County column
ageless_majors['County'] = census_majors_cleaned['County']
#Confidence check
display(ageless_majors)

In [None]:
#create the census dataframe for only ages 25-39 
#(columns 9 - 13)
young_majors = census_majors_cleaned.iloc[:, 9:14]
#Add the County column
young_majors['County'] = census_majors_cleaned['County']
#Confidence check
display(young_majors)

In [None]:
#create census dataframe only for ages 40-64 
#(columns 15 - 19)
middle_majors = census_majors_cleaned.iloc[:, 15:20]
#Add the County column
middle_majors['County'] = census_majors_cleaned[
    'County']
#Confidence check
display(middle_majors)

In [None]:
#create census dataframe for only ages 65+ 
#(columns 21 - 25)
old_majors = census_majors_cleaned.iloc[:, 21:26]
#Add County column
old_majors['County'] = census_majors_cleaned[
    'County']
#Confidence check
display(old_majors)

### Data cleaning: merging dataframes

We want to merge our cleaned census dataframe with our Wikepedia dataframe for comparison purposes. We will use a LEFT JOIN on the county key, because both dataframes have a county column that we have ensured has shared values. The Wikipedia data has more counties in its County column than the census's County column, but the join will ensure that the resulting merged dataframes will only have the shared counties. We will do this for census_majors_cleaned (the cleaned version of the original census dataframe), ageless_majors, and the three age-specific dataframes (young_majors, middle_majors, and old_majors).

In [None]:
#Merge the whole census_majors_cleaned 
#dataframe with the county_wealth (Wikepedia) dataframe
merged_df = county_wealth.merge(
    census_majors_cleaned, 
    left_on='County', right_on ='County')
display(merged_df)

Here we will merge the non-age-specific dataframe with the county_wealth dataframe.

In [None]:
ageless_merged_df = county_wealth.merge(
    ageless_majors, left_on='County', 
    right_on ='County')
display(ageless_merged_df)

We also merged each of the age-specific data frames with the county_wealth dataframe.

In [None]:
young_merged_df = county_wealth.merge(
    young_majors, left_on='County', 
    right_on ='County')
display(young_merged_df)

In [None]:
middle_merged_df = county_wealth.merge(
    middle_majors, left_on='County', 
    right_on ='County')
display(middle_merged_df)

In [None]:
old_merged_df = county_wealth.merge(
    old_majors, left_on='County', 
    right_on ='County')
display(old_merged_df)

Lastly, we will get rid of the percent signs for the county's college major data and convert the strings into floats. We will do this for every merged dataframe so that we can run regressions and compute functions on the percentage data.

We will start by getting rid of the %s in the ageless_majors dataframe by replacing the % for each column:

In [None]:
ageless_merged_df['Science and Engineering'] = ageless_merged_df[
    'Science and Engineering'].str.replace('%', '')
ageless_merged_df['Science and Engineering'] = ageless_merged_df[
    'Science and Engineering'].astype(float)
ageless_merged_df['Science and Engineering Related Fields'
                 ] = ageless_merged_df[
    'Science and Engineering Related Fields'].str.replace('%', '')
ageless_merged_df['Science and Engineering Related Fields'
                 ] = ageless_merged_df[
    'Science and Engineering Related Fields'].astype(float)
ageless_merged_df['Business'] = ageless_merged_df[
    'Business'].str.replace('%', '')
ageless_merged_df['Business'] = ageless_merged_df[
    'Business'].astype(float)
ageless_merged_df['Education'] = ageless_merged_df[
    'Education'].str.replace('%', '')
ageless_merged_df['Education'] = ageless_merged_df[
    'Education'].astype(float)
ageless_merged_df['Arts, Humanities and Others'
] = ageless_merged_df['Arts, Humanities and Others'
                     ].str.replace('%', '')
ageless_merged_df['Arts, Humanities and Others'
] = ageless_merged_df['Arts, Humanities and Others'
                     ].astype(float)
#Confidence check
print(ageless_merged_df[
    'Science and Engineering'].dtypes)
display(ageless_merged_df)

And now we will do the same to the three age-divided dataframes by replacing the % for each column in each dataframe:

In [None]:
young_merged_df['Science and Engineering'] = young_merged_df[
    'Science and Engineering'].str.replace('%', '')
young_merged_df['Science and Engineering'] = young_merged_df[
    'Science and Engineering'].astype(float)
young_merged_df[
    'Science and Engineering Related Fields'
] = young_merged_df[
    'Science and Engineering Related Fields'
].str.replace('%', '')
young_merged_df[
    'Science and Engineering Related Fields'
] = young_merged_df[
    'Science and Engineering Related Fields'
].astype(float)
young_merged_df['Business'] = young_merged_df[
    'Business'].str.replace('%', '')
young_merged_df['Business'] = young_merged_df[
    'Business'].astype(float)
young_merged_df['Education'] = young_merged_df[
    'Education'].str.replace('%', '')
young_merged_df['Education'] = young_merged_df[
    'Education'].astype(float)
young_merged_df['Arts, Humanities and Others'
] = young_merged_df['Arts, Humanities and Others'
].str.replace('%', '')
young_merged_df['Arts, Humanities and Others'
] = young_merged_df['Arts, Humanities and Others'
                   ].astype(float)
#Confidence check
print(young_merged_df['Science and Engineering'
                     ].dtypes)
display(young_merged_df)

In [None]:
middle_merged_df['Science and Engineering'
] = middle_merged_df['Science and Engineering'
                    ].str.replace('%', '')
middle_merged_df['Science and Engineering'
] = middle_merged_df['Science and Engineering'
                    ].astype(float)
middle_merged_df['Science and Engineering Related Fields'
] = middle_merged_df['Science and Engineering Related Fields'
                    ].str.replace('%', '')
middle_merged_df['Science and Engineering Related Fields'
] = middle_merged_df['Science and Engineering Related Fields'
                    ].astype(float)
middle_merged_df['Business'
] = middle_merged_df['Business'].str.replace('%', '')
middle_merged_df['Business'
] = middle_merged_df['Business'].astype(float)
middle_merged_df['Education'
] = middle_merged_df['Education'].str.replace('%', '')
middle_merged_df['Education'
] = middle_merged_df['Education'].astype(float)
middle_merged_df['Arts, Humanities and Others'
] = middle_merged_df['Arts, Humanities and Others'
                    ].str.replace('%', '')
middle_merged_df['Arts, Humanities and Others'
] = middle_merged_df['Arts, Humanities and Others'
            ].astype(float)
#Confidence check
print(middle_merged_df['Science and Engineering'
                      ].dtypes)
display(middle_merged_df)

In [None]:
old_merged_df['Science and Engineering'] = old_merged_df[
    'Science and Engineering'].str.replace('%', '')
old_merged_df['Science and Engineering'] = old_merged_df[
    'Science and Engineering'].astype(float)
old_merged_df['Science and Engineering Related Fields'
] = old_merged_df['Science and Engineering Related Fields'
                 ].str.replace('%', '')
old_merged_df['Science and Engineering Related Fields'
] = old_merged_df['Science and Engineering Related Fields'
                 ].astype(float)
old_merged_df['Business'
] = old_merged_df['Business'].str.replace('%', '')
old_merged_df['Business'] = old_merged_df[
'Business'].astype(float)
old_merged_df['Education'] = old_merged_df['Education'
                        ].str.replace('%', '')
old_merged_df['Education'] = old_merged_df[
    'Education'].astype(float)
old_merged_df['Arts, Humanities and Others'
] = old_merged_df['Arts, Humanities and Others'
                 ].str.replace('%', '')
old_merged_df['Arts, Humanities and Others'
] = old_merged_df['Arts, Humanities and Others'
                 ].astype(float)
#Confidence check
print(old_merged_df['Science and Engineering'
                   ].dtypes)
display(old_merged_df)

Now the merged dataframes are ready for our exploratory data analysis!

## Exploratory Data Analysis

We will begin by performing general analysis and visualization on the larger, more overarching index_county dataframe. 

We want to find the values of:
- Mean of each major's percentage within all counties
- Median of each major's percentage within all counties
- Mean of each major's percentage within all counties for each age
- Median of each major's percentage within all counties for each age

We want to display contextual visualizations to demonstrate:
- Each county's per capita income
- A comparison of counties' science and engineering majors percentage for all ages
- A comparison of counties' science and engineering related majors percentage for all ages
- A comparison of counties' business majors percentage for all ages
- A comparison of counties' education majors percentage for all ages
- A comparison of counties' arts, humanities, and others majors percentage for all ages

Begin with the mean and median values for ageless_merged_df:

In [None]:
print('Mean of Science and Engineering percentage within all counties: ' +
str(np.round(np.mean(ageless_merged_df['Science and Engineering']),2)))
print(
'Mean of Science and Engineering Related Fields percentage within all counties: ' 
+ str(np.round(np.mean(ageless_merged_df[
    'Science and Engineering Related Fields']),2)))
print('Mean of Business percentage within all counties: ' + 
str(np.round(np.mean(ageless_merged_df['Business']),2)))
print('Mean of Education percentage within all counties: ' + 
str(np.round(np.mean(ageless_merged_df['Education']),2)))
print('Mean of Arts, Humanities and Others percentage within all counties: ' + 
str(np.round(np.mean(ageless_merged_df['Arts, Humanities and Others']),2)))

In [None]:
print('Median of Science and Engineering percentage within all counties: ' + 
str(np.round(np.median(ageless_merged_df['Science and Engineering']),2)))
print('Median of Science and Engineering Related Fields percentage within all counties: ' 
+ str(np.round(np.median(ageless_merged_df['Science and Engineering Related Fields']),2)))
print('Median of Business percentage within all counties: ' + 
str(np.round(np.median(ageless_merged_df['Business']),2)))
print('Median of Education percentage within all counties: ' +
str(np.round(np.median(ageless_merged_df['Education']),2)))
print('Median of Arts, Humanities and Others percentage within all counties: ' + 
str(np.round(np.median(ageless_merged_df['Arts, Humanities and Others']),2)))

Now the mean and median for each age range: 

For young_merged_df (ages 25-39):

In [None]:
print('Mean of Science and Engineering percentage within all counties (ages 25-39): '
+ str(np.round(np.mean(young_merged_df['Science and Engineering']),2)))
print('Mean of Science and Engineering Related Fields percentage within all counties (ages 25-39): ' 
+ str(np.round(np.mean(young_merged_df['Science and Engineering Related Fields']),2)))
print('Mean of Business percentage within all counties (ages 25-39): ' + 
str(np.round(np.mean(young_merged_df['Business']),2)))
print('Mean of Education percentage within all counties (ages 25-39): ' + 
str(np.round(np.mean(young_merged_df['Education']),2)))
print('Mean of Arts, Humanities and Others percentage within all counties (ages 25-39): ' + 
str(np.round(np.mean(young_merged_df['Arts, Humanities and Others']),2)))

In [None]:
print('Median of Science and Engineering percentage within all counties (ages 25-39): ' 
+ str(np.round(np.median(young_merged_df['Science and Engineering']),2)))
print('Median of Science and Engineering Related Fields percentage within all counties (ages 25-39): ' 
+ str(np.round(np.median(young_merged_df['Science and Engineering Related Fields']),2)))
print('Median of Business percentage within all counties (ages 25-39): ' 
+ str(np.round(np.median(young_merged_df['Business']),2)))
print('Median of Education percentage within all counties (ages 25-39): '
+ str(np.round(np.median(young_merged_df['Education']),2)))
print('Median of Arts, Humanities and Others percentage within all counties (ages 25-39): '
+ str(np.round(np.median(young_merged_df['Arts, Humanities and Others']),2)))

For middle_merged_df (ages 40-64):

In [None]:
print('Mean of Science and Engineering percentage within all counties (ages 40-64): ' + 
str(np.round(np.mean(middle_merged_df['Science and Engineering']),2)))
print('Mean of Science and Engineering Related Fields percentage within all counties (ages 40-64): ' 
+ str(np.round(np.mean(middle_merged_df['Science and Engineering Related Fields']),2)))
print('Mean of Business percentage within all counties (ages 40-64): ' 
+ str(np.round(np.mean(middle_merged_df['Business']),2)))
print('Mean of Education percentage within all counties (ages 40-64): ' 
+ str(np.round(np.mean(middle_merged_df['Education']),2)))
print('Mean of Arts, Humanities and Others percentage within all counties (ages 40-64): ' 
+ str(np.round(np.mean(middle_merged_df['Arts, Humanities and Others']),2)))

In [None]:
print('Median of Science and Engineering percentage within all counties (ages 40-64): ' 
+ str(np.round(np.median(middle_merged_df['Science and Engineering']),2)))
print('Median of Science and Engineering Related Fields percentage within all counties (ages 40-64): ' 
+ str(np.round(np.median(middle_merged_df['Science and Engineering Related Fields']),2)))
print('Median of Business percentage within all counties (ages 40-64): ' 
+ str(np.round(np.median(middle_merged_df['Business']),2)))
print('Median of Education percentage within all counties (ages 40-64): '
+ str(np.round(np.median(middle_merged_df['Education']),2)))
print('Median of Arts, Humanities and Others percentage within all counties (ages 40-64): ' 
+ str(np.round(np.median(middle_merged_df['Arts, Humanities and Others']),2)))

For old_merged_df (ages 65+):

In [None]:
print('Mean of Science and Engineering percentage within all counties (ages 65+): ' 
+ str(np.round(np.mean(old_merged_df['Science and Engineering']),2)))
print('Mean of Science and Engineering Related Fields percentage within all counties (ages 65+): ' 
+ str(np.round(np.mean(old_merged_df['Science and Engineering Related Fields']),2)))
print('Mean of Business percentage within all counties (ages 65+): ' 
+ str(np.round(np.mean(old_merged_df['Business']),2)))
print('Mean of Education percentage within all counties (ages 65+): ' 
+ str(np.round(np.mean(old_merged_df['Education']),2)))
print('Mean of Arts, Humanities and Others percentage within all counties (ages 65+): ' 
+ str(np.round(np.mean(old_merged_df['Arts, Humanities and Others']),2)))

Start by displaying per capita income for each county (ranked) to understand the general differences between the rankings

In [None]:
#display general trend of each county (ranked by per capita income) 
#vs. their per capita income
general_graph = plt.plot(ageless_merged_df['County'], 
                         ageless_merged_df['Per capita income'])
plt.title("NYS Counties' per capita income")
plt.xlabel("County name, ranked by per capita income")
plt.ylabel("Per capita income ($)")
plt.xticks(rotation=85)
display(general_graph)

Now we will display each county's percentage of Science and Engineering majors relative to the other counties:

In [None]:
science_engineering_barplot_ageless = sns.barplot(
    ageless_merged_df, x="County", y="Science and Engineering")
science_engineering_barplot_ageless.set_xticklabels(
    ageless_merged_df['County'],rotation=85)
science_engineering_barplot_ageless.set(
    xlabel="County name, ranked by per capita income", 
    ylabel='Percentage of Science and Engineering', 
    title='Percentage of Science and Engineering majors for each county')
display(science_engineering_barplot_ageless)

Now we will display each county's percentage of Science and Engineering Related Fields majors relative to the other counties:

In [None]:
science_engineering_barplot_ageless = sns.barplot(
    ageless_merged_df, x="County", 
    y="Science and Engineering Related Fields")
science_engineering_barplot_ageless.set_xticklabels(
    ageless_merged_df['County'],rotation=85)
science_engineering_barplot_ageless.set(
    xlabel="County name, ranked by per capita income", 
    ylabel='Percentage of Science and Engineering Related Fields', 
    title='Percentage of Science and Engineering Related Fields majors for each county')
display(science_engineering_barplot_ageless)

Now we will display each county's percentage of Business majors relative to the other counties:

In [None]:
science_engineering_barplot_ageless = sns.barplot(
    ageless_merged_df, x="County", y="Business")
science_engineering_barplot_ageless.set_xticklabels(
    ageless_merged_df['County'],rotation=85)
science_engineering_barplot_ageless.set(
    xlabel="County name, ranked by per capita income", 
    ylabel='Percentage of Business', 
    title='Percentage of Business majors for each county')
display(science_engineering_barplot_ageless)

Now we will display each county's percentage of Education majors relative to the other counties:

In [None]:
science_engineering_barplot_ageless = sns.barplot(
    ageless_merged_df, x="County", y="Education")
science_engineering_barplot_ageless.set_xticklabels(
    ageless_merged_df['County'],rotation=85)
science_engineering_barplot_ageless.set(
    xlabel="County name, ranked by per capita income", 
    ylabel='Percentage of Education', 
    title='Percentage of Education majors for each county')
display(science_engineering_barplot_ageless)

Now we will display each county's percentage of Arts, Humanities and Others majors relative to the other counties:

In [None]:
science_engineering_barplot_ageless = sns.barplot(
ageless_merged_df, x="County", y="Arts, Humanities and Others")
science_engineering_barplot_ageless.set_xticklabels(
    ageless_merged_df['County'],rotation=85)
science_engineering_barplot_ageless.set(
    xlabel="County name, ranked by per capita income", 
    ylabel='Percentage of Arts, Humanities and Others', 
    title='Percentage of Arts, Humanities and Others majors for each county')
display(science_engineering_barplot_ageless)

## Data Description of Analysis-Ready data

- What are the observations (rows) and the attributes (columns)?
    - For the ageless_merged_df dataframe, it has 37 rows and 7 columns. Each row is a county (they are ranked by per capita income) and the first column is the county name. The rest of the columns are each group / category of college major. The values in those columns are the percentage of the specific major in a specific county.
    - For the young_merged_df dataframe, it has 37 rows and 7 rows, all of which are the same as the ageless_merged_df dataframe, but only pertaining to the values for participants who were between the ages of 25 and 39.
    - For the middle_merged_df dataframe, it has 37 rows and 7 rows, all of which are the same as the ageless_merged_df dataframe, but only pertaining to the values for participants who were between the ages of 40 and 64.
    - For the old_merged_df dataframe, it has 37 rows and 7 rows, all of which are the same as the ageless_merged_df dataframe, but only pertaining to the values for participants who were 65 years old and older.
- What preprocessing was done, and how did the data come to be in the form that you are using?
    - We had to grab the columns from the census data that we were interested in; originally, the data had a lot of columns unrelated to the major percentage of each county. We also had to convert many string columns into floats and ints, as the % signs and the commas in the data we extracted from the Census and Wikepedia both had numerical values in the form of a string. We also had to change the names of many columns because they had unnecessarily long and verbose names, including the County names. With our cleaned data, we divided that one big dataframe into three dataframes, each one representing one age group. We merged the Wikepedia data and the big cleaned Census data we had into one. We then merged the Wikepedia dataframe with each of the three smaller dataframes for each age group. We split it into age-specific dataframes because we wanted to view the data in both an age-specific context and an age-blind context.
- What do the instances that comprise the dataset represent?
    - The data in the major-name columns represent the percentage of people in that county who reported their major as being within that category of majors.
    - The data in the county name columns represents the county being measured.

## Preregistration Hypotheses

- Hypothesis 1: Do counties with more business majors than any other major tend to be richer than counties with a lower percentage of business majors? 

- $H_0$: Counties with more business majors than any other major do not have a significant difference in wealth compared to counties with a lower percentages of business majors.
- $H_A$: Counties with more business majors than any other major do have a significant difference in wealth compared to counties with a lower percentages of business majors.

We seek to determine whether a correlation exists between college major and the wealth of a graduate's county of residency. Majoring in business typically involves positions in finance, an industry that is assumed to be profitable and lucrative. By specifically investigating a potential relationship between business major percentage and a county's wealth, we can support or counter this assumption.

Using a linear regression, we will input business major percentage as a variable and output county wealth. We will use this model to determine the correlation coefficient between the input and the output. We will also create visual graphs to display the potential relationship between the input and the output.

- Hypothesis 2: Are science and engineering majors more likely to reside in a county with a high per capita income? 

- $H_0$: Science and engineering majors are equally as likely as other majors to reside in a county with a high per capita income.
- $H_A$: Science and engineering majors are significantly more likely than other majors to reside in a county with a high per capita income.

Similar to business, science and engineering professions have been increasingly viewed as profitable due to the rise of the technology industry. We seek to determine whether a relationship exists between science and engineering majors and the wealth of the county in which they reside in. By determining the existence of such a relationship, we can either reinforce or contradict this assumption with evidence.

Similar to the previous hypothesis, we will use a linear regression to determine the correlation coefficient between the input and the output. We will also display visuals to determine the distribution of the data.

## Hypothesis 1

Below, we are isolating the relevant columns of our dataframe (the Per Capita Income and Business columns of ageless_merged_df) and fitting a regression to it.

In [None]:
hyp1_df = ageless_merged_df['Per capita income', 'Business']
display(hyp1_df)

## Data Limitations

- The Census data is limited in that it is essentially a large sample, as it does not cover the entire population, and typically, more vulnerable groups are not represented. 
- This data and any findings from it cannot be generalized to any region outside New York State, as the only counties considered are New York State counties. Especially considering that the state has one of the biggest and most wealthy cities in the world, this is unlikely to be validly mapped to another region. In the same regard, the universities of New York residents are most likely composed of disproportionately more in-state colleges compared to other states, yet another reason to not generalize this data outside of New York. 
- Another limitation is that the Census lumps together a broad variety of majors into "STEM", "Humanities", etc. which may result in a lack of consideration of certain outliers within those major-groups. For example, Psychology may be considered STEM, but it may not follow the same trends as a Computer Science major.
- Another limitation is that the Wikepedia data is outdated by 13 years, as its most recent data is from 2010. This may be an issue because our Census data is much more recent.