# **Guide to Picking a College Major - Project 2**
### Analyzing the linear relationship between two quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that enable us to be able to

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import Image
import statsmodels.api as sm

In [None]:
# Assigns the URL where the data file is stored to 'file_path'.
image_url = 'https://media.istockphoto.com/id/1470208665/photo/multi-ethnic-group-of-latin-and-african-american-college-students-smiling-diversity-portrait.jpg?s=2048x2048&w=is&k=20&c=zicp2F74iFTRKjJUwFBgs_Mb_Xd5vvkvdmYSVoekb1I='

# Display the image
Image(url=image_url, width=600)


NameError: name 'Image' is not defined

#**Context**

The webiste FiveThirtyEight.com published an article, *The Economic Guide To Picking A College Major*, in 2014 which looked at the earnings of verious college majors. The data analyzed by the author, Ben Casselman, came from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series (PUMS). According to their website, the ACS, part of the U.S. Census Bureau, "is the premier source for detailed population and housing information in our nation."

With this dataset, you have the power to explore college programs and their graduates like never before and create stories of your own!

https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

https://www.census.gov/programs-surveys/acs

#**About the Dataset**

This dataset contains 172 rows corresponding to a random sample of people with at least some college education. The data consists of one identifier variable, index, and 22 other variables associated with 172 college majors. Descriptions of the variables are listed below:

**Variables**

| Column name                 | Description                                                |
|-----------------------------|------------------------------------------------------------|
| Index                       | A number assigned to an individual                         |
| Major_code                  | The code associated with the major (Integer)               |
| Major                       | The specific major of the field of study (String)          |
| Major_category              | The category of the major (String)                         |
| Grad_total                  | The total number of graduates from the major (Integer)     |
| Grad_sample_size            | The sample size of graduates from the major (Integer)      |
| Grad_employed               | The number of graduates employed (Integer)                 |
| Grad_full_time_year_round   | The number of graduates employed full-time year-round (Integer) |
| Grad_unemployed             | The number of graduates unemployed (Integer)               |
| Grad_unemployment_rate      | The unemployment rate of graduates (Float)                 |
| Grad_median                 | The median salary of graduates (Integer)                   |
| Grad_P25                    | The 25th percentile salary of graduates (Integer)          |
| Grad_P75                    | The 75th percentile salary of graduates (Integer)          |
| Nongrad_total               | The total number of non-graduates from the major (Integer) |
| Nongrad_employed            | The number of non-graduates employed (Integer)             |
| Nongrad_full_time_year_round| The number of non-graduates employed full-time year-round (Integer) |
| Nongrad_unemployed          | The number of non-graduates unemployed (Integer)           |
| Nongrad_unemployment_rate   | The unemployment rate of non-graduates (Float)             |
| Nongrad_median              | The median salary of non-graduates (Integer)               |
| Nongrad_P25                 | The 25th percentile salary of non-graduates (Integer)      |
| Nongrad_P75                 | The 75th percentile salary of non-graduates (Integer)      |
| Grad_share                  | The share of graduates in the major (Float)                |
| Grad_premium                | The difference between the median salary of graduates and non-graduates (Integer) |


# **A Snippet of the Data**

Let's take a look at the data. To do this, first we import it directly from the url.

In [None]:
# Assigns the URL containing the data file to 'file_path'.
file_path = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv"

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df = pd.read_csv(file_path)


Next, we can display the data by typing the name of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [None]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# When you type the object name, the object gets printed.
df

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,Grad_P25,Grad_P75,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,53000,110000.0,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,40000,89000.0,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,45000,100000.0,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,24500,85000.0,37575,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,40600,83700.0,53819,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,5203,COUNSELING PSYCHOLOGY,Psychology & Social Work,51812,724,38468,28808,1420,0.035600,50000.0,36000,65000.0,16781,12377,8502,835,0.063200,40000.0,25000,50000.0,0.755354,0.250000
169,5202,CLINICAL PSYCHOLOGY,Psychology & Social Work,22716,355,16612,12022,782,0.044958,70000.0,47000,95000.0,6519,4368,3033,357,0.075556,46000.0,30000,70000.0,0.777014,0.521739
170,6106,HEALTH AND MEDICAL PREPARATORY PROGRAMS,Health,114971,1766,78132,58825,1732,0.021687,135000.0,70000,294000.0,26320,16221,12185,1012,0.058725,51000.0,35000,87000.0,0.813718,1.647059
171,2303,SCHOOL STUDENT COUNSELING,Education,19841,260,11313,8130,613,0.051400,56000.0,42000,70000.0,2232,1328,980,169,0.112892,42000.0,27000,51000.0,0.898881,0.333333


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, replace the ellipsis (...) by double clicking the text box to start typing.
* Reference the tutorial from activity for assistance.
* Attend office hours if you still need help.

This assignment is intended to explore the univariate linear relationships between quantitative variables in the data. Choose the dependent and independent variables based on your last name.

| Last Name | Dependent Variable (y) | Independent Variable (x) |
|-----------|------------------------|--------------------------|
| A-L       | Grad_median            | Grad_unemployment_rate   |
| M-Z       | Nongrad_median         | Nongrad_unemployment_rate|




Based on your last name, which variable is explanatory and which is the response?

Explanatory: ...

Response: ...

What do you think the direction (positive/negative) and strength (weak/moderate/strong) will be?

# ...

# **QUESTION 1**

## Scatterplot

**1.1)** Construct a scatterplot that could be used to show the relationship between the variables stated above.

In [None]:
# Create a scatter plot using Plotly Express

                                                          # STUDENTS: replace ... as stated
scatter_plot = px.scatter(df,
                 x='...',                                 # Replace ... with the explanatory variable name
                 y='...',                                 # Replace ... with the response variable name
                 labels={'Grad_unemployment_rate': '...', # Replace ... with a better x-axis label
                         'Grad_median': '...'})           # Replace ... with a better y-axis label

# Updating layout
# STUDENTS: Do not change anything in this block of code.
scatter_plot.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot',                # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=575,                                 # Sets the width of the plot.
    height=400                                 # Sets the height of the plot for portrait mode.
)

# Show the plot
scatter_plot.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Major_code', 'Major', 'Major_category', 'Grad_total', 'Grad_sample_size', 'Grad_employed', 'Grad_full_time_year_round', 'Grad_unemployed', 'Grad_unemployment_rate', 'Grad_median', 'Grad_P25', 'Grad_P75', 'Nongrad_total', 'Nongrad_employed', 'Nongrad_full_time_year_round', 'Nongrad_unemployed', 'Nongrad_unemployment_rate', 'Nongrad_median', 'Nongrad_P25', 'Nongrad_P75', 'Grad_share', 'Grad_premium'] but received: ...

**1.2)** Describe the relationship between the two variables and include context, and identify the direction, strength, and form. Since you don't yet know the value of correlation (r), estimate what you think the strength is (weak/moderate/strong).


# ...

**1.3)** Is this what you expected?


# ...

# **Question 2**
## Correlation Coefficient

**2.1)** Calculate the value of the correlation coefficient.

In [None]:
# Calculate the correlation coefficient

# STUDENTS: Replace ... as stated
# Replace the 1st ... with the response variable name.
# Replace the 2nd ... with the explanatory variable name
correlation = df['...'].corr(df['...'])

# Print the correlation coefficient
print(f"Correlation coefficient between your two variables is: r = {correlation}")

**2.2)** Interpret the correlation coefficient.


# ...

**2.3)** Now that you know the value of correlation, r, did the strength match what you wrote in 1.2?

# ...

# **Question 3**

## Least Squares Regression

**3.1)** Calculate the linear model and add the regression line to the scatterplot.


In [None]:
# Calculate the LSRL and add the regression line to the scatterplot.

# Extract the variables from the dataframe
                   # STUDENTS: Replace ... as stated below.
X = df['...']      # Replace ... with the explantory variable name
Y = df['...']      # Replace ... with the response variable name

# Add a constant to the explanatory variable for the regression model.
# This forces Statsmodels to calculate the y-intercept.
# STUDENTS: Do not change anything in this block of code.
X_with_const = sm.add_constant(X)

# Fit the linear regression model (calculate slope and y-intercept)
# STUDENTS: Do not change anything in this block of code.
model = sm.OLS(Y, X_with_const).fit()

# Recreate the scatterplot
                                                          # STUDENTS: Replace ... as stated to the right below.
scatter_plot = px.scatter(df,
                 x='...',                                 # Replace ... with the explanatory variable name
                 y='...',                                 # Replace ... with the response variable name
                 labels={'Grad_unemployment_rate': '...', # Replace ... with a better x-axis label
                         'Grad_median': '...'})           # Replace ... with a better y-axis label

# Updating layout
# STUDENTS: Do not change anything in this block of code.
scatter_plot.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot',                # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=600,                                 # Sets the width of the plot.
    height=400,                                # Sets the height of the plot for portrait mode.
    showlegend=False                           # Disables the legend (key) display
)

# Add the regression line to the scatterplot
scatter_plot.add_scatter(x=X, y=model.predict(X_with_const),
                         mode='lines', name='Regression Line')

# Show the plot
scatter_plot.show()

**3.2)** Print the coefficients (Intercept and Slope)



In [None]:
# Print the coefficients (Intercept and Slope)

# STUDENTS: Do not change the code below.
# The coefficients were calculated when the regression line was added to the
# scatterplot and stored in 'model'. This assigns the values of the y-intercept
# and slope to 'coefficients', then prints the values.
coefficients = model.params
coefficients.name='Coefficients'
coefficients


**3.3)** Write the equation of the line that predicts waist circumference from BMI. To type 𝑦̂ you can copy-and-paste the symbol, or type “y-hat” or “predicted y”. Do NOT round.



# ...

# **Question 4**
## Slope
Interpret the slope


# ...

# **QUESTION 5**
## Y-Intercept

Interpret the y-intercept of the least squares regression line in the context of this study. State whether the interpretation is reasonable.

# ...

# **Question 6**
## Prediction

The values in the table below show the median salary for graduates (A-L) and non-graduates (M-Z) for Criminology majors.

Show work to predict the value of your dependent variable for Criminology majors using the actual value of the independent variable shown in the table below. Round to two (2) decimal places.

| Last Name      | Independent Variable (x)                |
|----------------|----------------------------------------|
| **A-L**        | `Grad_unemployment_rate` = 0.045299817 |
| **M-Z**        | `Nongrad_unemployment_rate` = 0.056334794 |


# ...

# **Question 7**
## Compare
**7.1)** Look up the actual value of your dependent variable for the value of the independent variable above. We know it is for Criminology majors, but be sure to type in the value of the independent variable listed in Question 6 in the code below.

In [None]:
# Filter the datafram to find the row where your independent variable
#   equals the value from the table in Question 6.

# STUDENTS:
# Replace the 1st ... with the explantory variable name
# Replace the 2nd ... with the value from the table in Question 6
# Replace the 3rd ... with the response variable name
actual_dependent = df[df['...']== ...]['...']

actual = int(actual_dependent.values[0])
print(f"The actual value of the dependent variable is: ${actual}")


**7.2)** Compare the predicted value of your dependent variable from Question 6 to the actual value from 7.1. Use the actual name of your dependent variable, as modeled in activity.


# ...

# **QUESTION 8**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 8a or 8b, but not both.

**8a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

# ...

--OR--

**8b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

# ...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
Note 1: You do not have to select Print Preview. You can print directly from the notebook.
Note 2: Image and graph sizes have been set so you should be able to see them correctly without making any changes to the browser width or the layout (portrait vs landscape).
1. Run all code one last time and make sure your graphs can be seen.
2. File -> Print (or ctrl-p/cmnd-p)
3. Change the "Desination" to PDF.
4. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
**Watch the "GradeScope Submission" video for help.**
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the name of the assignment that matches your data set
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the left, and a display of pages (thumbnails) on the right. Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"

#### **Still need help? Your STAT 108 team is here to help. Take your laptop to office hours.**
