# **College Majors - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that enable us to be able to

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import Image
import warnings
warnings.simplefilter('ignore', FutureWarning)

In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://media.istockphoto.com/id/1470208665/photo/multi-ethnic-group-of-latin-and-african-american-college-students-smiling-diversity-portrait.jpg?s=2048x2048&w=is&k=20&c=zicp2F74iFTRKjJUwFBgs_Mb_Xd5vvkvdmYSVoekb1I='

# Display the image
Image(url=image_url)

# **Context**

The webiste FiveThirtyEight.com published an article, *The Economic Guide To Picking A College Major*, in 2014 which looked at the earnings of verious college majors. The data analyzed by the author, Ben Casselman, came from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series (PUMS). According to their website, the ACS, part of the U.S. Census Bureau, "is the premier source for detailed population and housing information in our nation."

With this dataset, you have the power to explore college programs and their graduates like never before and create stories of your own!

https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

https://www.census.gov/programs-surveys/acs

# **About the Dataset**

This dataset contains 172 rows corresponding to a random sample of people with at least some college education. The data consists of one identifier variable, index, and 22 other variables associated with 172 college majors. Descriptions of the variables are listed below:

**Variables**

| Column name                 | Description                |
|:----------------------------|:---------------------------|
| Index                       | A number assigned to an individual                         |
| Major_code                  | The code associated with the major (Integer)               |
| Major                       | The specific major of the field of study (String)          |
| Major_category              | The category of the major (String)                         |
| Grad_total                  | The total number of graduates from the major (Integer)     |
| Grad_sample_size            | The sample size of graduates from the major (Integer)      |
| Grad_employed               | The number of graduates employed (Integer)                 |
| Grad_full_time_year_round   | The number of graduates employed full-time year-round (Integer) |
| Grad_unemployed             | The number of graduates unemployed (Integer)               |
| Grad_unemployment_rate      | The unemployment rate of graduates (Float)                 |
| Grad_median                 | The median salary of graduates (Integer)                   |
| Grad_P25                    | The 25th percentile salary of graduates (Integer)          |
| Grad_P75                    | The 75th percentile salary of graduates (Integer)          |
| Nongrad_total               | The total number of non-graduates from the major (Integer) |
| Nongrad_employed            | The number of non-graduates employed (Integer)             |
| Nongrad_full_time_year_round| The number of non-graduates employed full-time year-round (Integer) |
| Nongrad_unemployed          | The number of non-graduates unemployed (Integer)           |
| Nongrad_unemployment_rate   | The unemployment rate of non-graduates (Float)             |
| Nongrad_median              | The median salary of non-graduates (Integer)               |
| Nongrad_P25                 | The 25th percentile salary of non-graduates (Integer)      |
| Nongrad_P75                 | The 75th percentile salary of non-graduates (Integer)      |
| Grad_share                  | The share of graduates in the major (Float)                |
| Grad_premium                | The difference between the median salary of graduates and non-graduates (Integer) |


Let's take a look at the data. To do this, first we import it directly from the url below.

# **A Snippet of the Data**

In [3]:
file_path = "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv"
df = pd.read_csv(file_path)


Next, we can display the data by typing the name of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,Grad_P25,Grad_P75,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,53000,110000.0,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,40000,89000.0,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,45000,100000.0,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,24500,85000.0,37575,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,40600,83700.0,53819,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,5203,COUNSELING PSYCHOLOGY,Psychology & Social Work,51812,724,38468,28808,1420,0.035600,50000.0,36000,65000.0,16781,12377,8502,835,0.063200,40000.0,25000,50000.0,0.755354,0.250000
169,5202,CLINICAL PSYCHOLOGY,Psychology & Social Work,22716,355,16612,12022,782,0.044958,70000.0,47000,95000.0,6519,4368,3033,357,0.075556,46000.0,30000,70000.0,0.777014,0.521739
170,6106,HEALTH AND MEDICAL PREPARATORY PROGRAMS,Health,114971,1766,78132,58825,1732,0.021687,135000.0,70000,294000.0,26320,16221,12185,1012,0.058725,51000.0,35000,87000.0,0.813718,1.647059
171,2303,SCHOOL STUDENT COUNSELING,Education,19841,260,11313,8130,613,0.051400,56000.0,42000,70000.0,2232,1328,980,169,0.112892,42000.0,27000,51000.0,0.898881,0.333333


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**

Determine whether the three variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable               | Classification |
|:-----------------------|:---------------|
|Major                   | ... |
|Grad_unemployment_rate  | ... |
|Grad_unemployed         | ... |
|Major_category          | ... |


## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of Major_category. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for the major category variable. Round relative frequency to 3 decimal places.

In [14]:
# Define the name of the variable to be analyzed
variable = df['...']

# Create the frequency table and sort the categories in numerical order.
# rename "count" to "frequency"
freq_table = pd.value_counts(variable).sort_index()
freq_table = freq_table.rename('Frequency')

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table=freq_table/...      #HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
# axis=1 says to put the tables together as columns
combined_table=pd.concat([..., ...], axis=1)

# Rename the column 'Major_category'
combined_table.index.name = 'Major Category'

# Print the combined table.
...


KeyError: Ellipsis

**2b)** Construct a relative frequency bar chart to describe the distribution of major category.

In [15]:
# The argument in (...) tells the system which table to use to make the
#      bar graph. Since we want a relative freqency bar chart, which is the
#      correct label to replace the ellipsis (...) in this code?
dfrf = pd.DataFrame(...)
fig = px.bar(x=dfrf.index, y=dfrf['Relative Frequency'],
             title='Frequency Distribution Bar Chart')

# Update axis labels
fig.update_layout(xaxis_title='...')
fig.update_layout(yaxis_title='...')

# Show the bar graph
fig.show()

ValueError: DataFrame constructor not properly called!

**2c)** Describe the distribution of major category. Point out 2 or 3 things that jump out at you, including a comment about your major category.

...

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| Last Name | Variable                  |
|:----------|:--------------------------|
| A-L       | Grad_unemployment_rate    |
| M-Z       | Nongrad_unemployment_rate |



**3a)** Construct a histogram for your variable. Use number of bins = 12.

In [16]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.
fig = px.histogram(x=df['...'],nbins = ...,
                   title='...',
                   labels={'x':'...'})

# Update the vertical axis title.
fig.update_layout(yaxis_title='...')

# Print the histogram
fig.show()

KeyError: '...'

**3b)** Construct a boxplot for your variable.  

In [17]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['...'],
       title='...',
       labels={'x':'...'})

KeyError: '...'

**3c)** Calculate the following summary statistics for your variable: 5 number summary, mean, and standard deviation. Round to three decimal places.

In [19]:
# Calculate the numerical summaries
# Indicate your variable.
descriptive_stats = df[['...']].describe().round(...)

# Print the results.
...

KeyError: "None of [Index([Ellipsis], dtype='object')] are in the [columns]"

**3d)** Use information from (3a), (3b) and 3(c) to describe your variable in terms of shape, center, spread, and outliers.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which center and which spread you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list the values of **all** outliers.
* Include units, if any, for all numbers.
* Round to 3 decimal places.

...

**3e)** Interpret the standard deviation in context.

...

**3f)** Interpret the IQR in context.

...

## **QUESTION 4**

How do unemployment rates and median salaries compare for graduates and nongraduates?

Calculate the unemployment rate for graduates, the median salary for graduates, the unemployment rate for non-graduates, and the median salary for non-graduates for your major or intended major. Compare the results and answer a question about the code.

**4a)** Think about what your major is (or will be). Which of the major categories includes your major? Look at the table in part (2a) to find the name of the major category. Then, choose a 2nd, different, major category for comparison.  Calculate the unemployment rate for graduates, the median salary for graduates, the unemployment rate for non-graduates, and the median salary for non-graduates for your major or intended major. Round to two decimal places.

In [12]:
# heart_disease_means = df[df['heart disease variable name'] == 1]
# [['quant 1', 'quant 2', 'quant 3'].mean().round(...)

#df[df['qual var'] == 'major category'][['quant1', 'quant2', 'quant3', 'quant4']].mean().round(...)
MajCat1_means = df[df['...'] =='...'][['...', '...', '...', '...']].mean().round(...)

MajCat2_means = df[df['Major_category'] =='...'][['...', '...', '...', '...']].mean().round(...)

# Combine the two table and specify labels.
combined_means = pd.DataFrame({'Grad Unemployement Rate': [MajCat1_means['Grad_unemployment_rate'], MajCat2_means['Grad_unemployment_rate']],
                               'Grad Median': [MajCat1_means['Grad_median'], MajCat2_means['Grad_median']],
                               'Nongrad Unemployment Rate': [MajCat1_means['Nongrad_unemployment_rate'], MajCat2_means['Nongrad_unemployment_rate']],
                               'Nongrad Median': [MajCat1_means['Nongrad_median'], MajCat2_means['Nongrad_median']]},
                             index =['Engineering', 'Education'])

# Print the table
# Replace ... with the name of the table. We are using print() because the
#   is too wide if we don't.
print(...)

             Grad Unemployement Rate  Grad Median  Nongrad Unemployment Rate  \
Engineering                     0.04     94327.59                       0.05   
Education                       0.03     58437.50                       0.05   

             Nongrad Median  
Engineering        79927.59  
Education          44762.50  


**4b)** Compare the results for your two chosen major categories.

...

**4c)** In the table below are some snippets of code.

| **Last Name Initial** | **Code**                                    |
|-----------------------|----------------------------------------------|
| A-L                    | df['Major_category'] == 'Engineering'  |
| M-Z                    | df[df['Major_category'] == 'Education'][['Grad_unemployment_rate','Grad_median]]               |


Based on your last name, interpret the snippet of code.

...


## **QUESTION 5**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 5a or 5b, but not both.

**5a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**5b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
1. Run all code one last time
2. File-> Print Preview opens in a new browser window
3. Verify you can see all graphs. If not, go back to step 1.
4. File -> Print (or ctrl-p/cmnd-p)
5. Change destination to PDF (don't save, yet)
6. Scroll through preview to make sure you can see your graphs entirely. If not, click Cancel. Make the browser window narrower. Go back to step 4.
7. Repeat steps 4-6 until you can see your graphs completely. But do not make them too narrow.
8. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the "Tutorial 2 Practice Upload" assignment
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the right, and a display of pages (thumbnails) on the right.
Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"
