# **[LS22] UC Berkeley Admission Rate**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<br/>

<hr style="border: 1px solid #fdb515;" />

## Part 2: Exploratory Data Analysis (EDA) and Observations


## Takeaways
In this lab, we aim to prepare students for utilizing data science and decision making skillsets with the UC Berkeley 1973 Graduate Admission
Rate dataset. The objectives of this lab are as follows:


- **Heuristics**:
  - Understand and apply heuristics in decision-making, while identifying and avoiding biases like Base Rate Neglect, Representativeness Heuristic, Conjunction Fallacy, and Availability Heuristic.
  - Learn the fundamentals of Bayesian reasoning to enhance judgment and decision-making skills.

- **Confirmation Bias**:
  - Be aware of the tendency to favor existing beliefs, even against evidence.
  - Learn about selective exposure and biased assimilation, and how to mitigate confirmation bias by seeking counter-evidence.

- **When is Science Suspect**:
  - Recognize the potential to use science for social and political ends.
  - Be cautious of science that studies human groups and subsequent validation of societal power structures.
  - Recognize one's own involvement in the social dynamic to the assessment of any study of human groups.

A university's admission is related to the different aspects of the society, and often becomes a good reflection on societal's values and dynamic. For this part of the assignment, we will be working with a segment of **UC Berkeley's 1973 graduate admission data** to further explore how gender (recorded binary: Female and Male during 1973) plays a role in admission.

In *Part 1: Observation and Instrumentation*, we explored the data from a less objective lens by making obsersvations, claims, and credence levels to what we thought the dataset represented simply by taking glances at the raw data. In this part, we will get the chance to really dive into the data by performing Exploratory Data Analysis.

Exploratory Data Analysis, or EDA for short, is the process of analyzing/summarizing data to extract valuable insights and patterns that can help guide further analysis. EDA is usually performed at the beginning of a data science project and helps to guide the direction of the analysis. EDA allows us to gain an understanding of the data, identify any patterns or anomalies, and detect any potential issues that may affect the analysis. In the following problems, we will perform EDA on our admission rates dataset.

## **Question 1: Exploring the Data**

Let's jog our memories from *Part 1* and perform EDA on our data.

**Question 1.1)** Like last lab, load the *```berkeley.csv```* dataset below.

In [3]:
berkeley = ...
berkeley.head()

As we saw in *Part 1*, only using our senses can be very limiting in our ability to accurately describe the story that our dataset is trying to tell. Thus, let's utilize instruments such as `pandas`, `numpy`, and `matplotlib` to paint a better picture!

**Question 1.2)** Using *`berkeley.csv`*, calculate the admission rates of Female vs. Male applicants. Note that `female_admission_rate` and `male_admission_rate` should have `float` values. *Hint: Conditional expressions may be useful here*

In [None]:
...

female_admission_rate = ...
male_admission_rate = ...

print(f"Admission rate for female applicants: {female_admission_rate}\nAdmission rate for male applicants: {male_admission_rate}")

**Question 1.3)** Using the `berkeley` dataframe, create a pivot table `admissions_by_gender` that displays the totals for accepted and rejected admissions per gender.

In [None]:
admissions_by_gender = ...
admissions_by_gender

**Question 1.4)** In the same `admissions_by_gender` table, add a column `"Acceptance Rate"`, which contains the acceptance rates per row.

In [None]:
admissions_by_gender["Acceptance Rate"] = ...
admissions_by_gender

**Question 1.5)** Referring back to the results we obtained in all of question 1, what do we observe about the admissions trends between male and female applicants?

*Your Answer Here*

## **Question 2: Visualizing the Data**

**Question  2.1)** Using the `admissions_by_gender` table, create a `barh` plot comparing the acceptance rates between each gender. Don't forget to title your plot and label your axes!

In [None]:
...

**Question 2.2)** Using the same pivot table, create a stacked `bar` plot that visualizes this information. Your plot should clearly distinguish between accepted and rejected counts by stacking them within the same bar for each gender category. `matplotlib`'s `bar` documentation may be useful here: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html.

In [None]:
...

**Question 2.3)** Looking at the visualizations we just created, what observations can you make about the admission trends between male and female applicants?

*Your Answer Here*

## **Question 3: Simpson's Paradox**

Now that we've explored the data with respect to gender and admission rates, let's take a look at how gender plays a role in admission rates. 

**Question 3.1)** Create a pivot table `admissions_by_major` that displays the totals for accepted and rejected admissions per major.

In [None]:
admissions_by_major = ...
admissions_by_major

**Question 3.2)** Adding onto the `admissions_by_major` table, add a `"Counts"` column and an `"Acceptance Rate"` column, which shows the total amount of applicants and the acceptance rate for each major, respectively.

In [None]:
admissions_by_major['Counts'] = ...
admissions_by_major["Acceptance Rate"] = ...
admissions_by_major

**Question 3.3)** Let's utilize data visualizations to help us really understand the data. Create a stacked `bar` plot similar to that of 2.2, plotting the stacked accepted and rejected counts of Majors A-F (excluding `"Other"`).

In [None]:
...

**Question 3.4)** Using the stacked `bar` plot and the `admissions_by_major` table, what observations can we make about the admission trends for each major? For reference, majors "A-F are the six majors with the most applicants in Fall 1973", which was stated here: https://discovery.cs.illinois.edu/dataset/berkeley/.

*Your Answer Here*

Now that we'e explored Gender and Major separately, let's analyze admission rates taking both into account.

**Question 3.5)** Using the `berkeley` dataset, calculate the number of acceptances, rejections, total applicants, and acceptance rate for each combination of major and gender. Your resulting DataFrame whould have the following columns: Major, Gender, Accepted, Rejected, Counts, and Acceptance Rate.

In [None]:
...

admissions_by_all = ...
admissions_by_all

**Question 3.6)** Construct a bar plot that visualizes the admission rates by gender within each major. We recommend using `seaborn` for this particular task, which was imported for you at the start.

In [None]:
...

**Question 3.7)** Looking at the bar plot you made above, what do you conclude about the admission rates between male and female applicants? Is there any noticeable discrepency in the admission rates between male and female applicants, given the rates per major? How is it different from the observations you made in questions 1 and 2?

*Your Answer Here*

**Question 3.8)** How can making premature observations and claims be harmful whem performing exploratory data analysis? What assumptions did you mistakenly make throughout this notebook?

*Your Answer Here*

INSTRUCTOR ONLY: Continue with more visualizations then get into some predictive modeling so students can play around with Simpson's Paradox themselves

## **Question 4: Predictive Modeling**

To be completed... (Could be exploring other datasets that exhibit Simpson's Paradox, datasets with college application data, or some sort of modeling component)

## **Question 5: Closing Thoughts**

To be completed... (Will mostly be short answer responses about more general implications of what we explored here; EX: what are the consequences of a data analyst performing incomplete EDA)

          /\\_/\\      
         / o o  \\     
        (   "    ))    
         \\~(*)~//     
          \\~~~//      
