Analysis of demographics and scores for the 2016 AP Exams. The CSV file was retrieved from the Kaggle dataset at https://www.kaggle.com/datasets/collegeboard/ap-scores

##**Import Libraries**

In this cell, we import pandas, numpy and matplotlib. 

*   Pandas is used for parsing the input CSV file that contains the Ap exam scores from 2016.
*   Numpy is used for calculating mean(), median() and other statistical summaries.
*   Matplotlib is used to display score distributions across the male/female demographic.

In [None]:
### import packages and google drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#open file
GDRIVE_BASE_DIR = '/content/drive/MyDrive/Colab Notebooks';
GDRIVE_DATA_DIR = GDRIVE_BASE_DIR + '/';
CSV_FILE = GDRIVE_DATA_DIR + 'AP_2016.csv';

print(GDRIVE_DATA_DIR)
print(CSV_FILE)

/content/drive/MyDrive/Colab Notebooks/
/content/drive/MyDrive/Colab Notebooks/AP_2016.csv


##**General Methods**
These methods are used in both the Dictionary Dictionary part and the Male v. Female part of this project

---



**skip_average()**

Helps allDict and mvf_all skip the lines in the csv file that calculate the averages for each AP subject

**Parameters:**
*   arr - single line from input CSV file

**Returns:**
*  boolean indicating whether or not the line is an 'Average' line



In [None]:
def skip_average(arr):
  if arr[1] == "Average":
    return True
  else:
    return False

**percentDict**

Puts the AP score values in percents

**Parameters:**
*   ap_scores - dictionary of AP subject where every value is another dictionary of scores/genders

**Returns:**
*  dictionary of percent scores

In [None]:
#put all scores in terms of percents of All

def percentDict(ap_scores):
  percent_scores = ap_scores.copy()

  for examName, examScore in percent_scores.items():
    for score, numPer in examScore.items():
      examScore[score] = round((float(numPer) / float(examScore['All']))*100)
    percent_scores[examName] = examScore

  return percent_scores

**histogram**

Displays percent scores as a histogram

**Parameters:**
*   myDict - dictionary of AP subject where every value is another dictionary of percent scores

**Returns:**
*  none

In [None]:
def histogram(myDict):
  for subject, score_dist in myDict.items():
    del score_dist['All']
    newDict = score_dist
    plt.title(subject)
    plt.bar(list(newDict.keys()), newDict.values(), color = 'salmon')
    plt.show()

##**Put everything into a Dictionary Dictionary**

**allDict**

Makes dictionary of all AP subjects where the key is the subject name and the value is a dictionary with scores and how many people recieved them

**Parameters:**
*   f - csv file AP_2016.csv

**Returns:**
*  dictionary of scores

In [None]:
#read the file and make a dict for every subject
#and put all those dicts into a overarching dict

def allDict(f):
  subjects = {};

  for idx, line in enumerate(f):
    LOI = line.split(",");

    if skip_average(LOI) or idx == 0:
      continue;

    exam_name = LOI[0];
    exam_scores = subjects.get(exam_name, {});
    exam_scores[LOI[1]] = int(LOI[13]);
    subjects[exam_name] = exam_scores;

  return subjects

In [None]:
#testing allDict

f = open(CSV_FILE, 'r')
ap_scores1 = allDict(f)
ap_scores = ap_scores.copy()
print(ap_scores)

{'ART HISTORY': {'5': 11, '4': 22, '3': 28, '2': 28, '1': 11, 'All': 100}, 'BIOLOGY': {'5': 6, '4': 21, '3': 34, '2': 29, '1': 10, 'All': 100}, 'CALCULUS AB': {'5': 24, '4': 17, '3': 17, '2': 10, '1': 31, 'All': 100}, 'CALCULUS BC': {'5': 49, '4': 16, '3': 17, '2': 6, '1': 13, 'All': 100}, 'CHEMISTRY': {'5': 10, '4': 15, '3': 28, '2': 25, '1': 22, 'All': 100}, 'CHINESE LANGUAGE & CULTURE': {'5': 60, '4': 17, '3': 16, '2': 3, '1': 4, 'All': 100}, 'COMPUTER SCIENCE A': {'5': 20, '4': 20, '3': 23, '2': 13, '1': 24, 'All': 100}, 'MACROECONOMICS': {'5': 16, '4': 23, '3': 16, '2': 17, '1': 27, 'All': 100}, 'MICROECONOMICS': {'5': 15, '4': 27, '3': 23, '2': 14, '1': 20, 'All': 100}, 'ENGLISH LANGUAGE & COMPOSITION': {'5': 11, '4': 17, '3': 27, '2': 32, '1': 13, 'All': 100}, 'ENGLISH LITERATURE & COMPOSITION': {'5': 7, '4': 18, '3': 29, '2': 33, '1': 12, 'All': 100}, 'ENVIRONMENTAL SCIENCE': {'5': 8, '4': 23, '3': 15, '2': 26, '1': 29, 'All': 100}, 'EUROPEAN HISTORY': {'5': 7, '4': 16, '3': 29

In [None]:
#testing percentDict

percents = percentDict(ap_scores)
print(percents)
print(ap_scores1)
f.close()

{'ART HISTORY': {'5': 11, '4': 22, '3': 28, '2': 28, '1': 11, 'All': 100}, 'BIOLOGY': {'5': 6, '4': 21, '3': 34, '2': 29, '1': 10, 'All': 100}, 'CALCULUS AB': {'5': 24, '4': 17, '3': 17, '2': 10, '1': 31, 'All': 100}, 'CALCULUS BC': {'5': 49, '4': 16, '3': 17, '2': 6, '1': 13, 'All': 100}, 'CHEMISTRY': {'5': 10, '4': 15, '3': 28, '2': 25, '1': 22, 'All': 100}, 'CHINESE LANGUAGE & CULTURE': {'5': 60, '4': 17, '3': 16, '2': 3, '1': 4, 'All': 100}, 'COMPUTER SCIENCE A': {'5': 20, '4': 20, '3': 23, '2': 13, '1': 24, 'All': 100}, 'MACROECONOMICS': {'5': 16, '4': 23, '3': 16, '2': 17, '1': 27, 'All': 100}, 'MICROECONOMICS': {'5': 15, '4': 27, '3': 23, '2': 14, '1': 20, 'All': 100}, 'ENGLISH LANGUAGE & COMPOSITION': {'5': 11, '4': 17, '3': 27, '2': 32, '1': 13, 'All': 100}, 'ENGLISH LITERATURE & COMPOSITION': {'5': 7, '4': 18, '3': 29, '2': 33, '1': 12, 'All': 100}, 'ENVIRONMENTAL SCIENCE': {'5': 8, '4': 23, '3': 15, '2': 26, '1': 29, 'All': 100}, 'EUROPEAN HISTORY': {'5': 7, '4': 16, '3': 29

In [None]:
#write a new file for summary of the analysis

summary = open('/content/drive/MyDrive/Colab Notebooks/summary.csv', 'w')
for examName, examScore in percents.items():
    for score, perc in examScore.items():
      summary.write("{}, {}, {} \n".format(examName, score, perc))
summary.close()

##**Male Vs. Female Tests**

**mvf_all()**

Makes a dictionary for all AP subjects where the key is the subject and the value is a dictionary where the key is gender and the value is how many of that gender took respective AP exam

**Parameters:**
*   f - csv file AP_2016.csv

**Returns:**
*  dictionary of subjects

In [None]:
#all males vs. all females per test

def mvf_all(f):
  subjects = {};

  for idx, line in enumerate(f):
    LOI = line.split(",");

    if LOI[1] != "All" or idx == 0:
      continue;

    exam_name = LOI[0];
    exam_gen = subjects.get(exam_name, {});
    exam_gen["Male"] = int(LOI[4]);
    exam_gen["Female"] = int(LOI[5]);
    exam_gen["All"] = int(LOI[13])
    subjects[exam_name] = exam_gen;

  return subjects

In [None]:
#test mvf_all

f = open(CSV_FILE, 'r')
mvf1 = mvf_all(f)
mvf = mvf1.copy()
f.close()

print(mvf)

{'ART HISTORY': {'Male': 8344, 'Female': 16526, 'All': 24870}, 'BIOLOGY': {'Male': 90525, 'Female': 141451, 'All': 231976}, 'CALCULUS AB': {'Male': 148993, 'Female': 145463, 'All': 294456}, 'CALCULUS BC': {'Male': 65534, 'Female': 47161, 'All': 112695}, 'CHEMISTRY': {'Male': 72581, 'Female': 72211, 'All': 144792}, 'CHINESE LANGUAGE & CULTURE': {'Male': 4688, 'Female': 5568, 'All': 10256}, 'COMPUTER SCIENCE A': {'Male': 41737, 'Female': 12642, 'All': 54379}, 'MACROECONOMICS': {'Male': 68464, 'Female': 56415, 'All': 124879}, 'MICROECONOMICS': {'Male': 41321, 'Female': 29248, 'All': 70569}, 'ENGLISH LANGUAGE & COMPOSITION': {'Male': 201627, 'Female': 337730, 'All': 539357}, 'ENGLISH LITERATURE & COMPOSITION': {'Male': 147595, 'Female': 250110, 'All': 397705}, 'ENVIRONMENTAL SCIENCE': {'Male': 65952, 'Female': 81424, 'All': 147376}, 'EUROPEAN HISTORY': {'Male': 49608, 'Female': 57743, 'All': 107351}, 'FRENCH LANGUAGE & CULTURE': {'Male': 6288, 'Female': 13780, 'All': 20068}, 'GERMAN LANGUA

In [None]:
#write mvf in terms of percentages
    #for each test what percent of people were male vs. female

mvf_perc = percentDict(mvf2)
print(mvf_perc)
print(mvf)
#histogram(mvf_perc)

{'ART HISTORY': {'Male': 34, 'Female': 66, 'All': 100}, 'BIOLOGY': {'Male': 39, 'Female': 61, 'All': 100}, 'CALCULUS AB': {'Male': 51, 'Female': 49, 'All': 100}, 'CALCULUS BC': {'Male': 58, 'Female': 42, 'All': 100}, 'CHEMISTRY': {'Male': 50, 'Female': 50, 'All': 100}, 'CHINESE LANGUAGE & CULTURE': {'Male': 46, 'Female': 54, 'All': 100}, 'COMPUTER SCIENCE A': {'Male': 77, 'Female': 23, 'All': 100}, 'MACROECONOMICS': {'Male': 55, 'Female': 45, 'All': 100}, 'MICROECONOMICS': {'Male': 59, 'Female': 41, 'All': 100}, 'ENGLISH LANGUAGE & COMPOSITION': {'Male': 37, 'Female': 63, 'All': 100}, 'ENGLISH LITERATURE & COMPOSITION': {'Male': 37, 'Female': 63, 'All': 100}, 'ENVIRONMENTAL SCIENCE': {'Male': 45, 'Female': 55, 'All': 100}, 'EUROPEAN HISTORY': {'Male': 46, 'Female': 54, 'All': 100}, 'FRENCH LANGUAGE & CULTURE': {'Male': 31, 'Female': 69, 'All': 100}, 'GERMAN LANGUAGE & CULTURE': {'Male': 51, 'Female': 49, 'All': 100}, 'GOVERNMENT & POLITICS: COMPARATIVE': {'Male': 50, 'Female': 50, 'All

**findAll4or5()**

Same as mvf_all() except instead of all males and females that took the exam, it only keeps count of those that recieved a 4 or 5

**Parameters:**
*   f - csv file AP_2016.csv

**Returns:**
*  dictionary of scores

In [None]:
#for each test find num males that got a 4 or a 5
#do the same thing for females
#put all the subject dictionaries into a bigger dictionary

def findAll4or5(f):
  subjects = {};

  for idx, line in enumerate(f):
    LOI = line.split(",");

    if LOI[1] != '4' and LOI[1] != '5':
      continue;

    exam_name = LOI[0];
    exam_gen = subjects.get(exam_name, {"Male" : 0, "Female" : 0, "All" : 0});
    
    exam_gen['Male'] += int(LOI[4]);
    exam_gen['Female'] += int(LOI[5]);
    exam_gen['All'] += int(LOI[13])

    subjects[exam_name] = exam_gen;

  return subjects

In [None]:
#print all45 for file f
f = open(CSV_FILE, 'r')
all45 = findAll4or5(f)
print(all45)
f.close()

{'ART HISTORY': {'Male': 2615, 'Female': 5676, 'All': 8291}, 'BIOLOGY': {'Male': 30362, 'Female': 32719, 'All': 63081}, 'CALCULUS AB': {'Male': 65720, 'Female': 56660, 'All': 122380}, 'CALCULUS BC': {'Male': 43932, 'Female': 28285, 'All': 72217}, 'CHEMISTRY': {'Male': 21921, 'Female': 13905, 'All': 35826}, 'CHINESE LANGUAGE & CULTURE': {'Male': 3472, 'Female': 4372, 'All': 7844}, 'COMPUTER SCIENCE A': {'Male': 17417, 'Female': 4752, 'All': 22169}, 'MACROECONOMICS': {'Male': 29943, 'Female': 18950, 'All': 48893}, 'MICROECONOMICS': {'Male': 19552, 'Female': 10635, 'All': 30187}, 'ENGLISH LANGUAGE & COMPOSITION': {'Male': 62341, 'Female': 89097, 'All': 151438}, 'ENGLISH LITERATURE & COMPOSITION': {'Male': 34809, 'Female': 64702, 'All': 99511}, 'ENVIRONMENTAL SCIENCE': {'Male': 24253, 'Female': 20982, 'All': 45235}, 'EUROPEAN HISTORY': {'Male': 12585, 'Female': 12258, 'All': 24843}, 'FRENCH LANGUAGE & CULTURE': {'Male': 2646, 'Female': 5559, 'All': 8205}, 'GERMAN LANGUAGE & CULTURE': {'Mal

**compareTo10()**

Takes output of percentDict(mvf_all(f)) and returns two arrays. One with all the subjects where the difference between males and females that took the course was greater than 10% and another where the difference was less than or equal to 10%

**Parameters:**
*   subjects - dictionary of AP subject where every value is another dictionary of percent genders

**Returns:**
*  dictionary of percent genders where diff > 10%
*  dictionary of percent genders where diff <= 10%

In [None]:
#make two dicts: 
    #one with all the subjects where the |all m - all f| >= 10%
    #one with all the subjects where the |all m - all f| > 10%

def compareTo10(subjects):
 
  greater10 = {}
  less10 = {}
  
  for subject, gends in subjects.items():
    
    if abs(gends['Male'] - gends['Female']) >= 10:
      greater10[subject] = gends
    
    if abs(gends['Male'] - gends['Female']) < 10:
      less10[subject] = gends
  
  return greater10, less10

In [None]:
#test comareTo10 and calc4or5
compG, compL = compareTo10(mvf_perc)

print(">= 10% : " + str(compG))
print("------------")
print("< 10% : " + str(compL))

>= 10% : {'ART HISTORY': {'Male': 34, 'Female': 66, 'All': 100}, 'BIOLOGY': {'Male': 39, 'Female': 61, 'All': 100}, 'CALCULUS BC': {'Male': 58, 'Female': 42, 'All': 100}, 'COMPUTER SCIENCE A': {'Male': 77, 'Female': 23, 'All': 100}, 'MACROECONOMICS': {'Male': 55, 'Female': 45, 'All': 100}, 'MICROECONOMICS': {'Male': 59, 'Female': 41, 'All': 100}, 'ENGLISH LANGUAGE & COMPOSITION': {'Male': 37, 'Female': 63, 'All': 100}, 'ENGLISH LITERATURE & COMPOSITION': {'Male': 37, 'Female': 63, 'All': 100}, 'ENVIRONMENTAL SCIENCE': {'Male': 45, 'Female': 55, 'All': 100}, 'FRENCH LANGUAGE & CULTURE': {'Male': 31, 'Female': 69, 'All': 100}, 'HUMAN GEOGRAPHY': {'Male': 44, 'Female': 56, 'All': 100}, 'ITALIAN LANGUAGE & CULTURE': {'Male': 36, 'Female': 64, 'All': 100}, 'JAPANESE LANGUAGE & CULTURE': {'Male': 44, 'Female': 56, 'All': 100}, 'PHYSICS C: ELECTRICITY & MAGNETISM': {'Male': 76, 'Female': 24, 'All': 100}, 'PHYSICS C: MECHANICS': {'Male': 72, 'Female': 28, 'All': 100}, 'PHYSICS 1': {'Male': 59,

**display45()**

Displays the 4/5 rates per gender for every subject

**Parameters:**
*   mfv_perc - dictionary of percent genders per subject
*   comp - dictionary of percent genders per subject
*   all45 - dict of everyone that got a 4 or 5 on each exam
*   mvf - 

**Returns:**
*  none

In [None]:
#display

    #mvf_perc = percentages for first two categories
    #comp = which subjects to display data for
    #all45 = numerator
    #mvf = denominator

def display45(mvf_perc, comp, all45, mvf):
    print(f"Subject, % Male, % Female, % Male with 5 or 4, % Female with 5 or 4")
    for key, item in comp.items():
      exam = key
      allMale = mvf_perc[key]['Male']
      allFemale = mvf_perc[key]['Female']
    
      male45 = (all45[key]['Male'] * 100) // mvf[key]['Male']
      female45 = (all45[key]['Female'] * 100) // mvf[key]['Female']

      print(str(exam) + ": " + str(allMale) + "%, " + str(allFemale) + "%, " + str(male45) + "%, " + str(female45) + "%")

In [None]:
display45(mvf_perc, compG, all45, mvf)

Subject, % Male, % Female, % Male with 5 or 4, % Female with 5 or 4
ART HISTORY: 34%, 66%, 31%, 34%
BIOLOGY: 39%, 61%, 33%, 23%
CALCULUS BC: 58%, 42%, 67%, 59%
COMPUTER SCIENCE A: 77%, 23%, 41%, 37%
MACROECONOMICS: 55%, 45%, 43%, 33%
MICROECONOMICS: 59%, 41%, 47%, 36%
ENGLISH LANGUAGE & COMPOSITION: 37%, 63%, 30%, 26%
ENGLISH LITERATURE & COMPOSITION: 37%, 63%, 23%, 25%
ENVIRONMENTAL SCIENCE: 45%, 55%, 36%, 25%
FRENCH LANGUAGE & CULTURE: 31%, 69%, 42%, 40%
HUMAN GEOGRAPHY: 44%, 56%, 34%, 29%
ITALIAN LANGUAGE & CULTURE: 36%, 64%, 37%, 38%
JAPANESE LANGUAGE & CULTURE: 44%, 56%, 56%, 55%
PHYSICS C: ELECTRICITY & MAGNETISM: 76%, 24%, 57%, 49%
PHYSICS C: MECHANICS: 72%, 28%, 61%, 48%
PHYSICS 1: 59%, 41%, 22%, 10%
PHYSICS 2: 71%, 29%, 26%, 18%
PSYCHOLOGY: 35%, 65%, 44%, 44%
RESEARCH: 40%, 60%, 25%, 26%
SEMINAR: 40%, 60%, 15%, 19%
SPANISH LANGUAGE: 37%, 63%, 59%, 63%
SPANISH LITERATURE: 34%, 66%, 28%, 32%
STUDIO ART: DRAWING: 21%, 79%, 37%, 45%
STUDIO ART: 2-D DESIGN: 25%, 75%, 43%, 48%
STUDI

In [None]:
display45(mvf_perc, compL, all45, mvf)

Subject, % Male, % Female, % Male with 5 or 4, % Female with 5 or 4
CALCULUS AB: 51%, 49%, 44%, 38%
CHEMISTRY: 50%, 50%, 30%, 19%
CHINESE LANGUAGE & CULTURE: 46%, 54%, 74%, 78%
EUROPEAN HISTORY: 46%, 54%, 25%, 21%
GERMAN LANGUAGE & CULTURE: 51%, 49%, 37%, 43%
GOVERNMENT & POLITICS: COMPARATIVE: 50%, 50%, 45%, 36%
GOVERNMENT & POLITICS: U.S.: 47%, 53%, 29%, 22%
LATIN : 49%, 51%, 36%, 30%
MUSIC THEORY: 52%, 48%, 38%, 31%
STATISTICS: 48%, 52%, 39%, 31%
U.S. HISTORY: 46%, 54%, 31%, 28%
