<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Standardized Test Analysis

### 1. Problem Statement/ Background

SAT and ACT are standardized tests that many colleges and universities in the United States require for their admission process. Since the advent of ACT, the SAT and ACT have been in [rivalry](https://www.bestcolleges.com/blog/history-of-act/). Despite their [differences](https://www.crimsoneducation.org/sg/blog/test-prep/sat-vs-act-whats-the-difference/), there have been online resources to convert SAT and ACT interchangably. For example, see [the princeton review](https://www.princetonreview.com/college-advice/act-to-sat-conversion), [crimson education](https://www.crimsoneducation.org/sg/blog/test-prep/sat-vs-act-whats-the-difference/) etc.

This project shall aim to:
1. to examine the reliability of the SAT and ACT concordance table taken from the respective official board website by comparing it with the college admission score; and
2. to discover which tests did each state perform better based on the concordance table.

Given the findings, the project seeks to inform high school students:
- on the reliability of SAT and ACT concordance table as they work towards their dream college; and
- on the statistics for college admission based on the SAT and ACT concordance table given their geographical location.

### 2. Datasets

The project will make use of the following datasets for analysis:
1. [`act_2019.csv`](./data/act_2019.csv): 2019 ACT Scores by State ([*source*](https://blog.prepscholar.com/act-scores-by-state-averages-highs-and-lows))
2. [`sat_2019.csv`](./data/sat_2019.csv): 2019 SAT Scores by State ([*source*](https://blog.prepscholar.com/average-sat-scores-by-state-most-recent))
3. [`sat_act_by_college.csv`](./data/sat_act_by_college.csv): Ranges of Accepted ACT & SAT Student Scores by Colleges ([*source*](https://www.compassprep.com/college-profiles/))
4. [`sat_act_score_convertor.csv`](./data/sat_act_score_convertor.csv): ACT & SAT Student Scores Concordance Table from offical websites (sources: [ACT](https://www.act.org/content/act/en/products-and-services/the-act/scores/act-sat-concordance.html) & [SAT](https://satsuite.collegeboard.org/higher-ed-professionals/score-reports/score-comparisons/sat-act))

### 3. Data Import and Cleaning

In [1]:
# Importing all the revelant packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importing all the datasets
act_2019 = pd.read_csv('../data/act_2019.csv')
sat_2019 = pd.read_csv('../data/sat_2019.csv')
sat_act_by_college = pd.read_csv('../data/sat_act_by_college.csv')
sat_act_score_convertor = pd.read_csv('../data/sat_act_score_convertor.csv')

#### 3.1 Individual Dataset

We shall proceed to:
1. examine each of the datasets
2. prepare each data for further analysis based on the observations made

#### 3.1.1 'act_2019' Dataset

In [3]:
act_2019.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,100%,18.9
1,Alaska,38%,20.1
2,Arizona,73%,19.0
3,Arkansas,100%,19.3
4,California,23%,22.6


In [4]:
act_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   State          52 non-null     object 
 1   Participation  52 non-null     object 
 2   Composite      52 non-null     float64
dtypes: float64(1), object(2)
memory usage: 1.3+ KB


In [5]:
# Clarifying the unique stats in order to compare with sat_2019 dataset
print(act_2019['State'].unique())
print(f'The number of unique states is {act_2019.State.nunique()}.')

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'National']
The number of unique states is 52.


#### Observations

The dataset has no missing and suspicious value.

There are three things that we want to adjust for the 'act_2019.csv' Dataset :-
1. Drop the % under the participation rate column and change its value to integer.
2. Round composite score to integer as ACT score is [only meaningful](https://www.quadeducationgroup.com/blog/act-scores-everything-you-need-to-know) as a whole number.
3. Rename the columns to standardise with the 'sat_2019.csv' Dataset.

In [6]:
act_2019['Participation'] = act_2019['Participation'].str.replace('%','').astype(int)
act_2019.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,100,18.9
1,Alaska,38,20.1
2,Arizona,73,19.0
3,Arkansas,100,19.3
4,California,23,22.6


In [7]:
act_2019['Composite'] = act_2019['Composite'].round(0).astype(int)
act_2019.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,100,19
1,Alaska,38,20
2,Arizona,73,19
3,Arkansas,100,19
4,California,23,23


In [8]:
act_2019.rename(columns={'State': 'state',
                         'Participation': 'act_participation_%',
                         'Composite': 'act_composite'}, inplace=True)
act_2019.head()

Unnamed: 0,state,act_participation_%,act_composite
0,Alabama,100,19
1,Alaska,38,20
2,Arizona,73,19
3,Arkansas,100,19
4,California,23,23


#### 3.1.2 'sat_2019' Dataset

In [9]:
sat_2019.head()

Unnamed: 0,State,Participation Rate,EBRW,Math,Total
0,Alabama,7%,583,560,1143
1,Alaska,41%,556,541,1097
2,Arizona,31%,569,565,1134
3,Arkansas,6%,582,559,1141
4,California,63%,534,531,1065


In [10]:
sat_2019.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   State               53 non-null     object
 1   Participation Rate  53 non-null     object
 2   EBRW                53 non-null     int64 
 3   Math                53 non-null     int64 
 4   Total               53 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.2+ KB


In [11]:
# Need this to compare with act_2019 dataset
print(sat_2019['State'].unique())
print(f'The number of unique states is {sat_2019.State.nunique()}.')
# We notice that dataset has two more stats as compared to the act_2019 dataset (namely, Puerto Rico and Virgin Islands)
# We also notice that the act_2019 dataset has a national average

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Puerto Rico' 'Rhode Island'
 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont'
 'Virgin Islands' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin'
 'Wyoming']
The number of unique states is 53.


#### Observations

The dataset has no missing and suspicious value.

There are four things that we want to adjust for the 'sat_2019.csv' Dataset :-

1. Delete the rows of consisting Puerto Rico and Virgin Islands of sat_2019 dataset. Although they are under US's territory, they are not considered [US states](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations).
2. Drop the % under the participation rate column and change its value to integer.
3. Compute national average in sat_2019 dataset based on the remaining states. Even though we do not have any information on how national average was caculated in the act_2019 dataset, this is the next best alternative method.
4. Rename the columns to standardise with the 'sat_2019.csv' Dataset.

In [12]:
sat_2019 = sat_2019[(sat_2019.State  != 'Puerto Rico') & (sat_2019.State  != 'Virgin Islands')]
print(sat_2019['State'].unique())
print(f'The number of unique states is {sat_2019.State.nunique()}.')

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']
The number of unique states is 51.


In [13]:
sat_2019['Participation Rate'] = sat_2019['Participation Rate'].str.replace('%','').astype(int)
sat_2019.head()

Unnamed: 0,State,Participation Rate,EBRW,Math,Total
0,Alabama,7,583,560,1143
1,Alaska,41,556,541,1097
2,Arizona,31,569,565,1134
3,Arkansas,6,582,559,1141
4,California,63,534,531,1065


In [14]:
nat_avg = {'State':'National',
           'Participation Rate':sat_2019['Participation Rate'].mean().astype(int),
           'EBRW':sat_2019['EBRW'].mean().astype(int),
           'Math':sat_2019['Math'].mean().astype(int),
           'Total':sat_2019['Total'].mean().astype(int)}
sat_2019 = sat_2019.append(nat_avg, ignore_index = True)
sat_2019.tail()

  sat_2019 = sat_2019.append(nat_avg, ignore_index = True)


Unnamed: 0,State,Participation Rate,EBRW,Math,Total
47,Washington,70,539,535,1074
48,West Virginia,99,483,460,943
49,Wisconsin,3,635,648,1283
50,Wyoming,3,623,615,1238
51,National,49,560,552,1113


In [15]:
sat_2019.rename(columns={'State': 'state',
                         'Participation Rate': 'sat_participation_%',
                         'EBRW': 'sat_ebrw',
                         'Math': 'sat_math',
                         'Total':'sat_total'}, inplace=True)
sat_2019.head()

Unnamed: 0,state,sat_participation_%,sat_ebrw,sat_math,sat_total
0,Alabama,7,583,560,1143
1,Alaska,41,556,541,1097
2,Arizona,31,569,565,1134
3,Arkansas,6,582,559,1141
4,California,63,534,531,1065


#### 3.1.3 'sat_act_by_college' Dataset

In [16]:
sat_act_by_college.head()

Unnamed: 0,School,Test Optional?,Applies to Class Year(s),Policy Details,Number of Applicants,Accept Rate,SAT Total 25th-75th Percentile,ACT Total 25th-75th Percentile
0,Stanford University,Yes,2021,Stanford has adopted a one-year test optional ...,47452,4.3%,1440-1570,32-35
1,Harvard College,Yes,2021,Harvard has adopted a one-year test optional p...,42749,4.7%,1460-1580,33-35
2,Princeton University,Yes,2021,Princeton has adopted a one-year test optional...,35370,5.5%,1440-1570,32-35
3,Columbia University,Yes,2021,Columbia has adopted a one-year test optional ...,40203,5.5%,1450-1560,33-35
4,Yale University,Yes,2021,Yale has adopted a one-year test optional poli...,36844,6.1%,1460-1570,33-35


In [17]:
sat_act_by_college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416 entries, 0 to 415
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   School                          416 non-null    object
 1   Test Optional?                  416 non-null    object
 2   Applies to Class Year(s)        390 non-null    object
 3   Policy Details                  416 non-null    object
 4   Number of Applicants            416 non-null    int64 
 5   Accept Rate                     416 non-null    object
 6   SAT Total 25th-75th Percentile  416 non-null    object
 7   ACT Total 25th-75th Percentile  416 non-null    object
dtypes: int64(1), object(7)
memory usage: 26.1+ KB


#### Observations

There are null values in the dataset but the column is not required for analysis.

There are three things that we want to adjust for the 'sat_act_by_college.csv' Dataset :-

1. Drop the rows of colleges without both ACT and SAT scores as they are not useful for analysis with the ACT and SAT scores of the concordance table.
2. Pick out the score of 25th percentile of the ACT and SAT score columns. Given that the analysis is meant to gauge the admission scores of both tests required to enter the colleges, the score at 25th percentile would be more useful. The lower the percentile the better we can gauge the minimum scores required to enter each college. 
3. Convert the values of the 25th percentile of the ACT and SAT score columns to integers for analysis.
4. Rename the columns to represent the values.

In [18]:
sat_act_by_college['SAT Total 25th-75th Percentile'].unique()

array(['1440-1570', '1460-1580', '1450-1560', '1460-1570',
       '\u200b\u200b 1530-1560', '\u200b\u200b 1500-1570',
       '\u200b\u200b 1440-1570', '\u200b\u200b 1490-1570', '1390-1540',
       '1440-1560', '1330-1520', '1450-1570', '1380-1540', '1440-1550',
       '1460-1560', '1360-1510', '1470-1560', '1400-1560', '1410-1550',
       '1340-1520', '1270-1480', '1290-1510', '1360-1520', '1350-1530',
       '1350-1510', '1340-1490', '1300-1480', '1490-1570', '1370-1530',
       '1300-1530', '1470-1570', '1360-1530', '1400-1550', '1283-1510',
       '1370-1510', '1320-1510', '1370-1520', '1310-1485', '1350-1520',
       '1360-1540', '1300-1510', '1348-1490', '1330-1500', '1300-1490',
       '1340-1530', '1180-1440', '1340-1500', '1270-1450', '1280-1420',
       '1370-1490', '1320-1470', '1290-1460', '1250-1420', '1290-1450',
       '1300-1500', '1260-1460', '1320-1490', '1220-1400', '1250-1470',
       '1250-1460', '1280-1500', '1250-1440', '1220-1380', '1240-1470',
       '1333-1490'

In [19]:
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].str.replace('\u200b','')

In [20]:
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].str.strip()

In [21]:
sat_act_by_college['SAT Total 25th-75th Percentile'].unique()

array(['1440-1570', '1460-1580', '1450-1560', '1460-1570', '1530-1560',
       '1500-1570', '1490-1570', '1390-1540', '1440-1560', '1330-1520',
       '1450-1570', '1380-1540', '1440-1550', '1460-1560', '1360-1510',
       '1470-1560', '1400-1560', '1410-1550', '1340-1520', '1270-1480',
       '1290-1510', '1360-1520', '1350-1530', '1350-1510', '1340-1490',
       '1300-1480', '1370-1530', '1300-1530', '1470-1570', '1360-1530',
       '1400-1550', '1283-1510', '1370-1510', '1320-1510', '1370-1520',
       '1310-1485', '1350-1520', '1360-1540', '1300-1510', '1348-1490',
       '1330-1500', '1300-1490', '1340-1530', '1180-1440', '1340-1500',
       '1270-1450', '1280-1420', '1370-1490', '1320-1470', '1290-1460',
       '1250-1420', '1290-1450', '1300-1500', '1260-1460', '1320-1490',
       '1220-1400', '1250-1470', '1250-1460', '1280-1500', '1250-1440',
       '1220-1380', '1240-1470', '1333-1490', '1280-1450', '1300-1460',
       '1210-1380', '1110-1320', '1270-1460', '1260-1430', '1255

In [22]:
sat_act_by_college['ACT Total 25th-75th Percentile'].unique()

array(['32-35', '33-35', '35-36', '34-36', '31-34', '31-35', '29-33',
       '27-34', '31-33', '30-34', '30-33', '28-34', '32-34', '29-34',
       '27-33', '26-32', '28-32', '--', '29-32', '25-33', '26-33',
       '28-33', '27-31', '28-31', '22-29', '19-24', '27-32', '22-27',
       '26-30', '27-30', '15-20', '20-26', '25-32', '25-31', '24-30',
       '22-26', '19-25', '26-31', '23-29', '25-30', '24-29', '22-28',
       '20-25', '24-31', '19-26', '19-27', '21-27', '21-26', '24-33',
       '16-22', '21-28', '22-30', '23-30', '21-29', '23-28', '23-31',
       '20-27', '17-23', '24-28.5', '15-19', '24-28', '17-22', '20-29',
       '18-24', '19-28', '20-28', '19.3-25.3', '25-29', '17-24', '17-25',
       '18-25'], dtype=object)

In [23]:
sat_act_by_college = sat_act_by_college[(sat_act_by_college["SAT Total 25th-75th Percentile"]  != '--')]
sat_act_by_college = sat_act_by_college[(sat_act_by_college["ACT Total 25th-75th Percentile"]  != '--')]

In [24]:
sat_act_by_college['ACT Total 25th-75th Percentile'].unique()

array(['32-35', '33-35', '35-36', '34-36', '31-34', '31-35', '29-33',
       '27-34', '31-33', '30-34', '30-33', '28-34', '32-34', '29-34',
       '27-33', '26-32', '28-32', '29-32', '25-33', '26-33', '28-33',
       '27-31', '28-31', '22-29', '19-24', '27-32', '22-27', '26-30',
       '27-30', '15-20', '20-26', '25-32', '25-31', '24-30', '22-26',
       '19-25', '26-31', '23-29', '25-30', '24-29', '22-28', '20-25',
       '24-31', '19-26', '19-27', '21-27', '21-26', '24-33', '16-22',
       '21-28', '22-30', '23-30', '21-29', '23-28', '23-31', '20-27',
       '17-23', '24-28.5', '15-19', '24-28', '17-22', '20-29', '18-24',
       '19-28', '20-28', '19.3-25.3', '25-29', '17-24', '17-25', '18-25'],
      dtype=object)

In [25]:
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].map(lambda x: str(x)[:-5])

In [26]:
for i in sat_act_by_college['SAT Total 25th-75th Percentile']:
    if len(i) > 4:
        print(i)

1142.5


In [27]:
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].astype(float)
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].round()
sat_act_by_college['SAT Total 25th-75th Percentile'] = sat_act_by_college['SAT Total 25th-75th Percentile'].astype(int)

In [28]:
sat_act_by_college['SAT Total 25th-75th Percentile'].unique()

array([1440, 1460, 1450, 1530, 1500, 1490, 1390, 1330, 1380, 1360, 1470,
       1400, 1410, 1340, 1270, 1290, 1350, 1300, 1370, 1283, 1320, 1310,
       1348, 1180, 1280, 1250, 1260, 1220, 1240, 1333, 1210, 1110, 1255,
       1200, 1040, 1030, 1100,  890, 1325, 1150, 1160, 1080, 1230, 1190,
       1140, 1020,  990, 1090, 1050, 1130, 1203,  950, 1070, 1170,  910,
       1060, 1120, 1142, 1045, 1248, 1010, 1078,  940, 1000,  980, 1153,
       1012, 1108,  970, 1038,  960, 1143, 1008, 1245, 1275,  793,  820])

In [29]:
sat_act_by_college['ACT Total 25th-75th Percentile'] = sat_act_by_college['ACT Total 25th-75th Percentile'].map(lambda x: str(x)[:-3])

In [30]:
for i in sat_act_by_college['ACT Total 25th-75th Percentile']:
    if len(i) > 2:
        print(i)

24-2
19.3-2


In [31]:
sat_act_by_college['ACT Total 25th-75th Percentile'] = sat_act_by_college['ACT Total 25th-75th Percentile'].map(lambda x: str(x)[0:2])
sat_act_by_college['ACT Total 25th-75th Percentile'] = sat_act_by_college['ACT Total 25th-75th Percentile'].astype(int)

In [32]:
sat_act_by_college['ACT Total 25th-75th Percentile'].unique()

array([32, 33, 35, 34, 31, 29, 27, 30, 28, 26, 25, 22, 19, 15, 20, 24, 23,
       21, 16, 17, 18])

In [33]:
sat_act_by_college.head()

Unnamed: 0,School,Test Optional?,Applies to Class Year(s),Policy Details,Number of Applicants,Accept Rate,SAT Total 25th-75th Percentile,ACT Total 25th-75th Percentile
0,Stanford University,Yes,2021,Stanford has adopted a one-year test optional ...,47452,4.3%,1440,32
1,Harvard College,Yes,2021,Harvard has adopted a one-year test optional p...,42749,4.7%,1460,33
2,Princeton University,Yes,2021,Princeton has adopted a one-year test optional...,35370,5.5%,1440,32
3,Columbia University,Yes,2021,Columbia has adopted a one-year test optional ...,40203,5.5%,1450,33
4,Yale University,Yes,2021,Yale has adopted a one-year test optional poli...,36844,6.1%,1460,33


In [34]:
sat_act_by_college.rename(columns={'School': 'college',
                                   'Test Optional?': 'test_required?',
                                   'Applies to Class Year(s)': 'class_year(s)',
                                   'Policy Details': 'policy_details',
                                   'Number of Applicants':'num_of_applicants',
                                   'Accept Rate':'accept_rate',
                                   'SAT Total 25th-75th Percentile':'sat_25th_pclt',
                                   'ACT Total 25th-75th Percentile':'act_25th_pclt'},inplace=True)
sat_act_by_college.head()

Unnamed: 0,college,test_required?,class_year(s),policy_details,num_of_applicants,accept_rate,sat_25th_pclt,act_25th_pclt
0,Stanford University,Yes,2021,Stanford has adopted a one-year test optional ...,47452,4.3%,1440,32
1,Harvard College,Yes,2021,Harvard has adopted a one-year test optional p...,42749,4.7%,1460,33
2,Princeton University,Yes,2021,Princeton has adopted a one-year test optional...,35370,5.5%,1440,32
3,Columbia University,Yes,2021,Columbia has adopted a one-year test optional ...,40203,5.5%,1450,33
4,Yale University,Yes,2021,Yale has adopted a one-year test optional poli...,36844,6.1%,1460,33


#### 3.1.4 'sat_act_score_convertor' Dataset

In [35]:
sat_act_score_convertor.head()

Unnamed: 0,SAT_Score,ACT_Score
0,1560,36
1,1540,35
2,1500,34
3,1460,33
4,1430,32


In [36]:
sat_act_score_convertor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   SAT_Score  28 non-null     int64
 1   ACT_Score  28 non-null     int64
dtypes: int64(2)
memory usage: 576.0 bytes


In [37]:
sat_act_score_convertor.rename(columns={'SAT_Score': 'sat_score', 'ACT_Score': 'act_score'}, inplace=True)
sat_act_score_convertor.head()

Unnamed: 0,sat_score,act_score
0,1560,36
1,1540,35
2,1500,34
3,1460,33
4,1430,32


#### 3.2 Combining Dataset

We shall proceed to:

1. Merge the 'act_2019' and 'sat_2019' datasets together
2. Calculate the projected SAT score for each state for the merged dataset based on the 'sat_act_score_convertor' dataset (The approach is to use ACT score in the 'sat_act_score_convertor' dataset to project SAT score for the merged dataset as ACT score allows us to use matching values)
3. Calculate the projected SAT score for each state for the 'sat_act_by_college' dataset based on the sat_act_score_convertor 

In [38]:
act_sat_2019 = act_2019.merge(sat_2019, how='inner', on='state')
act_sat_2019.head()

Unnamed: 0,state,act_participation_%,act_composite,sat_participation_%,sat_ebrw,sat_math,sat_total
0,Alabama,100,19,7,583,560,1143
1,Alaska,38,20,41,556,541,1097
2,Arizona,73,19,31,569,565,1134
3,Arkansas,100,19,6,582,559,1141
4,California,23,23,63,534,531,1065


In [39]:
# Convert the sat_act_score_convertor into a dict
sat_act_score_dict = sat_act_score_convertor.set_index('act_score').T.to_dict('records')
for i in sat_act_score_dict:
    sat_act_score_dict = i
print(sat_act_score_dict)

{36: 1560, 35: 1540, 34: 1500, 33: 1460, 32: 1430, 31: 1400, 30: 1370, 29: 1340, 28: 1310, 27: 1280, 26: 1240, 25: 1210, 24: 1180, 23: 1140, 22: 1110, 21: 1080, 20: 1040, 19: 1010, 18: 970, 17: 930, 16: 890, 15: 850, 14: 800, 13: 760, 12: 710, 11: 670, 10: 630, 9: 590}


In [40]:
proj_sat_score_state = []
for i in act_sat_2019['act_composite']:
    for j, k in sat_act_score_dict.items():
        if i == j: 
            proj_sat_score_state.append(k)
print(proj_sat_score_state)

[1010, 1040, 1010, 1010, 1140, 1180, 1240, 1180, 1180, 1040, 1080, 1010, 1110, 1180, 1110, 1110, 1080, 1040, 1010, 1180, 1110, 1240, 1180, 1080, 970, 1080, 1040, 1040, 970, 1210, 1180, 1010, 1180, 1010, 1040, 1040, 1010, 1080, 1180, 1210, 1010, 1110, 1010, 1040, 1040, 1180, 1180, 1110, 1080, 1040, 1040, 1080]


In [41]:
act_sat_2019['proj_sat_score_state'] = proj_sat_score_state
act_sat_2019.head()

Unnamed: 0,state,act_participation_%,act_composite,sat_participation_%,sat_ebrw,sat_math,sat_total,proj_sat_score_state
0,Alabama,100,19,7,583,560,1143,1010
1,Alaska,38,20,41,556,541,1097,1040
2,Arizona,73,19,31,569,565,1134,1010
3,Arkansas,100,19,6,582,559,1141,1010
4,California,23,23,63,534,531,1065,1140


In [42]:
proj_sat_score_college = []
for i in sat_act_by_college['act_25th_pclt']:
    for j, k in sat_act_score_dict.items():
        if i == j: proj_sat_score_college.append(k)
print(proj_sat_score_college)

[1430, 1460, 1430, 1460, 1460, 1540, 1500, 1460, 1460, 1430, 1460, 1430, 1400, 1460, 1400, 1460, 1460, 1400, 1460, 1430, 1460, 1400, 1400, 1340, 1280, 1400, 1370, 1430, 1400, 1370, 1340, 1460, 1400, 1310, 1430, 1430, 1430, 1460, 1430, 1400, 1370, 1430, 1400, 1430, 1430, 1370, 1400, 1400, 1340, 1400, 1400, 1370, 1400, 1280, 1400, 1400, 1240, 1370, 1370, 1310, 1340, 1400, 1400, 1370, 1240, 1340, 1370, 1340, 1210, 1370, 1310, 1240, 1310, 1340, 1310, 1280, 1280, 1370, 1340, 1340, 1370, 1340, 1340, 1310, 1110, 1340, 1310, 1310, 1280, 1240, 1010, 1340, 1310, 1370, 1370, 1280, 1110, 1240, 1280, 850, 1040, 1370, 1310, 1310, 1210, 1340, 1210, 1370, 1180, 1280, 1340, 1310, 1110, 1280, 1340, 1240, 1280, 1280, 1240, 1180, 1240, 1310, 1210, 1340, 1280, 1240, 1010, 1280, 1240, 1240, 1310, 1140, 1210, 1240, 1210, 1340, 1210, 1040, 1280, 1180, 1180, 1280, 1010, 1280, 1280, 1210, 1310, 1010, 1110, 1040, 1180, 1010, 1010, 1310, 1280, 1180, 1310, 1180, 1240, 1280, 1080, 1310, 1110, 1240, 1080, 1240, 1210

In [43]:
sat_act_by_college['proj_sat_score_college'] = proj_sat_score_college
sat_act_by_college.head()

Unnamed: 0,college,test_required?,class_year(s),policy_details,num_of_applicants,accept_rate,sat_25th_pclt,act_25th_pclt,proj_sat_score_college
0,Stanford University,Yes,2021,Stanford has adopted a one-year test optional ...,47452,4.3%,1440,32,1430
1,Harvard College,Yes,2021,Harvard has adopted a one-year test optional p...,42749,4.7%,1460,33,1460
2,Princeton University,Yes,2021,Princeton has adopted a one-year test optional...,35370,5.5%,1440,32,1430
3,Columbia University,Yes,2021,Columbia has adopted a one-year test optional ...,40203,5.5%,1450,33,1460
4,Yale University,Yes,2021,Yale has adopted a one-year test optional poli...,36844,6.1%,1460,33,1460


In [48]:
my_rho = np.corrcoef(sat_act_by_college['sat_25th_pclt'], sat_act_by_college['proj_sat_score_college'])
print(my_rho)

[[1.         0.96064067]
 [0.96064067 1.        ]]


In [50]:
my_rho = np.corrcoef(act_sat_2019['sat_total'], act_sat_2019['proj_sat_score_state'])
print(my_rho)

[[ 1.         -0.41762673]
 [-0.41762673  1.        ]]


--- 
# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

*All libraries used should be added here*

### Data Dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry: 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| 
|**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|

[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**

*Note*: if you are unsure of what a feature is, check the source of the data! This can be found in the README.

**To-Do:** *Edit the table below to create your own data dictionary for the datasets you chose.*

|Feature|Type|Dataset|Description|
|---|---|---|---|
|column name|int/float/object|ACT/SAT|This is an example| 


## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
    - Assign the output to variable `sd` as a dictionary where: 
        - Each column name is now a key 
        - That standard deviation of the column is the value 
        - *Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement. Some examples are provided below (but feel free to change these questions for your specific problem):
        - Which states have the highest and lowest participation rates for the 2017, 2019, or 2019 SAT and ACT?
        - Which states have the highest and lowest mean total/composite scores for the 2017, 2019, or 2019 SAT and ACT?
        - Do any states with 100% participation on a given test have a rate change year-to-year?
        - Do any states show have >50% participation on *both* tests each year?
        - Which colleges have the highest median SAT and ACT scores for admittance?
        - Which California school districts have the highest and lowest mean test scores?
    - **You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

In [None]:
#Code:

**To-Do:** *Edit this cell with your findings on trends in the data (step 3 above).*

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

Here is an example of what your plots should look like following the above guidelines. Note that while the content of this example is unrelated, the principles of visualization hold:

![](https://snag.gy/hCBR1U.jpg)
*Interpretation: The above image shows that as we increase our spending on advertising, our sales numbers also tend to increase. There is a positive correlation between advertising spending and sales.*

---

Here are some prompts to get you started with visualizations. Feel free to add additional visualizations as you see fit:
1. Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features.
    - Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
    - Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
2. Visualize distributions using histograms. If you have a lot, consider writing a custom function and use subplots.
    - *OPTIONAL*: Summarize the underlying distributions of your features (in words & statistics)
         - Be thorough in your verbal description of these distributions.
         - Be sure to back up these summaries with statistics.
         - We generally assume that data we sample from a population will be normally distributed. Do we observe this trend? Explain your answers for each distribution and how you think this will affect estimates made from these data.
3. Plot and interpret boxplots. 
    - Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
    - Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
    - Each boxplot should:
        - Only include variables of a similar scale
        - Have clear labels for each variable
        - Have appropriate titles and labels
4. Plot and interpret scatter plots to view relationships between features. Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
    - Your plots should have:
        - Two clearly labeled axes
        - A proper title
        - Colors and symbols that are clear and unmistakable
5. Additional plots of your choosing.
    - Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.

In [None]:
# Code

## Conclusions and Recommendations

Based on your exploration of the data, what are you key takeaways and recommendations? Make sure to answer your question of interest or address your problem statement here.

**To-Do:** *Edit this cell with your conclusions and recommendations.*

Don't forget to create your README!

**To-Do:** *If you combine your problem statement, data dictionary, brief summary of your analysis, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project.* Don't forget to cite your data sources!