# SAT & ACT Analysis 

## Part 1: Data Cleaning

![](https://bostonglobe-prod.cdn.arcpublishing.com/resizer/S0jNbdqmfG9pcGdG13r6MLGIieU=/1440x0/cloudfront-us-east-1.images.arcpublishing.com/bostonglobe/PBOB6NQGFJGUPOO6EVWSF5GVYQ.jpeg)

# Problem Statement

The SAT dataset is showing statistics related to the SAT test for each state in the U.S. from years 2017, 2018, and 2019.The SAT is a standardized test widely used for college admissions in the United States. The SAT has four sections: Reading, Writing and Language, Math (no calculator), and Math (calculator allowed). These datasets take data from years 2017 to 2019.

The ACT dataset is showing statistics related to the ACT test for each state in the U.S. from years 2017, 2018, and 2019. The ACT is a standardized test used for college admissions in the United States that covers four academic skill areas: English, mathematics, reading, and science reasoning. The main four ACT test sections are individually scored on a scale of 1–36, and a composite score is the rounded whole number average of the four sections.

With these datasets, we are trying to find correlations and trends related to SAT and/or ACT participation rates in specific states, to determine a way to improve participation rates within these states. To do this, we will thoroughly analyze this data using pandas, matplotlib, seaborn, numpy, and other tools to draw our conclusions.

In [None]:
#Imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2017 Data Import and Cleaning

#### 1. Read In SAT & ACT  Data

Read in the `sat_2017.csv` and `act_2017.csv` files and assign them to appropriately named pandas dataframes.

In [None]:
#Code:
sat = pd.read_csv('../input/sat-act-files/sat_2017.csv')
act = pd.read_csv('../input/sat-act-files/act_2017.csv')

#### 2. Display Data

In [None]:
#Code:
sat.head()

In [None]:
act.head()

#### 3. Verbally Describing the Data

The SAT dataset is showing statistics related to the SAT test for each state in the U.S. for the SAT. The SAT is a standardized test widely used for college admissions in the United States. The SAT has four sections: Reading, Writing and Language, Math (no calculator), and Math (calculator allowed).

The total test score is displayed as 'Total'. In addition, the 'Participation' column relates to the percentage of a population within the state, that participated in taking the test. The 'Math' and 'Evidence-Based Reading and Writing' pertain to the scores in mathematics and reading/writing, respectivelly. We can prove this hypothesis by combining both math and english values to yield the final test score. All values in the sat dataset are integers.

The act dataset is showing statistics related to the ACT test for each state in the U.S. The ACT is a standardized test used for college admissions in the United States that covers four academic skill areas: English, mathematics, reading, and science reasoning. The main four ACT test sections are individually scored on a scale of 1–36, and a composite score is the rounded whole number average of the four sections.

Each column is related to an individual score pertaining to each individual topic, that is, 'English' refers to a score from 1-36 for English, 'Science' refers to a score from 1-36 for Science, etc. All Subject test scores range from 1 to 36; all scores are floats.
The 'Participation' is related to the percentage of a population within a state that are able to take the test.

For both the ACT and SAT dataset, there are no missing (null) values. However, in the SAT dataset there are missing partitions of columns pertaining to the individual scores for all 4 categories. Specifically, The SAT has four sections: Reading, Writing and Language, Math (no calculator), and Math (calculator allowed). In our dataset, Reading and Writing and Language have been combined into Column 2 as an integer. While Math scores of calculator and without calculator have been combined into column 3.


In [None]:
print(sat.min())
print('\n')
print(sat.max())
print('\n')
print(act.min())
print('\n')
print(act.max())

#### Fixing any errors identified

In [None]:
act.loc[act['State'] == 'Wyoming', ['Composite']] 

In [None]:
#remove x from wyoming
act.loc[act['State'] == 'Wyoming', ['Composite']] = float(20.2)

In [None]:
#code
sat.info()
print('\n')
act.info()

Looks like there are some errors in the data, we can fix this below:

In [None]:
#code
def percent_to_float(x):
    return float(x.strip('%'))/100

sat['Participation'] = sat['Participation'].str.rstrip('%').astype('float') / 100.0
act['Participation'] = act['Participation'].str.rstrip('%').astype('float') / 100.0
act['Composite'] = act['Composite'].astype(float)

In [None]:
#make columns lowercase
sat.columns = sat.columns.str.lower()
act.columns = act.columns.str.lower()

# The .rename method

sat.rename(columns={
    'participation' : 'participation_sat_2017',
    'evidence-based reading and writing' : 'ebrw_sat_2017',
    'math' : 'math_sat_2017',
    'total' : 'total_sat_2017'
}, inplace=True)

act.rename(columns={
    'math' : 'math_act_2017',
    'english' : 'english_act_2017',
    'reading' : 'reading_act_2017',
    'science' : 'science_act_2017',
    'participation' : 'participation_act_2017',
    'composite' : 'composite_act_2017',
    'state' : 'state_act'
}, inplace=True)

#### Our data dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|state|object|ACT/SAT|one of the 50 U.S. states of the United States| 
|participation_SAT / participation_ACT|float|ACT/SAT|percentage of students in the class of {year} that took the specified test (units percent to two decimal places 98.10 means 98.1%)| 
|ebrw_sat|int|SAT|scores for evidence-based reading and writing (total score = 800)| 
|math_sat|int|SAT|scores for mathematics (total score = 800)|
|total_sat|int|SAT|total score for the SAT out of 1600 points| 
|english_act|float|ACT|english scores. (scaled from 1-36)| 
|math_act|float|ACT|mathematics scores. (scaled from 1-36)|
|reading_act|float|ACT|reading scores. (scaled from 1-36)|
|science_act|float|ACT|science scores. (scaled from 1-36)|
|composite_act|float|ACT|total score. (average of four test scores, rounded to the nearest whole number. Fractions less than one-half are rounded down; fractions one-half or more are rounded up.)|

One of our dataframes contains an extra row. Let's remove this from the dataframe.

In [None]:
#code
act.drop([0], inplace = True)
act.index = act.index - 1 #shifting index
act.sort_index(inplace = True)

In [None]:
act.head()

#### Merge Dataframes

Join the 2017 ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable.

In [None]:
#Code:
merged = pd.concat([sat, act], axis=1)
del merged['state_act']
merged.head()

In [None]:
#code
merged.to_csv('combined_2017.csv', index=False)

## 2018 Data Import and Cleaning

Read in the `sat_2018.csv` and `act_2018.csv` files and assign them to appropriately named pandas dataframes. For the **2018 ACT Data**, only the `Composite` scores are available. Repeating the same processes to clean the 2018 data here as in the previous sections above.

In [None]:
sat2018 = pd.read_csv('../input/sat-act-files/sat_2018.csv')
act2018 = pd.read_csv('../input/sat-act-files/act_2018.csv')

In [None]:
#percent to float
sat2018['Participation'] = sat2018['Participation'].apply(percent_to_float)
act2018['Participation'] = act2018['Participation'].apply(percent_to_float)

#remove duplicate
act2018.drop([20], inplace = True)
act2018 = act2018.reset_index(drop=True)

#make everything lowercase
sat2018.columns = sat2018.columns.str.lower()
act2018.columns = act2018.columns.str.lower()

# The .rename method

sat2018.rename(columns={
    'participation' : 'participation_sat_2018',
    'evidence-based reading and writing' : 'ebrw_sat_2018',
    'math' : 'math_sat_2018',
    'total' : 'total_sat_2018'
}, inplace=True)

act2018.rename(columns={
    'participation' : 'participation_act_2018',
    'composite' : 'composite_act_2018',
}, inplace=True)

In [None]:
merged1 = pd.concat([sat2018, act2018], axis=1)
del merged1['state']

In [None]:
sat_act_2017_2018 = pd.concat([merged, merged1], axis = 1)

# 2019 Datasets

In [None]:
'''
Do the same thing for the 2019 datasets
'''
sat2019 = pd.read_csv('../input/sat-act-files/sat_2019.csv', encoding = "ISO-8859-1")
act2019 = pd.read_csv('../input/sat-act-files/act_2019.csv')

In [None]:
#remove duplicates and unnecesary rows

act2019.drop([51], inplace = True)
act2019 = act2019.reset_index(drop=True)

sat2019.drop([47], inplace = True)
sat2019.drop([39], inplace = True)
sat2019 = sat2019.reset_index(drop = True)

#percent to float
sat2019['Participation Rate'] = sat2019['Participation Rate'].apply(percent_to_float)
act2019['Participation'] = act2019['Participation'].apply(percent_to_float)

#make everything lowercase
sat2019.columns = sat2019.columns.str.lower()
act2019.columns = act2019.columns.str.lower()

# The .rename method

sat2019.rename(columns={
    'participation rate' : 'participation_sat_2019',
    'ebrw' : 'ebrw_sat_2019',
    'math' : 'math_sat_2019',
    'total' : 'total_sat_2019'
}, inplace=True)

act2019.rename(columns={
    'participation' : 'participation_act_2019',
    'composite' : 'composite_act_2019',
}, inplace=True)

In [None]:
merged2 = pd.concat([sat2019, act2019], axis=1)
del merged2['state']

In [None]:
final = pd.concat([sat_act_2017_2018, merged2], axis = 1)
final.T.head(10)

In [None]:
final.to_csv('final.csv', index=False)

## Part 2: EDA

In [None]:
plt.subplots(figsize=(12,12)) 
# set  canvas for plotting 
plt.barh(final.state, final.participation_act_2019, label='ACT part', alpha=0.5) 
# make a bar graph 
plt.barh(final.state, final.participation_sat_2019, label='SAT part.', alpha=0.5) 
plt.legend(loc='lower center') 
plt.title("Average Participation Rate for ACT and SAT per State", fontsize=16) 
plt.xlabel("Average Participation Rate", fontsize=14) 
plt.ylabel("State",fontsize=14) 
plt.show()

In [None]:
def subplot_histograms(dataframe, list_of_columns, list_of_titles, list_of_xlabels):
    nrows = int(np.ceil(len(list_of_columns)/2)) # Makes sure you have enough rows
    fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=(12,10)) # You'll want to specify your figsize
    ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
    plt.subplots_adjust(hspace = 0.5)
    for i, column in enumerate(list_of_columns): # Gives us an index value to get into all our lists
        ax[i].hist(dataframe[column]) # feel free to add more settings
        ax[i].set_title(list_of_titles[i]) # Set titles for each subplot
        ax[i].set_xlabel(list_of_xlabels[i]) # Set xlabels for each subplot

In [None]:
# Code
subplot_histograms(final,
                   list_of_columns=['participation_sat_2017', 'participation_sat_2018','participation_sat_2019', 
                           'participation_act_2017', 'participation_act_2018', 'participation_act_2019'],
                   
                   list_of_titles=['SAT 2017 Participation', 'SAT 2018 Participation', 'SAT 2019 Participation',
                                   'ACT 2017 Participation', 'ACT 2018 Participation', 'ACT 2019 Participation'], 
                   
                   list_of_xlabels=['participation rate', 'participation rate', 'participation rate', 'participation rate', 'participation rate', 'participation rate'])


Participation distributions are similar across years. SAT has a higher amount of states with very low participation rates, and ACT has a higher amount of states with very high participation rates

In [None]:
ax = plt.subplots(figsize=(15,10))
plt.title('SAT Scores',fontsize=30)
plt.ylabel('Score',fontsize=20)
plt.ylim(450, 675)
sns.boxplot(data=final[['ebrw_sat_2017','ebrw_sat_2018','ebrw_sat_2019',
                        'math_sat_2017', 'math_sat_2018', 'math_sat_2019']],width=0.4)


In [None]:
f, axes = plt.subplots(2, figsize = (10, 12))
sns.scatterplot(data = final, x = "participation_sat_2019", y = "total_sat_2019", ax=axes[0]).set_title('SAT vs. participation 2019');
sns.scatterplot(data = final, x = "participation_act_2019", y = "composite_act_2019", ax=axes[1]).set_title('ACT vs. participation 2019');

Lets test a hypothesis: Do states that offer free SAT testing have better participation?

In [None]:
#all states with free sat tests
free_state_SAT = ['Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 
                'Idaho', 'Illinois', 'Maine', 'Michigan', 'New Hampshire', 
                'Ohio', 'Oklahoma', 'Rhode Island', 'South Carolina', 'Tennesee', 
                'West Virginia']
free = final['state'].isin (free_state_SAT)

plt.subplots(figsize=(12,12)) 
# set  canvas for plotting 
plt.barh(final[free].state, final[free].participation_act_2019, label='ACT part', color = 'green', alpha=0.5) 
# make a bar graph 
plt.barh(final[free].state, final[free].participation_sat_2019, label='SAT part', color = 'red', alpha=0.5) 
plt.legend(loc='lower center') 
plt.title("Average Participation Rate for ACT and SAT in States with Free SAT Testing", fontsize=16) 
plt.xlabel("Average Participation Rate", fontsize=14) 
plt.ylabel("State",fontsize=14) 
plt.show()