# Introduction

What we know from researching the sources for our datasets:

- Only high school students take the SAT, so we'll want to focus on high schools.
- New York City is made up of five boroughs, which are essentially distinct regions.
- New York City schools fall within several different school districts, each of which can contains dozens of schools.
- Our data sets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
- Each school in New York City has a unique code called a DBN, or district borough number.
- Aggregating data by district will allow us to use the district mapping data to plot district-by-district differences.


## DataSet

This dataset has been provided and prepared by dataquest.io, This dataset is a continuation of their data cleaning certification program for python. Over the last three mission within this certification program, we explored relationships between SAT scores and demographic factors in New York City Public schools. 

- **SAT scores by school:** SAT scores for each high school in New York City.
- **School attendance:** Attendance information for each school in New York City.
- **Class size:** Information on class size for each school.
- **AP test results:** Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject).
- **Graduation outcomes:** The percentage of students who graduated, and other outcome information.
- **Demographics:** Demographic information for each school.
- **School survey:** Surveys of parents, teachers, and students at each school.


### Goal

Explore the dataset to find how High Schools with high average SAT Scores are different than other General Education schools in their district.

### Project Goal

The goal of this project is to demonstrate pandas and numpy fundamentals regarding data cleaning tasks and data story telling.

### Setting Up The Environment

For now, we will read in each dataset into the pandas dataframe and then store the dataframes into a dictionary. We choose to store the dataframes within a dictionary for convenience, otherwise, it would be hard to remember all the dataframes in use within this project

#### Steps
1. Import the needed modules.
2. Create a list that will store the file name of the datasets we want to import.
3. Create an empty dictionary.
4. Create a function that will, read in the datasets and make a key with file name within the list, but without the .csv attachment.
5. Check to see if the function worked as intended.

In [None]:
#Import need modules and datasets
import pandas as pd
import numpy
import re

#Create a list that contains all your datasets
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

#Create a function that will store all are datasets within a dictionary
data = {}

for f in data_files:
    d = pd.read_csv("../input/nyc-high-school-data/nyc_highschool_data/schools/{0}".format(f))
    data[f.replace(".csv", "")] = d
    
#Check the dictionary
print(data.keys())
print('\n')

#Check dataframes
for d in data:
    print(data[d].head(3))

### Reading In The Surveys

When we printed the rows for all dataframes above we notice that the DBN column represents a key for each unique school. Given that we have a unique key represented in each dataframe, we can use these columns to combine dataframes later on. If we check the survey dataframes we notice that dbn column is lower case and not the upper case (DBN) like the rest of the dataframes within the data dictionary. This problem of column name consistency will make it hard to combine such dataframes together later on. Thus, we should look to update the column names in the survey dataframes to match those in the data dictionary. Once adjusted, we should look to filter out columns that will not be needed for analysis later on. Once we have completed the filter, we will import the new combined survey dataframe into the data dictionary with the rest of the dataframes in this project.

#### Steps
1. Read in both survey datasets in pandas dataframes.
2. Explore both survey dataframes and find out if there are any inconsistencies with the rest of the dataframes in the data dictionary.
3. Combine both survey dataframes together using concat.
4. Create a new column named DBN which will hold the values from the other lower case dbn column.
5. Create a list of column names we want to filter in the combined survey dataframe.
6. Filter the the combined survey dataframe with the list of column names.
7. Add the combined survey dataframe in the data dictionary with the rest of the dataframes.

In [None]:
#Read in .txt datasets using the encoding for windows-1252
all_survey = pd.read_csv("../input/nyc-high-school-data/nyc_highschool_data/schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("../input/nyc-high-school-data/nyc_highschool_data/schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')

print(all_survey.head(3))
print('\n')
print(d75_survey.head(3))

#Combine datasframes using concat
survey = pd.concat([all_survey, d75_survey], axis=0)

#Create a new column that aligns with the rest of the dataframes inside the dictionary.
#Change dbn to DBN by creating a new column and brining over all the same values.
survey["DBN"] = survey["dbn"]

#Preselct the columns that we need to complete future tasks
#Create a list that contains all the columns that we need
survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]

#Filter the columns in the survey dataframe by using the survey_field list
survey = survey.loc[:,survey_fields]

#Add the survey dataframe into the data dictionary with the rest of the datasets
data["survey"] = survey

### Add DBN columns

Like the survey columns above there are two other dataframes within the data dictionary that do not hold up the same column named standard of DBN. The hs_directory is adjustment is similar to the survey dataframes in that we will just transfer over the values from the lower case dbn to the new upper case column DBN. However, the class_size dataframe is different in that there is no DBN column. the class_size dataframe contains a CSD and a SCHOOL CODE column that when together create the unique DBN KEY like the other dataframes. However, CSD still needs to be adjusted to fit the standard of DBN values.

#### Steps
1. Create a new DBN column within the hs_directory dataframe that will hold the values of the old dbn column in the same dataframe.
2. Create a function that will look at a string and see if it contains more than one character, if it contains one or more character return the string as is, else return the string with a 0 in front of it.
3. Create a new column named padded_csd that will hold the results for the string function by applying it to the CSD column in the class_size dataframe.
4. Create a new column named DBN in the class_size dataframe, this column will contain a concat of values from the padded_csd column and SCHOOL CODE column.
5. update the class_size dataframe within the data dictionary.

In [None]:
#We want to aline the DBN column name with the rest of the datasets within the data dictionary
#Change dbn to DBN by creating a new column and bring over the same values from the old column
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

#Update the csd to match the format need to align itself with the DBN columns in the other datasets
#Create a function that will add zero to the front of a string with less than one character.
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
#Create a new column namded padded_csd that will take in the applied function results from the csd column  
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)

#Create a new column named DBN within the class_size dataframe to align itself with the other datasets
#Combine both padded_csd and SCHOOL CODE to create the properally formated DBN column values
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

### Convert columns to numeric

When looking at the sat_results dataframe we notice that all the columns datatypes are objects, meaning we cannot perform any numerical analysis on them. This is a problem because we want to find out whether the total SAT schools show any insight into demographics or population within schools in NYC. Additionally, when looking at the hs_directory dataframe we notice that the location 1 column holds the geographic location of the schools. We want to extract the latitude and longitude from the location 1 column so that it is easier to find geological patterns later in the analysis. Since we will be pulling the latitude and longitude each string will contain its own column within the dataframe.

#### Steps 
1. Explore the datatypes within the sat_results dataframe.
2. Create list of columns from the sat_results dataframe that we want to change to numeric datatypes.
3. Create a for loop that  the list and changes the datatype to numeric.
4. Create a column named sat_score that sums all the values within the sat scores.
5. Create two functions that will either find the lat or long to finding, spliting, replace the location string in the location 1 column within the hs_directory.
6. Create two columns named lat and lon that uses the newly created lat or long function to apply the function to the location 1 column.
7. Change the newly created lat and long column in the hs_directory dataframe to numeric datatype and then update the dataframe within the data dictionary.

In [None]:
#Explore the datatypes for the sat_results dataframe
print(data['sat_results'].info())

#Convert the SAT score columns to numeric dtypes
#Create a list of the three columns we want to change
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']

#Write a for loop that changes the the columns datatypes to numeric
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

#Create a new column that totals all the SAT Scores
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

#We want the longitutde and latitude in seperate columns within the datafram
#Createa function that will split the current location string so that all we have left is the longitude value
def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat
#Createa function that will split the current location string so that all we have left is the latitude value
def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

#Create a column that holds the lat or longitude by using the new function inside the apply function
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

#Convert the lat and lon values back into numeric datatypes
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

### Condense datasets

Given that the goal of this project is to analyze differences in Highschools across the General Education school districts, we will begin by condensing the dataframe to fit our needs. Next, we want to find the averages across all columns within the dataframe for future analysis.

Class_size Columns:
- **Number of students/Seats filled:** the total number of students.
- **Number of sections:** number of class rooms.
- **Average class size:** average number of students in classroom.
- **Size of smallest class:** The smallest number of students within a classroom.
- **Size of largest class:** The largest number of students within a classroom.
- **Schoolwide pupil - teacher ratio:** Teacher to student ratio.

Once the class_size dataframe has been filtered and updated, we move onto the demographic dataframe where we will also filter by the most recent school year data. Moving onto the graduation dataframe, we want the DBN to be as unique as possible, so we will opt to filter the dataframe where the cohort is the most recent and the demographic column to contain the total cohort.

#### Steps
1. Find the unique values for the GRADE AND PROGRAM TYPE columns in the class_size dataframe.
2. Update the class_size dataframe by filtering the GRADE column by keeping grades 09-12.
3. Update the class_size dataframe by filtering the PROGRAM TYPE column by keeping the program GEN ED only.
4. update the class_size dataframe within the data dictionary.
5. Update the demographics dataframe by filtering the School year column to have the most recent years.
6. Update the demographics dataframe within the data dictionary.
7. Update the graduation dataframe by filtering the cohort column to contain them most recent year.
8. Update the graduation dataframe by filtering the demographic column to contain the total cohort group.
9. Update the graduation dataframe within the data dictionary.


In [None]:
#Check the unique values for the GRADE and PROGRAM TYPE columns in the class_size dataframe
print(data['class_size']['GRADE '].value_counts())
print('\n')
print(data['class_size']['PROGRAM TYPE'].value_counts())
print('\n')

#Filter class_size dataset to contain highschools only and with a general education program type
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

#Group the class size dataset by DBN and aggregate values by the mean
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
print(class_size.head(2))

#return the newly adjusted class_size dataset back into the dataset dictionary 
data["class_size"] = class_size

#Filter the demographics dataset to contain the most recent school years
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

#Filter graduation dataset to contain the most recent school year and demographic to total cohort only
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

### Convert AP scores to numeric

To better understand if there is a correlation between ap exams and sat scores across schools, We will need to convert three columns datatypes in the ap_2010 dataframe to numeric. Given that ap exams exist in schools that are academically challenging, it stands to reason that high schools with less funding or lack of academic rigor will not participate in ap exams.

#### Steps
1. Find out the ap_2010 dataframes columns datatype.
2. Create a list of column names from the ap_2010 dataframe that we want to change to numeric.
3. Create a for loop that will convert those columns from the list to numeric datatype.
4. Make sure to update the ap_2010 dataframe within the data dictionary.

In [None]:
#Check the column datatypes for the ap_2010 dataframe
print(data['ap_2010'].info())

#Create a list of column names that we want to change to numeric
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

#create a for loop that will change the column values into a numeric datatype
for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

### Combine the datasets

Now that we have cleaned all the dataframes to a functional degree, we will start by merging all the dataframes within the data dictionary together by the DBN column. However, we must be careful in how we merge our data together because it could introduce an influx of null values within the combined dataframe. To combat this issue I will use a combination of left joins and inner joins to minimize the null value counts.

#### Steps
1. Create a new dataframe called combined with the sat_results dataframe inserted into it.
2. Left join the combined dataframe with ap_2010 and graduation dataframes on DBN individually.
3. Create a list of dataframes remaining that we want to merge.
4. Create a for loop that takes the list of dataframes we want to merge, and merges them with an inner join by DBN.
5. Fill any numeric datatype null values by using the columns mean.
6. Fill the remainder null string type values with the number zero.

In [None]:
#Combined all the data sets using left join or inner join
combined = data["sat_results"]

#Combine ap_2010 and graduation data set to the combined dataframe
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

#Createa list of datasets we want to merge into the combined dataframe
to_merge = ["class_size", "demographics", "survey", "hs_directory"]

#create a for loop that will merge the list of datasets using a inner join
for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

#Fill any null values with the mean of their columns
combined = combined.fillna(combined.mean())

#Fill the rest of the null values with zero if they are not numeric datatype
combined = combined.fillna(0)

### Add a school district column for mapping

In order to find the school's district for each highschool easier, I will create a column named school_dist with each schools district number. The school district number can be found within the first two characters within DBN values, thus we will need to create a function that will extract those two strings and put them into our new school_dist column.

#### Steps
1. Check the DBN column in the combined dataframe to understand its format.
2. Create a function that will strip the first two strings from any string value.
3. Create a column called school_dist that will use the newly created function to strip the values from the DBN column.

In [None]:
#Check the DBN column in the combined dataframe
print(combined['DBN'].head(3))

#Create a function that will take the first two characters from a string
def get_first_two_chars(dbn):
    return dbn[0:2]

#Create a new column named school_district that will use the new fucntion to pull two characters from the DBN column
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

### Find correlations

We want to find if there are any strong correlations with the sat_score column, since we speculate that sat scores can be impacted by numerous factors.

#### Steps
1. Find the correlation for all columns within the dataset save it to a Series named correlations
2. Update the Correlation series by finding the correlation for the sat_score column.

In [None]:
#Find Correlations throughout all the columns within the dataframe
correlations = combined.corr()

#Find the correlations that align with the sat_score column
correlations = correlations["sat_score"]
print(correlations.head(40))

### Plotting survey correlations

The 2011 NYC School Survey dictionary for terms can be found [here](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8), and can be download as and xls file.

#### Steps
1. Remove DBN from the survey_fields list.
2. import matplotlib.
3. Make a barplot from the correlation between sat_score and the column names inside survey_field list.

#### Analysis

- **rr_s:**: The student response rate shows decent correlation, which makes sense because students who take an SAT test are likely to be involved academically.
- **N_s, N_t, N-P:** Shows strong correlation due to being actualy SAT test scores.
- **saf_t_11:** sat_scores shows correlation to how teachers feel safe and respected within their classrooms.
- **saf_s_11:** sat_scores shows correlation to how students feel safe and respected within their classrooms, however the correlation is slightly stronger than saf_t_11.
- **aca_s_10:** The academic expectations of the students show a strong correlation with the sat scores.

In conclusion: It looks like there is a strong correlation to sat scores when it comes to teachers’ and students’ environments. According to the correlation, there may be an indication that sat_scores are impacted by how safe and respected teachers and students feel within their classrooms. Additionally, sat_scores might be impacted by the expectation of the student academics, meaning if a student who expects a low score might not score high on the SAT. Lastly, there seems to be an interesting correlation between the engagement of students and their sat scores.

In [None]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

#import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#plot the correlations of sat_score and columns in the survey_field list.
combined.corr()["sat_score"][survey_fields].plot.bar()


### Investingating Safety Scores.

Given the decent correlation of safety and respect with SAT scores, I looked to investigate it further by developing a scatter plot. The scatter plot on the left indicates the Student’s safety respect score with SAT scores on the y-axis and the plot on the right identifies the teacher’s safety response with SAT scores on the y-axis.

#### Safety Score and SAT Score

After looking at both graphs, it’s apparent that there is a weak positive correlation on both teacher and student graphs. The points look to cluster near the middle with outliers scatter through different safety response numbers. There are points in which higher safety responses do not mean an increase in SAT scores.

#### Avg Borough Safety Score

Across all boroughs, teachers are scoring their safety and respect higher than students. Brooklyn looks to have the lowest scores across both teachers and student safety scores. The highest score across both teachers and students in Queens and Manhattan.

#### Steps
1. Determine the figure size.
2. Determine how to format your subplots.
3. Create a scatter plot for both teachers and students against sat_score.
4. Group the combined dataframe by borough and find the average safety score for each.

In [None]:
#Createa subplot of 
fig = plt.figure(figsize = (16,5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.scatter(combined['saf_s_11'],combined['sat_score'])
ax1.set_title('Student Safety Response')
ax2.scatter(combined['saf_t_11'], combined['sat_score'])
ax2.set_title('Teacher Safety Response')
plt.show()

In [None]:
#Find the average Safety and Respect score for teachers and students by borough
boro_s = combined.groupby('boro').agg(numpy.mean)['saf_s_11']
boro_t = combined.groupby('boro').agg(numpy.mean)['saf_t_11']

print(boro_s)
print(boro_t)


### Investigating Racial Differences In SAT Scores

When looking at the correlation between race and SAT scores, we notice that the white and Asian demographics have a strong positive correlation. The black demographic has a weak negative correlation with SAT scores while the Hispanic demographic has a moderate negative correlation with SAT scores. 

#### Explore Schools With Low SAT Scores

When looking at the correlation between SAT score and hispanic_per column, there is a downtrend when hispanic_per numbers increase. In other words, when the hispanic_per numbers increase we see a decrease in sat_scores.


#### Steps
1. Create a list that holds the race columns we will use for the barplot.
2. Create a barplot that shows the correlation between sat scores and race.
3. Create a scatter plot of sat_score and hispanic_per column.
4. Find the school name for schools with 95% hispanic per.
5. Find the school name for schools with 10% hispanic per and greater than 1800 sat_score.


In [None]:
#Create a single graph that has several bar plots that shows correlation of sat scores and different demographics

#Create a list of columns that we want to include for race
race_fields= ['white_per', 'asian_per', 'black_per', 'hispanic_per']

#create a bar plot that shows the correlation between sat scores and race
combined.corr()['sat_score'][race_fields].plot.bar(title = 'SAT Score by Race')

In [None]:
#Create a scatter plot of hispanic_per and sat_score columns
fig, ax3 = plt.subplots()
ax3.scatter(combined['hispanic_per'], combined['sat_score'])
ax3.set_title('Correlation Between SAT Score and Hispanic Demographic')
ax3.set_ylabel('SAT Score')
ax3.set_xlabel('Hispanic Per')
ax3.set_ylim(800,2200)
ax3.set_xlim(0,110)



In [None]:
#Create a series that shows the school names for schools with hispanic_per greater than 95%
hispanic_greater_95 = combined[combined['hispanic_per'] > 95 ]['SCHOOL NAME']
print(hispanic_greater_95)
print('\n')



- The schools listed above are more geared towards recent immigrants to the U.S and are focused on helping students learn English, this could be attributed to the low SAT scores.

In [None]:
#Create a series taht shows the school names for schools with hispanic_per less than 10% and avg SAT score greater than 1800.
hispanic_lower_10 = combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME']
print(hispanic_lower_10)


- The schools listed above are more focused on Science and may receive financial help to run those programs. Additionally, these schools require students to pass an entrance exam to enter, although this would not explain the low Hispanic population. Given that these schools are focused on sciences and technical skills, could be an indication of why their SAT scores are on the high side.

### Investigating Gender Differences In SAT Score


#### Bar plot Observation

When looking at the correlation between gender and SAT Scores, we notice that both genders show weak signs of correlation. Males show a weak negative correlation, while females show a weak positive correlation. 

#### Scatter plot Observation

The scatter plot graphs confirm are barplot observations as it shows that gender has a weak correlation with SAT Scores. However, both the percentage of female and male students in school are clustered around the range of 40 to 60 percent. Furthermore, all the high SAT Scores for both genders are above the cluster and do not span below or above the initial cluster bounds of 40 to 60 percent.

#### School Name Observations



#### Steps
1. Create a gender field that will hold the columns names for gender.
2. Create a barplot that shows the correlation of gender and SAT Scores.
3. Create a scatter plot for female SAT Scores.
4. Create a scatter plot for male SAT Scores.
5. Find the school names for both genders where schools have a gender_per greater than 60 percent and a SAT Score higher than 1700.



In [None]:
#Create a barplot that shows the correlation for both genders against SAT scores

#Create a gender list
gender_field = ['male_per', 'female_per']

#Create a barplot
combined.corr()['sat_score'][gender_field].plot.bar(title = 'Correlation of Gender and SAT Scores')

In [None]:
#Create a scatter plot of both genders and their SAT scores
#male_per scatter plot with SAT Scores
combined.plot.scatter('male_per', 'sat_score', title = 'Male SAT Scores')

#female scatter plot with SAT Scores
combined.plot.scatter('female_per', 'sat_score', title = 'Female SAT Scores')

In [None]:
#Find the school names where female_per is greater than 60 percent and the SAT Scores are greater 1700 
female_greater_60 = combined[(combined['female_per'] > 60) & (combined['sat_score'] > 1700)]['SCHOOL NAME']
print(female_greater_60)
print('\n')



- The schools that meet the criteria above have a strong emphasis on college preparation and humanities studies. There are also more schools that fit these criteria for females than males.

In [None]:
#Find the school names where male_per is greater than 60 percent and the SAT Scores are greater 1700 
male_greater_60 = combined[(combined['male_per'] > 60) & (combined['sat_score'] > 1700)]['SCHOOL NAME']
print(male_greater_60)

- There is only one school that meets the criteria above, this school emphasizes STEM subjects and college preparation. 

### Investigating The Relationship between AP Scores and SAT Scores

After adjusting the ap_scores to ap_per, we notice that the positive correlation between ap_per and sat_score is moderate in strength. Some points show the positive correlation of ap_per and sat_score, however, the majority of points are clustered around 0 to 40% ap_per with a sat_score between 1000 to 1300. Moreover, across the board of ap_per percentages, the ap_scores stay consistent within the range of 1000 to 1300, these results don’t seem too convincing in showing that an increase in ap test takers increases ap scores.

#### Steps 
1. Create a column that holds the results for the division of ap test takers by total enrollment.
2. Find the correlation between ap_per and sat_score.
3. Create a scatter plot of ap_per and sat_score.

In [None]:
#Create a column that holds the results for the division of ap test takers by total enrollment
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']

print(combined.corr()['ap_per']['sat_score'])

#Create a scatter plot with ap_per and sat_score
combined.plot.scatter('ap_per', 'sat_score', title = 'Ap Scores and SAT Scores')