<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Standardized Test Analysis

### Saloni, DSI-SG-25

--- 
# Part 1

Part 1 requires knowledge of basic Python.

---

## Problem Statement

Since 2001, states have been adopting and implementing varying state-wide assessments, including the ACT and SAT. Signing a contract with ACT inc. to administer the ACT had and continues to have two benefits in that it (i) provides students with an opportunity to take a college admissions test and (ii) encourages more students to consider college upon high school graduation. However, recently the SAT has been outpacing the ACT in participation. In this project, we aim to analyse ACT and SAT participation and scoring by state from 2017 and 2018 to better understand how we can improve ACT participation rates across states and whether increased participation in either test, does indeed show an uptick in high school graduates opting to go to college.

### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-Data)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

The SAT and ACT are standardized tests that many colleges and universities in the United States require for their admissions process. This score is used along with other materials such as grade point average (GPA) and essay responses to determine whether or not a potential student will be accepted to the university.

The SAT has two sections of the test: Evidence-Based Reading and Writing and Math ([*source*](https://www.princetonreview.com/college/sat-sections)). The ACT has 4 sections: English, Mathematics, Reading, and Science, with an additional optional writing section ([*source*](https://www.act.org/content/act/en/products-and-services/the-act/scores/understanding-your-scores.html)). They have different score ranges, which you can read more about on their websites or additional outside sources (a quick Google search will help you understand the scores for each test):
* [SAT](https://collegereadiness.collegeboard.org/sat)
* [ACT](https://www.act.org/content/act/en.html)

Standardized tests have long been a controversial topic for students, administrators, and legislators. Since the 1940's, an increasing number of colleges have been using scores from sudents' performances on tests like the SAT and the ACT as a measure for college readiness and aptitude ([*source*](https://www.minotdailynews.com/news/local-news/2017/04/a-brief-history-of-the-sat-and-act/)). Supporters of these tests argue that these scores can be used as an objective measure to determine college admittance. Opponents of these tests claim that these tests are not accurate measures of students potential or ability and serve as an inequitable barrier to entry. Lately, more and more schools are opting to drop the SAT/ACT requirement for their Fall 2021 applications ([*read more about this here*](https://www.cnn.com/2020/04/14/us/coronavirus-colleges-sat-act-test-trnd/index.html)).

ACT was introduced in late 1959 by ACT, Inc. and has since gained steady traction, with Colorado and Illinois being one of the first to implement it as a state-wide test in 2001 and numerous other states following suit ([source](https://www.act.org/content/act/en/about-act.html)). The ACT was marketed as a test to replace and standardize the graduation requirements for high school juniors with the intent to give every student access to a college admissions test and motivate more students to consider college upon graduation ([source](https://blog.prepscholar.com/which-states-require-the-act-full-list-and-advice)). Currently,13 states require high school juniors to sit the test while 8 others offer it as an option, at no extra cost to the student ([source](https://blog.prepscholar.com/which-states-require-the-act-full-list-and-advice)).

SAT was developed and introduced in 1926 by the College Board with major revisions being made to the test in 2016 to better prepare students for the reading and writing standards expected at college level ([source](https://sat.ivyglobal.com/new-vs-old)). Over time, the test has gained more popularity with tests becoming administered at no cost to students and better test-prep resources becoming readily available ([source](https://www.wsj.com/articles/sat-scores-fall-as-more-students-take-the-test-11569297660)). 20 states have made it a graduation requirement for high school juniors while 7 states offer it but do not require it ([source](https://blog.prepscholar.com/which-states-require-the-sat)). 


### Choose your Data

* [`act_2017.csv`](./data/act_2017.csv): 2017 ACT Scores by State [(source)](https://www.act.org/content/dam/act/unsecured/documents/cccr2017/ACT_2017-Average_Scores_by_State.pdf)
* [`act_2018.csv`](./data/act_2018.csv): 2018 ACT Scores by State [(source)](https://www.act.org/content/dam/act/unsecured/documents/cccr2018/Average-Scores-by-State.pdf)
* [`sat_2017.csv`](./data/sat_2017.csv): 2017 SAT Scores by State [(source)](https://nces.ed.gov/programs/digest/d19/tables/dt19_226.40.asp)
* [`sat_2018.csv`](./data/sat_2018.csv): 2018 SAT Scores by State [(source)](https://nces.ed.gov/programs/digest/d19/tables/dt19_226.40.asp)
* [`college_going_rate_2018.csv`](./data/college_going_rate_2018.csv): 2018 College Going Rate by State [(source)](http://www.higheredinfo.org/dbrowser/index.php?submeasure=63&year=2018&level=nation&mode=data&state=)

### Outside Research

In 2013, there were increasing debates to consider replacing existing high school graduation tests with standardized college admissions tests to increase the chances of students going to college and to reduce the content load for both students and teachers by administering only one high school test ([source](https://www.nytimes.com/2013/08/04/education/edlife/more-students-are-taking-both-the-act-and-sat.html)). Over time, the move to make college admissions tests state-mandated has not only made access to tertiary education more equitable but has shown evidence suggesting that more students consider and enroll in college after graduating from high school ([source](https://www.csmonitor.com/USA/Education/2015/0903/As-states-change-use-of-SAT-and-ACT-disadvantaged-students-get-boost)).

College going rates by state from 2018, published by the National Information Center for Higher Education Policymaking and Analysis ([source](http://www.higheredinfo.org/dbrowser/index.php?submeasure=63&year=2018&level=nation&mode=data&state=)) were used in this project to corroborate the above-mentioned information.

### Coding Challenges

1. Manually calculate mean:

    Write a function that takes in values and returns the mean of the values. Create a list of numbers that you test on your function to check to make sure your function works!
    
    *Note*: Do not use any mean methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [1]:
def find_mean(numbers):
    return sum(numbers)/len(numbers)
num1 = [2,4,8,16,17]

find_mean(num1)

9.4

2. Manually calculate standard deviation:

    The formula for standard deviation is below:

    $$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

    Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.

    Write a function that takes in values and returns the standard deviation of the values using the formula above. Hint: use the function you wrote above to calculate the mean! Use the list of numbers you created above to test on your function.
    
    *Note*: Do not use any standard deviation methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

In [2]:
def stdev(numbers):
    mean = find_mean(numbers)
    deviations = [(x-mean)**2 for x in numbers]
    variance = sum(deviations)/len(numbers)
    return variance**0.5

stdev(num1)

6.118823416311343

3. Data cleaning function:
    
    Write a function that takes in a string that is a number and a percent symbol (ex. '50%', '30.5%', etc.) and converts this to a float that is the decimal approximation of the percent. For example, inputting '50%' in your function should return 0.5, '30.5%' should return 0.305, etc. Make sure to test your function to make sure it works!

You will use these functions later on in the project!

In [3]:
def convert_float(string):
    return float(string.strip("%"))/100
percent = "50.5%"
convert_float(percent)

0.505

--- 
# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

*All libraries used should be added here*

In [4]:
# Import numpy
import numpy as np

# Import pandas
import pandas as pd

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Import seaborn
import seaborn as sns

#Import os
import os

%matplotlib inline

## Data Import and Cleaning

### Data Import & Cleaning

Import the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary:
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values.
3. Check for any obvious issues with the observations (keep in mind the minimum & maximum possible values for each test/subtest).
4. Fix any errors you identified in steps 2-3.
5. Display the data types of each feature.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If your dataset has a column of percents (ex. '50%', '30.5%', etc.), use the function you wrote in Part 1 (coding challenges, number 3) to convert this to floats! *Hint*: use `.map()` or `.apply()`.
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

### Cleaning ACT and SAT data from 2017 

#### Import datasets

In [5]:
# Import ACT and SAT 2017 data
# Reference: https://stackoverflow.com/questions/56100674/i-cannot-import-csv-file
datapath = "../data"

act_2017_path = os.path.join(datapath,"act_2017.csv")
sat_2017_path = os.path.join(datapath,"sat_2017.csv")

act_2017 = pd.read_csv(act_2017_path)
sat_2017 = pd.read_csv(sat_2017_path)

#### 1) Display first 5 rows of data

In [6]:
# First 5 rows of ACT 2017
act_2017.head()

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
0,National,60%,20.3,20.7,21.4,21.0,21.0
1,Alabama,100%,18.9,18.4,19.7,19.4,19.2
2,Alaska,65%,18.7,19.8,20.4,19.9,19.8
3,Arizona,62%,18.6,19.8,20.1,19.8,19.7
4,Arkansas,100%,18.9,19.0,19.7,19.5,19.4


In [7]:
# First 5 rows of SAT 2017
sat_2017.head()

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,5%,593,572,1165
1,Alaska,38%,547,533,1080
2,Arizona,30%,563,553,1116
3,Arkansas,3%,614,594,1208
4,California,53%,531,524,1055


#### 2,3,5) Check for missing values and obvious issues with data

In [8]:
# Checking for missing values and data types of each feature in ACT 2017. Displays datatype hence fulfilling step 5.
print("Check for missing values and data types of each feature in ACT 2017")
print(act_2017.info())

# Max values of every column in ACT 2017. Check for obvious issues.
print("\nMax value for each column in ACT 2017")
print(act_2017.max())

# Min values of every value in ACT 2017. Check for obvious issues.
print("\nMin value for each column in ACT 2017")
print(act_2017.min())

Check for missing values and data types of each feature in ACT 2017
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   State          52 non-null     object 
 1   Participation  52 non-null     object 
 2   English        52 non-null     float64
 3   Math           52 non-null     float64
 4   Reading        52 non-null     float64
 5   Science        52 non-null     float64
 6   Composite      52 non-null     object 
dtypes: float64(4), object(3)
memory usage: 3.0+ KB
None

Max value for each column in ACT 2017
State            Wyoming
Participation        98%
English             25.5
Math                25.3
Reading             26.0
Science             24.9
Composite           25.5
dtype: object

Min value for each column in ACT 2017
State            Alabama
Participation        60%
English             16.3
Math                18.0
Reading   

In [9]:
#Finding which state has the min. value of 2.3 for science
act_2017[act_2017["Science"] == 2.3]

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
21,Maryland,28%,23.3,23.1,24.2,2.3,23.6


**ACT 2017 observations and errors**

* No missing or null values. There are 50 states and the District of Columbia (a.k.a. Washington D.C.) equating to 51 entries under the 'state' column. act_2017 has 52 entries. The 'national' row for act_2017 needs to be removed as it will not be used in analysis.
* The 'composite' column is seen as an object in data types. This should be a float indicating that one of the values is erroneous.
* 'participation' column shows a max. value of 98% and min. value of 60%. Referring to the [source](https://www.act.org/content/dam/act/unsecured/documents/cccr2017/ACT_2017-Average_Scores_by_State.pdf) of the data, we know that the max. should be 100% and min. 8%. Requires correction.
* 'participation' column datatype is shown to be an 'object'. It needs to be conerted to a float.
* 'science' column shows a min. value of 2.3. We have identified that this score comes from Maryland. Based on the [scoring system](https://blog.prepscholar.com/how-is-the-act-scored) scores for each subject and the composite should be a number between 1-36 respectively. Looking at the composite for Maryland, we can tell that the Science score has been mistyped. Referring to the source, we know it needs to be converted to 23.2.

In [10]:
# Checking for missing values and data types of each feature in SAT 2017. Displays datatype hence fulfilling step 5.
print("Check for missing values and data types of each feature in SAT 2017")
print(sat_2017.info())

# Max values of every column in SAT 2017. Check for obvious issues.
print("\nMax value for each column SAT 2017")
print(sat_2017.max())

# Min values of every value in SAT 2017. Check for obvious issues.
print("\nMin value for each column in SAT 2017")
print(sat_2017.min())

Check for missing values and data types of each feature in SAT 2017
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   State                               51 non-null     object
 1   Participation                       51 non-null     object
 2   Evidence-Based Reading and Writing  51 non-null     int64 
 3   Math                                51 non-null     int64 
 4   Total                               51 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB
None

Max value for each column SAT 2017
State                                 Wyoming
Participation                             96%
Evidence-Based Reading and Writing        644
Math                                      651
Total                                    1295
dtype: object

Min value for each column in SAT 2017
State   

In [11]:
#Finding which state has the min. value of 52 for math
sat_2017[sat_2017["Math"] == 52]

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
20,Maryland,69%,536,52,1060


**SAT 2017 observations and errors**

* No missing or null values. There are 50 states and the District of Columbia (a.k.a. Washington D.C.) equating to 51 entries under the 'state' column.  
* 'participation' column shows a max. value of 96% and min. value of 10%. Referring to the [source](https://nces.ed.gov/programs/digest/d19/tables/dt19_226.40.asp) of the data, we know that the max. should be 100% and min. 2%. Requires correction.
* 'participation' column datatype is shown to be an 'object'. It needs to be conerted to a float.
* 'math' column shows a min. value of 52. We have identified that this score comes from Maryland. Based on the [scoring system](https://blog.prepscholar.com/how-is-the-sat-scored-scoring-charts) scores for each subject are between 200-800 and the total should be a number between 400-1600 respectively. Looking at the total for Maryland, we can tell that the Math score has been mistyped. Referring to the source, we know it needs to be converted to 524.

#### 4,6,8) Fix identified errors and drop unneccesary rows
**Summary of errors fixed**
* For ACT 2017 "Science", replaced 2.3 with 23.2
* For ACT 2017 converted "Composite" column from str to float
* For ACT 2017 dropped "national" from rows
* For ACT 2017 converted "Participation" column from str to float using a function defined in Part 1
* For SAT 2017 "Math", replaced 52 with 524
* For SAT 2017 convert "Participation" column from str to float using a function defined in Part 1

In [12]:
# For ACT 2017 "Science", replace 2.3 with 23.2
act_2017.loc[act_2017.State == "Maryland", "Science"] = 23.2
act_2017[act_2017["State"] == "Maryland"]

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
21,Maryland,28%,23.3,23.1,24.2,23.2,23.6


In [13]:
# For ACT 2017 convert "Composite" column from str to float
###### Run this : act_2017["Composite"].astype(float) ######

In [14]:
## Error message: Revealed that there is a value 20.2x that cannot be converted to a float.
## Finding which state has composite value of 20.2x and replacing with 20.2
act_2017[act_2017["Composite"] == "20.2x"]
act_2017.loc[act_2017.State == "Wyoming", "Composite"] = "20.2"
act_2017[act_2017["State"] == "Wyoming"]
act_2017["Composite"] = act_2017["Composite"].astype(float)
act_2017.dtypes

State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite        float64
dtype: object

In [15]:
# For ACT 2017 drop "national" from rows
act_2017.drop(act_2017[act_2017["State"] == "National"].index, axis = 0, inplace = True)
act_2017.head()

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
1,Alabama,100%,18.9,18.4,19.7,19.4,19.2
2,Alaska,65%,18.7,19.8,20.4,19.9,19.8
3,Arizona,62%,18.6,19.8,20.1,19.8,19.7
4,Arkansas,100%,18.9,19.0,19.7,19.5,19.4
5,California,31%,22.5,22.7,23.1,22.2,22.8


In [16]:
# For ACT 2017 convert "Participation" column from str to float using a function defined in Part 1
act_2017["Participation"] = act_2017["Participation"].map(convert_float)
act_2017.head()

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
1,Alabama,1.0,18.9,18.4,19.7,19.4,19.2
2,Alaska,0.65,18.7,19.8,20.4,19.9,19.8
3,Arizona,0.62,18.6,19.8,20.1,19.8,19.7
4,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4
5,California,0.31,22.5,22.7,23.1,22.2,22.8


In [17]:
#For Sat 2017 "Math", replace 52 with 524
sat_2017.loc[sat_2017.State == "Maryland", "Math"] = 524
sat_2017[sat_2017["State"] == "Maryland"]

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
20,Maryland,69%,536,524,1060


In [18]:
# For SAT 2017 convert "Participation" column from str to float using a function defined in Part 1
sat_2017["Participation"] = sat_2017["Participation"].map(convert_float)
sat_2017.head()

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,0.05,593,572,1165
1,Alaska,0.38,547,533,1080
2,Arizona,0.3,563,553,1116
3,Arkansas,0.03,614,594,1208
4,California,0.53,531,524,1055


#### 7) Renaming columns

In [19]:
# Renaming columns in ACT 2017
new_columns_act2017 = ["state","act_participation_2017","act_english_2017","act_math_2017","act_reading_2017",
                       "act_science_2017","act_composite_2017"]
act_2017.columns = new_columns_act2017
act_2017.head()

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017
1,Alabama,1.0,18.9,18.4,19.7,19.4,19.2
2,Alaska,0.65,18.7,19.8,20.4,19.9,19.8
3,Arizona,0.62,18.6,19.8,20.1,19.8,19.7
4,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4
5,California,0.31,22.5,22.7,23.1,22.2,22.8


In [20]:
# Renaming columns in SAT 2017
new_columns_sat2017 = ["state","sat_participation_2017","sat_ebrw_2017","sat_math_2017","sat_total_2017"]
sat_2017.columns = new_columns_sat2017
sat_2017.head()

Unnamed: 0,state,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017
0,Alabama,0.05,593,572,1165
1,Alaska,0.38,547,533,1080
2,Arizona,0.3,563,553,1116
3,Arkansas,0.03,614,594,1208
4,California,0.53,531,524,1055


### Cleaning ACT and SAT data from 2018

#### Import datasets

In [21]:
#Import ACT and SAT 2018 data
act_2018_path = os.path.join(datapath,"act_2018.csv")
sat_2018_path = os.path.join(datapath,"sat_2018.csv")

act_2018 = pd.read_csv(act_2018_path)
sat_2018 = pd.read_csv(sat_2018_path)

#### 1) Display first 5 rows of data

In [22]:
# First 5 rows of ACT 2018
act_2018.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,100%,19.1
1,Alaska,33%,20.8
2,Arizona,66%,19.2
3,Arkansas,100%,19.4
4,California,27%,22.7


In [23]:
# First 5 rows of SAT 2018
sat_2018.head()

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,6%,595,571,1166
1,Alaska,43%,562,544,1106
2,Arizona,29%,577,572,1149
3,Arkansas,5%,592,576,1169
4,California,60%,540,536,1076


#### 2,3,5) Check for missing values and obvious issues with data

In [24]:
# Checking for missing values and data types of each feature in ACT 2018. Displays datatype hence fulfilling step 5.
print("Check for missing values and data types of each feature in ACT 2018")
print(act_2018.info())

# Max values of every column in ACT 2018. Check for obvious issues.
print("\nMax value for each column in ACT 2018")
print(act_2018.max())

# Min values of every value in ACT 2018. Check for obvious issues.
print("\nMin value for each column in ACT 2018")
print(act_2018.min())

Check for missing values and data types of each feature in ACT 2018
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   State          52 non-null     object 
 1   Participation  52 non-null     object 
 2   Composite      52 non-null     float64
dtypes: float64(1), object(2)
memory usage: 1.3+ KB
None

Max value for each column in ACT 2018
State            Wyoming
Participation        99%
Composite           25.6
dtype: object

Min value for each column in ACT 2018
State            Alabama
Participation       100%
Composite           17.7
dtype: object


**ACT 2018 observations and errors**

* No missing or null values. There are 50 states and the District of Columbia (a.k.a. Washington D.C.) equating to 51 entries under the 'state' column. act_2018 has 52 entries. The 'national' row for act_2017 needs to be removed as it will not be used in analysis.
* 'participation' column shows a max. value of 98% and min. value of 60%. Referring to the [source](https://www.act.org/content/dam/act/unsecured/documents/cccr2018/Average-Scores-by-State.pdf) of the data, we know that the max. should be 100% and min. 7%. Requires correction.
* 'participation' column datatype is shown to be an 'object'. It needs to be converted to a float.

In [25]:
# Checking for missing values and data types of each feature in SAT 2018. Displays datatype hence fulfilling step 5.
print("Check for missing values and data types of each feature in SAT 2018")
print(sat_2018.info())

# Max values of every column in SAT 2018. Check for obvious issues.
print("\nMax value for each column SAT 2018")
print(sat_2018.max())

# Min values of every value in SAT 2018. Check for obvious issues.
print("\nMin value for each column in SAT 2018")
print(sat_2018.min())

Check for missing values and data types of each feature in SAT 2018
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   State                               51 non-null     object
 1   Participation                       51 non-null     object
 2   Evidence-Based Reading and Writing  51 non-null     int64 
 3   Math                                51 non-null     int64 
 4   Total                               51 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB
None

Max value for each column SAT 2018
State                                 Wyoming
Participation                             99%
Evidence-Based Reading and Writing        643
Math                                      655
Total                                    1298
dtype: object

Min value for each column in SAT 2018
State   

**SAT 2018 observations and errors**

* No missing or null values. There are 50 states and the District of Columbia (a.k.a. Washington D.C.) equating to 51 entries under the 'state' column.  
* 'participation' column shows a max. value of 99% and min. value of 10%. Referring to the [source](https://nces.ed.gov/programs/digest/d19/tables/dt19_226.40.asp) of the data, we know that the max. should be 100% and min. 3%. Requires correction.
* 'participation' column datatype is shown to be an 'object'. It needs to be conerted to a float.

#### 4,6,8) Fix identified errors and drop unneccesary rows
**Summary of errors fixed**
* For ACT 2018 dropped "national" from rows
* For ACT 2018 converted "Participation" column from str to float using a function defined in Part 1
* For SAT 2018 converted "Participation" column from str to float using a function defined in Part 1

In [26]:
# For ACT 2018 dropped "national" from rows
act_2018.drop(act_2018[act_2018["State"] == "National"].index, axis = 0, inplace = True)
act_2018.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,100%,19.1
1,Alaska,33%,20.8
2,Arizona,66%,19.2
3,Arkansas,100%,19.4
4,California,27%,22.7


In [27]:
# For ACT 2018 converted "Participation" column from str to float using a function defined in Part 1
act_2018["Participation"] = act_2018["Participation"].map(convert_float)
act_2018.head()

Unnamed: 0,State,Participation,Composite
0,Alabama,1.0,19.1
1,Alaska,0.33,20.8
2,Arizona,0.66,19.2
3,Arkansas,1.0,19.4
4,California,0.27,22.7


In [28]:
# For SAT 2018 converted "Participation" column from str to float using a function defined in Part 1
sat_2018["Participation"] = sat_2018["Participation"].map(convert_float)
sat_2018.head()

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,0.06,595,571,1166
1,Alaska,0.43,562,544,1106
2,Arizona,0.29,577,572,1149
3,Arkansas,0.05,592,576,1169
4,California,0.6,540,536,1076


#### 7) Renaming columns

In [29]:
# Renaming columns in ACT 2018
new_columns_act2018 = ["state","act_participation_2018","act_composite_2018"]
act_2018.columns = new_columns_act2018
act_2018.head()

Unnamed: 0,state,act_participation_2018,act_composite_2018
0,Alabama,1.0,19.1
1,Alaska,0.33,20.8
2,Arizona,0.66,19.2
3,Arkansas,1.0,19.4
4,California,0.27,22.7


In [30]:
# Renaming columns in SAT 2018
new_columns_sat2018 = ["state","sat_participation_2018","sat_ebrw_2018","sat_math_2018","sat_total_2018"]
sat_2018.columns = new_columns_sat2018
sat_2018.head()

Unnamed: 0,state,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018
0,Alabama,0.06,595,571,1166
1,Alaska,0.43,562,544,1106
2,Arizona,0.29,577,572,1149
3,Arkansas,0.05,592,576,1169
4,California,0.6,540,536,1076


### Cleaning college going rate data from 2018

#### Import datasets

In [31]:
#Import college going rate 2018 data
college_going_path = os.path.join(datapath,"college_going_rate_2018.csv")

college_going_rate_2018 = pd.read_csv(college_going_path)

#### 1) Display first 5 rows of data

In [32]:
# First 5 rows of college going rate 2018
college_going_rate_2018.head()

Unnamed: 0,State,Percent of High School Graduates Going Directly to College (%),Projected High School Graduates - 2018,First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018
0,Alabama,65.95,48690,32111
1,Alaska,41.14,7758,3192
2,Arizona,50.09,68985,34558
3,Arkansas,63.13,31315,19770
4,California,66.01,431009,284529


#### 2,3,5) Check for missing values and obvious issues with data

In [33]:
# Checking for missing values and data types of each feature in college going rate 2018. Displays datatype hence fulfilling step 5.
print("Check for missing values and data types of each feature in college going rate 2018")
print(college_going_rate_2018.info())

# Max values of every column in college going rate 2018. Check for obvious issues.
print("\nMax value for each column in college going rate 2018")
print(college_going_rate_2018.max())

# Min values of every value in college going rate 2018. Check for obvious issues.
print("\nMin value for each column in college going rate 2018")
print(college_going_rate_2018.min())

Check for missing values and data types of each feature in college going rate 2018
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 4 columns):
 #   Column                                                                                 Non-Null Count  Dtype  
---  ------                                                                                 --------------  -----  
 0   State                                                                                  51 non-null     object 
 1   Percent of High School Graduates Going Directly to College (%)                         51 non-null     float64
 2   Projected High School Graduates - 2018                                                 51 non-null     object 
 3   First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018  51 non-null     object 
dtypes: float64(1), object(3)
memory usage: 1.7+ KB
None

Max value for each column in college going rate 2018
State   

**College going rate 2018 observations and errors**

* No missing or null values. There are 50 states (does not include District of Columbia (a.k.a. Washington D.C.)) equating to 51 entries under the 'state' column. The 'nation' row for college_going_rate_2018 needs to be removed as it will not be used in analysis.
* 'Percent of High School Graduates Going Directly to College' column datatype is shown to be a 'float'. However, it needs to be standardized to match the format of the other datasets. 
* Columns 'Projected High School Graduates - 2018' and 'First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018' need to be dropped as they will not be used in analysis.

#### 4,6,8) Fix identified errors and drop unneccesary rows/columns
**Summary of errors fixed**
* For College going rate 2018 dropped "nation" from rows
* For College going rate 2018 converted "Percent of High School Graduates Going Directly to College (%)" column from to standardized float
* For College going rate 2018 dropped 'Projected High School Graduates - 2018' and 'First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018' from columns

In [34]:
# For College going rate 2018 drop "nation" from rows
college_going_rate_2018.drop(college_going_rate_2018[college_going_rate_2018["State"] == "Nation"].index, axis = 0, inplace = True)
college_going_rate_2018.tail()

Unnamed: 0,State,Percent of High School Graduates Going Directly to College (%),Projected High School Graduates - 2018,First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018
45,Virginia,69.0,90213,62243
46,Washington,53.23,70411,37480
47,West Virginia,54.85,17447,9570
48,Wisconsin,58.28,65548,38202
49,Wyoming,56.12,5864,3291


In [35]:
# For College going rate 2018 converted "Percent of High School Graduates Going Directly to College (%)" column from to standardized float
college_going_rate_2018["Percent of High School Graduates Going Directly to College (%)"] = college_going_rate_2018["Percent of High School Graduates Going Directly to College (%)"].div(100).round(2)
college_going_rate_2018.head()

Unnamed: 0,State,Percent of High School Graduates Going Directly to College (%),Projected High School Graduates - 2018,First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018
0,Alabama,0.66,48690,32111
1,Alaska,0.41,7758,3192
2,Arizona,0.5,68985,34558
3,Arkansas,0.63,31315,19770
4,California,0.66,431009,284529


In [36]:
#For College going rate 2018 dropped 'Projected High School Graduates - 2018' and 'First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018' from columns
college_going_rate_2018.drop(["Projected High School Graduates - 2018", "First-Time Freshmen Directly from High School Enrolled Anywhere in the US - Fall 2018"], axis = 1, inplace = True)
college_going_rate_2018.head()

Unnamed: 0,State,Percent of High School Graduates Going Directly to College (%)
0,Alabama,0.66
1,Alaska,0.41
2,Arizona,0.5
3,Arkansas,0.63
4,California,0.66


#### 7) Renaming columns

In [37]:
# Renaming columns in college going rate 2018
new_columns_cgr2018 = ["state","cgr_2018"]
college_going_rate_2018.columns = new_columns_cgr2018
college_going_rate_2018.head()

Unnamed: 0,state,cgr_2018
0,Alabama,0.66
1,Alaska,0.41
2,Arizona,0.5
3,Arkansas,0.63
4,California,0.66


### 9,10,11) Merge datasets, additional cleaning and converting to csv files

#### Merge ACT 2017 and 2018

In [38]:
#Merge ACT 2017 and ACT 2018
act_final = pd.merge(act_2017, act_2018, on= "state", how="left")
act_final.head()

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018
0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1
1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8
2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2
3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4
4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7


In [39]:
#Find percentage change in ACT participation from 2017-2018 and add that in as new column in dataset
act_final["act_participation_change"] = act_final["act_participation_2018"] - act_final["act_participation_2017"]
act_final.head()

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018,act_participation_change
0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1,0.0
1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8,-0.32
2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2,0.04
3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4,0.0
4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7,-0.04


#### Merge SAT 2017 and 2018

In [40]:
#Find percentage change in SAT participation from 2017-2018
sat_final = pd.merge(sat_2017, sat_2018, on= "state", how="left")
sat_final.head()

Unnamed: 0,state,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018
0,Alabama,0.05,593,572,1165,0.06,595,571,1166
1,Alaska,0.38,547,533,1080,0.43,562,544,1106
2,Arizona,0.3,563,553,1116,0.29,577,572,1149
3,Arkansas,0.03,614,594,1208,0.05,592,576,1169
4,California,0.53,531,524,1055,0.6,540,536,1076


In [41]:
#Find percentage change in SAT participation from 2017-2018 and add that in as new column in dataset
sat_final["sat_participation_change"] = sat_final["sat_participation_2018"] - sat_final["sat_participation_2017"]
sat_final.head()

Unnamed: 0,state,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018,sat_participation_change
0,Alabama,0.05,593,572,1165,0.06,595,571,1166,0.01
1,Alaska,0.38,547,533,1080,0.43,562,544,1106,0.05
2,Arizona,0.3,563,553,1116,0.29,577,572,1149,-0.01
3,Arkansas,0.03,614,594,1208,0.05,592,576,1169,0.02
4,California,0.53,531,524,1055,0.6,540,536,1076,0.07


#### Merge SAT and ACT 2017-2018 datasets

In [42]:
act_sat_final = pd.merge(act_final, sat_final, on= "state", how="left")
act_sat_final.head()

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018,act_participation_change,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018,sat_participation_change
0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1,0.0,0.05,593,572,1165,0.06,595,571,1166,0.01
1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8,-0.32,0.38,547,533,1080,0.43,562,544,1106,0.05
2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2,0.04,0.3,563,553,1116,0.29,577,572,1149,-0.01
3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4,0.0,0.03,614,594,1208,0.05,592,576,1169,0.02
4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7,-0.04,0.53,531,524,1055,0.6,540,536,1076,0.07


#### Convert ACT and SAT 2017-2018 into a csv

In [53]:
act_sat_final.to_csv("act_sat_final.csv", index = False)

#### Additional step for later analysis with college going rate

In [44]:
#Drop row District of Columbia (a.k.a. Washington D.C.) from act_sat_final
act_sat_final.drop(act_sat_final[act_sat_final["state"] == "District of Columbia"].index, axis = 0, inplace = True)
act_sat_final[:13]

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018,act_participation_change,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018,sat_participation_change
0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1,0.0,0.05,593,572,1165,0.06,595,571,1166,0.01
1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8,-0.32,0.38,547,533,1080,0.43,562,544,1106,0.05
2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2,0.04,0.3,563,553,1116,0.29,577,572,1149,-0.01
3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4,0.0,0.03,614,594,1208,0.05,592,576,1169,0.02
4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7,-0.04,0.53,531,524,1055,0.6,540,536,1076,0.07
5,Colorado,1.0,20.1,20.3,21.2,20.9,20.8,0.3,23.9,-0.7,0.11,606,595,1201,1.0,519,506,1025,0.89
6,Connecticut,0.31,25.5,24.6,25.6,24.6,25.2,0.26,25.6,-0.05,1.0,530,512,1041,1.0,535,519,1053,0.0
7,Delaware,0.18,24.1,23.4,24.8,23.6,24.1,0.17,23.8,-0.01,1.0,503,492,996,1.0,505,492,998,0.0
9,Florida,0.73,19.0,19.4,21.0,19.4,19.8,0.66,19.9,-0.07,0.83,520,497,1017,0.56,550,549,1099,-0.27
10,Georgia,0.55,21.0,20.9,22.0,21.3,21.4,0.53,21.4,-0.02,0.61,535,515,1050,0.7,542,522,1064,0.09


#### Merge SAT-ACT 2017-2018 with college going rate 2018

In [45]:
act_sat_cgr_final = pd.merge(act_sat_final,college_going_rate_2018, on= "state", how="left")
act_sat_cgr_final.head()

Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018,act_participation_change,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018,sat_participation_change,cgr_2018
0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1,0.0,0.05,593,572,1165,0.06,595,571,1166,0.01,0.66
1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8,-0.32,0.38,547,533,1080,0.43,562,544,1106,0.05,0.41
2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2,0.04,0.3,563,553,1116,0.29,577,572,1149,-0.01,0.5
3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4,0.0,0.03,614,594,1208,0.05,592,576,1169,0.02,0.63
4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7,-0.04,0.53,531,524,1055,0.6,540,536,1076,0.07,0.66


#### Convert ACT,SAT and CGR dataset into a csv

In [54]:
act_sat_cgr_final.to_csv("act_sat_cgr_final.csv", index = False)

### Data Dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry: 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| 
|**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|

[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**

*Note*: if you are unsure of what a feature is, check the source of the data! This can be found in the README.

**To-Do:** *Edit the table below to create your own data dictionary for the datasets you chose.*

|Feature|Type|Dataset|Description|
|---|---|---|---|
|column name|int/float/object|ACT/SAT|This is an example| 


In [47]:
# Import act_sat_final and act_sat_cgr_final
# Reference: https://stackoverflow.com/questions/56100674/i-cannot-import-csv-file
#datapath = "../data"

#act_sat_final_path = os.path.join(datapath,"act_sat_final.csv")
#act_sat_cgr_final_path = os.path.join(datapath,"act_sat_cgr_final.csv")

#act_sat_final = pd.read_csv(act_sat_final_path)
#act_sat_cgr_final = pd.read_csv(act_sat_cgr_final_path)

In [52]:
# Details of each data set
#act_sat_final.info()
#act_sat_final.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                52 non-null     int64  
 1   state                     52 non-null     object 
 2   act_participation_2017    52 non-null     float64
 3   act_english_2017          52 non-null     float64
 4   act_math_2017             52 non-null     float64
 5   act_reading_2017          52 non-null     float64
 6   act_science_2017          52 non-null     float64
 7   act_composite_2017        52 non-null     float64
 8   act_participation_2018    51 non-null     float64
 9   act_composite_2018        51 non-null     float64
 10  act_participation_change  51 non-null     float64
 11  sat_participation_2017    52 non-null     float64
 12  sat_ebrw_2017             52 non-null     int64  
 13  sat_math_2017             52 non-null     int64  
 14  sat_total_20

Unnamed: 0.1,Unnamed: 0,state,act_participation_2017,act_english_2017,act_math_2017,act_reading_2017,act_science_2017,act_composite_2017,act_participation_2018,act_composite_2018,act_participation_change,sat_participation_2017,sat_ebrw_2017,sat_math_2017,sat_total_2017,sat_participation_2018,sat_ebrw_2018,sat_math_2018,sat_total_2018,sat_participation_change
0,0,Alabama,1.0,18.9,18.4,19.7,19.4,19.2,1.0,19.1,0.0,0.05,593,572,1165,0.06,595,571,1166,0.01
1,1,Alaska,0.65,18.7,19.8,20.4,19.9,19.8,0.33,20.8,-0.32,0.38,547,533,1080,0.43,562,544,1106,0.05
2,2,Arizona,0.62,18.6,19.8,20.1,19.8,19.7,0.66,19.2,0.04,0.3,563,553,1116,0.29,577,572,1149,-0.01
3,3,Arkansas,1.0,18.9,19.0,19.7,19.5,19.4,1.0,19.4,0.0,0.03,614,594,1208,0.05,592,576,1169,0.02
4,4,California,0.31,22.5,22.7,23.1,22.2,22.8,0.27,22.7,-0.04,0.53,531,524,1055,0.6,540,536,1076,0.07


In [51]:
act_sat_cgr_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                51 non-null     int64  
 1   state                     51 non-null     object 
 2   act_participation_2017    51 non-null     float64
 3   act_english_2017          51 non-null     float64
 4   act_math_2017             51 non-null     float64
 5   act_reading_2017          51 non-null     float64
 6   act_science_2017          51 non-null     float64
 7   act_composite_2017        51 non-null     float64
 8   act_participation_2018    51 non-null     float64
 9   act_composite_2018        51 non-null     float64
 10  act_participation_change  51 non-null     float64
 11  sat_participation_2017    51 non-null     float64
 12  sat_ebrw_2017             51 non-null     int64  
 13  sat_math_2017             51 non-null     int64  
 14  sat_total_20

## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
    - Assign the output to variable `sd` as a dictionary where: 
        - Each column name is now a key 
        - That standard deviation of the column is the value 
        - *Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement. Some examples are provided below (but feel free to change these questions for your specific problem):
        - Which states have the highest and lowest participation rates for the 2017, 2019, or 2019 SAT and ACT?
        - Which states have the highest and lowest mean total/composite scores for the 2017, 2019, or 2019 SAT and ACT?
        - Do any states with 100% participation on a given test have a rate change year-to-year?
        - Do any states show have >50% participation on *both* tests each year?
        - Which colleges have the highest median SAT and ACT scores for admittance?
        - Which California school districts have the highest and lowest mean test scores?
    - **You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

In [48]:
#Code:

**To-Do:** *Edit this cell with your findings on trends in the data (step 3 above).*

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

Here is an example of what your plots should look like following the above guidelines. Note that while the content of this example is unrelated, the principles of visualization hold:

![](https://snag.gy/hCBR1U.jpg)
*Interpretation: The above image shows that as we increase our spending on advertising, our sales numbers also tend to increase. There is a positive correlation between advertising spending and sales.*

---

Here are some prompts to get you started with visualizations. Feel free to add additional visualizations as you see fit:
1. Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features.
    - Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
    - Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
2. Visualize distributions using histograms. If you have a lot, consider writing a custom function and use subplots.
    - *OPTIONAL*: Summarize the underlying distributions of your features (in words & statistics)
         - Be thorough in your verbal description of these distributions.
         - Be sure to back up these summaries with statistics.
         - We generally assume that data we sample from a population will be normally distributed. Do we observe this trend? Explain your answers for each distribution and how you think this will affect estimates made from these data.
3. Plot and interpret boxplots. 
    - Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
    - Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
    - Each boxplot should:
        - Only include variables of a similar scale
        - Have clear labels for each variable
        - Have appropriate titles and labels
4. Plot and interpret scatter plots to view relationships between features. Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
    - Your plots should have:
        - Two clearly labeled axes
        - A proper title
        - Colors and symbols that are clear and unmistakable
5. Additional plots of your choosing.
    - Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.

In [49]:
# Code

## Conclusions and Recommendations

Based on your exploration of the data, what are you key takeaways and recommendations? Make sure to answer your question of interest or address your problem statement here.

**To-Do:** *Edit this cell with your conclusions and recommendations.*

Don't forget to create your README!

**To-Do:** *If you combine your problem statement, data dictionary, brief summary of your analysis, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project.* Don't forget to cite your data sources!