# Project 1: SAT & ACT Analysis

The first markdown cell in a notebook is a great place to provide an overview of your entire project. If you want to, you can also use relative links to direct your audience to various sections of a notebook. **HERE'S A DEMONSTRATION WITH THE CURRENT SECTION HEADERS**:

### Contents:
- [Data Acquisition & Grooming](#Step-1%3A-Data-Acquisition-%26amp%3B-Grooming)
- [Dataframe Manipulation](#Step-2%3A-Manipulate-the-dataframe)
- [Data Visualization](#Step-3%3A-Visualize-the-data)
- [Descriptive and Inferential Statistics](#Step-4%3A-Descriptive-and-Inferential-Statistics)

*All libraries used should be added here*

In [None]:
#Imports:

## Step 1: Data Acquisition & Grooming


#### 1. Read In SAT & ACT  Data

Read in the `sat_2017.csv` and `act_2017.csv` files and assign them to appropriately named pandas dataframes.

In [None]:
#Code:

#### 2. Display Data

Print the first 10 rows of each dataframe to your jupyter notebook

In [None]:
#Code:

#### 3. Verbally Describe Data

Take your time looking through the data and throroughly describe the data in the markdown cell below. 

Answer:

#### 4a. Does the data look complete? 

Answer:

#### 4b. Are there any obvious issues with the observations?

**What is the minimum *possible* value for each test/subtest? What is the maximum *possible* value?**

Consider comparing any questionable values to the sources of your data:
- [SAT](https://blog.prepscholar.com/average-sat-scores-by-state-most-recent)
- [ACT](https://blog.prepscholar.com/act-scores-by-state-averages-highs-and-lows)

Answer:

#### 4c. Fix any errors you identified

**The data is available** so there's no need to guess or calculate anything. If you didn't find any errors, continue to the next step.

In [None]:
#code

#### 5. What are your data types? 
Display the data types of each feature. 

In [None]:
#code

What did you learn?
- Do any of them seem odd?  
- Which ones are not as they should be?  

Answer:

#### 6. Fix Incorrect Data Types
Based on what you discovered above, use appropriate methods to re-type incorrectly typed data.
- Define a function that will allow you to convert participation rates to an appropriate numeric type. Use `map` or `apply` to change these columns in each dataframe.

In [None]:
#code

- Fix any individual values preventing other columns from being the appropriate type.

In [1]:
#code

- Finish you data modifications by making sure the columns are now typed appropriately.

In [2]:
#code

- Display the data types again to confirm they are correct.

In [None]:
#Code:

#### 7. Rename Columns
Change the names of the columns to more expressive names such that you can tell the difference the SAT columns and the ACT columns. Your solution should map all column names being changed at once (no repeated singular name-changes).

**Guidelines**:
- Column names should be all lowercase (you will thank yourself when you start pushing data to SQL later in the course)
- Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`.
- Column names should be unique and informative (the only feature that we actually share between dataframes is the state).

In [None]:
#code

#### 8. Create a data dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry: 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| 
|**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|

[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**

|Feature|Type|Dataset|Description|
|---|---|---|---|
|column name|int/float/object|ACT/SAT|This is an example| 


#### 9. Drop unnecessary rows

One of our dataframes contains an extra row. Identify and remove this from the dataframe.

In [3]:
#code

#### 10. Merge Dataframes

Join the ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable. **Use this combined dataframe for the remainder of the project**.

In [None]:
#Code:

#### 11. Save your cleaned, merged dataframe

Use a relative path to save out your data as `combined_2017.csv`.

In [7]:
#code

## Summary Statistics

Transpose the output of pandas `describe` method to create a quick overview of each numeric feature.

In [None]:
#Code:

#### 12. Manually calculate standard deviation

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

- Write a function to calculate standard deviation using the formula above

In [4]:
#code

- Use only dictionary comprehensions to apply your standard deviation function to each numeric column in the dataframe.  **No loops**  
- Assign the output to variable `sd` as a dictionary where: 
    - Each column name is now a key 
    - That standard deviation of the column is the value 
     
*Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`

In [None]:
#Code:

Do your manually calculated standard deviations match up with those output by pandas `describe`? What about numpy's `std` method?

Answer

## Step 2: Manipulate the dataframe

#### 13. Sort Data
Sort the observations in your dataframe according to SAT participation rates in ascending order (lowest to highest)

In [None]:
#Code:

##### 15. Filter Dataframe
    Using boolean logic to filter the data frame, display a subset of data where observations in a column of your choice exceed a threshold of your choosing.
    
*Example: Only rows where ACT_Math is greater than 1000 or Only rows where ACT_Participation is greater than 75%*

You should not be permanently altering your dataframe, just print to the notebook the observations meeting your chosen criteria.

Enter your Criteria Here:

In [None]:
# Code for your chosen Criteria

## Step 3: Visualize the data

##### 16. Plot Histograms of Rates for SAT & ACT

Using MatPlotLib and PyPlot, plot histograms of the distribution of the Rate columns for both SAT and ACT using . 

You should show:
 - Two histograms, side-by-side
 - Each histogram should be properly labelled

[Helpful Link for Plotting Multiple Figures](https://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes)

In [None]:
# Code

##### 17. Plot Histograms of Math for SAT & ACT

Using MatPlotLib and PyPlot, plot histograms of the distribution of the Rate columns for both SAT and ACT using . 

You should show:
 - Two histograms, side-by-side
 - Each histogram should be properly labelled

In [None]:
# Code

##### 18. Plot Histograms of Reading for SAT & ACT

Using MatPlotLib and PyPlot, plot histograms of the distribution of the Rate columns for both SAT and ACT using . 

You should show:
 - Two histograms, side-by-side
 - Each histogram should be properly labelled

In [None]:
# Code

##### 19. What is the most common assumption we make about distributions when working with data ?

In [None]:
Answer:

Does This Assumption Hold for:
    - Math
    - Reading
    - Rates
Explain your answers for each distribution and how you think this will affect estimates made from these data.

In [None]:
Answer:

##### 20. Scatter Plot of SAT vs. ACT Math Scores

Plot the two variables against each other using Matplotlib & PyPlot

Your plots should show:
- Two clearly Labled Axes
- A proper title
- Using colors and symbols that are clear and unmistakable




In [None]:
# Code

##### 21. Scatter Plot of SAT vs. ACT Reading Scores

Plot the two variables against each other using Matplotlib & PyPlot

Your plots should show:
- Two clearly Labled Axes
- A proper title
- Using colors and symbols that are clear and unmistakable


In [None]:
# Code

##### 23. Create Boxplots
For each variable in the dataframe create a boxplot 

Each one should:
 - Be in a separate cell of the notebook
 - Be properly labelled
 - Have appropriate axes & scales
 - Have appropriate axis labels
 - Have a title which clearly communicates the relationships illustrated by the graph
 - Have an accompanying Markdown Cell with your interpretation of the chart

In [None]:
# Code

##### 21. Make at Least 5 More Plots 
*(do research and choose your own chart types & variables)*

You should make a least five more plots (of your choosing) to get a feel for these data.
Each Plot Should:
- Be in a separate cell of the notebook
- Be properly labelled
- Have appropriate axes & scales
- Have appropriate axis labels
- Have a title which clearly communicates the relationships illustrated by the graph
- Have an accompanying Markdown Cell with your interpretation of the chart

##### BONUS: Using Tableau, create a chorpleth map for each variable using a map of the US. 

## Step 4: Descriptive and Inferential Statistics

#### 22. Statistical Evaluation of Distributions 

Using methods and we discussed in class, calculate and show summary and test statistics for each of these distributions. You may want to make a table or dataframe to show them in a more comprehensible or elegant way.

*(Hint: What are the three things we care about when describing distributions?)*

In [None]:
# Code:

##### 23. Summarizing Distributions

As data scientists, having a complete understanding of data is imperative prior to any models you build from the data

For each variable in your data, summarize the underlying distributions (in words & statistics)
 - Be thorough in your verbal description of these distribtuions.
 - Be sure to back up these summaries with statistics 

*(Hint: Again what are the three things we care about when describing distributions?)*

Answers:

#### 24 Plotting Takeaways
 
What do you think are the most important takaways from your plots and graphs?

Answer:

##### 25. Is it appropriate to compare  *these*  specific SAT and ACT math scores? 

Why or why not?

Answer:

##### 26. Estimate Limits of Data

Suppose we only seek to understand the relationship between SAT and ACT data in 2017. 

Does it make sense to conduct statistical inference given these data specifically? 

Why or why not?

*(think about granularity, aggregation, the relationships between populations size & rates...consider the actualy populations these data describe in answering this question)*

Answer: