## Exploratory Data Analysis


### Summary Statistics
Transpose the output of pandas `describe` method to create a quick overview of each numeric feature.

In [1]:
#Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

In [2]:
#Code:

df_final = pd.read_csv("../a_data/final.csv")

In [3]:
df_final.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sat_2017_participation,50.0,39.12,35.216323,2.0,4.0,34.0,65.0,100.0
sat_2017_read_write,50.0,570.4,44.870335,503.0,532.75,559.5,613.5,644.0
sat_2017_math,50.0,558.16,46.359007,492.0,523.25,549.5,601.0,651.0
sat_2017_total,50.0,1128.66,90.922171,996.0,1055.25,1107.5,1214.0,1295.0
act_2017_participation,50.0,65.52,32.711628,8.0,31.0,71.0,100.0,100.0
act_2017_english,50.0,20.88,2.346948,16.3,19.0,20.55,23.1,25.5
act_2017_math,50.0,21.154,1.996242,18.0,19.4,20.9,23.0,25.3
act_2017_reading,50.0,21.968,2.061448,18.1,20.425,21.7,23.875,26.0
act_2017_science,50.0,21.42,1.743911,18.2,19.925,21.3,22.975,24.9
act_2017_composite,50.0,21.0,2.070197,17.0,19.0,21.0,22.75,25.0


In [4]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   50 non-null     object 
 1   sat_2017_participation  50 non-null     float64
 2   sat_2017_read_write     50 non-null     int64  
 3   sat_2017_math           50 non-null     int64  
 4   sat_2017_total          50 non-null     int64  
 5   act_2017_participation  50 non-null     float64
 6   act_2017_english        50 non-null     float64
 7   act_2017_math           50 non-null     float64
 8   act_2017_reading        50 non-null     float64
 9   act_2017_science        50 non-null     float64
 10  act_2017_composite      50 non-null     float64
 11  sat_2018_participation  50 non-null     float64
 12  sat_2018_read_write     50 non-null     int64  
 13  sat_2018_math           50 non-null     int64  
 14  sat_2018_total          50 non-null     int6

#### Manually calculate standard deviation

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

- Write a function to calculate standard deviation using the formula above

In [5]:
#code
def std_dev(list):
        avg = np.mean(list)
        n = len(list)
        sum_sqr = 0
        for i in list:
            sum_sqr += ((i - avg)**2)
        return math.sqrt((sum_sqr) / (n))

In [6]:
std_dev(df_final["sat_2017_participation"])

34.86238087107649

- Use a **dictionary comprehension** to apply your standard deviation function to each numeric column in the dataframe.  **No loops**  
- Assign the output to variable `sd` as a dictionary where: 
    - Each column name is now a key 
    - That standard deviation of the column is the value 
     
*Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`

In [7]:
#Code:
{}

{}

In [8]:
sd = {str(df_final.columns[i]): round(np.std(df_final[df_final.columns[i]]), 2)
     for i in range(1,df_final.shape[1])}
sd

{'sat_2017_participation': 34.86,
 'sat_2017_read_write': 44.42,
 'sat_2017_math': 45.89,
 'sat_2017_total': 90.01,
 'act_2017_participation': 32.38,
 'act_2017_english': 2.32,
 'act_2017_math': 1.98,
 'act_2017_reading': 2.04,
 'act_2017_science': 1.73,
 'act_2017_composite': 2.05,
 'sat_2018_participation': 37.25,
 'sat_2018_read_write': 47.05,
 'sat_2018_math': 47.09,
 'sat_2018_total': 93.01,
 'act_2018_participation': 34.38,
 'act_2018_composite': 2.12}

Do your manually calculated standard deviations match up with the output from pandas `describe`? What about numpy's `std` method?

- It matches with numpy's std, but deviates sometimes from describe method.

#### Investigate trends in the data
Using sorting and/or masking (along with the `.head` method to not print our entire dataframe), consider the following questions:

- Which states have the highest and lowest participation rates for the:
    - 2017 SAT?:      *Highest Mich.*       *Lowest: Iowa*
    - 2018 SAT?:      *Highest Col.*       *Lowest: North Dak.*
    - 2017 ACT?:      *Highest Alab.*       *Lowest: Maine.*
    - 2018 ACT?:      *Highest Alab.*       *Lowest: Maine.* 
- Which states have the highest and lowest mean total/composite scores for the:
    - 2017 SAT? 
    - 2018 SAT?
    - 2017 ACT?
    - 2018 ACT?
- Do any states with 100% participation on a given test have a rate change year-to-year? 

- Conn. Del. Mich.
- Do any states show have >50% participation on *both* tests either year?

- Florida, Georgia, Hawaii, Nort Carolina, South Carolina

Based on what you've just observed, have you identified any states that you're especially interested in? **Make a note of these and state *why* you think they're interesting**.

- Colorado is interesting, as the SAT participation chanhged from 11% to 100 % all of a sudden from 2017 to 2018.

In [9]:
#code
df_final.sort_values(["act_2018_participation"], ascending=[False]).head(1)

Unnamed: 0,state,sat_2017_participation,sat_2017_read_write,sat_2017_math,sat_2017_total,act_2017_participation,act_2017_english,act_2017_math,act_2017_reading,act_2017_science,act_2017_composite,sat_2018_participation,sat_2018_read_write,sat_2018_math,sat_2018_total,act_2018_participation,act_2018_composite
0,Alabama,5.0,593,572,1165,100.0,18.9,18.4,19.7,19.4,19.0,6.0,595,571,1166,100.0,19.1


In [10]:
df_final.sort_values(["act_2018_participation"], ascending=[False]).tail(1)

Unnamed: 0,state,sat_2017_participation,sat_2017_read_write,sat_2017_math,sat_2017_total,act_2017_participation,act_2017_english,act_2017_math,act_2017_reading,act_2017_science,act_2017_composite,sat_2018_participation,sat_2018_read_write,sat_2018_math,sat_2018_total,act_2018_participation,act_2018_composite
18,Maine,95.0,513,499,1012,8.0,24.2,24.0,24.8,23.7,24.0,99.0,512,501,1013,7.0,24.0


In [13]:
df_final.loc[(df_final["sat_2017_participation"] == 100) & 
             (df_final["sat_2017_total"] - df_final["sat_2018_total"] != 0), :].head()

Unnamed: 0,state,sat_2017_participation,sat_2017_read_write,sat_2017_math,sat_2017_total,act_2017_participation,act_2017_english,act_2017_math,act_2017_reading,act_2017_science,act_2017_composite,sat_2018_participation,sat_2018_read_write,sat_2018_math,sat_2018_total,act_2018_participation,act_2018_composite
6,Connecticut,100.0,530,512,1041,31.0,25.5,24.6,25.6,24.6,25.0,100.0,535,519,1053,26.0,25.6
7,Delaware,100.0,503,492,996,18.0,24.1,23.4,24.8,23.6,24.0,100.0,505,492,998,17.0,23.8
21,Michigan,100.0,509,495,1005,29.0,24.1,23.7,24.5,23.8,24.0,100.0,511,499,1011,22.0,24.2


In [14]:
df_final.loc[((df_final["sat_2017_participation"] > 50) & (df_final["act_2017_participation"] > 50)) |
             ((df_final["sat_2018_participation"] > 50) & (df_final["act_2018_participation"] > 50)), :].head()

Unnamed: 0,state,sat_2017_participation,sat_2017_read_write,sat_2017_math,sat_2017_total,act_2017_participation,act_2017_english,act_2017_math,act_2017_reading,act_2017_science,act_2017_composite,sat_2018_participation,sat_2018_read_write,sat_2018_math,sat_2018_total,act_2018_participation,act_2018_composite
8,Florida,83.0,520,497,1017,73.0,19.0,19.4,21.0,19.4,19.0,56.0,550,549,1099,66.0,19.9
9,Georgia,61.0,535,515,1050,55.0,21.0,20.9,22.0,21.3,21.0,70.0,542,522,1064,53.0,21.4
10,Hawaii,55.0,544,541,1085,90.0,17.8,19.2,19.2,19.3,19.0,56.0,480,530,1010,89.0,18.9
32,North Carolina,49.0,546,535,1081,100.0,17.8,19.3,19.6,19.3,19.0,52.0,554,543,1098,100.0,19.1
39,South Carolina,50.0,543,521,1064,100.0,17.5,18.6,19.1,18.9,18.0,55.0,547,523,1070,100.0,18.3
