# Project 1

## Step 1: Load the data and perform basic operations.

##### 1. Load the data in using pandas.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.preprocessing as skl
plt.style.use('fivethirtyeight')
sns.set(style='whitegrid')
sns.set_context('poster')

In [2]:
sat = pd.read_csv("../data/sat.csv")
act = pd.read_csv("../data/act.csv")

##### 2. Print the first ten rows of each dataframe.

In [3]:
sat.head(10)

Unnamed: 0.1,Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,0,Alabama,5%,593,572,1165
1,1,Alaska,38%,547,533,1080
2,2,Arizona,30%,563,553,1116
3,3,Arkansas,3%,614,594,1208
4,4,California,53%,531,524,1055
5,5,Colorado,11%,606,595,1201
6,6,Connecticut,100%,530,512,1041
7,7,Delaware,100%,503,492,996
8,8,District of Columbia,100%,482,468,950
9,9,Florida,83%,520,497,1017


In [4]:
act.head(10)

Unnamed: 0.1,Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
0,0,National,60%,20.3,20.7,21.4,21.0,21.0
1,1,Alabama,100%,18.9,18.4,19.7,19.4,19.2
2,2,Alaska,65%,18.7,19.8,20.4,19.9,19.8
3,3,Arizona,62%,18.6,19.8,20.1,19.8,19.7
4,4,Arkansas,100%,18.9,19.0,19.7,19.5,19.4
5,5,California,31%,22.5,22.7,23.1,22.2,22.8
6,6,Colorado,100%,20.1,20.3,21.2,20.9,20.8
7,7,Connecticut,31%,25.5,24.6,25.6,24.6,25.2
8,8,Delaware,18%,24.1,23.4,24.8,23.6,24.1
9,9,District of Columbia,32%,24.4,23.5,24.9,23.5,24.2


In [5]:
#Dropping the unnamed column as it appears to be a duplicate of the index values.
list(sat.columns.values)
sat.drop(sat.columns[0], axis=1)

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,5%,593,572,1165
1,Alaska,38%,547,533,1080
2,Arizona,30%,563,553,1116
3,Arkansas,3%,614,594,1208
4,California,53%,531,524,1055
5,Colorado,11%,606,595,1201
6,Connecticut,100%,530,512,1041
7,Delaware,100%,503,492,996
8,District of Columbia,100%,482,468,950
9,Florida,83%,520,497,1017


In [6]:
#Dropping the unnamed column as it appears to be a duplicate of the index values.
list(act.columns.values)
act.drop(act.columns[0], axis=1)

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
0,National,60%,20.3,20.7,21.4,21.0,21.0
1,Alabama,100%,18.9,18.4,19.7,19.4,19.2
2,Alaska,65%,18.7,19.8,20.4,19.9,19.8
3,Arizona,62%,18.6,19.8,20.1,19.8,19.7
4,Arkansas,100%,18.9,19.0,19.7,19.5,19.4
5,California,31%,22.5,22.7,23.1,22.2,22.8
6,Colorado,100%,20.1,20.3,21.2,20.9,20.8
7,Connecticut,31%,25.5,24.6,25.6,24.6,25.2
8,Delaware,18%,24.1,23.4,24.8,23.6,24.1
9,District of Columbia,32%,24.4,23.5,24.9,23.5,24.2


##### 3. Describe in words what each variable (column) is.

SAT

* State is, as the name implies, the records for that particular state.  
* Participation is the percentage of schools in that state that took the exam.
* The Evidence-Based Reading and Writing, and Math, columns give the aggregate score in those categories for each state.
* The Total is the sum of the Writing and Math columns.  The values listed are the aggregate total score for each state.

ACT

* State is, as the name implies, the records for that particular state.  
* Participation is the percentage of schools in that state that took the exam.
* The English, Math, Reading, and Science columns give the aggregate score in those categories for each state.
* The Composite column is the mean of the Writing and Math columns.  The values listed are the aggregate total score for each state, rounded up to the nearest tenth.

##### 4. Does the data look complete? Are there any obvious issues with the observations?

In [7]:
all(sat.Math >= 400)

True

In [8]:
all(sat.Math <= 800)

True

In [9]:
all(sat['Evidence-Based Reading and Writing'] >=400)

True

In [10]:
all(sat['Evidence-Based Reading and Writing'] <=800)

True

In [11]:
all(sat.Total == (sat.Math)+(sat['Evidence-Based Reading and Writing']))

False

In [12]:
all(act.English >= 1)

True

In [13]:
all(act.English <= 36)

True

In [14]:
all(act.Math >= 1)

True

In [15]:
all(act.Math <= 36)

True

In [16]:
all(act.Reading >= 1)

True

In [17]:
all(act.Reading <= 36)

True

In [18]:
all(act.Science >= 1)

True

In [19]:
all(act.Science <= 36)

True

In [20]:
all(sat.Math + sat['Evidence-Based Reading and Writing'] == sat.Total)

False

In [21]:
all((act.Math + act.Reading + act.Science + act.English)  == round((act.Composite/4),1))


False

In [22]:
sattotcalc = round((sat['Evidence-Based Reading and Writing']+sat.Math),1)
sat = pd.concat([sat,sattotcalc], axis=1)
sat.columns.values[6] = 'Calculated Total'
satcompvar = round(((sat['Calculated Total']/sat.Total)*100),1)
sat = pd.concat([sat, satcompvar], axis=1)
sat.columns.values[-1] = "Percent Commonality"

#This code block adds two columns: Calculated Total and Percent Commonality.  The assert functions (see above) indicate that 
#sum of total scores listed are not equal to the sum of the individual categories.  The caculated total is simply the sum
# of the two columns, and the percent commonality is simply how much they agree with one another.

In [23]:
sat.drop(sat.columns[0],axis=1)

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total,Calculated Total,Percent Commonality
0,Alabama,5%,593,572,1165,1165,100.0
1,Alaska,38%,547,533,1080,1080,100.0
2,Arizona,30%,563,553,1116,1116,100.0
3,Arkansas,3%,614,594,1208,1208,100.0
4,California,53%,531,524,1055,1055,100.0
5,Colorado,11%,606,595,1201,1201,100.0
6,Connecticut,100%,530,512,1041,1042,100.1
7,Delaware,100%,503,492,996,995,99.9
8,District of Columbia,100%,482,468,950,950,100.0
9,Florida,83%,520,497,1017,1017,100.0


In [24]:
actcompcalc = round(((act.Math + act.Reading + act.Science + act.English)/4),1)
act = pd.concat([act, actcompcalc],axis=1)
act.columns.values[7] = 'Composite'
act.columns.values[8] = 'Calculated'
actcompvar = round(((act.Calculated/act.Composite) * 100),1)
act = pd.concat([act,actcompvar],axis=1)
act.columns.values[-1] = "Percent Commonality"

#Much like with the SAT values, there appeared to be some issues in how the composite score was calculated. The caculated total is simply the sum
#of the two columns, and the percent commonality is simply how much they agree with one another.

All of the individual SAT scores pass as they are greater than or equal to the minimum individual category score of 400 and greater than or equal to the maximum indivual category score of 800.
All of the individual ACT socres pass as they are greater than or equal to the minimum individual category score of 1 and greater than or equal to the maximum individual category score of 36.

The dataset's Composite ACT score do not appear to be calculated properly as they are not equal to the true mean of the scores. Difference between given and calculated composite/totals is on the order of 1-2%, and errors are most likely due to rounding errors. It is possible that the individual scores were rounded first, and then these rounded scores were used to calculate the composite, rather than roudning the calculated composite after the fact.

##### 5. Print the types of each column.

In [25]:
#SAT
sat['State'].dtype # -> Object
sat['Participation'].dtype # -> Object
sat['Evidence-Based Reading and Writing'].dtype # -> Integer
sat['Math'].dtype # ->Integer
sat['Total'].dtype # -> Integer

#ACT
act['State'].dtype # -> Object
act['Participation'].dtype # -> Object
act['English'].dtype # -> Float
act['Reading'].dtype # -> Float
act['Math'].dtype # -> Float
act['Science'].dtype # -> Float
act['Composite'].dtype # -> Float

dtype('float64')

##### 6. Do any types need to be reassigned? If so, go ahead and do it.

In [26]:
#Converting the Participation columns to floats to more easily do calculations
sat['Participation'] = sat['Participation'].replace('%','', regex = True)
act['Participation'] = act['Participation'].replace('%','', regex = True)

In [27]:
sat['Participation'] = sat['Participation'].astype(float)
act['Participation'] = act['Participation'].astype(float)

##### 7. Create a dictionary for each column mapping the State to its respective value for that column. (For example, you should have three SAT dictionaries.)

In [28]:
satmath = dict(zip(sat.State, sat.Math))
satwrite = dict(zip(sat.State, sat["Evidence-Based Reading and Writing"]))

acteng = dict(zip(act.State, act.English))
actread = dict(zip(act.State, act.Reading))
actmath = dict(zip(sat.State, act.Math))
actsci = dict(zip(sat.State, act.Science))

##### 8. Create one dictionary where each key is the column name, and each value is an iterable (a list or a Pandas Series) of all the values in that column.

In [29]:
actstatelist = act['State'].tolist()
actpartlist = act['Participation'].tolist()
actenglist = act['English'].tolist()
actreadlist = act['Reading'].tolist()
actmathlist = act['Science'].tolist()
actscilist = act['Math'].tolist()
actcomplist = act['Composite'].tolist()


satstatelist = sat['State'].tolist()
satpartlist = sat['Participation'].tolist()
satwritelist = sat['Evidence-Based Reading and Writing'].tolist()
satmathlist = sat['Math'].tolist()
sattotallist = sat['Total'].tolist()

In [30]:
actdict = {'State':actstatelist, 'Participation':actpartlist, 'English':actenglist, 'Math':actmathlist, 'Reading':actreadlist,  'Science':actscilist, 'Composite':actcomplist}
satdict = {'State':satstatelist, 'Participation':satpartlist, 'Evidence-Based Reading and Writing':satwritelist, 'Math':satmathlist, 'Total':sattotallist}

##### 9. Merge the dataframes on the state column.

In [31]:
Full = sat.merge(act, left_on="State", right_on="State")

In [32]:
Full = Full.drop(['Unnamed: 0_x'], 1)



##### 10. Change the names of the columns so you can distinguish between the SAT columns and the ACT columns.

In [33]:
#Massive dicitionary to rename things.  Probably an easier way to do this.
Full = Full.rename(index=str, columns={"State":"State", "Participation_x":"SAT Participation (%)", 
                                "Evidence-Based Reading and Writing":"SAT Reading/Writing","Math_x":"SAT Math",
                                "Total":"SAT Total", "Calculated Total":"Calculated SAT Total",
                                "Percent Commonality_x":"Calculated SAT % Commonality","Unnamed: 0_y":"Junk",
                                "Participation_y":"ACT Participation (%)", "English":"ACT English","Math_y":"ACT Math",
                                "Reading":"ACT Reading","Science":"ACT Science","Composite":"ACT Composite",
                                "Calculated":"Calculated ACT Composite","Percent Commonality_y":"Calculated ACT % Commonality"}) 

In [34]:
Full = Full.drop(['Junk'],1)

In [35]:
Full

Unnamed: 0,State,SAT Participation (%),SAT Reading/Writing,SAT Math,SAT Total,Calculated SAT Total,Calculated SAT % Commonality,ACT Participation (%),ACT English,ACT Math,ACT Reading,ACT Science,ACT Composite,Calculated ACT Composite,Calculated ACT % Commonality
0,Alabama,5.0,593,572,1165,1165,100.0,100.0,18.9,18.4,19.7,19.4,19.2,19.1,99.5
1,Alaska,38.0,547,533,1080,1080,100.0,65.0,18.7,19.8,20.4,19.9,19.8,19.7,99.5
2,Arizona,30.0,563,553,1116,1116,100.0,62.0,18.6,19.8,20.1,19.8,19.7,19.6,99.5
3,Arkansas,3.0,614,594,1208,1208,100.0,100.0,18.9,19.0,19.7,19.5,19.4,19.3,99.5
4,California,53.0,531,524,1055,1055,100.0,31.0,22.5,22.7,23.1,22.2,22.8,22.6,99.1
5,Colorado,11.0,606,595,1201,1201,100.0,100.0,20.1,20.3,21.2,20.9,20.8,20.6,99.0
6,Connecticut,100.0,530,512,1041,1042,100.1,31.0,25.5,24.6,25.6,24.6,25.2,25.1,99.6
7,Delaware,100.0,503,492,996,995,99.9,18.0,24.1,23.4,24.8,23.6,24.1,24.0,99.6
8,District of Columbia,100.0,482,468,950,950,100.0,32.0,24.4,23.5,24.9,23.5,24.2,24.1,99.6
9,Florida,83.0,520,497,1017,1017,100.0,73.0,19.0,19.4,21.0,19.4,19.8,19.7,99.5


##### 11. Print the minimum and maximum of each numeric column in the data frame.

In [36]:
#Maximums of all Colums
Fullmax = Full.max(axis=0, numeric_only=True)
print(Fullmax)

SAT Participation (%)            100.0
SAT Reading/Writing              644.0
SAT Math                         651.0
SAT Total                       1295.0
Calculated SAT Total            1295.0
Calculated SAT % Commonality     100.1
ACT Participation (%)            100.0
ACT English                       25.5
ACT Math                          25.3
ACT Reading                       26.0
ACT Science                       24.9
ACT Composite                     25.5
Calculated ACT Composite          25.4
Calculated ACT % Commonality     100.0
dtype: float64


In [37]:
#Minimums of all Columns
Fullmin = Full.min(axis=0, numeric_only=True)
print(Fullmin)

SAT Participation (%)             2.0
SAT Reading/Writing             482.0
SAT Math                        468.0
SAT Total                       950.0
Calculated SAT Total            950.0
Calculated SAT % Commonality     99.9
ACT Participation (%)             8.0
ACT English                      16.3
ACT Math                         18.0
ACT Reading                      18.1
ACT Science                      18.2
ACT Composite                    17.8
Calculated ACT Composite         17.6
Calculated ACT % Commonality     98.9
dtype: float64


##### 12. Write a function using only list comprehensions, no loops, to compute standard deviation. Using this function, calculate the standard deviation of each numeric column in both data sets. Add these to a list called `sd`.

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)}$$

In [38]:
def sdtest(Value):
    ssdme = (sum([(i-(sum(Value)/len(Value)))**2 for i in Value]) / len(Value))**0.5
    return float(sdme)


In [39]:
sdtest(Full['SAT Math'])

NameError: name 'sdme' is not defined

In [None]:
#Dropping out the state column so that I can better apply my function over the DataFrame
FullMat = Full.drop(Full.columns[0], axis=1)


In [None]:
sd = FullMat.apply(sdtest, axis=0)
sd

## Step 2: Manipulate the dataframe

##### 13. Turn the list `sd` into a new observation in your dataset.

In [None]:
import collections

In [None]:
#First off I generated a new ordered dictionary that contained the column names (keys) and values (values) that I wanted.
#This is sddict. I then appended on the data that I wanted (Max, Min, Mean).  Finally I set the index values to newindex, thus
#giving me labeled rows. 
newindex=['Standard Deviation','Maximum','Minimum','Mean']
sd =sd.reindex(newindex)
sddict=collections.OrderedDict({'SAT Participation (%)': 34.929071, 'SAT Reading/Writing':45.216970, 'SAT Math':46.657134,
        'SAT Total':91.583511, 'Calculated SAT Total':91.576390, 'Calculated SAT % Commonality':0.053339,
        'ACT Participation (%)':31.824176, 'ACT English':2.330488, 'ACT Math':1.962462, 'ACT Reading':2.046903,
        'ACT Science':1.722216, 'ACT Composite':2.000786, 'Calculated ACT Composite':2.003217, 'Calculated ACT % Commonality':0.241248})
sd = pd.DataFrame(sddict,index=[0])
sd = sd.append(Fullmax, ignore_index=True)
sd = sd.append(Fullmin, ignore_index=True)
sd = sd.append((Full.mean()), ignore_index=True)
sd = round(sd,2)
sd.index = newindex
sd


##### 14. Sort the dataframe by the values in a numeric column (e.g. observations descending by SAT participation rate)

In [None]:
#Sorting based upon highest SAT Participation and SAT Math scores.
Full.sort_values(['SAT Participation (%)', 'SAT Math'], ascending=[False, False])

In [None]:
Full.sort_values(['ACT Math'], ascending=[False])

In [None]:
Full.sort_values(['SAT Participation (%)','ACT Participation (%)'], ascending=[True, True])

##### 15. Use a boolean filter to display only observations with a score above a certain threshold (e.g. only states with a participation rate above 50%)

In [None]:
FullBool50 = Full[(Full['SAT Participation (%)'] >= 50)]

In [None]:
FullBool50.sort_values(['SAT Participation (%)'], ascending = False)

## Step 3: Visualize the data

##### 16. Using MatPlotLib and PyPlot, plot the distribution of the Rate columns for both SAT and ACT using histograms. (You should have two histograms. You might find [this link](https://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes) helpful in organizing one plot above the other.) 

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
foo1 = plt.hist(Full['SAT Participation (%)'], color = 'blue', bins = 30)
plt.xlabel("% Participation")
plt.ylabel("Number of States")
plt.title("Distribution of SAT Participation")
plt.show()

plt.subplots
bar1 = plt.hist(Full['ACT Participation (%)'], color = 'red', bins = 30)
plt.xlabel("% Participation")
plt.ylabel("Number of States")
plt.title("Distribution of ACT Participation")
plt.show()

plt.savefig('foo1.png')
plt.savefig('bar1.png')

#Plotting histograms of the test(s) participation rates.  The savfig calls at the bottom export the image for later use.

In [None]:
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
Full.to_excel(writer, sheet_name='Sheet 1')

#I think the above histograms, while serviceable, are lacking a bit in clarity.  I'm exporting my Full data frame
#as an excel file so I can make a fancier looking heat map in Tableau. See below for the finished products.

##### 17. Plot the Math(s) distributions from both data sets.

In [None]:
sns.distplot(Full['SAT Math'])
plt.xlabel("SAT Math Scores")
plt.ylabel("Probability")
plt.title("Distribution of SAT Math Scores")
plt.show()

plt.subplots
sns.distplot(Full['ACT Math'])
plt.xlabel("ACT Math Scores")
plt.ylabel("Probability")
plt.title("Distribution of ACT Math Scores")
plt.show()

##### 18. Plot the Verbal distributions from both data sets.

In [None]:
sns.distplot(Full['SAT Reading/Writing'])
plt.xlabel("SAT Math Scores")
plt.ylabel("Probability")
plt.title("Distribution of SAT Reading/Writing")
plt.show()

plt.subplots
sns.distplot(Full['ACT English'])
plt.xlabel("ACT English Scores")
plt.ylabel("Probability")
plt.title("Distribution of ACT English Scores")
plt.show()

In [None]:
plt.subplots
sns.distplot(Full['ACT Science'])
plt.xlabel("ACT English Scores")
plt.ylabel("Probability")
plt.title("Distribution of ACT Science Scores")
plt.show()

plt.subplots
sns.distplot(Full['ACT Reading'])
plt.xlabel("ACT English Scores")
plt.ylabel("Probability")
plt.title("Distribution of ACT Reading Scores")
plt.show()

In [None]:
plt.subplots
sns.distplot(Full['SAT Total'])
plt.xlabel("SAT Final Scores")
plt.ylabel("Probability")
plt.title("Distribution of SAT Final Scores")
plt.show()

plt.subplots
sns.distplot(Full['ACT Composite'])
plt.xlabel("ACT Composite Scores")
plt.ylabel("Probability")
plt.title("Distribution of ACT Composite Scores")
plt.show()

##### 19. When we make assumptions about how data are distributed, what is the most common assumption?

That the data is normally distributed.  Based purely upon the distributions, the dataset dies not appear to be normally distributed.  However, since the number of data points is greater than 30 (50 total, one for each state), we can approximate a normal distribution via the Central Limit Theorem.

##### 20. Does this assumption hold true for any of our columns? Which?

No dataset displays a pure normal distribution.  However, as mentioned previously, they can approximated using a normal distribution as the number of data points is greater than 30

##### 21. Plot some scatterplots examining relationships between all variables.

In [None]:
fig, ax = plt.subplots()
foo=plt.scatter(y=Full['SAT Participation (%)'],x=Full['SAT Math'], color='blue')
z = np.polyfit(Full['SAT Math'], Full['SAT Participation (%)'], 1)
p = np.poly1d(z)
plt.plot(Full['SAT Math'],p(Full['SAT Math']),"r--")
plt.xlabel("Range of Mean SAT Math Scores")
plt.ylabel("Percent Participation")
plt.title("Distribution of Mean SAT Math Scores")
fig.set_size_inches(18.5, 10.5)
plt.ylim([-3,103])
plt.show()

fig, ax = plt.subplots()
bar=plt.scatter(y=Full['ACT Participation (%)'], x=Full['ACT Math'],color='blue')
z = np.polyfit(Full['ACT Math'], Full['ACT Participation (%)'], 1)
p = np.poly1d(z)
plt.plot(Full['ACT Math'],p(Full['ACT Math']),"r--")
plt.xlabel("Range of Mean ACT Math Scores")
plt.ylabel("Percent Participation")
plt.title("Distribution of Mean ACT Math Scores")
plt.ylim([-3,103])
fig.set_size_inches(18.5, 10.5)
plt.show()

#Making a scatterplot of the test(s) participation rates. The np.polyfit equation gives line of best fit.  ylim sets
#the Y range to something more managable (defaults to -20 to 120 for some ood reason)

In [None]:
fig, ax = plt.subplots()
foo2=plt.scatter(y=Full['SAT Participation (%)'],x=Full['SAT Reading/Writing'], color = 'blue') 
z = np.polyfit(Full['SAT Reading/Writing'], Full['SAT Participation (%)'], 1)
p = np.poly1d(z)
plt.plot(Full['SAT Reading/Writing'],p(Full['SAT Reading/Writing']),"r--")
plt.ylabel("Percent Participation")
plt.xlabel("Range of Mean SAT Reading/Writing Scores")
plt.title("Distribution of Mean SAT Reading/Writing Scores")
fig.set_size_inches(18.5, 10.5)
plt.show()

fig, ax = plt.subplots()
bar2=plt.scatter(y=Full['ACT Participation (%)'], x=Full['ACT English'], color = 'blue')
z = np.polyfit(Full['ACT English'], Full['ACT Participation (%)'], 1)
p = np.poly1d(z)
plt.plot(Full['ACT English'],p(Full['ACT English']),"r--")
plt.xlabel("Range of Mean ACT English Scores")
plt.ylabel("Percent Participation")
plt.ylim([3,103])
plt.title("Distribution of Mean ACT English Scores")
fig.set_size_inches(18.5, 10.5)
plt.show()

In [None]:
fig, ax = plt.subplots()
foo=plt.scatter(y=Full['ACT Participation (%)'],x=Full['SAT Participation (%)'], color='blue')
z = np.polyfit(Full['ACT Participation (%)'], Full['SAT Participation (%)'], 1)
p = np.poly1d(z)
plt.plot(Full['ACT Participation (%)'],p(Full['ACT Participation (%)']),"r--")
plt.xlabel('ACT Participation (%)')
plt.ylabel('SAT Participation (%)')
plt.title("SAT Participation vs ACT Participation")
fig.set_size_inches(14.5, 10.5)
plt.show()

##### 22. Are there any interesting relationships to note?

From the scatterplots, there appears to be a negative correlation between participation percentage and scores.  To put it another way, the highest scores on both tests take place in states with the lowest mandatory participation.  This correlation appears to be true for both SAT and ACT Data. From this, I theorize that the people that take the test in the lowest participation states are more highly motivated students that would spend a great deal of time studying for the test, as opposed to students in states with a larger participation.

Secondly the distribution of the particpation is heavily skewed. The SAT participation is heavily skewed to the right, while the ACT participation scores are heavily skewed to the left. In addition, it appears that the majority of the states that offer the SAT do not mandate it, or mandate it at a very low (<15%) level. On the other hand, it appears that ACT participation is more "all or nothing".  States that mandated the ACT tended to task the test to all schools rather than a select few.

##### 23. Create box plots for each variable. 

In [None]:
ax = sns.boxplot(data = Full.loc[:,['ACT English', 'ACT Math','ACT Reading', 'ACT Science']],orient="v", width=0.6)
ax = sns.swarmplot(data = Full.loc[:,['ACT English', 'ACT Math','ACT Reading', 'ACT Science']],orient="v", color = "Black")
plt.ylabel("Range of ACT Scores")
plt.show()

In [None]:
ax2 = sns.boxplot(data = Full.loc[:,['SAT Reading/Writing', 'SAT Math']],orient="v", width=0.4)
ax2 = sns.swarmplot(data = Full.loc[:,['SAT Reading/Writing', 'SAT Math']],orient="v", color = "Black")
plt.ylabel("Range of ACT Scores")
plt.show()

In [None]:
ax3 = sns.boxplot(data = Full.loc[:,['SAT Total']],orient="v", width=0.2)
ax3 = sns.swarmplot(data = Full.loc[:,['SAT Total']],orient="v", color = "Black")
plt.ylabel("Range of SAT Final Scores")
plt.show()

In [None]:
ax4 = sns.boxplot(data = Full.loc[:,['ACT Composite']],orient="v", width=0.2, color='yellow')
ax4 = sns.swarmplot(data = Full.loc[:,['ACT Composite']],orient="v", color = "Black")
plt.ylabel("Range of ACT Composite Scores")
plt.show()

##### BONUS: Using Tableau, create a heat map for each variable using a map of the US. 

Done!

SAT Participation Heat Map: https://public.tableau.com/shared/BR3WHQN8D?:display_count=yes

ACT Participation Heat Map: https://public.tableau.com/shared/2DZZK656Y?:display_count=yes

Thanks to Diego Rodriguez for help in getting these setup.

## Step 4: Descriptive and Inferential Statistics

##### 24. Summarize each distribution. As data scientists, be sure to back up these summaries with statistics. (Hint: What are the three things we care about when describing distributions?)

Distributions are typically summarized via shape, spread, and center. All the score distributions display non-symmetric spread of data points around the mean. This indicates that are we are not dealing with a pure normal distribution.

##### 25. Summarize each relationship. Be sure to back up these summaries with statistics.

In [None]:
Full.describe()

From the data there are several relationships that can be inferred. The raw category scores (English and Math) for both tests are heavily skewed to the right, indicating that the outliers tended to be very high achievers and thus inflated the mean of the data above the median.

Secondly, based upon the mean participation rates, the ACT is administered more than the SAT.

##### 26. Execute a hypothesis test comparing the SAT and ACT participation rates. Use $\alpha = 0.05$. Be sure to interpret your results.

In [None]:
stats.ttest_ind(Full['SAT Participation (%)'], Full['ACT Participation (%)'])
#This call is running a T test against the two columns to determine our alpha value.

Due to the extremely small alpha value (0.002%), we reject the null hypothesis that the SAT and ACT have or will have identical participation distributions.  Conceptually, this makes sense.  The tests are offered in very different geographic areas, with widely different curricula, and with students having widely ranging motivations for taking the test.  It is extemely unlikely that all this variables would come together to allow for identical distributions.

##### 27. Generate and interpret 95% confidence intervals for SAT and ACT participation rates.

In [None]:
#SAT 95% Confidence Intervals
stats.norm.interval(0.95, loc=39.803922, scale=35.276632)
#This call is computing the 95% confidence intervals.  Loc is the mean of the data, scale is the standard deviation.

In [None]:
#ACT 95% Confidence Intervals
stats.norm.interval(0.95, loc=65.254902, scale=32.140842)

##### 28. Given your answer to 26, was your answer to 27 surprising? Why?

No, the answer is not surprising. As mentioned in 26, the extremely small alpha value indicates that it is highly unlikely that the two tests would have similar participation distributions. Since the range between the confidence intervals is so large, the area between the two ends of the confidence interval is also very large. As a result, this shows that it is extremely unlikely that you would ever have similar distributions.  This applies for the both the SAT and the ACT.

##### 29. Is it appropriate to generate correlation between SAT and ACT math scores? Why?

No, it is not appropriate to draw correlations as the two datasets have not been normalized with respect to one another. 

##### 30. Suppose we only seek to understand the relationship between SAT and ACT data in 2017. Does it make sense to conduct statistical inference given the data we have? Why?

No, because the data we have IS the population data. Since this the population, we can't make any inferences from a sample because we dont take need to take a sample to being with.  We have all the data of the population readily available for us.