<h1> The Challenge:</h1>

Based off this dataset with school financial, enrollment, and achievement data, we are interested in what information is a useful indicator of student performance at the state level.

This question is a bit too big for a checkpoint, however. Instead, we want you to look at smaller questions related to our overall goal. Here's the overview:

1. Choose a specific test to focus on
>Math/Reading for 4/8 grade
* Pick or create features to use
>Will all the features be useful in predicting test score? Are some more important than others? Should you standardize, bin, or scale the data?
* Explore the data as it relates to that test
>Create 2 well-labeled visualizations (graphs), each with a caption describing the graph and what it tells us about the data
* Create training and testing data
>Do you want to train on all the data? Only data from the last 10 years? Only Michigan data?
* Train a ML model to predict outcome 
>Pick if you want to do a regression or classification task. For both cases, defined _exactly_ what you want to predict, and pick any model in sklearn to use (see sklearn <a href="https://scikit-learn.org/stable/modules/linear_model.html">regressors</a> and <a href="https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html">classifiers</a>).
* Summarize your findings
>Write a 1 paragraph summary of what you did and make a recommendation about if and how student performance can be predicted

** Include comments throughout your code! Every cleanup and preprocessing task should be documented.


Of course, if you're finding this assignment interesting (and we really hope you do!), you are welcome to do more than the requirements! For example, you may want to see if expenditure affects 4th graders more than 8th graders. Maybe you want to look into the extended version of this dataset and see how factors like sex and race are involved. You can include all your work in this notebook when you turn it in -- just always make sure you explain what you did and interpret your results. Good luck!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# feel free to import other libraries! 

In [2]:
df = pd.read_csv('states_edu.csv')

Chosen test: **<hit `Enter` to edit>**

<h2> Cleanup (optional)</h2>

_Use this space to rename columns, deal with missing data, etc._

<h2> Feature Selection </h2>

_Use this space to modify or create features_

In [0]:
df = df[df['TOTAL_REVENUE'] >= 20000000]
df['LOCAL_REVENUE TOTAL_REVENUE RATIO'] = df['LOCAL_REVENUE'] / df['TOTAL_REVENUE']

Final feature list: **<Deleted all the states where the total_revenue was less than 2 billion and I also made a new feature which is the local revenue divided by the total revenue\>**

Feature selection justification: **<I wanted to compare richer states and also see if the local_revenue and total_revenue ratio had any affect with math scores in the fourth grade\>**

<h2> EDA </h2>

Visualization 1

In [0]:
df.plot.scatter(x='ENROLL_4',y='AVG_MATH_4_SCORE')
plt.xlabel('4th grade enrollment')
plt.ylabel('4th grade math score')
plt.title('Distribution of 4th grade math scores based on 4th grade enrollment')

**<CAPTION FOR VIZ 1>**

Visualization 2

In [0]:
df.plot.scatter(x='LOCAL_REVENUE TOTAL_REVENUE RATIO',y='AVG_MATH_4_SCORE')
plt.xlabel('Local_revenue total_revenue ratio')
plt.ylabel('4th grade math score')
plt.title('Distribution of 4th grade math scores based on Local Revenue and Total Revenue ratio')

**<CAPTION FOR VIZ 2>**

<h2> Data Creation </h2>

_Use this space to create train/test data_

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X = df[['ENROLL_8','LOCAL_REVENUE TOTAL_REVENUE RATIO']].dropna()
y = df.loc[X.index]['AVG_MATH_4_SCORE']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

<h2> Prediction </h2>

Chosen ML task: **<REGRESSION/CLASSIFICATION>**

In [0]:
# import your sklearn class here
from sklearn.linear_model import LinearRegression

In [0]:
# create your model here
model = LinearRegression()

In [0]:
model.fit(X_train, y_train)

In [0]:
y_pred = model.predict(X_test)

In [0]:
# for classification:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_test, y_test,
                         cmap=plt.cm.Blues)

In [0]:
# for regression: (pick a single column to visualize results)

# Results from this graph _should not_ be used as a part of your results -- it is just here to help with intuition. 
# Instead, look at the error values and individual intercepts.


col_name = 'LOCAL_REVENUE TOTAL_REVENUE RATIO'
col_index = X_train.columns.get_loc(col_name)

f = plt.figure(figsize=(12,6))
plt.scatter(X_train[col_name], y_train, color = "red")
plt.scatter(X_train[col_name], model.predict(X_train), color = "green")
plt.scatter(X_test[col_name], model.predict(X_test), color = "blue")

new_x = np.linspace(X_train[col_name].min(),X_train[col_name].max(),200)
intercept = model.predict([X_train.sort_values(col_name).iloc[0]]) - X_train[col_name].min()*model.coef_[col_index]
plt.plot(new_x, intercept+new_x*model.coef_[col_index])

plt.legend(['controlled model','true training','predicted training','predicted testing'])
plt.xlabel(col_name)
plt.ylabel('Math 4 score')

<h2> Summary </h2>

**<I wanted to compare the local revenue and total revenue ratio to fourth grade math scores in richer states. It is usually stated that more money put into students means higher exam grades, but this is can be due to other variables if the students are compared from different wealth leveled states. So I decided to see the difference between richer states of total revenue of higher than 2 billion. I found that fourth grade math grades were actually better as the ratio decreased. This is surprising as usually more local revenue compared to total revenue is good for school education.\>**