### College of Computing and Informatics, Drexel University
### INFO 213: Data Science Programming II
---

## Final Report

## Project Title: Countrywide Car Accidents Analysis and Forecasting

## Student(s): Khanh Tran, Amanjyot Singh

#### Date: August 30, 2020
---

### 1. Introduction
---
*(Introduce the project and describe the objectives.)* 

With a good amount of data and thoroughly executed analytics, one can possibly unveil the many faces of a problem or phenomenon. Data science has been being considered the most direct and reliable way to attack a problem, tracing it to the root and predicting what and when next consequences will take place. This project will follow the same direction and try to solve a specific real-world problem: what can data analytics do to reduce the number of car accidents in the U.S. The analytics will be based on “A Countrywide Traffic Accident Dataset” by Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. In this project, we will strive for understanding the cause and effect rules of the accidents, and from that, we will try to build several machine learning models that can help with the future accidents forecasting.

### 2. Problem Definition
---
*(Define the problem that will be solved in this data analytics project.)*

On average, there are 6 million car accidents in the U.S. every year. That's roughly 16,438 per day. Over 37,000 Americans die in automobile crashes per year, and there is an additional 3 million injured or disabled annually. Economically, traffic accidents cost the country $871 billion a year, and that was 6 years ago. These are only a few quick car crash statistics happening right now in the U.S. Even though the country is standing at 110th on the list of countries with the highest traffic-related death rate, the number can still be lowered tremendously if science-based solutions are carried out in a mission to improve the safety of the people on the roads. With a good dataset, data analysis can be an efficient method to extract useful information in order to figure out the cause and effect rules of the accidents, which will result in improved accident prevention.

### 3. Data Sources
---
*(Describe the origin of the data sources. What is the format of the original data? How to access the data?)*

As the dataset was acquired on Kaggle and because of its size, downloading it to local computers will be quite time-consuming. Using Kaggle notebook will solve this problem as we don't have to manually download the dataset to use it. Kaggle allows their users to get access to the datasets available on their website. There are currently about 3.5 million accident records in this dataset. It covers 49 states of the USA, and the data were collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. Along with the large number of records, this dataset also provide a wide range of attributes for each accident. With 49 columns, analysts can observe and discover many faces of the accidents such as starting-ending time, exact starting-ending location, address, weather conditions, existed crossings, junctions, or bumps, etc. Our goals are planned upon this variety of features. We will also make use of pandas, numpy, matplotlib.pyplot, math, and sklearn packages of Python to effectively analyze, visualize, and model our data.

Acknowledgements

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

https://www.kaggle.com/sobhanmoosavi/us-accidents

### 4. The Goal(s) of the predictions
---
*(What are the expected results of the project?)*

The project's mission is to provide assitance for this battle against car accidents with statistics-based findings and data-based analysis. To be more specific, we strive for determining the importance of each attribute toward predicting severity levels of accidents. The process is to create two similar models that predict severity level of available accidents based on a set of attributes. One model will take in all attributes except for the target attribute, for example weather conditions, and the other one will take in every attribute including the target attribute. Two sets of metric scores will be calculated and compared to see if adding the target attribute to the model will improve its performance or hurt it. Additionally, the level of influence of each target attribute will also be evaluated to find out which one plays the most important role and which one plays the least in supporting the performance of the algorithms. The target attributes are:

- Weather Conditions
- Locations
- Time of the day
- Time of the year

For each attribute, we will create a separate set of models. We will try to implement as many machine learning algorithms as possible. Each of the attribute listed above will be carefully processed and feeded into the models, making sure they retain their full features and hopefully are influential enough to affect the performance of the algorithms for the better or worse. 

### 5. Experimental Models
---

See the other notebook.

### 6. Final Models
---

In [None]:
# Import models 
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
import pandas as  pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv("../input/us-accidents/US_Accidents_June20.csv")
df.head()

In [None]:
# With the Weather_Condition column
df2 = df[["Distance(mi)", 
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)", 
          "Weather_Condition",
          "Severity"]]

# Without the Weather_Condition column
df1 = df[["Distance(mi)",  
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)",
          "Severity"
          ]]

df1.replace(-1, np.nan, inplace=True)  
df1 = df1.dropna()

df2.replace(-1, np.nan, inplace=True)  
df2 = df2.dropna()

Y1 = df1.Severity.values
X1 = df1.loc[:, df1.columns != 'Severity']

After dropping NA values, we have nearly 1.3 million records left.

In [None]:
print(Y1.shape)
print(X1.shape)

Due to the enormous amount of time it takes to run SVC in scikit-learn (time complexity is O(n_samples^2*n_features)), we will opt that model out in the final run. In section 5, it took us an hour to run SVC with 100k input. Mathematically, it will take at least 100 times longer than that if we run the model with 1 million input. Even though we don't have SVC in our final run, we can still draw a conclusion on its performance with and without the target attributes. 

According to Section 5, the best set of parameters will be applied to each model below. We will find out how each target attribute affects the performance of the models after fine tuning. For the first target attribute "Weather_Condition", we are again running models on two dataframes, one with the attribute and one without. The one without "Weather_Condition" will be considered the benchmark of not just this attribute but also the other three. The rest of the models will be compared to the performance of this set of benchmarks. 

In [None]:
X_train1, X_test1,Y_train1,Y_test1 = train_test_split(X1, Y1, test_size=0.33, random_state=99)
#Without Weather

# KNN

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train1, Y_train1)
Y_pred = knn.predict(X_test1)
acc_knn1 = round(knn.score(X_test1, Y_test1) * 100, 2)
print("Accuracy KNN: " , acc_knn1)


# Logistic Regression

logreg = LogisticRegression(max_iter = 2000, C=0.1)
logreg.fit(X_train1, Y_train1)
Y_pred = logreg.predict(X_test1)
acc_log1 = round(logreg.score(X_train1, Y_train1) * 100, 2)
print("Accuracy Log: ", acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train1, Y_train1)
Y_pred = gaussian.predict(X_test1)
acc_gaussian1 = round(gaussian.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Gaussian (still no fine tuning): ", acc_gaussian1)

# Perceptron

perceptron = Perceptron(early_stopping=False, validation_fraction=0.2)
perceptron.fit(X_train1, Y_train1)
Y_pred = perceptron.predict(X_test1)
acc_perceptron1 = round(perceptron.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptron1)

# Stochastic Gradient Descent

sgd = SGDClassifier(early_stopping=True, validation_fraction=0.1)
sgd.fit(X_train1, Y_train1)
Y_pred = sgd.predict(X_test1)
acc_sgd1 = round(sgd.score(X_test1, Y_test1) * 100, 2)
print("Accuracy SGD: ", acc_sgd1)

# Decision Tree

decision_tree = DecisionTreeClassifier(max_depth=5)
decision_tree.fit(X_train1, Y_train1)
Y_pred = decision_tree.predict(X_test1)
acc_decision_tree1 = round(decision_tree.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_tree1)

# Random Forest

random_forest = RandomForestClassifier(n_estimators=32)
random_forest.fit(X_train1, Y_train1)
Y_pred = random_forest.predict(X_test1)
random_forest.score(X_train1, Y_train1)
acc_random_forest1 = round(random_forest.score(X_test1, Y_test1) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forest1)

In [None]:
# Mapping
encoded_cons = []
for con in df2["Weather_Condition"].values:
    if "Rain" in con.split(" "):
        encoded_cons.append(1)
    elif "Snow" in con.split(" "):
        encoded_cons.append(2)
    elif "Fog" in con.split(" "):
        encoded_cons.append(3)
    else:
        encoded_cons.append(4)

# New column and delete the original Weather_Condition column
df2['Encoded_Weather'] = encoded_cons
del df2["Weather_Condition"]

Y = df2.Severity.values
X = df2.loc[:, df2.columns != 'Severity']df2

In [None]:
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X, Y, test_size=0.33, random_state=99)
#With weather

# KNN

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_train2, Y_train2)
Y_pred = knn.predict(X_test2)
acc_knn2 = round(knn.score(X_test2, Y_test2) * 100, 2)
print("Accuracy KNN: " , acc_knn2)
print("Improvement: ", acc_knn2 > acc_knn1)


# Logistic Regression

logreg = LogisticRegression(max_iter = 2000, C=0.1)
logreg.fit(X_train2, Y_train2)
Y_pred = logreg.predict(X_test2)
acc_log2 = round(logreg.score(X_test2, Y_test2) * 100, 2)
print("Accuracy Log: ", acc_log2)
print("Improvement: ", acc_log2 > acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train2, Y_train2)
Y_pred = gaussian.predict(X_test2)
acc_gaussian2 = round(gaussian.score(X_test2, Y_test2) * 100, 2)
print("Accuracy Gaussian (still no fine tuning): ", acc_gaussian2)
print("Improvement: ", acc_gaussian2 > acc_gaussian1)

# Perceptron

perceptron = Perceptron(early_stopping=True, validation_fraction=0.1)
perceptron.fit(X_train2, Y_train2)
Y_pred = perceptron.predict(X_test2)
acc_perceptron2 = round(perceptron.score(X_test2, Y_test2) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptron2)
print("Improvement: ", acc_perceptron2 > acc_perceptron1)


# Stochastic Gradient Descent

sgd = SGDClassifier(early_stopping=False, validation_fraction=0.2)
sgd.fit(X_train2, Y_train2)
Y_pred = sgd.predict(X_test2)
acc_sgd2 = round(sgd.score(X_test2, Y_test2) * 100, 2)
print("Accuracy SGD: ", acc_sgd2)
print("Improvement: ", acc_sgd2 > acc_sgd1)


# Decision Tree

decision_tree = DecisionTreeClassifier(max_depth=4)
decision_tree.fit(X_train2, Y_train2)
Y_pred = decision_tree.predict(X_test2)
acc_decision_tree2 = round(decision_tree.score(X_test2, Y_test2) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_tree2)
print("Improvement: ", acc_decision_tree2 > acc_decision_tree1)


# Random Forest

random_forest = RandomForestClassifier(n_estimators=32)
random_forest.fit(X_train2, Y_train2)
Y_pred = random_forest.predict(X_test2)
random_forest.score(X_train2, Y_train2)
acc_random_forest2 = round(random_forest.score(X_test2, Y_test2) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forest2)
print("Improvement: ", acc_random_forest2 > acc_random_forest1)

In [None]:
# With State column
dfState = df[["Distance(mi)", 
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)", 
          "State",
          "Severity"]]

dfState.replace(-1, np.nan, inplace=True)  
dfState = dfState.dropna()

# Mapping 
encoded_states = []
for states in dfState["State"].values:
    if "PA" in states.split(" "):
        encoded_states.append(1)
    elif "CA" in states.split(" "):
        encoded_states.append(2)
    elif "NY" in states.split(" "):
        encoded_states.append(3)
    else:
        encoded_states.append(4)

# New column and delete the original Weather_Condition column
dfState['Encoded_States'] = encoded_states
del dfState["State"]


YState = dfState.Severity.values
XState = dfState.loc[:, dfState.columns != 'Severity']

In [None]:
X_trainState, X_testState, Y_trainState, Y_testState = train_test_split(XState, YState, test_size=0.33, random_state=99)
#With state

# KNN

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_trainState, Y_trainState)
Y_pred = knn.predict(X_testState)
acc_knnState = round(knn.score(X_testState, Y_testState) * 100, 2)
print("Accuracy KNN: " , acc_knnState)
print("Improvement: ", acc_knnState > acc_knn1)


# Logistic Regression

logreg = LogisticRegression(max_iter = 2000, C=0.1)
logreg.fit(X_trainState, Y_trainState)
Y_pred = logreg.predict(X_testState)
acc_logState = round(logreg.score(X_testState, Y_testState) * 100, 2)
print("Accuracy Log: ", acc_logState)
print("Improvement: ", acc_logState > acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_trainState, Y_trainState)
Y_pred = gaussian.predict(X_testState)
acc_gaussianState = round(gaussian.score(X_testState, Y_testState) * 100, 2)
print("Accuracy Gaussian (still no fine tuning): ", acc_gaussianState)
print("Improvement: ", acc_gaussianState > acc_gaussian1)


# Perceptron

perceptron = Perceptron(early_stopping=True, validation_fraction=0.1)
perceptron.fit(X_trainState, Y_trainState)
Y_pred = perceptron.predict(X_testState)
acc_perceptronState = round(perceptron.score(X_testState, Y_testState) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptronState)
print("Improvement: ", acc_perceptronState > acc_perceptron1)


# Stochastic Gradient Descent

sgd = SGDClassifier(early_stopping=False, validation_fraction=0.2)
sgd.fit(X_trainState, Y_trainState)
Y_pred = sgd.predict(X_testState)
acc_sgdState = round(sgd.score(X_testState, Y_testState) * 100, 2)
print("Accuracy SGD: ", acc_sgdState)
print("Improvement: ", acc_sgdState > acc_sgd1)


# Decision Tree

decision_tree = DecisionTreeClassifier(max_depth=7)
decision_tree.fit(X_trainState, Y_trainState)
Y_pred = decision_tree.predict(X_testState)
acc_decision_treeState = round(decision_tree.score(X_testState, Y_testState) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_treeState)
print("Improvement: ", acc_decision_treeState > acc_decision_tree1)


# Random Forest

random_forest = RandomForestClassifier(n_estimators=32)
random_forest.fit(X_trainState, Y_trainState)
Y_pred = random_forest.predict(X_testState)
random_forest.score(X_trainState, Y_trainState)
acc_random_forestState = round(random_forest.score(X_testState, Y_testState) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forestState)
print("Improvement: ", acc_random_forestState > acc_random_forest1)

In [None]:
import datetime


df["Year"] = pd.DatetimeIndex(df["Start_Time"]).year

#With Year
dfYear = df[["Distance(mi)", 
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)", 
          "Year",
          "Severity"]]

dfState.replace(-1, np.nan, inplace=True)  
dfState = dfState.dropna()

YYear = dfYear.Severity.values
XYear = dfYear.loc[:, dfYear.columns != 'Severity']

In [None]:
X_trainYear, X_testYear, Y_trainYear, Y_testYear = train_test_split(XYear, YYear, test_size=0.33, random_state=99)
#With year

# KNN

knn = KNeighborsClassifier(n_neighbors = 10)
knn.fit(X_trainYear, Y_trainYear)
Y_pred = knn.predict(X_testYear)
acc_knnYear = round(knn.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy KNN: " , acc_knnYear)
print("Improvement: ", acc_knnYear > acc_knn1)


# Logistic Regression

logreg = LogisticRegression(max_iter = 400, C=1)
logreg.fit(X_trainYear, Y_trainYear)
Y_pred = logreg.predict(X_testYear)
acc_logYear = round(logreg.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy Log: ", acc_logYear)
print("Improvement: ", acc_logYear > acc_log1)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_trainYear, Y_trainYear)
Y_pred = gaussian.predict(X_testYear)
acc_gaussianYear = round(gaussian.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy Gaussian (still no fine tuning): ", acc_gaussianYear)
print("Improvement: ", acc_gaussianYear > acc_gaussian1)

# Perceptron

perceptron = Perceptron(early_stopping=True, validation_fraction=0.1)
perceptron.fit(X_trainYear, Y_trainYear)
Y_pred = perceptron.predict(X_testYear)
acc_perceptronYear = round(perceptron.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptronYear)
print("Improvement: ", acc_perceptronYear > acc_perceptron1)


# Stochastic Gradient Descent

sgd = SGDClassifier(early_stopping=True, validation_fraction=0.1)
sgd.fit(X_trainYear, Y_trainYear)
Y_pred = sgd.predict(X_testYear)
acc_sgdYear = round(sgd.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy SGD: ", acc_sgdYear)
print("Improvement: ", acc_sgdYear > acc_sgd1)


# Decision Tree

decision_tree = DecisionTreeClassifier(max_depth=5)
decision_tree.fit(X_trainYear, Y_trainYear)
Y_pred = decision_tree.predict(X_testYear)
acc_decision_treeYear = round(decision_tree.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_treeYear)
print("Improvement: ", acc_decision_treeYear > acc_decision_tree1)


# Random Forest

random_forest = RandomForestClassifier(n_estimators=32)
random_forest.fit(X_trainYear, Y_trainYear)
Y_pred = random_forest.predict(X_testYear)
random_forest.score(X_trainYear, Y_trainYear)
acc_random_forestYear = round(random_forest.score(X_testYear, Y_testYear) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forestYear)
print("Improvement: ", acc_random_forestYear > acc_random_forest1)

In [None]:
df["Hour"] = pd.DatetimeIndex(df["Start_Time"]).hour

#With Year
dfHour = df[["Distance(mi)", 
          "Temperature(F)", 
          "Wind_Chill(F)", 
          "Humidity(%)", 
          "Pressure(in)", 
          "Visibility(mi)", 
          "Precipitation(in)", 
          "Hour",
          "Severity"]]

dfHour.replace(-1, np.nan, inplace=True)  
dfHour = dfHour.dropna()

YHour = dfHour.Severity.values
XHour = dfHour.loc[:, dfHour.columns != 'Severity']

In [None]:
X_trainHour, X_testHour, Y_trainHour, Y_testHour = train_test_split(XHour, YHour, test_size=0.33, random_state=99)
#With Hour

# KNN

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_trainHour, Y_trainHour)
Y_pred = knn.predict(X_testHour)
acc_knnHour = round(knn.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy KNN: " , acc_knnHour)
print("Improvement: ", acc_knnHour > acc_knnNoHour)


# Logistic Regression

logreg = LogisticRegression(max_iter = 400)
logreg.fit(X_trainHour, Y_trainHour)
Y_pred = logreg.predict(X_testHour)
acc_logHour = round(logreg.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy Log: ", acc_logYear)
print("Improvement: ", acc_logHour > acc_logNoHour)


# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_trainHour, Y_trainHour)
Y_pred = gaussian.predict(X_testHour)
acc_gaussianHour = round(gaussian.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy Gaussian: ", acc_gaussianHour)
print("Improvement: ", acc_gaussianHour > acc_gaussianNoHour)

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_trainHour, Y_trainHour)
Y_pred = perceptron.predict(X_testHour)
acc_perceptronHour = round(perceptron.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy Perceptron: ", acc_perceptronHour)
print("Improvement: ", acc_perceptronHour > acc_perceptronNoHour)


# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_trainHour, Y_trainHour)
Y_pred = sgd.predict(X_testHour)
acc_sgdHour = round(sgd.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy SGD: ", acc_sgdHour)
print("Improvement: ", acc_sgdHour > acc_sgdNoHour)


# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_trainHour, Y_trainHour)
Y_pred = decision_tree.predict(X_testHour)
acc_decision_treeHour = round(decision_tree.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy Decision Tree: ", acc_decision_treeHour)
print("Improvement: ", acc_decision_treeHour > acc_decision_treeNoHour)


# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_trainHour, Y_trainHour)
Y_pred = random_forest.predict(X_testHour)
random_forest.score(X_trainHour, Y_trainHour)
acc_random_forestHour = round(random_forest.score(X_testHour, Y_testHour) * 100, 2)
print("Accuracy Random Forest: ", acc_random_forestHour)
print("Improvement: ", acc_random_forestHour > acc_random_forestNoHour)

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The following course outcomes will be checked against the content of the report:

Upon successful completion of this course, a student will be able to:
* Describe the key Python tools and libraries that related to a typical data analytics project. 
* Identify data science libraries, frameworks, modules, and toolkits in Python that efficiently implement the most common data science algorithms and techniques.
* Apply latest Python techniques in data acquisition, transformation and predictive analytics for data science projects.
* Discuss the underlying principles and main characteristics of the most common methods and techniques for data analytics. 
* Build data analytic and predictive models for real world data sets using existing Python libraries.

** Marking will be foucsed on both presentation and content.** 

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project comparable to that of the example data sets used in the lectures and assignments?
* Did the report describe the charactriatics of the data?
* Did the report describe the goals of the data analysis?
* Did the analysis conduct exploratory analyses on the data?
* Did the analysis build models of the data and evaluated the performance of the models?
* Overall, what is the rating of this project?