Skip to content

Commit

Permalink
Several simple fixes for #8
Browse files Browse the repository at this point in the history
  • Loading branch information
rhiever committed Aug 21, 2015
1 parent 35f4a4f commit 98c20fe
Showing 1 changed file with 10 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@
"\n",
"This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:\n",
"\n",
"* **numpy**: Provides a fast numerical array structure and helper functions.\n",
"* **NumPy**: Provides a fast numerical array structure and helper functions.\n",
"* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.\n",
"* **scikit-learn**: The main Machine Learning package in Python.\n",
"* **scikit-learn**: The essential Machine Learning package in Python.\n",
"* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.\n",
"* **Seaborn**: Advanced statistical plotting library.\n",
"\n",
Expand Down Expand Up @@ -87,15 +87,15 @@
"\n",
">Did you define the metric for success before beginning?\n",
"\n",
"Let's do that now. Since we're performing classification, we can use [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) to quantify how well our model is performing. Our head of data has told us that we should achieve at least 90% accuracy.\n",
"Let's do that now. Since we're performing classification, we can use [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) — the fraction of correctly classified flowers — to quantify how well our model is performing. Our head of data has told us that we should achieve at least 90% accuracy.\n",
"\n",
">Did you understand the context for the question and the scientific or business application?\n",
"\n",
"We're building part of a data analysis pipeline for a smartphone app that will be able to classify the species of flowers from pictures taken on the smartphone. In the future, this pipeline will be connected to another pipeline that automatically measures from pictures the traits we're using to perform this classification.\n",
"\n",
">Did you record the experimental design?\n",
"\n",
"Our head of data has told us that the field researchers are hand-measuring 100 randomly-sampled flowers of each species using a standardized methodology. The field researchers take pictures of each flower they sample from pre-defined angles so the measurements and species can be confirmed by the other field researchers at a later point. At the end of each day, the data is compiled and stored on a private company GitHub repository.\n",
"Our head of data has told us that the field researchers are hand-measuring 50 randomly-sampled flowers of each species using a standardized methodology. The field researchers take pictures of each flower they sample from pre-defined angles so the measurements and species can be confirmed by the other field researchers at a later point. At the end of each day, the data is compiled and stored on a private company GitHub repository.\n",
"\n",
">Did you consider whether the question could be answered with the available data?\n",
"\n",
Expand Down Expand Up @@ -395,7 +395,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's create a **scatter matrix**. Scatter matrices plot the distribution of each column along the diagonal, and then plot a scatter matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.\n",
"Next, let's create a **scatterplot matrix**. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.\n",
"\n",
"We can even have the plotting package color each entry by its class to look for trends within the classes."
]
Expand Down Expand Up @@ -439,7 +439,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"From the scatter matrix, we can already see some issues with the data set:\n",
"From the scatterplot matrix, we can already see some issues with the data set:\n",
"\n",
"1. There are five classes when there should only be three, meaning there were some coding errors.\n",
"\n",
Expand Down Expand Up @@ -1016,7 +1016,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's take a look at the scatter matrix now that we've tidied the data."
"Now, let's take a look at the scatterplot matrix now that we've tidied the data."
]
},
{
Expand Down Expand Up @@ -1147,7 +1147,7 @@
"\n",
"This is the stage where we plot all the data in as many ways as possible. Create many charts, but don't bother making them pretty — these charts are for internal use.\n",
"\n",
"Let's return to that scatter matrix that we used earlier."
"Let's return to that scatterplot matrix that we used earlier."
]
},
{
Expand Down Expand Up @@ -1371,9 +1371,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can start fitting models to our data. Our head of data is all about random forest classifiers, so let's start with one of those.\n",
"With our data split, we can start fitting models to our data. Our head of data is all about random forest classifiers, so let's start with one of those.\n",
"\n",
"There are several random forest classifier [parameters](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) that we can tune, but let's use a basic classifier with 10 decision trees."
"There are several random forest classifier [parameters](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) that we can tune, but one major advantage of random forests is that we typically only need to tune the number of decision trees. For now, let's use a basic classifier with 10 decision trees."
]
},
{
Expand Down

0 comments on commit 98c20fe

Please sign in to comment.