# Connect Intensive - Machine Learning Nanodegree
# Lesson 1: An intro to Statistical Analysis using `pandas`

## Objectives
  - Practice running python from within a [Jupyter Notebook](http://jupyter.org/) (FKA IPython Notebook).
  - Become familiar with importing useful modules and packages, *e.g.* `pandas`, `numpy`, `matplotlib.pyplot`.
  - Learn about the [`pandas` data structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#), including the `Series` and `DataFrame` objects.
  - Create a `DataFrame` object from data in a comma-separated variable (csv) file using [`pandas.read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
  - [Index and select data](http://pandas.pydata.org/pandas-docs/stable/indexing.html) from `Series` and `DataFrame` objects using `loc` and `iloc`
  - Compute descriptive statistics on a `Series` or `DataFrame`, including the [`mean`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html), the [`median`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html), and the [`min`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) & [`max`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html)
  - Explore a public data set found on [Kaggle](https://www.kaggle.com/)
  - Conduct some exploratory data analysis, and visualize trends in data using `matplotlib`
  


## Introduction
Exploring and understanding datasets is a crucial skill for any machine learning engineer. The library **`pandas`** is a Python package developed by Wes McKinney that machine learning engineers use to quickly and efficiently navigate data sets. From [the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html):

> "**`pandas`** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive."

Fun fact: the name "`pandas`" derives from **pan**el **da**ta, a term for multi-dimensional data sets! [(source)](http://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf)

## There's plenty to learn!
Much of the information and code in this Jupyter Notebook may be new to you. No reason to concern, a wide array of documentation and references have been hyperlinked for you. Click on any of the hyperlinks to learn more -- the links should open within a new tab, so you won't lose your progress in the Jupyter Notebook. If you're still feeling uncertain about a topic, often the best place to start is a quick web search. For example, if Python is throwing me an error I've never seen before, I'll often copy and paste the error right into a Google search and see what comes up!

## First things first: Import statements
In Jupyter Notebooks, importing useful modules and packages should be among the first tasks because they provide more functionality to the code. For example, in this project the `pandas` import will be used so that we can manipulate data utilizing the `DataFrame` object. We also want to have our Jupyter Notebook create figures and plots directly within the notebook. To do this, we use the [magic function](http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained) `%matplotlib inline`. For more information on plotting within the IPython kernel, [check this out](http://ipython.readthedocs.io/en/stable/interactive/plotting.html)!

**Run** the cell below (**click** on the cell to highlight it, then press **shift + enter** or **shift + return** to run it) to import modules and libraries for this Jupyter Notebook

In [None]:
'''
Importing Libraries: 

import numpy as np 
import warnings
import pandas as pd
from IPython.display import display
from zipfile import ZipFile
import os.path''' 


%matplotlib inline


try:
    import numpy as np
    print("Successfully imported numpy! (Version {})".format(np.version.version))
except ImportError:
    print("Could not import numpy!")

    
try:
    import warnings
    with warnings.catch_warnings():
        warnings.simplefilter("ignore");
        import matplotlib
        import matplotlib.pyplot as plt
    plt.style.use('ggplot')
    print("Successfully imported matplotlib.pyplot! (Version {})".format(matplotlib.__version__))
except ImportError:
    print("Could not import matplotlib.pyplot!")

    
try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

    
try:
    from IPython.display import display
    print("Successfully imported display from IPython.display!")
except ImportError:
    print("Could not import display from IPython.display")

    
try:
    from zipfile import ZipFile
    print("Successfully imported ZipFile from zipfile!")
except ImportError:
    print("Could not import ZipFile from zipfile")
    
try:
    import os.path
    print("Successfully imported os.path!")
except ImportError:
    print("Could not import os.path")

## Extracting from a ZIP archive
Let's create our first DataFrame using pandas! You can learn more about the DataFrame object from [the pandas documentation on DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). The dataset that we are going to explore and manipulate was obtained from [Kaggle datasets](https://www.kaggle.com/datasets). The user Cathie So crawled the [Kickstarter](https://www.kickstarter.com/) website to  retrieve  data on the 4000 most backed projects on the site, and put the [project data on Kaggle](https://www.kaggle.com/socathie/kickstarter-project-statistics). The csv file `'most_backed.csv'` which contains the project data is zipped within the archive `'kickstarter'`.

**Run** the cell below to extract the csv file from the ZIP archive using [the zipfile module](https://docs.python.org/2/library/zipfile.html).

In [None]:
# The dataset is compressed within the zip file in this directory called 'kickstarter'
zip_file_name = 'kickstarter.zip'

# Create a ZipFile object using the zip file name
zf = ZipFile(file=zip_file_name)

# Within the zip file is a comma-separated values (csv) file containing the project data named 'most_backed.csv'
in_file_name = 'most_backed.csv'

# Extract the project data into the current directory from the zip file 
zf.extract(member=in_file_name)

# Close the ZipFile object -- we won't need it any more
zf.close()

# Print a success message if the csv file was extracted from the zip file
if os.path.isfile(in_file_name):
    print("The file {} has been extracted!".format(in_file_name))
else:
    print("Could not extract the file {}".format(in_file_name))

## Our first `DataFrame`!
Now that we've extracted the file `'most_backed.csv'` from the ZIP archive containing the project data, we can create a `DataFrame` object from the csv file. The first line of the csv file contains the **feature** or **attribute** names, while each subsequent line in the file describes one **instance** or **input** of the data.

**Run** the cell below to use `pd.read_csv` to read the csv into a `DataFrame` object that we will call `df`. Then, the first 5 lines of the `DataFrame` will be displayed using `df.head(5)`. For more information on these methods, the documentation for [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) and [`head`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) is a good place to start!

In [None]:
df = pd.read_csv(in_file_name)
display(df.head(5))

## Removing non-useful features
Our `DataFrame` contains a lot of information for each project:


- Unnamed:0: a number indicating which row the instance is on. 
- amt_pledged: The amount of funding the project received from backers. 
- blurb: A short description of the project. 
- by: The person or group who submitted the project. 
- category: The Kickstarter designated category the project belongs to. 
- currency: The currency that the project was funded in. 
- goal: The initial funding amount being requested for the project. 
- location: The geographic location of the submitter. 
- num_backers: The number of people who backed the project by pledging money. 
- num_backers.tier: The number of people who pledged money in each pledge tier (see below). 
- pledge_tier: Designated funding amount categories in which backers can pledge (in dollars). 
- title: The title of the project. 
- url: The URL of the project. 



However, the first column of the `DataFrame`, the column labeled `'Unnamed: 0'` appears to be redundant for our purposes. It seems to simply repeat the [index of the instance](http://pandas.pydata.org/pandas-docs/stable/indexing.html). It is not necessary to keep `Unnamed: 0`, because this information can be obtained from using the code `df.index.values`. So let's delete the first column:

**Run** the cell below to delete the column `'Unnamed: 0'` from the `DataFrame` object `df`. Then, display the first few lines of the data frame to see what it looks like without `'Unnamed: 0'`


In [None]:
if 'Unnamed: 0' in df.columns:
    del df['Unnamed: 0']
    print("Deleted the column 'Unnamed: 0' from df!")
else:
    print("The column 'Unnamed: 0' has already been deleted!")
display(df.head(5))

## Indexing `DataFrame` and `Series` objects
On the leftmost edge of the `DataFrame`, we can see the index. Each row (instance, input) in the `DataFrame` has an index. To access a specific row based on the index, we can use `loc` or `iloc`. Label-based indexing is done with `loc`, while integer-position based indexing is done with `iloc`. For example, looking above, we can see that the first row in the `DataFrame` contains the "category" Tabletop Games . Let's get the first row (index 0) using `loc`!

**Run** the cell below to get the first row of the `DataFrame` using `df.loc[0]`. What does the result look like?

In [None]:
df.loc[0]

The result doesn't look like a DataFrame! That's because one-dimensional objects in `pandas` are `Series` objects. `Series` objects are displayed as columns, with the indices shown on the left and the values shown on the right. Below the `Series` object, we see the name of the `Series` object and the `dtype` or data type of the `Series` object. The `dtype` of a `Series` object is chosen to accommodate all data within the `Series`.

What if we don't want the entire first row of the DataFrame, but just the `amt_pledged`?

**Run** the cell below to see one way to get the `amt_pledged` from the first row:

In [None]:
df.loc[0,'amt_pledged']

`'8782571'` should have been displayed. There are a variety of different methods for retrieving `amt_pledged` from the first row of the `DataFrame`.

**Run** the cells below to see many other ways to get the same `amt_pledged` from the first row of the `DataFrame`. Do you understand how each line works?

In [None]:
df.iloc[0,0]

In [None]:
df.loc[0].loc['amt_pledged']

In [None]:
df['amt_pledged'][0]

In [None]:
df.iloc[0].iloc[0]

In [None]:
df['amt_pledged'].loc[0]

We are also able to retrieve multiple rows from the `DataFrame` by doing `numpy`-like slicing: `df.iloc[lower:upper]` will take a slice of the `DataFrame` object from the lower bound `lower` up to (but not including) the upper bound `upper`

It's important to note that we get different results by slicing the `DataFrame` with `loc` and with `iloc`.

When slicing a `DataFrame` using `iloc` (the *integer-based* position indexing) the lower bound is included, while the upper bound is excluded.

**Run** the cell below to get the first three rows of the `DataFrame` using `df.iloc[0:3]`

In [None]:
df.iloc[0:3]

The lower bound (0) is *included*, while the upper bound (3) is *excluded*.

However, when slicing a `DataFrame` using `loc` (the *label-based* indexing), the starting and upper bounds are **both** included!

**Run** the cell below to see what happens when we call `df.loc[0:3]`

In [None]:
df.loc[0:3]

Note that both the lower bound (0) and the upper bound (3) are *included* with *label-based* indexing.

# Exercises - Part 1
Now it's time to use your `pandas` knowledge to answer questions about the most_backed data set. If you're uncertain how to do something, please feel free to ask questions, refer to the code examples above or review the `pandas` documentation. 

For this section, the `pandas` documentation on [indexing and slicing](http://pandas.pydata.org/pandas-docs/stable/indexing.html) will be very helpful.

## Question 1 
What is the `'location'` of the project in the row with index 250? 

## Question 2
What is the `'category'` of the project in the row with index 1000? 

## Question 3
Display rows 100 up to (but not including) 105.

## Question 4:
What does `df.iloc[0:100:10]` return? How would you describe integer-based slicing when there are three numbers in the square brackets?

## Question 5:
What does `df.iloc[:8]` return? How would you describe integer-based slicing when no lower bound is specified?

## Question 6:
What does `df.iloc[-5:]` return? How would you describe integer-based slicing with a negative lower bound and no upper bound?

## How much data do we have?
We are able to slice the `DataFrame` object using `loc` and `iloc`, but it may not be readily apparent how much data is actually in the dataset from using these methods. 

The size of the dataset can be determined by calling the `len(df)` method which counts the number of rows in the `DataFrame` object.

**Run** the cell below to count the number of rows in the `DataFrame` object.

In [None]:
print("There are {} rows in the DataFrame".format(len(df)))

[The method `df.count()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.count.html) returns more information than `len(df)`. 

For each feature (column) in the `DataFrame` object, `df.count()` will count the number of non-NA/null values for that feature. This is useful to get an idea of how many missing values are in your `DataFrame`, if any.

**Run** the cell below to count the number of non-NA/null values for each feature using `df.count()`. Is there a feature with missing values?

In [None]:
display(df.count())

## More about the `Series` and `DataFrame` objects
**Note:** This section assumes that you are familiar with basic data structures, in particular Python dictionaries. If you need to brush up on this aspect of Python, check out the [Python tutorial on data structures](https://docs.python.org/3/tutorial/datastructures.html)!

The `DataFrame` object is a two-dimensional labeled data structure. As we saw above, each row is labeled with an index and each column is labeled with a feature name. Recall that if we take a single row from the DataFrame, it becomes a `Series` object. Similarly, if we take a single column of the `DataFrame` object, we would get a `Series` object. Thus, we can think of the `DataFrame` as a Python dictionary. The key-value pairs for this dictionary are the feature names and the `Series` objects. You can retrieve a single `Series` object from a `DataFrame` the same way you would retrieve a value from a dictionary: `df['title']` would retrieve the column `'title'` from `df` as a `Series` object. Let's try it!

**Run** the cell below to print the first ten lines of the `Series` object containing the all of the project titles :

In [None]:
display(df['title'].head(10))

The `Name` and `dtype` (data type) of the `Series` object are also displayed. Notice that `title` has a data type of `object` -- we'll learn more about that later in the project.

We can access more than one feature of a `DataFrame` object with a list of keys rather than a single key.

**Run** the cell below to print the first ten lines of just the `'title'` and `'amt_pledged'` features of the `DataFrame`.

In [None]:
display(df[['title','amt_pledged']].head(10))

We accessed two features from the `DataFrame` object, and got a new `DataFrame` object as a result! That's a neat feature of `pandas`: one-dimensional labeled data structures become `Series` objects, while two-dimensional labeled data structures remain as `DataFrame` objects!


## Descriptive statistics on a `Series` object

Now that we are more familiar with the `Series` and `DataFrame` objects in `pandas`, let's explore the advatnages of keeping our data within these structures. Let's analyze an important quantitative feature in our data set: the `'num_backers'` column. Let's learn more about this feature by calculating some of its descriptive statistics!

### Computing the mean

The mean, or average, is one measure of central tendency for a data set. The mean of a set of values is the sum of all values, divided by the number of values. 

We can compute the mean of the `Series` object `df['num_backers']` using [the `sum()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) to add up all the scores, and then divide by the length of the `Series` object

**Run** the cell below to store the `Series` object `df['num_backers']` into the variable `backers`, and then compute the mean.

In [None]:
# Manual computation 

backers = df['num_backers']
print("The mean of the scores is {:.2f}".format(backers.sum() / len(backers)))

**Run** the cell below to compute the mean of `backers` using the `mean()` method. Do we find the same result?

In [None]:
# Using the mean() method 

print("The mean of the scores is {:.2f}".format(backers.mean()))

### Computing the median
The median is another measure of central tendency for a data set. When the data is arranged from smallest to largest, the median is the value in the middle of the ordered data set. 

- With an *odd* number of values in the data set $(2n+1)$, one value will be directly in the middle, with $n$ values above it and $n$ values below it. 

- With an *even* number of values in the data set $(2n)$, *two* values will be in the middle, with $n-1$ values above them and $n-1$ values below them: in this case, the median is actually the average of the two values in the middle.

**Run** the cell below to compute the median by sorting `scores` in ascending order with the method [sort_values()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html#pandas.Series.sort_values) and then finding the value(s) right in the middle of the data set with [the iloc indexer](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.iloc.html).

In [None]:
backers = backers.sort_values(ascending=True)
n = len(backers)
if n % 2 == 1:
    median = backers.iloc[(n-1)//2]
else:
    median = (backers.iloc[n//2 - 1] + backers.iloc[n//2]) / 2.0
print("The median of the number of backers is {:.2f}".format(median))


**Run** the cell below to compute the median of scores using the [`median()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html). Do we receive same result?

In [None]:
print("The median number of backers is {:.2f}".format(backers.median()))

### Finding the maximum and minimum values
The maximum and the minimum values within a data set give us an idea of the range for that feature. 

Because we sorted `scores` in an earlier cell, the smallest (minimum) value within `backers` is at the start of the `Series` object, while the largest (maximum) value within `backers` is at the end of the `Series` object. Recall that the indexing convention in Python starts with 0 as the first index, and we can use the index -1 to retrieve the last item.

**Run** the cell below to print the minimum and maximum values from the sorted `backers` object.

In [None]:
print("Minimum # of backers:{:.1f} \nMaximum # of backers:{:.1f}".format(backers.iloc[0],backers.iloc[-1]))


**Run** the cell below to print the minimum and maximum values using the [`min()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html) and the [`max()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html). 

In [None]:
print("Minimum # of backers:{:.1f} \nMaximum # of backers:{:.1f}".format(backers.min(),backers.max()))

## Selecting instances
There may situations in which it is advantageous to select specific instances in the `DataFrame` prior to processing the data. To select specific instances, we can use [Boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing).

Before we explore Boolean indexing, we should see what happens when we compute a logical expression using a `Series` object. For example, let's see what happens when we check whether the `'location'` feature is equal to `'San Jose, CA'`.

**Run** the cell below to compute the logical expression `df['location'] == 'Los Angeles, CA'`. What is the result?

In [None]:
df['location'] == 'Los Angeles, CA'



The logical expression above returns a `Series` object of dtype `bool`, with a value of `True` at the indices where `df['location']` is equal to `'Los Angeles, CA'`, and `False` at the indices where `df['location']` is equal to `'Los Angeles, CA'`. 

**Run** the cell below to see what happens when we index a `DataFrame` object using this `Series` object.

In [None]:
df[df['location'] == 'Los Angeles, CA']


Boolean Indexing returns a DataFrame keeping only the rows where the logical expression was `True`. So we are able to easily select instances using a logical expression. 

We can use other methods to return `Series` objects of dtype `bool`. For example, we will use [the method `Series.str.contains()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html) below to return all the projects that contain the string `'smart'` in their blurbs.

**Run** the cell below to select only the instances where the blurb contains the string `'smart'` and display the first few `'smart'` projects.

In [None]:
smart_df = df[df['blurb'].str.contains('smart')]
display(smart_df.head(5))

Now that we've selected the instances where the `blurb` feature contains the `'smart'` string, we can check to see if smart projects have more backers than the projects in the dataset as a whole.

**Run** the block of code below to display the mean number of backers for the smart projects and compare it to the overall mean. 

In [None]:
print("Mean # of backers for smart projects: {:.2f} \nMean # of backers for all projects: {:.2f}".format(
    smart_df['num_backers'].mean(), df['num_backers'].mean()))



**Run** the block of code below to see the mean number of backers in San Jose, CA. 

In [None]:
df_san_jose = df[df['location']=='San Jose, CA']
print("The mean # of backers of the {} projects located in San Jose, CA  is {:.2f}.".format(len(df_san_jose),df_san_jose['num_backers'].mean()))

We can use [the `isin()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) to see if the value of a feature is equal to any of the values within a list. 

This is useful if we want to select items that belong to a category that is not necessarily numerical or lexical. For example, if we want to consider projects utilizing sensor-based technology, we might want to select only the instances from `df` where the `'category'` feature is `'Wearables'`, `'DIY Electronics'`, or `'Gadgets'`.



In [None]:
sensor_tech_df = df[df['category'].isin(['Wearables', 'DIY Electronics', 'Gadgets'])]
print("The mean # of backers of the {} sensor-based technology projects is {:.2f}.".format(len(sensor_tech_df),sensor_tech_df['num_backers'].mean()))

## Describing and grouping
There are a few other `pandas` techniques that will help you with your initial data exploration. The first is [the `describe()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html), that provides a descriptive summary of all numerical features within a `DataFrame` object.

**Run** the cell below to see what happens when you call `df.describe()`

In [None]:
display(df.describe())

`describe()` returns the count, the mean, the standard deviation, the minimum, the 25th, 50th, and 75th percentiles, and the maximum for features with numeric dtypes. 

It is also valuable to view descriptive statistics for each of the groups in a feature.  This can be done by using the [the `groupby()` method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). We can combine the `groupby()` method with descriptive statistics methods, such as `min()`, `max()`, `mean()`, and `median()` to apply the function to each groups. 

Let's use `groupby()` to see the mean number of backers for each `category`.

**Run** the code below to try your first `groupby()` command!

In [None]:
df.groupby('category')['num_backers'].mean()



**Run** the cell below to use `groupby()` to plot the results for only a few specific cities:

In [None]:
bay_area = ['Palo Alto, CA','San Francisco, CA','San Jose, CA']
ax = df.groupby('location')['num_backers'].mean()[bay_area].plot(kind="bar")
xlabel = ax.set_xlabel("City")
ylabel = ax.set_ylabel("Average # of backers")
ylim = ax.set_ylim([0,10000])