# Connect Intensive - Machine Learning Nanodegree

## Week 1. Python Crash Course  

## Objectives    

- Jupyter notebook 
- Basic Python programming  
- Numpy
- Pandas 
- Data visualization with Matplotlib and Seaborn 

## Prerequisites   

 - You should have **Python 2.7** installed (if not, please [download and install Python 2.7](https://www.python.org/downloads/))
 - You should also install (and perhaps upgrade) the following packages, if you haven't already:
    - [numpy](http://www.numpy.org/)
    - [pandas](http://pandas.pydata.org/)
    - [matplotlib](http://matplotlib.org/)  
    - [seaborn](http://seaborn.pydata.org)  

---

## 3 | Pandas

One of the major skills you can bring to the table as a machine learnist is the ability to explore and understand a data set. The library **`pandas`** is a Python package developed by Wes McKinney that machine learnists use to quickly and efficiently navigate data sets. From [the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html):

> "**`pandas`** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive."


Some basic topics we cover here include:
- Input data into DataFrame 
- Get summary information of the data
- Selection and indexing  
- Conditional selection   
- Others, e.g., creating a new column, drop columns 

> **Fun fact:** the name "`pandas`" derives from **pan**el **da**ta, a term for multi-dimensional data sets! [(source)](http://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf)  

> **Adtional reference:** Check out a Pandas cheat sheet [here](http://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html). 

In [None]:
# import numpy and pandas 
import numpy as np
import pandas as pd

In [None]:
# set maximum rows to display
pd.options.display.max_rows = 10 # default is 60

### Read Data Into DataFrame

- pd.read_csv('data.csv')
- pd.read_excel('data.xlsx',sheetname='Sheet1')
- pd.read_table() - It reads general delimited file into DataFrame and also supports optionally iterating or breaking of the file into chunks.  

**Run** the cell below to read the iris dataset. 

In [None]:
# read data
data = pd.read_table('./data/iris.txt', delimiter=',')

### Information of Data  

After reading the datafile into a DataFrame, we can use built-in pandas dataframe methods to get some information of the data. 

**Run** the cells below to view the first 5 rows of the dataframe, the information of the dataframe, and summary statistics of the data. 

In [None]:
# View first 5 rows
# Note: use data.tail() to view the last 5 rows
# You can pass into an integer to specify the number of rows to show, e.g., data.head(20)
data.head() 

In [None]:
# Get a concise summary of data
data.info()

In [None]:
# Generates descriptive statistics that summarize the central tendency, dispersion 
# and shape of a dataset's distribution, excluding missing values
data.describe()

### Selection and Indexing   

There are various ways to grab data from a dataframe. 

**Grab by columns**

In [None]:
data['sepal_l']

Note, the result doesn't look like a DataFrame! That's because one-dimensional objects in `pandas` are `Series` objects. 

**Run** the cell below to check out. 

In [None]:
type(data['sepal_l'])

`Series` objects are displayed as columns, with the indices shown on the left and the values shown on the right. Below the `Series` object, we see the name of the `Series` object and the `dtype` or data type of the `Series` object. The `dtype` of a `Series` object is chosen to accomodate all data within the `Series`.

In [None]:
# Not recommended; problem with certain column names
data.sepal_l

In [None]:
# Pass a list of column names
data[['sepal_l', 'sepal_w']]

**Grab by rows**

On the leftmost edge of the `DataFrame`, we can see the index. Each row (instance, input) in the `DataFrame` has an index. To access a specific row based on the index, we can use `loc` or `iloc`. Label-based indexing is done with `loc`, while integer-position based indexing is done with `iloc`. For example, looking above, we can see that the first row in the `DataFrame` contains the iris type 'Iris-setosa'. Let's get the first row (index 0) using `loc`!

**Run** the cell below to get the first row of the `DataFrame` using `df.loc[0]`. 

In [None]:
data.loc[0]

In [None]:
type(data.loc[0])

You will see that again, we get a `Series` object.  

Sometimes, we do not want to grab the entire row. For example, if we want to get the sepal width `sepal_w` of the iris flower in the first row. 

**Run** the cell below to see one way to get the `sepal_w` from the first row:

In [None]:
data.loc[0, 'sepal_w']

There are other ways to get the same thing. **Run** the cells below.

In [None]:
data.iloc[0, 1]

In [None]:
data.loc[0].loc['sepal_w']

In [None]:
data.iloc[0].iloc[1]

In [None]:
data['sepal_w'][0]

In [None]:
data['sepal_w'].loc[0]

We can also get multiple rows from the `DataFrame` by doing `numpy`-like slicing: `df.iloc[lower:upper]` will take a slice of the `DataFrame` object from the lower bound `lower` up to (but not including) the upper bound `upper`. Be careful! We get different results by slicing the `DataFrame` with `loc` and with `iloc`.

When slicing a `DataFrame` using `iloc` (the *integer-based* position indexing) the lower bound is included, while the upper bound is excluded.

**Run** the cell below to get the first three rows of the `DataFrame` using `data.iloc[0:3]`

In [None]:
data.iloc[0:3]

### Conditional Selection  

An important feature of pandas is conditional selection using bracket notation, very similar to numpy. For example, **run** the cell below to select rows with sepal length `sepal_l` greater than 7. 

In [None]:
data[data['sepal_l'] > 7]

What if we just want to see the `type` for those rows with `sepal_l` greater than 7? **Run** the following cell: 

In [None]:
data[data['sepal_l'] > 7]['type']

We can also use multiple conditions to do selection. For two conditions you can use | and & with parenthesis. **Run** the cell below to find out flowers with `sepal_l` greater than 7 and `sepal_w` smaller than 3.  

In [None]:
data[(data['sepal_l'] > 7) & (data['sepal_w'] < 3)]

### Others

**Create new columns**

In [None]:
# create a new column called 'delta_sepal' which is the difference between sepal_l and sepal_w
data['delta_sepal'] = data['sepal_l'] - data['sepal_w']
data.head()

**Removing columns** 

In [None]:
data.drop('delta_sepal', axis=1).head()

Note that this operation is not inplace, meaning that it only creates a view with the specified column removed. Use `inplace=True` for the change to be inplace.

In [None]:
data.head() # delta_sepal should still be there

In [None]:
data.drop('delta_sepal', axis=1, inplace=True)

In [None]:
data.head() # delta_sepal should be gone