# Connect Intensive - Machine Learning Nanodegree

## Week 1. Python Crash Course  

## Objectives    

- Jupyter notebook 
- Basic Python programming  
- Numpy
- Pandas 
- Data visualization with Matplotlib and Seaborn 

## Prerequisites   

 - You should have **Python 2.7** installed (if not, please [download and install Python 2.7](https://www.python.org/downloads/))
 - You should also install (and perhaps upgrade) the following packages, if you haven't already:
    - [numpy](http://www.numpy.org/)
    - [pandas](http://pandas.pydata.org/)
    - [matplotlib](http://matplotlib.org/)  
    - [seaborn](http://seaborn.pydata.org)  

---

## 3 | Pandas

One of the major skills you can bring to the table as a machine learnist is the ability to explore and understand a data set. The library **`pandas`** is a Python package developed by Wes McKinney that machine learnists use to quickly and efficiently navigate data sets. From [the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/index.html):

> "**`pandas`** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive."


Some basic topics we cover here include:
- Input data into DataFrame 
- Get summary information of the data
- Selection and indexing  
- Conditional selection   
- Others, e.g., creating a new column, drop columns 

> **Fun fact:** the name "`pandas`" derives from **pan**el **da**ta, a term for multi-dimensional data sets! [(source)](http://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf)  

> **Adtional reference:** Check out a Pandas cheat sheet [here](http://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html). 

In [30]:
# import numpy and pandas 
import numpy as np
import pandas as pd

In [31]:
# set maximum rows to display
pd.options.display.max_rows = 10 # default is 60

### Read Data Into DataFrame

- pd.read_csv('data.csv')
- pd.read_excel('data.xlsx',sheetname='Sheet1')
- pd.read_table() - It reads general delimited file into DataFrame and also supports optionally iterating or breaking of the file into chunks.  

**Run** the cell below to read the iris dataset. 

In [32]:
# read data
data = pd.read_table('./data/iris.txt', delimiter=',')

In [33]:
type(data)

pandas.core.frame.DataFrame

### Information of Data  

After reading the datafile into a DataFrame, we can use built-in pandas dataframe methods to get some information of the data. 

**Run** the cells below to view the first 5 rows of the dataframe, the information of the dataframe, and summary statistics of the data. 

In [34]:
# View first 5 rows
# Note: use data.tail() to view the last 5 rows
# You can pass into an integer to specify the number of rows to show, e.g., data.head(20)
data.head() 

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [35]:
# Get a concise summary of data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_l    150 non-null float64
sepal_w    150 non-null float64
petal_l    150 non-null float64
petal_w    150 non-null float64
type       150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [36]:
# Generates descriptive statistics that summarize the central tendency, dispersion 
# and shape of a dataset's distribution, excluding missing values
data.describe()

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Selection and Indexing   

There are various ways to grab data from a dataframe. 

**Grab by columns**

In [37]:
data['sepal_l']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_l, Length: 150, dtype: float64

Note, the result doesn't look like a DataFrame! That's because one-dimensional objects in `pandas` are `Series` objects. 

**Run** the cell below to check out. 

In [None]:
type(data['sepal_l'])

`Series` objects are displayed as columns, with the indices shown on the left and the values shown on the right. Below the `Series` object, we see the name of the `Series` object and the `dtype` or data type of the `Series` object. The `dtype` of a `Series` object is chosen to accomodate all data within the `Series`.

In [None]:
# Not recommended; problem with certain column names
data.sepal_l

In [42]:
# Pass a list of column names
data[['sepal_l', 'sepal_w']]

Unnamed: 0,sepal_l,sepal_w
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


**Grab by rows**

On the leftmost edge of the `DataFrame`, we can see the index. Each row (instance, input) in the `DataFrame` has an index. To access a specific row based on the index, we can use `loc` or `iloc`. Label-based indexing is done with `loc`, while integer-position based indexing is done with `iloc`. For example, looking above, we can see that the first row in the `DataFrame` contains the iris type 'Iris-setosa'. Let's get the first row (index 0) using `loc`!

**Run** the cell below to get the first row of the `DataFrame` using `df.loc[0]`. 

In [38]:
data.loc[0]

sepal_l            5.1
sepal_w            3.5
petal_l            1.4
petal_w            0.2
type       Iris-setosa
Name: 0, dtype: object

In [None]:
type(data.loc[0])

You will see that again, we get a `Series` object.  

Sometimes, we do not want to grab the entire row. For example, if we want to get the sepal width `sepal_w` of the iris flower in the first row. 

**Run** the cell below to see one way to get the `sepal_w` from the first row:

In [16]:
data.loc[0, 'sepal_w']

3.5

There are other ways to get the same thing. **Run** the cells below.

In [39]:
data.iloc[0, 1]

3.5

In [40]:
data.loc[0].loc['sepal_w']

3.5

In [18]:
data.iloc[0].iloc[1]

3.5

In [19]:
data['sepal_w'][0]

3.5

In [41]:
data['sepal_w'].loc[0]

3.5

We can also get multiple rows from the `DataFrame` by doing `numpy`-like slicing: `df.iloc[lower:upper]` will take a slice of the `DataFrame` object from the lower bound `lower` up to (but not including) the upper bound `upper`. Be careful! We get different results by slicing the `DataFrame` with `loc` and with `iloc`.

When slicing a `DataFrame` using `iloc` (the *integer-based* position indexing) the lower bound is included, while the upper bound is excluded.

**Run** the cell below to get the first three rows of the `DataFrame` using `data.iloc[0:3]`

In [29]:
data['sepal_w']

0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_w, Length: 150, dtype: float64

In [None]:
data.iloc[0:3]

### Conditional Selection  

An important feature of pandas is conditional selection using bracket notation, very similar to numpy. For example, **run** the cell below to select rows with sepal length `sepal_l` greater than 7. 

In [43]:
data[data['sepal_l'] > 7]

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
102,7.1,3.0,5.9,2.1,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
...,...,...,...,...,...
125,7.2,3.2,6.0,1.8,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
131,7.9,3.8,6.4,2.0,Iris-virginica


What if we just want to see the `type` for those rows with `sepal_l` greater than 7? **Run** the following cell: 

In [26]:
data[data['sepal_l'] > 7]['type']


102    Iris-virginica
105    Iris-virginica
107    Iris-virginica
109    Iris-virginica
117    Iris-virginica
            ...      
125    Iris-virginica
129    Iris-virginica
130    Iris-virginica
131    Iris-virginica
135    Iris-virginica
Name: type, dtype: object

We can also use multiple conditions to do selection. For two conditions you can use | and & with parenthesis. **Run** the cell below to find out flowers with `sepal_l` greater than 7 and `sepal_w` smaller than 3.  

In [16]:
data[(data['sepal_l'] > 7) & (data['sepal_w'] < 3)]

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type,delta_sepal
107,7.3,2.9,6.3,1.8,Iris-virginica,4.4
118,7.7,2.6,6.9,2.3,Iris-virginica,5.1
122,7.7,2.8,6.7,2.0,Iris-virginica,4.9
130,7.4,2.8,6.1,1.9,Iris-virginica,4.6


### Others

**Create new columns**

In [25]:
# create a new column called 'delta_sepal' which is the difference between sepal_l and sepal_w
data['delta_sepal'] = data['sepal_l'] - data['sepal_w']
data['delta_lzy']= data['sepal_l'] - data['petal_l']
data.head()

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type,delta_sepal,delta_lzy
0,5.1,3.5,1.4,0.2,Iris-setosa,1.6,3.7
1,4.9,3.0,1.4,0.2,Iris-setosa,1.9,3.5
2,4.7,3.2,1.3,0.2,Iris-setosa,1.5,3.4
3,4.6,3.1,1.5,0.2,Iris-setosa,1.5,3.1
4,5.0,3.6,1.4,0.2,Iris-setosa,1.4,3.6


**Removing columns** 

In [32]:
data.drop('delta_sepal', axis=1).head()

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Note that this operation is not inplace, meaning that it only creates a view with the specified column removed. Use `inplace=True` for the change to be inplace.

In [10]:
data.head() # delta_sepal should still be there

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [34]:
data.drop('delta_sepal', axis=1, inplace=True)

In [35]:
data.head() # delta_sepal should be gone

Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
