## Working with pandas dataframes and columns
Let's learn some basics of working with pandas dataframes. 

### Read in the dataset
Let's read in the Iris dataset from a URL, just like we did in the lecture: 

In [2]:
import pandas as pd

# url to get file from
url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data"

# read the file into a dataframe
iris = pd.read_csv(url)

### Add column names
Column names are stored as a list and can be accessed with the following syntax: 

`df.columns`

In [3]:
iris.columns

Index(['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'], dtype='object')

We can see that there are no column names. Let's create a list of column names and apply it to our dataframe: 

In [12]:
iris.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']

In [13]:
iris.head()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


### Rename Columns
Actually... it's a terrible idea to have spaces in our column names. That makes it very hard to work with columsn downstream. Let's rename our columns. 

We can either just feed the columns attribute a new list of names: 

In [14]:
iris.columns = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Class']

In [15]:
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


But what if we have hundreds or thousands of columns? That's pretty tedious. Also, how do we know our columns will be in the same order as our list? 

The safest way to rename columns is using pandas rename() method, and a dictionary of old to new name mappings: 

In [18]:
iris = iris.rename(columns={"Sepal Length":"SepalLength", 
                   "Sepal Width":"SepalWidth",
                   "Petal Length":"PetalLength",
                   "Petal Width":"PetalWidth",
                   "Class":"Class"
                  })

In [19]:
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


### Viewing a column
We can pull out just a single column from a datframe using the following syntax:
`df["column_name"]`

In [20]:
iris["SepalLength"].head()

0    4.9
1    4.7
2    4.6
3    5.0
4    5.4
Name: SepalLength, dtype: float64

You can also select a column as: 

In [23]:
iris.SepalLength.head()

0    4.9
1    4.7
2    4.6
3    5.0
4    5.4
Name: SepalLength, dtype: float64

### Viewing Multiple Columns
We can use a similar syntax to view multiple dataframe columns, we just feed it a list instead of a single column name: 

In [21]:
iris[ ["SepalLength", "SepalWidth"]  ].head()

Unnamed: 0,SepalLength,SepalWidth
0,4.9,3.0
1,4.7,3.2
2,4.6,3.1
3,5.0,3.6
4,5.4,3.9


### loc and iloc
You can use loc and iloc to select data in pandas when you don't know the column name, or if you want to grab out a row by position.

- iloc = select by index
- loc = selecting by label or boolean/conditional 

The syntax is: 

`df.iloc[<row selection>, <column selection>]
df.loc[<row selection>, <column selection>]`

In [24]:
iris.iloc[0] # first row 

SepalLength            4.9
SepalWidth               3
PetalLength            1.4
PetalWidth             0.2
Class          Iris-setosa
Name: 0, dtype: object

In [32]:
iris.iloc[1:5] # second to fifth rows

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [26]:
iris.iloc[-1] # last row 

SepalLength               5.9
SepalWidth                  3
PetalLength               5.1
PetalWidth                1.8
Class          Iris-virginica
Name: 148, dtype: object

In [35]:
iris.iloc[:,0].head() # first column

0    4.9
1    4.7
2    4.6
3    5.0
4    5.4
Name: SepalLength, dtype: float64

In [37]:
iris.iloc[:,-1].head() # last column

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Class, dtype: object

In [40]:
# first five rows and third and fourth columns
iris.iloc[0:5, 2:4].head() 

Unnamed: 0,PetalLength,PetalWidth
0,1.4,0.2
1,1.3,0.2
2,1.5,0.2
3,1.4,0.2
4,1.7,0.4


### Setting an Index
Setting an index on a dataframe makes it much easier to work with downstream. You set the index as the main point of reference in your dataset, these will become your row labels. 

In [44]:
iris = iris.set_index("Class")

In [45]:
iris.head()

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,4.9,3.0,1.4,0.2
Iris-setosa,4.7,3.2,1.3,0.2
Iris-setosa,4.6,3.1,1.5,0.2
Iris-setosa,5.0,3.6,1.4,0.2
Iris-setosa,5.4,3.9,1.7,0.4


### Add a new column
Adding a new column to an existing dataframe is easy: 

In [46]:
iris["fake_column"] = "testing"

In [47]:
iris.head()

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,fake_column
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,4.9,3.0,1.4,0.2,testing
Iris-setosa,4.7,3.2,1.3,0.2,testing
Iris-setosa,4.6,3.1,1.5,0.2,testing
Iris-setosa,5.0,3.6,1.4,0.2,testing
Iris-setosa,5.4,3.9,1.7,0.4,testing


### Delete a column
Let's get rid of that fake column using the drop() method. We need to add an argument `axis=1` to let pandas know we want to drop the column. If we wanted to drop by row, we would use `axis=0`. 

In [48]:
iris = iris.drop("fake_column", axis=1)

In [49]:
iris.head()

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,4.9,3.0,1.4,0.2
Iris-setosa,4.7,3.2,1.3,0.2
Iris-setosa,4.6,3.1,1.5,0.2
Iris-setosa,5.0,3.6,1.4,0.2
Iris-setosa,5.4,3.9,1.7,0.4
