# Intro to pandas

**Learning Objectives:**
  * Gain an introduction to the `DataFrame` data structure of the *pandas* library
  * Access and manipulate data within a `DataFrame`
  * Import CSV data into a *pandas* `DataFrame`
  

[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.

## Basic Operation: Library import

The following line imports the *pandas* library

In [1]:
import pandas as pd


## Basic Operation: Data loading and DataFrame creation


Most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create your first `DataFrame`

In [2]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")


A useful function is `Pandas.DataFrame.head()`, which displays the first few records of a `DataFrame`:

In [16]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


## Basic Operation: Row selection by name

Sometimes, you will need to select specific rows from your  `DataFrame`. The following example selects a specific row from the  California housing data. 

We use the property `DataFrame.loc[]` to select rows by their names. 


In [24]:
# we select the first row. Please notice that in Python we start counting from zero.
california_housing_dataframe.loc[0]

longitude              -114.3100
latitude                 34.1900
housing_median_age       15.0000
total_rooms            5612.0000
total_bedrooms         1283.0000
population             1015.0000
households              472.0000
median_income             1.4936
median_house_value    66900.0000
Name: 0, dtype: float64

In [25]:
# we select the tenth row. Please notice that in Python we start counting from zero.
california_housing_dataframe.loc[9]

longitude              -114.6000
latitude                 34.8300
housing_median_age       46.0000
total_rooms            1497.0000
total_bedrooms          309.0000
population              787.0000
households              271.0000
median_income             2.1908
median_house_value    48100.0000
Name: 9, dtype: float64

We can select a range of rows using the same property DataFrame.loc[]

In [27]:
california_housing_dataframe[0:3]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0


In [28]:
california_housing_dataframe[3:6]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0


## Basic Operation: Column selection by name

Oftentimes you will need to select specific columns from your  `DataFrame`. The following example selects a specific column from the  California housing data. 

Selecting columns from a `DataFrame` is easy

In [35]:
# we select the column named: "longitude"
california_housing_dataframe['longitude']

0       -114.31
1       -114.47
2       -114.56
3       -114.57
4       -114.57
          ...  
16995   -124.26
16996   -124.27
16997   -124.30
16998   -124.30
16999   -124.35
Name: longitude, Length: 17000, dtype: float64

In [36]:
# we select the column named: "total_rooms"
california_housing_dataframe['total_rooms']

0        5612.0
1        7650.0
2         720.0
3        1501.0
4        1454.0
          ...  
16995    2217.0
16996    2349.0
16997    2677.0
16998    2672.0
16999    1820.0
Name: total_rooms, Length: 17000, dtype: float64

You can also select several columns at the same time as shown in the next cell

In [5]:
california_housing_dataframe[['housing_median_age','total_rooms']]

Unnamed: 0,housing_median_age,total_rooms
0,15.0,5612.0
1,19.0,7650.0
2,17.0,720.0
3,14.0,1501.0
4,20.0,1454.0
...,...,...
16995,52.0,2217.0
16996,36.0,2349.0
16997,17.0,2677.0
16998,19.0,2672.0


## Basic Operation: Row and Column selection by name

Sometimes you might need to access specific elements from a `DataFrame`

In [37]:
california_housing_dataframe.loc[0,'total_rooms']

5612.0

In [38]:
california_housing_dataframe.loc[0,'longitude']

-114.31

You can even select a range of elements from a `DataFrame`

In [39]:
california_housing_dataframe.loc[0:3,'longitude']

0   -114.31
1   -114.47
2   -114.56
3   -114.57
Name: longitude, dtype: float64

In [41]:
california_housing_dataframe.loc[10:20,'total_rooms']

10    3741.0
11    1988.0
12    1291.0
13    2478.0
14    1448.0
15    2556.0
16    1678.0
17      44.0
18    1388.0
19      97.0
20    1491.0
Name: total_rooms, dtype: float64

In [43]:
california_housing_dataframe.loc[10:20,['longitude','total_rooms']]

Unnamed: 0,longitude,total_rooms
10,-114.6,3741.0
11,-114.6,1988.0
12,-114.61,1291.0
13,-114.61,2478.0
14,-114.63,1448.0
15,-114.65,2556.0
16,-114.65,1678.0
17,-114.65,44.0
18,-114.66,1388.0
19,-114.67,97.0


## Basic Operation: Row selection by name

From time to time, you will need to select specific rows from your  `DataFrame`. The following example selects some rows from the  California housing data. 

In [14]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [None]:
california_housing_dataframe

## Basic Operation: Column selection by position

#### From time to time you will need to select specific columns based on their position in your  `DataFrame`. The following example selects the first two columns from the  California housing data. 

Columns in `DataFrames` can be accessed using the `iloc` atribute

In [11]:
california_housing_dataframe.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0


In [13]:
# We select the first column, please notice that in Python we start counting from zero onwards
california_housing_dataframe.iloc[:,0]

0       -114.31
1       -114.47
2       -114.56
3       -114.57
4       -114.57
          ...  
16995   -124.26
16996   -124.27
16997   -124.30
16998   -124.30
16999   -124.35
Name: longitude, Length: 17000, dtype: float64