# Intro to pandas (Lab1)

**Learning Objectives:**
  * Gain an introduction to the `DataFrame` data structure of the *pandas* library
  * Access and manipulate data within a `DataFrame`
  * Import CSV data into a *pandas* `DataFrame`
  

[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.

## Basic Operation: Library import

The following line imports the *pandas* library

In [2]:
import pandas as pd


## Basic Operation: Data loading and DataFrame creation


Most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create your first `DataFrame`

In [3]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")


A useful function is `Pandas.DataFrame.head()`, which displays the first few records of a `DataFrame`:

In [4]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


## Basic Operation: Assigning new values to an element

Sometimes, you will need to change the value of specific elements of a `DataFrame`. The following example changes the value of a specific element from the  California housing data. 

We use the property `DataFrame.loc[]` to select an element by their names and then assign new values to that element


In [6]:
# we select the value from the first row and the column total_rooms
california_housing_dataframe.loc[0,'total_rooms']

5612.0

In [8]:
# we change the value from the first row and the column total_rooms
california_housing_dataframe.loc[0,'total_rooms']=10

In [10]:
# we check that we have changed the value
california_housing_dataframe.loc[0,'total_rooms']

10.0

## Basic Operation: Assigning new values to a column

A quite common operation is the assignment of new values to a whole column


 The following example changes the value of the column 'population' in the California housing data. 


In [11]:
california_housing_dataframe['population']=100

In [13]:
# we observe that the column population has new values
california_housing_dataframe

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,10.0,1283.0,100,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,100,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,100,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,100,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,100,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,100,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,100,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,100,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,100,478.0,1.9797,85800.0


## Basic Operation: Assigning new values to a row

it is possible to assing new values to an entire row

 The following example changes the value of the first row in the California housing data. 


In [15]:
california_housing_dataframe.loc[0]='this is a new value'

In [16]:
california_housing_dataframe

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value
1,-114.47,34.4,19.0,7650.0,1901.0,100,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,100,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,100,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,100,262.0,1.925,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,100,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,100,465.0,2.5179,79000.0
16997,-124.3,41.84,17.0,2677.0,531.0,100,456.0,3.0313,103600.0
16998,-124.3,41.8,19.0,2672.0,552.0,100,478.0,1.9797,85800.0


## Basic Operation: Create new Dataframes from others

It is quite common to create new `DataFrames` from others

 The following example creates a new dataframe from the existing one.


In [18]:
MyNewDataFrame=california_housing_dataframe[['longitude','latitude']]

In [19]:
MyNewDataFrame

Unnamed: 0,longitude,latitude
0,this is a new value,this is a new value
1,-114.47,34.4
2,-114.56,33.69
3,-114.57,33.64
4,-114.57,33.57
...,...,...
16995,-124.26,40.58
16996,-124.27,40.69
16997,-124.3,41.84
16998,-124.3,41.8


In [20]:
YetAnotherDataFrame=california_housing_dataframe[['total_rooms','total_bedrooms','population']]

In [21]:
YetAnotherDataFrame

Unnamed: 0,total_rooms,total_bedrooms,population
0,this is a new value,this is a new value,this is a new value
1,7650.0,1901.0,100
2,720.0,174.0,100
3,1501.0,337.0,100
4,1454.0,326.0,100
...,...,...,...
16995,2217.0,394.0,100
16996,2349.0,528.0,100
16997,2677.0,531.0,100
16998,2672.0,552.0,100


## Basic Operation: Drop columns from a DataFrame

It is quite common to drop columns from  `DataFrames` to save memory and keep things simple.

We use the method `Pandas.DataFrame.drop()`

The following example drops a colum from an existing DataFrame 

In [29]:
## we drop the column named 'longitude'. We use the argument axis=1 to search by column (axis=0 would search by row)

california_housing_dataframe.drop(['longitude'],axis=1)

Unnamed: 0,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value
1,34.4,19.0,7650.0,1901.0,100,463.0,1.82,80100.0
2,33.69,17.0,720.0,174.0,100,117.0,1.6509,85700.0
3,33.64,14.0,1501.0,337.0,100,226.0,3.1917,73400.0
4,33.57,20.0,1454.0,326.0,100,262.0,1.925,65500.0
...,...,...,...,...,...,...,...,...
16995,40.58,52.0,2217.0,394.0,100,369.0,2.3571,111400.0
16996,40.69,36.0,2349.0,528.0,100,465.0,2.5179,79000.0
16997,41.84,17.0,2677.0,531.0,100,456.0,3.0313,103600.0
16998,41.8,19.0,2672.0,552.0,100,478.0,1.9797,85800.0


The following example drops a colum from an existing DataFrame and saves the result as a new DataFrame

In [30]:
DataFrameWithDroppedColumn=california_housing_dataframe.drop(['longitude'],axis=1)

In [31]:
DataFrameWithDroppedColumn

Unnamed: 0,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value,this is a new value
1,34.4,19.0,7650.0,1901.0,100,463.0,1.82,80100.0
2,33.69,17.0,720.0,174.0,100,117.0,1.6509,85700.0
3,33.64,14.0,1501.0,337.0,100,226.0,3.1917,73400.0
4,33.57,20.0,1454.0,326.0,100,262.0,1.925,65500.0
...,...,...,...,...,...,...,...,...
16995,40.58,52.0,2217.0,394.0,100,369.0,2.3571,111400.0
16996,40.69,36.0,2349.0,528.0,100,465.0,2.5179,79000.0
16997,41.84,17.0,2677.0,531.0,100,456.0,3.0313,103600.0
16998,41.8,19.0,2672.0,552.0,100,478.0,1.9797,85800.0


We can also drop several columns at the same time

In [35]:
MiniDataFrame=california_housing_dataframe.drop(['longitude','latitude','housing_median_age','total_rooms','population','median_house_value'],axis=1)

In [36]:
MiniDataFrame

Unnamed: 0,total_bedrooms,households,median_income
0,this is a new value,this is a new value,this is a new value
1,1901.0,463.0,1.82
2,174.0,117.0,1.6509
3,337.0,226.0,3.1917
4,326.0,262.0,1.925
...,...,...,...
16995,394.0,369.0,2.3571
16996,528.0,465.0,2.5179
16997,531.0,456.0,3.0313
16998,552.0,478.0,1.9797
