# <font color='tomato' style="font-size:40px"><center><b>Introduction to Pandas</b></center></font>


* We have seen before several ways to structure data in Python. First we dealt with lists. Then, we talked about NumPy arrays. Now we discuss **Pandas**. If you recall, NumPy arrays all need to contain data of the same type (and if they originally do not, they get transformed to be all of the same type).

* On the other hand, in data bases and spreadsheets data can be usually of different types (strings, floats, etc). Pandas provide us with similar functionalities as NumPy but with this additional flexibility. They are very similar in concept to DataFrame objects in R which you are just learning about and are essentially like **Excel on steroids**.

* Another cool thing is that both export and import of data using Pandas is much simpler than using just regular Python. In this way, with Pandas we are getting tools that we most commonly use when we work in Python in real life situations.

## <font color='orange' style="font-size:25px"><b>Installing packages</b></font>

We are going to use two packages:
* <font color='mediumseagreen'><b>Pandas</b></font>
* <font color='mediumseagreen'><b>Pandas-DataReader</b></font>

Google Colab users already have them installed! However, if you use any other platform you need to install it first. To install them into your desktop Jupyter notebook, you just need to type the following commands in the command prompt:
<br><br>
<b>`pip install pandas-datareader`</b> and

 <b>`pip install pandas`</b>

In [1]:
%pip install pandas-datareader



In [2]:
%pip install Pandas




Since <font color='mediumseagreen'><b>YFinance</b></font> doesn't come by default with Colab, you have to install it. For that we use magic command:
<br> <font color='Plum'><b>%</b></font>`pip install`
    </font>

In [3]:
%pip install yfinance



Once all packages are properly installed, you should import them into your notebook.
* PS: note that at the end of code below we have line which uses <font color='mediumseagreen'><b>YFinance</b></font> to patch <font color='mediumseagreen'><b>Pandas - Data Reader</b></font>

In [4]:
import pandas as pd
import numpy as np
from pandas_datareader import data as dr
import yfinance as yfin
yfin.pdr_override()

## <font color='orange' style="font-size:25px"><b>Importing and display data with Pandas</b></font>

Up to this point we have used general Python to import data, i.e. function <font color='DodgerBlue'><b>open</b></font> in combination with the string method <font color='DeepPink'><b>split</b></font> or regex to transform one large string into list of numbers with which we can operate. With <font color='mediumseagreen'><b>Pandas</b></font> data import simplifies.

Function <font color='DodgerBlue'><b>pd.read_csv</b></font> has only one required argument - path to the file which you want to import (a string). If data file is in the same directory as the .ipynb file into which you want import it, you have only to give file name with extension.

> Import: "fivepricesNew.csv". Recall that these are daily prices of various stocks and SP500 index. As usual, Google Colab users have to first to upload the file:

In [5]:
from google.colab import files
uploaded1 = files.upload()

Saving fivepricesNew.csv to fivepricesNew.csv


Now import data into <font color='mediumseagreen'><b>Pandas</b></font>:

In [6]:
df1=pd.read_csv("fivepricesNew.csv") #data is imported and placed into data frame called df1

When we call variable *df1* couple of initial and ending rows will be displayed by default:

In [7]:
df1

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.080000,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.400000,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.660000,192.869995,75.044296,1457.15
...,...,...,...,...,...,...
499,12/24/2014,303.029999,50.650002,161.820007,112.010002,2081.88
500,12/26/2014,309.089996,50.860001,162.339996,113.989998,2088.77
501,12/29/2014,312.040008,50.529999,160.509995,113.910004,2090.57
502,12/30/2014,310.299988,51.220001,160.050003,112.519997,2080.35


If you want to see only 5 initial rows you can use method <font color='DeepPink'><b>head</b></font>:

In [8]:
df1.head() #by default display just the first 5 rows

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.08,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.4,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.66,192.869995,75.044296,1457.15


How many initial rows you want to see can be specified by the number inside <font color='DeepPink'><b>head</b></font> method:

In [9]:
df1.head(10)

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.08,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.4,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.66,192.869995,75.044296,1457.15
5,1/9/2013,266.350006,19.33,192.320007,73.871399,1461.02
6,1/10/2013,265.339996,18.99,192.880005,74.787102,1472.12
7,1/11/2013,267.940002,19.290001,194.449997,74.328598,1472.05
8,1/14/2013,272.730011,19.43,192.619995,71.678596,1470.68
9,1/15/2013,271.899994,19.52,192.5,69.417099,1472.34


Similar for tail <font color='DeepPink'><b>tail</b></font>:

In [10]:
df1.tail(8)

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
496,12/19/2014,299.899994,50.880001,158.509995,111.779999,2070.65
497,12/22/2014,306.540008,51.150002,161.440002,112.940002,2078.54
498,12/23/2014,306.285004,50.02,162.240005,112.540001,2082.17
499,12/24/2014,303.029999,50.650002,161.820007,112.010002,2081.88
500,12/26/2014,309.089996,50.860001,162.339996,113.989998,2088.77
501,12/29/2014,312.040008,50.529999,160.509995,113.910004,2090.57
502,12/30/2014,310.299988,51.220001,160.050003,112.519997,2080.35
503,12/31/2014,310.350006,50.509998,160.440002,110.379997,2058.9


* <font color='mediumseagreen'><b>Pandas</b></font> by default take the first row in data set as **table headings**. The table has an additional column called **index**. By default, it is ordinal number of each row in the data set. Serves as ID number for each row.

* Index can be used to uniquely determine which rows user wants to extract from the table when performing transformations of the data set.

* We can use as index column other variable that uniquely identifies rows. In finance price info is typically ordered in time, so variable "Date" can identify each row. For setting a column as index we use the optional argument **index_col** when importing data.

When importing dates, by default, dates are imported as strings. More advanced analysis of dates can be done only if dates are transformed into <font color='mediumseagreen'><b>DateTime</b></font> objects.

You can transform dates from string into <font color='mediumseagreen'><b>DateTime</b></font> objects by using optional argument **parse_dates** when importing the data.

> Let's redefine our data set in a way that index column is now "Date":

In [11]:
df2 = pd.read_csv("fivepricesNew.csv", index_col="Date",parse_dates=True)
df2.head()

Unnamed: 0_level_0,AMZN,YHOO,IBM,AAPL,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-02,257.309998,20.08,196.350006,78.432899,1462.42
2013-01-03,258.480011,19.780001,195.270004,77.442299,1459.37
2013-01-04,259.149994,19.860001,193.990005,75.285698,1466.47
2013-01-07,268.459015,19.4,193.139999,74.842903,1461.89
2013-01-08,266.380005,19.66,192.869995,75.044296,1457.15


To import data from an Excel file you can use function <font color='DodgerBlue'><b>pd.read_excel</b></font>. It has only one required argument - file path (or file name) of file that you want to import.

> Let's use it to import data from file: "threereturns.xlsx". These are daily returns on 3 gaming stocks. Again, Google Colab users have to upload file first:

In [12]:
from google.colab import files
uploaded2 = files.upload()

Saving threereturns.xlsx to threereturns.xlsx


In [13]:
df3 = pd.read_excel("threereturns.xlsx")
df3.head()

Unnamed: 0,Date,Ubisoft,Capcom,Electronic Arts
0,2019-07-02,0.010918,-0.002745,0.018139
1,2019-07-03,0.041296,0.006685,-0.044879
2,2019-07-05,-0.018914,0.020508,-0.045969
3,2019-07-08,-0.012438,0.000957,-0.001709
4,2019-07-09,0.002519,-0.032505,-0.014876


Mostly, it works like <font color='DodgerBlue'><b>pd.read_csv</b></font>. However, if there are many work sheets in the Excel file (which is not the case here) you can specify from which sheet you want to import data (by default the first sheet, i.e. sheet 0 in Python terminology, is used).

 Although the argument name is: **sheet_name** you can give either ordinal number of the sheet that you want to import (numeration starts from 0), or the string which is the exact name of that sheet.

Now we specify that we want to import data from the first sheet. As before we set column "Date" as index column. Argument **usecols** determines which columns you want to import (strings of capital letters corresponding to Excel columns)

**dtype** is dictionary which keys are column names and values are data type which you want to set (if you don't specify them explicitly <font color='mediumseagreen'><b>Pandas</b></font> will set them for you).

In [14]:
df4 = pd.read_excel('threereturns.xlsx',index_col='Date',sheet_name=0,usecols='A,C',dtype={'Date':str,'Capcom':float})
df4

Unnamed: 0_level_0,Capcom
Date,Unnamed: 1_level_1
2019-07-02,-0.002745
2019-07-03,0.006685
2019-07-05,0.020508
2019-07-08,0.000957
2019-07-09,-0.032505
...,...
2020-06-23,0.031746
2020-06-24,-0.025995
2020-06-25,0.018519
2020-06-26,-0.001604


To compare the situation when dates are imported as strings and case when dates are imported as  <font color='mediumseagreen'><b>DateTime</b></font> objects we have created this data set (*df4*) with dates given as strings.

### <font color='MediumVioletRed' style="font-size:20px"><b>Importing data from the Internet using DataReader</b></font>

<font color='mediumseagreen'><b>Pandas-DataReader</b></font> scraps data from commonly used websites and stores them in <font color='mediumseagreen'><b>Pandas</b></font>. Very useful indeed.

Some data collections supported by this library:
*  <a href="https://finance.yahoo.com/">Yahoo! Finance</a>
* <a href="https://www.google.com/search?client=firefox-b-d&q=Tiingo">Tiingo</a>
* <a href="https://data.worldbank.org/">World Bank</a>
* <a href="https://data.oecd.org/">OECD</a>
* <a href="https://ec.europa.eu/eurostat/data/database">Eurostat</a>
* <a href="https://fred.stlouisfed.org/">Federal Reserves of St Louis (FRED)</a>
* <a href="https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html">Ken French Data Library</a>
* and many others

From *Yahoo! Finance* one can easily acquire historical data for different stocks and indexes. To do that, use <font color='DodgerBlue'><b>dr.get_data_yahoo</b></font>. Need to specify ticker symbol of index or stock that you want to import as well as starting and ending date of your sample.

Let us use this function to import daily data on Google stock in the period Jan 1st to May 1st of 2021. Note that we first write year, then month, then day.

In [15]:
goog=dr.get_data_yahoo('goog',start='2022-01-01',end='2022-05-01')
goog.head()

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-01-03,144.475494,145.550003,143.502502,145.074493,145.074493,25214000
2022-01-04,145.550507,146.610001,143.816147,144.416504,144.416504,22928000
2022-01-05,144.181,144.298004,137.523499,137.653503,137.653503,49642000
2022-01-06,137.497498,139.686005,136.763504,137.550995,137.550995,29050000
2022-01-07,137.904999,138.254745,135.789001,137.004501,137.004501,19408000


<font color='crimson'>
New accessorizes with this version of Google Colab adds interactive button with magic wand on it next to downloaded table. If you click it transforms your static <font color='MediumSeaGreen'><b>Pandas</b></font> table into an interactive table. Now you have two additional interactive objects:
</font>

* <font color='crimson'>
    Drop down menu on the bottom left corner of table with which you can determine how many rows do you want to see, but it cannot be applied on <font color='MediumSeaGreen'><b>Pandas</b></font> tables unless you increase the number of rows that you want to display with <font color='DeepPink'><b>head</b></font> or display entire table (i.e. don't use <font color='DeepPink'><b>head</b></font> method)
    </font>
* <font color='crimson'>
    Second one is button <button>Filter</button> which appear in upper right corner of your table. By clicking on it a form view appear in which you can state different criteria by which you want to filter the table. You can find interactive form fields for each column in your table in which you insert filtering criteria.
    </font>

<font color='crimson'>
Call entire table <i>goog</i> in next code call and see that you can manipulate with number of rows to be displayed in this way:
    </font>

In [16]:
goog

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-01-03,144.475494,145.550003,143.502502,145.074493,145.074493,25214000
2022-01-04,145.550507,146.610001,143.816147,144.416504,144.416504,22928000
2022-01-05,144.181000,144.298004,137.523499,137.653503,137.653503,49642000
2022-01-06,137.497498,139.686005,136.763504,137.550995,137.550995,29050000
2022-01-07,137.904999,138.254745,135.789001,137.004501,137.004501,19408000
...,...,...,...,...,...,...
2022-04-25,119.429497,123.278000,118.769249,123.250000,123.250000,34522000
2022-04-26,122.750000,122.750000,119.161850,119.505997,119.505997,49394000
2022-04-27,114.373001,117.500000,113.124252,115.020500,115.020500,62238000
2022-04-28,117.114998,120.438499,115.143898,119.411499,119.411499,36790000


*FRED (Federal Reserve’s Economic Data)* has a large collection of economic indicators, primarily in the USA (like inflation, interest rates, GDP and so on. To import data from FRED use <font color='DodgerBlue'><b>dr.get_data_fred</b></font> function. Need to know the symbol for the appropriate macroeconomic indicator.

Import monthly data about 4-Week Treasury Bills interest rates (short 'TB4WK') in period from January of 2020 to May of 2021 (these are yields in percentage points, annualized):

In [17]:
tb=dr.get_data_fred('TB4WK',start='2020-01-01',end='2021-05-01')
tb

Unnamed: 0_level_0,TB4WK
DATE,Unnamed: 1_level_1
2020-01-01,1.5
2020-02-01,1.55
2020-03-01,0.36
2020-04-01,0.11
2020-05-01,0.1
2020-06-01,0.13
2020-07-01,0.11
2020-08-01,0.08
2020-09-01,0.09
2020-10-01,0.09


* *Kenneth R. French - Data Library* website provides data necessary to perform Fama-French regression factors. Use <font color='DodgerBlue'><b>dr.get_data_famafrench</b></font>. Beside the starting and ending date need the name of the data library from which you want to import data.

* In investments, it is very common to look for the most important, salient risk factors that drive returns on financial asset. One of the very commonly used sets of factors are the so-called Fama-French factors.

* The library which contains daily data for basic factors in Fama-French model is called "F-F_Research_Data_Factors_daily". Import data from it for the same period as in previous two examples. Below we have daily data, in percentage points:

In [18]:
ff=dr.get_data_famafrench('F-F_Research_data_Factors_daily',start='2021-01-01',end='2021-05-01')[0]
ff

Unnamed: 0_level_0,Mkt-RF,SMB,HML,RF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-04,-1.41,0.22,0.58,0.0
2021-01-05,0.86,1.23,0.48,0.0
2021-01-06,0.79,2.14,3.93,0.0
2021-01-07,1.76,0.33,-0.83,0.0
2021-01-08,0.51,-0.75,-1.38,0.0
...,...,...,...,...
2021-04-26,0.43,0.87,-0.50,0.0
2021-04-27,-0.03,-0.16,0.84,0.0
2021-04-28,-0.06,0.20,0.19,0.0
2021-04-29,0.39,-1.14,1.09,0.0


Please note that by default function will give you both - data set and notes which comes with data sets. Here we used index [0] to extract only data set. They are risk factors that are supposed to explain bulk of the asset returns:

- SMB: excess return of small vs large market cap stocks
- HML: excess return on value (low P/E ratio) vs growth stocks (high P/E ratio)

## <font color='orange' style="font-size:25px"><b>Two basic types of data objects in Pandas</b></font>

Since we have learned how to import data properly we are ready to discuss objects which are used to store data in <font color='mediumseagreen'><b>Pandas</b></font>. There are two basic objects (i.e. data structures):
* <font color='DodgerBlue'><b>pd.DataFrame</b></font>: 2-dimensional data object with two indexes (rows and columns). This is tabular data organized in columns (each column is one series).

* <font color='DodgerBlue'><b>pd.Series</b></font>: 1-dimensional data object with a single index (one column, many rows). This is one time series (i.e. one column).

### <font color='MediumVioletRed' style="font-size:20px"><b>DataFrame class</b></font>

At the core of pandas is <font color='DodgerBlue'><b>pd.DataFrame</b></font>, a class designed to efficiently handle data in tabular form — i.e., data organized in columns.

Provides column labeling and flexible indexing capabilities for the rows (records) of the data set, similar to a table in a relational database or an Excel spreadsheet.

> Let us create one data set from scratch. It will contain two columns with stock prices. The header will be stock names while each row will be indexed by the dates.

When you are defining <font color='DodgerBlue'><b>pd.DataFrame</b></font>, you have to state the following arguments:
1. **data** - required argument. It is a list of lists (tuple, dictionary or even <font color='mediumseagreen'><b>NumPy</b></font>'s <font color='DodgerBlue'><b>np.array</b></font>) where each sub-list represents one row.
2. **columns** - optional argument (if omitted labels 0,1,2... will be set as header). Sets the header of table.
3. **index** - optional argument (if omitted index for rows will be 0,1,2...). Sets the values with which you want to index rows in table.

In [19]:
df5= pd.DataFrame([[100,90,106],[120,88,111],[130,81,103],[135,93,95]],
                  columns=['stock_A','stock_B','stock_C'],
                  index=['day1','day2','day3','day4'])
df5

Unnamed: 0,stock_A,stock_B,stock_C
day1,100,90,106
day2,120,88,111
day3,130,81,103
day4,135,93,95


We can extract elements used to define instance of <font color='DodgerBlue'><b>pd.DataFrame</b></font> in the same way as with any other class, by calling attributes. Let's extract column names (i.e. header):

In [20]:
df5.columns

Index(['stock_A', 'stock_B', 'stock_C'], dtype='object')

Notice that given element is an object called **index**. Usually we can't work with it, so it is nice to convert such object into list. Now try to extract rows labels and convert given object into list:

In [21]:
list(df5.index)

['day1', 'day2', 'day3', 'day4']

### <font color='MediumVioletRed' style="font-size:20px"><b>Series class</b></font>

<font color='DodgerBlue'><b>pd.Series</b></font> is data set with only one column. Its name (Series) comes from the fact that if you use dates as index (which we often do in finance) you will get list of pairs date-value which is a times series.

<font color='DodgerBlue'><b>pd.Series</b></font> can be constructed in almost the same way as <font color='DodgerBlue'><b>pd.DataFrame</b></font>. Argument **data** is now a simple list (or other appropriate data type), while instead of **columns** we have **name** as arguments which labels the column. On the other hand, argument **index** stayed the same.

> Let's create a <font color='DodgerBlue'><b>pd.Series</b></font> for first stock (stock A) data:

In [22]:
s1 = pd.Series([100,120,130,135], name="stock_A",index=['day1','day2','day3','day4'])
s1

day1    100
day2    120
day3    130
day4    135
Name: stock_A, dtype: int64

Although they are constructed in the same way, their output looks a little bit different. Output is now printed, i.e. it isn't shown as table. Furthermore, name of the column is given below the printed pairs as footnote.

## <font color='orange' style="font-size:25px"><b>Extracting parts of a pandas object</b></font>

Big advantage of <font color='DodgerBlue'><b>pd.DataFrame</b></font>s is that they are more flexible to work with than other data types which can be used for data storing (for example <font color='mediumseagreen'><b>NumPy</b></font>'s <font color='DodgerBlue'><b>np.array</b></font>). On the other hand, <font color='DodgerBlue'><b>pd.DataFrame</b></font> objects are as computationally efficient as <font color='mediumseagreen'><b>NumPy</b></font>'s <font color='DodgerBlue'><b>np.array</b></font> objects.

> Assume that you want to extract data about AMZN stock from **df2** Like with any dictionary, you just need to put name of the column (dictionary key) in square brackets in front of it, and you will get only data from desired column:

In [23]:
df2['AMZN'] #Selecting Amazon stock from df2 datafrime(selects a column of data). We obtain Series object if we select just one column

Date
2013-01-02    257.309998
2013-01-03    258.480011
2013-01-04    259.149994
2013-01-07    268.459015
2013-01-08    266.380005
                 ...    
2014-12-24    303.029999
2014-12-26    309.089996
2014-12-29    312.040008
2014-12-30    310.299988
2014-12-31    310.350006
Name: AMZN, Length: 504, dtype: float64

If we want to select 2 columns, for example, we specify a list of column names:

> Extract two specified columns:

In [24]:
df2[['AMZN','YHOO']]

Unnamed: 0_level_0,AMZN,YHOO
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-02,257.309998,20.080000
2013-01-03,258.480011,19.780001
2013-01-04,259.149994,19.860001
2013-01-07,268.459015,19.400000
2013-01-08,266.380005,19.660000
...,...,...
2014-12-24,303.029999,50.650002
2014-12-26,309.089996,50.860001
2014-12-29,312.040008,50.529999
2014-12-30,310.299988,51.220001


Suppose we want to select data correponding to **a single date** (i.e. a row). If we know what date it is, we can use <font color='DeepPink'><b>loc</b></font> method. The argument is the index method, in this case the date. In this case is another (smaller) <font color='DodgerBlue'><b>pd.DataFrame</b></font>, not <font color='DodgerBlue'><b>pd.Series</b></font>!

In [25]:
df2.loc['2013-01-02'] #this selects data for January 2nd, 2013

AMZN      257.309998
YHOO       20.080000
IBM       196.350006
AAPL       78.432899
^GSPC    1462.420000
Name: 2013-01-02 00:00:00, dtype: float64

As in case when you extract a single column, output of a single row extraction is <font color='DodgerBlue'><b>pd.Series</b></font> (put in this case indexes are column names). However, if you extract multiple rows, output will, of course, be another <font color='DodgerBlue'><b>pd.DataFrame</b></font>. In that case you would have to give list of all rows indexes that you want to extract instead of single row name:

In [26]:
df2.loc[['2013-01-02','2013-01-03']]

Unnamed: 0_level_0,AMZN,YHOO,IBM,AAPL,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-02,257.309998,20.08,196.350006,78.432899,1462.42
2013-01-03,258.480011,19.780001,195.270004,77.442299,1459.37


In [27]:
df2.loc['2013-01-04':'2013-01-08']  # This selects data from January 4th to January 8th, 2013

Unnamed: 0_level_0,AMZN,YHOO,IBM,AAPL,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-04,259.149994,19.860001,193.990005,75.285698,1466.47
2013-01-07,268.459015,19.4,193.139999,74.842903,1461.89
2013-01-08,266.380005,19.66,192.869995,75.044296,1457.15


Suppose you want to extract data from the third row, but **you cannot remember** what the exact index name of that row is. For this we use <font color='DeepPink'><b>iloc</b></font> method. When the method is applied you have to give ordinal number of the raw that you want to extract in square brackets. Let's use it to extract third row from the same data set:

In [28]:
df2.iloc[2]

AMZN      259.149994
YHOO       19.860001
IBM       193.990005
AAPL       75.285698
^GSPC    1466.470000
Name: 2013-01-04 00:00:00, dtype: float64

We can use the same notation as with NumPy arrays. Suppose we want to select data from row 5 to row 10.


In [29]:
df2.iloc[4:10] #Recall the notation

Unnamed: 0_level_0,AMZN,YHOO,IBM,AAPL,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-08,266.380005,19.66,192.869995,75.044296,1457.15
2013-01-09,266.350006,19.33,192.320007,73.871399,1461.02
2013-01-10,265.339996,18.99,192.880005,74.787102,1472.12
2013-01-11,267.940002,19.290001,194.449997,74.328598,1472.05
2013-01-14,272.730011,19.43,192.619995,71.678596,1470.68
2013-01-15,271.899994,19.52,192.5,69.417099,1472.34



When we extract columns from <font color='DodgerBlue'><b>np.array</b></font>, we need to give two dimensions in square brackets (one for row and the other for column). Extract the whole second column from the same data set only by using its position:

In [30]:
df2.iloc[:,1] #This are yahoo! stock prices

Date
2013-01-02    20.080000
2013-01-03    19.780001
2013-01-04    19.860001
2013-01-07    19.400000
2013-01-08    19.660000
                ...    
2014-12-24    50.650002
2014-12-26    50.860001
2014-12-29    50.529999
2014-12-30    51.220001
2014-12-31    50.509998
Name: YHOO, Length: 504, dtype: float64

Thus, all you have learned when working with NumPy arrays slicing you can use with Pandas as well.

### <font color='MediumVioletRed' style="font-size:20px"><b>Conditional data extraction</b></font>

When we analyze some data set we frequently have to select only data which satisfied some logical conditions. This can be done in the same way as we did that with <font color='mediumseagreen'><b>NumPy</b></font>'s <font color='DodgerBlue'><b>np.array</b></font>.

First let's redefine *df3* which contains data from file "threereturns.xlsx". Basically, the only difference is that here I just want to put dates as indices, but I will keep them in string format (you will see why):

In [31]:
df3=pd.read_excel('threereturns.xlsx',index_col='Date')
df3.head()

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-02,0.010918,-0.002745,0.018139
2019-07-03,0.041296,0.006685,-0.044879
2019-07-05,-0.018914,0.020508,-0.045969
2019-07-08,-0.012438,0.000957,-0.001709
2019-07-09,0.002519,-0.032505,-0.014876


> Assume that we want to select Ubisoft daily returns which were greater than 4%:

In [32]:
df3['Ubisoft'] > 0.04

Date
2019-07-02    False
2019-07-03     True
2019-07-05    False
2019-07-08    False
2019-07-09    False
              ...  
2020-06-23    False
2020-06-24    False
2020-06-25    False
2020-06-26    False
2020-06-29    False
Name: Ubisoft, Length: 251, dtype: bool

The output of test given above is <font color='DodgerBlue'><b>pd.Series</b></font> which for each date (index) shows whether test yields True or False. So, as in case of <font color='mediumseagreen'><b>NumPy</b></font>'s <font color='DodgerBlue'><b>np.array</b></font>, if we put this output in square brackets in front of our <font color='DodgerBlue'><b>pd.DataFrame</b></font> only rows with value True will be displayed. Let' try that out:

In [33]:
df3[df3['Ubisoft'] > 0.04] # Selects all rows for which the condition is True.

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-03,0.041296,0.006685,-0.044879
2019-07-10,0.040201,0.012846,0.02075
2019-10-28,0.109932,-0.026134,0.004257
2019-12-04,0.051852,0.043698,-0.00632
2019-12-12,0.044391,0.06613,0.018314
2019-12-13,0.042504,0.0,-0.005138
2020-01-08,0.041176,0.002907,0.010149
2020-01-28,0.04321,-0.012075,0.01212
2020-03-10,0.045586,0.03,0.024274
2020-03-13,0.042707,0.049497,0.039743


Note that as output you got all rows from our data set that satisfy the given condition. That means that columns which weren't part of the test (Capcom and EA's returns) were shown as well.

If you want to see only Ubisofts returns you can extract the column Ubisoft from the previous output:

In [34]:
df3[df3['Ubisoft'] > 0.04]['Ubisoft']

Date
2019-07-03    0.041296
2019-07-10    0.040201
2019-10-28    0.109932
2019-12-04    0.051852
2019-12-12    0.044391
2019-12-13    0.042504
2020-01-08    0.041176
2020-01-28    0.043210
2020-03-10    0.045586
2020-03-13    0.042707
2020-03-17    0.099693
2020-03-24    0.064491
2020-03-26    0.073746
2020-04-27    0.049519
Name: Ubisoft, dtype: float64

Say you are interested in days for which Ubisofts returns were positive while returns on the other two stocks were negative. Here we have three conditions and all three of them need to be satisfied in order for a row to be extracted, i.e. all three of them are connected with operator <font color='DarkGreen'><b>&</b></font>.

Note that when we have multiple conditions we have to put each of them in small brackets:

> Find dates for which Ubisoft makes money while the other two companies lose money (on the daily basis)

In [35]:
cond1 = (df3['Ubisoft'] > 0) & (df3['Capcom']<0) & (df3['Electronic Arts'] <0)
df3[cond1]

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-09,0.002519,-0.032505,-0.014876
2019-09-11,0.016656,-0.044611,-0.00392
2019-10-31,0.021739,-0.029286,-0.003309
2019-11-13,0.009009,-0.007692,-0.003426
2019-11-26,0.023932,-0.014309,-0.00521
2019-12-02,0.016639,-0.008746,-0.006039
2019-12-27,0.011852,-0.00292,-0.000737
2020-01-02,0.010885,-0.004298,-0.001581
2020-01-09,0.014831,-0.008333,-0.001187
2020-01-21,0.015184,-0.001879,-0.004605


Suppose we want to find dates for which all 3 returns were positive. Instead of using the previous method, we can also do it like this. Note that when a return is negative, we can NaN (missing value).

In [36]:
cond2=df3 > 0
df3c = df3[cond2]
df3c

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-02,0.010918,,0.018139
2019-07-03,0.041296,0.006685,
2019-07-05,,0.020508,
2019-07-08,,0.000957,
2019-07-09,0.002519,,
...,...,...,...
2020-06-23,0.008382,0.031746,0.006784
2020-06-24,0.004476,,0.002067
2020-06-25,0.009548,0.018519,
2020-06-26,,,0.001453


One can use method <font color='DeepPink'><b>dropna</b></font> to get rid of the missing data. If you apply it directly with no optional arguments it would display only rows which do not contain any missing data, but it won't erase missing data from the data set:

In [37]:
df3c.dropna() # This drops rows where at least one element in the row is NaN

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-10,0.040201,0.012846,0.02075
2019-08-01,0.001824,0.051025,0.022703
2019-08-06,0.013401,0.030424,0.017045
2019-08-08,0.008833,0.010761,0.03208
2019-08-13,0.00743,0.004643,0.006211
2019-08-16,0.027724,0.029748,0.019117
2019-08-21,0.030856,0.021707,0.014455
2019-10-09,0.002355,0.010303,0.003674
2019-11-08,0.024793,0.008718,0.010696
2019-11-25,0.014744,0.049785,0.00615


Suppose you drop rows where **all** elements are missing. Use optional argument **how** to determine that.Once again this won't change your data set! It will only display to you how data set would look like without rows with all missing data:

In [38]:
df3c.dropna(how='all') # This only drops rows where all elements of the row are NaN

Unnamed: 0_level_0,Ubisoft,Capcom,Electronic Arts
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-07-02,0.010918,,0.018139
2019-07-03,0.041296,0.006685,
2019-07-05,,0.020508,
2019-07-08,,0.000957,
2019-07-09,0.002519,,
...,...,...,...
2020-06-22,0.016383,0.027270,
2020-06-23,0.008382,0.031746,0.006784
2020-06-24,0.004476,,0.002067
2020-06-25,0.009548,0.018519,


To permanently make changes in your data set you would have to use optional argument **inplace** and set it to True. But be careful, once when you erase rows with missing data you cannot get them back.