# <font color='tomato' style="font-size:40px"><center><b>Introduction to Pandas</b></center></font>


* We have seen before several ways to structure data in Python. First we dealt with lists. Then, we talked about NumPy arrays. Now we discuss **Pandas**. If you recall, NumPy arrays all need to contain data of the same type (and if they originally do not, they get transformed to be all of the same type).

* On the other hand, in data bases and spreadsheets data can be usually of different types (strings, floats, etc). Pandas provide us with similar functionalities as NumPy but with this additional flexibility. They are very similar in concept to DataFrame objects in R which you are just learning about and are essentially like **Excel on steroids**.

* Another cool thing is that both export and import of data using Pandas is much simpler than using just regular Python. In this way, with Pandas we are getting tools that we most commonly use when we work in Python in real life situations.

## <font color='orange' style="font-size:25px"><b>Installing packages</b></font>

We are going to use two packages:
* <font color='mediumseagreen'><b>Pandas</b></font>
* <font color='mediumseagreen'><b>Pandas-DataReader</b></font>

Google Colab users already have them installed! However, if you use any other platform you need to install it first. To install them into your desktop Jupyter notebook, you just need to type the following commands in the command prompt:
<br><br>
<b>`pip install pandas-datareader`</b> and

 <b>`pip install pandas`</b>

In [1]:
%pip install pandas-datareader



In [2]:
%pip install Pandas



<font color='crimson'>
    Since Yahoo!Finance has changed its website recently, current version of <font color='mediumseagreen'><b>Pandas-DataReader</b></font> is unable to download data from it. However, until new update for this package come, we can use highly specialized package <font color='mediumseagreen'><b>YFinance</b></font> (short form Yahoo!Finance) to download data from this website. We discuss this package in more detail in next semester, here we only use it to make patch for <font color='mediumseagreen'><b>Pandas-DataReader</b></font> so that it can download data from Yahoo!Finance.
</font>


Since <font color='mediumseagreen'><b>YFinance</b></font> doesn't come by default with Colab, you have to install it. For that we use magic command:
<br> <font color='Plum'><b>%</b></font>`pip install`
    </font>

In [3]:
%pip install yfinance



Once all packages are properly installed, you should import them into your notebook.
* PS: note that at the end of code below we have line which uses <font color='mediumseagreen'><b>YFinance</b></font> to patch <font color='mediumseagreen'><b>Pandas - Data Reader</b></font>

In [4]:
import pandas as pd
import numpy as np
from pandas_datareader import data as dr
import yfinance as yfin
yfin.pdr_override()

## <font color='orange' style="font-size:25px"><b>Importing and display data with Pandas</b></font>

Up to this point we have used general Python to import data, i.e. function <font color='DodgerBlue'><b>open</b></font> in combination with the string method <font color='DeepPink'><b>split</b></font> or regex to transform one large string into list of numbers with which we can operate. With <font color='mediumseagreen'><b>Pandas</b></font> data import simplifies.

Function <font color='DodgerBlue'><b>pd.read_csv</b></font> has only one required argument - path to the file which you want to import (a string). If data file is in the same directory as the .ipynb file into which you want import it, you have only to give file name with extension.

> Import: "fivepricesNew.csv". Recall that these are daily prices of various stocks and SP500 index. As usual, Google Colab users have to first to upload the file:

In [5]:
from google.colab import files
uploaded1 = files.upload()

Saving fivepricesNew.csv to fivepricesNew.csv


Now import data into <font color='mediumseagreen'><b>Pandas</b></font>:

In [6]:
df1=pd.read_csv("fivepricesNew.csv") #data is imported and placed into data frame called df1

When we call variable *df1* couple of initial and ending rows will be displayed by default:

In [7]:
df1

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.080000,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.400000,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.660000,192.869995,75.044296,1457.15
...,...,...,...,...,...,...
499,12/24/2014,303.029999,50.650002,161.820007,112.010002,2081.88
500,12/26/2014,309.089996,50.860001,162.339996,113.989998,2088.77
501,12/29/2014,312.040008,50.529999,160.509995,113.910004,2090.57
502,12/30/2014,310.299988,51.220001,160.050003,112.519997,2080.35


If you want to see only 5 initial rows you can use method <font color='DeepPink'><b>head</b></font>:

In [8]:
df1.head() #by default display just the first 5 rows

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.08,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.4,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.66,192.869995,75.044296,1457.15


How many initial rows you want to see can be specified by the number inside <font color='DeepPink'><b>head</b></font> method:

In [10]:
df1.head(10)

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
0,1/2/2013,257.309998,20.08,196.350006,78.432899,1462.42
1,1/3/2013,258.480011,19.780001,195.270004,77.442299,1459.37
2,1/4/2013,259.149994,19.860001,193.990005,75.285698,1466.47
3,1/7/2013,268.459015,19.4,193.139999,74.842903,1461.89
4,1/8/2013,266.380005,19.66,192.869995,75.044296,1457.15
5,1/9/2013,266.350006,19.33,192.320007,73.871399,1461.02
6,1/10/2013,265.339996,18.99,192.880005,74.787102,1472.12
7,1/11/2013,267.940002,19.290001,194.449997,74.328598,1472.05
8,1/14/2013,272.730011,19.43,192.619995,71.678596,1470.68
9,1/15/2013,271.899994,19.52,192.5,69.417099,1472.34


Similar for tail <font color='DeepPink'><b>tail</b></font>:

In [11]:
df1.tail(8)

Unnamed: 0,Date,AMZN,YHOO,IBM,AAPL,^GSPC
496,12/19/2014,299.899994,50.880001,158.509995,111.779999,2070.65
497,12/22/2014,306.540008,51.150002,161.440002,112.940002,2078.54
498,12/23/2014,306.285004,50.02,162.240005,112.540001,2082.17
499,12/24/2014,303.029999,50.650002,161.820007,112.010002,2081.88
500,12/26/2014,309.089996,50.860001,162.339996,113.989998,2088.77
501,12/29/2014,312.040008,50.529999,160.509995,113.910004,2090.57
502,12/30/2014,310.299988,51.220001,160.050003,112.519997,2080.35
503,12/31/2014,310.350006,50.509998,160.440002,110.379997,2058.9


* <font color='mediumseagreen'><b>Pandas</b></font> by default take the first row in data set as **table headings**. The table has an additional column called **index**. By default, it is ordinal number of each row in the data set. Serves as ID number for each row.

* Index can be used to uniquely determine which rows user wants to extract from the table when performing transformations of the data set.

* We can use as index column other variable that uniquely identifies rows. In finance price info is typically ordered in time, so variable "Date" can identify each row. For setting a column as index we use the optional argument **index_col** when importing data.

When importing dates, by default, dates are imported as strings. More advanced analysis of dates can be done only if dates are transformed into <font color='mediumseagreen'><b>DateTime</b></font> objects.

You can transform dates from string into <font color='mediumseagreen'><b>DateTime</b></font> objects by using optional argument **parse_dates** when importing the data.

> Let's redefine our data set in a way that index column is now "Date":

In [13]:
df2 = pd.read_csv("fivepricesNew.csv", index_col="Date",parse_dates=True)
df2.head()

Unnamed: 0_level_0,AMZN,YHOO,IBM,AAPL,^GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-02,257.309998,20.08,196.350006,78.432899,1462.42
2013-01-03,258.480011,19.780001,195.270004,77.442299,1459.37
2013-01-04,259.149994,19.860001,193.990005,75.285698,1466.47
2013-01-07,268.459015,19.4,193.139999,74.842903,1461.89
2013-01-08,266.380005,19.66,192.869995,75.044296,1457.15


To import data from an Excel file you can use function <font color='DodgerBlue'><b>pd.read_excel</b></font>. It has only one required argument - file path (or file name) of file that you want to import.

> Let's use it to import data from file: "threereturns.xlsx". These are daily returns on 3 gaming stocks. Again, Google Colab users have to upload file first:

In [14]:
from google.colab import files
uploaded2 = files.upload()

Saving threereturns.xlsx to threereturns.xlsx


In [15]:
df3 = pd.read_excel("threereturns.xlsx")
df3.head()

Unnamed: 0,Date,Ubisoft,Capcom,Electronic Arts
0,2019-07-02,0.010918,-0.002745,0.018139
1,2019-07-03,0.041296,0.006685,-0.044879
2,2019-07-05,-0.018914,0.020508,-0.045969
3,2019-07-08,-0.012438,0.000957,-0.001709
4,2019-07-09,0.002519,-0.032505,-0.014876


Mostly, it works like <font color='DodgerBlue'><b>pd.read_csv</b></font>. However, if there are many work sheets in the Excel file (which is not the case here) you can specify from which sheet you want to import data (by default the first sheet, i.e. sheet 0 in Python terminology, is used).

 Although the argument name is: **sheet_name** you can give either ordinal number of the sheet that you want to import (numeration starts from 0), or the string which is the exact name of that sheet.

Now we specify that we want to import data from the first sheet. As before we set column "Date" as index column. Argument **usecols** determines which columns you want to import (strings of capital letters corresponding to Excel columns)

**dtype** is dictionary which keys are column names and values are data type which you want to set (if you don't specify them explicitly <font color='mediumseagreen'><b>Pandas</b></font> will set them for you).