# pandas for Data Science

![Data Science Workflow](img/ds-workflow.png)

## pandas
- When working with tabular data (spreadsheets, databases, etc) **pandas** is the right tool
- **pandas** makes it easy to acquire, explore, clean, process, analyze, and visualize your data
- This basically covers the full Data Science process

## pandas help
- **pandas** is a large tool but also complex
- **pandas** can do (almost) everything with data
    - if you can do it in Excel, you can do it in **pandas**
- **pandas** has a great [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) to help you
- **pandas** also has great [tutorials](https://pandas.pydata.org/docs/getting_started/index.html)

## What will we cover here?
- Some insights into **DataFrames** (the main datastructure in **pandas**)
- How to work with data

## This course also covers
- Later we will dive into how **pandas** can get data from various sources
    - Web Scraping, Databases, CSV, Parquet, Excel files
- How to combine data from different sources
- How to deal with missing data

## Getting started with pandas
- **pandas** is installed by default in anaconda (JuPyter Notebooks)
- In other environments you can install it with
    - ```pip install pandas```
- To access **pandas** you need to import it
    - ```import pandas as pd```

In [34]:
import pandas as pd

### What is pandas?
- **pandas** is like an Excel sheet - just better
- to learn pandas, let's play with some data

### Read data from CSV
- What is CSV? See this lecture ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
- ```pd.read_csv(filename, parse_dates, index_col)``` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))
    - ```filename```: The path to the filename
    - ```parse_dates=True```: If True -> try parsing the index (default False)
    - ```index_col=0```: Set the index to be column 0

In [6]:
data = pd.read_csv("files/aapl.csv", parse_dates=True, index_col=0)
data.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-02,75.150002,73.797501,74.059998,75.087502,135480400.0,73.988464
2020-01-03,75.144997,74.125,74.287498,74.357498,146322800.0,73.26915
2020-01-06,74.989998,73.1875,73.447502,74.949997,118387200.0,73.852982
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000.0,73.505653
2020-01-08,76.110001,74.290001,74.290001,75.797501,132079200.0,74.68808


### Always check data
- The ```.head()```: prints the first 5 columns

In [7]:
data.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-02,75.150002,73.797501,74.059998,75.087502,135480400.0,73.988464
2020-01-03,75.144997,74.125,74.287498,74.357498,146322800.0,73.26915
2020-01-06,74.989998,73.1875,73.447502,74.949997,118387200.0,73.852982
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000.0,73.505653
2020-01-08,76.110001,74.290001,74.290001,75.797501,132079200.0,74.68808


## Index and columns
- ```.index```: Returns the index
- ```.columns```: Returns the column names in a list

In [8]:
data.index

DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-06', '2020-01-07',
               '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-13',
               '2020-01-14', '2020-01-15',
               ...
               '2021-11-01', '2021-11-02', '2021-11-03', '2021-11-04',
               '2021-11-05', '2021-11-08', '2021-11-09', '2021-11-10',
               '2021-11-11', '2021-11-12'],
              dtype='datetime64[ns]', name='Date', length=472, freq=None)

## Each column has a data type
- ```.dtypes```: Returns the data types of each column

In [9]:
data.dtypes

High         float64
Low          float64
Open         float64
Close        float64
Volume       float64
Adj Close    float64
dtype: object

## The size and shape of data
- ```len(data)```: gives the number of rows in the DataFrame
- ```.shape```: Returns the number of rows and columns in the DataFrame

In [10]:
len(data)

472

In [11]:
data.shape

(472, 6)

## Slicing rows and columns
- ```data['Close']```: Select one column (Series)
- ```data[['Open', 'Close']]```: Select multiple columns with specific names
- ```data.loc['2020-05-01':'2021-05-01']```: Select all columns between the dates (including 2021-05-01)
- ```data.iloc[50:55]```: Select all columns between rows 50-55 (excluding 55)

In [14]:
data['Close']

Date
2020-01-02     75.087502
2020-01-03     74.357498
2020-01-06     74.949997
2020-01-07     74.597504
2020-01-08     75.797501
                 ...    
2021-11-08    150.440002
2021-11-09    150.809998
2021-11-10    147.919998
2021-11-11    147.869995
2021-11-12    149.990005
Name: Close, Length: 472, dtype: float64

In [20]:
data[['Open','Close']]

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-02,74.059998,75.087502
2020-01-03,74.287498,74.357498
2020-01-06,73.447502,74.949997
2020-01-07,74.959999,74.597504
2020-01-08,74.290001,75.797501
...,...,...
2021-11-08,151.410004,150.440002
2021-11-09,150.199997,150.809998
2021-11-10,150.020004,147.919998
2021-11-11,148.960007,147.869995


In [21]:
data[['Open','Close']].loc['2021-05-01':'2021-05-24']

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-05-03,132.039993,132.539993
2021-05-04,131.190002,127.849998
2021-05-05,129.199997,128.100006
2021-05-06,127.889999,129.740005
2021-05-07,130.850006,130.210007
2021-05-10,129.410004,126.849998
2021-05-11,123.5,125.910004
2021-05-12,123.400002,122.769997
2021-05-13,124.580002,124.970001
2021-05-14,126.25,127.449997


In [23]:
data.loc['2021-05']

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-05-03,134.070007,131.830002,132.039993,132.539993,75135100.0,131.924759
2021-05-04,131.490005,126.699997,131.190002,127.849998,137564700.0,127.256538
2021-05-05,130.449997,127.970001,129.199997,128.100006,84000900.0,127.505386
2021-05-06,129.75,127.129997,127.889999,129.740005,78128300.0,129.137756
2021-05-07,131.259995,129.479996,130.850006,130.210007,78973300.0,129.825745
2021-05-10,129.539993,126.809998,129.410004,126.849998,88071200.0,126.475639
2021-05-11,126.269997,122.769997,123.5,125.910004,126142800.0,125.538422
2021-05-12,124.639999,122.25,123.400002,122.769997,112172300.0,122.407684
2021-05-13,126.150002,124.260002,124.580002,124.970001,105861300.0,124.601189
2021-05-14,127.889999,125.849998,126.25,127.449997,81918000.0,127.073868


In [24]:
data.iloc[50:55]

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-16,64.769997,60.0,60.487499,60.552502,322423600.0,59.807819
2020-03-17,64.402496,59.599998,61.877499,63.215,324056000.0,62.437572
2020-03-18,62.5,59.279999,59.942501,61.6675,300233600.0,60.909103
2020-03-19,63.209999,60.6525,61.8475,61.195,271857200.0,60.442413
2020-03-20,62.9575,57.0,61.794998,57.310001,401693200.0,56.605202


## Arithmetic operations
- Calculating with columns on all rows
    - Example: ```data['Close'] - data['Open']```
- Creating new columns
    - Example: ```data['New'] = data['Open'] - data['Close']```

In [25]:
data['Close'] - data['Open']

Date
2020-01-02    1.027504
2020-01-03    0.070000
2020-01-06    1.502495
2020-01-07   -0.362495
2020-01-08    1.507500
                ...   
2021-11-08   -0.970001
2021-11-09    0.610001
2021-11-10   -2.100006
2021-11-11   -1.090012
2021-11-12    1.560013
Length: 472, dtype: float64

In [26]:
data['New'] = data['Open'] - data['Close']

In [27]:
data.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close,New
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-02,75.150002,73.797501,74.059998,75.087502,135480400.0,73.988464,-1.027504
2020-01-03,75.144997,74.125,74.287498,74.357498,146322800.0,73.26915,-0.07
2020-01-06,74.989998,73.1875,73.447502,74.949997,118387200.0,73.852982,-1.502495
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000.0,73.505653,0.362495
2020-01-08,76.110001,74.290001,74.290001,75.797501,132079200.0,74.68808,-1.5075


## Select data
- Select data based boolean expressions
    - Example: ```data['New'] > 0```
    - Example: ```data[data['New'] > 0]```

In [28]:
data['New'] > 0

Date
2020-01-02    False
2020-01-03    False
2020-01-06    False
2020-01-07     True
2020-01-08    False
              ...  
2021-11-08     True
2021-11-09    False
2021-11-10     True
2021-11-11     True
2021-11-12    False
Name: New, Length: 472, dtype: bool

In [29]:
data[data['New'] > 0]

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close,New
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000.0,73.505653,0.362495
2020-01-10,78.167503,77.062500,77.650002,77.582497,140644800.0,76.446938,0.067505
2020-01-14,79.392502,78.042503,79.175003,78.169998,161954400.0,77.025856,1.005005
2020-01-15,78.875000,77.387497,77.962502,77.834999,121923600.0,76.695763,0.127502
2020-01-21,79.754997,79.000000,79.297501,79.142502,110843200.0,77.984108,0.154999
...,...,...,...,...,...,...,...
2021-11-04,152.429993,150.639999,151.580002,150.960007,60394600.0,150.740005,0.619995
2021-11-05,152.199997,150.059998,151.889999,151.279999,65414600.0,151.279999,0.610001
2021-11-08,151.570007,150.160004,151.410004,150.440002,55020900.0,150.440002,0.970001
2021-11-10,150.130005,147.850006,150.020004,147.919998,65187100.0,147.919998,2.100006


## Groupby and value_counts
- Example
```Python
data['Category'] = data['New'] > 0
data.groupby('Category').mean()
```
- Example
```Python
data['Category'].value_counts()
(data['New'] > 0).value_counts()
```

In [30]:
data['Category'] = data['New'] > 0

In [31]:
data.groupby('Category').mean()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close,New
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,113.79496,111.09517,111.81481,113.178203,126004200.0,112.481534,-1.363393
True,118.2662,115.529327,117.593958,116.201648,124526700.0,115.548601,1.39231


In [32]:
data['Category'].value_counts()

False    249
True     223
Name: Category, dtype: int64

In [33]:
(data['New'] > 0).value_counts()

False    249
True     223
Name: New, dtype: int64