In this tutorial, we will look at Pandas, a data analysis toolkit 
for Python. We will learn how to read a CSV file and edit the series 
or dataframe that is created. More information about Pandas can 
be found here: https://pandas.pydata.org/pandas-docs/stable/dsintro.html

Needed packages:  ```pandas```
Note: Run this command on the unix command line. 
This might take a while to install:
```
conda install pandas
```

In [45]:
from IPython.display import Image
##Image(filename='test.png') 




Here's a few quick definitions and images before we get started:
- Series: One-dimensional labeled array capable of holding any data type
![alt text](./seriesexample.PNG "Series Example")
- DataFrame: Two-dimensional labeled data structure with columns of potentially different types
![alt text](./dataframeexample.PNG "Dataframe Example")

Table of Contents:

[Introduction](#Introduction)

[Creating your own CSV file](#Creating your own CSV file)

[Getting Started](#Getting Started)

[Working With Rows and Columns](#Working With Rows and Columns)

[Data Operations](#Data Operations)

[Deleting Rows and Columns](#Deleting Rows and Columns)

[Filtering the Outputs](#Filtering the Outputs)

[Fixing Missing Data](#Fixing Missing Data)

[Using Numpy with Pandas](#Using Numpy with Pandas)

[Using Other File Formats in Pandas](#Using Other File Formats in Pandas)

[Summary](#Summary)

<a id='Creating your own CSV file'></a>
## Creating your own CSV file

In this tutorial, you can either follow along using the data that I have provided, or get your own to analyze. Below are instructions to gather data from Google Trends. If you do not want to gathter new data, you can skip this cell.

- Go to https://trends.google.com/trends/?geo=US
- Search for multiple terms that you want to compare
- When you are satisfied with your search terms, click on the download symbol above the graph titled "Interest over time"
![alt text](graphexample2.png "Example Graph")
- Now import the file into a new Google Sheets. File > Import > Upload and select the file. It will probably be named "multiTimeline.csv"
- A popup will appear. Click "Import Data", then click "Open now" which appears at the top of the popup
![alt text](popupexample.PNG "Popup Example")
- Your data will now appear in a sheet and we just have to clean it up a little
![alt text](sheetsexample.PNG "Sheets Example")
- Remove the block that says "Week"
- Now we need to download the file as a CSV file. File > Download as > Comma-Seperated Values (.csv, current sheet). Move the downloaded file into the same folder as your new jupyter notebook on your computer. Make sure to rename the file so it is easy to copy into the jupyter notebook. This allows you to read the file in the next steps

<a id='Getting Started'></a>
## Getting Started

First we need to import pandas. We can follow that with "as pd" so we don't have to write pandas every time we want to use one of the methods that pandas contains.

In [46]:
import pandas as pd

Now we are ready to start looking at the data. We will use the command pd.read_csv() and the variable df. Inside the parentheses, we will write the name of the CSV file we downloaded, the column that we want to serve as the column farthest to the left, and the amount of rows we want to skip from the top of the file. Feel free to mess with any of the values and see what the different results are!

<a id='Introduction'></a>
# Working with Pandas

### Created by Joshua Bay, REHS Internship, 2019


In [47]:
df = pd.read_csv('TrendsDataMissing.csv', index_col=0, skiprows=2)

Now that we have the CSV file read, we can start to look at the data. head() will show the first five rows of data, and tail() will show the last 5 rows of data. You can change the amount of rows you see by changing the number you pass into the parentheses.

In [48]:
df.head()

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
Minecraft,27,26.0,24.0,22.0,20.0,19.0
Fortnite,82,,76.0,88.0,80.0,75.0
World of Warcraft,15,15.0,,20.0,18.0,
Overwatch,7,7.0,6.0,,7.0,
Rocket League,3,3.0,2.0,2.0,,3.0


In [49]:
df.tail(3)

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
World of Warcraft,15,15.0,,20.0,18.0,
Overwatch,7,7.0,6.0,,7.0,
Rocket League,3,3.0,2.0,2.0,,3.0


You can also just write the variable name to see the entire dataframe.

In addition to looking at the whole DataTable, you can look at just the titles of the rows and columns by using the commands columns and index.

In [50]:
df.columns

Index(['2018-07-29', '2018-08-05', '2018-08-12', '2018-08-19', '2018-08-26',
       '2018-09-02'],
      dtype='object')

In [51]:
df.index

Index(['Minecraft', 'Fortnite', 'World of Warcraft', 'Overwatch',
       'Rocket League'],
      dtype='object')

The shape command will give you the shape of your data. In this example, it is 52 rows down by 5 rows across. The size command works the same way, but gives you the total amount of numbers you have entered. It multiplies the number of rows by columns (52 * 5 = 260). Next, the len() command will give you one of the values that shpae does, depending on what you pass into the parentheses. Passing index will give you the amount of rows, and passing columns will give you the amount of columns.

In [52]:
df.shape

(5, 6)

In [53]:
df.size

30

In [54]:
len(df.index)

5

In [55]:
len(df.columns)

6

Now we have to clean up the names of the titles. We are going to use the str.split() command to split the names at the colon. Then we will re-assign the names to the columns of the DataFrame. This process just makes the names look cleaner.

In [56]:
names_ids = df.columns.str.split(':')
df.columns = names_ids.str[0]
df.head(3)

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
Minecraft,27,26.0,24.0,22.0,20.0,19.0
Fortnite,82,,76.0,88.0,80.0,75.0
World of Warcraft,15,15.0,,20.0,18.0,


<a id='Working With Rows and Columns'></a>
## Working With Rows and Columns

When using DataFrames, you might have to access the values of a certain row or column. You are able to do this in a few different ways:

- Brackets
    - [ ] Single brackets with one value will return a Series
    - [ ] Single brackets with many values will return a DataFrame
    - [[ ]] Double brackets will return a DataFrame
- Selecting Rows
    - iloc[ ] with a number will select that row (ex: df.iloc[2])
    - loc[ ] with a name will select that row (ex: df.loc['2018-08-12'])
- Selecting Columns
    - A set of brackets with a name or number will select that column (ex: df[Minecraft])
    
**Important Note: When accessing data with an index number, the first index value is 0, not 1. Counting is as follows: 0, 1, 2, 3, etc.**

### Rows

These first two commands will output the third row of data in two different ways. The first is as a series, because we are using one set of brackets. The second is a dataframe, because there are two sets of brackets.

In [57]:
df.iloc[2]

2018-07-29    15.0
2018-08-05    15.0
2018-08-12     NaN
2018-08-19    20.0
2018-08-26    18.0
2018-09-02     NaN
Name: World of Warcraft, dtype: float64

In [58]:
df.iloc[[2]]

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
World of Warcraft,15,15.0,,20.0,18.0,


This next example shows multiple rows. Notice that single brackets are being used, but a dataframe is still the output. Also, be careful which indexes are being passed into the brackets. The first index value **will** be included, but the last value **will not**.

In [59]:
df.iloc[2:5]

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
World of Warcraft,15,15.0,,20.0,18.0,
Overwatch,7,7.0,6.0,,7.0,
Rocket League,3,3.0,2.0,2.0,,3.0


If you do not know the index of the row you want to see, you can pass the name of the row instead. Notice that the amount of brackets used and the amount of rows being passed changes if the output is a series or dataframe.

In [60]:
df.loc['World of Warcraft']

2018-07-29    15.0
2018-08-05    15.0
2018-08-12     NaN
2018-08-19    20.0
2018-08-26    18.0
2018-09-02     NaN
Name: World of Warcraft, dtype: float64

In [61]:
rows = ['World of Warcraft', 'Overwatch']
df.loc[rows]

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
World of Warcraft,15,15.0,,20.0,18.0,
Overwatch,7,7.0,6.0,,7.0,


In [62]:
df.columns

Index(['2018-07-29', '2018-08-05', '2018-08-12', '2018-08-19', '2018-08-26',
       '2018-09-02'],
      dtype='object')

### Columns

Columns are much simpler than rows, because you do not have to remember whether to use loc or iloc. instead, you just put the name or names of the columns into single or double brackets. Again, the number of names and the amount of brackets has an impact on the data being a series or dataframe. We can also use head() so we only see the first few lines of data, rather than the whole column.

In [63]:
df['2018-07-29'].head()


Minecraft            27
Fortnite             82
World of Warcraft    15
Overwatch             7
Rocket League         3
Name: 2018-07-29, dtype: int64

In [64]:
names = ['2018-07-29']
df[names].head()

Unnamed: 0,2018-07-29
Minecraft,27
Fortnite,82
World of Warcraft,15
Overwatch,7
Rocket League,3


<a id='Data Operations'></a>
## Data Operations

If you just want the value of adding all the items in a column, the sum() command will add them up and give the output.

In [65]:
overwatch_row_sum = df.loc['Overwatch'].sum()


print("Sum of scores in the 'Overwatch' row:", overwatch_row_sum)


Sum of scores in the 'Overwatch' row: 27.0


In [66]:
df['Total'] = df.fillna(0).sum(axis=1)
print(df.head())



                   2018-07-29  2018-08-05  2018-08-12  2018-08-19  2018-08-26  \
Minecraft                  27        26.0        24.0        22.0        20.0   
Fortnite                   82         NaN        76.0        88.0        80.0   
World of Warcraft          15        15.0         NaN        20.0        18.0   
Overwatch                   7         7.0         6.0         NaN         7.0   
Rocket League               3         3.0         2.0         2.0         NaN   

                   2018-09-02  Total  
Minecraft                19.0  138.0  
Fortnite                 75.0  401.0  
World of Warcraft         NaN   68.0  
Overwatch                 NaN   27.0  
Rocket League             3.0   13.0  


You can also do math with the rows and columns. Just use any sign (+, -, *, /, %, etc) on the values from each of the rows or columns and you will be able to do that opperation on the numbers and give the output in a new line. You can put the values you get into a row or column that does not exist, and pandas will create it for you.

In [67]:

df['Average'] = df['Total'] / (len(df.columns) - 1)


In [68]:
print(df)

                   2018-07-29  2018-08-05  2018-08-12  2018-08-19  2018-08-26  \
Minecraft                  27        26.0        24.0        22.0        20.0   
Fortnite                   82         NaN        76.0        88.0        80.0   
World of Warcraft          15        15.0         NaN        20.0        18.0   
Overwatch                   7         7.0         6.0         NaN         7.0   
Rocket League               3         3.0         2.0         2.0         NaN   

                   2018-09-02  Total    Average  
Minecraft                19.0  138.0  23.000000  
Fortnite                 75.0  401.0  66.833333  
World of Warcraft         NaN   68.0  11.333333  
Overwatch                 NaN   27.0   4.500000  
Rocket League             3.0   13.0   2.166667  


<a id='Deleting Rows and Columns'></a>
## Deleting Rows and Columns

The drop() command will remove any row that you don't want in your data, and the del command can be put in front of the call to a column. Pass the name of the row you want to remove into the parentheses or the column in square brackets, and look at the table to see that the row or column is now gone. From this point forward, it will not be present in the data. If you want to change that, just run the read_csv() command in the second code cell, and the change will be undone. You will have to re-run some of the cells below it if you changed the data in any other ways.

In [70]:
df = df.drop(columns=['Total', 'Average'])


<a id='Filtering the Outputs'></a>
## Filtering the Outputs

You can use the max() and idxmax() commands to find the largest number in the column and the row that contains the largest number, respectively.

In [72]:
df.loc['Fortnite'].max()

88.0

In [74]:
df.loc['Fortnite'].idxmax()

'2018-08-19'

You can also locate certain rows that meet some criteria. In these cells, we filter the outputs to only show rows that have data over a certain threshold. We can also just show certain columns.

In [89]:
# Filter rows where the values in the 'Minecraft' column are greater than 50
filtered_rows = df[df.loc['Minecraft'] > 50]

# Print the filtered DataFrame
print(filtered_rows)

  filtered_rows = df[df.loc['Minecraft'] > 50]


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [88]:
df.loc[(df['Minecraft']>df['Fortnite']) & (df['Minecraft']>50)]

KeyError: 'Minecraft'

In [77]:
df[['World of Warcraft', 'Overwatch']]

KeyError: "None of [Index(['World of Warcraft', 'Overwatch'], dtype='object')] are in the [columns]"

Now if we put some of the ideas from above together, we will be able to show certain columns and certain rows. This command is so common that you can even leave out 'iloc' and you will get the same result. Just put the names of the columns you want in double brackets and follow that by the range of the index of the rows you want in single brackets.

In [78]:
df[['World of Warcraft', 'Overwatch', 'Rocket League']].iloc[10:15]

KeyError: "None of [Index(['World of Warcraft', 'Overwatch', 'Rocket League'], dtype='object')] are in the [columns]"

In [79]:
df[['World of Warcraft', 'Overwatch', 'Rocket League']][10:15]

KeyError: "None of [Index(['World of Warcraft', 'Overwatch', 'Rocket League'], dtype='object')] are in the [columns]"

If you don't know the index of the rows you want, you can always just enter the names of the rows in double brackets with the loc command.

In [80]:
df[['World of Warcraft', 'Overwatch', 'Rocket League']].loc[['2018-10-07', '2018-10-14', '2018-10-21']]

KeyError: "None of [Index(['World of Warcraft', 'Overwatch', 'Rocket League'], dtype='object')] are in the [columns]"

<a id='Fixing Missing Data'></a>
## Fixing Missing Data

Now we are going to create a new table, but leave out some values. For this example, it will be easiest to use the csv file I created, because I have flipped the headers for the rows and columns. I just transfered over a few of the numbers, and left out some of the data. We are going to be working with missing values and how pandas can help you finish off the data. The command fillna() is going to be used in most of these examples.

In [81]:
df2 = pd.read_csv('TrendsDataMissing.csv', index_col=0, skiprows=2)
df2

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
Minecraft,27,26.0,24.0,22.0,20.0,19.0
Fortnite,82,,76.0,88.0,80.0,75.0
World of Warcraft,15,15.0,,20.0,18.0,
Overwatch,7,7.0,6.0,,7.0,
Rocket League,3,3.0,2.0,2.0,,3.0


The first way is the easiest, and just fills in all the missing numbers with one value, in this case, 200.

In [82]:
df2.fillna(value=200)

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
Minecraft,27,26.0,24.0,22.0,20.0,19.0
Fortnite,82,200.0,76.0,88.0,80.0,75.0
World of Warcraft,15,15.0,200.0,20.0,18.0,200.0
Overwatch,7,7.0,6.0,200.0,7.0,200.0
Rocket League,3,3.0,2.0,2.0,200.0,3.0


The next method is filling forward, which takes the previous value and assigns that to the missing number.

In [83]:
df2.fillna(method='ffill', axis=1)

Unnamed: 0,2018-07-29,2018-08-05,2018-08-12,2018-08-19,2018-08-26,2018-09-02
Minecraft,27.0,26.0,24.0,22.0,20.0,19.0
Fortnite,82.0,82.0,76.0,88.0,80.0,75.0
World of Warcraft,15.0,15.0,15.0,20.0,18.0,18.0
Overwatch,7.0,7.0,6.0,6.0,7.0,7.0
Rocket League,3.0,3.0,2.0,2.0,2.0,3.0


Additionally, you can fill backwards, taking the next value and assigning it back. You may notice that there are a few values still missing in the last column of the table, and that is because there are no values to backfill the last column with. This would also happen using forward filling if you are missing values in the first column.

In [None]:
df2.fillna(method='bfill', axis=1)

Finally, you can use the interpolate() command with the linear method to fill the missing value with a number between the one before and after. There are also a few different methods that you can use to change the number that will be given, such as quadratic, cubic, polynomial, etc.

In [None]:
df2.interpolate(method='linear', axis=1)

<a id='Using Numpy with Pandas'></a>
## Using Numpy with Pandas

You can convert numpy arrays to dataframes, and name the rows and columns whatever you wish. In this example, we use a random number generator to make a 10 by 3 array.

In [None]:
import numpy as np
a = np.random.rand(10, 3)
a

In [None]:
df3 = pd.DataFrame(a, columns=['Col 1', 'Col 2', 'Col 3'], index=['Row 1', 'Row 2', 'Row 3', 'Row 4', 'Row 5', 'Row 6', 'Row 7', 'Row 8', 'Row 9', 'Row 10'])
df3

<a id='Using Other File Formats in Pandas'></a>
## Using Other File Formats in Pandas

Lastly, there are a few other file formats that can be read by pandas. They include json, html, excel, and hdf, and they should work the same way as the csv file does.

<a id='Summary'></a>
## Summary

Congradulations! Now you know the basics of using pandas! We have learned how to create a new csv file from Google Trends, import it into pandas, look at the data, create and delete rows and columns, fill missing values, and much more! I hope you learned something, and have a good rest of your day!

Miss anything? Go back and review!

[Introduction](#Introduction)

[Creating your own CSV file](#Creating your own CSV file)

[Getting Started](#Getting Started)

[Working With Rows and Columns](#Working With Rows and Columns)

[Data Operations](#Data Operations)

[Deleting Rows and Columns](#Deleting Rows and Columns)

[Filtering the Outputs](#Filtering the Outputs)

[Fixing Missing Data](#Fixing Missing Data)

[Using Numpy with Pandas](#Using Numpy with Pandas)

[Using Other File Formats in Pandas](#Using Other File Formats in Pandas)

[Summary](#Summary)